 First, I'll just mention you've had a nice overview by Dr. Sau about the sort of overview of this workshop. I hope you're going to learn lots more in the coming days where I'm going to be focusing on phylogetic analysis, though my research area is generally in what I call more sustainable infectious disease control. I'm interested in different approaches, including very much genomic epidemiology for better tracking and identifying outbreaks, identifying ways to mitigate this. I have been involved in genomic epidemiology analyses for bacterial infectious diseases and also then when Cindy Bell of Genome Canada said, hey, maybe we should set up something like what the UK is doing for COVID, myself, Will Sau and Gary von Domsler basically got together and worked on setting up this cankogen or Canadian COVID-19 genomics network that involves pretty well everyone and involved in this workshop and essentially what happened was we used that to further develop some viral-based analyses that will be included, some that will be talked about. I do want to emphasize that I get involved in doing a lot of other kinds of analyses like genomic island-based analyses, so if you ever have any questions about some of these other kinds of microbial bioinformatics analyses, don't hesitate to contact me and I've also been involved in, yeah, I run the pseudomonas.com website with Jeff Windsor in my group. We form pseudomonas genomics analysis and I'm also leading the data integration and database development for the Canadian Child Study, Child Cohort Study, which is profiling incredible amounts of detailed data for about 3,500 healthy kids temporally and is kids right now that are at age 13 or so. That study also includes a lot of microbiome work, which I'm also involved in, so I've got my hands in a lot of things and I'm very happy to answer your questions. My last point I'll bake about myself is that I'm very keen to join the Q&A at the end of the day today, but I've got a densest appointment complex, so I'm going to be showing up right after that appointment just for the last little bit and again, please don't hesitate to contact me if you have any questions, but let's move on to this module, which is a fairly, sorry, this Zoom updated and it is now, there we go, okay. So basically, today what we're going to do is talk about phylogenic analysis. So I also want to mention, you know, my, sorry, just one second here, shoot, well, I'll bring it up more at the end, but I do want to start with just doing a territorial acknowledgement that I'm on the unceded territories of the suede-tooth quitlam and Musqueam nations upon which I reside and play. The, sorry, so then I just want to mention I am associated with molecular biology, biochemistry, computing science and faculty of health sciences at SFU, okay. So today, by the end of this lecture, you'll hopefully understand some fundamentals of character-based evolutionary analysis and phylogenetic analysis. Really what we want to do is Will is giving you a bit of an overview and now we want to get into some details and the goal here is really just to be able to sort of look at phylogenetic analyses, interpret them, know the basics of how to build a phylogenetic tree. Oh, sorry, am I in the wrong mode? Oh dear, just a minute. There we go. You should be able to see now the, and I'm going to just put on the chat, sorry. There we go, I've got the chat. Can you not see the full screen slides right now? We can, it's fine. It is? Yeah. Okay, sorry. Just lost it now. Yeah, I know. Oh my God. Sorry, I was just saying full screen. Yeah, yeah, that's fine, you know. Okay, sorry about that. Okay, so I want you to also appreciate how the basics of how you build a phylogenetic tree and key differences between methods. But first, let's just step way back. I want you to appreciate that this evolutionary theory, how it evolved, was like many theories where you have something comes up. There's some ideas that are circulating. At the time when this theory got first made, there was appreciation, the world was not constant but changing. Plate tectonics were appreciated. There was discovery of fossil accumulating. And we were starting to see cool things like the deeper the strata, the more the fossils resembled say this coast of Africa and this coast of South America. And we were starting to realize that this was all making sense, that there was this sort of history. And it was still a bit debated whether there's remains of unknown but still living species that are elsewhere on the planet. But Cuvier did really have this landmark study showing the deeper the strata, the less similar fossils were to existing species. And that combined with this really set the stage for origin of the species Charles Darwin's landmark analysis, which really was notable because it really gave a mechanism. So there was already this feeling as with any theory that an appreciation for something that then this was a sort of a mechanism that really made people understand it. And the idea was that all organisms were derived from common ancestors by a period of process called branching. And this explained the fossil record similarities of organism classified together in that they shared traits inherited from common ancestor, and then similar species in a same geographic region. So you would have an ancestral species, and then these sort of branching and there might be an end of a branch reflecting some sort of extinct species. And then you'd have these sort of living species as the leaves of the tree at the end. So now today, we generally feel that there's a sort of common ancestor just to give you some way back, where thought Earth is thought to be approximately about 4.6 billion years old. Life is thought to have occurred as bar back is about 4.1 billion years based on carbon isotope dating and also a fossil microbial mat. All cellular organisms share this last universal common ancestor or Luca that dates back to more than 3.8 billion years, where you know, what really is notable is their sort of shared components of genetic code and amino acid chirality, which implies sort of this universal ancestor. And now today, you know, the famous quote, nothing in biology makes sense except in the light of evolution. So we really have moved towards a lot of evolutionary based analyses as we sort of appreciate that we should be looking for that Luca and forward. Okay. And so that moving forward, evolution that my favorite simple definition is descent with modifications. It's basically just changes in heritable characteristics over time that occur in these biological populations over generations. So every time there's a new generation, nature beautifully makes little errors or are mixed, you know, mixes up sequences so that you end up with these slight changes. And basically, there's sort of a natural selection process where the sort of fit is survive and the ones that have some some kind of advantage will do better. We see this all the time with SARS-CoV-2 evolution where we've got all these sort of immune evasive variants or variants that can better bind our human ACE2 receptor that will allow us to be able to make it allow it to be able to infect better. But I want to emphasize there's also neutral evolution. There's these changes occurring and sometimes those changes don't really make any difference to what's happening with a sequence. But those are still provide nice little clock like changes and we can use that these kind of changes in analysis, both the sort of advantageous ones, we can sort of learn something about function and with neutral ones, we can learn something about distance between samples, for example, in microbial geomic epidemiology. So these processes give rise, of course, to biodiversity and there's estimated 8 million living species with 2.3 million named in about 80% calculated databases. That is, I want to say, though, controversial because of course, there's many phage and viruses that we really haven't touched on yet and studied and generally it's thought that you've got sort of this bacterial diversity and gene pool associated with bacteria, then you've got this sort of archaic diversity and gene pool associated with that and eukaryotic diversity and gene pool. And then these gene pools have these viral gene pools for bacteria or or archaea that are basically like a cloud of things are popping in and out that are generally about 10 times more prevalent the sort of, for example, phage diversity over bacterial diversity roughly. So appreciate that you've got this sort of diversity in this gene pools, say with bacteria, with this viral diversity associated with the or in eukarya that we also care about with the viral diversity associated with that. Okay, so then one thing that happens, of course, is we've got asexual reproduction, clonal reproduction, where bacteria do very good job of making copies of themselves, still with built in bits of error. But there's also these other evolutionary processes that blur the boundaries, in particular horizontal gene transfer or what we also call lateral gene transfer. And lateral gene transfer is primarily sort of infection by phages, but there's also mechanisms that include just the basic uptake of DNA and conjugation that result in movement of chunks of DNA. And those generally as somebody who studies those a lot, they disproportionately include genes often of medical and, you know, even environmental adaptive significance, they disproportionately include virulence factors, for example, and pathogens, because they really are these chunks that are coming over that when that usually confer some sort of advantage, because especially if it's something that increases the size of the genome, because that literally has a cost benefit, having that extra chunk. And so there'll be more about that later. What I want to focus on today, my this module is some terminology, and getting you to appreciate the basics of phylogenetics. So I'm going to go through a number of terms that will be used throughout the workshop. And but one thing we won't talk about too much is systematics, which is a study of the interrelationships of living things, but you'll hear taxonomy mentioned, which is really the science of naming and classifying organisms and evolutionary theory is not necessarily involved. What we're going to focus on right now is phylogetics, which is a field of the systematics that focuses on evolutionary relationships, either between organisms or genes or proteins. And we talk about looking at the safe phylogeny of some genes. The just going to just check if there's anything in the chat. Nope. So for phylogetics, this usually involves, particularly for genomic epidemiology, molecular sequence data, of course, looking at DNA and proteins. But note that the phylogenetics can be based on morphological features. You know, you can do it based on unusual features, for example, that have some sort of potential homology or shared ancestry. And so we infer these relationships usually using phylogetic trees. So here's a tree just looking at some relationships among some primate species, where, you know, you would have these kinds of, you know, bifurcating patterns in, and this is one of several assumptions in many phylogetic methods, is that you don't, you can't just have one point branch out to many points that you usually have a bifurcating nature. And the terminology in terms of how we describe the trees usually have a root to the tree, there's branches, there's nodes, and then there's a terminal nodes, which are called leafs. They can also be called operational taxonomic units. And that term also gets used when we try to cluster sequences together in metagenomics analysis. But we've largely replaced that term with Amplicon sequence variants in many cases just to sort of make it clear that these aren't necessarily operational taxonomic units. And basically, it's a way of sort of coming up with looking at these as groups that have an independent name. But the key with these sort of leaves, these are usually the living species or the things that you're analyzing. And the internal nodes represent some sort of ancestral species before a divergence event. And the ancestors again, are the root, but I want as I'll show you not all trees have a root. So another term we tend to refer to is clades, and you'll hear that a lot. That's basically a monophyletic group is in basically they include the recent common ancestor and all the descendants for that recent common ancestor. So this sequence is A and B here are, you know, this is a clade. And here in A and B are what we call sister taxa. And they are basically species or clades arising from the same note. Here's a second clade here with the C and D or sister taxa. And then clade one and clade two of them together are sister cats. And if you had a further branch here, you know, this could be considered a whole clade, right? So you don't have to just they can you can have clades of clades per se. But but basically, you know, I want to also go over the concept of this kind of way that we show these, there's a number of ways you can visualize these. And I would say this is really important. I find people often misinterpret trees by not appreciating the different ways they can be shown. And the ways they can be manipulated that actually don't impact your interpretation of them. So first, note that this is a cladogram. Cladograms or cladograms are only show the branching order, the branch links don't have any meeting. It's just basically a way to sort of line these guys up of A, B and C, for example. And here it's just showing that A and B share a common ancestor more recently than either does with C. Okay, and that A, B and C share some sort of common ancestor. Now note, you can show this in different ways. This tree and this tree are identical in the sense that they're both showing the same thing, they're both showing A and B have a common ancestor more recently than with C. The way this is done, sometimes if you have long names, it's just sort of more convenient to show it this way. And sometimes this way is preferred because it sort of shows that sort of history or direction of evolution going up, for example. The branching order can sometimes look a bit uneven like this. And that's your clue that this is probably a filogram. So filograms have scale branches to indicate some sort of level of similarities such as number of sequence changes. And so this tree indicates that A has acquired more substitutions than B since the time they shared a common ancestor. And these branch links can be indicated by using either a scale bar or a number on the bar or both. I would say that now today, most people do scaled bars and they don't show the numbers there in most cases. So usually the numbers are what we call bootstrap values, which I will bring up that are associated with nodes, not bring that up in a bit. But I just don't want to confuse this with bootstrap values. Key messages that when drawn vertically, any distance between two nodes is some of the vertical branches between them. And when drawn horizontally, it's the same thing, but it's any distances between two nodes, some of the horizontal branches. So for example, the distance between A and B is four plus two, right, which is six. This one is pointing to this branch, you don't care about this branch, these these branches are just to separate out the different nodes or leaves on the tree. So for example, the distance between B and C, you might want to take two seconds and think what you think the distance between B and C is. And basically what you would do is you would look at this one, this one and this two. And so the distance between B and C is four, right? Okay. And so the idea is this basically allows you to look at relationships and a little bit more get a real sense of the scale of degree of similarity. So these trees, again, can be oriented in different ways. You can have horizontally or vertically. This is just examples of horizontal, vertical clotograms or filograms. And then I also I think one of the more important things that I find people who aren't familiar with biogenic analysis don't appreciate is that you can rotate these branches. Think of it like, you know, a little sort of rotating, by a statue or something like that. I forgot the word. But anyways, you're rotating these so you can rotate this, for example, this tree is equivalent to this tree. All we've done is just rotated A, B, C and D as A, B, C and D, right? Okay. And then you could rotate here. And so it's a, you know, this instead of B, C and D here, it's B, C and D here. So you can rotate these in money ways. So just appreciate that just because this guy is, or sort of this one is close to this one here, that is not the sign that these are closely related. You have to look at the branches. So you have to look at the length of this branch, this branch, this branch and this branch to get the sense of DNA, whereas C and D are clearly much more related because they just have this branch and this branch that connects the two. Okay. So the root of trees can be very useful as a sort of anchor for looking at your tree. It's the ancestor of all sequences in your tree. And a tree with a root is really useful because it shows the order of descent or the sort of direction of evolution. Your root is sort of like considered the older and then you've got all the way to the newer. And an unrooted tree basically lacks a root and doesn't show the direction of evolution. And so it's less informative. It's often drawn in what we call radial format, where it's just they're sort of sticking out in different directions, sort of like a snowflake in many cases. And the unrooted tree is still valuable for giving you a sense of how related some things are. But they can't tell you, for example, you know, where the start of this evolutionary tree is. It could be here. It could be here. It could be here. And then these two branched off and these guys all branched off. It could be here and these two are sort of more closely related to the root. It could be here and these guys branched off and these guys branched off. It could be even here, you know. So what we tend to like to do is look at a rooted tree. Here's an unrooted tree where we just don't know, you know, where the ancestor is. Is it here? Is it here? Is it here? Here? We if we have a root, we can say that this start started here and then species four branched off then species one, then species two and species three. And we give a kind of sense of what are the ancestral points of any along that tree. And that could be very valuable. So to root a tree, we need we often use an out group, basically a sequence that is thought to be ancestral or more distantly related to all the members of your tree than to others. So for example, if you want to have a mammalian tree, and you wanted to root it, you might root it with a non mammalian vertebrate, for example, is no group. So you want to root this tree and see what the relationship is. You might want to root it with zebrafish, for example. And and that allows you either in either of these trees, that says this this, you know, this cow diverged and then chimp and human are have a shared ancestor versus cow. Okay. So take a second and you've got the answer in your slides, but try to look at it and say, you know, tell yourself for a second based on what you've learned so far, you know, what does this tree tell you? And I'll just give you a second to look at this and see, you know, what kinds of things does this tree tell you? And feel free, you know, even if you have any questions to to put them in the chat. Now, I'll look out the wall. So check in at the end. Okay, so I'm going to tell you right now, the first thing that strikes me, when I look at this tree is I see these differing branch lengths that tells me without even looking at the method that it's a filogram. And so these branch lengths are probably meaningful. And sure enough, you can look at your method and figure out if that's true. But it also appears to be rooted because there's a line here. But I really want to emphasize here that many people will, in particularly in the old days, would make trees with a method that would show a fake root that would just be like the midpoint of an analysis, and it would actually not necessarily be the root. So you've got to confirm that you've actually got a root here that was see if it says it was rooted with something. Or if you can see some sort of ancestral species say there's another line here showing something like, you know, something that's not mammalian, for example, that might would give you more you would say, Okay, that's sort of a root. This also tells you that the mouse linear just undergone some sort of accelerated evolution. Now, this was exaggerated. But the main point is that, you know, this is sort of showing this accelerated evolution, which is true. And actually, fun fact, because mice have undergone an accelerated evolution versus humans, if you actually look at the average similarity between human and cow or bovine genes, and you look at a normal distribution of that, the human and bovine are actually more sequence similar than human and mice, because of their accelerated evolution. And sure enough, in this tree, A, B and C is shorter than A and D, right? So that's the distance between human and bovine. And that's the distance here is A and D. So basically, what I want you to appreciate is that leads to challenges, because when you do a tree, it can actually mistakenly think human and bovine, because they share more sequence similarity that they're actually ancestral. So, you know, you have to be very careful with looking out for things where there is a accelerated evolution occurring in a lineage, it can cause some challenges, and I'd be happy to talk about that more as needed. Okay, so let's move forward with how we build a tree. How are we for time? Okay, we're good. So first, we take a sequence, obviously, and we're going to end up with a tree. So first things first, I can't emphasize enough is you're doing a multiple sequence alignment. Okay, you don't just make a tree from your sequences, you have to make an alignment. And this alignment is really important. Because what you're doing is if you're, say, taking a sequence, you're going from, say, information to bioinformatics over time, you know, you're trying to line things up, and you're basically implying that, say, from information to bioinformatics in this alignment, that, you know, this is just showing over time, that there was, you know, this insertion of an O and this change to an S to an N or a C here, and this this BI added on kind of everything. When you show a multiple sequence alignment, they're all jumbled up that there are no, there's no sort of nice order, per se, of the sequences in terms of their evolutionary history. And these all represent existing sequences. So you'd be like maybe looking at bioinformatics and information, and aligning those two and not seeing all these intermediates. So this sequence alignment is really basically taking homologous or shared ancestry positions of homologous sequences into the same column. So it's assuming that these sequences have some sort of shared ancestry. And it's assuming that this column here is all lined up such that this particular residue has some sort of shared ancestry. And just over time, there's been changes, such as, for example, in this case, you can see there's this tryptophan that is basically, you know, a W or a tryptophan is my favorite film with that fud way of remembering that amino acid. And but you can see here that there's some sequences that have, you know, evolved quite a bit here with some change, where there's some some sequences that share some failings here, for example, but there's different residues in these places. The main message I want to make is that care about your multiple sequence alignment, because it is basically lining up all your data and treating each of these these items as characters versus looking at different components of a skull. And it's doing an analysis of these different characters. Then another important component is your model of evolution. So you're basically have some sort of model of saying what kind of change you think is occurring. And that helps to infer the significance of these changes, right? And lastly, you've got this tree building algorithm that basically is trying to usually sort of look at different possibilities for trees or clustering sequences and making trees based on sort of more of a cluster based analysis. Okay, so four steps incur then. So again, as I've emphasized, you're constructing multiple sequence alignment. And I want to emphasize that this has to be good quality sequences, ideally labeled with relevant contextual data. And after this, you're going to hear a great lecture about contextual data from Emma Griffiths. And then we also determine evolution or substitution model to use, build the tree and then evaluate the tree. And again, sequence quality issues are also going to get brought up as well as contextual data and more in future modules. Okay, so there's really two main methods I'm going to go through. And I'm going to go through them pretty briefly. I'm not going to go into a lot of detail. There's detail on the slides for reference. And certainly encourage you to read up more on them if you want. What I want to focus more on is some of this kinds of interpretation versus some of the methods. But appreciate that there's two main types, distance-based and character-based methods. Distance-based methods take the sequences and take all those columns of characters and come up with pair-wise distances between the sequences and character-based analyses. And this is nice and fast, actually, because you're converting it into this sort of distance matrix. Character-based methods is usually aligned sequences directly during tree building and tend to be a little bit more robust because of that. So distance-based methods basically get to make a distance matrix using a multiple sequence alignment. Basically, you can count, for example, the number of sites at which they diver. There's common methods include the unweighted pair-group method with our arithmetic mean or UPGMA, which we're not going to go into for time, and neighbor joining, which I'll briefly bring up. And the idea is basically... Oh, sorry, just a minute. I need too high. One thing here. So basically you're inputting as an n by n matrix, m, where basically you have these distances between these sequences, i and j. And the goal is to build a tree where each leaf corresponds to a sequence in m, where the distances measured between the leaves, i and j, which is d i j, correspond to m i j. Now, I'm going to cut to the chase without getting into this a lot. Say that the tree that exactly fits this matrix often doesn't exist. So we have to try to find the closely matched, most closely matched matrix. And so basically you can take a tree and sort of get the distances, but to go from distances to a tree, that turns out to be a challenge. And even finding the best fit tree turns out to be an NP-complete problem, which means it's computationally takes very, very long. And so we need heuristic or approximate approaches, as is used in many areas of bioinformatics. And in this case what we do is there's a heuristic method called neighbor joining, which basically joins at each step the two closest subtrees that are not already joined. And this approach starts out with a distance matrix and a completely unresolved tree, like what we call a star phylogeny. And then a matrix is calculated from the original distance matrix to determine the average distance from each node to all the other nodes. And based on these distances between the new node and, sorry, so then it basically takes these two and says, these two are the most similar together. And so it basically puts those two together. And then the distances between this new node and the two sequences A and B, the nodes are reconnected, so I guess I already mentioned that. And that's referred to as star decomposition. So basically what you're doing is saying, okay, these two are most similar. I'm going to put those together. And then I've got this sort of n minus one situation. And I'm going to say, okay, which are the most similar here next? Maybe F and E are most similar next. And I'm going to join those two next. Or maybe this U1 and C is the most similar next. And I'm going to put C off of here, okay? The result is you end up with an unrooted tree with branch lengths, which is very valuable. And so you can see here that it's A and B, and then C, and then F, and then D and E is a result here. This tree can be rooted if one of the sequences is known to be an outgroup. So say F turns out to be an outgroup, you could use that to root the tree and imply this history. Note that it won't necessarily find the optimal tree. And one thing I haven't mentioned in the slides here I'm realizing is the input order of sequences gave and can make a difference. So you do want to be careful to realize that this is a heuristic method. However, a lot of testing is shown this really works well and it's fast. So if you want a fast sense of what's going on, this is a great method to use. But you do have to watch out. So one of the things is that distance matrix is throw information away. Many distinct datasets can yield the same measures. You can also have gaps that are basically can really muck up your analysis. So say you have some sequences where you have a really low quality sequence and it's just got a you know part of the sequence of interest. You're missing a whole bunch of this sequence. Well gaps are not incorporated into distance matrices. It's only looking at sequences where you've got a character. So if you look at this I you know I invite you to check out if you know that gaps are not incorporated. What component of this sequence would be used by a distance matrix method? Okay you obviously you want to have make sure there's a character in all of these columns. Well the answer is just this part or like half of this sequence is being incorporated. Okay. So if you've got for certain methods you do have to watch out. And in contrast though character based methods basically look directly at the sequences. They care they they look at gaps and generally regarded as giving more accurate trees. So you'll see that most people are using those extensively. So character based method also called discrete methods operate on the sequences. There's two major methods I'm going to mention. Maximum parsimony and maximum likelihood. Maximum parsimony involves and I'll briefly bring up some other methods too. And also we'll have a bit more and subsequent lecture like modules. But maximum parsimony basically involves finding the tree that describes the sequences using the fewest evolutionary steps. It says let's just find what are the minimum number of changes we can make to it infer this tree. Okay. Maximum likelihood involves finding the tree that most likely is produced to data given some sort of model of evolution. So it basically sort of turns everything inside out and says okay we're going to look at this model of evolution and we're going to make a tree and we're going to see if that tree reflects the data and then we're going to try and make another tree that's another version of a tree and we're going to try all these different trees and see which one it makes. So maximum likelihood again tries to find the tree again using the multiple sequence alignment. Okay. Requires this model of sequence evolution a tree and the observed data. How in other words given data D and a model M find the tree such that this is maximized. Okay. Now this model can be important. Often a common model used is just you know for example looking at transitions and transversions. So transitions are basically exchange of purines or ag as like and I think of pyrimidines as cut because there's a urusil as well and transversions are basically the interchange of a sorry there's pyrimidines. Transversions are basically interchange of purine or pyrimidine that's a much more significant change in the sequence and this should be reviewed to many of you but basically they're not favored. So transitions tend to occur at higher frequency without impacting the sequence and also whereas transversions there's a lot of possibilities but because these are so easy to occur these tend to occur more frequently. Now consider this for example following four sequences and the tree basically 1, 2, 3, 4 right we're 1 and 2 are grouped and 3 or 4 are grouped shown by these brackets. What we're looking at is what is the probability of D1, i.e. position 1 given this tree and the model of evolution that says the transition is 0.3 and transversion is 0.1 and a P of no change is 0.6. Basically what we do is we can calculate the probability of this tree for every possible reconstruction of the ancestral sites X and Y and this is what I want to emphasize is other words we're trying to look at all the possibilities here there's 16 values to calculate okay and essentially this is what makes it time consuming for this method because it's trying to look at every possible reconstruction and basically then with these 16 values calculated essentially we can determine the probability of the column given the tree and model and so we're sort of saying okay there's all these possibilities what is the probability of this particular scenario and then other positions are calculated in a similar manner and once we have these we basically put them all together and then get the likelihood of one particular tree okay but we need to look at all three to find the one that gives the largest value right so we're looking at this one this one and this one these are three possibilities right where one and two and three and four this way or one and two and three and four this way or one and four and two and three are this way okay in short and I know I've gone through that really quickly but I want to get to the crunch of what you should care about the most is that this requires searching through many possible trees very computationally intensive and but it can also the evolutionary model can include other things like time so this means considering each topology and fresh branch lengths for each topology and so the result though is nice because it's it's looking at all these possibilities and gives you an actual probability of that tree but I do want to emphasize that it's dependent on the model of evolution use so you really sort of have to appreciate the model of what kind of changes you're looking at okay there are other approaches there's Bayesian approaches which will also get alluded to this applies Bayes theorem to estimate probability distribution for a population of interest and basically has the ability to incorporate prior information for events like a prior distribution for outbreak onset time so there'll be more about that but another thing I want to highlight is recombination aware approaches so recombination invalidates most approaches since the columns of your multiple sequence alignment again have to be homologists so it's important as part of your analysis to detect possible recombination before performing a phylogenic analysis and then ideally you're only performing phylogenic analysis on these subset of sequences that you know are homologous that don't have recombination now there are some approaches for how to deal with that that's a bit beyond the scope of this short module you can take a whole four-year degree in molecular evolution analysis but I also want to emphasize that there's different kinds of methods that have been developed that are quite novel Usher for example enables very rapid SARS-CoV-2 analysis basically placing a sample on a very large tree versus so not sort of having to do the whole a tree construction and allows very rapid based analyses so we're going to cover more like Bayesian approaches and time trees in the phylogenomics course module which is awesome this is where you really get into the meat of some of the things that can be done and what is possible with genomic epidemiology analysis okay so what is the best tree building approach I get asked that obviously there's no single method really is best for all circumstances it depends on the size and complexity of the dataset speed of your computing resources why you want to perform the analysis for example you know if you're wanting to just quickly find out where a SARS-CoV-2 sequence is should be placed in a tree you know something like Usher is a very good approach if you've got a very small number of sequences where you really care about the getting a very accurate approach doing a maximum likelihood approach Bayesian approach with maximum likelihood approach with some different models you know you can try a few and see if you start to sort of get some consensus when it comes to identifying the best model you can also try a few and how well see how well they fit the data there's J-Model test as an example of an approach that looks at different evolutionary models of that nucleotide substitution but key message is there's no single best method often you're trying to do the best method you can with the based on the size of tree you had and the speed of your computing resources okay I also want to talk about tree evaluation in the last bit of this bootstrapping is a very popular approach for testing if the whole data set is supporting the tree or if that tree is you know just that you're actually getting as a result it's just a slight winner equally nearly equal alternatives so you really want to sort of know how much is this reflecting really the true tree so what you do is you generate new data sets usually a hundred or a thousand that are slightly perturbed from the original data set an example is to take some columns and duplicate them and replace some columns and some are removed and you're taking these columns of the multiple sequence alignment and doing little perturbations to say okay well if we added this column twice and we already removed this column then what happens to the tree does it still give the same branching order and so each replicate tree is based using the same method as the original tree you're just doing it the same way with these slightly perturbed data sets so these slightly perturbed data sets run it as you've been running it you do it say a hundred times and you see how many times out of a hundred you get the tree the same tree branching order and particularly you label the tree at the nodes with numbers indicating how often that cluster occurs in the trees made from the replicates or a consensus tree is built and labeled so you can either take your original tree and label it with these nodes in this case this tree is trying to show one means a hundred percent and point five one is yeah thank you is like 51 percent and you're basically taking this tree and saying okay this is how often this node occurs so this is trying to say that this branch here you know is occurring these two are I'm just trying to move some this virus three and virus four a hundred percent of the time they were grouping together okay but this one here is saying only 51 percent of the time these two are grouping together and uh sorry I shouldn't I should finish what I was saying earlier that you can either overlay these values on your original tree with branch links or you can take a consensus tree but I want to emphasize these bootstrap values are just on the branching order not to do with branch length okay just branching order and it's basically saying 51 percent of the time virus six and virus seven joined together that were together and that's sort of a warning for you that maybe they really are don't have a serrat ancestral relationship it might be that in other versions of the trees you had virus six was branching off you know say over here and virus seven was just its own thing or vice versa or maybe virus you know six and seven you know had some other different virus six or virus seven had some other different kind of placement okay so often you can get a bit of a hint by the lengths here often this is sort of saying that this group this and this like virus six virus seven and this whole clade including virus one to virus five they're sort of branched off but about the same time and it's hard to resolve what's going on there but um but you know a sometimes um you know the the main point is that you want to get this sense of you know what are the reliable branches okay like this guy's branching off 100 percent time this is branching off 100 percent of the time this virus eight and um bootstrap generally of over 70 percent is a good given good support for the cluster okay so also useful is identifying unique characters versus homo plazy so for example in animals we have you know a series of characters that have occurred once in evolutionary evolution or unreversed so for example fur in mammals so if you go into the middle of the amazon you see an animal you've never seen before it has fur you instantly can say that is very likely a mammal so it's very useful for inferring relationships similar with sequences you know if you see a certain character that is very associated with a particular clade of bacteria or a particular clade of SARS-CoV-2 you can sort of imply that okay that's probably most likely part of that clade because it's sort of this very unique identifiable character and so excuse me so if you find an indel for example only in sequences associated with an outbreak in a certain geographic location phylogeny suggests a new sequence of interest that may have evolved elsewhere from another there's a new sequence elsewhere that may have originated from that location it's nice if you can see if that contains that unique indel that would furlough support your phylogeny saying okay yeah that sequence that's from Ontario maybe it's really related to that BC outbreak because it's got that additional sort of unique character so just remember that in addition to some of the analyses you'll see also useful is identifying convergent evolution you have for example the great convergence of SARS-CoV-2 viral variants where you'll be familiar with these names maybe like BA5 or BQ1 or whatever this is an older picture but the main point is that all these variants as they evolve they all sort of started to gain independently the same immunovasive mutations because they're so advantageous for it and those can be useful you can look at that and see find evidence if something keeps showing up it keeps on evolving and then there's selection for that that implies that maybe there's something useful for that and you can functionally study that further okay also useful is identifying orthologs paralogs and xenologs and understand that's been mentioned already a bit apologies I wasn't able to see that part but so just briefly orthologs tend to have similar functions so there tend to be of interest you know you've got a duplication and speciation here for example but if this is just diverging due to speciation this H sequence and M sequence are considered orthologs paralogs where you've had a duplication it doesn't make sense to have two copies of something doing the same thing so usually they have some sort of divergence and function so you do want to look out for paralogs because they can have some sort of differing function at some level it might be a duplication of a transporter and a bacteria where one takes up arginine and one takes up you know another amino acid or you know and maybe they both take take up positively charged amino acids for example so there's a functional maybe change at some level also xenologs are of interest because they tend to part novel adaptive functionality to the recipient organism and I want to emphasize that here when you're orthologs you're trying to find the gene tree matches the species tree okay so you really got to look at that or you can do a reciprocal best blast tin analysis with paralogs there's multiple copies in the same species with xenologs the gene doesn't match the species tree when the rest of the tree does match very briefly because I'm sort of running out of time is there's in paralogs where you have a species divergence that occurs before so this is like an in paralogs here where you've got a species divergence and then two copies of a gene the duplicated in one species and then you've got out paralogs which is the vast majority big transporter families are all sort of out paralogs of duplications that occurred before species before species divergence and you've got many many kinds like that and so this is very useful one thing that's notable is that for example you can infer that this species M and species H2 genes are likely more likely functionally similar with this H1 and H3 it's less clear what's going on you know has one or the other diverged in function versus what M1 is like because there's this parologous relationship okay there's lots of different kinds of examples duplications of genes in bacteria that will occur is in paralogs that I give us an example okay closing comments in my last minute phylogetic analysis basically allows us to estimate or infer evolutionary relationships between genes or proteins I'd really like to emphasize that we talk about inference because we are never saying for sure what's going on we have seen evolution in action we certainly have seen it with SARS-CoV-2 we've seen it with read about the Manchester moths and how they changed to be darker colored when now we have the industrial revolution in the UK and the city environment gave them an advantage if they were darker colored in the soot I'd also like to emphasize that but so that kind of we have seen evolution but we're always inferring what's happening okay the second thing I'd like to emphasize is data quality is permanent we're going to talk more about that later we have covered some basic methods but I do want to emphasize that more complicated and efficient methods are needed to deal with genomic level size data and so we'll look at some of these techniques and approaches in the rest of the workshop and note that it's well best interpreted if you have well incorporated contextual data geography data patient information some more about that