 So I'm going to be talking about WGS based up typing in module five in the context of bacterial analysis. And joining us later will be Jimmy Liu, who's a SFU grad student. He will be handling the, the lab component. So a bit about me. My name is at to go down. I'm with the public health agency of Canada. In the economic epidemiology research unit. And our focus is research and methods development for ecology and epidemiology of bacteria. In this module. This, you know, that don't be scared of the number of slides. There's a lot of slides. But I wanted to sort of take you through the historical aspects of molecular stuff. I think on their application to molecular at the. To then start talking about WGS and that type it from WGS. And some of the analytical stuff. And, but I also wanted to talk a little bit about genomic surveillance in a broader context. You know, some of our previous speakers have spoken about data sharing and metadata. And I wanted to sort of give you a bit of an overview in terms of, you know, what was some of the surveillance that's happened prior to COVID, you know, where foodborne surveillance was really where a lot of the activity was happening. So I wanted to give you a little bit of that context. So again, you know, lots of slides. A lot of it is for context. We may have to get through some in order to get through the slide. So I'll keep an eye on time. So. So I think the typing in epidemiology. As you know, sexual disease isn't distributed homogeneously in population. There is, you know, there's different exposure to risk factors are there. Accordingly, there's also differences in the distribution of disease outcome. And the term molecular epidemiology was coined to capture the fact that we're trying to use molecular approaches to identify and characterize infectious disease agents. So that we could try to, you know, infer transmission and try to prevent and control the. So the molecular epiparadigm, it has always been fairly simple. At the end of the day. If you have patients that are ill, and we're able to recover a pathogen from those patients. We are expecting that where there is epidemiological concordance between cases. There should accordingly also be genetic concordance between the pathogen that has been isolated from these cases and vice versa. And the, you know, and so really a lot of this started with the development of methods to assess genetic similarity between isolate. Of course, you know, this goes back 100 years to when people started using approaches, such a, you know, biological approaches, culturing, moving on to things like stereotyping and so forth. So, you know, so those are all school methods of, you know, at their core, where about trying to identify. You know, the possible genetic similarity between pathogens. Now this evolved in the 90s and on into a bunch of methods that were derived from molecular approaches such as PCR, running things on gels, comparing gel patterns and so forth. So DNA fingerprinting, as it was, you know, it was called. And it also led to the emergence of the, what Mark Atman was prominent, prominent researcher into, into what are called the atom. The atoms are essentially yet another typing method. So in this, you know, this all of these methods sort of came out at a time when genomics was in a tendency that, you know, there were obviously molecular tools, and we had learned to harness them to characterize and develop methods for, you know, as proxies for being able to evaluate the genetic similarity between isolate. But it also was a bit of a wild less, you know, people would develop these methods, release them into the wild, publish a paper claim that it was the best thing in the world. And then a lot of the validation and testing would be left to other people reading the paper and saying, I wonder if this method is going to work for me. Now, in terms of the application of molecular methods in the constant context of make or buy a lot of people surveillance some of this has already been covered. But I'll just reiterate, where you're collecting samples from potential sources of exposure at the same time you're recovering isolates and performing genetic analysis, comparing the genetic data of what we think are matching and then examining the epidemiology of those I thought. More sort of to the point, you know, certainly in foodborne, the, you know, the emphasis on is on outbreak detection. And the only way to detect the possible outbreak is to find multiple cases by case cluster, where everybody is sharing the same, you know, the same strain of the pathogen. At the same time, if you want to prevent and control, then you need to be looking out there in the one health context. Looking at foods, looking at, you know, animals, and so on water, looking for potential map so that you can then at least having found potential matches you can then inform the population about, you know, ways to avoid exposure. But the problem has always been the limitations of the molecular stuff typing that. The, you know, certainly prior to the, you know, prior to the advent of genomics. So, and with a lot of these methods, the problem is that, you know, that you get a lot of matches and it's impossible to determine how significant those matches are. You know, if, if, if you tell me, hey, my, my name is John Smith, and I am like, hey, my name is John Smith too. That's not as significant as if I, you know, I said, hey, my name is that Taboda and you told me my name is that Taboda too. Because I'm like, really? So it is not a very common name where Smith is. So that sort of context is really important. Now, in terms of WGS, they subtyping, you know, so, okay, in the 90s and the odds, a lot of these methods were being developed, you know, that we had sequencing but sequencing was cumbersome at the time. And then you have starting the emergence of high throughput sequencing. Next generation sequencing. And that sort of set the table for WGS based subtyping. Now, I think we recall 10 years ago being at a conference in France. One of the really good for the molecular epi type conferences. And where the discussion was really about the fact that, you know, on the one hand we have genome sequencing coming. We still don't have a ton of data. You know, in the meantime, you know, you have these databases of molecular subtyping data that, you know, that, for example, for PSP that was called net and so forth. And these databases would have data and thousands of isolates. And so forth. So the whole idea at the time was, okay, well, we have WGS, which is awesome. But we also don't have a lot of it. We also have the subtyping databases, we have a lot of data in them. But we know that the, you know, then those methods are not the greatest. And the idea was, you know, can we bridge that bridge a little together through and so the cold typing, because at the end of the day, a lot of these methods, you know, whether they're phenotypic or molecular, ultimately can be tied to King isn't in the DNA. And so therefore you have the opportunity to search for targets, upon which these various molecular typing methods were based. And so we're using the WGS data and inferring the type based on the findings of these and so local searches. And I think this is a self serving example on someone else server determination because we built a tool for doing this. And you would, you know, and certainly at the time. It was, you know, stereotyping has been around for 100 years. And it is embedded in our surveillance apparatus. You know that it's, it can be quite informative. It, you know, it doesn't necessarily have a ton of resolution but it can be very informative and certainly we know a lot about it. We know it and pity me all of us know it. Researchers know it. So if I tell you that, you know, this infection is some of the language really is everybody has an idea of what we're dealing with. And certainly from a, you know, in silica typing perspective, you know, we know the old entities, the end that the age antigens. And we know specific sequence variants that lead to different differences among the different syrup arms. And, you know, if you put this all into, you know, sort of a, you know, a decision tree type scenario. So, you know, the magic rules, you know, can be used to infer, serve our in a vast number of cases, maybe not some of the weirdos where we don't have a lot of genome sequence data, but certainly for the big one that caused a lot of disease. Those can be predicted very easily. And certainly in time, you know, we developed a tool called poster. And mine was, okay, well, you know what, we'll do the serial of predictions this matter of course, you know, and I close to being sequence, we will perform the infill of prediction. And it will be there. Because, you know, people still want to know, they're a bar. And, you know, and soon enough, we won't really need it because it'll be such a mass of data. You know, and at that same meeting I was mentioning France, Public Health England sort of announced that beginning the next year in 2014 they would start sequencing every single time and all of them came through their surveillance. And I, you know, it was a fairly significant thing because they were essentially saying, we are, you know, we are from now on we're doing genomics survey, you know, genomic based surveillance of the UK. And I mean, certainly I was shocked and excited at the same time, but I was really shocked that they would announce that and they basically I think pushed everybody else into the same lot so that we all felt like, you know what, if the UK is doing it, we should be doing it too. Here in Canada took us a little while to catch up, but I think as of 2017 we basically started doing the same thing. So the WGS based subtype in paradigm is no different. We're still trying to estimate genetic similarity, but we can use whole genome sequence data. And a lot of this is really ultimately based on comparative genomics principles, because there's all sorts of different levels of genetic variation in a bacterial genome and all of those can be brought to bear. In the development of methodologies that we can use for first genomics surveillance. So, then. Now, I want to take a, I'm going to discuss this multi-look at sequence typing. And, you know, if you'd ask me 15 years ago if I'd ever be talking about MLFD fondly, I would have called you crazy because at the time it was not. And I was not a big proponent of it and I will tell you why. You know, as the slide that proceeds. So, I'm almost he was developed by Martin Maiden at Oxford. And the approach is fairly simple, you know, remember, this is the time when you can to sequence, you know, 300 isoacinib sequencing run. So at the time, you know, the idea was, let's analyze seven to nine genes by PCR amplifying them and then sequence them, sequencing them the old fashioned way. We can then take that sequence data and put it through an analytical approach where, you know, you infer the alleles. And then the combination of alleles can be used as a step as a subject. You know, where you have that centralized data. Actually, this was a pretty cool thing too. Because it, you know, it really was the first time that there was an open database of subtype information based on sequence data. Where, you know, people could upload their sequences, generate a subtype and everybody would, you know, we would all know what they mean. So in many respects, it became sort of the gold standard for molecular typing and molecular epi. Because it's supplanted, you know, comparing gel fragments versus, you know, the, you know, sequence data, which is so well, I mean, I shouldn't overstyle it in terms of ambiguities. But certainly in the context of MLSP with curation and so on. You move from fragments and gels to sequence data, which is a lot more unambiguous. And a lot of different schemes were developed for some of the heavy hitting pathogens. And the methodologies has been used in hundreds of thousands of different studies. So the weird thing about MLSP is the fact that it's not a proper kind of genetic analysis. I mean, it's certainly not nothing that you that would compare you. Some of the stuff that Fiona has mentioned and some of the stuff that's going to be covered in some of the other modules and let me explain why. So if I was, you know, going to perform a forensic analysis on one of the genes that was amplified in MLSP, I would take that sequence data and then I would, you know, make a multiple sequence alignment and then, you know, analyze it with evolutionary modeling and generate a tree. And you could even extend that to the seven genes and MLSP. But in actual MLSP, you don't even care about, well, you don't do the evolutionary modeling and you sure heck don't end up with, you know, 3,500 data points was a sequence data. And it overlaps each gene into a single allele. And it's the combination of the seven alleles that you use in your analysis, which sounds a little bit crazy until you dig it a little deeper into why will, you know, one, why one would take such a crazy approach. So yeah, we're essentially comparing the seven loci seven data points. Anyway. So because of that, I'm going to have to take a little bit of a detour into sequence evolution. So we're going to talk about recombination and mutation, which has already been alluded to. So, you know, if I want you, when you're doing some genetic analysis, you're, you're really relying on a sequence of reciprocation series the acquisition of mutations. That's inherently a vertical process. But when you have the species that, you know, where recombination is a significant feature of their evolution. You can have, you know, alleles parachuting and replacing other alleles. So that's that wise acquisition of mutation is completely. You know, gone because you can have a, you know, an allele replaced by a completely different allele. Or even by foreign DNA from different species or, you know, close to related species. And all of that's gone to hell. So that causes significant problems in terms of final genetic signal. And, and it distorts by genetic relationship. And so one of the things that my curly populations is that they, they, you know, you have a variety of different types of material population structures. So, you know, on the one extreme, you have corner populations that essentially are, you know, acquire mutations and where evolution is nice. That's why it's acquisition of mutations. The old fashioned way. You have what are called weekly clonal species where there's a, you know, there's some recombination happening. And we're primarily the, you know, the stepwise mutation happens within renaissance. You have the epidemic population structures where the vast majority of the population is not really participating in this stepwise. You know, evolutionary descent type of situation. There's a lot of recombination happening at low levels throughout the entirety of the population. And then lastly, you have this thing that I'm making a tough time thing. And then makes you have essentially this free for all of recombination. And, you know, and it makes it really difficult to do proper final genetic analysis. So all of us to say that different populations will have different combinations of, you know, recombination versus mutation. A lot of species have a bit of both. And, you know, and you're kind of left having to figure out the relative contribution of either process. And, you know, but all this to say that high recombination will have that discerning effect that we've been talking about. We talked about epidemic population structures and basically what happens is you have a lot of different. You know, micro lineages, sort of in the background and they're exchanging recombinationally amongst themselves and this is very much sort of like a network like situation. Every once in a while, you'll have a clone that, you know, that stands, you know, that has gained some sort of put hold in the population, maybe it's exploited a niche. And it is these sublinages that then will evolve more in a tree like fashion, and that are more amenable to final genetic analysis proper family analysis. So then, you know, for many. Now, for many pathogens, this type of population structure that makes it really difficult to figure out lineage to lineage phylogenetic relationships. And so that really the only parts of the population structure where you can do proper final genetic analysis are these sort of tornadoes or, you know, cones or whatever. So then, so in the context of typing, you know, it becomes more important to be able to identify these clones so that we can then do more proper analysis within them. So phylogenies between the different sublinages are a bit of a, you know, a bit of a something for another day or for more research purpose. So let's get back to all of the, you know, joking about the fact that instead of using, you know, 3100 data points for molecular evolution analysis. We're distancing everything into seven data points. Now, one of the reasons why this is advantageous, you know, that the fact that by collapsing a particular gene into a single data point. You are there to bypassing the potential introduction of multiple mutations through recombination as often happens in recombinogenic species. And so for almost the, you know, this epidemic structure is really the prototype. And there's a group led by Ed file in was the bath now I believe, who developed this algorithm called bird that enabled the clustering of most data using this, this epidemic type of approach. And so one was then replaced by an updated approach called the bill eburst and globally optimized burst analysis. But basically at the end of the day, you know, sort of burst and its progenitors are all based upon trying to identify central types here depicted in red, who are related to a cloud of additional sub types. That, you know, that are very similar, maybe have one, two or three allelic differences. And where you've done the, the, the mathematical analysis to figure out that the guy in the center is the one who, who takes the least number of. The differences in order to capture the cloud. So there, it becomes a standard type and it's what we call the, you know, the founder of the colonial complex. Have here, I see you have a question but I don't know if I'm going to be able to hear it. Maybe you can type it. Okay. Or someone. Yeah. Sorry about that. Sorry if I couldn't across the other presentation. I say Martin is also typing. Okay, so maybe I'll continue and then once the question pops up, I'll come back to it. So this eburst approach of defining colonial complex is that's what we call them colonial conflict is where you have a center type surrounded by all of it all of the single double and triple locust variance. And, you know, and if you perform this in a population, then it kind of looks like this like a nice constellation with little, you know, my pocket conversion evolution would cause problems for the type of analysis. Here's the thing. It doesn't pretend to be an evolutionary analysis in the least, quite frankly. It is, you know, we're trying to identify lineages as I was saying before. It's all about the lineage of man, and the relationship between lineages is an afterthought. And it's because ultimately, you know, when you're dealing with these epidemic type population structures. The, you know, you're primarily you primarily care about those lineages that are really that have a public health consequence. So, you know, I'll, the only thing I will say is that if you were trying to do file, well, put yourself back 20 years ago. You didn't have the data you couldn't sequence your way out of this, and we are tools for, you know, phylogenetic reconstruction that were aware of recombination, etc. Didn't, you know, we're in their infancy. This was the pragmatic solution. Was there another question. Okay, I'll see it come up on the chat there if anyone has any additional questions. Anyway, like it's hard for me to defend this because I always thought it was so crazy. But we're going to, we're going to get there. Yeah. So he burst MST. What served this purpose at the time. I'll take you through the nomenclature of MST because it is kind of part of its charm, where, you know, you have a lily. The alleles of different genes have a number. And it's kind of like 1st, 1st, the 1st, the 1st, that was discovered for a particular gene is given a little number 1. The second one that comes into the database is called number 2 and so on and so forth. This total a really profile for ST 21 so the 2113215. That's the elite profile for ST 21. And. You know, if, and I need not repeat the alleles, everyone knows that it's the 21 means that. I was talking about single this is an example. You know, so ST. You know, so ST 50 differs from ST 21 at 1 location. The guilty locus and it's the 12th. Allel that was found. This is a double location variant and so on and so forth. Okay. And I talked about some of the complexes to the founder, which in this case is the 21. And you'll notice that all of the other guys in the table are similar have a similar profile they share anywhere from four to five to six, four to five to six of the same alleles as the founder, obviously not seven or else they would be the 21. And so it illustrates 1 of the 1st. 1 of the 1st efforts to develop a nomenclature. And nomenclatures are really, really important because they represent a way of telling the world. This is what I've got and everybody understands what that means. Without me having to share the sequence data, I can tell you, I have an ST 21 on solid. And we already know what we're talking about. And, and it, and, you know, people just describe this as a portable approach and mostly was the 1st portable approach to typing because it was sequence based. Everybody knows what every thing you know what the different STs meant. And all all that you need is to know the name. But I also told you that I wasn't a big fan of LST. And so now I'm going to destroy him on the other side with life. The 1st one was, you know, that even, you know, back in the day, we already knew that just because you said this is ST 21 and that's also as the 21 that they weren't necessarily very similar at the, at the genome level. So here in this. Okay, that's not well. So here you see a dendrogram with, you know, with distance matrices attached to it, throwing just the diversity of a whole pile of ST 21 isolates that are supposedly identical. And you have to know seven genes, you know, on a genome that may have 1000 of genes is not like a lot of information to make such grandiose pronouncements about this being the same as that. And then the other issue with them LST was the fact that, you know, the data sets tended to be overrepresented with some really heavy hitting STs. So it has very limited usefulness in terms of the opinion, opinion, logical investing. In many respects, good for sort of long term tracking and surveillance, but you would never use this for an outbreak situation. So when coaching sequencing started becoming more routine, you know, with high people sequencing and so forth. So the folks who have developed them all in the first place started suggesting, hey, you know, why don't we just do an all see at the genome scale. And, you know, so that we could harness the data, but we could still think of some of the same principles that had made an LST into the gold standard method. And at the same time, you know, you could sequence the whole genome, extract the LST data so that it would be backwards compatible. In time, you know, you would be able to extract a lot more loci, making it way more powerful. But scaling up MST then become became sort of a thing that needed to be done. And it should be super easy right like we already know how to do MST now we have all been on front of everything into an MST schema. And now I'm going to tell you why that's not advisable in the least. So that one of the first problems is that the fact that not all genes are present in all strains when it comes to bacterial genome, we have this concept of core genes and accessory gene. So the first paper that illustrated this where it came out in 2005 it was an analysis of eight genomes of a strep in the last day. And they showed that, and it was kind of mind blowing at the time, the fact that the genomes, they surely shared a lot of stuff that they also work very different. And so this gave rise to the, you know, to the pan genome concept, where you have core genes that are essentially shared by all members of the species you have accessory genes that have varying levels of presence in the population. And the combination of the two is what we call the pan genome for species. So there isn't a genome for E coli there is a pan genome for E coli. That comprises the core that's found in everybody the accessory that is the collection of genes found among all of the E coli that have ever been looked at. And that totality then becomes the pan genome. But when it comes to the problem is that accessory genes by definition are not found in every single genome. And we have no a priori knowledge as to whether or not a particular genome should, you know, should carry a particular gene. If it's accessory, we just don't know it could be a coin flip. And that makes it problematic from an analytical point of view because we already discussed the fact that we're dealing with genome assemblies we're not talking about complete circular as genomes. We've talked about reference guided assembly and the noble assembly. You know, in a lot of cases, we're dealing with the noble assembled genomes. And we end up with gaps, you know, with contigs and gaps and you know the better the assembly, the fewer contacts you have and the fewer gaps you have always have. At the very least, you know, the other many contacts you have minus one that's the number of gaps. So if you have 50 contacts you have 49 gaps to deal with. And generally speaking when you're dealing with, you know, the potential collision between one of the locusts, one of the side that you're going to try to use and the fact that it is hitting one of those assembly gaps. So and if you can't define it, well then you can't call it, which means no data for that locus. And again, you know the problem being, we don't even know if this gene should be there in the first place, because maybe it's accessory. We didn't know anything about the strain, and whether it should have that gene. So now we don't know if it, you know, if it doesn't have the gene because the, you know, it's not in its accessory genome, we don't know if it's just missing from an incomplete assembly. You know, especially because the data sets get larger. You end up having, you know, the near certainty that you're going to have low side that are going to have missing data. And that, you know, that there's sort of a baseline, you know, in a data set, you have a baseline of genes that have incomplete assignments. And then there's going to be certain genes that are just problematic throughout. So you can't escape missing data. So you're going to have to do quite a bit of quality control to make this work. The other thing, you know, and this was alluded to by Fiona and I think we'll also brought it up. The concept of orthologous and parologous genes. And that suffice to say that we're dealing with potential gene duplication, and that we can't necessarily match one duplicate in one strain to the same duplicate in another strain. And that in order to be able to map that properly, so that we're talking oranges to oranges and apples to apples, some fairly sophisticated analysis has to be performed. You're lucky that that that Fiona and her team have worked on this problem for a long time, but still, this is not something that you want to be dealing on the regular. So yeah, these types of low sign that may be duplicated just have to go. We also have the issue that there are certain genes that show variability in sequence and or in line. So that as you start scaling up to, you know, data sets comprising thousands of genome, you're going to see all kinds of weird stuff and some of this doesn't even have to do with biology. Some of this just could be the propagation of, you know, of errors, assembly errors and so forth. So because of that, you know, James that have a lot of length variability or that have significant sequence variability, where it's, you know, not necessarily possible to always know whether, you know, what you're seeing is real. The length variation, or so forth, better to just sort of get those guys out. Thank you. So according to my mother, he basically. We're focusing on core James because several it's all several of the problems I was talking about. Especially in the context that, you know, the old and most people, you know, there are curators. I think I found a new allele and then you send them an email and say, can you look at this allele and tell me if it's real and a person would get back to you and say, yeah, you know what I think that that's a new allele that's a new sequence type. Let's make it so that's not going to happen in, you know, in a genomics environment. So it pays to get a lot of it sorted out ahead of time. And see generalist is one of the ways to do it. Because core genes are shared by all members of the species. So, you know, they should be there. No, you don't have to wonder if the genes not there is because the assembly is no good or, you know, or that particular assembly didn't have a complete version of that gene. And at the same time, core genes tend to display mostly level variations. So that, you know, in terms of homology searching, it means that makes it a lot easier to map things in terms of the start of the day and the end of the gene, and knowing, you know, that you've got the right thing. So it, you know, as I say here, provides a robust foundation or seek for genome based typing. In terms of designing schema. I'm going to sort of talk about this particular software that was developed by some colleagues in Portugal, called Chewbacca. And the thing about it, you know, if you ever had to develop a schema, this is the tool that you would use. Now, hopefully, you know, you come into the job and someone's ready to develop the schema for you and then you just have to apply it. But if you have to start to stretch, this is the way to do it. Because Chewbacca takes a lot of the concepts that I've been talking about. You know, loci that show extreme variability and length and or sequence. You know, loci that tend to end up with truncated calls, paralysis genes, and so forth gets rid of them. It defunds the core. And so it starts with the core and then it gets rid of the problematic genes within the core, so that you're left with a really well behaved schema comprising core genes only. And only that, you know, it creates, it has to create the schema, but it also, it allows you to evaluate the schema, you know, so that you can, again, figure out like okay you know this is my core, how much of the core has to go. And, but it also does the allele calling. So, yeah, very multi purpose tool. And as I say here, it helps to standardize the development of the schema. You know, but using some really well defined principles. But there are still things that you need to do when you look at data. Now, if you use Chewbacca, a lot of these things are no longer, you know, are no longer something to worry about. But, you know, certainly before Chewbacca when, you know, if you were trying to do this my hands. Trying to define a core is actually more problematic than people think, because of the fact that you're, you know, you'll have a data set you're trying to define the core from that. Great. Why don't I just choose the genes that have complete data across all the genome. You could tell yourself, and then I would tell you, well, how do you know that you don't have some crap genomes in your data set. And that that is driving down the number of core genes that you are identifying. The, the, you know, I can tell you just from personal experience that a good, you know, 10 to 15% of the genomes in your data set are probably going to require re sequencing. And, and if you don't take care of that up front, then finding the core becomes a bit problematic. You may find that the, you know, hopefully someone knows or have a rough idea of what the core should be, or the number gene in the core. If you don't know that, you're going to have to do some work. If you do sort of have a rough idea and your core is turning out to be small, then you probably are dealing with a certain proportion of genomes in your data set that are just not a high quality enough. You know, the, as I say here, the core genome definition. You shouldn't be defining a core based on 10 genomes. I mean, you can, but then it's not a core definition for the species. It's the core definition for the 10 genome that you're analyzing. And, and you have to get rid of quality genomes as I said before, don't start adding, you know, genome from, you know, other species. You know, I work primarily in temporal vector. And people develop that type of active, you know, I can pull back or colon hybrid schema. That's not helpful. And you also need to inspect patterns and missing data, you know, as I mentioned before, sometimes certain genes happen to be in regions that don't assemble particularly well. So, you know, and this is another thing that I find I have found really annoying is, for example, the same Oxford team that had been the major proponents of. C and then C and so forth, they define their C, none of the near 100% level of what a court should be. They defined it at 95%, which just means you've got that 5% of genes that are not really core. But now you have to wonder when you're doing your analysis, whether or not, you know, that 5% is going to cause you a hell of a lot of trouble. And as I cannot, you know, emphasize with enough, you gotta get rid of crappy genome. Okay, let's move into clustering. I'm no clustering expert. But, you know, but that, you know, but we're dealing with CMLC data, you know, like the proper clustering that is going to be used by our folks that are, you know, doing the phylogenetic modeling. That's a different thing. But, and again, as I said before, for CMLC, the thing really is trying to identify these lineages. And I talked about burst. You know, people are generally quite shocked about how rudimentary this analysis is, because it just uses the having like uncorrected having distance which is the proportion of differences between two profiles. And then the clustering. I think Fiona didn't want to cover UPGMA because that's clustering for children. But let me tell you, a lot of people use DPGMA when clustering CMLC data. I won't take you through that though, you know what. We tend to use slightly more sophisticated methods, but they're still rudimentary compared to the sort of phylogenetically aware methods. I will talk about single linkage clustering more because it generates results that tend to look like this other approach called minimum fanning tree. And you don't have to learn all events. I certainly have a rudimentary knowledge of it. But all this to say that MSTs are all the rage among the cool kids for clustering CMLC data. And so that one of the nice features about an MST is that it tries to minimize the total to create the tree and that by minimizing the total branch lines connecting all of the different, all the different genomes. And yes, it is an ethically graph where all of the nodes that are connected. But what I really want to talk about is this thing called rape tree. So the whole burst was sort of the method that it found first developed to cluster and all the data. And that, you know, our colleagues, who actually the same group that developed Chewbacca developed this goal eburst algorithm which is sort of a soup that version of eburst that took into account the fact that that burst wasn't a complete mathematical solution. So goal eburst was sort of like a globally optimized eburst algorithm. And wouldn't you know that the dual who included the persnickety guy who was talking about Adams, his lab. And also being, you know, they work together with the Portuguese guys, and they developed this method called MST MST MST tree V2, which is implemented in the software package called rape tree. And it has a couple of really interesting features one is that it handles missing data better than conventional tree algorithms. And the other thing is that they enabled. Well, one of the main issues that we have with genome scale analysis is the fact that, again, you're dealing with assemblies is going to be a certain level of missing data if you don't want to get rid of every single genome in your data. So you end up with a lot of sequence types that only differ from other sequence types that by one or two low sign. And where the difference isn't really a difference because it's more the fact that there's nothing data for one of the genomes. But you, you know, you don't want to have to be dealing with that and MST to take care of that. And, you know, it builds upon burst analysis and tweak some of the math, so that I can find, you know, the founders of colonial complexes in a matter that in a way that is a bit more mathematically consistent. So this is my preferred approach. In terms of epidemiological interpretation. One of the things you have to deal with is, you know, you have a tree that trees are nice, but they're not practical, and certainly not in the context of you having to go and talk to a epidemiologist or expecting spreadsheet. So basically you have to take a tree and you have to decompose it into groups of, you know, highly related genomes. And this is done by applying a distance or a similarity threshold. So let's. I'm going to go to the side and we're going to test them. This isn't a new thing. You know, so Charles Darwin was talking about lumping and splitting. You know, where do you define, where do you define a particular species where the next species begin. This is no different. You know, you have a tree where you set a threshold that generates groupings that are useful. And the reality is that there is no, you know, sort of major guidance on how to do this. A lot of this has to just be done empirically. Generally speaking, a lot of this, you know, has really been based on, you know, people who have access to outbreak data. They set the thresholds to maximize concordance with outbreak data. But still, you know, there, there is a lot of different places where you could, you know, draw that line. And ultimately, I think like me, I'm more of a proponent where we should be. You know, we should be using a combination of population structure as well as ecology and epidemiology to help inform where to draw the line. Because ultimately we know that ecology and epidemiology influence population structure in many respects. And so why not use that information to, you know, to do something that's informed by those things. So let's go back. So we have genomic questions from WDS data from a practical perspective than what you need. So if you have a tree, you can imagine just dissolving edges or branches that are beyond a particular line. And whatever you're left with as little clumps, that's what you would define as a group. And those clusters now that you've extracted from your tree, then you can subject to a variety of different analytical approaches. You know, if you have metadata, for example, on, oh, you know, which of these genomes is associated with a particular type of antimicrobial resistance. You could then knowing that you have this tree you've decomposed it into all these different clusters. You can calculate cluster by cluster AMR rates, and you can identify the cluster that are really hot have a lot of AMR versus the ones that are not, and so on and so forth. So when you take the tree and then you decompose it into clusters then that enables you all this, you know, to become a lot more. I'm going to put it's quantitative rather than qualitative. Yeah, so you're making analytical units that you can subject to further analysis. And then you can sort of to, you know, go back, you take a tree, and you can start mapping some of the things that you've calculated upon the various clusters that you extracted in the previous and you can make a nice visualization like this one here that shows you, you know, that shows you, you know, regions of the tree that represent large clusters. Some that have, you know, high proportions of human clinical pieces within them. So, you know, we have a number of clusters that have high rates of poor clinical and resistance. Now, one thing is that ultimately you're going to have to, you know, people do this, you know, taint me and say this but people are going to have to manually inspect the contextual data of an outbreak to ensure that it looks kosher. You know, you look at the date, month and year of isolation, the source of the, you know, of the isolate and then the location information. And boom, it all comes back concordant, you know, all of these guys were isolated from the same day, the same place, confirming the possibility of that outbreak. You know, it pays me to say this but you know people are still printing out trees and, you know, putting tables of metadata and so forth to do this. This has got to get better. But, but anyway, just to say again that, you know, we expect and this goes back to one of the first principles I talked about which is the fact that you expect the genetic similarity and epidemiological similarity. We're almost there yet. And I'm going to talk about this, you know, partly because I, there's a lot of enthusiasm about approaches that don't have to do all this. Crazy stuff. Like, and that may be more computationally effective and faster and so on. So I'll talk about men have for man. So maybe have men hashes this. Is this algorithm that was developed. I think in the 90s. That is primarily that was originally developed so that you could compare web pages to one another, and you could efficiently determine whether or not they were extremely similar or identical, or they weren't. It uses, you know, so in the context of text, you know, you would take substrings of the text. And then you would, you know, so once you've taken your, your, once you've taken the, the, the web page, reconstructed it into all or different substrings. You could then apply this algorithm where each substring would then go through a hashing function. And the hashing function would fit out a number. And then men have to basically say, you know, agree that, okay, every, you know, every web page I'm going to give it. You know, let's say, and, and, and, and substring to represent itself as a sketch. And I am going to pick the lowest 10 of those minutes. So then you can essentially compare rather than having to compare word by word, you compare the main house sketches to one another and if they match well you can, you're going to figure that out. So people want once the fact that hey you know what we're using that for comparing documents can't we just use that to compare sequence data. So matches the adaptation of main half in the conflict of molecular data. And, you know, they, they soup up things. And one of the things that they, they were very smart to do was, you know, they use the same thing function that enables you to essentially estimate the, the mutation rate between two different genomes based on this, you know, the math sketches. And now the other thing that's super cool is the fact that because you're using that, you know, in, in a web page you're taking such strengths here you're dealing with cameras. You work because you're using cameras you don't have to worry about it weren't being from an assembly it could be assemblies it could be metagenomic data it could be rock sequencing reads and so on. And I think the thing that blew me away the first time I saw it was that they had used it to cluster all of the rustic genomes from all these different species. Like how well, like, you know, the fact that you could perform basically a kind of genetic analysis of a whole bunch of different species that don't know, don't share very many genes was kind of wild. So I'm pretty, pretty close here. We know that the generality is not, it's a compromise. You know, you want robots performance, but you know that you're going to lose discriminatory power because we're not using all the genes are only using core. And because we're, you know, at for every gene we're taking the sequence collapsing into a single wheel. So we know that the discriminatory power if it's not enough. So then we know that this has to be supplemented with analysis with higher discriminatory power so either SNFs, or MLC with more genes, but on a smaller number of genomes once you know that you have a family. You dig in with those approaches. In terms of centralizing databases. One of the things that has been proposed is a real hashing, you know, because a lot of people have their own database. You know, in their own institution and they're not sharing with the world. So, and then at some point they're like, we should be sharing with the world. But then by that time, everything's baked in all the real numbers and the sequence types have been baked in. So, how do you deal with that the best way to do that is to again use hashing your friend, so that a particular sequence fits out a particular code that is unique to that sequence. So that way, if we're you're all using the same hashing function then we're all talking. Nomenclatures. Most people know about nomenclatures until SARS Kobe to sorry. Because of the development of this particular tool that would assign genes to a lineage. You know, so nomenclatures are essential. You can't do genomic surveillance without nomenclature system that will systematically issue lineage names or strain names or whatever based on the sequence that you have. And, you know, that enable that enables communication between groups without necessarily having to sequence, sorry, share the sequence data, although as you know, and I will tell you you have, you should be sharing. In terms of bacterial genomes. The only constant so far is that we know that, you know, that basically every species needs its own nomenclature. And there's the idea that, you know, that you should also always be doing some sort of hierarchical nomenclature with several levels of similarity. Okay, the recap. This is a phylogeny. Just to recap, not when I'm doing a genetic analysis. We're doing the best we can in a context where because of scale and a lot of biology, sometimes show a genetic analysis different, even feasible. You know, so WGS based based up typing ultimately relies on trying to take from the genomic information enough so that we can develop, you know, good estimates of genetic similarity. But realistically, you know, we're talking about developing genomic surveillance on top of this. If you have a PhD student that is going to be analyzing several hundred genome and just do the proper thing and do a proper phylogenetic analysis, maybe put it through some of these, you know, type of voodoo. So that you can generate data that you can compare to other studies and so forth. But this is not about being phylogenetically correct. And most analysis, you know, it sends me that I'm here defending them because I never coming the first way. But here we are. It works well for, you know, for scale up because it's fairly robust. But, you know, but at the same time, we know that it doesn't have enough resolution for drilling down into into some places such as, you know, some complicated outbreaks where maybe there's not a lot of genetic variability there. We generally have to be supplemented with things like analysis or, you know, higher levels of MLS.