 So I work with the Public Health Agency of Canada on foodborne pathogens using genomics. So the interesting thing I'm going to talk to you about is your multi-local sequence typing, which is ironic because I was a big denier of them on last day until we started using whole genome sequence data. That's the game changer. So I'm in this unusual situation of having been against them LSD for the longest time, but now that we have whole genome sequence data, I'm a reluctant supporter. So the objectives of the module are these, you know, sort of looking at subtyping in the context of molecular epi. And I think one of the big things I want to drive home is sort of following from the previous modules is, you know, phylogenetics and population structure are really important to all of this. And I will probably cover this from a slightly different angle. And MLST really, when you look at it at phase value doesn't really make sense that data should be analyzed the way that it is unless you take into consideration population structure and the way bacteria, a lot of the bacterial pathogens, the populations are structured. Only then does it make sense to analyze data the way that, you know, that MLST handles that analysis. So I kind of want to go over some of that. I think the other thing to bear in mind is, you know, MLST is just another tool in the toolkit. It is not the tool. It is a tool. It may not be the best tool for every scenario. Certainly not the best tool for every bug out there. For some it's really good. For others it's terrible. It's also probably not the tool for outbreak analysis. Probably it's better for things like long-term surveillance, that kind of thing. So I want to give you context for that. And, you know, so probably that, you know, that's the big thing to keep in mind is MLST is one of the tools. It has certain things that is very good at. And there are things where it's really not that good. Okay. So the first thing is, you know, ultimately we're looking at epidemiology. And epidemiology deals with the fact that disease is not, you know, it's not evenly distributed. We're trying to investigate why disease is not evenly distributed. In terms of molecular epi, really ultimately we're talking about using molecular data to support epidemiological investigations. Primarily we're talking about using molecular tools to characterize pathogens and then to use that information to try to infer things about those pathogens, how they were transmitted, you know, how people might have been exposed to them and trying to make some epidemiological inferences based on that. The end, it's a fairly simple paradigm ultimately in that we expect strains that are epidemiologically linked to be genetically similar and vice versa. That's it. If they're, you know, if they're linked epidemiologically, we expect them to be genetically similar. And that, you know, that hypothesis itself is a starting point for making your investigation. So, but the key to remember is that that's under optimal conditions. There are a number of things that can disturb that expectation of, you know, how things behave. Things don't behave like normal all the time. There's lots of noise out there. There are a lot of things that can, you know, that the basic assumption about linkage, you know, epidemiologically and molecularly can be violated in a number of different ways. And it's important to know the things that can lead to the, you know, violating that simple principle. You know, and certainly one of the big issues is the fact that until recently with the adoption of whole genome sequencing, the molecular tools that we've been using to investigate pathogens have been not that great. They were better than nothing. And certainly, you know, in the grand scheme of things, that the information that you could derive from molecular typing data is certainly better than not knowing anything about the pathogen. But at the same time, I think that the one sin that we all committed was that we may have overreached in terms of interpreting that data, considering how crappy it was. And, you know, and, and as I discuss MLST and then moving on to whole genome MLST, I hope to drive down that, you know, drive home that point. So, you know, so and like a lot of this is just you probably already know this, but nevertheless, I'll, you know, in the context of my presentation, I hope it all makes sense. I hope that, you know, if we don't have the time to spend on all the slides, I was kind of designing them so that when you're on your own, Saturday night, there's nothing else on to do. You're partied out that you're going to go home and read your notes and that they will make sense. And there is going to be some overlap with some of the concepts that have already been discussed. So but but nevertheless, anyway, let's let's get one. So this is supposed to. So we're talking about surveillance, which is really all about getting out there and sampling. You do microbiological sampling of different foodstuffs and the environment water testing. Because you've got people who are sick, you don't know what they got exposed to that made them sick. So ultimately, you go out there, you sample. And then it's all about really trying to subtype the pathogen so that you can then try to match it up to wherever that exposure came from. And the two main things that we're really talking about is, you know, sort of like a there, which is you've got two people, they're sick. And by subtyping, you match up the bug in person, in the first person, second person, you can likely infer that they were exposed to the same thing. And certainly with epidemiological follow up, you can ask, hey, did you eat tacos from those taco stands out there? And they both did. So then from that, you can figure out where to go next. The other thing is yes. The other thing is is the whole concept of source attribution, which is trying to, you know, trying to figure out, in general, where what people are getting exposed to when they become sick, you know, so trying to assign the likelihood, or, you know, sort of the partitioning of illness to different sources, so that you can then figure out, well, you know, people who eat raw chicken all the time are getting sick from kind of like or salmonella or whatever. And then you can then try to develop interventions to try to reduce that. So we've been using molecular subtyping for a long time. It used to be about phenotypic testing, serotypes, biochemical testing, that kind of thing, which evolved eventually into molecular typing. So using DNA methods to characterize pathogens. The prototypical one is both field gelatrophoresis. But there's a number of myriad methods. You know, so much so that there is this Yadam thing that somebody coined yet another typing method, because there are so many methods being developed in the, you know, in the 90s and nots. And basically every method that someone develops, if you want to use it, then you have to validate it for them, kind of. So this became really problematic, because every method out there means you have to assess it to see whether or not it works. The same question you had about this pipeline, that pipeline. Now I have to test both. Well, imagine molecular methods, which are in the lab with reagents and, you know, so it's a nightmare to try to do that. So enter MLSD. And MLSD, one of the really cool things about it. Now, I can say in retrospect, is the fact that, you know, that for one, you know, say relies on amplifying 79 genes, sequencing them and analyzing that data. So one of the things about that is the fact that ultimately, the reason why it became really important is because sequence data, unlike a lot of the fingerprinting data that we were using before, is a lot it's more amenable to standardization. So it's easier to compare sequence data across labs than it is to compare gels. So it really took off, you know, in terms of popularity. It really, in many respects, it became the gold standard of molecular epidemiology. And schemes for all sorts of different bugs were developed and used in hundreds and hundreds of papers. So, you know, the methodology, you know, the success of the methodology kind of speaks for itself. It's not perfect, but it certainly in the era before whole genome sequencing, it was an amazing tool to have. Now, one of the things about MLST that is kind of counterintuitive is that so as Gary and Phil were talking about and well also was talking about, you know, normally when you have sequence data, you can put it through a number of different algorithms and derive a tree from that. And so then you imagine, well, I'm LST, we're looking at instead of one gene, we're looking at seven genes. So why don't we just repeat that, you know, that one process that we did for one gene, why don't we just repeat it for seven genes? That's what would make sense. That's not what we do. With MLST, actually, the sequence actually gets reduced to an allele type. And what we compare is the allele types to one another. So we're going instead of using, you know, 3100 data points, we're collapsing everything is seven data points, which seems like an awful waste of data, especially because we're getting rid of the nucleotide differences. So you know that hard work that those guys just showed you about trying to identify the single nucleotide variation, we get rid of it, totally get rid of it. If there may be multiple variations in one stretch of sequence in the gene, we don't care about that, we don't care about counting them, which is the signal that we typically associate with pylogenetic inference. So we're wasting a lot of sequence, right? Now, so the, and this sort of, this is like, you know, you will know this, you know, bacteria divide the binary vision, and in theory, you end up with two clones, right? You started with a mother cell divides, you've got now two clones. As Gary alluded to this sometimes, you know, well, in theory, when that happens, those cells are identical to one another. However, what happens is that, you know, there are some processes that generate genetic variation. One of the big ones is mutation. So when you have a mother cell and it divides into two daughter cells, there will be some accumulation of mutations, just because the replication machinery is not perfect. So you end up with two daughter cells that are not identical, they're very similar, but they're not identical. And that's what we call vertical. It's a vertical process. You started up with the mother cell and it divided, the daughter cells, you know, will have, you know, will, through inheritance, will be very similar but not identical. This is the big one, however, recombination. Recombination is a process through which you can have DNA floating around, come into a cell and get integrated and replace a stretch of DNA that was there already. I'm totally doing the service to the concept of recombination, but for the sake of today, that's basically what recombination, DNA coming in, replacing something that was there already. That process is lateral because the DNA that came in had no relationship to the DNA that was there before. And from the perspective of trying to do phylogenetic inference, the problem with that is that that replacement, unless you knew that it occurred, it's basically throwing off your inference. Because now the sequence has changed and to the point where maybe it looks a lot more different than it should have been. And when you make your tree, the tree is not going to be accurate. But you have no way of knowing that. The replacement occurred. You don't know that it happened. So you mutation, you got, you know, that's basically mutation is what drives our ability to do phylogenetic inference. Without variation, what are we tracking? Everything's the same. Mutation is what we analyze when we're doing phylogenetics. Recombination is a big monkey wrench because the lateral acquisition of DNA sort of, you know, inhibits our ability to properly infer what's what has happened. So it distorts the phylogenetic signal. Okay. So now we're going to talk a little bit about bacteria population structure, because it's really important to bear this in mind. And I'm not gonna, you know, I'm not a bacteria population structure guy. And no, I'm going to pretend to be one. But I, you know, but I can I can read papers. And so can you. And this is a really good paper that you could definitely read because in it, they describe some of the population structures that can be observed out there. This is sort of the clonal population structure is what we typically, you know, we think occurs, you know, you have cells dividing accumulating mutations over time. And so then, in theory, you take their sequences, throw it through, you know, that sort of phylogenetic analysis that Gary was talking about, and you can infer relationships and evolution. And then you have some other types of population structures that are a lot less like that clonal structure. To the point where this can make the population that you see on the right, is more of a network. So you know how we're conditioned to think about phylogenies as trees. Well, in a in a pandemic population, it's basically strains that are exchanging DNA all the time, kind of basically bacterial sex, shuffling stuff around to the point where a lot of these clonal relationships are, you know, are gone. So the problem with that is that it's more of a network than a tree. So then you start thinking, well, you know, then why are we looking at trees when maybe the population is not a tree. So most of the populations are in fact, sort of comprised of these clonal lineages. And, you know, and not every population has the same rate of recombination and mutation. It varies by species. And you have to take that into account in terms of which is the best tool for examining the population. So, you know, when when we were talking about the fact that that SNP analysis is, you know, really good for certain things. The MLSD is different is better for other things. Generally, you know, it depends on the type of population that you're looking at. And whether it behaves better under SNP analysis or MLSD analysis. So, and more importantly, I think pathogens exhibit, many of them exhibit this epidemic structure that I'm going to talk about, where there's a number of sort of genotypes that are quite rare, they're exchanging DNA, it's more of a network. And then now and then there will be a clone that appears that takes off. It could be that it's really well adapted, or, you know, numbered number of different reasons why it might just take off and increase in frequency in the population. So then basically, you know, you have a background of almost no population structure. And then all of a sudden a clone appears, and then it might disappear and might stick around. And then you have other clones that are popping up at different times in in the the evolution of the population. I'm not going to read that quote, but since you have the notes, I would want you to read it on that Saturday night that we talked about. But anyway, the important thing is that for many pathogens, the relationship between the clones are what matter. Because really trying to figure out the relationships, sorry, figuring out the relationship within the clones is almost not useful, because there is too much noise. So it's more important to define the clones, define the groups that are important. And the relationship between them, we can't really figure them out. So it's my event of the job for nerds. So again, like on Saturday night, you can read it. Now, in order to analyze MLSD data, we use this approach called burst or based upon related sequence types. So it was an algorithm that was developed in the early knots. And it is all about trying to figure out these clonal groups or lineages. And it's got undergone some refinements. eBurst was sort of like the first, you know, iteration, and then we have a goal eBurst algorithm now. But it sort of is based on this concept of trying to identify groups of related sequence types, and trying to infer the ancestral sequence type. And all of its descendants, for lack of a better term, we have this, you know, a network of, you know, sort of founders, as well as their acolytes that descended from them, you know, by acquiring, you know, mutations or differences. We have single look as variants, you know, so you have the founder that then acquires a mutation. That's an SLV. If there's two different alleles, that's a DLV, three TLV, and so on and so forth. So you have like a network of dudes. And burst then tries to infer the ancestral sequence type to a network of sequence types that are related. By trying to figure, maximize the number of SLVs, DLVs and TLVs. It's not really that important that you know the specifics. But just to, you know, we're trying to figure out, you know, certainly groups, as well as who was the founder of that group. That sort of boils down to that. So in this particular case, you know, we think that the ancestral guy is the guy in the middle, not the guy in the side. Because when we, you know, when the guy in the middle is assumed to be the ancestral guy, it maximizes the number of SLVs, DLVs and TLVs. So that so mathematically, we, you know, you can try to infer that. And you can do that at the population level and generate like a big map of, you know, colonial complexes and their related sequence types. I was gonna say, all right, so take home message. Many bacterial species do not generate tree like phylogeny. Not in the long term. In the short term, they may. In fact, they probably will. But in the longer term, a lot of that tree like behavior disappears. And, you know, and so one of the, again, when you're trying to figure out, should I do MLSC analysis versus SNP analysis, you have to sort of wonder, am I looking at short term trends, longer term trends, and, you know, and figure out accordingly. So for non-colonial epidemic species, everson analysis actually is a pretty good alternative way of doing analysis. So first, just a couple of sort of rules, you know, housekeeping rules for nomenclature of MLSC. So as I was saying before, you know, you have a sequence for a particular gene. The sequence becomes irrelevant. We give it a number. And it's the, and then we compare the numbers, basically. So if you have one strain and it has a number one allele in gene one, and another strain also has a number one allele in gene one, you call it a match. And then you tally it up over the seven genes. So the sequence type is a combination of alleles at the seven genes or the nine genes or whatever. And then we assign the colonial complexes based on the burst analysis I was just talking about. And then the one really important thing is that, you know, when you have a unique combination of alleles, there is a central database where we store all this. And somebody has to make that call, oh, this is new. Let's give it a new number when you find a new allele, same thing. And it's sort of, okay, so that's like an allele table. And here, you know, I'm just showing an example. This particular case, you know, that allele number one in gene TKT is, you know, that allele is what we see for that particular sequence type. That combination, that entire combination of seven digits is the allelic profile for sequence type 21. I talked about SLVs, right? They have one difference in one of the genes. And that's, again, do not conflate difference to the number of nucleotide substitutions, because there may be multiple substitutions. But at the end, all that matters is that it's not a match. If it's not a match, then it's, you know, one digit mismatch. And that's all that counts. It could be 50 nucleotide mismatches. We don't care. We collapse it to one difference. I talked about DOVs. And as I said before, the clonal conflicts is a group of related sequence types that are all quite similar, but not identical. And in, you know, and the burst analysis tries to identify the central sequence type of a bunch of related sequence types. Also really important is this concept, you know, that we have a database and it's universal and centralized. So someone curates that in the old days, the old days of like 10 years ago or whatever. Someone, if you discovered a new allele, it wouldn't just be added to the database. Someone had to verify it by hand to make sure that, oh, this is a real thing. And if you had a new combination of alleles, i.e., a new sequence type, again, someone had to verify that by hand, which is awesome. Okay, great. That means your database is curated. But it also means that if it's one dude's job to do that, that dude has a crappy life. And it takes a long time to get anything sorted out, you know, and curated. So, you know, like for the, as I said before, I'm really, I'm a reformed MLSD denialist. And the big reason is because, you know, about 10 years ago, we published this paper showing that all these strains that were supposed to be of the same type. When you looked at the genomes of them, they were really different, like super different. And so then, you know, you have all this body of, like publications based on MLSD about talking about sequence types. And like, oh, we are, you know, we understand these sequence types, whatever. And meanwhile, we're like, well, but they're not the same. You do realize that, right? They are the same sequence type, but that doesn't mean that they're the same strain. And that those could have significant, like the differences that are there are huge. They could have massive implications in terms of our understanding of disease. So for me, that was always a big issue, huge issue. Seven genes, most genomes have several thousand genes. We're discarding most of the data, like most of the information. And as I said before, even in the context of one gene, we're discarding most of the genetic variability there. So that was always an issue for me. And the other thing, too, is, you know, like each of these genes would have to be amplified and sequenced manually. We're at a point now where we can sequence a genome for almost the same amount as it would take to generate an MLSD profile by hand, one gene at a time. And the other problem with MLSD is that, ultimately, when you look at the databases, there are, you know, a handful of sequence types that are really popular. So everybody looks like the same thing, except that if you dig deeper, you know, by genomics, you know that they're not the same. But at the surface level, they look the same. So you know that the method is not that great. And so, and that's why I hated it, quite frankly. There is an analysis of a bunch of strains that, based on MLSD, are identical. You can read it again on Saturday night. But on top of some sort of genomic comparisons using microarrays, which is an old technology that you guys will never learn about, because you're too young. Underneath is based on CGMLC analysis that Dylan, who's your TA for the next section, has done, just showing that there's a lot of structure there, even though the strains are supposed to be identical. So the solution, of course, genomics. So, you know, we can sequence a lot of stuff really fast. And we, basically, all it is is extending the concept to a whole genome. Right? So, you know, so basically we're taking whole genome sequence data, but just applying the rules of MLSD. Right? Awesome. Cool. We can do this. So, as I was saying before, you know, we were using seven genes before. Now we can use thousands of genes. So, because that means that you have greater discriminatory power, more resolution, to both be able to say, oh, these guys are really similar, or they're really different. So then, can't we just take whole genome sequence data, start analyzing thousands of genes? Right? Makes sense. So, can't we just do that? Right? Simple. It doesn't seem like it should be that difficult, except that it is. And so I'm going to now show you why it's not as simple as you would think that it should be. So the first thing is, you know, like in the old days of MLSD, we would amplify each gene, sequence each gene, nice and slow and steady, compare the data to the database. Now we're not doing that. We just have sequence data and trying to bioinformatically extract the data from all the genes. That's not as trivial as you think it should be. The first one is a big one here. The fact that bacteria don't, when we talk about the bacterial genome, strains can differ a lot in terms of their gene content. You compare two strains of E. coli, they may have 20% of the genome that is not shared between two strains. So we have this concept of the pan genome, the core genome, accessory genome. For me, this is probably one of my favorite papers of all time. They, you know, they sequenced, this is like one of the first comparisons of multiple genomes of a particular species. And they basically came up with this concept where the core or the genes that are shared by everybody and then the accessory is what's not shared by everybody. And it turns out that the accessory is way larger than the core. The core is basically what we used to call housekeeping genes, you know, essential to survival of that, you know, species. The accessory, a lot of things that are really important are accessory, like AMR genes, virulence factors, carbohydrate utilization genes. A lot of the things that we, you know, that we relied on over the years for, you know, phenotypic testing of pathogens turned out to be accessory. And so then this pan genome is basically the sum of the core and the accessory. Really important. So, but here's one of the, one of the problems with all of that. When we talk about genome sequencing, what we really, and this is something that was alluded to before, we're not really talking about whole genomes. We're generally talking about draft genomes, right? Partial, well, we sequence, we try to assemble, we try to assemble. We don't fully assemble. Most genomes out there, it's just draft assemblies that, you know, lead to fragments called contigs, but there's gaps in between. So, most of the genomes out there are not complete. And when you have two contigs and a gap in between, if there's a gene right there, you can't decide what it is. You can't tell. And so, you know, a good assembly might have 50 contigs. I don't know. What's a good assembly for you guys? 30, 50 contigs? Well, yeah, okay. Best case scenario, one contig. One single, right? But most of the time, really, we're talking about if we can try to reduce them down to maybe 30, 50 contigs, well, that's a bunch of gaps that, you know, that genes that we can't assign. And so, the problem with that, of course, is that we can't tell whether or not the gene is missing because it's in a gap, or if it's because the strain doesn't carry that gene. So, how do you know? You can't know. That's the problem. And there are other problems. One is the fact that genes are not nicely behaved. Certain genes are nicely conserved, and they don't vary in length. They primarily vary in terms of sequence, polymorphisms like SNPs, or minor insertions deletions. Other genes are quite variable in length, and those are really problematic to deal with, because ultimately, when we're trying to fish out the data bioinformatically by homology, it becomes really difficult to figure out what you're looking for on genes that are quite variable, like that. So, those are problematic. Then we have this whole issue of paralogs, which is orthologs. I don't know if anyone is anyone familiar with that concept. Yes. Now, if we take that concept and put it within a particular, the same species, the orthologs is... So, if there's duplications of genes, right, you may have... So, now you have two copies. A and A prime, let's say. So, if you're comparing two strains, and they have two duplicates, A A prime, A A prime, A to A orthologs, A to A prime paralogs, or A to A prime paralogs. The problem with that is that in bacteria, not just bacteria, but in bacteria, it's way worse, duplications occur. And now, one of the things about duplications occur is awesome, is that that drives evolution in that now, you know... So, imagine you have a gene for whatever, and now it's duplicated. Oh, I got two copies. One copy I can keep for whatever it was doing before, but then my new copy can change, because I still have that old copy, right? So, then that copy is free to evolve, and that's how a lot of evolution occurs, as duplication and diversification of the duplicates. Well, what happens then is that, what happens if one of the two duplicates disappears? If we're comparing genes that may have duplicated in the past, and we're comparing strains, we don't know if we're comparing apples to apples, or not. And so that, again, is a problem for this kind of analysis, is that if you have an inkling that a gene has duplicated and that you have orthologue paralog issues, you kind of are stuck. You don't know what you're comparing anymore. And so then, I think in the words of Gary, if it's a problematic guy, maybe get rid of it. That's sort of a lot of, you know, so even though SNP analysis and MLSD analysis are quite different, at the end of the day the one guiding principle that is the same is, if it's a troublemaker, get rid of it, because you cannot deal with that. Okay, so maybe I'll get rid of that. Next slide, okay. Yeah, so if a gene is problematic, get rid of it. There's no sense in keeping it around in your scheme, because it's just going to give you more trouble, especially, you know, like we have data sets of thousands of genomes now. There's a lot of mischief that can happen in evolution. We're dealing with biology. Biology is noisy inherently. If you are, if you have this penchant for maintaining things, you know, for keeping around things that are sort of biologically noisy, it's going to come back and bite you hugely, as someone famous has said, hugely bigly. Anyway, so the missing data, the problem with missing data, you know, like in this whole thing about, you know, accessory genes and versus genes and gaps, is that as far as MLSD is concerned, if we can't figure out what the sequence is, we can't type that. And so for all intents and purposes, missing data is our nemesis. We can't assign its true sequence type only part of it. And moreover, you know, in the big picture of surveillance, we, you know, if we're trying to keep, well, okay, think of an outbreak. If you'll remember the big outbreak of the whole line in Germany a few years ago. And blame was assigned to different products. You know, there's Spanish cucumbers, blah, and the problem is that nowadays, you know, with whole genome sequence data, we want to be able to get it right. If we have missing data, you basically can't exclude or include something as part of an outbreak with 100 percent certainty. And so then, you know, this untypability becomes a huge issue. Okay, now one of the things that we know, and this, you know, we've been working on this for a while now, is that on a big enough data set, basically every gene in, you know, becomes, you're going to get at least one genome that doesn't have the data for that gene. And so then that's just the reality. So one of the things that we've learned over the years is that the bigger your scheme, the more genes you include, the more chances are that you're going to have missing data. And not all genes behave equally bad or badly. So it behooves you to really try to get rid of troublemakers early on, because they're just going to cause you trouble anyway. And moreover, and sort of this goes, you know, another thing that we know is that not all genes have the same probability of missing data. Some are way worse. You know, that's sort of like baseline missing data. There are some peaks there, as you can see. Those are not worth keeping around, trust me. So we can't really avoid missing data, but you can certainly try to mitigate the problem by getting rid of noisy genomes, noisy genes. Get rid of them. So the, so this brought sort of the concept of CGMLSD, okay? CGMLSD is just MLSD, but using core genes. Only core genes. We get rid of everything else, okay? And the reason why this becomes a really powerful thing is because we need to be able to develop a scheme that kind of takes care of itself, that doesn't require a lot of manual curation, and that it will scale up with thousands of genomes that it will behave, right? So if you keep genes around that are not core, you're begging for a butt kicking. And so then, you know, so we, and we, no one's going to have a job of curating every gene in your scheme. No one's going to do that. A computer's going to do that. Or a program's going to do that. So the more difficult that it is, the more biologically noisy it is, the worse it's, you know, it's going to be to curate. So we, you know, so a number of different groups have proposed the idea of core genome and LLSD. Why? Because core genes behave better in general. And so then, yes, we're sacrificing information at the expense of better trackability, better behavior. So yes, you sacrifice some information, but by the same analogy to what Gary was talking about, we sacrifice what's noisy. Who cares, right? Like, yeah, it'd be nice to have it, but if it's just going to cause problems, it's not worth it. So basically, if you stick to core genes, you remove the ones that vary quite a bit. The remaining genes tend to behave really well, like really, really well. And they lend themselves to global surveillance, you know, which is a big thing nowadays. You know, like being able to share data across countries, being able to compare the profiles of things that are circulating here versus elsewhere. So how do we design something like this? Well, the first thing is, you know, you have a whole bunch of genomes. We first have to figure out what the core looks like. So to do that, first we have to predict the genes, you know, and we use these programs like Proka, Ross, you know, that are used for gene prediction. Then we have to identify core genes and accessory genes. And then we do something that we call merging of, you know, orthologs. Programs are not perfect, you know. You may have different alleles of the same gene that, as far as a gene prediction program, you know, it might think that there are two different genes, but really they're, you know, maybe they are the same thing. So there are programs that will merge them. And, you know, and, you know, as I was saying before, like, paralogs we hate. We don't like duplications. If a gene is duplicated, it's not going to go into our core, into a core scheme. So this is the time to try to get rid of stuff like that. So we extract the core genes. There's a number of programs that that are, you know, that we can used to do that. And then you have to do a lot of curation. So, you know, by and large, at this point, we kind of know the size of a core for most, you know, for most bugs. Or at least an estimate, you know. So when you're designing something like this, you should have a fairly good idea of what the, you know, what the core was before you started. And then does your core look like the side of the core that you are expecting? You know, so the, like in this particular plot, that's the carriage of, you know, 10,000 genes in the pangenome of Campylobacter. That subset of genes that are carried on 100% of the genomes, that's the core. This is what, now if it's too big, what happens, you know, so I would say that, you know, for a lot of organisms, the core might represent like maybe 20, 30% of the total available pangenome for highly clonal organisms. It might be more than that. If your core is too big, then you're probably, you're probably not using enough genomes to assess the core. If it's too small, this has happened to us before. You know, where all of a sudden it's like, wow, that's not a lot of core genes. What can happen sometimes is that if you have genes in your genomes that you're looking at that are not from the same species, you're gonna lose a lot of genes that, like, if I'm looking at a lot of Salmonella, and I include some E. coli, the shared core of E. coli and Salmonella is smaller than the Salmonella only core. So, you know, so contaminants will reduce the core size. So basically, if you add a genome to your analysis, and all of a sudden you lose a lot of the core, that genome is probably not of the species that you think that it is. And we've seen that a lot, you know, that point there, trust nobody. I cannot stress that enough. The number of genomes in NCBI that are set that say, oh, this is Salmonella and it's not, is surprising. Um, so don't trust it. If it looks, uh, yes. It, and you know, um, and unfortunately then they say, well, the submitter, well, you know, and I get it, you know, like, people, you know, people are not trying to be deliberately bad. They do, and whether the submitter's then gonna, right, oh, yeah, right. No, and I mean, and people, people mean well, but okay, but, but I'll frame it this way. If, you know, if, if you get the wrong species, can you imagine, uh, you know, the some of the reliability of some of the other data, like, we've done a lot of work on Salmonella. And like, probably the most critical, information that you can provide for Salmonella is the Cerebar. And, but we also have a, a web service that predicts the Cerebar from the genome sequence data. And trust me, when we were comparing our predictions to what was from the public domain, a lot of cerebrations don't make sense. We know, and we know that they're wrong. Uh, and it's not that people mean to, you know, to submit the wrong thing, but there's a lot of errors in transcription and, you know, submission and all that. So people mean well, but just don't trust it, because there may be things and, you know, you may not be getting what you think you're getting. Um, it's not like fish, you know, when you say like, oh yeah, this is, uh, Chilean bass and it totally isn't because the restaurant is, you know, trying to scam you. People mean well, it's just that sometimes, you know, when you're submitting hundreds of genomes or whatever, errors happen. Yes, yes, true enough. Yeah. So, uh, so certainly, you know, like just be, you know, be skeptical. Uh, and, you know, and then, and definitely, you know, there's going to be a lot of polishing required. Now, okay, here's the good news. A lot of work towards the development of a CGMLC scheme. The good news is that it's probably won't be your job to do that. It's going to be some poor jerks. You know, well, I shouldn't say jerk. But I was going to say poor bathroom, but that's bad language. So, sorry. Can I unrecord? It's going to be some poor guy's job to design a scheme and you will then get to the benefits of all their hard work of you being able to use that scheme and it's amazing. But they spent all this time building it just so, so that it's amazing. It takes a lot of work. You know, sorry Dylan, I didn't mean to imply that, you know, Dylan builds schemes. So, it's a lot of work. I've seen it in, you know, so, but it won't be you most likely. It'll be somebody else. But, and this is the big important thing, when you do go back and you're one of these people who will get to use these, hopefully from this lecture you will take some of the good practices that should go into the development of a good scheme. So, then when you see their scheme and you're like, you know, it's Saturday night I'm looking at this data and this scheme looks really bad because, you know, based on Ed's lecture that really changed my life, you're going to go back and you're like, I don't like this scheme. I think this scheme is not so good. I want you to be critical because right now a lot of these schemes are being built by a number of different people. And not all schemes are the same. There are probably three or four schemes for Listeria as we speak and they're not all equally good, you know. So, do be critical when you go back and you start using, you know, whatever platform you're going to use for CGNM LSD analysis. Yeah, it takes a lot of effort to do that. Yeah, for Niseria, right? It's a labor of love to do that quite frankly. Right, right. So, yeah, just, you know, be aware, be aware like, like anything, you know, like you are, if you're using a schema, then, you know, make sure that it's a good one because there are going to be different ones that people like I'm the one. Okay, now this is the, another good news really. Cold Genome Sequence data can be analyzed and re-analyzed, you know, over and over and over without having to incur like, you know, when I was saying before the Yadam thing, you know, there's 10 different methods and now, you know, so your boss comes to you and say, hey, there's 10 methods. Can you test all 10 methods and tell me which one's the best one? That means like you're going to have to go and, you know, analyze a bunch of stuff in the lab, horrible. You wouldn't want to do that. With Cold Genome Sequence data, well, there are ways to evaluate how good these schemes are and hopefully, as a scientific community, we will do a lot of that testing to see which ones are better. There's four schemes for Listeria, which is best. We can tell there are ways to do that. Okay, nomenclature, neglected part of, you know, it's not a lot of fun, but, you know, you're here. I can guide you for another 10 minutes, so. So nomenclature is just a naming scheme for the subtypes that a scheme generates. And, you know, so the importance of nomenclature is that it is the way that people exchange information. You know, if I'm in Spain and you tell me or I have an outbreak of ST-37, I know what you mean, because we all agreed on what that means. So, nomenclatures are hugely important, you know, and that's actually one of the big reasons that MLSD is as popular as it has become and why a lot of people have a lot of hope that for sort of global epidemiological tracking and surveillance, that it is a workable solution. You know, now, I think one of the things that, you know, when Gary was talking about SNIVs, you know, and like the ephemeral nature of SNIVs, really, you know, to bring this out, you know, to sort of extend it to the sort of the logical conclusion of that thought process, you know, there is, you know, so there are a lot of folks that say like, look, you know, every outbreak is going to be unique. Like there's no two outbreaks that are going to be the same, not the whole genome sequencing. Why do we need to give it a name? It's a very valid argument. And I think, you know, I'm more from the other side where I say like, yeah, maybe we don't want to name the outbreak because there's, you know, it will never look the same. But what if we at least could group outbreaks that are kind of related or very similar to one another? Maybe not identical, but similar. So we do need a method of grouping similar things at a certain level. So I'm more proponent that, you know, in MLS state we have the clonal complex sort of definition of related sequence types. I feel like we need that. We need to maintain that as far as I'm concerned. We just need to up our game. We have whole genome sequence data. Why don't we sort of update that to reflect the fact that we have more data now? So a nomenclature system for CNGMLC to me, you know, as high resolution clonal complexes, I think is a good, valid thing. And that's a lot of our research actually is in that area. This is not going to make a lot of sense. I think, I don't have a lot of time. But anyway, one of the things that we have, what that we know is that if we go back to that ephemeral nature of, you know, outbreaks now where strains are very unique in time and space. So I'm not a proponent that every, you know, last little guy gets a name. But we should know that they all are a family. That family needs a name. And so we know that if we, there's a concept of lumping versus splitting, you know, so you either group things together or you split them apart, right? This is time immemorial, an issue in biology. To me, the whole thing is about trying to figure out the optimal point of splitting or lumping, that where you're tracking something useful, that has some meaning and that is stable enough that it's worth naming, basically. And so here, you know, like if you're curious about this, come and talk to us. You know, we can tell you but basically, it's about cluster stability. You know, if a cluster of strains is quite stable in time, then it's worth naming. That's basically our point. If it's not stable, then forget it. Why give it a name? You know how, like you won't necessarily get a name for your pet goldfish? Because it might not last. But if you have a dog, you'll give it a name. Same thing here. You know, that's all I'm saying. Okay. So what we know, for example, in campy, we've done a lot of work in capital vector. You know, we have a scheme of about 700 genes. The clonal complex we think should be probably scaled back to like 650 out of 700 genes. Groups that are defined at that similarity level are pretty stable. Anything above that, it's too unstable. We don't think it's worth you know, giving a name. Okay. So how do you do this in Jamal's name? Oops, sorry. So there's your ingredient list. You know, you can read that up. But basically you need genomes. You need a scheme. Someone's got to come up with it. Again, like it doesn't have to be you. There are, some people are developing programs to be able to design a scheme. These people should be applauded. But most times you won't be the one doing that. If you, so if you have genomes, you have a scheme, there are programs that will then let you type that data. So what do you need? So you know, so when you have a scheme, you need an allele database for every gene in your scheme. You need the then the sequence life definitions. This combination of alleles is ST1. This other combination of alleles, ST2 and so on and so forth. Someone has to come up with that. And then you need allele calling software. There's a, you know, one approach is to do it using an assembly. So if you have assembled genomes, there's programs like, you know, here we talk about this MLSB program that one of our colleagues in Australia has developed. We have developed one called MIST. And it just sort of looks for the genes and then tells you what the allele is. You can also do it on raw reads. There are programs that do this. I don't know if there's a preference, quite frankly. The, I think that raw read map, basing allele from raw reads is a bit more noisy, but it's also a bit more sensitive. You can pick up stuff that maybe on assemblies you might not be able to pick up. But that's, you know, that's as far as I'll go there. When you have, and then when you have results, you have, you know, sequence types. What do you do? You need to visualize them. We're going to show you that in a minute. We use this tool built by some colleagues of ours called Filoviz, where you load up, you know, allelic profiles of MLST data and it will make sort of like trees, visualization, you know, for showing how things are related to one another. So in conclusion, the MLST approach is one of the two primary methods that has been suggested for analyzing whole genome sequence data in the context of surveillance and typing. It's good for certain things, terrible for others. What's it good for? If you have lots of recombination. Certain species are really, really recombinogenic. Others, not so much. The, the species that have like what we call an epidemic structure, you know, where there is clonal, you know, clones or lineages that dominate in time. And then, you know, but so the increase in frequency, but then they kind of disappear. And then meanwhile, there's a whole bunch of diversity in the population. That's the kind of thing that that MLST is good for. If you have, you know, so for example, Salmonella, very clonal, not good for MLST. Listeria, highly clonal, probably not so good for, like for MLST. And, you know, I guess the other thing too is, you know, some people are talking about combining MLST, like core MLST, and then having like a more whole genome MLST, when you need to drill down into an outbreak. Personally, I hate that approach, because I feel like if you need to delve into an outbreak, there are better tools like SNIV analysis that are far superior to doing MLST analysis. So I don't advocate the hybrid approach. I think, you know, use the tool for what it's good for. And so MLST, to me, long-term population tracking, once you've got an outbreak, like an investigation that is looking into an outbreak, you probably want to use SNIV analysis. Yeah, so that last point, I disagree with it completely. And now we get coffee and networking.