 Okay, so I'll be talking to you about polygenome sequence-based multi-local sequence typing for pathogenic bacteria. So a lot of this will sort of build on what Gary and Will have been talking about. So in this lecture, we'll go over kind of like the basic introduction to molecular typing back to a population structure and within typing and typing methods, multi-local sequence typing in particular. Both the classical variants of it and next-generation versions that are made possible by whole genome sequencing. I'll talk to you a little bit about nomenclatures for public health surveillance programs. So in particular populations, like no population is completely homogeneous, even when it may seem that way initially, there are always going to be variations in distribution of disease and exposure to risk factors. And so one way we can investigate this is the concept of molecular epidemiology. And that's essentially taking molecular approaches and using it to identify pathogens. And that way we can start to look, take an in-depth look at the distribution and the environment and how they, and their transmission dynamics. So the kind of general rule to epidemiology is you start with the notion that if strains are genetically similar to one another, we assume then that they're epidemiologically linked. This isn't always the case, of course, but that's kind of our starting idea. And we can use molecular subtyping to sort of take these groups and start to pick them apart and see internal structure within that. The challenge of this is like molecular subtyping methods in the lab have always been limited in the amount of genetic similarity they can compare. And this can historically lead to estimation problems. So when we use molecular typing for surveillance programs, we can look at the potential source of exposure and we can start to look into looking at building a tree and understanding the lineage of these strains and their relevance to human health. And so how we can, and two major parts of that is source attribution, figure out where a particular strain came from, and then also identifying when outbreaks have occurred. So molecular typing sort of grew out of these historical methods, serotyping, biotyping, which of course still continue. But they offer limited resolution compared to molecular subtyping. In the 90s and the early noughts, there was this sort of proliferation of molecular typing methods, so PFG, MLDA, and some of their variations, and there were a lot of these. And they all sort of were linked by the fact that they looked at banding patterns on gels and required a lot of human oversight to run and to interpret. One of the big advancements that came out of that era was one that didn't rely on gel banding patterns, and it was what was called multi-local savings typing, which came out of Martin Mains lab at Oxford. And what MLSD did is it looked at a small collection of genes, 7 to 9, typically 7, and these genes weren't just any genes, they were particularly in the fact that they were small, conserved fragments of what were called housekeeping genes, so these genes provided essential functions and we knew that they would always be in the strains. They were randomly distributed throughout the genome, so it wasn't, we weren't accidentally overestimating or underestimating diversity by targeting any kind of hypervariable regions. So these loci were amplified by PCR and then sequenced to using cellular sequencing. And the way that we looked at this is we looked at alleles rather than nucleotide variation directly, so we looked at sort of an abstraction of that, each allele at each gene. And the way that these types were assigned is they were stored in a centralized database, so this was run out of Oxford, and this allowed investigators in one lab to unambiguously say that they had the same type or not with their strain as another lab, because they would just check against this Oxford database, and then they would know, and that way everybody's speaking the same language. This was a very attractive feature for public health people looking into typing their strains because no ambiguity. So this became the gold standard for a lot of different organisms. This small typo here says over 50 schemes were developed. I think it's more like 150 now. And these were used very commonly in both research and surveillance programs. So the way MLST data is analyzed is so you can look at a single gene, and if there are approximately 450 base pairs long, you could look at this as a collection of 450 data points to be compared, strain to strain. And then we're looking, and MLST uses typically seven. So multiply that out, and suddenly we're looking at thousands of data points that we can use to compare. However, that's not how we actually run MLST. We only compare seven, because it's at the allele level. And this seems like it would be a step backwards. But even though we're sacrificing a little bit of the high level of resolution, it does have some advantages. So of course bacteria, when they reproduce, each of the two daughter cells that come off, ideally are identical. They're clones of the parent. Of course, this isn't always the case. We have mutations that can arise spontaneously. And so this is a vertical process. It passes from each parent to one of its offspring cells. And then so this is a diversifying process. Each time this happens, we've created a new type that we can compare. So that's not the only way that diversity can be introduced to bacterial strains. The other, of course, is recombination. So this is where DNA comes in from some external source and other cell of the same species, possibly. And this is used to replace that locus in the bacterial chromosome. So this can also introduce diversity in the same way that mutation does, in that a new allele parachutes in all of a sudden. And as a result, possibly brings in a large number of single nucleotide variations between the two strains. But this can also have a homogenizing effect. So if an allele has changed through mutation vertically, it can also be horizontally reverted back to the original. So this recombination in bacterial populations complicates our interpretation of trees. It's not strictly vertical the way, say, humans are. And this is not just on a paralyzed basis. I mean, this generalizes out to the entire bacterial population structure. So there are these sort of general groupings of population structure in bacteria. So from the extreme clonal, so that's binaural diversity comes from vertical mutation. Weekly clonal, where it's largely the same as clonal, but there can be some proliferations of new diversity brought in horizontally and not strictly vertically. Is the epidemic where you have lots of relatively rare types running around, because there's lots of recombination. But suddenly, one of them gets lucky because it's found a host, and it's adapted to that, and there's proliferation of similar clones, clonal expansion. And the pentamictic type, which is you rarely see the same sequence twice because it's homogenic, and there's tons of diversity, and sort of like everything is rare. So the epidemic population structure is what we like to see in pathogens. There's lots of rare genotypes circulating around. And when one of them finds a host, there's a sudden proliferation of that genotype. And one way to visualize this is this sort of a cones model at the bottom. So you have all these strains in the red dots. And when there's a sudden expansion, you have a large number of related strains that appear simultaneously. Or seemingly simultaneously. So this would be sort of your classic outbreak scenario. And the problem with this is that it can make interesting relationships difficult to keep track of, because you have these groups of seemingly clonal strains that don't have an obvious connection between them. So applying MLST to that, the way that these were sort of grouped into useful clusters was this algorithm called burst, based upon related sequence types. This was sort of the original algorithm used, developed by Ed Vale. And these were, essentially, if you have the seven genes in an MLST scheme, each unique set of seven forms a sequence type. And these are further grouped into related groups. So at four out of seven clonal complex was the terminology used for that. And then the burst algorithm went through several advances. There was eBurst, and then go eBurst, globally optimized eBurst. So each eBurst group is there's a single founding sequence type and then has sort of a ring of related sequence types. So you end up with this graph structure where each sequence type is a node and then separate by edges that represent locus variance between the founding strain and its relatives. And this was very helpful for interpreting MLST data and species that don't have this nice vertical inheritance of mutations. So I sort of alluded to it a moment ago, but the MLST nomenclature was based on what were called sequence types. So these were unique seven gene profiles. And so in this case, it might be a little bit hard to see. But in this diagram, it shows the clonal complex sequence type and then seven genes. So in that first row, that's sequence type 21. And then there's an allele designation that's given to each allele for each gene. So each profile seven is a sequence type. Some of them are single or double locus variance. Those are highlighted in gray there. So that third row differs by one locus from the top row. And therefore, it's a single locus variant of that top row profile. And then these sequence types are further grouped into clonal complexes. So these are groups that share four of the seven locus. So as you can see, each row has a unique sequence type, but they're all clonal complex 21. And the way that these alleles are named is that they've been named essentially in order of discovery. Because I mentioned earlier, there's that centralized database, so different researchers would investigate their strains and find a new allele, and they would submit the database. And then that would just be appended to the list. And the allele number would increment by one. So when you're typing your strain in the lab, you've developed, you've sequenced each of these seven genes we compare against the database and then look up the relevant names for it. To make these databases work though, they were carefully curated by humans and still are. These are something that's still in active use. So even though MLSD was great in the late 90s and early 2000s and right up until recently, there are problems that have been manifesting themselves as we get better at molecular typing. Foremost is that it does only use seven genes out of thousands in a strain. And so you're only really looking at a very small fraction of the genome. So there's limited information in that. And strains that are otherwise can be quite different might show the same ST. And the reverse can be true as well by chance. So the obvious solution then is to scale the MLSD concept up to the entire genome, because we have full genome sequence. Now it's fast and relatively cheap. So you can take this seven gene MLSD concept and scale up to hundreds or thousands of genes. And as people all over the world are sequencing more and submitting their data to online repositories, we have ever greater diversity to look at in which designer scheme. So but of course, it's not that easy. Otherwise, we'll not be talking here about it. So the problem with scaling that up is you run into this concept of the pan genome, which Will talked about. So the pan genome is the totality of all genes available to bacterial species. No two, or not no two, but there's no guarantee that any two strains within a species will share exactly all the same genes. So the pan genome is broken into two segments. There's the core genes, which are shared by all strains. They're definitional to the species. So these are sort of an expansion of that housekeeping gene concept that I mentioned earlier. These are essential. And then there's also a component of genes called the accessory genes or the accessory genome. And these are often adaptive. And no two particular strains are guaranteed to share accessory genes. They might, they might not. So when you're designing a whole genome M list T scheme, you start to run into problems that are related to that. So when you're comparing two strains and you're using draft data, because it's relatively right that people complete their bacterial genomes, you just have to draft state. So you have several large contiguous pieces of sequenced data context with gaps between. And so when you're comparing strains, like a large portion of strains will be missing at least some genes out of your scheme. And then so you end up with a problem. Because of the accessory genome, you don't know whether you're missing those genes because you lost it in the sequencing step and it's biologically there, or it's not biologically there. It's going to be difficult to extract. And so these incomplete things is when you, so let's say you've targeted some gene and it's spanning the gap between contigs, that you lose it, you might get partial or you might be missing the gene entirely. If it's partial, you might know it's there but you also don't know the, the identity of it, what the exact sequence, which is what the scheme is based on. So another problem with accessory genes is they can often show significant variation in length and in sequence. Highly variable genes can be problematic to a whole genome scheme because it can mess with homology searches. And when they're highly variable, it can be difficult to tell if you're even talking about the same gene because of the issue of paralogs in your genome. Might, through gene duplication events, you might have multiple similar variations of a gene. And then suddenly when you're trying to assign an unambiguous type, you don't know which one you're looking at. So the paralog issue can be particularly difficult to work with, especially once you've scaled up to thousands of genes and you have a lot of comparisons between things that may or may not be the same. It's hard to assign a true sequence type and a lot of them might be untypeable because the classic MLST system required that you have full data at all seven genes. So if you're scaling that concept up to thousands of genes, suddenly you've got this problem where you don't know what you have, essentially. And this missing data issue becomes problematic as you add more genomes because on a sufficiently large dataset, every locus will be missing at least some of the time. And a lot of them tend to come from a small number of genomes. So it can be difficult to identify which ones were the bad ones if you don't know what's supposed to be there to begin with due to the accessory content issue. So what ends up being sort of problematic to the idea of whole genome MLST is that missing data is ultimately inevitable. And if you don't have the consistent definition of what genes are supposed to be there, you don't know when they're missing due to technical problems or to biological problems. And ultimately, you're gonna have to be willing to make sacrifices in order to divine a scheme that's robust and be used for surveillance. So that's where this concept of core genome MLST, so distinct from whole genome, core genome MLST. So this is sort of a halfway step between the idea of the classical MLST. We're looking at seven highly conserved genes and whole genome taking everything that's available because it's just there's too much sequencing, there are too many genes and it rules out things like manipulation that worked for classical MLST. So with core genome MLST, we target only the core genes. So these are genes that we know to be present in every member of the species. And that way when we see an absence, we know that it was due to a technical error and not due to biological absence. This makes the assignment of CGMLST types a little more tractable than whole genome MLST. And although core genes, just because they're core genes doesn't make them automatically good, there will still be ones that don't play nicely with your scheme if they are potential, if they have paralogs, that can cause problems or if there's a large degree of length variation. But if you're very conservative with your definition of the CGMLST scheme, then you can end up with a lot of the same advantages of classical MLST in that you will always know the genes you're there and you can assign types and there's no ambiguity. So in designing the scheme, the main steps fall into sort of these three categories, identifying all potential target loci. And then from that, pairing it down to, we're separating them out into core and accessory genes and coming up with a nice consistent set and extracting those core genes. You identify loci that are in all strains. Of course, we can't go too crazy. If we're looking for 100% and we have thousands of strains of varying sequence data, we are going to miss some loci, but you should be as stringent as possible. So our definitions of core genome that use something very relaxed, like it's got to be present in 80% of the strains, but that won't work clearly. So if you're looking for genes that are present in 99.9%, you can come up with a very conservative set that still gives you lots of typing resolution, but minimizes your troubles down the road. And you have to do some, so if you have this, you've designed or you've been given a core genome MLST scheme, you have to do sort of some sanity checks, like first and probably biggest is just do these make sense at all? Does it make sense given the literature? If your core genome is too big, you've probably included accessory genes. It's too small. You probably haven't captured enough diversity and you're missing core genes. So as long as you're capturing as much first as you can and filtering out port quality genomes, you should end up with a relatively good definition of core genome in your species. Another common pitfall that I've encountered with other designs is that they'll sometimes attempt to merge species. And then this complicating results because there are different species for a reason and they don't share, they might share much of the core genome because they're falling to the same genus, but this also can end up breaking the scheme. So once you've done the scheme, you have to go through and sort of polish things up and make sure it's as good as it can possibly be before you unleash it upon the world. So removing like variable genes as I mentioned, periologous genes and looking at the data and making sure that everything makes sense. There aren't any issues that got past the initial quality filtering. So I've been talking about this MLST approach and Gary earlier was talking about SNP typing for bacteria or SNP typing. So, and that's the approach of taking an alignment and basically saying which SNPs are in each strain. So single nucleotide polymorphisms. So with SNPs, you end up with the highest possible discriminatory power because you're comparing everything that it has to offer. But if you approach that naively, you end up susceptible to the effects of horizontal transfer, the combination issue that I mentioned earlier. So a new gene might come in and instead of a single vertically inherited mutation or a small number, you can end up with a large number of these issues. SNIPL, which we attempted in the preceding lab does work around this, by filtering out highly variable regions. MLST, however, is easier if you're trying to assign standard names and substandard denominators. It's resistant to the issue of recombination because if a gene's recombinant, it will change a single allele, but it won't distort that by showing all the different SNPs that have been introduced by that. So you've sacrificed some discriminatory power essentially by using LST in exchange for a robust nomenclature. And nomenclature is important because of what I mentioned earlier, the ability to unambiguously discuss the same strain between labs in a way that's reproducible. We can then know that a lab in Canada and a lab in Denmark is talking about the same thing if they have the same names. And for public health, of course, this is of particular interest. So for core genome MLSTO, because there is, even though it's reduced discriminatory power relative to SNIP typing, it is still very high resolution and it's often useful to cluster it back slightly. So what I mean by that is, so this graph that I've generated here shows the similarity of different clustering thresholds to their neighbors. Essentially what that means is if you try to go two stringent with a CGEMLST scheme, so you have a 700 gene scheme and we're gonna compare two strains at 700 genes, things are unstable and clusters can rapidly change in this very short term. So by picking a clustering threshold that's a little bit back from that, you can come up with a useful name for your strain that remains stable over time and time and space. So this particular example was drawn from CloudBike to Junie. So we had determined that comparing below 45 differences was unstable and can't be, is a particularly difficult species to work with in regards to that there is a lot of recombination. So sort of like the overall workflow of a CGEMLST analysis is beginning with whole genome data and then there's sort of a branch because you can either do assembly free and work directly off the reads and that's what something like Mentalist or SRST2 do. Otherwise you can go to assemblies, assemble your genome into assemblies and then assign types using programs like Jibacca which came out of the University of Lisbon or just the simply named MLSD and then finally cluster it and visualize it using programs like Filoviz or like GrapeTree and it'll be talking about GrapeTree and my upcoming lab. So the sort of data that you'd want to have to do is an analysis like this would be a collection of multi-class defiles for containing your allele definitions, a list of sequence types that you can use to assign to strains given their types and then of course some of the software I just talked about. There are other options. There's Seaxpere from Rhythm or Bionumerics from Applied Maths. They each have their own MLSD systems and their own MLSD schemes. However, these can be quite expensive as I'm sure you will know. There's also the BiggsDB genome comparator which was by the same authors as the original MLSD but this can be difficult to set up in your own lab so it's a bit of a fussy program. We've been using Jibacca. So once you've run it, you have MLSD type or CGMLSD types assigned and then you can visualize that through something like GrapeTree or Filoviz. It can do a minimum spanning tree so these are these e-burst groups that I talked about earlier and you can annotate these with metadata because I think Will mentioned earlier that sequence on its own is meaningless. You have to connect it to some form of metadata to draw any interpretations. So just to conclude the MLSD approach also known as gene by gene is one of two primary methods. It can be contrasted with SNP typing. It's particularly useful and if you know your organism is highly recombinogenic. It can be used in clonal organisms but that's where SNP typing's high resolution is most appropriate. And in real life, a hybrid approach will be required. Of course. And with that, I guess we're out of break.