 Okay. Okay, so Canadian Bioinformatics workshops just remind you of the creative comments for the content. Today we're going to talk about AMR analysis which I'm sure is important to a fair number of you considering how critical it's becoming. And then the lab and the lecture sort of mirror each other so the things I do in the lecture we're going to come back to and touch upon in the lab. And my main goal is really practical oriented. There's a lot to in the AMR field to track but not all it's important to you so I'm going to sort of highlight where the decision points come from. So the learning objectives you'll understand the molecular basis of AMR. You'll understand AMR reference sequence databases how they're created and used how to predict antimicrobial resistance genes from genome assemblies or for metagenomic sequencing and we'll talk a little bit about the promise and pitfalls of predicting phenotype from from sequence information. So probably don't need to review this too much in this audience. You know, the simple message is that bacteria evolving resistance faster than we can bring new drugs to market. And the tweet on the left particularly stands out this is a UTI infection from Spain where everything they test except for have a back to them. They were resistance to. And so you know this is unfortunately just more and more common. So this is a Canadian statistic it's not just immortality but it's also in illness. So, you know, canvas predicted by 2050 to be close to 400,000 lives 120 billion in hospital costs nearly 400 billion in GDP. You know, and then I'm trying to that is the loss of modern medicine if you can't control secondary infection you can't get a lung transplant hip replacement you can't undergo chemotherapy, etc. What's really remarkable and it sort of drives sort of what we're going to do in the lab today is the diversity of molecules involved. It's not like we're dealing with one small group of compounds we're doing with a huge diversity chemical space of compounds that we've discovered over the years and have been losing over the more recent years and particularly on the left here is it's not that we're just resistances is taking them off the market, but we aren't finding any more. And particularly antibiotics that discover gram negatives and we really haven't found anything decent since the 1970s, and there has been sort of discovery gaps in the 80s. And so it's a double whammy bacteria overcoming these drugs and we're really not finding anything fundamentally new with a few notable exceptions. For the most part we're talking about pathogens that we're quite familiar with you know my grandparents generation thought this year gonorrhea was trivial they had a shot of penicillin and it was over right but every decade that goes by less and less drugs are able to treat this and we have basically completely untreatable strains. So it's not like we're dealing with fundamentally new pathogens for the most part. The big challenge of controlling the MR and getting surveillance data around it is that it's not just in the clinic, it's not just people so antibiotics are used in agricultural sessions, antibiotics and antibiotic resistance genes, get into wastewater therefore get into environment. They're involved in agricultural processes and fuel production. So it's a really complicated in Canada we call this the Confusogram. We've been around since 77 in Canada in particular has a major federal effort to sequence across this space to understand how both pathogens and antimicrobial resistance genes move around. But so if you might only be working in one of these sectors but you're not in isolation. It's really important to try and understand that you're connected to a much more complicated web whether you're dealing with a pig on a farm, a fish in a stream system or a person in the clinic. So overall though the metaphor of how we annotate resistance is pretty simple. We perform DNA sequencing and yesterday we talked about the many options for that. And so you have to choose your sequencing platform, and that sequence data is compared to reference sequences things that we know about MR. And then that can lead to prediction of the resistance of the resistance is the catalog of genes and mutations that confer resistance and this is the tool of MR surveillance, where are the genes where are the mutations how they move around. Then the Holy Grail is at the bottom right. Can we take those lists of genes and mutations and actually predict what the bacteria is going to do what its phenotype is going to be. So this leads to risk assessment and clinical decision making antimicrobial resistance genes that barely generate a phenotypic resistance are not of high risk right but they are a potential future risk if they evolved to be more resistant, but those now that are circulating that are quite potent are really important for risk assessment and clinical decision making. There's really three fields involved in this the reference sequence part is the art of bio curation how you take knowledge and you digitize it. Bioinformatics is the algorithm space so how do you annotate a genome for example and analytics is how you make reliable phenotypic predictions from genotype we're going to talk about all three today. And in particular in the lab we're going to use my labs product the comprehensive antibiotic resistance database or RGI software. This is one of our machine learning results. Before we get into the nuts and bolts there's a few things I want you to think about for context in your own decision making figuring the left shows that antibiotics target a huge range of things in bacteria. Therefore they've evolved, we've evolved a bacteria evolved a larger range of ways to resist it so drugs may target the DNA replication machinery they may take target biosynthesis or cell wall formation. But the bacteria may resist by spitting the drug out through efflux or making the membrane impermanable the drug by modifying or protecting the target or by chewing up the drug. Again, a huge range of drug classes and mechanisms, there are thousands of resistance genes, not all of them are going to be important to some of these drug classes, you may not use in your system you would never prescribe them or use them in agriculture so why would you care about the resistance So there are some decisions making which mechanisms with drug classes which targets are important to you. The sort of thing that's already come up in the course is the elephant on the room of amr is lateral gene transfer so we know amr is created by overusing or misusing antibiotics. But what makes this worse is that they're often become associated with mobile genetic elements such as plasma so Finn's lecture this afternoon is going to be critical. This changes the game this means that an infection in one pathogen can suddenly be in many so I'm going to this is a screenshot from cards website. This is NDM one so this is a one of the high threat genes this undermines one of our last antibiotics of last resort and we really only have to. This gene was discovered in one patient in Northern India who then flew to England for additional treatment and it broke out became a hospital acquired infection. We've been club seal and ammonia, but you can see card surveillance work it is in well over 30 pathogens because of plasmids, and it is a global threat and we've sequenced in our one little urban community about 500,000 people. And we found almost a dozen different plasmids carrying this path this particular gene. So the particularly the lateral gene transfer is a key part and that's why Finn's lecture is a good thing to do. Let's think about analysis so we're going to go back to this metaphor from the US Department of Secretary of Defense from quite some time ago, which says there are no knowns the things we know. There are known unknowns that is the things that we know we don't know, and then there's the true unknown unknown things we don't know that we don't know. So what do you know people giggle but this is actually from black swan threat theory. So if you look at threat and emr it's the best is to look at from a threat perspective which is why we do surveillance. So the known knowns are the genes and pathogens you were we are routinely tracking so in an average public health lab this list is very short. We might only do PCR for 12 different targets we may only do culture for a half dozen pathogens on a regular basis. The known knowns when it comes to surveillance is extremely small and we might partner that with phenotypic testing maybe test eight antibiotics regularly. And so it's a tiny fraction of what's out there. The known unknowns are the genes that we know about from the scientific literature from research from other surveillance, but we don't routinely track them, we don't have a PCR assay we don't have a phenotypic test. Particularly after COVID you always get variants. So the genes that are published once in the scientific literature you know they will evolve and there'll be variants out there in particular mutation based resistance novel mutations occur. While we are doing surveillance, most of these of the things that we could surveillance are actually sitting in the known unknowns we know about them, but we're not looking. These are the emergent threats you know fortunately they're not hugely common, you know, less than one a year of a truly new resistance mechanism or particularly potent resistance gene. But we, this is really sort of the hard part of surveillance how can you find an emergent threat. Usually this is sequencing related to unexpected outcomes so we see drug failure we see fatalities, and we start to suspect something news going on and we start to look and research happens. COVID showed how quickly we can do that now as opposed to say five years ago. So DNA sequencing from a public health lab or a clinical setting really can change the game, whereas traditionally we could only afford a small number of PCR phenotypic assays. Sequencing is agnostic if you can get sequenced it will tell you about everything the known knowns the known unknowns and possibly about the unknown unknowns depending on how good your bioinformatics is. The reason this course exists is that there's a massive training ground how does a public health lab go from a traditionally molecular biology PCR and with some phenotypic work to handling genomic data. And the most common we think here we hear is the information all over you know you go from a lab that was tracking a husband half a dozen genes maybe by PCR and suddenly getting lists of 5060 genes from sequencing projects, how do you make decisions based on that. The next way to think about your work is sort of three vectors of thinking so the first vector is what you're looking for. So on the left, I call it perfect match screening. Are you only wanting to keep track of genes that we know a lot about so CTM X 15 a very potent beta lactamase many public health labs do PCR to track that. There is this gene that I know that I can point to experiments that is a threat. I only want to track that, or do you want to look for novel functional variants. So you want the capacity to say I have a relative of CTM X 15 that I need to worry about that's making outcomes, or you're trying to do surveillance for emergent threats. So this sort of vector is what part of the genome. So on the left, are you really about acquired resistance genes that are born on plasmids and mobile genetic elements. These tend to be the highest threat genes, because of antibiotic usage they tend to be plasmids become mobile, they are expressed reliably they usually have a high m ic's. In the middle is though resistance by mutation which is predominantly genomic. So is that important in your organism such as tuberculosis, for example. Now on the right, are you interested in the more subtle things intrinsic resistance efflux regularly the regulatory control so glycopeptide resistance in particular can come up on this side. Are you interested in all of these probably in your work so you need to think where you're on this factor. Then the last factor is what's your data going to look like. On the left is when we culture our isolates and we do whole genome sequencing so we get good estimates of the complete genome sequence with low amounts of error. So we actually have found whole antimicrobial resistance genes. This is the easy stuff to annotate. So we're looking at community sequencing on the right so we're looking at sequencing reads in other words metagenomics with the data is much more fragmented. This is you have some different choices. So where we add as a field on the left perfect match screening for known genes, particularly if they're acquired in their plasmid born and particularly for doing whole genome sequencing. We're doing really well, all the tools and databases do very well particularly for the escape path incident and enter a back to AC. Novel functional variance resistance by a mutation varied results not as good. Pretty good effort in the Syria gonorrhea and micro bacterium tuberculosis, other pathogens with mutational surveillance is modest and the bio curation is modest. Where we're at the infancy is really bioinformatics and machine learning tools to predict emergent threats really still it's about we notice unexpected outcomes and we start looking for things we don't really get early warning from sequencing. And we've seen some start on predicting how important efflux is a regulatory mutations and conferring meaningful resistance. Then the last part is the improvements even over in the last three years in whole genome sequencing or whole community sequencing or metagenomics. Real advances last time I taught this course. This was in red, not in blue, but I'll be redoing sepsis environmental monitoring one health work the algorithms and annotation spaces really improved. Okay, so we're just going to highlight a few of the areas of bio curation first so we'll talk about the elephant in the room. This is a screenshot from Wikipedia of available anti microbial resistance databases, and this is not the whole list by far. They're amr is a big threat. There's a lot of funding in the system. There are a lot of groups internationally starting databases a lot of them are targeted to a specific need. A small number are more broad in their scope. Some of them are updated on a regular basis some of them were built once and never updated again. It's messy. I'm bio curator I'll admit that we're in part of the field where everyone in their uncles doing some work in this field and there's a lot of boutique databases and software. And unfortunately, if you're in this space you have to do some background reading to see which ones you like. I'm going to mention the big three but before I do that I want to mention tuberculosis tuberculosis tends to have a very tight community. So the metrics for determining MIC is quite different a lot of its likelihood mathematics base. There's been some really good efforts the American reseek TV effort just lost their funding and they're offline, but we captured their data thank God. But TV tends to be a specialist space so if you're working tuberculosis, you got to do your homework often there are very specialized resources. But the big three if you're really thinking about on on general am our annotation are NCBI is the path of detection reference gene catalog rest finder in Denmark and that's in us in Canada the comprehensive antibiotic resistance database. We are what you would call friendly collaborator competitors we compete with each other but we also work very tightly together sharing data and, and correcting that sort of particularly card and NCBI work in almost daily at times, sharing data, discussing naming conventions and curation. Res finder traditionally has been focused on the plasma born threat but have really branched out recently. So even though I run one database honestly you're doing well in all three, though Finley might give you a different opinion he's a big fanatic on card I would say. So card we're going to focus on this database but some of the principles apply to the others. As Emma talked about her talk we are ontology driven, that is the distinction NCBI doesn't really use ontology they're happy to use some of ours res finder does not use an ontology so we really came from a data science perspective. And it's in card are the genes and mutations conferring resistance and there's some rules so Emma mentioned that we have hard rules. So the evidence it must become from a peer reviewed publication in PubMed. So it's not just someone emailing this and saying I found an a margin in that paper there must be clear experimental evidence of MIC so a lot of our curation is reading the actual methods and results section to see if it's a truly solid experiment. So here's the AMR mutations often that they propose are actually correlative, not causative you can't point out to a well controlled experiment. You really look at association, and that means what's the level of evidence what deserves getting card. And then the sequence must be available. So if they don't make the sequence available we're not interested in red there is the beta lactamines is this is an exception so NCBI leads the naming of beta lactamines as we work with them. In the genomic era essentially we're finding beta lactamines sequences faster than we can do experimentation to validate them so they are getting names based on sequence similarity, only a tiny fraction can you actually find a paper with the experiment. Though there are some groups are starting to do high throughput experiments with robots to actually generate this data so we will catch up, but right now genomics is well ahead of what we're doing experimentally. There's a screenshot on the on the right there. Just these are the public health agencies using card in the last seven days so we're busy to say the least. And the two arrows I'll highlight this is what often stresses our researchers so there are 5000 reference sequences or AMR genes and card including over 1900 mutations. The diversity of those genes there's over 300,000 possible alleles or sequences conferring antimicrobial resistance. It's a huge amount of information, but not all of it's going to be important for what you are doing right so we're going to in the lab particularly float through that I'd say what is important to you. So in card it has this antibiotic resistance ontology knowledge base. One of the things it does at the top is it has a permanent session numbers it has nomenclature and critically attracts synonyms genes get renamed over time, particularly the older ones. So we keep track of all that so if you know an older name and that's what you're familiar using you'll still be able to find the data in card. So the ontology resist the resistance ontology knowledge base, not shown as a graph but shown as a list. It tells you everything we know about in this particular family the ant six family, including their sub terms there so these are the four genes that are in that family, but you can see a relationship it confers resistance to amino glycoside antibiotic. You can see it's an antibiotic and debate it in activation. So this is all our knowledge base that if you happen to find a gene in this family all the information you can pull out of car. And a little bit more. Some of the guarantees of card is not you will always get genes classified so when you spit out results from annotating you will we have three columns explain the classification. Every gene is in an am our gene family. Every gene has a resistance mechanism described in every gene has lists of what drug classes. It impacts. We highly curate almost at 100% the relationship between am our gene families and drug classes. We are actively curating the relationship between individual genes and mutations and antibiotics we have over 3000 of those. But I will say those relationships that this gene impacts that antibiotic are qualitative statements in the ontology. We have the means to start curating the MIC is the quantitative, but we're in the early days of that you can guess that it's a phenomenal data. And this point I'm actually convinced it might be cheaper to use robots and do the screening myself and generate the data as opposed to treat all the papers. And of course with any family you'll have individual gene so at the bottom, there's the sequence of the ant six IB gene from a camp the back to both the amino acids and the DNA sequences are there. Lastly, if you happen to look on card in an individual gene, this is the ant six I a gene outside of all that ontology stuff and the papers, you will see a section on molecular epidemiology. With this one gene in the database we run algorithms on every gene and card against well over 200 thousands available genomes plasmids genome shotgun assemblies genomic islands thanks to Fiona Brinkman for the genomic island stuff. And we generate molecular epidemiology data. Where is this pathogen, where is this gene found so resistance with perfect matches and other words, identical encodes identical protein to ant six one a, or maybe sequence variance. Not only does you what pathogen is but it flags on whether it's found in genomes genomic islands plasmids whole genome socket assembly so it gives you that contextual data. This is a huge amount of compute on our end. Most labs, you know, this is just be out of the, you know, what they could do so this is sort of our investment for the community it takes about three weeks of hard compute on a super computer to generate all these data. And we make it sort of available to everyone to use. And we're going to see how this type of epidemiology result can inform your analysis. So overall, it's a high quality reference databases. It is hand performed expert curation, but the curation is guided by some text mining algorithms that lead us to paper. We try to cover every gene and every mechanisms. We provide some analytics which we'll talk in and as a format talked about we use ontologies and the rest for data harmonization. And what hasn't just recently that we actually provide a standardized naming convention for every single gene in the database. These machine learning people were pulling their hair out because the traditional nomenclature is so messy. We have dedicated human curators and we update roughly one to every three months. We can do emergency updates when a particularly new and awful gene shows up. So that's the bio curation. This is your reference data. I'm particularly looking at card but you can use NCBI respirator. The bioinformatics is how you use that to annotate new genomes. So I'm going to give you an example a few years ago colleagues of mine gave a seminar in the department and talking about a patient who had somenol somenol enterica in their spine, a very abnormal place for that infection. It's an interesting travel history misuse of antibiotics everything you expect. So really strange case so what do we do of course is we got a spinal tap and we cultured the salmonella. That's the genome of it there and we ran it through cards resistance gene identifier which you're going to use in the lab today. What RGI does is compare this genome to all the knowledge and card and gives you annotation. And the website produces graphics like this. So, this is the box or sort of the figure the annotation genome all the boxes on the outside, like NDSB and box of one. Every single box on the outside circle is an individual resistance gene in this one cultured somenol. This is a dangerous somenol your average somenol does not have these many boxes. You can recognize some of these gene names. A lot of them you have no idea what these gene names so using the ontology, you can reclassify as a data by drug class which is now the the middle wheel. So you can see Pan Am antibiotics flora quinolone sulfinamides etc. macrolides, and actually tells you number hits so the Pan Am antibiotics there are six genes conferring resistance in this somenol genome. The reason we're able to make these different levels of interpretation is because there's an a standardized ontology so we can compute over that ontology to say all those genes, how does it lead to drug classes for example. Now, the figure on the top left has a green section it says perfect, a yellow section that says strict and a red section says loose, mind you we're not showing those results. And that is why we go back to our threat theory. RGI provides results at three levels perfect strict and loose. So the perfect hits are the known knowns they are perfect identical matches at the amino acid level to the curated gene and card. So oxa one is in the green oxa one, you know we know there's dozens of good papers on its MIC levels finding a perfect hit to oxa one that's a high threat result. And maybe that's all you care about. You're not interested in variants, and so you would only look at the green. The strict is where some bioinformatics come in so the stricts are the known unknowns these are variants. But how do we know they're a functional variant so RGI predicts that it's similar within some sort of model context and is predicting it's a functional variant and we're going to come back to how it does that. So there's some bioinformatic magic to say this variant I see of a known gene, is it really functional, or is it non functional. If you're not passing that model you're outside of the model you become a loose hit. So a loose hit. Either either is a new AMR gene but often it's just nonsense, right it's a homologous protein region but the proteins probably interacting with some other type of small molecule. So a lot of users never turn on loose. Don't really the only time you should turn on loose is when you look at your gene list and you can't explain your phenotype. Right, why am I getting macro light resistance when perfect and strict does not show any macro light resistance genes. That means you might have found something new and you need to start looking at the loose. The loose hits anything in there if you think it's suspect you've got to validate it you need to clone and express it. So in a clinical setting agricultural settings, they never use loose. They can explain their phenotype from perfect and strict researchers like say Jerry right at McMaster goes out to weird geologically isolated caves or ancient permafrost often have to you loose and when in fact one of his bacteria isolated from a geologically isolated cave could metabolize modern antibiotics but had no perfect or strict hits at all. So you need to go into the loose and clone and express everything to realize that there was this long standing threat sitting in these cave bacteria. So again, you need to think about what do you care about. Certainly, most of you won't care about loose my lab doesn't use loose that often, but the perfect or strict might be a tough call for you because those strict ones could be the ones that are explaining your phenotype. Another thing that's unique to card. It's not done at res finder it's not done in CBI the rather is that we carry sequences in a model context. So this model type is called a protein homolog model. You can also think is a presence or absent model do you have the gene or not. This is a protein homolog model for ant six one a and basically the model has two pieces of information. It has the protein sequence which is at the bottom, it has something called a blast a bit score cut off. So what our GI does is it predicts all the open reading frames in your genome that you sequenced, and it starts blasting those open reading frames against the sequences and cart. The blast is a short alignment, you know, local alignment sequencing technique so there's I've got the screenshot of NCBI blast, but essentially what blast does is oh this gene in my patients pathogen gene is alignable to a reference in card and how good is that alignment. Well the bit score statistic essentially measures how much information is in the alignment a bit you know the bid is the smallest unit of information. The bits are 500 of those, the better the alignment between a sequence in your, your query and something in card the more bits that will have. And what happens is that our curators, literally by hand analyze hundreds of sequences, and they determine a bit score cut off that if you're above it, you're likely a functional variant of ant six one a but if you're below it, you're very unlikely to be a functional you're a loose hit. And so the bit scores to separate strict and loose. I'm going to give a little bit of Karen's result your TA so we know that generally speaking when card RGI calls a strict hit says that you're a functional variant. It tends to be right. Karen's PhD work she's starting to really put numbers on that and it's reliable. Sometimes though it misses things so there could be a real gene in the loose and that's where she's trying to make it better, but this is hand curated and inspected for over 5000 genes. This homologue model presence or absence are you a functional variant is almost two thirds of card. The next most common model is what's called a protein variant model. And this is about the drug targets the housekeeping genes that antibiotics binds to like a gyrae B. The antibiotic like floor quinolone binds to gyrae B that proteins no longer functional, the cell dies. All bacteria have gyrae B so our guys always going to find a gyrae B in your genome. But how you get resistance is the gyrae B picks up specific mutations and if it picks up those mutations, it's still functional, but the drug the floor quinolone can no longer bind into its job so it's functional and now not sensitive to antibiotic. What the variant model does is the exact same as the homologue uses a bit score cut off to find the gene. But once it's found the gene and says are you loose or strict. It will only reported if it also has some of these mutations that are reported in the literature that cause floor quinolone resistance. Some of these are single point mutations some of them are combinatoring mutations. And so this is the second largest sort of model in card all the housekeeping genes that confer resistance by picking up a mutation or two. And again we can curate all of these from the literature. And those of you who work in TV know that that is 9995% of resistance in TV. So we have a lot of curation to do to keep up to it. Overall the knowledge in card is curated under eight model types. And this is a very unique thing to card. I think our two biggest contributions are the ontology and these models of all them and I give examples of known resistance genes under each type. Only for these type do we actually have code in the RGI software. So RGI can only predict genes in the top four green model types. And in fact three of them have astrocese they're not done yet there's some mutations they still don't do well. And so there's still there's still gaps so there may be knowledge in card that RGI can't use yet to annotate your genome. The bottom two we've made sort of pilot level. And the two in white we've not written a bit of code to that. So this is my biggest morning. There is no software in existence that can predict the complete resistance in any pathogen. Even if the data the knowledge is curated in a database somewhere the engineers have not caught up to write the code. So there are always false negatives when you annotate a bacterial genome, because of gaps in knowledge or gaps in software. Okay, that was cultured isolates. What if you're doing culture free metagenomics so in our case you know taking a rectal swab and a baby and then they give, but others maybe looking at wastewater. So you make your sequencing library from this sample you sequence it and the bulk of the data is going to be the host the baby in my case or the background DNA you know fish poop and all the rest in the water. And another huge fraction are going to be the bacteria that belong there, but this is where they live. We want them there. The MR genes in your sequencing data are a tiny slice the data and this graphic over estimates what that slice. So they're a needle in a haystack. Therefore metagenomics takes a lot of sequencing even just to see them right so it's very expensive. The data tends to be very fragmentary. It's short 250 base pair sequencing results. The analysis is sensitive to the diversity of reference data I'm going to come back to this is greatly underappreciated out there on how the reference could be really important. But even then if you do find the MR gene and metagenomics data, you don't know what bacteria came from, which on a clinical setting, that's how you decide your treatment. So you know the MR genes there but if you don't know which bacteria are carrying them, you don't know what to do you're back to empirical medicine. So, Jared actually talked about read mapping a little bit yesterday in the viral work this is the workhorse of metagenomics annotation in the MR. Essentially, which gene that that 250 base pair sequencing read come and we use things like the Burl's Wheeler transform or read mapping. What that does is a high stringency mapping of the short sequencing read back to your genome so this is a picture from Galaxy so they're annotating a human genome in here. They're not just short sequencing reads. They use the Burl's Wheeler transform to map them to a specific locations in the genome, and at the bottom you can see that read one, two and three. I've been physically mapped or mapped bifraumatically to one common region of the genome. The key thing I want you to think about unlike blast which can be very generous read mapping is highly stringent. They have to be quite similar for the mapping to occur. We're going to come back to why that's important but this is a key thing and the common software you'll see in the literature is BWA and bow tie to. And if you're doing metagenomics. I just this is only one thing I truly want you walk away for walk away for when you're done. Don't trust metagenomics for AMR using BWA or bow tie to, and I'm going to show you why that is the case. These algorithms in AMR sequence base can actually be misleading. And unfortunately they are the bulk of the papers. So why is that so AMR has what's called the allele network problem. This is the TM beta lactamases. It's an old screenshot there's well over 300 of them now. TMS conversion confer resistance to beta lactamases the convention is that, even if they only differ by one amino acid they get a different name is sometimes that one amino acid has considerable changes in MIC so you have to track them specifically. But they evolved from each other. So TM one might only be two nucleotides different from TM seven. This is a network of highly similar alleles that we have to keep track of because different alleles can have different phenotypes. But read mapping algorithms because are that high stringency will have no trouble figuring out that the sequence they have is a TM. It will get into that space. But when it gets down to MIT M one or TM seven when they're right only be one or two nucleotide differences, to be honest, it almost gases. And so it kind of it can tell some TMS apart but many of them it just guesses on where the reads goes because they're just too similar for the algorithm. So I'm going to prove it. So this is some work that Amos did in our lab we simulated a genome with 10 fold sequencing covered metagenomics. And we simulated seven genes, am our genes in these genomes, and we use bow tie to. So on the right, when we mapped all the reads, this is the results that RGI gave it said I found reads from all of these genes in your metagenomic sequencing, but we only simulated the seven with red arrows. So we simulated CTM X 15, and a bunch of reads indeed were annotated were mapped to that gene but others ones were mapped to CTM M three CTM CTM X M 157 CTX M 114 CTX M 101, etc. Because of those CTMs are one or two nucleotides different from CTM X 15 it started guessing. The other reason you can tell there's a problem is on the y axis. This is the map Q this is a assessment of mapping quality. It's not a particularly high number at 35 so it was saying it was having trouble mapping because of these highly similar alleles. And then on the x axis completely mapped reads were 350. So we simulated more than 350 CTM X M reads but they got spread out amongst all these related to it. So if you look in the literature you will find papers with list of genes like this with different degrees of coverage, and they might claim that all of these are genes exist and say their wastewater sample or soil sample. It's probably not true because of the illegal network problem. Next to an algorithm called the KMA algorithm. This was developed by the reds fighter team in Denmark. It is a read mapping and alignment methods specifically designed for redundant databases with illegal network problems. So same data. In the API code we just simply swatched out swapped out bow tie to for KMA. And if you look on the right, we only found seven genes, we only simulated seven genes, and we only aligned reads the seven genes, five of them were the exact gene that we simulated. Two of them were an extremely close gene so when you're down to one or two nucleotides, you still can't get it absolutely perfect right but KMA at least tells us that there's only seven am R genes to it. I instead of cat IB, you know, minor stuff like that. So, like before, if I'd used bow tie to I might have claimed that there was a huge diversity of am R genes particularly CTMX and when really they're only seven. Look at the y axis, the map q is just gone through the roof. It was 35 before. So suddenly there's higher confidence in the mapping. And if you look at the map reads were up to 550 you know as our high end so the reads aren't being accidentally spread around a bunch of references they are being aligned to only what they deserve to be alive. So if you're doing metagenomics, if you walk from anything of this lab, please switch to gaming is my answer. Now the real network problem has a second part to it. The card the reference data that I just used in the analysis I showed you the 5000 genes and card this is an old screenshot. The vast majority of them come from clinical isolates because that's what's kids published in PubMed described so the sequence diversity in the reference set is clinical. So it doesn't reflect the diversity of genes alleles and mutations that you would find in clinical agricultural veterinary or environmental setting so if you're doing the NICU baby stuff like we do cards perfect right but if you're often soil, you know adjacent to a farm, or even just wild permafrost up north. You now have a bias the reference data is highly clinical but those alleles are probably not in your sample. Remember, be bow tie KMA are highly stringent read mapping, the elite diversity or sent your of the sequences you're pulling out of soil, for example, might be just different enough that the aligner the map read mapper doesn't work. I didn't include a side but we've quantified that and you actually have this problem. So what did we do. This is why we did this molecular epidemiology work. We analyzed hundreds of thousands of genomes to find all the variants and so while there's 5000 AMR genes and card. There are over 300,000 variants alleles underneath those genes. And a lot of the wheels that we pull in our reference set when we do this are not from clinical settings. So you can change your reference set so you have more alignment diversity. The warning is, these are all perfect and strict hits from RGI. So the perfects are fine. They match at the amino acid level they're functional. But you know you are relying on the strict where you can't for some of these 300,000 leels you can't point to a paper, right, because they are predicted by RGI. They are a sequence diversity estimate, but no one's particularly clone that one allele out of the 300,000. So you do have to keep that in mind. The metagenomics workflow that I would suggest is you start with your metagenomic sequencing reads. Depending on what your samples from you might just use cards canonical 5000 because you're working in a clinical setting. But if you're worried that you are actually pulling out much more interesting sequencing diversity from environmental other sources you might also use the card in silico variant set of 300,000. You're going to absolutely use the KMA liner to overcome the allele network problem, and you're going to get your read counts against reference sequences and you will get something like this this is from our lab today. This is generally you know what people use as their metagenomics results we're going to go through all the different statistics that come with this but essentially it comes down to a list of genes and how many reads have been mapped to them. What I want to point out is KMA bow tied to BWA. They don't account for SNPs. So they only work against protein homologue models those dedicated enzymes that are presence or absence. So the gyrazes that mutate. There's no way to map the SNPs that you may find in the gyrazes you sample against the card there's no algorithm space from that. So in metagenomic sequencing space, you're using about two thirds of card the protein homologue models these dedicated genes as opposed to point mutations of housekeeping genes. So like I said before, there are no algorithms out there that will predict total resistance from metagenomic data because metagenomic algorithms and workflows have generally ignored SNPs, when it comes down to it. One of the things that is coming along is you found the gene you had 1200 reads against, you know, oxo one, but you don't know who's carrying that oxo one, but that variant data set of 3000 alleles a lot of those alleles are only found in specific pathogens and so you can look for signature subsequences called cameras to say well, I've got the oxo one but this signature means it's the oxo one in the coal I this is the oxo one in Klebsiella this is the oxo one in pseudomonas for example. So card RGI actually uses these minor minor variants to actually predict which pathogen is carrying the AMR gene that you observe. As you'll see in the lab this is really beta stuff, it has not been validated yet. The code is awful at this point it's really slow. So we have a new master's student who's doing that validation. I would expect we'll submit a paper by end of summer with validations that we can be accurate of predicting which pathogen bears your gene. There's upper part in yellow. There is a possibility if you pull all the sequencing data out of your metagenomics you could bin them by their AMR gene family and subassemble and rebuild the alleles completely. And if you build alleles then you go back to the normal RGI that uses protein variant models protein home log models efflux models, etc. There's an interest in not just counting the reads but reconstructing the full alleles and if that only depends if you have enough sequencing data. So we actually can go back to using the algorithms that can do SNPs and point mutations. I put this in yellow. We don't have this publicly available we're still experimenting with some of some of the some other papers is this something you're interested in there's a few other groups out there to have got some beta software out there. But this is really where this age of metagenomics where you just count the reads aligned to things we're going to move beyond that we're going to start predict the underlying pathogen, we're start to reconstruct alleles and predict total resistance. So I'm hopeful we'll get to total resistance right now if you're using metagenomics data to predict fluoroclinal resistance forget it you're not it's mainly SNP driven. Therefore you're not measuring it when you do your work. I wanted to mention this. This came up a little bit in in Jared's talk is the cost part. So generally shotgun sequencing to find Amar genes is expensive you got to pull up millions of reads just to find some reads with Amar. But big capture which came up yesterday in the viral work is a real benefit so we released card on this website as a big capture platform so these are molecular fish hooks that when you have a metagenomics library. And not only the library molecules that encode Amar genes and let everything wash out of the tube. And so that means you're sequencing you only sequence a little because most of your sample is Amar and this example the papers out the the columns on the right are shotgun sequencing. So this is just shotgun sequencing many of the rows you see no data, you don't even detect the Amar genes. There are molecular fish hooks that are designed based on all the genes and card, and you wash out all the non Amar, same amount of sequencing on the left under enriched you see with high abundance the Amar genes to it. We have a second paper out now where we're tracking Amar isn't an IQ when it comes to probiotic usage. I will mention one of the students on the course Dirk is actually in the middle of validating Amar card Amar big capture version two, because this version. There's some new genes have evolved and they're not in the in the bait set, but now you can get away with much less sequencing. Most of your data that comes out of it with the money saved is going to be focused on Amar only, but again, this is in protein homolog space. This is metagenomics and we didn't make fish hooks for gyroses, because they would grab out both the antibiotic sensitive and the antibiotic resistance and waste our sequencing effort. Okay, the last little bit. So analytics, can I predict the behavior of the bacterium based on the resistive list the list of genes and mutations. We tried it in E. Coli so we worked with 115 E. Coli we sequenced their genomes, and we tested 18 antibiotics in a Vitec setting. So we knew the MICs and we knew the genomes we used RGI to predict the genes and mutations and it, and we simply said, given the ontology rules and card if it says this gene causes resistance to this antibiotic which was a qualitative statement in card. I think it is true and take these rules of this gene leads to these antibiotics. Do we predict phenotype accurately. The answer is no. So the good results are in the dark colors, the blue and the teal, the bad results are in the orange and the yellow. So if you look at something like, we're sitting about 80% accuracy, some of them are ridiculously low, right. So this is because presence or absence of a gene or mutation is never enough. Things that are on plasmids like oxo one. Yeah, they're going to if you see them, they're going to be expressed at high rates, you'll get phenotype, but all the chromosomes only encoded and other things. There's issues on expression, right, there's, there's regulatory issues. And that's why you see this massive gap between the list of genes and observed phenotype. But if you repeat this experiment where you take a training set where you know the the genomes and you annotate them with RGI and you train it relative to the observe resistance of MIC in a Vitec robot. And you train a machine learning method and then give them these data, the exact same 115 E coli. Suddenly you get extraordinarily good results right tetracycline is almost the only one we're really worried about an ampicillin a little bit. We can predict the observable thing phenotype admittedly in a Vitec robot, not in a patient, but we can predict that phenotype using machine learning or analytic analytics. So, this is where we're going as a field there's lots of literature, early early days, honestly, but within five years, I'm betting maybe sooner with new new algorithms that generated a device. And we'll have subsets where we feel that we can with high reliability from genome sequence predict phenotype. And that might be really important from the public health and risk assessment side. What's cool about these AIs is that you can crack them open and they make predictions they say well who's causing the resistance. So in particular CTMX 15, the machine learning suggested that this gene was causing resistance to four drugs. We knew that from the literature, it was already occurred in card, but the fourth one suffers on no one had ever done that experiment. It's not in card there was no knowledge. And when we cloned it and express it this exactly what happened that beta lactamase cause resistance to that gene so it was a broader threat. And in fact, six beta lactamases in E coli, the AI predicted a broader range of antibiotics and when we cloned and tested it agreed every single time. And this is a future because the trouble with the literature is there's no standardization on who tests what antibiotics when they publish a new to AMR gene. I wish there was that would make a curator's life easier. But AI can get us there. Is that my chat morning. Thank you. I'm going to flag this antibiotic resistance platform this is from the right lab at McMaster this is an efflux deficient E coli platform that you can clone any MRE gene in and therefore test directly the MIC is created by that antibiotic by that gene against antibiotics without any efflux confusing it. So this is a standardized way to determine what antibiotics a specific gene confers resistance to. It's not a snapshot but it's it's well over it's getting close to 500 genes and we actually hope to clone a lot of our mystery genes into this. If you got super excited by that I want to put a little cold water on it so this we basically repeated the experiment with pseudomonas originosa right the same machine learning algorithms and all the ref with 102 isolates 18 antibiotics and I'm not even show the graph. It's much less reliable in this pathogen right so even if we are going to get analytic prediction of phenotype from genomes it's not going to be a panacea it's not going to work in every type of infection every type of of you know organ or host. And pseudomonas tough right there's a lot more that means probably we there's it might not all be genetically driven it might be epigenetically driven or other factors that were not we're not looking at. So, you'll see a lot of talks saying the future is come will predict phenotype or lively. You know bacteria are tough to work with there will be exceptions I think E coli will get a long way other pathogens infections not so much. All right. Last thoughts I want to have time for questions. I'm not going to do all these. You can look at them but the plasma born a margins are generally the high threat and we're getting the lab we're going to show how we know which ones are on a plasma to not. So that questions how to deploy molecular knowledge is so you gotta really think about those three vectors and what you care about and what part results matter to you and which results you're not going to look at. All of this is computationally intensive so the need for engineers to make it more efficient. Our camera algorithms like they can't even run on the course computers but with good engineering they will in the future. But this is is always a challenge you can always find the amr genes encoded on plasmids, but resolving that complete plasmid can be very hard from sequencing data. Nanopore we've talked a lot about the next days this is going to change the game, but you do need GPU processors which are expensive to it, but the plasma analysis software like mob suite is really coming along there and then I'm just going to flag baked capture to your TAs and Jay Lee's one works in viral one works in bacterial baked capture design so they can they can answer some of those questions and many genomics is a big part of the future I do think because even though sequencing costs are dropping I think big capture is going to still be a part of it. But look at your algorithms right as I said no one can predict total resistance but things like both right to can actually give you wrong results so you gotta take the time to know your tools.