 All right, so we're going to talk about AMR analysis this morning. We'll do a lab on it. Now, there's a large diversity. You've read the papers, hopefully, large diversity of software and tools. We're going to pick two to really illustrate first principles. And the lecture I'm going to give is really about what's the core thing you need to worry about and think about in AMR analysis when you use any particular tool. So our objectives, we're going to review the available antimicrobial and microbial resistance resources. We're going to take a look at the card, which is what my group creates. But really, when you talk about AMR analysis, you've got to think about mechanism. So the type of mechanism influence the analysis. We're going to talk about that. We're going to look at both doing genomic and metagenomic data. When it comes to AMR, genomic is doing much better than metagenomic. We're in early days of predicting AMR from metagenomic data. So a little background, not everyone is an AMR person. So bacteria-evolving resistance faster than we can bring new drugs to market. We have completely or virtually untreatable strains, such as gonorrhea. So in the last five years, out of one high school in the UK, a strain has gone global. We have untreatable TB as well. This is the UK review on antimicrobial resistance. So right now we have about 700,000 global deaths by AMR. If we predicted by 2050, we could be up to 10 million deaths. I added a few citations. This is an economic assessment. Not everyone agrees with these numbers and the effects. So I put some counter examples. But these are large numbers. So we're looking at a large amount of death. Not only death, but you lose a lot of modern surgery. So if you can't control infection, you can't do hip replacement. You can't use chemotherapy. You can't use routine transplant surgery. So just general medicine is under threat of antimicrobial resistance. The key thing is we're not dealing with brand new bugs, right? We're, for the most part, with bugs that we know well about, right? So things like TB, right? When you reach an untreatable strain, you're thinking about going back to the iron lung. Endrobacter staff, every 10-year goes by, staff suddenly knocks out another class of drugs that won't work for it. Pseudomonas originosa, an opportunistic pathogen. So CF patients who get fluid in their lung take up pseudomonas from multiple sources. The clout, myseria gonorrhea, right? So the issue of safe sex becomes increasingly important again. So it's not only evolutions that work because of what we're doing to create resistance, but while we're doing that, we're not finding new drugs. So the pipeline of production of antibiotics has been just drying up, right? So really, other than one or two, we haven't found any major drug classes in the 80s, right? So the low-hanging fruit, the easy-to-find compounds have been found. Now it gets expensive and it's hard. As a result, there's really only two companies that are significantly putting money into antimicrobial R&D, right? So this is just a rapid drop of investment. There's issues over patent time. There's issues over profit models. So you have less and less players bringing new drugs to market, and those drugs are really hard to find and get through clinical trials. So how does adiabatic resistance happen? Speaking to the choir, so to speak. So it really comes down to misuse or abuse of antibiotics, right? So misuse, patients not finishing their runs, not using the right concentrations or times, but misuse using antibiotics as gross stimulants in agricultural settings, using the wrong antibiotics to do so, ones that are supposed to be held in reserve being used in agricultural settings. All of these create a selective environment that cause resistance. So resistance is a genetic thing. This is done by DNA. The drugs themselves target different aspects of the cell. So they could attack the cell wall, right? The viability of the cell wall. They could attack the informatics machinery, so DNA and RNA synthesis. They can attack aspects of the biochemistry, folate synthesis, an example. Protein synthesis. All these things that drugs can attack, but each one of those targets, there's a resistance mechanism again. So you can make the cell impermeable. So the drug can't get in. You can use the target by mutating the target or attaching something to the target. You can make an enzyme that just chews up the drug. And most commonly, you could use e-flux to spit out the drug as fast as it comes in. All these are genetic determinants. The other key thing is that threats emerge quickly. So my lab is really a threat lab. That's how we think about things. So this is the NVM-1 case. Who knows about NVM-1 a little bit, right? Okay, so the NVM-1 is a gene, a beta-lactamase plasmid boron that knocks out carbapenems, right? One of our two classes of last resort drugs. So this was literally came down to one Swedish individual in India, 2006. Got ill with a gut infection. That bacteria had picked up this gene Lord knows from where, somewhere in the environment, right? Maybe it was flotating in clinical settings already. That one patient ended up being flown to the UK for treatment. It got into the UK hospital. So one person, one suite. If there's a nation there that has color on it, that's where NVM-1 is detected in being monitored. And that kind of has an NVM-1 problem as well to it. So one person, rapid spread of a gene that takes out one of the two major drug classes of last resort, right? So these things can happen very quickly. So you can ask which resistance genes are in Canada, which genes are moving around, and which one poses threat. That's a different question to ask, what kind of resistance do we see in Canada? That's a phenotypic question. We have good numbers on that. What's the rate of beta-lactamase not working, we have that level data. We often are missing the data of what's causing it. Because there could be 30 or 40 different genes causing aminoglycosides, which ones are at play. Now with sequencing, we can start to fill in that gap around that data. So let's back up. Think about it from the threat. So Henry was old enough. This is Donnie Rumsfeld. He was the Secretary of Defense for George W. Bush. If there's great quote, there are known known things that we know that we know. There are known unknowns. The things we know we don't know. But there are also the unknown unknowns. The things we don't know that we don't know. He took a lot of criticism for this, but he was actually quoting Black Swan theory. This is a theory of how you look at threat. Particularly Black Swan looks at world-ending type of threat. You could argue that antibiotic resistance is not that scale, but it's certainly frightening enough. But when we look at software developers and bioinformatics, this is the perspective I want you to take when you're looking at AMR data. So, unknown knowns. So these are the genes and pathogens we're tracking. So in public health labs, we have PCR assays. We have culture assays for specific genes and specific pathogen. It's a pretty short list. So there's about 3,000 to 4,000 AMR determinants out there. If you go to public health lab in Ontario, we've got about 40 PCR tests for the most common threats. What about the known unknowns? Well, these are the genes that are in the literature that we know about, but we're not building assays for it. We're not testing in the laboratory. Or they're variants of known genes. So we know the amino glycosides we've characterized aren't the only ones. Or amino glycoside acetyltransferases we've characterized aren't the only ones out there in nature. We know they're sequence variants, and we don't know how functional they are. Do they have a higher or lower MIC? But we know this is a problem. There's lots that we don't track. Travel makes this an issue. Maybe we're not tracking one because we don't expect it to happen. Then there's the unknown unknowns. These are the emergent threats. Gary's going to give one of the last sections of the workshop about emergent pathogens. Emergent threats, they're the things that are coming out of the water, out of soil, basically out of the environment, maybe working their way through agriculture before they get to the clinic, but somehow they're getting to us. We don't know a heck of a lot about them. Two of them, NBM-1 in 2006, so that took out carbapenems. The other drug class of last resort is the polymixins, or chelistin, for example. So in 2015, another gene was reported from China called MCR-1. We now know from retroactive sequencing from freezer collections that China wasn't the source. It was in Canada 2008, I think is the last estimate. Where's Gary? He's not here yet on that. But if you've got that one, you've lost chelistin and polymixins and roughly three months ago in the U.S. with a young woman who had a plasmid with both. It was untreatable and died. So these are emergent threats. Are they common in day-to-day sequencing and epidemiology? No. But we don't want to lag time either. We don't want to find six months after the fact why the chelistin stopped working. We'd rather have algorithms to detect it right off the bat when that pathogen gets sampled in a clinical setting. So the sequence that we talked about already, DNA sequencing so the genes can no longer hide. We now have first revolution we can take advantage of sequencing faster, cheaper, smaller. So we can sample at high throughput with high quality and start to look at these genes. So from a bioinformatic perspective, this is really a four-step process. We sequence so we get an isolate or maybe it's a gut sample, a metagenomic sample, and that has to be compared to reference sequences. You need a point of reference to say, ah, you're NDM1, you're MCR1, you're an e-flex protein. From that catalog, we can predict resistome, the catalog of resistance and gene mutations in our sample. Those first three work pretty well. The last one's a lot harder. When I've got that sequence data, can I predict phenotype or antibiogram? So a subtle difference in there. Phenotype is really what a cell biologist thinks about. Antibiogram is what a clinician thinks about. What drug should I use next? What drug should I avoid? That's a big jump that last part. We're going to talk a little bit about that. So the key part is you need this comparison to reference sequences. The third part of AMR analysis is about biocuration. So there are many databases. There's about three or four of us that are quite large. We collaborate, we compete. It's that type of ecosystem. So this is the card, the comprehensive antibiotic resistance database. We're going to do a little bit. So this is my group. I'm embedded with drug discovery. I'm like a biologist, so I serve that community. But we work increasingly with ARIDA and public health perspective. You can browse. We're going to do a little bit in the lab. We have downloads, so the data in many formats. So you can run your own analysis and pipelines to it so you don't have to use our software, for example. And we have analytical tools to analyze sequence data. Card is an actively curated database. Every 30 days we put an update. They're rarely trivial. Summer gets a little slower. The students take off a lot, but we're going to put a lot of software this summer. This is a curation of a gold standard reference molecular sequence data. We need really good data for comparison when we sequence a sample, right? For the antibiotics, their targets, their mechanisms, their genes, and their mutations conferring resistance. So it's a lot of information. We organize it all by using the antibiotic resistance ontology. So Fiona gave a little hint on ontologies last night. This is, well, over 3,000 terms. We have a lot of biology that we need to encode to put in a framework both a human and a computer can use. We're going to talk a little bit about ontology. We also write software for analysis of DNA sequencing data. We have a few things that maybe we do a little different from the others, but there's a lot of overlap with other tools as well. And then we start to work on the internet of DNA approaches. How are you going to create a sharing environment? And that's really why we joined the Euridic Consortium when it comes to how do you put this on the top of data sharing environment. So let's just zoom in. We have an enzyme nucleotidal transferase. So this is the ontology term here. You have an ascension synonym. So if you read literature, you'll find three or four names get used for the same gene. So we spent a lot of time tracking that. So if you're thinking about this gene from the sense of A-A-D-E, you'll find the data. We got to get that in there. There's a written definition. It's classified as an ant6. You have the publications and you get the sequence and a little bit of a detection model, which we'll get back to. On the top left is a whole classification scheme which is the ontology. What it's mechanism, what drugs it interacts with, it's evolutionary history. We're going to go a little bit into depth on that. But essentially this is a single node or a point on an ontology where you have a wealth of data to it. So this is what card continued. It has high quality reference states on the molecular basis of AMR. It's expert hand curated. These are not algorithms generating this data. We read the literature. We go through it. We look at the quality of evidence of the MIC, the quality of the sequence call. We have rules on what gets in there as being gold standard. Many of the databases are targeted. So there's some excellent TB databases. There are some excellent databases for plasmid-born AMR. We go for all. So we go for plasmid-born genomic mutation. We are pathogen agnostic. We try to get comprehensive in the title. So that's the goal. We work on advanced analytics because different mechanisms, different algorithms we try to bring them all together in one package so you can look at total resistome instead of subset. We really focus on discovery. So doing surveillance for known threats, things that we know the sequence about is not hard. We can teach that really quickly. But you want algorithms that say, well in the lab, aminoglycosides, I can't explain why the aminoglycosides didn't work for that patient of that strain. We need algorithms to try and predict that. Do discovery for you. And growth. So we are constantly curating, right? In particular the long-hanging fruit mechanisms. So dedicated enzymes, right? Or mutations of certain targets. Those are well curated by a lot of teams. But we cover things like ribosomal mutations or operons where you need a whole suite of genes to create phenotype. So we're really trying to curate everything. Whether we write an algorithm or not is a second question but we curate the data. In terms of high quality reference data so this is to say that every gene in the car database has been phenotypically verified? Yep, so you can go down to the paper and you can look at that experiment that came with it. Is that at a clinical resistance level? Yeah, so the MIC level. When it comes to mutations particularly at this point the only public data it does have clinical evidence to it with laboratory confirmation. There's a second level of mutation data where it's been seen in the lab and it's been tested in the clinic. We've been doing that with Bureau Maryou, the multinational that data goes public in December according to the contract. That's actually a long list. Interpretation is the challenge there. Just because it happened in the lab you've never seen the clinic. Are you really the first one? Amount of MIC. We're going to get to that. In this way so far we're talking about qualitative knowledge. Jumping that to quantitative, to phenotypes of big leap. So we're going to talk a bit about that. Let's talk about ontologies. So Aero, the antibiotic resistance ontology is one. So there's the food on ontology. There's GenFBO, the overall effort to it. So we're just one player of many. My lab is also building a mobile ontology for annotating mobile elements. So it's a controlled vocabulary codification of drugs, targets, resistance genes and mechanisms. So you have these terms and you connect them by relationships. This gene confers resistance to this drug. That's a relationship. This allows computation over an AMR network knowledge. So maybe I want to compute from the molecules because I'm a chemist. So I pick the molecules as my entry point but I can follow that network to get down to the genes mutations. Maybe I'm a mechanism person and I'm just cared about e-flux. I can start there on the network and connect to many data points. So it's agnostic about your point of view. I can take sequence and enter the data from that point of view. So it allows not only computational network but you're starting to build standards. So we can start to share data. So the goal ARO, and it's not there yet, is to have all the databases and all the public health start to use these terms. And this is what a RIDA's goal is, to work with ARO and take us up to the next level of professional grade ontology so it starts to share data. So when I tag a data with a certain ontology term, no matter who generated it, it means the same thing. We really developed it scrum style. So we didn't have a lot of funding. It was the Wild West but now we're moving a lot of resources to it. Particularly working on the escape pathogen since that's about 70% of clinical cases that's our target to it. We are trying to be good computer scientists and follow the rules of ontology. I'm an engineer by heart so I don't care about rules half the time but now we're trying to behave nicely. And we're starting to build computer assisted curation. We'll talk a little bit about that. So what does an ontology look like? So this is a graph that's a little out of date because I noticed my crew were curating resistant gyrase B. So all bacteria have a gyrase. The wild type is sensitive to immunocurins, sensitive to fluoroquinolones, but picks up a couple mutations and the drug bounces off and now your resistance to it. So this relationship said this resistant form has evolved from an antibiotic sensitive gyrase B which happens to be a topoisomerase which is part of nucleic acid synthesis targeted by antibiotic and finally up to antibiotic target. So this is the target stream and how we evolved not to be a target. On this side, we have the chemistry side. Once we have that mutation where resistance to nobibyosin, chlorobiosin, cumeribycin, all those are immunocuberants leads up to antibiotic molecule. So you can connect the data from the chemistry part. From the molecular biologists, it is both an immunocurin resistant DNA topotom, a subrace and a DNA topoisomerase subrune gyrase B cause it's immunocurin resistant gene or determinant resistant and then you can connect the mechanism. So it is a mutation conferring antibiotic resistance. In other words, it's antibiotic target alteration. So I can enter my knowledge network from any point, either browsing the website or running an algorithm and connect down to that gyrase. So that's what an ontology does. We think about classification we think about how to codify knowledge about things. So fine, we have DNA sequencing, we build an informatic framework of reference data that's organized ontologically how do we get to resist them and this really comes down to detection models and parameters. Now the vast majority of tools that are out there are naive blast or naive burrow's wheel of transform. How similar are you to my reference? With no thought of how dissimilar is still functional. If I'm 78 per similar, is that still a functional protein? Mostly algorithms are looking for perfect matches. So a lot of them don't even have models. A lot of the algorithms. Card really thinks about models. So we say, well you didn't find a perfect match to a known resistance gene, but you're within our model of being a functional homolog. So maybe you actually have an MIC. I can't tell you exactly what it's going to be. So how does that work? A large amount of data of what we call protein homolog model. So this is evolutionary biology. So this is CTX-M13 it's a beta-lactamase often found transitioning from agriculture and into clinical. You can see it in pigs and humans. It is a dedicated resistance gene. It's an enzyme. It chews up beta-lactamase. So what we have is essentially the model is a sequence and a degree of similarity cut off. A blast cut off. What you're going to see over this summer is we're going to fade away with, we're going to get rid of expectation values and switch to bit scores because they provide a much higher resolution for telling different proteins. The goal here is if I found a protein that completely matched this, that's a no brainer. I have a CTX-M13. It's functional causes phenotype. But what if it's four amino acids away? That's probably still a functional beta-lactamase and that's what the cut off is around. The case of beta-lactamase is one amino acid can change phenotype. So you got a fairly high cut off to it to say you're within the family of CTX-M beta-lactamases. You're not some other hydrolase, right because you have a degree of similarity. So we spend a lot of time curing small molecules. Every family, every enzyme, every protein has different cut off because their functional biology is different. The evolution of beta-lactamase is not the same as the evolution of amino glycoside acetyl transphases. So we have to look at that and say what's the reason of similarity to still be likely a functional AMR? Not some functional working with some other small molecule. So this is all determined in the silicone? This is hand curated cut off. So we go through everyone looking at similarity profiles and looking at the evolution of the family and picking cut off. And is there any reason to forgive me because I'm a human researcher, but is there any reason to then go back and actually test, you know, like mutate the amino acids and test the activity of the protein? Yeah, we're going to come to the idea of when you find a homologue, something that there's no literature for that exact sequence and it's 72% similar, you have a judgment call to there. You don't have any proof. So if you ultimately really want to know if that's conferring the resistance, you need to clone it and express it and prove it. If you're just doing surveillance of likely threats, you're not going to go to the bench for it. Also, how much does so not only like the alignments, but the position of a mutation in terms of being the acrocyte? No. So now you're talking where we want to go, right? So right now, protein homology is really about the overall, how similar are you? And the bar is pretty high. Right? When you start to get dissimilar, then you got to get down into domain level analysis and docking sites, etc. That's where we have to go. Where the strategy we have for that, we're not going to go there yet until we have a depth of sequencing database. That's a machine learning question. So by building this, we encourage labs to sequence more, gather more data, more phenotypic data. When we hit depth of data, then we took machine learning and we start to actually make it an automated process back away, right? So our genome candidate, Grant with Rob, that's where we want to go. We're talking about building hidden Markov models in negative space. That's it. Got to build this first, encourage the sequencing, get depth of data, and then we can do that. A different type of resistance. So this, as a TB, so TB is an exception, almost all resistance in TB is by mutation. There's very few plasmids involved in TB. So this is EMDB, so an ethambutanol target to it, so the wild type is entirely sensitive to drug. This is the catalog of mutations that have clear evidence of elevated MIC. You pick up one or two of these mutations and you become resistance to ethambutanol. It's a large catalog. This one is interesting in that each mutation here stands on its own. You only need one. Many proteins, it's commutation. I need two or three mutations to call it. So that's even a more complex model from the analytical point of view. And the evidence gets weaker in the literature on those ones, so it gets really hard to cure it. But you need a different algorithm. You not only have to find this protein, but you need to see if it's the sensitive or the resistance form. So that's a different type of model. Curating mutations, only two databases do it. Us and Argonaut, Argonaut slowly throwing in the towel, because it's a lot of work. I'll tell you right up front we are miles behind on TB. Because TB has so many mutations it's really hard for us to keep up. We just had a big meeting internationally. Everyone agreed we need more money just to curate all these groups. Outside of TB we keep up pretty strongly with mutation literature. Then you move to complex models. So look at glycopeptides. So resistance to vancomycin. This is not conferred by a single gene. What you do is you pick up a plasma board cluster. So this is the BNA cluster. One, two, three, four, five, six, seven genes in an operon. All must be co-expressed. There's even experimental evidence they have to be in that exact order in the genome, yet to really understand how it works as a cell because the proteins don't care. That creates an entire biosynetic pathway to make some modification to the cell wall. It changes permeability. It changes target to it. You've got to have all of this in the right order. All the pieces. There's a regulatory component, regulators up front. The homology has to be different. So just doing what we do is saying blast found this, this, and this. That's not good. So France, we work with the French government because generally they do PCR for these three genes. If a patient lights up a PCR product in those three genes they isolate the patient, roughly $10,000 a day. Then they go to mass spec or everyone realize it was never expressed in the first place. Right? Scared the daylights of the patient, wasted a lot of money. What you want is to do genome sequencing to look at this and predict phenotype. So I never isolate that patient. That's a really hard problem. You're going to look at this as a third year student and decide why not I'll pick a tiny project and we're going to see some demo in the lab of his first product of how you make a confident prediction about glycopepin. E-flex is the same thing. So E-flex most proteins have the component of E-flex proteins. It's regulation that is the issue. Is it responsive to drug? Will it up-regulate them when the drug appears? That has to do with regulatory frameworks and mutations. That's really complex to model. This level of work in E-flex, nobody is modeling the phenotype. There's lots of algorithms to find the bits but no give a total picture. That's where we're starting to build this card, to take it a step further what we call metamodels and looking at phenotype. So where are we at? So this is our database we're releasing today basically. Just over 3600 ontology terms 2300 reference sequences. There's 925 SNPs. These are clinically observed and experimentally verified SNPs. There's another 2000 going to come on in December that are only observed in a laboratory setting or epidemiological setting. We carried over 2000 publications. Just over 2300 AMR detection models. So that's a rough estimate of how many determinants are out there. 2300. We're a little behind on TV. We just published an undergraduate. Just published first author, the latest Cardi. We redesigned it as a third year. It gets used for a number of things. People use it for genome analysis. Some of the other groups just grab the data and build models. We'll talk about resfams. They use card to build resfams. It's quite excellent work. Fiona's group, so you look at Island Viewer Island Viewer. They use card to enrich Island Viewer for AMR. So you see other databases using it to annotate their own data. And some use our software to predict the system. Those usage numbers are probably out of date. From a science point of view, how do you do this effectively? So you need biochemists and chemists. People who know the biology and epidemiologists. You need biocurators, which is a specialist niche. Someone who's got knowledge in the biological domain, the data science domain, the COMPSI, who can think about how to build a model that a computer and a human can work and read a lot of papers. You gotta like scholarship if you're going to do that job. Data scientists and software engineers. So the spectrum of people that they can build a model. By reading a paper, reading a suite of papers and build a model. But someone also has to write the code to use that model to it. Okay, so let's do a couple case studies. So this is Jerry Wright, my colleague. So Jerry, Royal Fellow, probably, anyone who knows Jerry Wright, a few of you know Jerry. So Jerry is probably the dominant voice when it comes and if I call it resistance. He finds resistance killers. He finds mechanism in discovering new drugs and chemistry. He's one of the biggest players. So five years ago, he was traveling in France, had dinner. By the time he got to Italy, he was very ill. So a GI infection. In the hospital, in Italy, multi-drug resistant infection. Edge of death. Then it was a Salmonella infection. So being a good scientist, he took the sample, brought it home, we sequenced it. So this is the Salmonella that infected Jerry. We went through the Illumina, we got the genome. And when you run it through CARD and CARD software, you come up with this is one visualization of the wheel. So out of the many hits, every tick here is a resistance determinant. To give you context, a drug sensitive Salmonella might have two or three. So this is a multi-resistance drug. Only one of them was a complete match to a known resistance gene. And it actually was a fairly uninteresting one on top of that. In fact, we're homologous to resistance. Our model said, hey, you're not a known sequence, but we think you're similar enough that you're causing phenotype. And I can see aminoglycoside, I can see efflux, et cetera. The informatic challenge is, okay, this is scary. Jerry's in trouble. He's got a multi-register drug. You look at that, does that mean anything to you? Just a bunch of gene names. Maybe you know a couple of those gene names. So what you want to do informatically is switch it to context. So you recode it using the ontology and you create this inner wheel. Now I can say that perfect hit, that one gene is an efflux protein causing chloramphenicol and beta-lactam resistance. And it actually has a regulatory component. And all those not quite perfect matches of it. If I go through and just look at the drug ones, there's chloramphenicol gone, beta-lactam's gone, polymixin's gone, aminocurin's gone, aminoglycoside's gone, phosphomycin's gone, fluoroquine's gone. It's a superpower. So interpretation is the challenge. No matter what tools or algorithms you're going to use, you can find the genes. You can find the mutations. But you need some way to take it down to phenotype or at least prediction of lightly phenotype. Just because I've got a hit here doesn't mean they were expressed in that genome. That's always the issue of it. But you can break it down to drug catalogs and seeing a wheel like this is not what I want to see. I don't want to be terrifying about to see it coming. So let's talk about that. This is a surveillance. Let's step back and talk about what kind of data you're going to see and what kind of algorithms. So when we talk about this, there's really three vectors that we're thinking about. The top vector is are we just doing perfect match screening? So I only want to do surveillance for known threats that public health is characterized by a million chance of walking my hospital. I just want to know what's known in my community. Or am I willing to look at novel and functional variants? I say, okay, I admit I don't know enough about aminoblactycytocetotransferase. So I want to keep my mind open that a new one might walk into the clinic. Or am I really looking for knowledge and threats? All three have different implications for algorithm. The second one is what type of resistance? Are we going to look at dedicated resistance genes? Like if they don't act in these, it's a plasma form, right? So there are plus, minus. If you got them, you get resistance. Generally they're expressed. They're usually expressed at high levels. They cause MIC. Or you're going to look at mutation, genomic mutations. So fluoroquine alone, immunocumbrance are often mutations around ribosomal RNA or ribosomal RNA proteins to it. So are we going to look at that level? Or are we going to take it further in the pathogen I've gotten at the point of infection I got? And what is my name? Is it whole genome sequencing? Have I assembled and got a nice draft genome? Nice clean data? Or is it community sequencing? Short reads, messy data with maybe human contamination, right? So you've got to think about these three vectors. In green, pardon me, is where the community is doing very well. Lots of software and tools can do the perfect match screening of plasma-borne threats when we do isolate sequencing. That does well. A couple are working on novel functional variants and resistance by mutation. Essentially, CARD and Argonaut are the two that focus on that. Novel emergent CARD is the only one that's thinking about that problem of building algorithms to say I can't explain the resistance. I see what could be emerging in this patient. What we are not doing well is regulatory and intrinsic resistance. We are working on E-flux because regulations are big there because we do not predict E-flux well. And then metagenomics. Metagenomics works well if you're looking for the dedicated genes. Do I have a high similarity to a dedicated enzyme? That works well. Completely ignores a lot of resistance which is mutational driven. So you never do metagenomics on a TB sample because it's all a mutational vision and we have no algorithms for that. But even if you do metagenomics on a male pathogen, you're going to possibly miss fluoroquinolone, immunocumorum, because those are mutational based resistances to it. So we've got a long way to go. When Rob and I talked about doing whole-resistant prediction of metagenomics, we're thinking about both the perfect match and mutational screening with all the noise that comes at that day that gets quite hard. All right, so I signed a paper that Kar and I wrote because the list is long. It's out of date about four minutes after it got published. There are a lot of tools and databases. I put a summary to it. I wanted to highlight a couple. So tools. Argonaut. Argonaut is basically Kar's closest competitor and collaborator. We talk regularly. The two of us work with NCBI on a regular basis around curation. It does very well, both for SNPs and dedicated resistance gene. ResFinder is actually a tool for plasmid-born resistance. Again, we collaborate. There's basically 100% overlap between Argonaut, Kar and ResFinder on dedicated resistance gene. We wrote the RGI. We're going to demo that. A highlight SEER, which is a nice draft pipeline for metagenomics. We're going to try a different one today called AMR++ to it. And then I highlight one of the bottom TB profiles. So who works on TB, anyone? TB is all mutational based. No one's keeping up with curation of data. So it doesn't matter how good their algorithm is, their underlying reference data is incomplete and these guys will admit it. There's a meeting in Germany where the six labs that were generating TB bioinformatics resources for AMR and they all admitted they were behind schedule on curation and their data sets didn't overlap. So there's a real problem. This is a great little algorithm. It screens for at that time which is definitely less than there are in TB in a clinical setting. What I like that one is you can add your own data as reference pretty easily to it. So if you know of mutations that are in your community you can add it to TB profile. Databases. Argonaut and CARD, I kind of already covered. We are European and Canadian. Similar effort to it. Someday we need one, hopefully when we build the ontology we'll just start merging and be one database at some point. NCBI. Basically Betalactamases which there's a thousand or more. It was two elderly scientists curating the nomenclation of the data and they really wanted to retire. So NCBI took this over. So right now NCBI is the home for naming Betalactamases and through sequencing we find many per week. A single mutation can change MIC for Betalactamases. They also have their resistance reference gene database which is essentially NCBI's version of CARD. But that means it's now integrated into NCBI's annotation format if you're in the U.S. government system. EPA, USDA, this is where they get their reference data. It flows from CARD and Argonaut and Resfam, collaborates with NCBI and NCBI packages and puts it out to the various American agencies. And that was a quiet negotiation over the last sort of five years to make sure that works well. So NCBI is starting to really do some heavy lifting. NCBI as well is now going to soon push editors to make, if you publish on AMR, you've got to submit the MIC data to NCBI. So every genotype is going to have an MIC connected to it. So you're going to see an editorial drive. Just like when it became mandatory that you had to submit sequenced and GenBank if you wanted to publish, soon it would be sequenced plus MIC. So this is a really good thing. Resfinder. So this just simply does the acquired genes. Really nice, tight little tool. We're going to talk a little bit about Resfams. So Resfams takes CARD and it builds hidden Markov models so they look at functional domains. This in great paper is Mining Metagenomic Data with the Resfams. Computationally intensive work. This is not boutique analysis. This is heavy stuff. But that allows, it's a big open window for functional diversity. It has a big, it says, well, you're not similar, but I really think you're a beta-lactamase. Trouble is it also bleeds into the hydro lasers into the others because it's really the functional domain. The false discovery rate is a little uncertain to it, but it really gives you a broad swath that well, here's a big list of things that could be beta-lactamases or acetyltransferases. Quite powerful for discovery. It's just we really don't know what the false positive rate in that database. I'm going to talk a little bit about Resfams in a second. Okay. So software. So we're going to use, RGI is just one example, shares a lot in common with other software. So we engineered the RGI going back to the black swan. Now we're going to analyze this case isolate data. We want to look for perfects or the known knowns, so the threats we understand. I want to see those, don't want to miss them. But I also want to have what we call strict, the known unknowns, that you are within a functional range of similarity to a known gene. So you're a new acetyl, I mean a black acetyltransferase. Those are those hand curated colors. And then we have the loose discovery. That when I have phenotype I can't explain go outside my models. What could explain it? So we call that the unknown unknown for discovery. Only a very small number of researchers care about this. Either they're drug discovery people or they've got something in the clinic they can't explain. That's when they turn on that algorithm. Most of you will never have need for loose discovery. Let's hope from your patient's point of view you will never have need for loose discovery. So this is a sample. We're sequencing a thousand samples in our hospitals. Generally kids come in any type of multi-drug. So it's got to fail three drugs before we see it. Could be anything. Could be lung. Could be tissue. Could be gut. This is a CLEB sample that came in, pediatric. We sequenced it and this individual had three known genes. Two aminomicicide and one beta-lactam resistance gene. Those are in the green. And then a large suite of ones that we thought were strict functional homologs. Now when we looked at the anti-biogram and we looked at this list, we could match it. We could figure it out. There was no unexplained resistance to it. But if there was, if we looked at it and said, well we can't explain with that list of strict and purpose. Here's the strict and perfect broken down by function both by mechanism and drug class. But say I still couldn't understand why macrolytes didn't work in this patient and on this strain, then I'd go into loose and I would find the category for macrolytes all classified arrow and see how many hits I've got. So if we look at say fluoroquine loan, there's 66 hits in the loose. So kind of back to your question, right? So if you can't explain phenotype and you go into loose, you've got 66 genes to clone to make any proof. This is discovery. These are outside our homology models. They could be working with some other small molecule entirely. But it's a stark point. We blinded CARDS so we took the MCR1 and the NDM1 but we took those models out of CARDS through those genes through and they showed up right at the top of the loose hit. So we feel pretty good about that. I'm not going to over claim. There's other classes that are really hard for loose to do well. It does really well for beta-lactamases. I'm not too worried if a new beta-lactamase pops out, I'll see it. Other enzyme gets even tougher to do so. New mutations you need a cohort study. You need much more information than new mutations. So RGI, we're going to see this interface. It has both a web and a command line. We're going to use a web today but you can download use it as high throughput at the command line with your data. RGI was designed for whole genomes or assembly contigs. So isolate work. So it uses Blastlace homology plus SNP mapping. It's very effective based on CARDS canonical reference sequences. Things that are in the literature and all this stuff. But what about metagenomics? So we heard from Fiona, we have the AMR time, amortime with Rob to work on extending this paradigm sequence. The key is I want to understand what the problems are with metagenomics. So we've got canonical reference sequences, the published diversity of sequences. Someone did a lab study with the proof, right? That's a very small sample of the real diversity of functional proteins that are out there. So we're under-referring nucleotide sequencing of diversity of AMR genes in the wild clinic farm environment. Most algorithms do metagenomics use the boroswheeler transform. What that does is high similarity nucleotide matching. Got to have a nice tight fit to line it up against the reference. But if your reference is not big enough, you'll miss. You could have a resistance gene in your sample that is in the wild, never been seen before, never been published before. Therefore it won't have a high stringency map to CARDS. This CARDS doesn't curate that level of data. So you have a high false negative rate when you use the boroswheeler transform. We're going to do a study with AMR tonight, time today, AMR plus plus, which we're going to do this boroswheeler transform alignment of raw metagenomic data against, it's their version of CARDS. But again, if there's novel stuff in there that has, is functional, but not slightly different sequence space, it's going to miss. So this is the number one challenge to do. Also when you do boroswheeler transform, you're just saying high similarity, 98%. You're not doing any SNP checking at all, right? So you're ignoring that whole component of resistance. So if you see a paper or an algorithm saying I'm analyzing, I predicted resistome from metagenomics by boroswheeler transform by definition, they've ignored whole suites of resistance that are mutational based, and that's why it's useless for TB, for example. So there are no algorithms now that do both homology and SNP assessment of metagenomic. That's what our genomes can at the grant. We want to predict the whole package, not just the partial package. So if you're in the metagenomics, we'll keep that in mind. You're going to under-represent. So how does that look? So if we look at three sort of methods for metagenomics. So what we just talked about now, the boroswheeler transform and read mapping on one end, the res farms which look at functional domains in very divergent homology, really open point of view, or maybe even just blast x. You take that short read of 250 base pairs to blast x that's against card, that allows some sequence diversity. So at this end, you have a high false positive rate. Markov models by definition reached long. So you don't know, it's going to generate a long list of hits, and you got to look at how good those hits, and there's no rules on saying where the break is from a functional beta-lactamase to a hydrolysis doing something else. Blast x, we use blast in card where we've curated cut-offs for whole proteins. All that breaks down once you have fragments of proteins. So we are no paradigms for the cut-off of blast x. You can always generate a blast x hit. Does that mean it's a functional MRR protein? At the other end with the boroswheeler transform you're being high stringency. So we've got an under curated reference set. We don't know sequence diversity in the wild so we're going to have a higher false negative rate. So it really isn't a good answer yet to it. And I said all these methods are ignoring SNPs. So by definition you're ignoring SNPs. So this is where we are in metagenomics. All right, so let's look a little bit at scale. This is genomic, but it has implication for metagenomics. There's a study we published of Pseudomonas originosa in just shy of 300 CF patients in Canada. These are isolated to it. Every row is a distant resisto type. Most of them were only seen in one isolate. A few of them were seen in more than one isolate histogram at the top. Every row is a different resistance gene. So you see there's a lot of resisto type in Canadian CF patients. Pseudomonas originosa. If it's green, it's a perfect match to a known threat. If it's red, it's within strict. It's a new sequence variant. So the number one conclusion was there's a lot of underappreciated sequence diversity in clinical CF patients for Pseudomonas originosa. That has immediate implications for doing metagenomic lung sequencing. Because our reference is missing all those red dots. And metagenomics needs high similarity reference data. If you're going to do Breur's Wheeler Transform. So the challenge we saw about this is, oh my god, taking card and starting to get spit in the cup and let's do metagenomic sequencing as CF patient is going to break down because of this. And then we see a lot of things. There were some gene classes that everybody had. Most of them were the e-flux. So those are regulatory questions. And that's what we learned on is a whole separate question which our algorithm doesn't have. Then you have some interesting ones. A few isolates that have very different resistance pro-tiles that you hopefully won't spread at that point. So now we go and look to see if that maybe was a plasmid. This study was interesting. Had some implications that there is a lot of similarity among CF strains. But a lot of interesting one-off strains. But a lot of underappreciated functional sequence diversity. But we got a group from Brazil said your software sucks. You really are not helping Brazil at all. Brazil does have an endemic pseudomonasas originosa strain. We under-called phenotype completely on this paper so they put out a comment. So we looked at it and what we found is that RGI has false negatives for a Brazilian epidemic strain of pseudomonasas originosa. So it did not work well for Brazilian. It was not a failure of our algorithms. RGI, the algorithm, then perfectly fine. It was a fail of curation at the cart. So the Betalactamase SPM1 we had added the term to the antibiotic resistance ontology. But the curators had yet to build a model for it. So there was nothing for the software to use. But yeah, when it came to bicyclamycin which is a drug used in Brazil, not often used in Canada, we hadn't even entered the drug in the cart yet. We had missed them to it. So no reference data, no hits to it. So we curated it that afternoon. That evening it was in there and we emailed the Brazilians. We're very sorry. But what was the problem? The problem was that curation is a labor intensive human effort, expert human curators. So generally there are anywhere from students to technicians reading papers, building models in cart, following a few rules. The reference sequence must be published, must have a gen bank accession. There must be clear evidence of resistance of an experiment underneath it. So they missed Brazil entirely. So what we've done now is we wrote an algorithm called CardShark. These are two undergraduate second years. So who plays poker? So CardShark is you're shuffling the deck and as you're doing slight a hand so the card you want ends at the top. So you get that card next. That's what a CardShark does. So what CardShark does is every third days it reads PubMed, all of PubMed shuffles the deck and puts the most paper at the top. That's the first one the curator reads. It basically says, what paper should you read next? By using card as a network of knowledge, the ontology, and building a text mining algorithm around it. So we went in and we ran CardShark once we got it going and guess what came up first? The two Brazilian hits were right in the top 10. Plus an interesting Beatle Actimates family which we missed entirely because it was in Foul Hole in India. So this is one thing. You need to do this infrastructure. None of you are going to be bio curators like us but if your bio curation pipeline is not good, you're going to have bad reference data, you're going to get bad results. So next time you're at a CHR panel tell them bio curators deserve more money to it. So CardShark is helping. It's hurting because it finds so many papers we can't keep up. TV is the one who loses on this side. So let's scale up further. Surveillance. So Syria, right? So Newsweek put out this paper about the Syrian crisis is going to maybe be the end of antibiotics, right? So you deal with crisis you deal with, so Haiti with Colorado, you look at war zones, refugee camps, you get use and misuse or unable to properly use antibiotics, right? You get shrapnel which picks up strange bacteria out of soil. Acinobacter strains, we have bad strains that came out of British combat vets came out of soil, right? We have this issue that this global problem, one crisis in Syria could breed new environment to breed new resistance becomes what we call the discovery zone, right? Because these people are going to get on airplanes, the physicians get airplanes that could show up in Manitoba you name it. So this is why we need a discovery algorithm. So really what will informatics need to look like when we sequence every hospital clinic or outbreak or war zone? How do we build a surveillance network for drug resistance? So these are just the last couple slides. So you've heard a little bit about these genome canagrants with the RIDA, so Will Zaw has one grant really build a just push ontologies way further to build this world. Rob has a second grant to say, well how are we going to deal with messy data and metagenomics for this world to build this surveillance? So the Canadian government is starting to put some serious money in this to work on this problem but I'm going to highlight this comes down to the other revolution. The first revolution was sequencing. We could take advantage of that. It's a market force that we don't need to worry about. It gets cheaper and faster and smaller every day. You have the internet of things. Sequencers are plugged into the internet. They're getting smaller. We got nanopores that go on the end of cell phones now to it. So just as like you can in the states, you can buy a smart milk bottle that tells you that hey the milk just went sour or the kid drank it or you got shoes that are tracking your steps etc. The hope is that as we build these informatics resources better reference, better data sharing ontologies, we start embedding sequencing as a technology in the neighborhood clinic. So being a Canadian think about the north. If you're a child in the north who is breathing poorly you get a sputum sample shipped to the south it's TB so it takes two weeks to culture characterize. Maybe it's four to six weeks before the report comes back there's a new strain of TB. Meanwhile the community is infected with. Instead, if you make sequencing in DNA isolation technology cheaper the physician's assistant whoever's there in that clinic in the north gets in the spin the cup spins it down runs it through a nanopore and 40 seconds later the federal p-hack gets in a warning because an algorithm said hey that's a new strain of TB it has a mutation that we think is functional we've never seen. So I'm an industrial chair that's my job to get industry and government to come to that side of the equation. So why would you want to do this? So the first one is what Fiona's great curve she showed last night the earlier you detect the earlier you get rapid response the less disease and mortality you get rapid response. That's what we want to build on the sequencing world. Early detection of emergent threats depth of data we talked about now you get better at predicting emergence so when you sequence that thing that picked up a brand new beta-lactamase or whatever the algorithm wars you said hey what was that protein that protein worries me go clone it look at it now because I think that's going to be a problem in your community. Inform public policies I said you can't really see and can't which resistance genes are around and what's their density and frequency we know what resistance is around phenotypically but we don't know the underlying genes and mechanisms. With that in place we can get better data the earlier we know there's a complete drug resistance of the strain of gonorrhea the more we can inform go to high school under public education etc we can guide clinical practice so physicians don't need me. Physicians know what they're doing they know what resistance in their community they know roughly how to treat their patients where a physician's struggle is when they have a patient that's resistant to one drug so now they're down to four they know which one's most effective clinically with the least side effects for their patient but they don't know which of the four has least likely to cause a community problem to create more resistance so which one should they prescribe that will save their patient but also have the least risk of creating more resistance in the future that's antibiotic stewardship so when we gather this level of data we can actually hit that say well drug number four there's no resistance genes in your community for that one that's the drug I want you to use because there's no resistance elements that you're going to favor or maybe that's the one I hold reserve because we don't want to create resistance for that one so we need that data then the last one's my sweet spot drug development so there's only two companies really doing antimicrobial development we're not going to get a lot of new drugs there will be some the future is actually in resistance killers drugs that turn off resistance so you can go back to using your old drugs so Jerry Wright who I showed has developed a resistance killer to NDM1 so when you put his killer in the cocktail NDM1 shut off you can go back to using your carbapenems things that that gene usually takes off that so resistance killers are a big part but if you're a small startup or a medium-sized biotech company I have a resistance killer for the AACs go to the federal government say well you know AAC genes is that 2% of your patients? 80% of your patients is it pediatric? I need this data to build a business model and get investors and bring this drug to market we have no data they won't risk so by having this data sequence surveillance I produce a data set to derisk the drug development pipeline so they can go wow that's 80% of pediatric patients we can get this to trials and still make profit before the patent wears out or no that's not the case we got to do it as a non-profit with the federal government and bring the drug to market that way no data no risk taking so that's where I want to go so I'm actually going to show you a real example of that and some of the algorithms you're going to see in the lab perfect this is called our wildcard algorithms so this goes to that metagenomics problem so wildcard takes and RGI and runs it against GenBank everything else there finds all the perfect and hits so there's a whole bunch of random maybes here mostly escapes every red tick every row is a different resistance gene and a red tick is how prevalent so a faint red tick you can barely see that it means it only shows up in a couple isolates you know 3% 4% so bright red tick says wow it's in almost all the isolates for that strain so there's a lot of resistance drugs and this is a perfect algorithm right so this is a perfect match to known threats we get all these and you can see of course every pathogen has it in the community wise different repertoire of resistance genes but if we focus in on the immunoglycosidic seno transphases because Jerry Wright is trying to find killers for these to shut these down and I look at prevalence I have perfect hits to a lot of them rarely but 3 of them in fairly high abundance 2 of them actually in multiple pathogens those are the ones that I want to develop a resistance killer because those ones are prevalent in what's in GenBank assuming GenBank is a decent sample people don't sequence boring stuff so probably a good example the reason we're sequencing a thousand patients is I want a non-biased analysis to that so Jerry went to the lab said fine let's go find killers to AAC 6 prime so we're doing high through robotic screening the 6 prime means that the AAC modifies the 6 prime position on the immunoglycoside acetylase so Jerry started spending money doing all that and I said well slow down let's run a wild card let's take the perfect and the stricts right so not only the complete matched and known sequences but what our algorithms say are all the functional variants and a lot more things start to light up red there is an underappreciated sequence diversity out there which is why metagenomics failed and you look at the AACs and here's the 3 that Jerry originally did and they didn't get better now we get some strict hits as well in other pathogens so good keep going after these ones as resistance killers but right at the top we're a highly prevalent group of AAC 3 so they modify a different position of the immunoglycoside this is a slightly different enzyme it's only about 70% similar to the one Jerry is trying to shut down and it's quite prevalent only at strict level so there's a completely underappreciated sequence diversity of AAC 3 sitting out there in clinical specimens that GenBank has in it they've never been characterized they've never made it into literature no one's been an MIC on them but they're quite prevalent and this requires a completely different chemical screen than those to find a resistance killer so Jerry's going to clone those see if they really produce MIC and if they really do this MIC you'll publish a paper saying hey look there's a whole other threat from AACs that we don't appreciate the AAC 3s and then he's going to screen and try to find a killer for those as well that we want to build let me skip by the frequency as they'll come up for sure in the lab so challenges so if you're going to do AMR if you're going to do virulence factors whatever you're annotating in your data whether it be AMR and virulence factors it comes down to high quality reference data the virulence factor DB hasn't been updated in about three years so there's your warning flag right there that you're referencing at ages IACs and IHCHR they don't fund biocuration very well but we all need it the trouble with AMR is it's not static it evolves on a weekly basis so it's not like annotating the zebrafish genome which I helped do and then you're done because it doesn't evolve on a daily basis but this does so I'm not true to biocuration we're not annotating static genome it's a target in this movie we're going to be increasing an increasing amount of data at the level of clinic or farm secondary storage so that's what the intelligence are about metagenomics is difficult and computationally costly so as a result we're only idolizing the easiest subsets of AMR burrows, wheeler, transform type things that are fast and translational tools are increasingly needed so that's why I'm actually showing web so galaxy came up last night who knows what galaxy is or is familiar with the galaxy framework I'm going to stick in a galaxy demo this afternoon so if you don't want to use command line you have a technician who is doing their poor job is they got to do all the culture sequence and annotate you don't want to teach in command line you want a translational tool why aren't we doing well back to your question minimum inhibitory dors data is not curated in any of these databases so card may tell you that this gene has a relationship that confers resistance to this drug but I don't give you context what the MIC is in E. coli what the MIC is in staff that data is not being clanked GenBank is now going to formalize that hopefully it comes kind of what we get then prevalence matters so the drug discovery prevalence matters you want to go after the things that are common threats not just every hit that the RGI matters but prevalence is another one so if you analyze GenBank there's some associations that are 100% I always see these four genes in every pathogen sometimes they're not functional they're just always there if I shock on a sequence so I can sequence a genome really thinly so it's got lots of gaps that Will talked about and I see three of them just by probability the force probably there I just didn't sequence it that's called prevalence so you can make smart algorithms based on prevalence data say well you really sequence that thinly but you saw one, two and three so I'm predicting you probably have four even though they're the most critical part of MMR in many well my lab is starting on that with the GenFPO and doing mobile loans and then all the metadata so that goes to the MICs but more that point of infection source of it everything if you're going to do some type of meta-analysis to predict hey this new SNP is causing the phenotype not to it you need as much metadata it's really inconsistent so Fiona talked a lot about that last night the exact same challenges so that is perfect so I'm just going to do my thanks so Jerry and McMaster and the Carvedead team consortium GenFPO and Iridda and all the Cardinais so McMaster and Cisco sort of funds my lab and helps build this Sightline is one of those machine learning companies just waiting for us to generate enough data that's where they want their marketplace NML and Iridda huge partners in this USDA NCBI and then a whole suite of academic collaborators to get this again Fiona, any questions? Good. Answer the questions in the middle, Rob. Yeah, so that's totally what's missing in there so that, pardon me I lost my voice, but if we go back to those AC3s and they're all plasmid-born that's an extra signal of it's a real legitimate threat right? They are reasonably homologous to settle transphases and they're sitting in the chromosome they could be interacting with other small molecule to it so that context is everything interpretation is everything our algorithms are not doing that sufficiently so we've been in the background playing with how do we predict whether a contig is a plasmid contig or not so we can make that secondary to it the prevalence comes down to that too if you know that based on Wildcard that these genes are always plasmid-born you can get that level of data analytically so when I analyze this isolate I'm calling that a plasmid because it's never seen in a non-plasmid setting but right now I can't make that call the identities are pretty high so I'm thinking they're probably functional okay, coffee break