 All right, we're going to spend this morning up until lunch working on AMR. We're going to go through what's out there and how to do the analysis, and then we're actually going to play a bit with some beta level stuff, sort of see where the field's going as well, stuff you'll see in the next year. So learning objectives, I'm going to do this out of no glasses. So we're going to review the available tools out there. There is a lot. There's a diversity. It's sort of a chaotic time in the field. We're going to look at our own resource, the card database. With that, we're going to look at the diversity of mechanisms. This is really the trouble of doing AMR bioinformatics. It's a very diverse type of mechanistic thing. We're going to do some analysis of genome, some clinical isolates, and then we're going to look in the challenges of metagenomics because a lot of you are going to start generating metagenomics, whether it be clinical or environmental. So background, no real surprises. Audience, we're losing this battle. We're in the losing phase in here. It's essentially evolving resistance faster than we can bring drugs to market at this point. The UK did a review on antimicrobial resistance. So right now cancer is sitting at around 8.2 million a year, AMRs and fatalities, AMRs sitting around 700,000 mortality a year. By 2050, AMRs predicted to be about 10 million annual deaths and about 3.5% global GDP. So this is, we're losing and we're losing badly on this war. But it's not like we're seeing things we haven't seen before. It's not like brand new pathogens that are coming at us. These are pathogens we've been fighting for decades, right? So tuberculosis, making a comeback of completely drug-resistant strains, gonorrhea. So six years ago, one high school in UK at 14 cases of untreatable gonorrhea. Now it's in Hamilton, Toronto, Detroit, you name it, it's all around. Pseudomonas, basically every 10 years, just like staff. We tend to lose a drug class of antibiotics and treatment of pseudomonas infections. And then of course entrobacter. This is all exacerbated by the fact that while bacteria are evolving resistance to everything we expose them to, we're not finding any new drugs. The drug discovery pipeline has shriveled up. In fact, there's only one company really putting real R&D money in antibacterials. All the easily found compounds have been found, right? So you spend tens of millions screening environmental isolates hoping to find a new drug. You keep finding streptomycin again, right? We keep finding the same drugs. So very little money's going in this, very little, few investors. It's really becoming a government endeavor, not a for-profit model if we're going to bring new antibiotics. But we haven't found a really new antibiotic class since the 80s, right? We've had one interesting drug in the last couple of years that's phenomenally different. So even if we keep finding same compounds of the same class, bacteria already know how to evolve resistance to them. What we need is new classes or maybe something completely different from antibiotics. So how does it happen while it comes from misuse, overuse of antibiotics, whether it be at the clinic or in the farm or environmental settings? It's all about population biology. Every population bacteria has a tiny fraction that inherently have a mutation or something that causes resistance. You use antibiotics, you select and they survive. From what you end up as a population of resistant bacteria if you don't use your antibiotics enough or you're constantly using them as a gross supplement in a farm environment. But what's worse with bacteria is once they have them, they often use conjugation or plasmids or other methods to transfer them to other pathogens. So what starts as a Klebsiella infection quickly becomes an infection that drug resistance in 20 different pathogens and they tend to get residents in populations. So the key thing is that it's really all a genetic undertaking in bacteria. The targets can be anything from the cell wall to the information machinery of the cell. It can have to do with biochemical pathways like folates or the membrane or even protein synthesis. Antibiotics cover a large range of the cell but generally the cell has a way of coming back to it. It can protect the target, it can close the membrane, it can metabolize the drug. There's lots of things it can do and these are all genetically based resistance. The real problem is how fast this occurs. So this map is for NDM-1 beta-lactamase. Now we only have two drug classes left, two drugs of last resort with carbapenems and polymixants like Calliston. So NDM-1 appeared in 2008 and took out one of those classes. We lose our fourth generation carbapenems. It actually was one individual, a Swedish gentleman of Indian descent visiting India got ill, gastrointestinal disease, got reasonably well enough to be flown to the U.K. undergo treatment. The infection, it picked up a gene from Lord knows where. It was Klebsiella and E. Coli. It picked up this gene NDM-1 probably out of soil or water and it completely knocked out this drug class. And from there it broke out into the hospital in the U.K. and now in every country, this is old data, but every country that is in color has issues with NDM-1 and I just saw my first in Hamilton in clinical sequencing. So we're only talking 10 years. And in that 10 years, I just checked our database yesterday, it went from being known as Klebsiella and E. Coli to being known in 20 different pathogens. So this is a rapid evolving system. So we need to move fast, we need to be informed fast. So which regimes are in Canada? Which ones are moving about? Which ones are a threat? This is why we do bioinformatics and genomics levels to answer this question. So I'm going to go, people are old enough like me, this was Secretary of Defense Donnie Rumsfeld. He got made fun of for this statement, there are known knowns, there are things that we know we know. There are known unknowns, that is to say things we know we don't know. Then there are the unknown unknowns, the things we don't know we don't know. And he got mocked for this, but this is actually black swan theory. How do you prepare for the unexpected and the disastrous, right? And this is the world we're living in and we're all doing this. So you're a public health lab, you're working with the known knowns. You have assays, whether they PCR or sequencing as a method. And you're looking at genes and pathogens that you are tracking, a very small suite. There are thousands of resistant genes, we only PCR for the most common threats. So we can do surveillance for the known known. But if non sequencing methods, PCR and the rest, it's a pretty limited battery. The known unknowns are all the other genes, resistant genes that have been published and characterized in the literature that we're not routinely screening for. Maybe we don't expect them in our community, though an airplane can change that in the night. Or maybe we just can't afford to run that many assays. But we also know there's natural variants occurring. The gene that's described in the paper, there's probably sequence variants out there in nature that we need to think of that might have slightly different functionalities to them. Maybe they're a little more effective, a little less effective. So we know those are known, but we don't know what they look like because we're not looking for them. And then we have the unknown unknowns. We have the emergent threat. So NDM1 is a good example popped out of somewhere into a clinical setting and became a global threat. A few years ago, MCR1 came out. So we talked about NDM1 taking out the carbapenems. Well, MCR1 takes out the polymixin. It's like Calliston. And about a year in a gaffe ago, we had our first patient with a plasmid that had both, and she was not savable. There wasn't a drug on the planet that could save her. We lost both. The drugs were last resort. So the unknown unknowns, thankfully rare. We only have to find them only maybe one or two a year, but they will quickly become a massive public health problem. We often find them too late, while after we have patients or animal problems, and we can't figure out the mechanism. Six, eight months of biochemistry and lab work till we figure out what's going on. So the advantage is that with sequencing, genes can't hide. We have a chance at surveillance to take a look at this data. And the real advantage is that this is a market force we can take advantage of. Sequencing cost is plummeting. It's getting cheaper, higher throughput, and the devices are getting smaller just by market forces. So that technology is moving into we can afford it even at the local clinic level. So in Anipore, we use it at a clinical level. They'll look at clinical positive in there. It's going to come to the day where we can embed a sequence at a northern clinic or even at the back of a pickup truck in an outbreak in Haiti again. So we can take advantages, and we need to prepare for it. Essentially, whatever you're doing, this is your pipeline. You sequence using one of the many platforms, Illumina and Anipore or PacBio, you name it, even old school Sanger in our case. You need to compare it to reference sequences to make sense of it. And you predict the resistome. I'll use both sides here, actually. So the resistome, this is the catalogs of genes and mutations that may cause resistance to it. And this is genome. The real tough one, the holy grail, is to go from there and predict phenotype, exactly what drugs will not work and why and to what degree in this cell. If you want to learn more about that question, spend some time talking with Kara, who there's PhD is on that end of the site. This is the holy grail one we get to, and that's where you get to machine learning and all the more interesting sides of computer science. So that reference part, that's what my lab's really fundamental job is. We do lots of other things, but really we work on the reference side. So we built the comprehensive antibiotic resistance database. This is an expert curated molecular reference database. The sequences, the mutations, the regulators, everything that undergoes in different pathogens around resistance. So when you get sequence data, whatever it's quality or quantity, you can compare it to the card and figure out what did I just see, what's in my isolate. It really has a couple themes. You can browse, so it's a knowledge base, we'll do that. You can perform different analyses, we'll definitely do that today. And the data comes in many different download formats, so you can build your own pipeline. So you can see it used in ARIDA and other platforms on that. We also do surveillance, so we routinely, we're gonna talk about this data, surveil many pathogens, actually know which genes and variants are out there. So what it is is a high quality sequence data based on the molecular basis of AMR. It is expert curation, so there's a full-time human curator and lots of junior curators, but it's also guided by text mining. So we wrote text miners to make sure, every month we look at PubMed to make sure we didn't miss a paper or miss a mutation. It has breadth, so that means we try to cover everything. The word comprehensive is in there, so we gotta do it. Every type of mechanism, whether it be plasmid boron or a mutation in a gyraase, an intrinsic regulatory factor, things like that. On top of that, once we build the reference data, we do work on software and analytics and work on the discovery zone in particular of emergent threats, but underneath it all, we really work on data harmonization. So we've heard a lot about ontologies, right? How to standardize and share this data. So CARD is built on an ontology underneath, and we have a commitment. Essentially, unless it's a really busy month, we release every month. We update, even if it might only be one or two genes on the slow months, but we're always cranking out data, we have this commitment to keep it going. This is an evolving threat. It evolves on a daily basis, so this is not like traditional curation, like a human genome. You sequence it one and curate it forever. Our, the ground changes underneath our feet every week in this battle. Quick bits of what it looks like. So this is a shot of the ant-6, aminoglycoside resistance change. You're not meant to see all the details, but this is an AMR gene family, and you can see there's all tons of classification about the details of the mechanism and the details that drugs of interaction interacts with. And there's terms saying, you know, which drug classes that it's known to interact with. You have this high level classification, which you can use or you could not use, depending on what you need. But you can zoom into an individual gene within that family, so the ant-6, 1A, and you have an ontology term. Here's the ontology assessment, it's an arrow. There are known synonyms. The language is horrible in this field. There's synonyms for everything. There is a written definition. It tells you what gene family it's in, what drug classes interacts with, what resistance mechanism, and even what pathogens it has been observed in at the genomic level and whether it's seen in a genome, in a plasmid or a whole genome shotgun effort where we can't tell the difference. If you clicked on any of these buttons, you would bring up that higher level information. It tracks the key publications, not every paper on this. It's really the ones that define this gene when it was first studied. And down underneath, you can find the sequences, the DNA and protein that define this gene. We'll talk a little bit about the model context. So you have this massive knowledge base. Underneath it all is the ontology. So here's a graph of the ontology. You have, in this case, we're looking at mutated gyrates. So we can follow mechanisms. We can look at it evolve from a drug-sensitive gyrates and we can see what drugs it confers resistance to and drug class, what targets it evolved with, what gene groups like the toposomerases, what mechanism it evolves with, and then up to the individual compounds. And this means you could be a drug discovery person. So you would enter the card through the drugs part of the ontology and do your searches because you think from chemical space. You might be a molecular biologist thinks the mechanism. So you enter the card through the mechanism. You might be a sequence person. So you enter at the model of the individual sequence. Maybe you're interested in topoisomerases for your PhD. Essentially you have a knowledge network. Ontology standardizes the terminology and connects them all. So you can enter and analyze the data from any perspective. It allows it very easy for a human to work through the data but especially for algorithms to mine the data because it has a knowledge network. So this effort underneath card the ontology is part of what we've heard about is the genomic epidemiology ontology, this higher level ontology led by Will to bring in the food on ontology, mobile home, virulence, et cetera. So we have a standardized language for harmonizing data and surveillance. And as well it's part of the ARIDA project that you had your first taste of yesterday where you can standardize data and standardize your metadata. What's a little unique about card compared to the other resources, not a little, it's a lot unique compared to the other resources, is most resources are essentially a collection of sequences, a fast A file, and you figure out what you want to do with it. We actually provide detection models and parameters. How should you use this data? Not just give you the data. So here is one model to it. So we're giving you the sequence in this case. It is a ant6. So this is an aminoglycoside resistance gene. Here's our protein to it. This is a protein homologue model. So it says if you want to predict whether you have one of these in your genome, you're essentially gonna take this reference sequence and you're gonna do blasts or some similar sequence comparison to it. And we're gonna provide you a bit score cutoff which is right here. So clearly if you blast and you get a perfect match, you know you've got the gene. But what if you have a variant of the gene? What point is it too distant in sequence that it's not actually functional? So we hand curate all these cutoffs. And a bit score is essentially a measure of similarity. How much information is similar between your sequence and the reference. And if your bit score is above, in this case, 500, our curators predict that the variant you saw is functional. That this is some related ant6 aminoglycoside to it. If it's below that cutoff, it's related to this aminoglycoside, but probably working on different small molecules or we're not really sure, you probably should clone it and be sure to it. So this takes us a ton of effort, but this gives you context to say because you're gonna see lots of variation when you sequence your samples. Is that variation probably functional? To be absolutely sure, you probably have to clone and sequence everyone. But if it's within the cutoff, you have a high likelihood that that can explain your phenotype, why the drug didn't work. This is the most common model type in the card. So beta-lactamases would cause that since they're the most abundant resistance gene. The second model is the protein variant model. So this is the second most abundant model. Very similar, I have a reference sequence. So in your case, we're looking at, what are we looking at? TB, ethyl butanol resistance. So the reference though is often a sensitive wild type. The normal gene that every bacteria carries to it. And then you do blast to find it. In this case, you get a cutoff and this cutoff requires a bit score of at least 2000 to say, yep, I found the EMBB gene in this strain. But that's not enough. Every mycobacterium carries an EMBB gene. So you get a positive for every single one. Then it compares it into a matrix of curated SNPs. These are mutations that have been characterized in the literature to cause elevated MIC, these cause resistance. So this model is a two-step model. Find the protein, screen it against the knowledge base of substitutions and mutations or insertions in deletions. So GI raises, immunocumen resistance, ethyl butanol resistance, fluoroquine alone resistance, all use these type of models. These are the most common to it. As well as mutations and regulators is another common one. This data is really rare. CARD is really the only one that cranks that with the exception of tuberculosis. So if you're a tuberculosis researcher, there is an American resource. TB evolves so fast, we can't keep up. And there is a well-funded American resource for TB, so we let them do that. And we're gonna hopefully get the data synonymizing between the two of us soon. But I would need 50 curators to keep up with TB. So at this point, we have eight different types of models in the CARD. So we've heard about homolog protein variant, well the equivalent for ribosomal RNA gene, so fluoroquine alone, immunocumen resistance. So mutations of ribosome RNAs. Protein overexpression models. So CARD can talk a little about those if you want the details, but essentially E-flux proteins, they all exist in all the strains. It really has to do with the regulator is mutated so you overexpress the E-flux system. So you're kicking out the drug at high rate. So we're looking at this. This model will find the existence of the E-flux protein and tell you that, but it will especially tell you if it's mutated so you get higher levels of E-flux. We have protein knockouts or non-functional insertions, so acinobacter, and then higher level things like van clusters, where you have to need a whole battery of genes to cause resistance. Or E-flux systems, where you have five or six components all at play, including the regulators. In orange is where the models that we put in CARD are in use by our software. So you can predict this with our software. The asterisks indicate there's a few mutation types we've yet to write code for. We're not very good at deletions yet to it. The blue is where students are generating new algorithms part of their thesis. These higher level, like do I have glycopeptide resistance in my enterococcus. The two here is we've curated all the knowledge in the CARD, we haven't written a line of code yet to use that, and that's open game. Anyone can do that if they want. So where are we at? So we're just over 4,000 ontology turns in the knowledge base. Just shy of 2,500 reference sequences, 1,200 mutations. 2,300 publications are tied to the original definition. There's, of course, many more publications we go through, Lord knows how many. And about 2,500 detection models. So that's your rough estimate of how big is the resistome out there. We're dealing about 2,500 resistance types. That's actually a little higher as a few of them are private. We published a paper last summer for the students in the room. This was an undergraduate that got first author, did redesign the whole thing. We are cited at a very high rate, which is great. You'll see the data is used for genome analysis. People use it themselves in their own pipeline. We'll see people to make brand new models. So we'll talk about the ResVan models to it. You'll see our data used to enrich a database. We saw Fiona talk about Island Viewer. But we have a formal collaboration with NCBI to use our ontology. And we trade data back and forth to make sure NCBI and us are on the same page. For the Americans in the audience, how it works in the US is we curate like mad, pour our data, collaborate with NCBI, and then it goes to the USDA, the EPA, the CDC. So they're the conduit we go through. NCBI is starting to take a lead. And we really are working on the ontology standardization. And lots of people use our software. From Adele Valley, this is Tricky Science. So those of you interested in how to be a bioinformatician, you need to be biochemists and cell biologists first. Most of people come into my group from that perspective. You've got to understand these cells. We have to know the enemy. Biocuration, which is a sub-discipline of bioinformatics is really about standardization and building data structures. Build data so humans can confuse it. It's an growing and urgent field underfunded field as well. And then, of course, the data sciences and software engineering to build all of it. So you've got to think from multiple perspectives. So we have two full-time staff, a software engineer, a properly degree in engineering to it, and then an evolutionary biologist who's really the lead curator. Because this is an evolutionary phenomenon or battling. So you have the resistance reference database built. We have all these structures. How do you use it? How do you do AMR prediction? So essentially, you have three axes. I'm going to go on this side. The top axis is whether you want to just screen for perfect matches. Do I have NDM1? Do I have MCR1? The ITM72? Or are you looking for functional variants? So I want to be a little more open-minded. I know there's sequencing variation out there. Or am I looking for truly emergent threats? I'm trying to look for the things that are going to scare the daylights out of me. The next axis is what type of resistance? So am I just looking for dedicated resistance genes like beta-lactamase disease? They're often born on a plasma, so protein homolog models. Or am I looking for resistance by mutation, the protein variant models, gyrase mutations, and all the rest? Lots of software doesn't do this one. Or am I trying to work on the subtlety side? Things that make the cell just intrinsically resistant to the drug. The drug just can't get in. Or regulatory changes. And then how good is my data? Am I got a whole genome sequencing or assembly? So I have huge chunks of the genome resolved. So I'm getting whole AMR gene so I can find the whole thing. Or am I doing community sequencing, metagenomics of lung or water or something like that, where I'm just getting fragments? You have these three axes. The green boxes were essentially the communities doing very well. There's lots of players, lots of people building tools and software. It's the low-hanging fruit bioinformatically. So finding the known and just saying, hey, you've got NBM1, where, and that's a plasmaborb, dedicated gene, and it's from sequencing and isolate. We can do that with our eyes closed. When you move into novel variants, essentially we're talking about curation of the SNPs. So it's us in the Argonaut database, a little bit of the res finder guys in Denmark. And we share data and go back and forth. This is a tough battle. This, these things have mutated at high rates. We're constantly have to curate new mutations. I can never tell you that we're comprehensive. We're caught up. We're always trying to keep up. The immersion threats, really, there's only two groups working on it, us and the resfans, and very different perspectives we'll talk about those. But what we're not covering very much is intrinsic resistance and regulatory. We're the first to work on it. We have those protein over expression models. Really has to do about mutations of the regulators. So we say, not only do we find the genes, but we're predicting they're overexpressed. So CAR gets all the credit for that much. Somebody intrinsic? Not so much. Well, we're going to improve on that in the next year. We're going to do a pathogen identification, more often than it is about resistance. Metagenomics is a mess. There's lots of, we're going to play with it today. I'm going to tell you the limits. There's lots of papers, not done well. And there's some real technical challenges. But it's going to be a good year for many groups on this front. So resources, there are tons. So when I sequenced in Giardia 14 years ago, we get the genome out. And there was this plethora of tools and databases and everything. And after about five years, it settled down. And there were two or three really good reliable things. We're in that state for AMR. There's lots of money in the pipe because it's a severe health crisis. So every week there's another tool or a critique database, really hard for a researcher to figure out what to do. So Cara, I promised to review with me last year. I'm just going to highlight a few. So tools that I like. First of all, the Argonaut folks. Really good curators. They have a nice tool for genome annotation. Their SNP database is a little small. We got Resfinder. These are the names. Really was extraordinarily good. Very commonly used for plasmid-borne resistance, but they're starting to branch out. They got some funding to go to that. Ours tool, we'll see the resistance identifier this year. I highlighted this. I probably are not supporting as well as they should, but this is a good preliminary metagenomics tool. We're going to talk about the AMR++ today, too. The TB profiler. So TB, really, you want to use things like TB profiler, not card or Argonaut because these guys are really focused on that pathogen. Databases. So fine, you got tools. You need data to compare. So there's really three. Argonaut, fantastic group. We collaborate. We send data back and forth to it. We're going to come to the table NCBI. Us, curating data of the card. You have the Resfinder, which is, again, predominantly plasmid-borne one in Denmark. Getting them to come to the table and share a bit more, a little tougher with Danes. I did highlight the Resfams. We'll talk about that. These are a specific thing called hidden Markov models. They don't generate data. They amalgamate all our data sources and build Markov models, which we'll talk about a little bit later, but a really computationally intensive tool but quite powerful. And then NCBI has made a real commitment, particularly around beta-lactamase, is because the people that curated it have all retired, like Karen Bush and the rest. And NCBI has stepped up in making centralized resources. And we all collaborate with NCBI. All right, our software. We're going to play with the resistance gene identifier today. The RGI, basically things like a black swan, right? So when you do RGI, you can make hits on three paradigms, perfect, strict, and loose, or what we call discovery. So a perfect hit where the sequence matches the reference completely. It's a no-no. So hey, I've got NDM1 in my isolate, right? So if you are doing public health surveillance and you're really using sequencing to track known threats, this is what RGI, that's the level you're interested in. You may not use the rest of it, right? Then we have strict, where the hit is not the same as the reference, but it's within those bit score, those model cutoffs. It has all the things that different model parameters say it should. So our curators are predicting it's functional. So if it's 97% similar to a known gene and that's within the cutoffs, it's probably a functional variant. If you're doing clinical sequencing more, you're trying to understand which drugs will or not work. You probably need the strict window because there's lots of variation out there in nature, right, to it. Almost the majority of it on character drugs. Majority of users never move outside of that window. It's a tiny minority that turn on the loose button and it's because they have something they can't explain. When they look at the perfect and the strict, they still can't explain why their drug doesn't work. There's something new, right? They turn on the loose and these are all the hits that are outside of our model cutoffs. So a lot of them are spurious, complete nonsense. They're related proteins that are working on some other small molecule or something like that. But if you come down to no resort, this is where you go and this is where you find your emergence. Right now, this is a really a massive data. It takes a lot of work and you got a cloning sequence that we're now starting to collaborate with the protein database that have structural level data to say, okay, out of those 4,000s looses you got, these four have a good structural profile to actually be your resistance gene. They could physically interact with the molecule that you're working with to it. So people like Jerry Wright, whose sequence really bizarre pathogen or ancient pathogens and never been exposed to humans, he's always in the loose zone, right? And spending, doing a lot of cloning as a result. So this is what it looks like on the website currently. You can put in a session, it'll just grab data from GenBank or load your own files. You can say, well, I've got DNA sequence so it's gonna find the genes for you or I've already predicted my genes. I'm just giving it my predicted proteome. I've already done my annotation to it. You can say I want perfect and straight or turn on that loose button, right? To see all those loose hits to it. And recently this sort of high coverage. So if it's, you've got good quality coverage, like a good shotgun assembly, well coverage of your genome, big chunks of genome resolve, no problem. But if you're doing small plasmids or it's a quick and dirty, slow sequence coverage of your pathogen or maybe even it could be merged metagenomic reads, so tiny contigs to it, you can turn on the low quality. And then it will actually look for partial genes. So in the high quality it says, no, I gotta find a complete gene, start to stop before I'll consider it, right? And that's a rule. But so if you think partial genes because your data's small, it's plasmids, it's low coverage, you turn on the low quality to find partial genes and weak hits to it. And this is actually, workshop way better than I thought it would, to be honest. So this is a visualization from the web. This is a multi-drug resistant salmonella from a patient in Hamilton, Ontario. The wheels is a snapshot from the web. So we don't have leucine, so not showing any, but there's one perfect hit of all that. So this is a multi-drug resistant strain. I only see things when three drugs fail. They come to me. Tons of strict hits. Genes that are in our cut-offs around model. We haven't seen that specific sequence before, but it's within our cut-offs. It's a lot of genes. And then you can use the ontology and literally click on the website and it goes to this. It reorganized by ontology and now it's organized by gene family. And we look, most of the gene families only have one hit for acrylon-resistant penicillin. But if you look in the R and D e-flux, there's 14 genes. There's always a lot of e-flux in bacteria. Or if you look in the MFS e-flux, seven hits there. Two hits for that looks like a, we lack them over there. So you can reorganize the data on the fly using the ontology to break it down by gene families. Well, maybe that doesn't mean a lot to you. So you can reorganize the data and say view it by drug class. And what's scary is that one perfect hit covers seven drug classes. So it alone is causing multi-drug resistance. But now we take a look and I've got 15 genes contributing to chloroquine, 17 to ribomycin, I've got aminoglycaides, I've got macrolides. I've got basically everything that's super bugged. Chloramphenicol is even in there. You've caused, obviously we've never used chloramphenicol on a patient, but we predict that resistant nonetheless. So you have that context that really, and key thing about resistance, one gene can lead to many drugs and many drug classes, particularly e-flux. So it doesn't take, this is a super bug that has a large battery of genes, but it actually can take a few genes to generate a very complex phenotype. If we still couldn't explain why a certain drug worked in this clinical pathogen, and so we were at a loss, then we could turn on the loose. And the loose, you can see how many weak hits to resistance genes there are in just one salmonella isolate, right, among this list. But if we know it was macrolides that we couldn't do, we could cross-reference this with the ontology to reduce this list to all the ones that are weak hit to a macrolide resistance gene and maybe get it down to 12, 13 candidates. Then hopefully in the future, right now by hand we would look at structure, in the future we'd have it just the website work, look at structure to narrow it down even further. We fortunately don't have to use this much, right? It's only when something scary shows up that we turn this on. But we sort of usually have running bets how long till the next one shows up in the lab, and I usually lose the bets to it. So you have this capacity for those of you who have the unexpected. Okay, what about metagenomics? So here, how am I doing on time, man? What about metagenomics? So what I just showed you was an isolate that was cultured on a plate, DNA isolated, we sequenced an alumina, we assembled it with something like space, so it was high quality data on a purified isolate, right? Metagenomics is where I get a cystic fibrosis patient and ask them to cough in a cup and I sequence that DNA, right? So that DNA is gonna have all tons of patient DNA, human cells that get in it, it's gonna have tons of commensal bacteria that belong there, that's your job, and then it's a tiny little bit that might be my pathogen, and out of that tiny little bit of pathogen, a tiny, tiny bit will be the AMR genes because most of it's just what makes a cell a cell, right? So you've really got a noise and signal problem in this data. On top of that, we don't know how to assemble metagenomics very well. So we have short fragments, aluminum sequencings are 250 base pairs long, so most genes are longer than that. So we all have partial hits to it. So it is fragmentary, highly noisy data. So how do we deal with that? So where the field is at is essentially there's three methods out there. First of all, if you've got a big server farm, you could just blast X the whole thing. Every fragment into protein space compared to the card or Argonaut or whatever. This is computationally intensive, we do it to baseline analysis, but very few labs have this capacity because the average metagenomic data set, it's gonna be 160 million sequences, right? This is gonna take forever. So we don't have the algorithmic space and the reason that you'll see things like the Brosewieler transformer BWA or bow tie, all these alignment softwares, it was simply because this was too dang slow, we had to do something smarter and faster. If you really have a lot of computing, you can use the resfams. Now what the resfams is is really some excellent bioinformatics in a way. So they take card and others and they say, I'm gonna build a model, a statistical model of what a beta-lactamase functional domain looks like and an amino glycoside is settled transferase functional domain. So it's a mathematical model and this is what hidden Markov's models do. They're usually used for looking at proteomes to say, I wanna find a distant relative to my histone. I wanna find a distant relative to my beta-lactamase. They're really good at reaching an evolutionary space and finding weak hits. These guys take in those resfams, build an excellent resource of them and they're great for protein annotation, absolutely. But then they throw them, they use a huge amount of computation to throw them against metagenomics because these are fragmentary and says, I think I just saw a bit of a functional domain of a beta-lactamase. I think I just saw a bit of a functional being of a subtle transferase. So they've done it in soil and they came up with a huge list, right? And the trouble is no one knows when you do this what the fault's positive rate is when the data's really fragmentary. So when I look at their list and say, hey, there's 470 putative beta-lactamases in the soil sample, I go, yeah, but is it 3% correct or 97% correct? No one knows. So when Fiona talked about the amortyne project yesterday, this is the problem they're trying to solve. They're trying to build these advanced models for mining soil and water and clinical metagenomes but to reduce the faults positive. That's really fundamentally what amortyne's about. But if you look at the resfand papers, that's the grain of salt. First of all, almost everyone in this room doesn't have the computing power to run those against metagenomics. I don't do it in my lab and I got a lot of horsepower, but also you're dealing with a false positive rate problem. So what you'll see the vast majority of literatures, people using the Burroughs Wheeler Transfer, and you'll see software called BOTI or BWA, these type of things, or they'll call it read mapping. These are high intensity, fast bioinformatics algorithms but they're high stringency. I have my little fragment gene, align it to the reference, but really don't even consider anything below 97% similarity. They gotta have a high matching rate. Therefore the algorithm can be fast is the key. Unlike Blaster says, look at everything. So takes forever. This one says don't even consider if you're not even within 90% similarity. It's all about high stringency read mapping. So it has very low levels of false positives, right? It's very good to a reference database. What we don't know is what the rates of false negatives and there's nothing wrong with the algorithm. Then false negatives have to do is the thing that it's 97, 98% similar to in your reference database in the first place. So if you're sequencing a clinical sample, like a CF patient when we expect Pseudomonas. Card and others is really good at the sequence diversity of Pseudomonas. So the Burroughs Wheeler Transform is gonna work great. It's gonna align the C-serif bonus. If you go to a farm, right? And you sequence the microbiome of a pig, right? Medijones, as you're looking at AMR transmission in a farm, the sequence variants that are in the wild in a farm might be a long way from the sequences that are sitting in card because it's generally the published clinical sequences. And so the sequence that you've got in your farm might be a perfectly valid AMR gene, but it only might be 93% similar to the reference and the BWT fails, right? So the false negative rate, what is the real problem of that literature is it's highly dependent upon the reference database. So stringent read mapping is a sense of sequence diversity in the reference database. And this has been really, really under examined. Every database out there, Canonical Card, Argonaut, Res, Fine and all the rest, what gets in the literature, and 90% of the papers in literature are clinical sequence variants and they only publish the first sequence they saw, not all the other variants that are out there in nature. So card traditionally attract the published canonical sequences the first time it was published, right? That is actually our curation rule. Our algorithm is our text miner says, looks for new things, not information on known things. So it's really underestimating these variants. So we looked at even a CF patient, it was CARs undergrad thesis, and we found that even at the CF level, when we knew a lot about pseudomonas, we missed things on this because the CF we saw in the clinic was a little sequence distance from the pseudomonas we had put into the card. So we knew this was coming, we knew that the bioinformatics was making nice progress, and Amartime was gonna attack the false positives here, but we as curators had to attack the false negatives to it. So what did we do? I'll show one more. So this is CF data from Canada, showing you the diversity. So every row is an isolate type from a Canadian cystic fibrosis patient, pseudomonas infection, every column I should say. Some of the isolates are more common than others, so most of them are one-offs, but a few isolates are seen in more than one patient. Every row is a different resistance gene to it, and this is using the RGI software. If it's green, it's a known sequence, it's a known node, right? If it's red, it's a strict hit. It's a sequence variant that's not in CAR or any other database. So that showed just even in clinical CF patients looking at pseudomonas, close to about, it's almost three quarters of the kids of AMR genes were brand new sequence variants that CAR, it was naive about, doesn't have that individual sequence. And you can see some of the more so others sort of whole drug class picked up by this one right here. It's an interesting variant. So this bar along here is predominantly E-flux to it. So what did we do about it? So CAR released resistomes and variants. Every month, gonna switch to every two months because it's taking so much processor time. We grab every chromosome, close chromosome, close plasma and shotgun assembly in NCBI for up to 76 pathogens. And we're just every month we keep adding more pathogens to it. We take that data set and there's just a screenshot of a few of them of where the density of the data is. And we run the complete CAR RGI against it and we predict every perfect and strict, right? And we find, so at this point, we have over 70,000 chromosomes, plasmas and shotgun assemblies for 76 pathogens. When RGI, so at the last release, just a few days ago, we have over 130,000 allelic variants of all the things that sequence. So we're building a database of sequence diversity to make the work on that false negative problem that comes with doing metagenomics. The reason we built this other than we get a lot of great epidemiology data out of it is really about solving the metagenomics problem. This data is all available on the web and downloadable. So here's an example where you're going to do this yourself today. But here is again, another pseudomonas from CF. So at the left, we have the predicted gene in the metagenomics data. On the right, the number of sequences. This is just a subset to test software. So not many sequencing reads from the Illumina hit. Normally you get much higher numbers to it. So lots of genes in the CF patients community, some data to it. Most of them are homolog models. Some of them, the Burroughs Wheeler Transform mapped it to a known card sequence, a canonical published allele, right? But the vast majority of them, the closest sequence was actually one of these resistance invariants from genome mining against GenBank. It was much closer than the card one. In fact, if you lose these, you lose whole rows because they're outside of that 97% similarity window. So this really illustrated what was key about it. Then you get further because you've done resistance invariants, you know where they came from. So two of them we know are found from plasmids in GenBank. But on top of that is we have context, not just a sequence match. You can say, well, two of these appear to be plasmid born are known to be associated with plasmids, whereas most of them in this case are not. But on top of that, we said, well, where are those genes known, right? Now I happen to have submitted a pseudomonas infection. So, but a lot of these genes have only ever been seen as pseudomonas. Others are much more broadly distributed and no surprise, look at that, the two plasmid born are the ones that are spread out amongst many pathogens, right? So that's a step towards pathogen identification. Now this is a control, I knew it was pseudomonas, but if you're sequencing a gut infection, you don't know who your player is at, right? So that you have in metagenomics, you have two issues. What resistance genes do I have whose carrying is often really the hard part to figure out and we're starting to make some steps and you're gonna use some, I'm not even gonna call it beta alpha software today to look at this issue. So that led to the question is we built these alleles, are they diagnostic? Are we getting down to a real level molecular epidemiology? So we went to Jerry, right, our colleague. So Jerry leads a clinical sequencing effort, anything that we have third drug fails, we get it from Hamilton's offer, so it can be any kind of infection. And so we looked at this and we said, fine, we'll take all your collection of pathogens, we'll run them, we'll sequence them, they'll run them through the RGI, we'll find all the sequences that are in the wild in a clinical, diverse clinical setting and we'll ask of all those alleles we see, how many are only found in one pathogen according to all our resistance and variants? So if you look at Kleb, about 55% of the AMR alleles in Klebsiella have only ever been seen in Klebsiella according to everything in GenPang. They're diagnostic for Kleb. Another 20% are from Horiscomus plasmids, right? So these are alleles that are on a plasmid and are just leaping around between pathogens. This is another tiny fraction we couldn't diagnose. Pseudomonas, really, I didn't realize how different in sequence space a Pseudomonas is. Almost every AMR alleles we find in Pseudomonas is diagnostic for Pseudomonas. We don't find it in any other pathogen. So we can do really well at identifying Pseudomonas. Up here you'll see Proteus, well we didn't have Proteus in our reference database so it was really bad then, but it's better now. So this really comes to the question, can you actually, by doing this massive computing and providing the database to our users, can you get down to pathogen ID and AMR detection in metagenomics? And we're gonna do that in the lab today. Okay, so let's scale it up the last little bit. So Syria, we're starting to see the first real papers coming out of Syria. So if you're in a combat zone, a refugee zone, a medical camp zone, reliable and consistent use of antibiotics is almost impossible. It's a breeding ground for resistance. You have people who are in movement who are using their drugs, not finishing their courses and getting to another location being assigned a different antibiotic, being shipped to a different country that has a different antibiotic they tend to use. We have the early evidence that we're breeding drug resistant strains. The combat zones and refugee camps need special care. We're gonna learn the hard way that we need to treat antibiotic usage different than these type of environments to be reliable, save the patient and save the community, so to speak. But how are we gonna deal with the fact that what do we need, what informants need to look like when we sequence at every clinic, a hospital, outbreak or war zone, right? How do we build a surveillance network for drug resistance? This is partly motivated out of the Haiti study, right? You guys talked about Haiti, that cholera did not exist in Haiti till after the earthquake and we now know from Canadian work that it was a relief workers to introduce cholera, right? But we found out way after the fact, how do we build a system where we find out the next day, right? And close that camp down. So we've been lucky to have some funding to work on this, to think on the big scale around AI and big data approaches to it. So essentially what I want, we'll go over here, is working with groups like the GenFBO and ARIDA to build work towards a big surveillance. And this is ARIDA's goal, that we could have ARIDA instances in sequencing become routine. As I said, the price is plummeting. So what's my job is to get governments and rest to put bags and nanopores in northern clinics, right? Your community hospital has a regular sequencing surveillance system for infections to it. Hopefully we can get into metagenomics surveillance. So if you think every clinic, every hospital, we're constantly generating that data, right? So companies like Cisco, my sponsor would be very interested in that. What would you do if you were collecting data in that volume and that philosophy and that diversity just within a country or even internationally? So first of all, you could have rapid response strategies. So you all know we had a bad TB outbreak in the north, right? There is a delay. It's a hard to grow on a plate. It has to be shipped to an Ottawa lab. It takes a long time to come back and say, hey, it's TB and it's a new resistance stream. If you have a child coughing a cup and there's a bag of sequencers in your clinic when the nurse practitioner visits in every two weeks, you can have them cough. It goes on the end. An algorithm could say this is TB and it's got these mutations to it. You would have rapid response and inform it. So early detection, emergent threat. So this is a no-no, right? It's a TB strain. It's got a mutation in it. Hey, we should be scared. But also, as we build our discovery models and we start to put structure, routine sequencing, you want to learn, say, hey, a gene just showed up in a patient in Winnipeg that looks like a brand new beta-lactamase. Go take a look. Maybe it's nothing, but maybe you really look closely before it kills people, right? So we can get to that world. Informed public policy. So drug-resistant gonorrhea is a great case, right? We find out, secondarily, as we start to get case numbers up, public health starts saying, hey, we've got multiple strains of this. Often it's in the community long before we get enough public health data to change to inform the community. So sequence-based, tell the community, look, 80% of high schools in Ontario have drug-resistant gonorrhea in them. That leads to a public information campaign. Better data, right, to it. Clinical practice. So this is a tricky one. Clinicians don't need us. They really don't. They know what to do, right? Clinicians know when they have a strain that doesn't respond to drug. They know what kind of resistance they're used to seeing. They know what drugs that they could use. They generally know what to expect in the community. Where they struggle is where they have four drugs that they could treat their patient with. They have an opinion which one's best for the patient, but they often don't know the stewardship side. Which drug should I use that causes the least long-term harm for my community? Because right now we have no genetic data to it. But if we're sequencing on surveillance, we can inform stewardship that please use dug number one because there's very few resistance genes in our community for that drug, right? Or we know from sequencing efforts that it's the least likely to generate new resistance, right? It tends to be a fairly irresistible drug. We need better data. So this is what we hear from Royal College of Surgeons all the time. What's the genetic landscape resistance so we can make better stewardship decisions? The long game is actually this one. So drug development is almost dead, right? You have a few innovators in academia, a few little companies do it. I don't know if antibiotics are going to be the succeeds. There's a lot of interesting kind of looking at phage, right? There's alternate strategies or immunoboosting strategies. But one of the most interesting is our resistance killers. So NDM-1 came out of India. Global threat now takes out an entire drug class, right? And a big worry. So Jerry Wright at McMaster a few years ago found a compound that turns off NDM-1. So you can go back to using your carbapenems. So you can use it as a mixture, as a cocktail. So it's a resistance killer. And it's a compound we know a lot about actually. It's been through preliminary trials. It's in trials now with the NIH and the states. So through Jerry doing screening of looking compounds that shut down resistance, Jerry might in the end up save millions of lives by generating this. There is a lot of hope of resistance killers, compounds that shut down the resistance mechanism. So you can go back to our usual drugs. The trouble is economic. So if you're a small company and you are screening for resistance killers and you find a candidate, you go to venture capitalists and say, I need some money to get this through the Valley of Death to bring the drug to market. It's just a Valley of Death in drug marketing. And they all said, oh, great. So it targets this gene, this resistance gene. How common is that gene? Is it pediatric patients? Is it elderly? And you go to P. hack right now. And they say, we don't know. We do phenotypic surveillance. Not genetic surveillance. When we sequence everywhere, we are able to say, yeah, that's a pediatric resistance gene. It's 80% of cases in Canada. Or it's general population. Or it's the elderly. Or it's CF patients. Or it's cancer patients. That will give you the data where investors will come to this company and invest to bring the resistance killer to market. Or it's a rare case. And so it'll become a non-profit partnership with government, kind of like all the Ebola therapeutics had their funding stream to it. So essentially, our long-term plan of getting all this sequencing going is actually to encourage the resistance killers. So we actually combat drug resistance genes directly. So we take them out of place. We can go back to our old drugs. And specifically to work on the economics of how you're going to re-boost that by proving better data. So you have real numbers. You can take the VCs who will put money down and bring that drug to market. OK, a few practical things before we have some fun. Recent releases in the cart. So curation. I'll never end in curation. I did 600 curation. I didn't last night listing. We actually, you'll see that we've done a new classification around the ontology. So everything resulted by gene, by gene family, by mechanism, by drug class. So we're making some better ways to organize your results. You'll see now we have in vitro versus in vivo mutations. So in vivo, our mutations that we're seeing in clinic or in farm, something like that. In vitro are ones that people did selection experiments at the bench. So that mutations never cause clinical resistance. But it could. But we don't have that. And selection experiments tend to be rougher. The ontology, we didn't talk about it today, but we added an entirely new branch for harmonizing phenotypic testing methods, disk diffusion and all the rest. This is with GenFDO. So if you actually work in a public health surveillance lab and you're doing a lot of phenotypic testing, you're running out of harmonize your data for sharing with other agencies, talk with Will and I. We just completely released that branch. The RGI software, so we added that low coverage part, so we've got fragmentary, low coverage data. Merged metagenomic reads. So when you sequence, you get a foreign reverse read and metagenomics is short. So they tend to overlap. You get a little fragment. So you get a tiny little contact. You get $80 million of $160 million. We can support that with RGI. Small plasminates, which generally don't do well. A little bit of barf management, if you don't like blasts too small, you can switch to diamond. We added our everzonal RNA mutations and E-flux over expression, so it's the only tool out there. So every tool didn't do fluoroquinolence immunocuberance before we added that to it. Better documentation, some of that. And then the big one was resist zones and variants. 76 pathogens, it's gonna grow. So what are we working on now? You're gonna get a glimpse of this. The great one is, for those of you who are in the agricultural and agri-food, so CARD has really had its funding base on the clinical side. Now we're funded heavily on the agri-food agriculture side, so we're gonna round it out and we're gonna look at the One Health perspective when it comes to AMR. This pathogen origin idea, allelic aminology, we're gonna work with camers, we're gonna do it today, of ways to say, it's an AMR gene, but I think it came from salmonella, so I'll talk about that. Resist zones and variants. If anyone uses AMR, use the AMR++ metagenomics. So you guys have had a look here, there's one. So galaxy frameworks, very popular. Colleague in Florida wrote the AMR++. It's a metagenomics pipeline for AMR prediction that works in galaxy. It's quite nice, but the data's stale. So we're gonna collaborate with them this year and to make sure AMR++, which is a pretty, it's a first pass decent metagenomics galaxy tool, will at least be up to date with CARD data, and be a little bit smarter with it to it. New visualization tools, you can do some heat maps today. Particularly, what CARD has not done is really look at the relationship with the mobile home and with virulence. So we're gonna collaborate with them in this room on a mobile home ontology, virulence ontology and virulence gene identifier with CFIA actually, it's gonna be our major collaborative. Prediction from phenotype from genotype, we'll use that one to try. So we just got funded a little more recently, not as recently as we'll did yesterday. One of the things we're missing is if we have a gene or mutation, it just tells you what drug or drug class it works with, but it doesn't tell you the known MIC. Like really, how much does it change resistance? So we're gonna curate that level of data in, so you see the severity of it. A lot more mobile home and genomic violence with Fiona's group. Alila Kepidemons we talked to, and then a lot of the stuff we talked about, metagenomics, amortimia, and our plus plus. But really about context. Our big drive this year is, yeah, you found it, but what does it mean? Is really essentially what we're doing as engineers. So the challenges I want you need to be aware if you're gonna do AMR surveillance, essentially of this, what's tough? So bio-curation of reference data is difficult to fund. NIH does it well when I was there and CIR HR, not so much. But without gold staff, the standard reference data, the whole thing breaks down. It's a reference base. Now Carr can talk to you about machine learning, other methods where there's no reference data, but really we're in the reference world still. The challenge is that unlike sequencing Drosophila and annotating it for the rest of life, it's not static. The resistome evolves on a daily basis. So unlike traditional bear creation, where actually the target's always moving. So this is never gonna go away. We need a constant funding stream on bio-curation. Sequencing technology is getting cheaper and smaller. We'll be doing increasing amounts of it. So storage, right? Storage and sharing has really been gonna come to challenge on how to do this. And that's what RIDDA is all about. That's why we partner with RIDDA. Metagenomics is difficult and computationally costly. And right now, anything you've got that you use that's out there, even what we're gonna demo today is really only using the easiest subset of AMR mechanisms, the protein homolog models. So when you do metagenomics now on any tool, it's not screening for mutation-based resistance, which is at least a third of the resistance out there. So you're grossly underestimating resistance. It's just saying, hey, I found a beta-lactonase. It can't say, hey, I found a mutated gyratris to it. So we need a lot of progress. We need a lot more translational tools to make this easy to do. What we're not doing well right now, tracking MICs, how much resistance does it cause? Very poor data on plasmids and other aspects of the mobile. We haven't done a good job on that on curation. Transmission dynamics, how things move around the one health. We just got a big grant funded, but very little out there now. And metadata, right? Where is it seen? What does it done? Who does it infect? And there we go. And so we have lots of collaborators, both Canadian and otherwise listed up there. So a lot of our, a lot of you, and if you wanna be a beta tester, let me know, we'll put you on the beta test list. But Rob Biko, who wasn't here this year, but Fiona and Will are our big collaborators, NML, Gary's group, USDA and NCBI, our big American colleagues, Biri Meru and France, and then many, many academic teams to it. And there we go. Perfect. Any questions before we pack some data? Cool. Okay, question. So we're seeing resistance emerging candidate species. Is that included? So we're prokaryotic so far. So we have an attack yeast. We're talking about that because we have a group partner with us working on sepsis and of course it's not always bacteria, sepsis. The challenge we have is our algorithms were designed from a bacterial perspective. So the concept of an intron is foreign to our data, so far. We do, there's actually a nice little database I saw of proof of resistance in yeast came out. I have been being attracted down. We haven't made a commitment to that yet. Yeah. I think that's one of our lines in the sand for finding. Until I get money to work on the fungal side of things, we probably won't do much. But there's not a lot out there. It's tough. Yeah. Just a question about, it's hard to develop companies or how do you envision that this field wouldn't work? Yeah, so card up until recently, half of cards funding was industrial. We licensed to at least 30 different companies, our data or software. About a quarter of them are in drug discovery pharma. A lot of them are machine learning or food safety. It's to it. So we basically talk with industry a lot and say, okay, what do you need? What kind of data do you need to speed up your pipeline or not tools? And then we write, and we basically have three players. We have academics that we produce for, we have public health people we produce for and we have industry. We talk to all three and that's why we resist them. So it was a needed bit of data to it. Do they also share data because you're not in a pharma company? I guess you're some sort of secrecy. Yeah. Peri-Merry-U. Actually they can claim a good, maybe 5% of the card Peri-Merry-U helped generate to it. We have a couple others that the data will go public at some point, but they're not yet. It's case by case. So the rule for canonical card right now is that it's gotta be published in a peer review journal, the sequence in Genovank and clear, well done experiment to illustrate MIC. So a lot of industrial data has not passed one of those marks they have published in paper. So we're building the class to say, okay, here's an unpublished data set. And here's what it says. We actually have a few of those private now. I have to do it. But we keep those lines really clear. So if industry wants it to be in-card, they got to publish it. Get through period. Anymore? Okay, possibly break. I think they hit that button. There we go, coffee break.