 Okay. All right. Good morning everybody. So I want to talk about emerging pathogen detection and the application of metagenomics for the purpose of detecting novel and emerging pathogens. Is that there? Yeah. So this is kind of a newer technology. It's one that allows us to essentially identify these emerging pathogens without having access to a preexisting diagnostic. Okay. And so the objectives are here of the module is to also understand a little bit about how metagenomics data is generated. Very, very brief overview of how the metadata is generated and then how we can assign the metagenomics data to specific taxa and then how we can apply that technology for pathogen discovery to identify these novel and emerging variants. So when we are performing infectious disease surveillance, there's essentially two arms. There's like the lab arm and then the epi arm. And so the epi surveillance is also referred to as syndromic surveillance. This is where you are monitoring for, you know, patients who are exhibiting certain symptoms, characteristic symptoms of illness. But there are other sources. For example, just general hospital activity would be an example, reports in newspapers of people getting sick would be an example, people starting to doctors starting to prescribe certain like well, a lot more than normal amount of antibiotics or antivirals. Lots of non laboratory sources of information that can give us indication that there may be some type of new disease outbreak somewhere in the world. And we have surveillance systems, global surveillance systems that track this information all the time. And essentially, they look for these anomalies little blips where it says this is not really what we normally see. Sorry, this is in this country or at this time of the year. And that can give us a clue. And then, of course, through the laboratory surveillance. And this is one where, you know, we're a lot more familiar. Essentially, what we're going to be doing is using molecular diagnostic techniques to try and identify a new pathogen. This is for standard disease surveillance. So for traditional laboratory surveillance, there's a number of approaches. So microscopy is one is one of one approach is looking at the cultured isolate under a plate as it grows. There are biochemical tests that you can use to say, does it? Well, does it does it have a certain set of biological treats or certain set of phenotypes? There are other phenotypic testing for things like antimicrobial resistance. As we get into more of the molecular techniques, they involve things like serotyping and nucleic acid amplification. There's a couple of different specific techniques. And I will very briefly review them. So the advantage of these molecular methods are over traditional methods are for their specificity. Essentially, they are working by recognizing some component of that organism. And the tests that are used to detect that specific component usually are very, very good at detecting that specific component. So antibodies, for example, can tell different subtypes of, say influenza. But they may not be able to detect all influenza. So if you have, say, an H1N1 influenza, you may have an antibody that's good for that. But if you have an H5N1 antibody, your H1N1 antibody will not detect it. So this is this issue of specificity versus sensitivity. When you think about specificity as kind of like a lock and key mechanism. So, you know, if you have a lock and there's only one key that can fit it, but it fits it perfectly, then that means it's highly specific. If you have a skeleton key that can open a whole bunch of locks, then that means that that key is very sensitive, that it can open a wide array of locks, but it can't actually, it's not specific to just one lock. Okay? That's kind of the analogy. Are there any questions about specificity versus sensitivity? These other techniques that are looking for things like general anomalies in the syndromic data, they're very sensitive, but they're given to false positives where there may be that there is an outbreak occurring. There's just some other thing that is giving rise to that unusual anomalous blip in the syndromic data for specificity. Like I mentioned, they're good for diagnosing and these molecular techniques are typically good at diagnosing a very specific pathogen, but they can miss a new pathogen because the diagnostic is just not generalizable enough to detect a range of pathogens. Okay. So, serotyping is a long, like a time used historically popular method for identifying bacteria and viruses and essentially what it's using are antibody techniques that are designed to detect specifically various cell surface components, known as antigens, for those for those pathogens. A second molecular method are these, or a class of molecular methods are the restriction digestion based techniques. So these use restriction enzymes that recognize a very specific, well called a recognition sequence in nucleic acid data and it will chop it up at that sequence. So that sequence may occur maybe only 15 or 20 times across an entire genome. So if you apply that restriction enzyme to it, you'll chop a genome up into 20 smaller pieces. Then you can sort those those fragments by their size on a gel and that gives you a specific pattern. As the organism evolves, it will start to change some of the band sizes. Some will say the same stuff will change, but you can actually use alignment techniques or similarity techniques to say this looks a lot like something that I've performed a restriction digest on in the past. So PFTE, Pulse field gel electrophoresis is kind of a standard workhorse for a lot of surveillance of bacterial diseases, especially for the foodborne diseases. A lot of that is getting replaced by whole genome sequencing. AFLP and Robotyping are other similar approaches that are kind of, they are, well, essentially they're kind of, they're a little bit, well, they extend the use of PFTE. So ribotyping, what it'll do is it you actually fish out using these ribosomes certain for a certain subset of fragments from the total set of 20. So you'll get like three or four. So less reduces the complexity. AFLP is is an amplification approach. So again, what you're doing is you're just taking some of those amplifying some subset of the fragments and just looking at that subset. So it reduces and simplifies, but it is less specific than PFTE. Even though PFTE is kind of can be considered a whole genome sequencing technique, it only really interrogates just those small sections, right? So it's not, it's not really considered to be whole genome sequencing in the way that we sort of think about it today. The amplification based techniques are typically ones where you have a set of primers that are have been designed to amplify a specific region in from a specific genome. And then you can use something like quantitative PCR to which where you identify will you basically put out a fluorescent tag to that primer. And then as you basically perform the PCR operation, you'll start to see a rise in the fluorescent signal over a number of cycles. It's a very sensitive technique and it can detect very low amounts of pathogen. Okay, but those are, so those are kind of the existing set of diagnostic, molecular diagnostics that we used routinely for the surveillance of existing infectious diseases. Merging infectious diseases are a little bit more of a problem. So they are well defined, essentially that as diseases, incidences increased just kind of recently over the last 20 years, for example, like SARS, or the pandemic H1N1, MERS-CoV, other variants of influenza and then most recently Zika virus, they occur regularly and will continue to occur regularly. But the problem for them with these is that your existing toolkit of molecular diagnostics may be too specific to be able to detect that one specific pathogen because it is, well, either has never been seen before, it's recently emerged from, say, a zoonotic transmission from animals into humans, or it has evolved substantial diversity to the point where the existing path, existing diagnostic is no longer able to detect that pathogen. Oops, okay. Alright, so if we want to diagnose emerging diseases, there's a couple of techniques that are available to us. One is to enhance the sensitivity of existing diagnostics. And so, for example, this is actually something that I'm involved in as part of my research, there's the, you can generate something called heterosubtypic antibodies. And these are antibodies that may instead of being directed specifically at a certain antigen, not like at an H1N1 antigen, you would direct it at a very, like an exposed but conserved region of a surface molecule. For example, for influenza, it has the H-stance or hemagglutinin, the N-stance or neuraminidase, both of those molecules are exposed on the surface and both have these highly variable antigenic regions, but they also have some conserved partially exposed regions. And if you're clever, then you can design techniques to generate antibodies that will just recognize that conserved region. And we've demonstrated that you can detect all 15 different subtypes of or think you're 16 now of the hemagglutinins and then the nine different neuraminidases with these with monoclonal antibodies that are generated this way. So these are very good for detecting, they won't tell you if you have an H1N1 or an H5N7 or anything like that, right? But they will tell you that you have an influenza virus. And so these are good to have around in case there is a like a new emergent, re-assortant influenza virus that has a novel epitope that can't be recognized by any of the previous diagnostic techniques. This one will, this antibody technique will tell you, yes, at least, you know, you do have an influenza. Okay. Well, and then there are, you can use degenerate bases in your amplification techniques so that they have an inherent, they will, they have the, well, these certain bases can like in a scene can bond to any of the other bases, the AGT or C. So if you start to incorporate in a scenes into your primer, that means it can accommodate some of the, some variety in the template sequence that it's, or the target sequence that is trying to identify will still recognize it. And so you can, you can get it that way. But they're, they're not perfect. The, the, well, the current best technique for identifying a novel emerging pathogen is to use a whole genome sequencing approach, because you don't have to have any preexisting diagnostic or you don't have to enhance the sensitivity of existing diagnostics. You just sequence its genome, you look at the genome and then, and then you attempt to try and diagnose the identity of that novel pathogen, just using that genomic sequence information. And the approach that we use shotgun metagenomics. So there's two types of metagenomics that are commonly applied. There's single locus, amplicon based techniques. And these are like the 16s are ribosomal RNA technique, where you just specifically identify and amplify out one locus. And then that becomes a signature for the organism. That's great. But not all pathogens will have a 16s ribosomal RNA. For example, all viruses do not have a 16s ribosomal RNA. So it's not a universal technique for identifying novel pathogen shotgun metagenomics. On the other hand, is a technique that allows you to sample the genomes of an entire community of organisms inside of some type of biological specimen. So the, the concept is similar to like shotgun sequencing applied to a single cultured isolate, but now you're just applying it to an entire collection of nucleic acid. So you have your, say a clinical specimen, which might say be a biopsy from a recently deceased person, or it could be a cerebral spinal fluid, blood, feces, any type of biological specimen that you suspect may contain the pathogen that you have been unable to diagnose to using your, your existing toolkit of a priori molecular diagnostic techniques. Okay, so you can extract the total nucleic acid, which includes the RNA component, because some of the viruses are RNA viruses. And some of the pathogens may have an RNA genome like RNA viruses, and the DNA. And then you, in the, at the end of that procedure, you're going to have a combination of your possibly your pathogen, your commensals, which are just the organisms that inhabit our body and don't cause disease. That's our microbiome. And you're going to have your host DNA as well. So if it's from humans, you're going to have human DNA in there. So you have a really big kind of complicated mixture of nucleic acid in the, in the sample that's been extracted from your directly from your clinical specimen, without using any enrichment techniques. And you can use enrichment techniques, but I'm not going to talk about those right now. Those are they end up biasing the sample. And it's not really important for what we're trying to learn here in this module. So once you have that, that jumble of, of host and commensal and possibly pathogen nucleic acid, then you, well, if you have the RNA, you'll convert it to cDNA. So you have total DNA, then you fragment it and adapt it, just like you would with it generating a normal library for a cultured isolate, you put it through your sequencer, and then you generate reads from there. And normally you'll generate quite a large number of reads relative to when we're doing a regular genome. So a regular bacterial genome might be about 5 million base pairs. And so if you want to get good coverage, then if you're going to say get like, well, 10 times coverage, that means you're going to need about 50 to generate 50 million base pairs inside of your reads. If you want to get really good coverage, you're going to go to 100 times. But even those are kind of small numbers of reads to generate relative to when you're generating one of these shotgun metagenomic samples, because you may have a very small number of samples of pathogen reads and a large number of reads that are just not. So in order to improve our probability of being able to identify those reads, you have to generate really large numbers on the order of about 25 million, typically 25 to 50 million reads need to be generated reads that about say 500, you know, two times 250, so 500 base pairs. So we're really, we're getting up to close to a billion base pairs worth of reads that need to be generated. Okay, so this provides an unbiased survey of your nucleic acid content to the degree that you can say that that approach is unbiased. There's always bias, of course, gets introduced out at all stages for just about any molecular method, especially for preparing nucleic acids, but without going to things like using things like a hybrid capture or a other type of enrichment technique, this will give you the most unbiased survey of your nucleic acid content. Like I mentioned, it's going to contain the host plus the microbiome. And the pathogen, so the this becomes can become a problem because of the relative abundance of the host DNA that may may have extracted out along with the sample. So and that really depends a lot on the biological specimen. For example, human fecal material contains typically under about 5% of the host DNA. The rest is all just the microbiome DNA, so 95% in cerebral spinal fluid, typically over 99% of the DNA that's extracted in a in a CSF stamp sample will contain host. So these are kind of you're generating a lot of stuff in that you already know in advance that you're just going to have to throw away in order to be able to interrogate the fraction that's interesting and we contain your your pathogen. And then contamination is also a major concern. Oops, this should have been larger. Anyway, so I mentioned a little bit about how what labs can be used well to enrich your sample. So there's methods for removing the host DNA. There's methods for specifically trying to capture a certain pathogen. They are they work well if you have a high biological load of commensals in your so that if there's not much host to remove in the first place, then these host reduction techniques work pretty well. The problem with them when you're trying to use these host reduction host DNA reduction techniques is if there's a large amount of of the host, then they will remove a large proportion of the host. They'll also remove a large proportion of the of the the microbial content to and that can severely bias the result. In fact, you may be removing the the pathogen DNA as well. Oops, this should have been larger. Anyways, so so but but contaminants will cost is one type of contamination, but there's also contamination that can come from other sources that can lead you astray. So your lab reagents, lab codes, the lab worker lab, just around in the environment, these all can can contaminate a sample. For example, in our laboratory, we have we would have been very active in the between the years of about 2003 and 2010 and doing a lot of SARS work. And when we started developing these protocols for identifying emerging pathogens with these much like on metagenomics techniques, we were finding SARS in all of our samples, which was really, we were kind of surprised to find out about that we end up doing a test of the environment to see where it came from. And it turns out is coming from just about everywhere, even though we have clean rooms to prepare our amplicons and are under under our libraries. They're in lab codes, they're in people's beards. The there's SARS basically just all over the place in the lab. And so you have to take into account the fact they have to kind of expect that you're going to have contamination. So there are ways that you can identify that contamination. The if you by doing the environmental sampling, you can generate a background of the DNA that you would that you can assign as contamination that is essentially bespoke to your laboratory, right? So that's a one method to remove it. Better, better, one of the best method is essentially to run a blank, because so the blank would be the it contains all of the reagents that you've performed in your workup, it just doesn't contain any of the nucleic acid that you've extracted out from from the biological specimen. And that provides your your background set of contaminant DNA that you can also know that you can expect to find and you can remove. Interestingly, we when we were doing an analysis of a patient who was suffering from a undiagnosed disease that was causing meningitis, we were looking at that CSF samples, and we were consistently finding equine. What is its EIAV? So that's like equine. What is infectious anemia virus, right? Right, it's the it's the it's the country version of HIV, right? So it infects horses. And we're just like, we're thinking like, how did this is this per could this person really have gotten, I'd like say a zoonotic contamination of this EIAV, but now it turns out when we run our blank, it's showing up in our blank. And so the methods that are being used to prepare our reagents that we are using, for example, our proteases, and polymerases, etc. are at some point, I think may have, assume have come from some type of equine source possibly, and the virus has traveled along with it. Okay, so once we have our reads, and we, and we're, and we're considering the contamination component of it at the point, we don't really know what the reads are, which organisms they may be derived from through the first step is to, well, essentially to cluster those reads together. So into into similar clusters, essentially the this in the same for people who are familiar with doing 16s, or have zonal RNA type of metagenomics, it's the method of OTU generation, where we kind of just sort of say out of certain similarity cut off for reads a 98% or 99%. We just assume that they're all going to be the same. So we can cluster those into these groups of highly similar sequences, and then we can just take a representative from that group, and then we analyze that and then when we get a taxonomic assignment from that representative, we just assign everything in the cluster. That once we have those tax identified, then we can consult with our our clinical microbiologists who know a lot more about the the the symptoms that these diseases manifest than us than the bioinformaticians do. And then they can help us to identify a possible etiological agent for the for the for the disease that we're investigating. So it is a kind of an optional step to do the reclustering it is recommended. There are some programs that are around that can that can be used to assist in providing that clustering step or provide a list of them here. But once we have either individual read phylo typing, which is what we're going to do, or we have the clusters, then we can go to the taxonomic mapping step. And so this is a process of taking the reads and comparing them to a database of while a genomic database, where the the the organism that has been harboring those sequences are known and are assigned to the sequences in that database. By making that comparison, then we can infer that our sequence, as long as it has a high enough specificity, came from that organism as well. Okay, so in when when when next generation sequencing first came out, and we were interested in developing systems for doing the for emerging pathogen detection using shock and metagenomics, the the the most obvious method that was available to us at the time was to use to use blast alignment, where you just simply take your read, and you have your database of reference genomes with their taxonomic assignments. And then you map that read to that database, and it will give you back the well, the the sequences that contain your read sequence, right? So blast is, well, it is, it is designed in an era prior to whole genome sequencing. So even though it is kind of considered to be a fast alignment technique, it is not fast enough. If you need to align 25 million or 50 million reads to a large database of microbial genome sequences. So especially, especially under the type of circumstances where you have an emerging pathogen, if it's an undiagnosed novel pathogen that maybe as an individual is, you know, suffering from and you get a sample from the clinic, they want to get an answer right away. If you're going to use blast, you might it might take five or six days for you to get through all the reads. That's not good enough, you want to be able to get a result around, you know, in the same day, if you can, even in a shorter period of time, if you can. So it is to it is quite slow. The big advantage is that it has an adjustable sensitivity. And it has the of all of the available alignment algorithms that are used to do this type of taxonomic mapping, blast has the most sensitivity, which means that there can be, well, an adjustable and quite a large range of diversity variation between the read that you have sequenced and the the genomic sequence that it that is it is homologous to and derived from. So that is nice, because it may be the case. I mean, if it is a truly emerging pathogen novel, haven't seen it before, but maybe it might belong to a certain like class of viruses or a certain family, certain order higher up in the taxonomic ranks, then it will have it you you will it will not that existing virus species will not be represented in the database. And so if you use a very specific alignment method technique, then it's going to say I didn't see it. But if you have something that's more sensitive and allows for more diversity between your newly sequenced genome and the reference genomes, then it can say, Well, I found something that looks a lot like it. Right. So having that sensitivity is nice. And so blast is something that you necessarily want to throw away. But there are faster methods that are that have been developed. And they have a decent amounts of sensitivity. Essentially, they have been designed for this type of purpose. And so they can provide enough effective sensitivity for you to be able to map your read to say a family or a class if you can't imagine exactly. But when we use the the blast algorithm, it's, you know, the the if you get a you may get a hit to one organism. And if the disease symptoms that are associated with that organism match the ones that are that the patient is presenting or is being or is being presented in the in the population, then you've got a good candidate hit. But you may actually get hits to multiple possible candidate organisms because they're similar to each other. Some may be existing and some may be well, well more novel, depending on your sensitivity, you can still get hits to a lot of different organisms. And they will have essentially a like a bit score, or effectively an alignment score that tells you how close they are to each other. But you you might want to be conservative in your choice of of how you assign that read to a certain taxa. If it is if it is coming from if it has been derived from a region that is common amongst a whole set of organisms, that means that there's just not enough specificity inherent in that read for you to make a specific as assignment down to say to the species level or maybe into the genus level. So in that case, the best that you can do is to assign it to the most common recent ancestor of all of the organisms that you that were returned from the from the blast analysis. And this is also a similar concept for other approaches as well. These pipelines and software that have been developed to do this, what we call phylo typing, they are smart enough to know that if you can map to a set of organisms with a high enough specificity, then they all become likely candidates and you can't choose the one over the other. So what you can do is just look up the taxonomic tree and then you find the the the the order or the class or the family or whatever that contains those organisms as you know in its well as well descendants basically from that from that rank. Okay, so here is kind of an example where you have these these different organisms that have been identified with a blast analysis and then they return these blast scores. So does this show up? So the the top three here have very high blast scores too are identical. One is almost identical and then we have these other ones that are lower but you kind of have to supply a cutoff and if you didn't supply the cutoff and then maybe in an intelligent way, then the algorithm is just going to say I'm just going to pull the most recent common ancestor of everything that you basically chose to be above a certain cutoff. So if you chose a cutoff of 150, then it will say okay I'm going to choose these two and oops excuse me and look up the taxonomic tree and say okay well it's a Campylobacteriales right it's identified the order if it has if you include 140 then it will say well these are proteobacteria so you're up to the phylum level which is already pretty high if you include the clastridiae or the methanococcus it's basically just going to tell you it's bacterial sometimes you'll just it'll just return the root of the taxonomic tree you'll just get out something that says root which is not that specific and so you know you're not really going to it's not really helping you out a lot if you have these well if you have equivalent high numbers of of these blast scores but you they are coming from organisms that are distributed out throughout the entire taxonomic tree because it's just going to look up to the essentially to the tip of the tree interestingly when we were doing some development of this blast-based approach we were finding a lot of our of of of certain organisms certain reeds were mapping to the roots all the time and when we did an investigation of what the individual taxa for you know that were giving rise to those uh to that root designation were they were all mycobacteria and uh that was confusing for us um and the it turned out the reason why was because at the time that we were doing these investigations the taxonomic tree we were just downloading from NCBI that's what we're using basically to do that taxonomic placement but um uh they there recently had been the development and the synthesis in the publication of an artificial genome for mycobacteria genitalium and the they didn't really know where to place it into the taxonomic tree because it was completely artificial so they can created a synthetic line right up at the top so so essentially everything that was synthetic would be a descendant inside the synthetic node and then you would have the non-synthetic node so if you had but you're mapping to the because that organism exists inside of your reference database anytime it gets a mycobacteria it's going to pull from the synthetic it's going to pull from the live stuff and it's just going to return the common ancestor which is up at that root so there's those gotchas all over the place one way around that is to use something called a weighted most recent common ancestor tree and so when you get these hits and you start to place them taxonomically instead of just blindly looking up the most recent common ancestor of all of them that what you can do is essentially a kind of a clodistic analysis and you can say when i go up to a you know up one node how what percentage have i captured of the of the of the organisms that have been returned from those from that blast analysis maybe you get 10 percent you go up another node you might get 30 percent you go up another node you get maybe 75 percent at some point you can say well if i'm capturing if the if the majority of them are falling into this one taxonomic clade then that i'm going to use that information basically to to prevent me from just traveling up to the root of the tree every time so that's called the weighted MRCA approach and it turns out that is actually a very very effective approach of getting getting rid of these outliers that just coincidentally happened to have enough similarity that they return a blast hit that's high enough that they that they end up confounding your your MRCA assignments new taxonomic assignments okay once you have all of the reeds mapped to some tax one or more taxa then you can represent that in as a taxonomic abundance and that what that's what i have here where each this is a taxonomic tree and all of the reeds that have been assigned to some node in the tree either some common ancestor or some specific organism the they're represented the amount of reeds that have been assigned to any specific tax are represented by the size of the node so you can see here this one a lot have have got a very general assignment some have had a little bit more you know some some more specific assignments but in but smaller abundances so if you look at it like this then that can give you a better sense of which candidate taxa to look at first um because those would be the ones the idea is the more hits that you have to something that's the most hits that you have to the most specific taxa that you get are likely the the the um the tree organism questions about that programs like megan which was an you know a very early and probably i think still popular program for doing this type of shotgun metagenomics approach they use blast you have to perform the blast on the side it takes your blast analysis results it does all that taxonomic mapping mapping to the most recent common ancestor and then it will display it for you inside of a a tree that has represented here where the nodes the size of the nodes are representative of the numbers of the of the reads that map to that to that node okay so so that has that was a useful um early method for doing for identifying possible novel or emerging pathogens from shotgun metagenomics data but like i mentioned it's very slow unless you have an a lot of computational muscle it can take quite a long time like i said it would take us when we get back in the days when we had about 500 cpu chords it would take us about five days or maybe no i think it would take about one or two days to get through 25 million reads or about three or four days to get through 50 if that's the size that we generated and that all depends on other people's use of the cluster so it's just it's not that great but there has been a lot of work on developing faster algorithms and these are the well there's one that i'm going to show you and then and the lab we're going to look at a second one um that is orders of magnitude faster and still give you very very good results so kraken for example can identify approximately one million reads per minute on a single cpu like on your laptop right so that's a pretty big difference then from 25 million reads on 500 cpu's that take you know two days right so it's a very very popular and we're going to take a look at kraken it's developed in some steven solsberg's lab steven solsberg it's kind of like the i would consider probably to be the most valuable bioinformatician in the world because so he his lab developed um the glimmer gene microbial gene predictor uh kraken of course other earlier versions of of these phylo typers like fimble um which were useful our time until kraken came along um the bowtie suite of uh is that called tuxedo suite of like right so he thought his lab developed the tuxedo suite which includes bowtie and cufflinks and cummerbund and a whole bunch of stuff that's basically just used all over the world today these guys really have that the gas in the tank they do good stuff anyways this is their approach to doing the the taxonomic mapping so what they do is they decompose the reads into a set of camers and i believe that you have been already been introduced to the concept of camers just to refresh your memory what you're doing is just taking a fixed size a certain subsize of a length and say maybe 10 or 12 and you start at one end of the read and you cut that subsequence out and then you move at a certain step size either one nucleotide or maybe four nucleotides and then you'll cut out another one until you have traversed the entire read and generated a set of these camers so these are all camers that with a step size of one that have been generated from this read second read here third read here you can do this there are methods that are extremely fast for generating camers and then you map the camers to a taxonomic tree that has a bunch of well think about each node as containing the camers that have been generated from the reference database of microbial genome sequences okay so we're going to map camers instead of mapping reads with blast to genomes now we're mapping camers to other well we're going to be traversed taxonomic tree and look to see if we get matches in our these smaller subsets so here's kind of the idea so you have a camer and it is coming from you know well some so it's coming from some read and you you look and see is it can is it can to as i traverse through my tree from the root to the tip do i find that camer in one of these nodes as you search through that taxonomic tree if you do so the reference database contains that same camer exact camer that you have looked at so we're really good at doing exact matches really quickly so you can say yep it's there so you add a score to that node and say i found a i found a read in my experimental data that matches to a read that also existed from in the reference data and um if you the idea is to try and map it does specifically it can as it can so that you get down to the very tips of the tree which is where you have the most specificity essentially this is where we are giving the species level or possibly even subspecies level assignments reads here are basically the most at the tips are the most specific that can be that could be generated from the from the original camer database but if you can't assign it that specifically then you can assign it to one of the most recent common ancestors kind of like this so so you have the read you take the camer and then you map it in here and you say okay well does it exist in here the way if if it is specific enough so it is really specific it's only exists in that one organism then what's gonna happen the way what's gonna happen is it's gonna say actually here it says i know that it is in this part of the subtree and it just says go down this way so it gives it a very a direction to go and then it'll and then this one say okay go down this way and then go down this way and then go down this way and eventually you'll get down to the final placement so it doesn't actually have to search the entire tree to find out where to place it it just starts at the start of the tree and the tree because it gives it directions and say you go this way go that way until it there's no more directions to go and then it's basically got the final placement in that tree very very fast because you can you can place a camer in one two three four five steps rather than searching through a tree that may have tens of thousands of different nodes okay and okay and then so here we are assigning camers to these nodes and here we've got one assigned here we've got one assigned here we've got three assigned but what we're trying we're not trying to to assign camers that's not the the the reason uh that's not the purpose of this experiment it's to assign the reed to a certain taxon not to assign the camers to the certain taxon so the reed contains the camers the camers get placed into the tree so none got placed here and none got placed here they all kind of got placed over here and then what you do is looking at the part that where you we got placement you try and find the one that has what's called the the you assign the reed to the highest weighted route to leaf path so this one basically had the most numbers of reeds assigned to it so the so one two three four five down here reeds were assigned here and so this is rather than go down here where you only had we had one two three reeds here you can get one two three four five so you'd say this is my highest weighted path i'm going to choose the taxonomic assignment that's associated with that out there extremely fast very clever so that's a little bit about the behind the scenes about how kraken works beyond that um the process that you use to there's an so there's additional work to be done that is not carried out automatically by the software and it's basically where you're the the context in which you want to do the analysis is important and that can inform you on your experimental design there are typically two types of experimental design one is called the serial analysis and the other one is called the parallel analysis so in this example of a serial analysis the idea here is it is a iterative reductive diagnostic technique so and that will make sense when i explain it here so you start with your all of your samples so you've got your commensals you've got your possibly your pathogen you've got your host the hosts you don't want to have to waste time searching all of your databases for the host so in it so because you expect that you're going to have host in there you know it's going to be in there prior information then a very fast host filtering step right at the start and possibly other contaminants that you may expect to see would be your first step so you use a very fast filtering technique just get rid of everything that is basically assigned to the host then you can go to a second database that is say bacterial data and you map to the bacterial data so everything all of your bacteria are the reads that came from bacteria are getting mapped at that stage and then they do not progress to any further analysis so everything that is not host and not bacteria gets passed through and is and then it's assigned say to virus that's passed through and assigned say to two protists and then maybe to fungi that way you are not yet when you want to conserve the amount of of computational resources that you're expending to this type of analysis then it kind of makes sense that you might want to set them up in a serial analysis so that you're iteratively removing and not reanalyzing stuff because once it's been analyzed sufficiently for one type of kingdom then there's not really any need to go further okay well the other one is a parallel analysis pipeline and this is one where you don't want to take the chance that you may have had a read that is diagnostic for a pathogen that is of a certain class say a virus but it just happened to map to a bacteria bacteria for example contain phage and phager viruses they're not so important for diagnosing emerging pathogens in humans right but that's a concept that's important to think about that there may be the there may be matching sequence data in one in a kingdom that you're not interested in that but it pulls and assigns that to it'll assign it basically to some to a salmonella that contains the phage right that you have when say if you're looking for phage you're not going to be able to find it unless you search through the salmonella data and find out where it mapped and you say open up to a phage region so these are problems can anybody give me an example a more relevant example where you may filter inadvertently filter out an important pathogen at one of these prior stages say a viral pathogen that would be contained that's not a phage but it's contained in one of in a in a one of these other databases for another kingdom can you give me an example of a virus that might be integrated into the host you know like retroviruses so if there's a novel retrovirus right that or if there's a if there's a novel pathogen that has sufficient similarity to an existing retrovirus that is that is inside of the human dna gets removed right at the hostage right so you have to gotta be thinking all the time there's biologies just like that at least apply prank you every step of the way so in those cases it might be better to use a instead of using an iterative reductive technique to basically say well i'm actually going to rather you now you can get your remove the host filtering step and just run everything through every possible classification or kingdom of of pathogen that you can and and see what's in there and then typically at the end you're going to get a bunch of assignments and then you may if you have say a truly novel virus that has no representation in the database at all then it just not doesn't map to any databases and so because these are kind of these iterative filtering techniques they say okay if it doesn't map to this database go and take the unclassified reads and map them to this database etc at the end you're going to end up with a set of unclassified reads that belong to that pathogen it's just that it doesn't it can't be assigned taxonomically is there can anybody think of anything that you might be able to do with that with those unclassified reads that might give you an indication that it's real and not just say some type of artifactual read that is some just garbage junk that has not even real but has managed to be spewed out from your sequencer which does happen exactly very good you guys are a good class you can assemble them right because if they do come from a real organism and there's some overlap that you guys would need to do random sampling and it's shotgun there might be some overlap if you can even get two reads to overlap then that gives you a sense that those reads came were derived from a real pathogen not just some random junk that's been generated yeah okay and well you can do a hail merry and you can just take say an assembled set of reads and map them or send them out to the to NCBI for a blast analysis at that point where you have more sensitivity in the result then you may have using things like kraken and it would it might be able to place that read to a certain family or order of a certain type of pathogen and give you some more clues give you a partial taxonomic assignment from there that might give you enough to go forward with at that point these are I mean these are all hypothesis generating procedures at this stage it kind of says these are candidates for what this emerging pathogen might be if you identify something that says is unclassified but assembled and say maps to um to like a phylo virus or something like that then you can say okay well this matches the symptoms as well let's go back to the patient and then we can design some specific diagnostic primers now that we have we've kind of bootstrapped our way into getting some information from that novel pathogen that we can use to make a specific diagnostic and go back into that ice and test and see if it comes shows up positive and that way we have a better confirmation that that is the the actual ideological agent of that that that disease for the blank oh well the so the blank would so there's a contamination step here that I do not include okay and I should have included normally the way that we have it is host plus contamination we consider the host to be contamination so you could you would have a contempt so I should modify this so that there is a contamination step as well yeah and so your blank would be there um yeah either prior to host or subsequent to host but before you actually start mapping to any of any of the database that's when you would map it to your contaminate your contaminate database which would contain whatever showed up in the blank right or what are the collection of sequences that you have assembled from an environmental analysis of your laboratory I was wondering so of course like last is the most sensitive and the cold standard but it's very slow except that it's takes a lot of time especially 25 million of reads although what if you like also kind of reassemble them and collapse this number of reads and reduce it to like one million if there is a good overall right that's that clustering step that I talked about before yeah OT so you can do the OTU generation part and then you're just selecting one representative so you might be able to collapse 25 million reads down well if you get 25 million reads and 99 percent are a host right the first thing you do is just get rid of the host but if you have a fast host reduction algorithm and we're gonna look at that today but yes in general maybe it's from fecal so it's less five percent is host so you're not really getting much of a computational efficiency by removing the host at that point you still have you know me if it's you still have 20 million reads so you would likely want to do clustering and then also you can this is something that I don't really discuss here and I don't want to get over complicated but it's probably worth ending off on there there are methods for assembling your metagenomic read data so it's not going to assemble them back into full genomes but if it does find some overlapping reads that are sufficiently untangled from the rest of the system that I can say well I can extend these into a little mini-contake right and they have variable success rates but as the read gets longer there's more information that can discriminate one organism from another organism so that makes sense right I mean at the whole genome level you can discriminate between organisms but differ by one base pair right and at the just say the one gene level you might not be able to discriminate between organism within the certain genus or a certain family right so the longer the connected read is the effective reader the content is the more specificity you have in the taxonomic mapping stage to be able to discriminate between two possible candidates taxa so but it's that in full that involves a lot of additional computational expense there is a lot of interest these days and using the Minion platform to do emerging pathogen detection because it can generate these extremely long reads it's um it I mean it generates a lot of data but it doesn't generate a lot of reads you know in one run so you're not going to get 25 million reads and I don't recall just how many reads you'll get out of a typical Minion run right that's because they're so long right so does anybody have an idea about how many that's above 50,000 on that on that range right okay so and it's expensive so you might just be analyzing you know if you get nothing but host DNA back you just spend a thousand dollars and you've got nothing but one read that's long enough is sufficient to be able to uniquely identify one an organism so if you did manage to actually get the pathogen sequenced in one of the pores of a firma from the Minion technology you're pretty much done also they have uh well it's portable you know you can just basically essentially just plug it into your laptop and they do have kits now that are we're testing to uh that can uh generate your library in approximately 15 minutes and it's really just sort of a plug-and-play thing just put your sample in there put your have the kit wait 15 minutes stick it inside your Minion and it starts start spinning out the data so these are very attractive technologies for emergency response types of situations so if there's a like a bioterrorism event and you need to know if it was it maybe it could be that it was an environmental pathogen for example the psilocynterosis it could be um you know it could be a person to become infected because of malfeasance right or they could be infected just because they are hanging out you know in a field of sheep right so there's there's it's important to know early in biocrime events how you know the the attribution and the likelihood of if it's real or if it's not um and for if there if you want to um respond to an emerging outbreak of a pathogen like Ebola you want to be able to detect people who have Ebola early and quickly like inside of 15 minutes or 20 minutes then these Minion technologies are very good for that it's just the problem is they don't they only generate about 50 000 reads they generate the information very quickly though um and uh but you also need a large amount of DNA in the sample to be able to generate the read so it's not really that good for if you just have like very small amounts of sample which is often the case unfortunately um so those are I don't have anything to show you on that because the methodologies are still that is a very rapidly evolving technology right now especially on the bioinformatics side so I hope maybe in a year or two to be able to introduce the Minion um long read technologies real-time long read technologies for emerging pathogen detection that's the end of my talk are there any questions I think I managed to get through it in an hour it's not bad okay okay phage typing dead pretty much a dead technology yeah so I mean phage phage have a very specific tropism right so they will infect certain some of them will affect only very very specifically one subtype right um and not another subtype and so their presence um is an indication basically of the subtype of the organism that harbors it but now that we have whole genome sequencing it's just not it never really became a very popular um molecular method diagnostic method um there's I'm not aware of any standard method that we use for any type of diagnostic or reference services surveillance service at the national microbiology lab that uses phage typing so it was kind of like an interesting thing that was kind of going on at the side but is just not really certainly no longer a viable technique this is in my opinion I could be wrong I usually am so that's just that's just my opinion well that's true too so the the the variation of the phage that's like that is in the salmonella space is also a method for you to type the salmonella but so I think it was mentioned earlier that the you know public helping when they sequence all of their salmonellas every isolate that they get they do a full genome sequence so the phage information is in there but there's also a lot of additional information that's available to do the typing we have the that salmonella in silico typing system that I think we demonstrated for you yesterday and you have results in that is the primary method for serotyping in silico serotyping basically um globally right now that's really the main method for salmonella serotyping it's not big typing yeah okay any other questions okay okay thanks