 So I'm going to go over in the next hour emerging path module on emerging pathogen detection and identification. And so my name is Aaron Petkel and I'm a bioinformatician at the National Microbiology Lab here in Canada with the Public Health Agency of Canada. And I've been while working in bioinformatics at the National Microbiology Lab since I mean, since I started as a student since in 2008. But I've been working full time, following, following that student student work term since around 2010. And have been involved in a variety of projects, though all related to microbial genomics and bioinformatics analysis, in particular for infectious disease, epidemiology. And so my background is in computer science. And so I've been involved, quite a bit in designing or developing a number of different software packages over the years, and, and also doing a lot of computational biology and data analysis. So I'm going to be presenting to you, not on anything I've developed but on software that I've run and used previously for pathogen detection and identification as well as some additional background material. So the learning objectives of this module is to learn the processes and techniques used to detect existing infectious diseases. And then understanding metagenomic sequencing in particular shotgun metagenomic sequencing and the use of the Kraken and Kraken II software for data analysis. And also how to use the Kraken Kraken II software for identifying new pathogens using metagenomic sequencing data. So just as a bit of a background material. Novel and emerging infectious diseases refers to diseases that have not previously occurred in human hosts have occurred in humans before but only in fact it's small isolated populations, or occurred through a time but only recently this is distinct diseases. And so the figure on the right here just shows it's, well, this is deaths from major epidemics over the past 20 years, of which I mean the most recent and ongoing one is the COVID-19 pandemic which has the higher bar but there's been a lot of other infectious diseases, some of which have been emerging or reemerging diseases over the past 20 years, including the original SARS virus back in the early 2000s. And increasing in general, and this was mentioned in previous modules but increasing global connectivity as well as climate change increase the chance for novel pathogen emergence and spread. The figure here is just showing just the basically flight paths in the world over roughly almost 90 years. It's from 1930 to 2020. And I mean, as expected, the world is quite a bit more interconnected now than it was 100 years ago. And this means that there is an increasing chance of seeing novel or emerging infectious diseases and existing diagnostic methods may fail to detect an emerging pathogen. So, kind of giving a general overview and leading into existing methods for detecting different pathogens. And just introducing infectious disease surveillance, which the purpose of which is primarily to either describe the current burden and epidemiology of disease, to monitor disease trends, or to identify new outbreaks or new pathogens. So there's three different categories, while there's a number of different divisions, different people have divided surveillance up into one way dividing that up into three different categories mainly disease specific surveillance, syndromic surveillance and event based surveillance. So disease specific is well focused on the gathering information that is very specific to particular diseases, either through lab reports or or or genomic sequencing or other techniques, syndromic surveillance is more focused on gathering more. Well, I introduced that in this slide disease specific surveillance is looking for specific diseases and drug surveillance is looking for non specific health indicators. So as an example, school of work absenteeism, or the drug sales from different pharmacies, which are all not necessarily associated with any particular disease but could be an indication of some emerging infectious disease events or, or something that's going on. And both of these types of surveillance of all structured data. That is to say there are certain reports or standard standard reports or structure to the data. Whereas event based surveillance is more related to unstructured data, for example news reports to provide early mornings of disease so this is actually a table showing the part, the use of the global public health intelligence network, and which does invent event based surveillance and the sources of information that were used by this network to give an early indication for example of the pandemic based on reports of viral pneumonia of unknown origin, or for inbox based on government health notice reports. So this can give you quite an early indication of some sort of disease events. So if you still need further investigation. If you're using this sort of general data to confirm identifying characterize the cause of that disease. So here, there was many publications early on with COVID-19, where there was a large amount of interest in sequencing the virus or sequencing metagenomics data from that virus from people who are infected to attempt to identify the novel pathogen causing that disease. And then to infect disease surveillance, we're going to focus mainly on disease specific surveillance, and in particular we're going to focus on laboratory based methods for gathering data about related to pathogens. So this figure here that you can kind of divided this way you gather this data based on the source of that data. So as an example, clinical samples can be used to, or you can do surveillance of clinical samples, which would involve taking the sample and filtering that particular infectious agents and performing additional diagnostic testing. For example, to identify the sub types of that infectious agents, or identify the particular pathogen and produce reports which could then feed into some surveillance system and give an indication of outbreaks of disease. You can also perform surveillance using foodborne samples where food genomic materials extracted from food cultured and again pathogen identification and typing using a variety of lab based methods or genomic sequencing methods can be used to generate reports which can feed into surveillance systems. The same with environmental where there is, or where you can gather information from the environment for example wastewater samples, and again perform a variety of lab based testing to attempt to identify and characterize or quantify how much of a pathogen is within that waste water sample and produce reports for surveillance purposes. However, these all require some sort of laboratory based testing techniques to that are specific to a particular pathogen. So it may involve different, different typing methods, or it may involve form some form of culturing to identify and isolate a particular pathogen to be used for example for sequencing and then characterization that way. So it's into some more details about some of these traditional lab based pathogen detection techniques. They can involve a nucleotide acid applications you're typing. I mentioned culturing is often used or methods in proteomics my cross my cross be copy. And some of these I know have been introduced in the previous modules. The first one I'll just mention is. Okay, the first one I'll just mention is a nucleotide acid based amplification techniques. And this is something that people are likely familiar with at least COVID-19 pandemic since it was used to attempt to identify all possible or has many possible cases as we can have the people infected with COVID-19. And this involves basically amplification of particular signatures contained within the genomic sample related to the genome of the pathogen of interest. So for example real time quantitative PCR or RT PCR may involve collecting a sample so for example a nasal swab and extracting well for COVID-19 extracting the RNA from that swab converting the RNA to complimentary DNA or CDNA and then performing PCR amplification on that particular on the RNA using primers which uniquely bind to the regions associated with this SARS COVID-2 virus. And then through repeated PCR amplification cycles you can measure the fluorescence of the particular dye that binds to the DNA that is being amplified. And this through repeated cycles gives you a plot of the fluorescence intensity. And if that intensity plot basically increases exponentially at least initially. And then levels off here and surpasses a particular threshold and that is a positive sample, whereas if it is just a flat line here it is a negative sample. And not only that but the particular cycle where this threshold is meet matched can then be used to quantify the amount of the RNA within that particular sample or the amount of the virus in that sample so lower CT cycle would indicate that the thresholds of RNA was met earlier and so there was more RNA in the original sample that is more virus in the original sample. Another traditional laboratory based surveillance technique for identifying particular pathogens is multilocal sequence typing which is introduced in the previous modules. But just as a recap, the classical multilocal sequence typing is based on six or seven or maybe a few more genes. And these are these different alleles of these genes can be used to assign a particular sequence type or subtype of particular bacteria, for example. So in this case in this figure here for genome or sample A, B and C. There are roughly seven different genes or different loci that are being investigated for the multilocal sequence typing. Each one gets assigned a unique integer, if there is any genomic variant within this, within this particular gene. And then the combination of all these allelic identifiers so one three four and so on get assigned into unique sequence type. The combination gets assigned a new sequence type identifier. And then this can be used to essentially identify and classify subtypes of particular bacteria. And this can be performed through a variety of methods, but it can include for example genome whole genome sequencing which can then the genomic sequence can then be investigated to identify these particular loci and alleles of the loci. This was then extended later on to include basically hundreds or thousands of genes within a genome. And then through the use of core genome and molestee or whole genome molestee, which again core genome means that you are looking at the genes that are found within all or at least most of the particular organism in question, whereas whole genome may include genes that aren't part of the core genome that is to say some of the organisms that you're investigating some of the species may be missing particular genes. But any in any case these allow you basically a very detailed and fine grained classification system that goes beyond just looking at maybe a small fraction of the genome but essentially looking at the entire genome of a particular bacteria and using that to classify and cluster different bacteria and identify different subtypes or different clusters. And another type of all traditional lab based pathogen identification already identifying particular subtypes of individual pathogens is serotyping, which is subtyping based on cell surface molecules. So basically whether or not, so if you mix those particular a particular type of bacteria or subtype bacteria with different antibodies and you see a reaction that indicates that there is some unique combination of antigens on the surface of that cell and these differences between these different antigens can be used to then classify and type and identify different types of pathogens. And this can again also be performed if you're doing cold genome sequencing where you can investigate just using sequence based comparison methods. Investigate and look for the particular sequence of the coding for the different antigens and use that sequence to identify the serotype. And then finally another method for that is used for identifying different pathogens and classifying these different pathogens as PFGE where the idea here is that a particular structure DNA is cut into different fragments using different enzymes, which will basically fragment this strand of DNA into a variety of fragments of different sizes. And depending on the variation in the original genome these cuts will occur at different locations, which produce different fragment sizes and then these fragment sizes can then be essentially separated out on a gel, which you can see in this banding pattern right here, and then basically by comparing the different banding patterns with each other and looking for the presence or absence of different banding patterns across a whole collection of samples you can then cluster the different the different bacteria that were run through this PFGE process, which is shown in the dendrogram up here. So some of the advantages and disadvantages of these traditional diagnostic methods for pathogen and detection is that while advantages are specificity, basically, since they involve a large amount of these detection methods involve targeting either particular subtypes of pathogens or these be able to classify, identify and classify different subtypes of pathogens, or targeting a particular pathogen in general, for example QPCR for SARS-CoV-2. And so it this allows you to more easily differentiate and extract a signal from that pathogen amongst the noise of all the background material that is within the sample whether that's host genomic material from human for example or whether that's environmental genomic material. However, the disadvantage is that if a pathogen has evolved too much, then you might miss it. So it might not be. It might, for example, it might not fit into the existing classification system you have set up for subtyping different pathogens, or it might not even be detected at all. And so this is a large disadvantage in particular for emerging pathogens where emerging or in particular novel pathogens where you would expect there would be a large amount of differences to the existing pathogens we've already identified and classified. And so this is where then shotgun metagenomic sequencing can be used as a way to better handle that sort of situation. So with shotgun metagenomic sequencing, it's a genomic sequencing technique that provides an unbiased survey of nucleotide acid contents that unbiased here meaning that you aren't necessarily, for example, isolating a culturing for particular bacteria. You're just taking the collection of nucleic acids within a sample and fragmenting them and sequencing whatever it is that you have to get without amplifying particular targets for example. The nucleotide acid can be DNA or RNA. Often you would first convert it to complementary DNA before sequencing. For clinical metagenomics, the specimen will contain both host nucleic acid plus the microbial nucleic acid, which can include the micro commensal or microbes that are just normally found within you as well as any potential pathogens. And so there's a lot of data in there. And so you need unique techniques to be able to sift through that large amounts of data if you wish to use that to identify particular pathogens. Additionally, metagenomic samples can also can incaminate DNA from sources external to the sample to the sample. The figure is just this, it's from this paper here but I'm just using it as a comparison of culture dependent genomics to metagenomics. So in both cases you are collecting a biological sample about with culture dependent methods, you are isolating pure cultures of particular path or particular microbes. And using those sequencing those pure cultures by first extracting the DNA. Whereas with culture independent methods that is metagenomics, you have a complex microbial community which can be derived from the host. That is human that could be derived from environmental or food based samples. You can extract the DNA or convert RNA to DNA to complementary DNA. And this gives you a whole wide variety of DNA fragments or DNA from a variety of different microbes that aren't really separated in any meaningful sense. When you form sequencing with metagenomic sequencing, you end up getting reads which match to a whole wide variety of different microbes, or different organisms, such as human genomics or plant genomics plants sources or whatever, whatever you happen to be in that particular biological sample. Whereas for isolate genomics because you have ahead of time isolated pure cultures of the microbes in question, the read you can basically know that all the sequence reads belong to that particular colonial colony of that micro, which gives you a bit more information that can be used to, for example, assemble draft or potentially nearly complete depending on the length of the reads genome of that particular micro. Whereas with metagenomics you need to are metagenomics and particularly if you wish to do assembly of metagenomics data there's a lot more. You're getting a lot more mixture of data, and it's a lot more difficult to extract full genomes from metagenomic samples. So some one thing about metagenomic samples as mentioned before is that it's doesn't necessarily contain just the micro or even pathogen in question that you're looking for it contains an unbiased sample of whether whatever genomic material was if it was within that sample. You are interested in particular particular sets of genomic material, for example, you often aren't interested in human human genomic material when sequencing samples derived clinical samples, where you're interested in identifying a particular pathogen. There are a number of different methods for reducing the amount of host genomic material within your sample. And the amount of host genomic material can also depend quite a lot on the source of that sample or specimen type. The two main methods of host reduction are wet lab and computational methods with wet lab methods you are either enriching the either enriching the microbial nucleic nucleic acid, or reducing the host nucleic acid test again with samples with high microbial contents. Since for example host nucleic acid reduction may also impact the amount of genomic, the genomic content within the microbial communities as well. There are a number of different methods available CPG island hybridization RNA depletion polyase selection or selective host lysis and DNA degradation. So the figure at the right here is just from a paper, which is going over a number of evaluating number of different methods for selective host lysis and DNA degradation, where the whites bar here is the human DNA and the colored bars are microbial from bacteria or viruses. And it's again evaluating a number of different methods to our component compounds to degrade or, yeah, degrade the host cells in this case human, and degrade the genomic material from those host cells, which you can see both remove a large amount of the human genomic material, but depending on the compounds you use can also impact the microbial genomic material as well. So that is all the methods for host production in wet lab samples however there is also computational approaches you can use. And mainly that is to map host, map your reads to the host genome and remove any reads that match to that host genome. And you could also use a combination of these different methods as well. So that is one potential issue issue with men genomics in but seeing another issue with contamination, mainly that's nucleic acid external to the sample can be introduced. And this can happen at all stages of sample preparation, whether it's collection extraction library preparation. And there's a wide variety of sources that contamination can come from. So one way of handling or dealing with contamination is to make sure to use negative or controls when performing sequencing to just verify that there is no contamination in the sequence data. So once you have the, once you have your library prepared, you can then perform medianomic sequencing and the overall process for using this medianomic sequences for pathogen detection is Well, your nucleic acid to get sequenced into reads which come from wide variety or a wide mixture of different species that were found within that biological sample. And these reads can then be directly assigned, basically each individual read can be assigned a particular taxonomic assigned to a particular tax on our taxonomic category. And this can be then used to investigate what sort of material is within your sample. Alternatively, or optionally, you can perform in a meta genomics assembly, where the reads are combined together into larger fragments, larger contiguous fragments. The medianomics assembly can then also be or that context from that assembly can then also be assigned to different taxonomic categories to investigate the microbial and other other organisms within that genomic sample. Alternatively, the contigs can then also be been based on relatedness or based on a variety of methods to attempt to organize these contigs into particular into all the context associated with particular species. And this can be used to then while both be used as input for taxonomic assignments as well as it can also then be used to, for example, investigate the genomic or the genetic content of a particular organism, which may be broken up into a number of different context. However, and in any event, no matter which different method you use. There's always this extra step of interpretation of that data which can itself be challenging. So, describing a bit more in detail about these two optional steps medianomic assembly and bidding. Medianomic assembly can be performed using a wide variety of assemblers there was a number of them that are either were designed for or enhancements to existing genomic assemblers for metagenomics data sets such as Meg hit and Meta spades and many others. Many of these assemblers are deploying graph assemblers. Meaning that they break your reads up into different fragments of size K all the camera and construct right construct graphs from these cameras and then use those graphs to and basically following paths through that graph to output a contig. So, once you have your medianomic assembly and other optional step is this bidding step, which groups contigs based on shared characteristics so a number of different methods include examining differences and touch on nucleotide species or differences in abundances or code on usage of these different contigs the goal being to group these contigs into different basic group and organize these contigs into different species. And when you go through the process of doing a metagenomics assembly to construct a number of different contigs and then bidding those contigs into the different species collections, you end up with a metagenomic realm, which can then be used for additional analyses. Primarily, the idea here being that because you have your context grouped into different species. You may be able to then, well, you can then do gene prediction and gene annotation to investigate that the different genes within particular species. You also may be able to identify, for example, different bins associated with different species that may not match with a existing reference database. And so could you could indicate a novel pathogen or at least something not represented in your reference database. In any case, whether you're just using doing metagenomic assembly or whether you're proceeding forward to construct a meta genome assembled genome. You can use that information for taxonomic assignments, or for phylogenetic analysis phylogenetic analysis can be performed for example by searching for and extracting particular targets or particular genes and performing multiple sequence alignments of these different genes among the different species bins for example, among other among other ways. You're using directly the reads for taxonomic assignments or the assemblies for taxonomic assignments. The idea in general is, you want to assign that genomic content and identify which tax on or which organism or collections of organisms that genomic content belongs to. And this can be reformed a number of different taxonomic rates so for example you can identify the species a read belongs to but it could also be associated with a genus family order and so on. And this is typically performed by having a pre existing reference database of known taxa and then matching the reads or the context to that database to perform taxonomic assignment. So, since there can be a number of different ways of providing taxonomic assignment there's a question of which method is the best or which you should use, and what are some of the different ways you would measure which methods or measure or some of the different considerations for taxonomic assignments, mainly you want it to be fast and sensitive and specific. So, one way you could do taxonomic assignments, it is just directly using blast. For example, you could take all the reads and use blast to align the reads to reference database of reference genomes from a wide variety of wide variety of taxonomic categories. You could fine tune the blast parameters to tune specificity and sensitivity. However, especially if you're looking or trying to perform taxonomic assignment directly from reads blast is often not necessarily fast enough it's not really the best or fastest tool for that job. And though it could be used if you perform this genomic metagenomic assemblies prior to that. And there is the additional challenge if you're just directly using blast alone of managing the data and also have basically sorting out from these blast reports which taxonomic category, a particular read or content belongs to. So with blast. If you took a particular read and use blast to align it to a reference database of wide variety of organisms, you may get matches to a wide variety of different species. The matches may be all very similar to each other, or even identical across different species. For example, if the particular read was derived from genomic content that is common among the different species. And so there's this additional question of which taxonomic category should you assign that particular read, which is represented here where, again, you may have your read match to wide variety of species. Which may derive from a region that's common to all, all species within a particular genus. And so there's additional post processing you may need to do on the blast reports if you wish to use blast for taxonomic assignments. There are better solutions out there that's not only surpassed blast and performance for processing millions or tens of millions of reads and management genomic samples, but also do this additional post processing to sort out which taxonomic rank particular reads or context should be assigned to. And the one we're going to discuss here in this presentation is cracking. So cracking, which originally published in 2014 is a method that is does not rely on alignments so it's not doesn't necessarily have to do all the work that blast would be doing when it's doing alignment. But it's a method for very quickly assigning taxonomic labels to metagenomic sequence data, or just in general any sequence data. So the original software was published in 2014 however an updated version crack and two was published in 2019, which significantly reduced the database size is required for for a crack and analysis the database being the database of reference from all of the different from your large collection of taxonomic categories that are being used to classify reads. So, giving a brief going over briefly how cracking works. It is again primarily based on cameras and a camera being a process by which you break up a read or a contact or any other stretch of sequence into small fragments of a fixed size. With K being that particular size on this case K is for and to generate the collection of cameras from a sequence such as a read, you can then basically slides, basically use a sliding window method for example you pick the first four nucleotides form the first camer and then you slide over by one, generate the second camer which is the next four nucleotides and slide over by one again to generate the third camer which is the third set of four nucleotides. And this collection of all these strings of substrings of four from this read formula collection of cameras derived from the read. So cracking, both in constructing the database as well as in taxonomic assignment of reads or other sequences is all based on cameras. So to build a cracking database imagine for example for a number of different organisms that you're interested in representing in that cracking database. You've ahead of time taken the sequence, for example for an owl and divided it up into all of its substrings of size K, mainly all of the cameras, in this case all of the cameras of size four. And you've done that for all the different organisms, all the different organisms you want to include in the cracking database. So to construct this cracking database what you would do is for every camer, you then go and look up where basically for every camer associated with your organism. If it is you assign it to that particular category in the taxonomic tree. For example, if a picture camer only matches to owl and nothing else in the tree, you would assign it to owl. For camer that only matches to snake you would assign it to snake. If there's a camer that is in common among a snake and a turtle, then it gets assigned to the least common ancestor of those two organisms so in this case gets assigned to reptile for example. Or if you have a camer that matches to turtle snake and owl it might get assigned to the even higher taxonomic category, in this case vertebrate. And so you do this sort of assignments for every camer where you're assigning it to a particular location or node in this taxonomic tree. And you end up with a essentially a large table of your camers and the classification that was assigned to that, the classification category assigned to that camer based on your different observations. Now for cracking the original cracking databases and additional there is an additional data structure used called the minimizer, which is essentially a way of more. It's an additional data structure on top of this table of camers and classifications that is used to more efficiently search through the database. The minimizer being a substring or like an Elmer, for example, or a sub camer of a camer that is located in an additional in an additional table. The idea with the minimizers is that you are trying to group collections of camers that are very similar to each other into a larger and assign them a single identifier, which makes it more efficient to search through the cracking database. And just to compare cracking one to cracking two, this is one of the primary locations where cracking two was able to improve upon cracking one. Mainly that cracking two, or cracking one is storing within each record in this database, your camer and the least common ancestor associated with that camer. So cracking two is storing in each record and a hash code derived only from the minimizer of that camer. So this means that records and cracker one, while records and cracker one are quite a bit larger 96 bits for example records and cracker two are much smaller in this case 32 bits, which is one of the main ways that cracking two is able to create significantly smaller databases. But in any events, those technical details aside for cracking in general you can think of a database as a camer and then the assignments of that camer to a particular tax on a category. And when you want to assign a read to a particular tax on a category, you would initially break up that read into all of its camers. And then for every camer you go and look up in that cracking database, the particular taxonomic classification of label that was associated with that camer and basic counts, or count up the number of camers that match particular categories in a tree. And once you've done this for all the camers within a read, this will give you basically within the larger taxonomic tree of all organisms that you've integrated into the cracking database, a subtree with only your camers represented at a variety of different taxonomic categories. And then the idea is that you classify the read itself, the taxonomic class assignment to the read itself as being the particular route to path that has maximal weight. So it's easier to kind of just depict it here. In this case this read is assigned to the snake category, because there are three camers plus one camer in the route to path, a route to leaf path that go to snake, whereas if you went all the way to turtle, there would only be one in this turtle node and only one in owl node. So, in that way the read itself is classified as a snake. And the way you can think of it is, I know it's a snake because the camers mostly mapped to snake. They provide the most evidence for snake. Some camers may map to higher taxonomic categories, but mostly they match to snake. So once you've, so this is the way you can classify or assign a taxa to an individual read. And with a measurements data set with millions or tens of millions of reads, you would end up with millions or tens of millions of taxonomic assignments. And then to provide further interpretation of that data, there's this additional post processing where you may want to summarize, for example, how many of these belong to different taxonomic categories and visualize this summary. And this is or can be performed from a number of different software. So Corona is one software that can be used to visualize the percentage of different leads and the taxonomic categories they are assigned to in a pie chart, hierarchical pie chart, as you can see at the right here. And you can see over a newer software package that can do this is Pavian, where it can again take as input the same crack and report as well as input from a number of different software. And you can visualize things using either the Sanky plots, as well as a number of other useful visualizations and tables that can help for interpretation of the taxonomic assignments of your measurements data set. So once you have all these taxonomic assignments, there's still the question of how do you know which taxa are the causative agent of your disease. One way you can sort this out is potentially comparing the symptoms of the pathogen with the symptoms of the disease. So if we identify a pathogen or a number of different pathogens within the data set you can using basically additional different information such as the symptoms, use that to sort out potentially which pathogen is the cause of that disease. However, you can also compare your data controls so for example if you have metagenomics data from multiple biological individuals, some of which come from people who are infected with an unknown pathogen and some of which come from controls of people who are healthy. You could use this information as well to at least potentially narrow down which pathogens are potentially causative of the disease or what you would get is you can narrow down which which organisms are found within the samples from those who are who are sick when compared to those who are healthy. I did also want to mention there was a recently published paper around in well last year on defining a standard protocol for metagenomics analysis using Kraken, which was written by the authors of Kraken and Kraken 2. So this is a number of different steps that can be used for pathogen identification using Kraken. So once you have your NGS reads, you can remove host data by aligning to your host such as human genome. In this case, they're recommending using POTI 2 for that alignment. And then once you have the reads that have been filtered to remove host data, you can use Kraken or Kraken unique to perform classifications of those reads into the different catechonomic categories, as I had described in the previous slides, and then visualizing with pavian. So similar in this in this protocol, if you have a collection of biological samples from a large number of people which may include controls as well as those who may be infected or other environmental samples that may have a number of a mixture of organisms, you can use pavian to help you identify which pathogens or these which organisms are basically elevated in one sample compared to the other. And so in this case, they're recommending using Z scores of the read counts to identify which basically which sets of reads within individual taxonomic categories are elevated in one sample over another. For example, in the protocol, which gives example data for sample S71. What it looks like, for example, and Salia LJREA, and my pronunciation is pretty bad, but it looks like that is likely what is in that sample. As you can tell from this block which shows that there's a lot more reads of this organism in S71 and basically no reads from that organism and any of the other samples. And then they have some additional steps to extract reads associated with each taxonomic category, and then align those reads back to the reference genome for a particular organism and then examine those alignments to just make sure that you have an even read coverage across that reference organism as a confirmation step to see if that organism is actually in your data. So this is a very nice protocol that's gives step by step instructions and exactly which commands to run and example data if you want to investigate and do additional reading. For our lab we're going to do right after the break, we're going to actually use a different data sets and a simplified version of this protocol that doesn't necessarily use all the same software as they stated here. And which includes additional metagenomics assembly steps that are not present present in this analysis protocol.