 Okay, so I realize some people are not back on lunch yet, but that should be okay It's just a very quick introduction to 16 years at the beginning and now Talk a little bit about the What we're going to do for the last so For those of you who Didn't hear my very brief introduction at the back of this morning. My name is William. I'm from the BC Public Health Lab and also in clinical assistant professor at UBC, so the dual affiliation sort of allows me to Apply both genomics and metagenomics to a more Public health diagnostic lab setting So if you have any interesting doubt any question about that feel free to come talk to me about that so the learning objective for this This this afternoon is to focus on the the marker genes as you already know and At the end of this module including the lab We're hoping that you'll be able to understand and perform marker gene based microbiome analysis You'll be able to analyze 16s RNA market genes. So that's the that's what we'll be focusing on but the same protocols can be applied to other types of of marker genes and What you 16s RNA market genes to profile and to compare microbiomes and select suitable parameters and sort of hopefully demonstrate Some in some cases what are the parametral choice and the algorithm of choice when you do market gene analysis and We'll also explain the advantage and disadvantage of market gene based microbiome analysis The last part I'll keep it fairly brief Morgan tomorrow will go into a little bit more on using PyCross and actually compare Market gene based analysis versus predicted functional gene based based analysis You saw this this morning. So My lecture essentially follow this flow. We'll talk a little bit about the wet lab component attracting DNA amplifying target primers target using primers and then filtered out the errors and building an OTU cluster Using two of the most popular tools for market gene analysis and then we'll Talk a little bit about diversity We won't have too much time to go into that and again This is sort of to give you a brief introduction and I know about half of you have experience with 16s already So the hope is that for those who don't have experience you would Learn something new for those who have some experience. You'll share your experience and you'll You know Rob mentioned this morning your sort of real life experience will be very valuable for other people And also this is like recording. So if I say something wrong or something Incorrect, it's better that you catch me right now then to you know To let us live and it will be wrong perpetually. So let's hope that doesn't happen So feel free to ask questions. Feel free to stop me anytime if you if anything's unclear So Rob already talked about it this morning that represent RNA genes is the most popular universal phylogenetic markers and they have certain attributes that make it a very popular choice for For your community profiling analysis the these include that Revital RNA is present in all living organisms it is concerned because it plays such a critical role in protein translation and As a result of that the Revital RNA genes are also relatively It's relatively rare that it's acquired horizontally. There are some documented cases actually Rob and I sort of both studied horizontal gene transfer for our PHD and postdoc respectively. That's when we actually first met I guess over 10 years ago about 10 years ago so So then so it's horizontal gene transfer also holds for a special place in my heart, although I haven't been keeping up as much as Rob has and Because it's Fairly conserved it does behave like a molecular clock So it's useful for phylogenetic analysis and useful for building a universal tree of life Allowing you to place Organisms onto a single tree and of these 16s RNA RNA is the most commonly used For the reason that it But Rob mentioned this morning and it has a good database. It's Just about the right size to give you the right phylogenetic to give you sufficient phylogenetic signals Okay, so given that RNA is well studied The Attributes of the of RNA to not only make it a good tool for phylogenetic analysis It's also a useful tool for understanding composition. So microbial community So we'll talk a little bit about alpha diversity how you can The different measurements that you can use to profile a community And extrapolating from there. You can also use the same marker for relating one microbial community to another and And Typically this type of comparison is it's called using the beta diversity index to to compare your communities Lastly as we gain more understanding of how the microbiome interact with the host or with an environment You can actually then use the profile the microbiome profile as a readout was a biomarker for diseases or for a Condition in the environment. So you can actually develop Classifiers are used using microbiome information to tell you whether this sample comes from a particular It comes from a disease individual or disease Tree for example or this sample comes from a healthy sample. So It allows We can use the entire profile microbiome profile to characterize condition so 16 s of course, it's not the only marker gene used and 16 s of course only applied for For periodic organisms if you have eukaryotic organisms such as protus or fungi You would need either the 18 s or ITS or some other types of markers It both 18 as an ITS have been developed and have reasonably Comprehensive database and they're typically the choice for eukaryotic organisms ITS is especially useful for fungi not so much for protus For bacteria you can besides 16 s you can also have used a chaperone Gene cpn-60 you can there have been ITS developed also for bacteria Genomes and while back Groups like John John's nice Isen's lab Proposed using rec A genes as a universal marker For the reason that record is a single copy gene and there are other single copy genes Where 16 s as Rob mentioned is has multiple copies and sometimes That sort of confuses the the analysis when you have multiple copies of 16 s in a given organism For viruses, of course It's the case where it's very hard to come up But it's impossible to come up with a universal marker But they are specific bio markers for specific families of viruses such as GP 23 For T4 like bacterial phage or rdrp for corona viruses Anyone else here work with viruses or protus and do you use any other types of marker genes for? Not yet developed. Okay. Yeah, so this I got you got this list from one of the studies that I'm participating in which look at the the watersheds in in BC and we essentially profile the The water from both contaminated and and pristine sites and using the different biomarkers to characterize the different fractions of organisms And so we have some experience with all these biomarkers in from our study So if you have any questions regarding these biomarkers, feel free to come talk to me afterwards I wanted to mention that some of these marker genes are more fast fast Evolved faster, so they're useful for strain level differentiation. We're 16 s is typically not good for a strain level differentiation Okay, so I'm going to go into a bit about the wet lab component About nutrition as well. So my knowledge of the wet lab component is it's sort of By proxy I guess by talking to the Technicians the postdocs and so on but I have spent quite a bit of time talking to them So I'm aware sort of of the challenges and differences. So try to highlight some of these for you So the DNA extraction it's very early on in the human microbiome study It was known that the different protocols DNA extraction protocols that you give you fairly different profile Microbiome profile and as a result of that There Some standardizations were made for large-scale projects such as the HMP or the earth microbiome project And there I listed an example from the earth microbiome project, which is the same protocol used by HMP for their DNA extraction and And HMP had done the study if you go to their website There's links to studies. I will show you, you know the different Extraction kits how they end up Giving you a different sets of organisms preferentially Emphasis on different sets of organisms so And as I mentioned DNA extraction can also be done after you fractionate your samples to separate out the different Organisms based on their cellular size and other characteristics Miguel from our institute had produced a video on this online journal of videos highlighting some of the highlighting the protocols and some of the challenges associated with This particular approach Unfortunately, it's it's it's not an open-access site So you might have to you know go through your institute to access the website If it if if your institute happened to subscribe to it or come talk to me if you're interested in Okay, so During DNA extraction process, it's very common and it's almost impossible to avoid contamination from the lab This can come from the reagents can come from the environment the lab We can come from the sequencers from the previous run that you did and you didn't watch the sequencer properly then All these can result in contamination in other words DNA is out That that's not part of your sample end up in your sequence readout so Usually the level of DNA Contamination is quite low compared to your sample, especially if you're extracting from fecal material or material that have high yield but if you are extracting from Low yielding low yield material that generates very little DNA to begin with then Contamination can become a significant issue And in those cases it's recommended that you actually include extraction negative control. So run through DNA extraction process with no input. So just use the pure, you know use Molecular gray water the best water that you can possibly get and run through the process And you'll be surprised that you still end up with something in the native control But then bioinformatically we can subtract all those Organisms from our analysis. So it's very useful to have a extraction negative control in your lab process pipeline The other source of contamination is Contaminations are already in your sample this include a host or environment the DNA that come DNA that comes from the host what comes from the the environment and These type of sample this these type of Contaminations in your sample are usually not easily removed in the web lab You can try doing subtractive hybridization where you can try some kind of fractionation process to remove some of these Contaminants, but at the end of the day you probably do need to rely on bioinformatics Computational tools to remove these sequences from your from your samples So here I listed one that's Recommended by the HMP project for removing human contaminants One thing that may be worth noting is that I'll get to that Okay, so once you have your DNA the next step is to amplify the Biomarkers that you're interested in so for dark target and Affecation, I just wanted to highlight some terminologies on this In this picture here, so we're all on the same page in terms of terminology. So when I say adapter, I mean the Sequence that's Complementary to what's on say the the sequencing chip. So that's the part that hybridized to your sequencing chip So it's called the the sequence adapter and Immediate downstream of the sequence adapter is the the primer that you use to amplify your target So in the case here, it will be the 16 s primer that you used to amplify the v4 region of the 16 s represent what DNA You can In between these you cannot introduce a unique index or unique barcode to Allow you to pull your to pull multiple samples into the same run and then later on Using this unique barcode and you can pull out the individual samples from your pool to run So I think earlier someone mentioned about your indexing so in that case the index is just present in both the 5 prime and the 3 prime Primer sequence Okay, so the PCR primer are designed to amplify specific regions of the genome It's worth noting that sometimes when you get a dirty sample there could be a lot of Implication inhibitors in your sample in that case usually need to dilute your samples before you Can successfully amplify the target region In other more complex samples, you can actually end up with non-specific implication. In other words, you have multiple bands On your gel after you you run your PCR implication One way we have dealt with that issue is to Use gel size selection to only to cut out the bands out that correspond to the marker genes that were interested in It's if you're a wet lab person, you know, that's a very tedious process to do by hand. So we actually collaborate with a company that use That robotics to do gel to do gel extraction Kate so Again, I think Rob mentioned this earlier on but the different So for 16 s RNA, there are not hyper variable regions And if you target target different hyper if you amplify different Hypervariable regions you can actually get slightly different microbial profile And OTU based on the Region that you selected there are a few papers and I think I listed one of them here But I can I certainly can send you more About target bias issues with target bias and In our study we mostly focus on the V4 region using the of 16 s using the Illumina protocol Because as you might have noticed in the in this earlier slide the V4 region is just the right size About 250 base pairs that would give you a decent overlap to allow you to correct for Sequencing errors that happen typically in the three prime end of your sequences I'm sure most of you know that sequence quality degrades as you move from five prime to three prime So the ability to correct some correction in the five in the three prime is Desirable so and we'll see how that can be done in the in the lab session. Okay, so Even on the desktop sequencer like my seek you have You have you generate significantly more sequences than you need for to characterize a given sample So the strategy is typically to multiple to put multiple samples Into the same into a single run and so I mentioned already that the way you do it is by Tagging each of your samples with a unique barcode that you can then later use to separate out the The the reads into into the different samples The Number of reads that you need to characterize a sample of course the different Really depends on the sample itself Rob Knife famously say it only takes about a thousand sequins to separate your elbow from your ass So it's when the environments are very different Hopefully between your elbow and your ass You don't need a lot of reads to sub to differentiate the environment that the samples But when the samples are much more similar than you potentially need more reads to differentiate your samples It's sort of the the guiding principle When we first started metagenomics studies were aiming for you know 10 to a hundred thousand Reads per sample we now realize that's probably an overkill for for several reasons One is typically you don't need that many to differentiate One sample from one type of sample from another type and second because a lot of the Samples environment are highly uneven so you have very abundant organisms and very Rare organisms if even if you sequence to a hundred thousand or ten thousand reads you're still not capturing those rare Organisms and we have done studies where we actually spike in known amount of DNA from a known organism And and see if we can detect it in sequencing and we realize you know if you speak Spike low amount You know it's there, but you're not going to even see it in your sequence So for those reasons, you know unless you have specific reasons to sequence to a high depth It's typically advised that you sequence more samples rather than sequence to a much to to High depth for a given sample Oh And the last point is that the way that these Parallel sequencers so so my seek is an example works is that it takes a picture of the the sequencing the Sequencing by synthesis, so it takes pictures snapshots of the DNA molecule as it grows and each time it adds a New base it emits a signal So you can imagine if that On your slide if all the reta Homogeneous are the same then you for each cycle of your sequencing you're going to get Either a lot of spots lit up or no spot lit up so because of that if you get you just imagine like You know looking at a picture if there's all y to all block you get very little information, right? So For that reason you usually need to when you don't be sequencing one type of ampere con you need to spike in Some additional sequences are different from your ampere con. So Phi X is a control read is the default People spike into their Illumina runs in order to sort of make the sample more had the More heterogeneous to allow better resolution. So you might think that if I don't spike in any Other sequences I get more reads, but it's actually counterproductive because your quality of your readout will be much lower Then if you spike in a different sequence Hopefully that's clear Okay, so the last Point that I want to carry is that you can actually do one step versus two-step PCI amplification and the Protocol that's described by by Rob nice group Uses the one-step in amplification and as shown in the The diagram earlier you can see the adapter the barcode and the primers are all in one construct and And you only need to do one PCI Step to amplify the target genes and to have the sequencing adopters already on your molecule ready for for sequencing That type of approach is great in the sense that it's much quicker protocol But it's not Suitable if you have degraded sorry degenerate primer that you're using or if You're trying to amplify many different types of empathons because these long primers typically can the The adapter and the mark and the barcodes cannot you interfere with the PCR The target primers that you're using so you if you do the one-step application you actually have to test out each of your Long primers in order to to make sure that they they do amplify your Target properly and there's no marked Bias due to the the presence of a barcode For 16 as it has been tested quite nicely before other markers There's very little Existing one step and put on one step primers available for amplification So for that reason when we did our study where we have multiple target sites that we're trying to amplify We end up using a two-step process which we use the published Target primer so will test the target primer to amplify the target and then we Then like gate the the adapter and the barcodes to the the empathons after the PCR step this approach is it's you do lose biomaterial using this approach because You know that the ligation is certainly not hundred percent efficient But it is compatible with rendering and generate primer amplification steps and this approach is supported by most of the Primer vendors such as Illumina we use a company called bio before New England bio lab also have a set of primers That you can use for for these type of approach Okay, so I'm moving into the more balance make analysis component going to introduce you to two different tools One is chime And the other is called mother So chime stands for quantitative insight into microbial ecology It only takes geniuses like Rob nice to come up with these hard to react to them that we mean something Mother on the other hand doesn't stand as far as I know for any acronyms, but the developer Patrick Patshlaw Have a series of tools names son daughter and other so I guess the next logical step is For his naming scheme is his mother. So that's what that's where the name other came from Okay, but so these two tools started off from very different Emphasis one They but the overtime they serve So initially they have very different functions and you kind of have to pick one with the other when you need to do certain Type of analysis, but over time they they start to converge and now a day You can pretty much use either one to to accomplish your analysis There are some subtle differences both in sort of philosophical in the philosophy of their design, but also in some sort of practical Advises so Chime is actually a Python interface that glue together many different programs And mother on the other hand is a single program with minimal external dependency So because of that It's much easier to install and set up mother and it's much easier to learn how to use mother on the other hand Chime is Has a large number of dependencies It could easily take you a day or two just to install all the dependencies required for chime but they do make a virtual machine available that you can download and launch on your own computer the downside of virtual machine of course is quite limited by the number of Quite limited by the size of your machine So if you load it on your laptop, then it's it's not a very powerful machine to do analysis And if you then very rarely can you actually deploy it on the on the server or powerful machine? So the other Did I miss anything? Yeah, so the other point is that because chime is actually a glue that brings together many different programs It's more scalable in the sense that you can use different programs for different Purposes and some of the programs are highly paralyzable mother on the other hand is best suited to run on a single machine and For some of the analysis that machine hopefully is is as powerful as you can can get we have run into situations we where we have these servers with you know 32 CPUs and or 32 cores and 1 terabyte of memory and we still run into trouble when running mother on a large Data set and that's because some of the steps that it requires loading the entire data set into the memory and And it's very resource in intensive Since I mean it has since come up with better ways to break down the Large data sets into smaller ones, but still It's not it's not a scale go as I'm going to highlight two links here one is a publication pulled out by the chime group sort of telling Giving you a walkthrough of how to use chime and a lot of my actually Material that I'm showing here came directly from this paper The other one to highlight is the my seek SOP from the mother Website and again a lot of my tutorial is coming From this material here, and it's well worth your time if you want to do biomarker analysis to to be familiar with both of these resources Okay, so here's an overview of the the balance of magic workflow I'll go through these steps Individually, but just to show you that you typically start off with your sequence data as input and with that sequence data You will have some Additional information about your your samples so that those are the metadata so the sequence data need to be Processed in order to be ready for for For OTU picking for for clustering and once you have to pick the OT once you have picked the OTU's you can either directly assigned There were the OTU's to different taxa or you can do an alignment and from this aligned sequence Then you can do a phylogenetic analysis which will give you a phylogenetic tree when you do Taxonomic assignment what you end up is an OTU table. So a list of OTU's And which OTU's in which sample and from the OTU table or the phylogenetic tree Then you can perform downstream analysis looking at the diversity of your samples or to visualize the The composition of your of your samples Okay, so we what we'll do is My slides will refer to these numbers Here as we go through them so you can sort of flip back and forth just to know which Stage we're at for the analysis and I did the same for the lab as well So you can refer back to this diagram to see which stage of the analysis you are in when doing the lab okay, so I mentioned contamination already and I also mentioned that the Contam the Contam the contamination When you do the Contam the contamination in in a Banff-Morgan analysis This typically involves mapping your reads to a database containing the suspected contamination So if you want to remove the human sequences, you will search against the human genome database if you want to remove same Mouse or rat you'll search again their respective genomes one So Typically this type of search is you're getting millions of sequences if you use BLAST search, which I'm sure most of you are familiar with It's a fairly slow Algorithm so it could take days just to search through your your data set So typically what we do is use a shory aligner To do this type of search I Won't go into this at this point, but the the HMP The contamination protocol would give you a sense of how that that can be done And if you want more information, I'd be happy to talk to you more about, you know The shory aligners within what they are Well, we in our own analysis However, we found that most of the shory aligners give you comparable results So it really doesn't matter which one you pick to use What does make a difference is that if your database can have the it can include the host variant So as you know the human genome when the sequence, it's a it's a compass like It's a composite of multiple genomes and then later on they add the the different variants to the to To the database and you can choose download just a core genome with the download also a list of variants And what we found is a few included host variant In other words snip or whatever type of variants that's available In your search, it will help to improve the matching and therefore remove more host sequences from your From your set from your from your sequences for Downstream analysis any question so far any mistakes So most of the shory aligners actually have fairly stringent threshold typically it's about two 23% could probably go up to about 5% but anything above that then it's much less efficient at At the matching so you usually do two to three percent as a cutoff But that's why I said to include the variants because some of the very some of the more variable regions could easily Exceed a two two percent threshold Yeah, so Nick Lohman And so you just Google Nick Lohman or he's on Twitter like no one's business So if you want to find on Twitter or find him on on Google you can he did he and others Publish a paper looking at common Contaminants found in reagents, and I don't know if there's a specific database associated with what they found But certainly you can pull out the organisms from NCBI that correspond to to what they found to be in those reagents If you're not doing the novel Do you still need to Less so yeah, you can probably get away without doing if you if you're not doing the novel Picking which we'll talk about in a bit. Yeah, you can probably skip this step and certainly for MPCOM based analysis, this is less critical because your M your Your PC amplification. Hopefully is specific enough. You have very few contaminants to begin with It's more of an issue for sure for metagenomics analysis Any other questions? Okay, so I Think I covered this already, but basically the first step of Pre-processing is to remove the barcodes and the PCR primers. Those are artifacts of your Amplification so you need to remove those from your actual amplicons before Before your analysis These step is fairly well worked out. So I'm not going to drill too on it too much, but But if you again any questions feel free to Talk to me. Okay, so the next step is that Once you remove the obvious Adopter and primer sequences There you leave with hopefully what hopefully what you're truly trying to amplify But some of these amplicons will have low quality. So the next step is to remove Reads that have low quality both chime and mother have fairly good Protocols worked out for removing low quality Sequences and some of the common parameters used are Minimum lens of consecutive high quality base in other words after you cut out the low quality region how much of your Original amplicon is left if it's shorter than a certain Certain proportion of your original amplicon that it might be discarded entirely rather than trying to keep it short To mimic Yes, so I guess the question is if you have four five four data, how do you Okay, so if you have the Lumina Gotcha Right so In that case, I actually will recommend using mother as you will see mother has the my seek SOP that's designed to remove to to do the filtering and do the The The assembly for you So I'll show that in a bit, but if you want to use chime, I'm pretty sure chime now has a Lumina based protocol to allow you to remove to allow you to merge your Sequences and actually in our lab I'll show you one of the tools that can be fit into chime to do that The both chime and mother were developed when four five four was the Platform of choice. So both of them have since adopted to to alumina for sure Okay, so besides the length of the read you will also Look for maximum number of low quality bases. So you have a long stretch of low quality basis That's another sign that the entire read might not be very good The next is the number of ambiguous bases and an indication of a low quality and you can also set a quality threshold to To scan through your entire sequence and if there's again regions out of low quality you can discard the entire sequence Here I'm just listing some of the tools that can be used So both as I say mother and chime have built-in filters that you can use But there are other tools that you can use independently outside of these tools to do quality filtering and and to do pair and assembly show some of that later and fast QC is a tool that's very popular for Summarizing your quality sequence quality. So all you have to do is feed it A fast Q file and it will give you a summary report of what How how good your sequence quality is? Okay, moving on to your OTU picking Okay, so OTU's essentially arbitrarily Formed clusters based on sequence identity the 97% sequence similarity is The default cutoff for most of the programs and that roughly Correspond to the species level although I'll show later. There's publications indicating that the 97% It's really just a very rough guideline and you really should be Tailoring the threshold based on the the organism that you're interested in targeting So I'll highlight three different approaches de novo clustering Close reference and open reference and there's more detail about you picking at this link here Okay, so I'll start with de novo clustering This is sort of conceptually probably the most straightforward to understand essentially you have a group of sequences that you want to Cluster based on their sequence identity So what you start with is is just pair-wise comparison of all your sequences to establish how Similar each pair is and then you can use hierarchical clustering basically starting from the most similar pairs of sequences and build up your clustering that way and Then you let's say have a 90% 97% cutoff that means once you build your hierarchical clustering any sequences that are Beyond and beyond your false outside of your your similarity cutoff Forms its own Cluster, so you end up with multiple clusters each roughly about 97 Each roughly about 3% different So This process requires a lot of memories because you can imagine if you have a 10th Let's say a thousand sequence. That's a thousand times a thousand pairs of comparison that you need to do So if you have a million that's a million times a million, which is I don't even know what that one I think it's hundred billion or so or one trillion that sequences I see Pairs of sequences that you're comparing so that could easily overwhelm even the largest Machine that you have and this is typically the step where we run out of memory on our Computers So here's So besides hierarchical clustering there's a faster algorithm developed by Call you class or later on it's also part of the you search tool that will Show in in the class in the in a lot later So this type of approach instead of doing pairwise comparison what it does is to try to find the centroid For your sequence So then instead of doing pairwise comparison everything that falls within a certain distance from the centroid Grouped into the same cluster. So So that for example, it's an OTU represented by this particular sequence in the middle so that's what we call the the representative sequence and all these are the Equivalent sequences For that OTU. So when you call an OTU you have to keep in mind that What you're essentially saying is that every sequence is that fall within the OTU you're going to treat it the same For as far as your downstream analysis is concerned So that's why picking the threshold is quite important because you don't want to lump together sequences That shouldn't be treated as the same Organ that this shouldn't be treated as the same organism or the same text In this case, oh So how is centroid chosen in this case is typically you rank your sequence from the most abundant to the least abundant and then you pick your centroid by picking the longest sequences and you just go from top to bottom and and essentially take the the most abundant sequence scan through your rest of your sequence to see which one falls 97% from the from the From the sequence you pick and everything that's why it's called greedy because depending on which one you choose as your first It if it affects your down here your OTU clustering And then there's different Protocols, I'm just describing the most sort of common way of doing it But different tweaks and different protocols cannot really improve your OTU picking Yes, that's why you define as a user definable. So the T the threshold value is defined by you Right, so are you saying is there rules for okay Yes, and no, I mean that the 97% is the rule found people use for species level comparison and tools like you You search actually recommend you don't go below 97% because if we go below 97% the The algorithm becomes unreliable. I mean we can talk more about why it becomes unreliable, but 97% is the short answer Typically 97 people say species but but we in reality that range is actually from anywhere between 90 to 95% so so it's really all over the place and I have some stats for you later on The other point I should mention is that if you look at this diagram, let's say the T is three here that means 3% is the minimum distance Between your two sequences in other words if you if you look at this sequence here and this sequence there Obviously, that's not 3% right. That's more like 6% So so when you pick 97% what end up in your OTU could actually be as little as 94 93% So that's something to be careful about and another way to get around this issue is that you do the the OTU Clustering twice you first cluster at high stringency 99 98% and then you cluster again at the lower stringency that helped to cut out some of the issues associated with Just using a very relaxed threshold to begin with Okay Okay, so moving on to close reference this one Simply means you take your sequences and you match to an existing database of reference sequences So it's shown in the picture here essentially your list of sequences is compared to an existing database and Anything that hit the database that has a match in the database is kept in your OTU table and what's Not matched into the database is simply discarded So this type of approach is fast and it could be paralyzed because you know The matching could be done independently You can take the first 10 to do the matching and then the next 10 on a different machine And you can submit the next 10 to a different machine to do the matching So you can then afterwards combine the results since the matching is just between your query sequence and the database It really doesn't It the the other sequences in your data set has no no bearing no influence on what the matching Whether it will match or not match right that makes sense Whereas this particular approach you can paralyze it easily because that the algorithm has to scan the entire queries your entire queries your entire input data set to establish the clusters So that's why this close reference is much faster and it's suitable if you have a comprehensive and Properly annotated database such as the 16s database But it certainly doesn't work very well if your reference database is poorly annotated or it's very sparse So you're going to get a lot of sequences with no hits to your database in those cases You're much better off using either the novel or open reference Which we'll see next so open reference simply put is just a combination of close reference search followed by Processing the dismissed sequences using the novel Clustering so you simply take the rest of the sequence that don't match to your database and do the novel clustering Then at the end you merge the two OTU tables. There's different ways of implementing this particular approach suffice to say this is the Choice that if you have the time So in one the most robust data set this is the the approach to use Okay, and I already mentioned the represented sequences Typically the centroid of your OTU and there are several ways of picking that centroid one is based on abundance So most abundant sequences are likely to be most relevant for your analysis and The other way is just use a centroid that's used for the the novel OTU picking you can also pick the longest sequence available for From your set of of OT sequence in the given OTU pick the longest one is unlikely to be the representative The caveat for that of course is that if your longest one happened to be Chimera or some sort of hybrid sequence that's not real then then that could be an issue If you just pick simply by lens Basically, if you're if you know your ample count is 250 base pair long then any then you shouldn't be picking and Representing six representative sequence. That's 500 base pairs long And there's few others that are less Less Often used so most common is just the most abundant sequences used as a represent a sequence Okay, so as I mentioned Doing your PC our amplification process you can actually end up with a chimeric sequence That's artificially joined sequence from more than one template. So for example in this case here You have two template and it gives you a chimeric sequence. That's partly from this sequence X and second Paraphones sequence Y so We'll see how some we'll see some of the tools that can be used to Remove chimeric sequence and typically the detection is based on identification of a three-way alignment So, you know this sequence this part aligned to X this part aligned to Y That's a good indication of chimeric sequence All right moving on to taxonomy assignment It's important to differentiate Otu's from tax on so Otu's don't have names But we as human often like to give a name to an entity, right? So Instead of referring things as Otu one Otu two It's usually much more meaningful for us to refer to something as E. Coli or Samo Nila or Bactroides and so on so So as a result of that Most of time we end up assigning Otu's to known tax on In an attempt to to make to make sense of the data But bear in mind that the Otu or the sequences is not exactly the same as the tax on there's just And there's the issue of resolution. So imagine you It's a good example. So imagine you have different you know Different animals that you can call it animal A animal B and so on But if you once you assign it a given name, let's say elephant We know what you you know, we commonly know what you're referring to as an elephant But there are always things that look like mammoth for example that look kind of like an elephant So would you classify that as part of would you call it elephant where you call it something else? So the name one trying to get out is the name and the actual entity are not exactly equivalent So you keep that in mind it would help you to understand the difference between an Otu analysis and the taxonomic analysis and the the process of assigning an Otu to a Taxon is simply by similarity search So we match we map an Otu to a non taxon based on similarity and again the cutoff of what you use Can affect your your interpretation of the results? Okay, so Your Most of your said I get most of your Yeah Right so so that's where you know understanding the difference between Otu and taxon is is the key you can do Otu clustering at a given cutoff and then do the taxonomic mapping at a different cutoff So often an Otu that's not Assigned right away could still be forced into a taxon based on the cutoff you're using So so that's the key sort of the key message here is that The two are not exactly the same and important to keep that in mind Yeah, but yeah in the case where the Otu really cannot be assigned to a known taxon then you keep the Otu destination There's nothing not much you can do is you just call it Otu X and move on with your analysis Right so once you assign things to taxon let's say you use Cpn60 to define and your taxonomic mapping say to E. Coli and then you use 16s also to assign to two Differentials your sequences and assign things to E. Coli then at the taxonomic level you then can treat those as equivalent Whether that's ideal or not It's probably not but it's it's possible to do do that. So once you Assign things to taxon taxonomy again, you're treating everything within the taxon within the taxon the same All right, so So then you can do comparison that way It's for that will probably not be your primary comparison, but it will probably form a meta comparison So when you trying to compare across different environment, then you can use a different markers and hopefully they give you a complimentary signals that will allow you to to Characterize your environment better than just a single marker and that's sort of the approach we took for the watershed project And so just saying we look at 16s. We look at a whole bunch of biomarkers and we're trying to Compare in contrast what the different biomarkers report as the the important organisms and actually in a lot of cases we identified the same important Organisms or species Using different markers But the key is you have to assign it to a taxon in order to do that comparison Okay So here just to show quickly that you have a set of otus and it's the unique it the matching algorithm or the mapping algorithm is important as I mentioned otu and Tax on tax on a different thing So when you report out your analysis, you actually have to report the matching algorithm you use To go from otu to the specific taxonomy Assignments that you give So and the and also the different taxonomy databases could give you different results as we'll see in the lab section so Yeah, so just Keep that in mind I think Rob mentioned the different databases already so I'll skip over that. Oh Well, so as I mentioned here Green jeans is preferred by chime and selva is preferred by mother The key difference is green gene is a much shorter alignment and selva It's a much longer alignment So you're 16 s genes are the same size, right? But the different alignment method creates template that are short or long and the argument made by Patch loss is that the longer alignment actually is more accurate alignment than the shorter alignment Which forces base to align when they shouldn't shouldn't align so so if you believe had you use Selva if you believe The chime group then you you you can use green gene That Yeah, every six months or so so the database is updated every six months or so. Yeah What RDP is actually a little bit more frequent than that Because essentially it's an automated pipeline that pulls sequences from NCBI and convert it to their own taxonomy Okay Getting to the end so Bear with me for another 10 minutes or so Okay, so the one caveat I want to mention this came up a few times already only 11% of the 150 or so human associated bacteria Genera have species that fall within the 95 to 99 percent 16 s RNA Identity cut off so the 97 percent cut off that we typically use only really apply for about 10 11 percent of the Human associated Genera so that really give you a sense that the 97 percent is really just a rule of thumb and if possible You really will want to explore what the diversity within your species of interest is Before deciding what the cutoff to use is Okay, so once you do the taxonomy assignment typically get a bar graph that just shows you the proportion of each Text on in your sample So sure that's so pretty self-explanatory, but we'll see some of that in the bottom All right, so next few steps fairly quickly so Otu tables is actually just a sample by observation matrix so you have the Otu's in One X X is the samples on the other axis and you just list the number of occurrence for each Otu in a given sample So that's that's essentially what an Otu table is The Otu table then can be mapped to taxonomy information So you can have a separate file that says Otu one or it could be in the same file But anyway, so you could have another a bit bit of information that says Otu one is say Bactroides or or some other organism and so on so The Typically pipe the pipelines I will see that the extreme rare Otu's are removed Filter out and often these are attributed to sequencing errors and and they're just removed to to facilitate downstream analysis And This sort of echoes at my point earlier that these really rare sequences no matter how Deep you sequence you often not going to catch them So there's really not a whole lot of point sequence Asample to a to a huge depth unless you know for sure you're trying to capture some rare Otu's or rare organisms Okay And this Otu table can be converted to this common biome format. It's actually a binary format so So again, it speed up the analysis by compressing the Otu information into a biome file Okay, so sequence alignment We I think most of you hear a familiar with sequence alignment But it is the required step before you can do distance-based analysis or for example a phylogeny Tree phylogeny tree phylogeny analysis The traditional alignment programs that you might be familiar with such as cluster of W muscle and so on a way too slow to Align to do multiple alignments of thousands if not ten to thousand or millions of sequences so new type of method have been developed such as Pi NAS which will look at later on and inferno these are called template-based aligners essentially what you do is You can almost think of this as blaster to you search against a pre-aligned data set to establish The closest relative to your query sequence once that's found then instead of aligning your sequence against Thousands of sequences in the database or in your own data What in your own data set you're not only to align your sequence to a handful of what's called templates in your Data pulled out from the database based on similarity So in other words it cuts down the number of sequences you have to feed into an alignment process So dramatically speed up the the alignment process Okay, so once you have the alignment you can use any of the phylogeny analysis programs that you might be familiar with The exception is when your data set is really large, then you might need to use programs such as fast tree Which is a fax approximation Likelihood-based algorithm that's a fax approximation of maximum likelihood to to build your your phylogeny tree Okay, so I mentioned that we usually pull samples into the same run as a result of that it's very hard to have even number of reads from each sample so the But then the sampling depth can actually affect your richness and diversity Calculation so the common current practice now is to essentially do with Essentially, verify your samples to the low to the lowest or to a acceptable number of reads In your sample. So if you're you have ten samples the lowest Read for a given sample is a thousand and the highest is ten thousand then you are actually losing 9,000 sequences by verifying to the lowest common the lowest common The lowest number of reads for the four sample It's been shown that this type of approach at least statistically speaking is not very sound Because you're not only losing a lot of your reads that diversity Information you're you get from the artificial truncation of your read depth. It's actually There was it actually affected that the answering analysis as well So the the new approach that people are recommending is covariance stabilizing transformation Essentially control them if you have higher depths in your sampling usually that results in higher variance So by controlling the variance in your samples That's a one way of squishing down your high depth reads to a more to a more manageable I should say more comparable Value as the rest of your samples so there are a couple of references there that you can and look into for that type of approaches Okay, so quickly Rob already talked about alpha diversity So and and richness is just a number of species observed or estimated in your sample Evenness is a relative abundance of each samples diversity takes both evenness and richness into account and Suffice to say there are a lot of different diversity measures both implement in chime and mother and As to which one to choose I don't know if anyone has really a good explanation But the rule of thumb is typically try a few popular ones and hopefully they all give you the same stories and contradicting each other This is sort of a rare faction plot showing when you have close reference It's hard to read here. You your OTU count is much lower around 20 compared to when you have open reference with the novel Was a same sample OTU count is much higher when you have to know for open reference And the rare fraction curve essentially plots the Estimate number of OTU's other currents at a given sequencing depth So as your sequencing depth goes up, you would expect that the curve Would go up as well and what you're hoping to see is that the curve essentially? Evens out with plateaus out that means you reach the saturation of your diversity Okay, beta diversity is comparison across samples or across environments and again multiple Different measure measures for those the key differences is probably once that measured the membership of your Samples so looking at OTU abundance looking at the OTU presence or absence only versus once I look at the structure of your Community so that looked at the relative abundance of OTU's and these are some of the common measures used Okay, so for beta diversity analysis you first need to do pairwise comparison across your samples and establish a distance matrix for all your samples so the distance of course will be The similarity will be the highest say one between your samples or the distance will be zero It's typically a score between zero and one So here I'm showing the similarity So one simple one against itself has the highest similarity Simple two against simple one has lower similarity and so on so forth. So by establishing a matrix like this then you can transform it to Pico a plot or other types of plot to look at the relationship of your of your samples Okay, so unifrack is one of the popular beta diversity measures the way it works essentially you map all your OTU's to a phylogenetic to the same phylogenetic tree and You sum up the branch lengths that are unique to each samples. So for example in this case the purple ones shows the The branches are unique to the red samples versus the green samples and when you have a very different community then the You'll see a much longer unique branch So the the the larger the unique fraud score the the more that the more different the communities Against each other the two versions weighted versus unweighted again correspond to so the weighted one Correspond to the abundance measure and the unweighted one correspond to the presence or absence of OTU's Okay, so the way I like to describe these principle coordinate or component analysis and Rob sort of put me on spot to try to explain I had to ask him what the huge what the differences are and I'll get into that a little bit but conceptually what we want to do is that when you have a multi Variable high dimension Sample it's very hard to visualize that On a graph we're much more familiar with two-dimensional visualization or three-dimensional visualization so in order to view multi-dimensional data the distance with the the distance in terms of the p-co-a principle coordinate analysis or the Covariance in terms of the the principle component analysis essentially it's projected into a lower dimension And the most common of course is two or three dimensions so you can actually view it in in a graph I don't know if anyone tried to plot four-dimensional graphs But if you figure out how to do that, let me know probably some sort of time-serial analysis But anyway anything five-dimension. I don't think it's it's imaginable, but just imagine you have 3d product 3d object You want to project it such that you can still tell the salient feature of that object Right, so this is the chair if you project it this way you can still tell it's a chair But imagine if you project it this way Now all a sudden that could be a ladder or could be something else right that gives much less information than this particular projection so the goal of these projection is to maximize the the To maximize the variations with the salient features that can explain your data set In the lower dimension so clear. Okay, so Okay, so here's a plot of principle coordinate analysis So it's actually a two-dimensional plot showing you that these samples are different from these set of samples So in the first dimension and in the second dimension further separate out the blue ones from the yellow yellow ones So that's how you would read P code a principle coordinate plot Basically dimension by dimension All right, hi echo clustering can also be done on the same data set Yeah The biggest difference was so PC a or P the PC a essentially it's is well, let me say the The coordinate analysis so when you have distance information You need to project that onto a fixed coordinate in order to interpret the end result so what so the difference between PC a and P co a is that when you have distance information that need to be projected into a fixed corner and from there You can you do the same PC a analysis principle coordinate I'll try to Identify the covariance in your in your data and max and projected such what such that the maximum Variance is is the small salient features is shown in your graph So that's sort of the So peace P co a essentially is projection into a of your distance matrix Into a fixed coordinate then from there a principle component analysis and a principle component analysis I don't know if anyone have a better explanation of What that is but it's it's a decomposition of your covariance in your data with with the basic idea of trying to maximize the Features that separate out your data sets the best To get Yes, eventually The P co a is Distant projecting the distance into a fixed coordinate then then you do principle component analysis so you still end up with the you know the Components of your of your you know your vectors Okay, so hierarchical clustering You can apply this you can use the same you can input the same data set and get hierarchical clustering the key difference here is that Now your samples are forced into bifurcation trees and a lot of time the samples are not related in the bifurcating Manner right so the clustering approach look a lot more natural than the bifurcating method You can still see the yellow ones and the blue ones sort of separate out in the tree But as a human we tend to over interpret a youth the the relationship in a in a bifurcating tree so I think fewer people are using hierarchical clustering to show show the Show the the Coordination results Okay, okay, so marker gene versus shotgun sequencing as I mentioned Morgan would talk more about this shotgun metagenomics tomorrow, but suffice to say Marker gene is still cheaper computationally less demanding It provides taxonomic profiling rather than both taxonomy and functional profiling and you need sulfur such as PyCrafts to map the taxonomy to predict functions, which you'll see tomorrow The 16 s genes if you 16 s majority of genes can be assigned Whereas you will end up with a lot of unassigned gene fragments when you do shotgun metagenomics and again contamination more issue in shotgun and marker gene based What's worth noting is that? When you look at the function, sorry the taxonomic classification The samples can look quite different But when you look at the functional at the functional level the same samples now look much more even So this is to say that One way to interpret it is that the different bacterial Different bacteria actually in co-similar functions and fill in similar niche So sometimes it it's a bit advantageous to look at the functions because you probably get a more consistent Output then to look and certain more consistent pattern than looking at the taxa alone Morgan will cover PyCrafts tomorrow, so skip that and Okay, so we'll take a quick break and then move into the lab any questions Okay Yes, you mean within a genome no it doesn't So you need you can correct for that if you wish