 So you get me again for the 16S marketing analysis part of the lecture, and also we'll have a bit of hands-on tutorial after this. And the subsequent modules, too, there will be a bit of a hands-on tutorial, so you can actually try out some of the tools we covered in each of the modules. So the learning objective for this module is to understand and perform marketing-based microbiome analysis. Specifically, we'll focus on 16S, ribosomal RNA, we'll use the marker gene to profile and compare the different microbiomes or different microbiome samples, and we'll look at some of the parameters that may affect your marker gene analysis and explain the advantage in this venture, the marker gene-based microbiome analysis, but you actually get a lot of that also in the subsequent sessions comparing marker gene-based analysis to Shaka menogenomic analysis. So the general process is to extract the DNA from your samples to amplify with targeted primers, and once you get your sequences, filter the errors and build OTUs or build ASVs, and then you can carry out some statistical analysis looking at the diversity of your samples across your samples, and also looking at differential abundance of different features in your samples. And when I say features, roughly, it corresponds to the subpopulations in your samples, so each OTU or each TEXA is considered a feature. Okay, so why do we use ribosomal RNAs? There are several good reasons. One is that it's universally present in all living organisms, allowing you to compare the different communities using a single marker gene. It also plays a critical role in protein translation, and because of this functional constraint, it's relatively conserved and rarely acquired through horizontal gene transfer, so it carries with it phylogenetic signals, allowing you to reliably relate organisms with each other. It also behaves like a molecular clock in that it has a relatively stable rate of mutation, so again, useful for phylogenetic analysis and being used to build tree of lives relating all organisms to each other. And 16S, due to its size, has been the most commonly used. So we use RNA as a proxy to understand the microbiome community. So it's a tool that allows you to place organisms on phylogenetic tree and to understand the compositions of the microbial community. So we'll look at how to do that in using China 2 later today. It's also a tool to allow you to compare one community to another, and this is referred to as the beta diversity. And in Rob's lecture, you also see how to relate the different microbial features to the different characteristics that you're interested in, such as obesity versus lean populations, certain disease states such as IBD versus no IBD, and differentially abundant features that are, well, differentially proportioned features that are associated with the properties that you're interested in. Your 16S is not the only marker gene used, and depending on the organisms that you're interested in, there are different marker genes available. For eukaryotes, the commonly used is actually ITS and ATS. Reasons people using these different marker genes is that there have been databases established for these different marker genes, allowing you to compare your data to reference databases. And for bacteria, some of the other ones used include CPN60, which evolves faster than 16S or provides a better resolution than 16S for closely related organisms. Some of the other genes proposed at RECAE, for example, that are single copy genes that allow you to not have to deal with the copy number issues that are present with markers such as ribosomal RNAs. There are multiple copies in the genome. And for viruses, there's no real universal marker genes, but different types of viruses have different markers available to them as sort of communities establish a standard of single conserved genes for a given type of viruses. So for example, G23 has been used for bacterial phage comparison, and RDRP, RNA-dependent polymerase has been used for different RNA viruses. And this publication here has quite an extensive list of marker genes that have been used by researchers for different organism types. So a few considerations for selecting the appropriate marker genes. First is that it should have sufficient resolution to differentiate the different subpopulations of organisms that you're interested in. If you're studying, say, different strains of a given bacterium in a single host, for example, then choosing 16S would not give you the right resolution to be able to come up with strain level differentiation, and you might need much fast evolving markers to do so. So in the case where you're interested in the taxonomic information of the organisms, then the reference database of known species or known organisms is needed for taxonomic assignment. So available of a good reference database for the samples that you're interested in would be necessary in that case. In single copy genes, as I mentioned, it's preferred, but it's not always possible. And certainly for historic reasons, ribosomal RNAs, which is not single copy, have been used as the most common markers for microbiome studies. So when compared across studies, you'll need to use a standardized marker, such as 16S. But more importantly, if you're just sequencing a subregion of 16S using a short read platform such as MySeq, then the different variable regions can also give you different resolutions. So you need to use consistently the same region when you're comparing across data sets. And as Pauline has mentioned, experimental protocols could also affect your results significantly. And mentioned earlier, the different bounds of my pipelines can also affect your results and the importance of keeping track of what you did. So these couple of slides, I had it in here for historic reasons as a reminder, because there's a lot of people asking these questions about DNA extractions and about contamination. So that's why we had a separate lecture today to cover some of these topics in more depth. But here I just listed some of the references on DNA issues with DNA extraction. So one that has been pointed out is that it's possible to carry out some sort of fractionation to separate out different organisms based on cell size or other characteristics. And it's also possible to select for certain fragment sizes based on some automated gel extraction process. So if you're interested in that, there's a video journal about that topic. And just to reiterate that contamination, control for contamination is a major issue in microbiome studies where you don't really know the composition of the microbial community. So as Paul mentioned, that there are different controls that need to be used throughout your experiments. And also when it comes to embryo based studies, sometimes if your target is really low, you can get some non-specific implications, you can get primer-dimers and so on. So when you have low yield or low input volume for your target DNA, then you need to be especially careful regarding the control used to verify that your observation is real. So this is a sort of typical anatomy of a target amplification process. Essentially, let's say you're interested in the V4 region, hypervariable region 4 of your 16S ribosomal RNA. Typically you would design a primer that, of course, a PCR primer that anneals to the conserved regions, flanking the region that you're interested in, but a nifty way of doing a single-step PCR instead of multiple-step PCR is that you can actually attach the sequencing adapters, in this case the Lumina P5 and P7 sequencing adapters to your PCR primer so you can amplify and sequence the target without having to do a two-step PCR amplification. And in addition to that, given the throughput of modern sequencers, we will typically multiplex multiple samples in one single run, so there's a barcode index that uniquely identify each of the samples and typically that's also incorporated as part of your primer design. And there are now many, this is sort of the reference design that was published quite a few years ago by Rob Knight's group and Illumina adopted it, but nowadays there are actually quite a few different amplification protocols available. Okay, so I think I mentioned, most of these already, that in your PCR primer design process, you need to be careful of inhibitors in your samples that may prohibit the PCR process, so adding some internal positive control to make sure the PCR process worked, it's important. And in some complex samples, you could end up amplifying non-target DNA, so downstream in your bioinformatic analysis, it's useful to identify, to make sure that you're indeed amplify the right targets. And sometimes you might want to run your samples on a tape station or on a gel to verify what you have. Okay, so I mentioned there are different variable regions in 16S, ribosomal RNAs. For historically, V4 was chosen because it's the right size for Illumina pair and sequencing. At that time it was 150 base pair, but now it's 250 base pair, so the target has been extended to usually cover V3 and V4 region. And again, STIC was the same protocol that you're using for your entire study, it's important. And as you can see in this little graph here, the different variable regions have different level of conservation, so this Y axis shows the average proportion of the most dominant base, so the lower the peak, the less conserved the particular V region is. And you can see V4, it's actually not the most diverse region, typically speaking. So this paper, this some study looking at the different variable regions and noted that the different variable regions actually selectively pull out, highlight different fractions of your microbial community, different sort of sub-populations of your microbial community. So for example, V1 and V3 favors pervetella and elicit organism here that actually whereas V4 and V6, for example, would selectively pull out a slightly different set of organisms such as campylobacter and enterococcus, which are more gut-associated organisms and so on. And also maybe worth mentioning is that certain bacteria that I also found in the graph, for example, fusobacterium can be missed if you if because the some of the primers use for these regions, even though they're supposed to be universal, still have biases, so they might under-emphasize certain populations of bacteria. Okay, so a current myseq run roughly gets you about 25 million pair m reads, 25 to 30 million pair m reads. And depending on your study design and the diversity of your microbial population that you're interested in and the differences between the different sample types and essentially the effect size that you're interested in, you would need to design your experiment to sample to different depth. And in but in most cases, if you're talking about very diverse communities, a few hundred to a few thousand reads is sufficient to differentiate the community. But if you're looking for some rare biomarkers, or if you're looking for looking for differentials between very similar communities, then you will need to sequence to much deeper coverage. So, but in most cases, I think the recommendation right now is for marker gene analysis is still about 10,000 reads or so, or 10 to 1000 to 100,000 reads or so, should give you enough of a coverage to characterize your popular your microbial population. And if you're sequencing much deeper than that, if the return diminishes, sort of the cost effectiveness, cost effectiveness diminishes. So, as a result of that, you can, you can, you will use unique barcodes to differentiate your samples. And the other issue associated with sample multiplexing or sequencing on MySeq in general is that because the way that the image or the signals is processed on MySeq, when you have very similar sequences, it creates difficulties for the machine to process the image. In other words, you can imagine that all the, so you can imagine that basically the image, I either, if a given base is all the same in your sample, what happens is that the image, because it's taking a shot of the, the different, of the given base that lights up during a given cycle, so your image gets either very bright or very dark depending on the cycle. In other words, the retail synchronized. Then in that situation, what typically you need to do is, you need to do is introduce some diversity in your sequencing pool by adding either some reference sequences to, such as FIX or in your, in your sequence pool to diversify the sequences. Or alternatively, you can sequence different marker genes in a single, single sequencing run to diversify your, your sequence pool. And, okay, so as I mentioned, the convenience of one-step amplification followed by sequencing, sometimes a lot, often, because you're often processing large number of samples, doing a survey using marker gene-based approach, the convenience of one-step amplification is significant over the two-step approach. However, the two-step approach is to allow you to, to attach the same, the same sequencing adopters to different populations of, of marker genes, so different types of marker genes. So, the, so using this protocol, for example, you can mix 16S with 18S with other types of marker genes and, and do a, and, and look at the different fractions of your microbial population. So, for example, in, when we're doing a study looking at what kind of microorganisms, including viruses in water samples, we typically use the two-step amplification process to first use targeted primers to amplify the marker genes, and then anneal the same sequencing adopters to all the different markers and sequencing them all in one go. And this is, as I mentioned, has the added advantage of diversifying your sequencing population, sequencing pool population, and, and therefore improve the quality of the sequencing process. Okay, so now moving into the, the, the different sequencing analysis platforms that we can use for marker gene analysis. The, in this workshop we'll mainly be using CHIME 2, which is sort of a re-re-development or refactoring of the very popular software called CHIME. So how many of you have used CHIME 1 or 2 before? Okay, so about half. How many have used CHIME 2? Okay, so, so for, so for, I'll get into that a bit later. So, so CHIME 2 provides a nice sort of cohesive platform to, to run your analysis in and we'll show, we'll see that in the demo session. But there are also scripts that you can download and run, and one of the popular ones is from Morgan's group called Michael Beimhelper, and I guess, Gavin, are you currently the maintainer for that? Sorry? Yeah, okay, so, so this is sort of a set of scripts that you can download and use to process your marker genes. And there's a third one that was popular, but I mean it's still being used, but it's less popular these days. It's called MOTHER. So how many of you have used MOTHER to analyze your, so, okay, so same people. So I guess you guys try both MOTHER and CHIME. Which one do you prefer? CHIME? Okay, yeah, so MOTHER was, was, was, was very popular before, but because it, CHIME, the first version of CHIME was really quite difficult to use, and it's essentially a collection of scripts that you have to run, and sometimes they don't really have the entire workflow nicely connected, whereas MOTHER provides a uniform platform and uniform ways of entering commands too, and so it's much easier to learn and much easier to use, but it has full out of favor a little bit because, actually the next slide would talk a little bit about that, but also you can build your own custom workflow by combining our scripts, command line scripts, and other tools by coupling, by coupling together different pieces of tools and build your own workflow. So how many of you have done that building your own workflow for marketing analysis? Just one or two? Okay, so, okay, so that's good to know that hopefully the tutorial session will be, will be useful to, to many of you. All right, so I want to do a bit of a comparison between CHIME and MOTHER, just as I say MOTHER, it's, let's use these days for, and for several reasons. One is that CHIME2 has vastly improved its user interface. It's still command line based, but the, the commands are now standardized, and it also, as you'll see later, that it can help you keep track of the steps that you, you do in, in CHIME2. MOTHER has always done that for you in a sort of a log file, keep track of the steps for you, so some people find that nice to be able to know, you know, each steps of your, of your analysis. The key difference between CHIME approach and MOTHER approach is that CHIME has always exists more as a wrapper to, to tools, and so it takes existing tools and then design and, and sort of add it as a plug-in to the CHIME environment, whereas MOTHER typically re-implement popular algorithms and, and have, and as a result of that was able to provide a more sort of cohesive user experience. And both CHIME2 and MOTHER now are very easy to install, typically involves downloading the files and, and run the single command to install it on your, on your system. And in, in the case of CHIME2, it can, it will pull the, the necessary files and all the dependent programs for you automatically in its installation script, so vast improvement over the previous version of, of CHIME. The one downside of CHIME2, however, is that it's a re-implementation, a rewrite of CHIME, of the original CHIME and therefore it has, it still missed quite a few functions available in the original CHIME, but, but hasn't been ported over to CHIME2. Some of those might be sort of historically used, but not no longer relevant, but others such as visualization of phylogenetic trees and so on would be nice to have it natively incorporated in CHIME2 as well, but that's not the case at the moment, so you need external viewers to, to view some of your results. And the other key differences is that for whatever reason MOTHER hasn't implement this ASV-based approach, and I'll talk about the differences between ASV and OTUs shortly, but, and ASV has been proposed as a more accurate way of denoising your samples and identifying subpopulations in your sample. Does anyone know why MOTHER didn't choose not to incorporate any ASV-based approach? I haven't been able to find reasons for that, it's just seem, okay, yeah, because I've found a few sort of blog posts or, or rents in general by, what's his name, Pat, yeah, that sort of still defending OTU with that type of approach, but for whatever reason it hasn't, it hasn't implemented ASV-based approach. Yeah, right, so, yeah, so, so, I mean, most of you heard of OTU before and, and for her story's reasons were still covered in this workshop, but also introducing a new concept called amplicon sequence variant as a new approach to identifying features in your samples. So this is the general bound for MAC workflow that we will go through today in the workshop. So typically you start with one or more FASQ files, in the, in some cases they're still multiplex, in other cases they've been demotiplexed. So there are different scripts or different process that will allow you to take either multiplex samples, demotiplex it first and then, and then run it through the workflow here, or if your samples already demotiplex, then they allow you to associate the metadata to each of the demotiplex samples and then again carry it through the workflow. And as I mentioned, the metadata need to be present in, as part of your analysis and there are tools that will help you manage your metadata and make sure that it's in a standardized format that's acceptable to, to chime. Okay, so the first step is typically to remove the primers and, and of adapters that are found in your sequences. If that hasn't been done for you already, a lot of sequencing centers will give you demotiplexed data back rather than the raw data back and in that, in that, in that case, often the, the adopter and primers are removed. So when you get your data back from a sequencing center, best to check with them what they've done with your sequences before, before deciding what to do for your, by yourself. Okay, so the preprocessing, remove the primers, demotiplex and also look at the sequence quality and remove any reads out of low sequence quality. And also as an option, you can look for ways of, of checking to make sure that your samples consist of only the targets and not other non-specific amplifications. So it could be a decontamination step for your, in the preprocessing. And once you're, once you finish the preprocessing, the next step is called feature identification. This is where you would either combine all your duplicated reads or combining your reads into OTUs based on some predefined cut-offs such as 97% or use this ASV approach. Essentially what you're trying to do is to define the sub-populations in your microbial community and each sub-population is called a feature in your analysis. And then once you have the, the features, you can carry out taxonomic assignments to each of the features by comparing to reference genomes based on sequence similarity, then assign taxonomic name to the, the sequences, to your sequences. And then you, in the same step, you can build your feature table, which I'll explain what that looks like. And the feature table can contain both named and unnamed features that, that you want to care, you want to process and look for differential abundance and so on downstream. Alternatively, you can take those sequences and you can do phylogenetic analysis on the sequence and build phylogenetic tree. But in order to do that first, you need to align your sequences so the, so the, so you, so you can generate the distance matrix for building your phylogenetic tree. And both feature tables and phylogenetic trees then can be used as input for some downstream analysis. Okay, so we'll look at each of these steps a little bit more closely. So for the pre-processing, as mentioned already, the MySeq allow you to multiplex multiple samples in a single run. So the reads from each samples need to be linked back to its, its, to the sample. And this is done using unique barcodes that the, sorry, that the multiplexing step often also remove the barcodes and the primer sequences in, in trying to have several different scripts to help you do that. And this is outlined in the importing tutorial. And I should also mention that mother and other two will also have equivalent demultiplexing algorithms that would, that will work on your sample. Now about the quality filtering. So once you demultiplex your samples, some of the reads will be of lower quality and chime actually filter the reads based on some quality parameters. And this is very similar. So chime two and one had very similar parameters, but it took me a while actually to, to, to identify how the, the statement about how chime two filters by default. So essentially, if you look at the from your, from the three-parmen for a low quality and look at the number of consecutive low quality bases and if it's greater than three, it would start truncating your, your reads from the three-parmen based on the, the number of low quality bases. And the minimum freight quality score required is four to, to be considered sufficient quality. And also if the truncation goes too far far into your read, in other words, more than 70, more than 25% of your reads truncated, then it would drop that entire read. And because some of the downstream denoising steps does not tolerate ambiguous spaces. So by default, chime two would also filter out any reads that has any ambiguous spaces. In other words, ends in, in the sequences. So this is sort of the default filtering parameter in, in chime two, but there are other quality filtering tools available. And, and often in the sequencing center instead of running chime two, they will actually use some of these standalone tools to identify, to identify the, the adapters and, and the primers and remove them. And, and also would trim, in some cases, trim low quality read for you. So again, important to find out your, from your sequencing provider, what they did to the sequences, if they did not give you the, the, the original multiplex fast queue file. And you can, there's a tool called fast QC that's useful to summarize your sequence quality. And we'll see a representation of that in the tutorial session. Okay. So, so I mentioned some of the target amplifier might be non, non, some of the implication might be non specific off target. So you might want to take a subteler sequences and search against 16S or whatever reference databases you have to make sure that it's, it's on target. And also, there are ways to look for host specific contaminations. For example, by searching against host, the human databases, if you have a human sample, just to verify that your reads did not come from the host. Some of the tools, especially classifiers, machine learning based classifiers will happily take any random sequences and spit out some prediction for you. So if you give it a non 16S sequences, if these classifiers might not tell you that it's not 16S, so it will happily make a prediction for you. And so it's, it's, if you're not sure that you indeed have 16S, all your sequences are on target, then it's best to do some spot checking to make sure. Do you have an answer or? So maybe I'll jump to a, pick the diagram. Do I have a, actually it's in my other slide deck. So in Chime, what you can do is actually visualize the, the average quality per base position. So it basically, visually, you can see where the, the quality drop off is. In Chime will allow you to specify you want to cut off that position, and you can do it for both five prime and three prime. So it's sort of visually, visualization of the quality and visually decide what to cut off. And I don't know if it has implementation of, of other tools that would do the sliding window approach to cut it off. I didn't see it when, yeah. So the old version does, but then I don't think the new version the Chime 2 hasn't had that implemented. I guess it's just relying on the visual. So using print seek or other cut adopt or trimmer matics all have similar functions to allow you to trim back the low quality ends. That's right. Yeah. So, yeah. So some of them would allow you to specify, I mean all of them would trim for you, but the sliding window calculation, you say it's in the print seek, right? Yeah. Yes. Yeah. So, so you have ways of specifying which features or which read you want to remove from further analysis. Yeah. Okay. So just to, to reiterate the goal of micro gene analysis is that we're using these marker genes as proxies for the microbial community profiling to understand who's there and understand to, to know the relative abundance of these subpopulations. So we want to be able to divide or partition the community into subpopulations based on their sequence variations. And as I mentioned OT use, use a hard cut off of fixed percentage, typically 97% identity. But to do this, well, the goal found from my analysis is to distinguish true variants from artificial variants that, that are due to PCI amplification or sequencing errors. In other words, try to reconstitute what's in your samples, regardless of the PCR and sequencing errors. And using a hard cut off such as the OT approach, simply doesn't, it's not very accurate. So if we can reconstitute the, the subpopulations, then we can under, we can have a better chance of understanding the subpopulations biological roles and functional roles in the, in the community. And it's also maybe important to, to stress that the traditional OT approach is assuming that every sequence is or in that OTU is treated the same. So just like we say, okay, a given species, if, or given strains of bacteria, you know that it's a population of bacteria, but you, when you give it a label, you essentially treat that whole population the same. So, so the assumption of OTU is that all individuals in an OTU has the same role or function in the community. And certainly, when you do any downstream statistical analysis, you're treating each OTU as the minimum unit of, of analysis. Right. So, so the, as I mentioned already, the OTU picking based on an arbitrary sequence identity. And 97% has been shown to roughly correspond to the species level. But I think I have, but it's actually known that people have done studies looking at reference genomes that the, the percent identity ranges somewhere between one to five or six percent for a given named species. And also the different copies of RNAs in a given genome can have up to one to two percent of variation as well. So making this, this approach not, not very, not very accurate. Okay. So, CHIME supports three different OTU picking algorithms. And this is the original, original CHIME. The new CHIME does do that, but it's actually moved away from OTU picking and looking at the ASV, which I will talk about next. And the three different approaches are the NOVO clustering, close reference and open reference. So, sorry, go ahead. Yeah. So the OTUs are cluster at 99% sequence identity. Yeah. So, I think based on my experience so far, as you think the, the OTU based approach, really, you have to understand your, your organ, your sort of microbial community to pick the right cutoff. So the ASV based approach, which doesn't pre-specify a cutoff, gives you a, I think give you a more accurate representation of, of your microbial community. But yeah. But when you say 99%, so by the way, the slides I have here, slide update from what you might have. So I have to give you the new slides. We'll upload it later this afternoon. Okay. Okay. So for the NOVO clustering, what you're doing essentially is taking all the reads and comparing all other reads and, and create a pair-wise comparison table. And based on the percent identity between the, the two reads, you start grouping them typically using a hierarchical approach. So what that means is that you take the first 99% identical set two sequences, group them together, and then find the next 99% identical sequences to, to this, to the average of this group. And if the next sequence falls outside of the 99% cutoff or the 97% cutoff, then you create a new OTU, a new group. And then you find, again, sequences that are within the cutoff to, to that, what's called a seed sequence in that OTU. So this approach, because you're doing pair-wise comparison to, to set up your, your analysis, it requires a lot of disk and memory space. People have come up with ways of subdividing your, your reads to smaller groups and then only do comparisons within that group. So one, one approach, for example, is that you can first do a, a texanomic analysis of your samples. And then based on high-level texanomic information, such as at the family level, or at the even genus, I think it's those days, at the family or level, you can subgroup your, your reads into different families. And then within that family, build your OTU's within the families to reduce the number of sequences you have to do pair-wise comparison. And there's some studies done to figure out what type of distance measure within a cluster is the most robust. And according to Pash law, who's the developer mother, he found that the average linkage cluster is most robust when you change the input data with the algorithms. But again, as I said, the OTU approach is generally sort of falling out of, out of favor now. Okay, so this is the clustering approach. So as I described, the greedy algorithm is used to achieve time and space saving. So this is achieved by clustering incrementally use, by clustering incrementally using a subset of sequences as centroys. So as I described, when you first encounter a sequence that's, that fall outside of existing OTU's, you make that the new centroid, and then you identify other sequences that are similar to that first sequence that you encountered. But because you're taking the first sequence you encounter as a centroid, what effectively happens is that if you permutate or change the orders of your sequences, then the different centroid will be picked the next time you do the analysis. So as a result of that, you can actually create unstable clusters just by changing the orders of your sequences. So that's one key sort of disadvantage of this greedy algorithm-based approach to clustering. And there are some existing tools that try to get around these issues by first saw your, for example, first saw your sequences from the longest to the shortest, or you favor the longest sequence as the centroid, and so on. So different heuristics to improve the stability of your clustering. In contrast to the novel clustering approach, the close reference essentially is to match your sequence to an existing database of reference sequences. So it's very much a way of assigning taxonomic information to your sequences. The downside of this approach is that if your sequence is not found in the database, then it's discarded rather than used for downstream analysis. So this is quite fast because you're just comparing your list of sequences to reference database. And also you could do this by subdividing your input sequences and parallelize the process, running on different computers. Each subset of your sequences can be run on a different node in a cluster because the determination of the mapping of your sequence to a reference read does not depend on other sequences in your data set. So each can be determined independently. So this is suitable if the samples that you're studying has a really good reference database. In other words, it's a well-studied sample type. So for example, the human microbiome has done a lot of reference genome sequencing. So it has a pretty good set of reference sequences available for comparison. Contrasting that to any of the environmental or animal studies, you have much less information available about the genomes that are found in your samples and certainly much less reference genome sequences available. So close reference is generally good for well-studied sample types but not very good for novel sample types. So this approach called open reference combines essentially the the the novel OTU calling and with the taxonomic mapping approach and give you an OTU table that consists of both the unnamed and the named taxa in the analysis. So because your OTUs actually represent a collection of sequences and when you're doing analysis you cannot take a collection of sequences and analyze it individually. So the next step is how you need to choose one sequence in each of the cluster each of the OTUs as the representative sequence. So you can use that sequence to generate a phylogenetic tree for example or you can use that sequence to calculate a distance matrix. And because you're picking a single representative sequence for a collection of sequences or a collection of organisms essentially it's the downstream analysis is treating all organisms in one OTU the same and so you reduce the a group of organisms down to a single representative sequence and the number of equivalent sequences in that OTU. And the representative sequences can be picked based on some predefined criteria. Often it's the most abundant sequences that are found in a single OTU or the longest sequences that are found in an OTU and sometimes people use the sequence that's at the center of the OTU as the representative sequence. But again like how you pick representative sequence can affect your downstream analysis. And because of this problem a couple years ago new types of denoising what new types of identifying features have been introduced and this is called the ampercon sequence variant. And this is a attempt to avoid an arbitrary dissimilarity threshold and the goal here is to be able to distinguish sequence variants from artifacts that are introduced during sequencing and in much more limited extent during the PCR process. But the biggest disadvantage of this approach is that it's not very it's not sensitive to early round PCR artifacts. So imagine if you have a PCR if you have an early round PCR error that's propagated in your PCR process and it generates multiple copies of itself then this particular algorithm is not able to differentiate that from a true variant data in your sample. So data 2 is one of the approach it stands for division ampercon denoising algorithm. So conceptually how it works is that imagine this is your starting material has these four different species the what let's call it four different texas and the size represent the abundance of each texon. And in the amplification process you might introduce some minor variations due to sequencing errors. So for example these rare ones are derivative of this original population and as a result of that you get these red small dots around it same thing for the green same thing for the the blue and so on. The OTU approach essentially you are drawing fixed radius circles around these in an attempt to essentially hide the artifacts that are derived from sequencing errors in our PCR errors. What data 2 attempts to do is to statistically and using customized error profiles essentially recombine these minor populations due to sequencing errors back to its original population. So re-establish the re-establish the original population structure. So this is one of the supplemental figures from the paper and essentially it's a little bit harder to see here but essentially it's comparing the ASB approach to the OTU approach and on the y-axis it shows the frequency of basically the relative abundance of a feature and the y-axis shows the distance or the variations or the diversity across these features. And the important and this is a mock community so the original population structure is no in what you're seeing here for the for the blue symbols are features that are generated by data to compare to the OTU approach. So what you see here is for features that are both abundant and less abundant but are more similar to each other or more in other words more closely related taxa data 2 is able to differentiate them better than the OTU based approach. And in terms of accuracy because the target population is known you can also see that the the square ones are the what's in the in the mock community you can see that the data to approach reconstitute what's in the mock community better than the OTU based approach. So these square boxes on this side and and also the triangles are considered the the correct ones basically whereas the cross and the stars are considered incorrect and you can see that it does have some some errors in reconstitution that reconstituting the population but generally performs better than the OTU based approach. Now so the even though ASV has been and there are publications recommending the use of ASVs over OTU this block note from here I'll pull it up actually this block note actually discuss what are the the some of the advantages and disadvantages of ASV versus so we actually have a nice description on why the oh sorry there you go has a nice description on why the the ASV approach the pro and cons of the ASV approach but to summarize it as I show you that it the the main improvement is that for organisms that are closely related to each other in your community the ASV approach has better taxonomic resolution in separating them out into subpopulations in your in your sample rather than lumping them together into a single OTU. The other key advantage is that unlike OTUs where you have one representative sequence representing a collection of slightly variable sequences right up to three percent different well up to one percent depending on your cut off the ASV approach provides a consistent label because the algorithm essentially regenerates the the original population structure so each ASV is supposed to represent a cluster of identical sequences so essentially you will get one sequence rather than multiple sequence in each of the cluster and as I mentioned sometimes if you're you have PC early wrong PCR errors then the ASV could actually represent a artifact in your in your experiment but generally speaking it it does pretty good in pretty does a pretty good job in in reconstitution in reconstituting the the populations now the disadvantage is that because it doesn't have that clustering step if your samples are highly diverse if your sample population is highly diverse then or complex then you can have a lot of ASVs and that make the downstream analysis more maybe not more difficult but certainly more more time consuming and and may require more computational resources to do and it's also sensitive to data quality as I mentioned lastly for 16s and other multi-copy genes because each ASV is supposed to represent the cluster of identical sequence so if you have a genome with multiple 16s that are slightly different from each other each of those would become an ASV rather than in the OTU based approach they would be grouped into a single OTU if you use a low enough threshold where the the intracromosome variations does not is not an issue okay so the other feature identification post-processing is the removal of chimera sequences and I won't get into too much detail here but the general principle is you want to look at your reads and identify sequences where the reads map to two different targets in the database two different reference sequences in the database and then that typically signals that you have a chimeric reads so as you can see here the one part of the one part of your query sequence matched the a sequence very well and the second half of your query sequence the one in the middle matched be very well and there are tools that help you pull out those cases as potential chimeric sequences and study shown that approximately one percent of the reads in in a marker gene study can be due to chimeric okay so so taxonomic assignments I mentioned is analogous to close reference genome and close OTU calling I mean so I'm not sure why it's keep jumping so the the process of of of taxonomic assignment is simply to give OTUs a name that you can refer to and the process typically is take the OTUs and the represented sequence from the OTU do some similarity sequence search against a reference database and then take the the top hit or the top several hits as potential candidates for your taxonomic assignment but what's important is to specify the matching algorithms I use and the taxonomic database I use as this could also affect your taxonomic assignment so here's just a list of common databases that's been used for tax taxonomic assignment and here I mentioned that the reason that people use 97 percent is based on the observation that the assumption that the bacterial genera has the species level variation is somewhere between 95 to 99 percent but what's been shown is that actually there are a lot of exceptions to the rule yeah so I've seen it I think charm 2 still uses a version of that but maybe just for archiving purpose and it's not been updated yeah but not okay so but yeah it's good to know that it's not been updated I guess and um yeah so Selva I think is still the one that's been maintained as far as I know but RDP also don't think it's been actively maintained I don't know do you guys have any okay should be okay yeah okay so um as I said green green jeans for historic reasons still had as a archival copy in chime in but um but yeah but it's not been been actively maintained for RDP yeah because I last looked at it it also hasn't had a lot of action so yeah so so I should also mention that NCBI has taken over the some row of curating 16s sequence so NCBI actually does have a 16s database that contains as well all right so so this is just to show that once you have the taxon information you can a common display is to show it in a stack bar chart to show the relative proportions of the different taxon within across different samples or across different sample types okay and I just want to quickly mention the structure of a feature table typically a feature table consists of a sort of the number of the recount of each feature in in a given sample so it's a two-byte it's a two-dimensional table a matrix of features in one direction and and samples in in another direction so each feature could be a an OTU so or an ASV and this represent the number of reads that within that feature from sample one and sample two for example same feature has half the number and as we'll talk about later that typically you need to transform the recount to proportions or to other or normalize some other ways before you do analysis because you sequence first of all you sequence the different samples to different depth so you have different reads per sample and also the read depth is not it the there's correlations of features with each other and that count-based approach to to analyze the the sample would will have unacceptable false discovery rates and biases and the the read table usually is kept in this biome format which i won't we don't need to to get into but it's just to say that there are standardized formats for for different for different for for the microbiome data okay so also for historic context because we have a traditional multiple alignment approach to generate alignment is too slow when you're dealing with tens of thousands sometimes millions of sequences there have been development of template-based aligners one of them is called pine ass and another called inferno based on actually RNA secondary structure for aligning your sequences and these template-based aligner essentially only require your query sequence to align to to the template rather than a pairwise alignment that need to be done in multiple alignment so they're much faster than multiple alignment but they're also less accurate and and generate distance uh less representative of the true phylogenetic distance so the multiple alignment algorithms have been improved to allow at least 10 to thousands of reads to be aligned and this is currently sort of the preferred approach to when you want to generate sequence alignments and then the next step is to generate phylogenetic trees from your sequence alignments again different tree-rich reconstruction tools are available the most popular one is probably the fast tree algorithm that's implemented in charm 2 and and mother and many other tools because large number of sequences to be constructed large number of alignments reconstructed into a phylogenetic tree relatively quickly and relatively accurately uh Rob are you covering rare fraction okay okay okay well I don't think I'll you are so so um as I mentioned because your each of your samples could have different number of reads um and the number of reads can affect the your downstream analysis when you're trying to determine the diversity of your samples and so on so typically if you sequence more uh you get the higher diversity you get higher uh diversity readout so the rare fraction rare fraction process essentially take the same number of reads from each of your samples and um and uh the and and normalize your samples to the same number same number of reads um and this approach has been discussed extensively in the community both the pro and cons and some people say it has minor effect on your on your results some people think it it actually makes it it reduce the sensitivity of your analysis significantly and so on so I highlighted some alternative approaches and and why the field has moved to more compositional based approach rather than count based approach that Rob would mention some of the compositional based approach in this lecture right so um I'll take a few minutes just to finish the last few slides um introducing a few more concepts and then we'll jump into the hands-on portion of the the market gene analysis okay so so we talked about generating a feature table um and using the feature table you can characterize your samples or you can compare your samples to to each other um so I want to introduce a couple concept one is called the alpha diversity analysis and this is the diversity of organisms in a single sample or in a single environment in sort of more traditional ecological study and it's typically a combination of two different measures uh in different ways one is the richness or the number of species where texas are observed or estimated and the second evenness in other words the relative abundance of each texas and you can imagine that's um for the same richness say you have 10 species in the two environments if in one um environment the evenness is is um different such that there's one organism that's highly abundant and the rest are low abundance versus um the second population where it's a second environment where the sub-populations are quite even in other words you have 10 um 10 species that are all roughly the same number if you go into that community that's by by sampling it or by sequencing it the results that you get can be quite different in the first one where you have a highly abundant organism followed by low abundant organisms highly uneven environment you'll have to sequence a lot deeper in order to see the rare um organisms um whereas in this the second one because they're all relatively the same um evenness you'll be able to detect them with much shallower sampling so the alpha diversity can can affect the level of sampling that you require to to see to characterize your um your population so alpha diversity take both evenness and richness into account and there are many many different measures of alpha diversity some common ones such as Shannon entropy uh phylogelastic distance um have been um incorporate into uh chime and into mother and often they do give you um similar um characterization of your community now so um this is just a graph showing if you know if you plot each of your samples in this case two uh samples uh plot their um um alpha diversity in this case the phylogelastic distance in a close reference so the the y-axis is slightly different you can see that in uh for the same set of samples um sequencing to roughly the same depth of coverage when you have close ref using close reference otu pick give you a lower um diversity in this case 20 compared to both open reference and de novo which roughly the same about 30 um uh in the in the um the pd measure so um just to show that the again the otu picking process could affect your diversity measures beta diversity on the other hand uh is the the differences in diversity or plot samples or how different your samples are from um sample types are from each other or different samples are from each other and again there are several um measures of beta diversity one of the popular one is called unifract um and another one uh implemented and in uh chime is its breaker this disability measure uh looking at uh otu abundance across different samples as a measurement of uh of the differences between these samples and contrasting that with the jacar measure the jacar measure only look at otu presence or absence so it does not take abundance into account and so when you are trying to characterize your samples it's also useful to know whether your similarity measures or the similarity measures look uh takes into the relative abundance into account or just measuring the presence or absence of otu's uh or um and uh the the different measures as i mentioned often if your signal is strong enough to give you comparable results but but do have their own caveats that uh that's worth looking into when you're doing comparison when you're picking which ones to use and often by searching for um different discussion forums that people will comment on the on the application of a given diversity to their samples and what they see as the issues associated with the measurements now so so how is beta diversity calculated so when you so instead of the the feature table where you have otu by sample this is actually a table of sample by sample distance measure so so you have to um you have the sample one compared to itself um is um i you know it's it's it's uh obviously identical so you get an assignment for one and um sample one compared to sample two for example less similar so you get a 0.4 in the uh score and the lower the the similarity or the more the dissimilarity the the lower the score and then the but this information does not um you can you can use this to to observe the pairwise distance but um the uh component the um dimension reduction uh algorithm such as principal components or principal coordinate measures allow you to take this and reduce into a two or three dimensional graph so you can more in in a way intuitively see how the the different samples are grouping with each other and and see their similarity and dissimilarity um let me see if i can get to so this is an example of the the principal coordinates graph um taking the distance measure and projecting it onto in this case a two-dimensional graph and intuitively you can see this population here uh it's much more different from this population from these two populations and the these dimensional reduction algorithm works by essentially taking a high dimensional so i have a a graphic picture representation here right so it takes a two-dimension multi-dimensional graph in this case three-dimension and reduce a two-dimension but intuitively you can see that these two representations of of these two different projections one gives you much more information and you can tell that this is a chair whereas this one the way it's projected it gives you a shape that that it's harder to interpret whether it's a chair or if it's something else so the way the these dimensional reduction program works is try to preserve as much information as possible in the high dimension while projecting it in the in the lower dimension so you can visualize it and um commonly it do so by maximizing the variations um in your in your variables okay so um you can also um take all your samples and use the distance measure to build a hierarchical tree and then the uh in in in a way view the clustering as a as a sample tree and samples that are shared um not more similar to each other would be would be would show up show up as clusters in in the tree but this is forced you to show the samples in bifurcation trees so sometimes it could be misleading to to force your samples into bifurcation trees i think i can skip this because i i'm pretty sure i saw some slides talking about the uh the marker gene versus shotgun comparison um so i want to end with um two two fairly new references both published this year um just on gen general um best practice for for um 16 s microbiome studies and and one for this first one actually pretty much covers all the topics we're covering in in this workshop but hopefully the workshop provides you with with sort of a opportunity to to ask some questions related to to these best practices so if you have a chance take a look of these and we can certainly make them available to you if you couldn't download it uh yourself all right you i mean it's not you don't have access through your institutions and whatnot okay so um typically for these labs and certainly in the subsequent sessions we will give you a web link to a version of the um to the lab to the to the lab with instructions and so on but when i was preparing this marker gene lab i was looking at the different um tutorials available online and realize that you know there are some great tutorials out there and it's not really useful for me to reproduce what they do so instead what i want to do is sort of offer you um three different options um one is to but all three options give you an opportunity to use chime 2 to to process some data sets so the first option is just to follow the default the first chime to uh tutorial and this is for beginners who have never used chime 2 before and just want to do the step by step tutorial um and you can also try the second tutorial the the fecal transplant tutorial um which gives you a different starting point uh where the samples have been demotiplaxed already and you take the the demotiplax samples and put in try to and try to import it into chime 2 for analysis uh whereas the first one uh you take a single fast cue uh file that has not been demotiplaxed and then have chime 2 demotiplaxed and do all the processing for you um the second option is uh to use a book chapter that Rob and his student Michael wrote on 16 s analysis so by and large follows the same uh tutorial as the chime tutorial but use a different data set and uh highlights do you have the printout okay great so uh we'll give you the book chapter that you can can take a look um and i'll show you um how to start with the analysis shortly the third option is for for experts basically if you have run chime 2 before where if you have done other analysis before and you have your own data set that you want to just put it through these work uh these workflows you're welcome to try that and then we can try to help you as much as we can in terms of if you have issues or if it gives you an error that you don't know how to fix so um the third option is only for people who have have already used chime 2 uh in the past i should also mention that um the second option we will use the same data set for for module 7 so in a way you we module 7 will give you the results but you this give you an opportunity to play around with the data set and generate the the results that will be used for statistical analysis in module 7