 All right, so thanks for coming into the workshop. So I realize that many of you have experience with doing microbiome studies already, and some of you don't. So the way I sort of want to structure this is I'll give a very sort of basic overview for those who have not much exposure to the field. But for those who have a lot of experience, I would like to encourage you to share your experience so feel free to pipe up. I don't have that many slides, so there's time for discussion and talk about some problems or some issues that you might be facing or even some recommendations that you might be having. And throughout the workshop, it's sort of not just learning from us, but also learning from each other. And hopefully, you'll find someone who you can collaborate with in the future. So there are seven modules in this workshop. It's quite an intensive workshop. We go through the material fairly quickly, and there's different areas I was trying to cover, ranging from marker gene analysis to metagenomic taxonomy analysis, classifying and binging of sequence data, and then metagenomic functional analysis where you would do some functional prediction using metagenomic sequences. And learn a bit about the databases and pathways that you can, and pathway tools that you can use to do such analysis. And we also have John, who will be covering metatranscriptomics analysis. And on sort of a guess, the last module, Fiona will talk a little bit about biomarker discovery. So this particular module, we're just going to define some of the terms, talk about some of the general approaches to doing these type of analysis. And again, general discussion. So the general learning objective for the entire workshops are as follows. First, define the objectives of different types of metagenomic projects. Process raw data files using appropriate quality control. So we'll talk a little bit about the importance of quality filtering before you proceed with your sequence analysis. And we'll also show you some standard pipelines that we have developed in the past, and we and others have developed in the past, for marketing analysis, for metagenomics analysis, and metatranscriptomics analysis. We'll also then show you once you process your data, how you could analyze the results, some of the statistical approaches and network approaches for analyzing the results. And also the field, as you know, it's an evolving field, so it's also important to recognize the technical limitations and sort of, and there are also conceptual limitations of metagenomic studies that will bring up during the workshop. Specific to module one, our hope is that by the end of this 45 minutes, you'll be able to know the key terms of metagenomics. You'll be able to define the objective of metagenomics experiments and choose the appropriate technology for designing your experiments. Then actually, I took out a part about interpret the content of sequence file, namely fast Q files. But of course, during the tutorial, you'll have a chance to look at fast Q files for those of you who have never dealt with a sequence file, which may these days be like telling airline passengers how to fasten their seat belts, but in any case, we will show you what sequence files look like and what the content of it is. And last bit is in the hands-on tutorial, hands-on tutorial will show you how to acquire data from different online resources and reference databases available. Okay, so jumping right into the definition part. So for those of you who are sort of new to the field, or even for those of you who are not sort of involved at the beginning of the field, there might be some confusions in sort of what's microbiome, what's the difference between microbiome and microbiota, and what is metagenomics, and what's marketing studies. So there's a bit of an interesting history that's discussed by Jonathan Eisen in this link here. Sorry, not sure. Okay, this link here. So microbiome is attributed to Josh Litterberg by Laura Hooper and Jeff Gordon in their paper. The collective genome of our indigenous microbiome, what used to be called a microflora. But nowadays, microflora, given that pathogens, sorry, not microbes, prokaryotes are not really flowers, or plants, the term microflora is now sort of out of style. And the idea being that microbiome described a comprehensive genetic view of the homo sapiens as a life form should also incorporate its indigenous microbiome. So in this sense, the OM stands for genome. But there's actually a competing view that described microbiome, the OM, as in biome, so as microbial biome or microbial community. So microbiota described the actual set of microorganisms found in a particular setting. Sometimes people actually interchange that with the term microbiome when the OM describes the biome rather than the genome. So that sort of give you an idea of why there's, like people sometimes call it microbiome, sometimes calls it microbiota. But maybe for clarification's sake, you can use microbiota to describe the organism and microbiome to describe the genes encoded by these organisms. That sort of resolved some of the ambiguities, but again, there's no real standards. Meta, sure. Yeah, so that's a bit of a philosophical discussion in that people use marker genes to represent the organism, right? So that's why it's been used that way. Okay. Is there any consensus? I would say you're getting a profile of the microbiome. I would usually say that you're describing the microbiome. So I tend to use microbiomes of the thing that you're looking at, sort of in a general sense, and the microbiome really means organisms, and then 16S is just an approach to the biome. Same with metagenomics, it's just an approach. Yeah, so the microbiota, if you use it in the sense of the set of organisms, I would also tend to agree with Morgan that's the profile of the microbial community as a sort of accurate way of describing it. But again, if you use microbiome as the microbial biome, then people can also accept that as a term. I would really worry about it because people use it interchangeably, yeah. I mean, we're trying to teach those practices. Yeah. I wouldn't spend any more time. Yeah, Morgan's more practical than I am, but safe to say is that maybe you just define the terms like I did here, and then you use it to disambiguate what you meant. But the important distinction here, maybe it's not so much microbiome and microbiota, but it's metagenome and microbiome. So metagenome is actually a term that came about later in 98 by Joe Handelsman, when she described us advances in molecular biology and eukaryotic genome, which have laid the groundwork for cloning and functional analysis of the collective genomes of soil microflora. So you can see here microflora at that time is still used, which we term metagenome of the soil. So metagenome in this sense here, the meta is actually means beyond. So it's their attempt to say we're now going beyond just looking at individual genomes. We're looking at the collective genomes of a community. So that's what the term metagenomics came about. But it does not encompassing marker gene surveys, which you'll see way predates these type of metagenomics analysis. So the important distinction here and the community sort of agree that the best practice here is not to use metagenome when you're just doing marker gene surveys, but to refer it to as marker gene analysis or 16S analysis. And there's earlier in when the metagenomics, when the human metagenomics was being conceived, the US National Science Council, I think, puts out this little booklet, the new science of metagenomics. I don't know if any of you have seen a copy of this. And when I started in my PhD, I actually received a physical copy of it, but nowadays I think you can still find a PDF. So the big picture of what we're doing here and why you're taking this workshop or why we're doing metagenomics analysis is really to explore the relationship between microbiome and their habitats. This includes their, including the host environment, including natural environments, but the importance here is to be able to interpret the microbiome or the microbial community in relationship to the environment or the habitat. And to accomplish this, we use a series of experiments and computational tools to infer the relationship. And these tools, of course, including marker genes-based analysis, metagenomic analysis, metatranscriptomic analysis, metaproteomics, metaphylomic, which is analysis of the metabolites from a system. And some people even come up with a term, culturum, which goes back to trying to culture individual organisms, but using much more systematic ways of defining the culturing media to try to culture more organisms. And culturing, of course, most microbiologists know, is a problem. And there's a paper that talks about the great plate-count anomaly, conclusion being that less than 1% of the organism across many habitats are culture-able and the rest, while you can observe them indirectly, they never grow in the lab. Again, with the culturum type of approach, this is gradually being tackled and it's probably not true for habitats that are much more well-defined, such as the human gut or the human body sites. And the numbers, I think, range from 10 to 15% of the organism can be culture from fairly well-defined sites. But in any event, it's still nearly impossible to culture all of the constitutions of a microbiome and therefore we need a way of interrogating the community without culturing. So metagenomics offers an effective, if imperfect, ways of profiling the structure and the functions of a microbial community. So metagenomics and market gene-based analysis really goes hand-in-hand with the development of molecular biology, specifically with DNA sequencing technology. These are the fundamental units of analysis, if you will, so you need to sequence the genes and interpret the functions of the genes based on the sequence data. And in a way, what we're trying to teach in this workshop is how to do sequence analysis properly. And there's a large number of sequencing platforms available. Some of them are more popular than others, some of them are on their way out. And it's, again, a fairly actively developing field that demands attention, perhaps not in this particular lecture, but certainly worthwhile discussing if you encounter any questions. And I'm sure many of you have seen this graph, the huge leap in the output of sequencing platforms. I think this scale shows here a 10 to the 12-fold increase in less than 10 years in sequencing output. So this is a sequencing output per instrument run. It's really what's driving the data deluge that we're getting and the need to develop fast and accurate bioinformatic algorithms to analyze the data. Okay, so the human microbiome projects started in the mid-2000s, and it stemmed from the realization that after the human genome project, the scientists wanted to tackle a bigger question and what's bigger than the human genome project which gave us a catalog of about 25 to 30,000 genes. And of course, the Spice variants and the transcripts is the microbiome, the human gut microbiome. It's estimated that collectively the human microbiome consists of two to three million genes. And in the typical person, we have greater than a hundred species, depending on how you define species. And of course, much higher number of strains in the human gut. And that's just at any given time that the system is highly fluctuates quite a lot. Okay, any questions so far? It's just for a very sort of maybe even redundant introduction to what you already know, but if you have any questions or comments, feel free to pipe up. So next I'm going to go into a little bit of the history of how metagenomics and, oh, sure. So I'm not an expert in that field. And from what I've seen, I think lots of studies are done using fecal samples and extracting the tablates, but also there's, I don't know if anyone else have comments on the meta-pallomic studies. So sometimes most people just do urine right to get the tablates, but that's not to be a lot of course, so obviously some are probably not still very much doing that, but true, sort of like if you're looking at still working. Yeah, and meta-pallomic studies from what I've seen also typically tend to be much more targeted of the metabolites that are being analyzed, so it varies quite a bit system to system. Any other questions? Okay, so the stories of how we come to doing metagenomics nowadays really starts in the 70s and that's the air of molecular biology. So in the 70s, several important technologies were developed, including sequencing technologies of DNA and it's been, it improved quite a bit. And also arguably the first bioinformatics software package, the statin package, which to this day still can be downloaded and used, was released by, it was released in 1979. And at that time also, again, molecular data's are becoming available, both protein sequences or amino acid sequences and nucleotide sequences. So there's also works done to try to put the sequences into the framework, into the evolution framework. So matrix of how evolution of sequences occurred at that time as well. Okay, and this statement pulled from the statin package paper published in 97, it's actually very interesting. And it says the continuing rapid fall of the cause of computer components is making it possible for most DNA sequence laboratory to have their own small computer. The fact that DNA sequencing is now a fast procedure and the availability of computers gives the possibility of more efficient overall strategies for sequencing determination. And so it's actually an interesting historic perspective that didoxynucleotide sequencing, which takes days to do. At that time, consider state of the art and was considered a fast procedure compared to some of the earlier technologies. And I bet when statin wrote this statement, he wasn't thinking about the next gen sequencing, sequencing is available today and how much faster we can do sequencing. And I bet if you make the same statement today, 20 years, even 10 years later, that's commenting on how fast sequencing is today. That statement will probably be in a way laughable, you know, 10, 20 years from now, as the technology has been improving. And of course, not just about sequencing technology, the computer components also have been reduced significantly since the 70s and the 80s. Okay, so this is just to show that, painstakingly, a few small viral genomes, a few KBs long were published in the 70s by manual sequencing. And we still do this today, of course, I mean, do published genomes today, but it's just at a much larger scale and much faster pace. And maybe actually a slightly lower quality than the initial genomes. So in the 80s, this is really the beginning of now we have sequencing technology, what can we do with it? So Norm Pace's lab in Colorado in the mid-80s started to look at marker genes in fairly simple communities such as the hospital community where there's most likely only, at that time, belief only a handful organisms, microbial organisms in such a community. So in 95, for example, the octopus spring study, the lab looked at, they effectively took a sample from the hot spring by putting a sponge in the hot spring, take it out, rinse it off and collected the RNA, the total RNAs from the sample and then run it on a gel, cut out the band that correspond to the 5S RNA. And then you have to remember PCL was not invented until, well, actually was not really published or invented until the mid-80s as well. So at that time they actually have no way of amplifying a specific fragment. So they had to find chemical methodologies to identify ribosomal RNAs and then sequence ribosomal RNAs. So that's sort of a very painstaking process back then. And so this is just to show the three sequences that they isolated from the community. They just name it one, two, three and compare it to some of the known RNA sequences available. And they also published a tree of the known sequences at that time in this paper. So it's quite an interesting paper to sort of give you some historic perspective. In the late 80s, NCBI was founded in ribosomal data-based project. The RADB project was also funded back then and still exists, still function to date. Although taking on quite a different challenge of managing large amount of ribosomal RNAs most of them from unknown organisms to date. So in the 90s, sequencing technology continued to improve. We have capillary sequencers now that can be run automatically. And so in early 90s, PCI was used to clone 16S genes and then sequenced 16S genes and gradually became sort of the de facto marker genes for microbial community-based study. So the 90s is really defined as a genomics era where the first microbial genomes, the bacterial genome was published in 95 and in 98, at that time probably only less than 20 genomes were available at that time. Joe Handelsman is already thinking about going beyond genomic sequencing and the term of male genomics was coined at that time. Again, she was trying to look at the functional aspect of her soil community. So 98, Illumna was founded and next-gen sequencing technologies were being developed and improved during this period, the 90s to early 2000s. The ERISA project, which is sequencing intergenic ribosomal, intergenic spacer regions of ribosomal genes was also conceived. It's an alternative to looking at 16S and other ribosomal genes as a marker gene. It's very used, these interspacer regions of course evolve faster than functional genes so it's very useful for differentiating very closely related organisms and some of the fungal organisms fall into this category where their 16S genes have very little variation but their interspacer sequences have much more variation. Okay, so in 2000, there's an interesting paper that showcased direct cloning and identification of a functional, a new type of rhodopsin called a proteal rhodopsin because it's isolated from a bacterial organism and this shows you that you can actually go from sequence to take a DNA back into the lab and identify new functions as opposed to observing the phenomenon or the phenotype first and then look for the sequences. In the early 2000s, the term microbiome starting to become popular and at that time again, none of the next-gen sequencers were available on the market yet so in the early 2000s, while some of the early metagenomic studies were performed, these were done using Sanger sequencing at a cost that's quite a bit higher than today and hard to imagine and of course some of the popular ones including an SMI drainage paper looked at much simpler communities and when the paper came out, people were really surprised you can shotgun sequencing a community and assemble the genome of an organism in that community to almost completion and that was sort of the novelty of these papers. So Jeff Gordon's lab in Washington University in St. Louis started to look at the interactions of gut microbiome and the host so a series of studies looking at lean and obese twins were conceived at that time and they have access to a cohort of twins that have been tracked for longitudinally for over 20, 30 years and they were able to essentially get samples from these twin pairs and then sequenced the microbi... sequenced the at that time 16S marker genes from these individuals. So in 2005, the 454, the first next-gen sequencing platform was made available commercially and soon after that, Craig Vanter used it for the global ocean sampling survey where he essentially got on his yacht and toured around the different parts of the ocean and sampled and I don't know, like that data set is available but I think a lot of people found it to be heavily contaminated but it shows an early effort to sequence the environment and to get a global perspective of how the microbiomes in these environments differ. Okay, so 2008, the human microbiome project was funded and the project effectively looking both healthy and diseased individuals to try to get a better understanding of the human microbiome of multiple body sites. And just while we're not going to look at the mother directly in this workshop, it's worth mentioning that at the end of 2000, the software package mother was first published from Pashlos and I think, well, Pah was from Joe Handelsman's lab so there's definitely quite a connection or sort of synergy between the Bound for Mac development and the laboratory development. So in early 2000s, Illumina also became available and by 2010, it's now the dominant sequencing platform with the highest throughput being some of these high-seq machines. So with the large amount data being able to generate was relatively small amount of money and now people are really starting to look at sampling globally so the Earth's microbiome project was conceived and during that time, Chime, which we'll talk about in this workshop, was also formally published, but of course, Chime has been in use by the community for quite a few years already by that time. Okay, so then comes the desktop sequencing phase that started to put desktop sequencers in medium to small-sized labs to make sequencing available to all of you guys, really. So really the 2010, the decade that we're in now is marked by microbiomes of everything so people can see different sites to be sampled. There's a whole list here, but recently we're starting to see actually citizen scientists getting interested into the microbiome as well. So the American Got project was started in 2013 to try to find a better way to characterize the microbiome of diabetic patients. Well, the technology in Oxford Nanopore was limited release in 2014 so people started to test out so-called third-generation sequencing platforms. And in last year, there's a Kickstarter campaign that started by Jennifer Gardy here and by another researcher in California, last name Gens, to essentially crowd-fund sequencing of your cats so you can send in a sample of your cat poop and of course the money and then they will sequence the, they will characterize the microbiome of your cat for you. Okay, so that's sort of a brief introduction of all the different pieces of scientific history that contribute to the development of the field. Okay, so before I go on to the next, any questions or comments? Do people wanna talk about their favorite sampling project? I think they're trying to understand some certain felon diseases and the microbiome. There's the website actually tells you so the rationale is for doing it and the people behind it are a true scientist so there's not, there's including people like John Sennheis and Jack Gilbert. So they are people who are publishing real metagenomic studies and not just someone in their own cabin trying to get money from sequencing cats. There is a dog microbiome, yeah. So the dog microbiome has been characterized. So I believe they're only looking at marker genes so which probably exclude or preclude the identification of parasites unless they're looking at 18S sequences. So my guess is no, they're just looking at bacteria like 16S sequences. Yeah, don't expect that project to diagnose your cat's ailment for you. It's not at that point yet. I know vet bills are expensive but. Any other questions or comments? Any other scientific tidbits people would like to share? Okay, if you think of something feel free to pipe up. Okay, so this is really just one slide per home type of coverage for the different analysis pipelines which of course will get into a lot more details in the next few days. So the big picture analysis pipeline really is you collect the microbial sample, it generates the sequence data, the meta, not necessarily sequence data but the omics data and you run some pipelines to including QC and including binding of your sequences and so on so forth and then follow up with some statistical analysis that hopefully give you some new insights of the microbial community and the interactions that it has with the host. So that's sort of the 10,000 feet view of the workflow. Of course the details it's a lot harder and that's why you're here but also it's a lot of the details are still being worked out and still being improved. So for marketing analysis which we'll talk about today, the process as followed, you extract the DNAs first. So when we talk about sequencing ribosomal RNAs, we're actually sequencing the gene version of the RNA rather than the expressed product. So usually you extract the DNA rather than the RNA. You amplify with target primers looking at specific regions of the genome. And we'll talk about because NGS platforms are short reads and typically doesn't cover the entire marker gene. So these targeted primers also usually target specific regions of a gene rather than the entire gene for sequencing with the idea that you will be able to get a contiguous fragment for your sequence for your analysis rather than having to assemble the target region. And then there's sequencing platforms contain errors so you need to filter out the errors, build the clusters which we'll talk about in depth later on. And then once you have your so-called OTUs then you can do your diversity analysis. For metagenomics analysis you also have to extract the DNA sequence random fragments instead of targeted fragments. Again QC and annotation of the sequence. But in this case because the fragments are randomly shared you have the option of trying to assemble these staggered fragments. And after you get your sequences again you can carry out taxonomic analysis to look at the diversity of the community. You can also carry out functional try to predict the functions of the community. And this is what Morgan will cover tomorrow. So the metatranscriptomics studies look at extracting RNAs rather than DNAs. And of course there's ribosomal RNAs the majority of your RNAs is probably ribosomal RNA. So you first have to subtract out the ribosomal RNAs. Reverse transcribe the sequence to the CDNA. Again QC is an important step and then you can carry out analysis looking at gene expressions and functions. And John will cover transcriptomics studies on Friday. Okay so I'll quickly go over some of the major concerns in metagenomics analysis. And if you have any additional concerns do bring it up now. And so people I might have missed something so people can be aware of other issues. So clustering is actually not looking at context in that you don't put fragments together into a single context. Clustering is actually just grouping reads into a set of reads. And also typical marker gene analysis that you'll see have a defined star and end site that's bound by the primers. So typically you don't need to assemble you just need to essentially collapse the ones that are similar or identical to each other. Does that make sense? Assembling with the state. Whereas assembly because you're doing shotgun sequencing the fragments are from random locations you're trying to find overlapping fragments and you assemble them together into longer fragments. And the key reason for doing that is usually longer fragments give you more information about the function of about gene than the 200 base pair or 250 base pair fragment. Okay. Sure. Okay, so assembly you're trying to form context, right? But scaffolding what you're doing then is to look at. So let me backtrack a bit. So doing sequencing you have the option of doing paired end sequencing or may pair sequencing where you know the two reads are from the same fragment of DNA. There could be some gaps in between your two ends, two ends, reads from the two ends. So what scaffolding is supposed to do is take these pair end information or may pair information and put and try to order your context. That's so it's easier to draw it out or. Do people, everyone else interested in what scaffolding means? Okay. Right, so let's say that's your chromosome and then these are your, right? So shotgun sequencing you have random overlapping fragments. So these are the DNA fragments. Can everyone see that? Okay. So the DNA fragments could be longer than your reads, right? So when you do sequencing you could only, you could be only capturing the ends of a DNA fragment. So these are your reads. So you have reads. Okay, so let's see what's a good example. So this would be called paired end reads and may pair is a little bit different. May pair instead of sequencing in, you sequence out and so effectively the orientation of your ends are different. And so as long as your software know the orientation of the end, the assembler will be able to put back the sequence. So scaffolding, we'll have a good example. I should need to change the graph a bit. Maybe not. Okay, so scaffolding means, actually let me draw a few more reads. So if I have overlapping reads, let's say in this region, like the assembler can put back the entire region by finding the overlapping reads and this will be called contic. But you notice there's a region here that has some gaps in it, the region is end by a DNA fragment that was not sequenced to completion. There's some gap in it, but from the end. So for example, if I know that this end and that end come from the same piece of DNA, then I could scaffold this particular, let's say you have a, so okay, so you have a contic here, contic one. And then let's say this region, you also have coverage. So you have contic two here. Okay, but these two, of course, when the assembler gave you back the results, it would just report them as two separate sequences, right? Without, and you don't know the order of these two sequences in the genome. But because of the main pair information or the so-called locational information that you have, telling you that this particular fragment, this particular read and this particular read came from the same piece of DNA and this read and this read overlap, then you now know that these two contics in this particular order, even though this region here has not actually been sequenced. So usually scaffolders will give you a bunch of ends where the region that's not been sequenced is, but the order of the context can be established based on the locational information of your main pairs. So, click. So that's called scaffolding and this is called assembly or context generation. Okay, how much time you have? Okay, so the first issue that I've already alluded to is the sequence data quality issue. So sequencing errors exist and the next-gen machines typically have a 0.1 to 0.01% substitution error in the reads that they generate. The third-generation machines have much higher error rate. Pac-Bio, for example, about 10% right now and 904, it's about 15 to 20% error rate. So one in 10, one in five sequence in, sorry, base in your read could be erroneous and this, of course, affects your taxonomic identification and so on and so forth. Moreover, doing your PCR amplification or even of marker genes or even PCR, or PCR amplification of marker genes, chimeric sequence can form. This is when the primers essentially hop from one DNA molecule to another and up with a sequence that's from two different DNA templates. So we'll see a little bit how you can remove chimeric sequences later. So the third data quality issue has nothing to do with sequence quality but with the metadata quality. Of course, a lot of these microbiome projects were studying a very complex community. So there's multiple factors affecting the micropost or micropenvironment interaction and often studies don't collect enough data about the environment or about the host. They just collect data, they just collect sequence data and the message here is really you need to also collect data about the environment or about the host to really be able to interpret the metagenomic studies. Okay, so the other issue with metadata is even if the data is collected, they're often embedded in supplemental files in the publication and they're not really made available together with the sequence data on public repositories. And as a result, we're using of metagenomic study data or microbiome study data. It's quite a challenge. Well, it's sequencing a single molecule. So the signals from a single molecule is much weaker than the next-gen sequencing platforms which is actually sequencing a bundle of identical fragments to amplify the signal. That's sort of the simple explanation. So it's a signal issue. Signal is weaker in these third-generation platforms when you only look at one molecule at a time. Okay, the next issue has to do with comparability and reproducibility of the experiment. 16S, as I mentioned, it's about 1,600 base pairs long and your reads are only about 250 base pairs long so you can't sequence the entire 16S gene directly. And therefore, people focus on different hypervariable regions or V regions. And there are studies both from the HMP project and other groups have shown that different V regions can give different taxonomic results. And one of the reasons is that these different V regions actually evolve at different rates as well. Right, so we'll go into this in a lot more detail in the marker genes section. But the sure answer is people typically don't compare different V regions. So let's say there's a public data set that you're really interested in and that's using a specific region and if you want to compare your results to that data set then it's best to do your analysis based on the reference data available. So that's sort of the sure answer. Anyone else want to comment on that? Sometimes some set or some maybe more informative than others in case we need to... Yeah, there are other issues such as you might want to pick a faster evolving region or a fast mutating region I should say, not necessarily a fast mutating region if you're looking at more closely related organisms and if you're looking at more distant related organisms you might want to pick a more slowly mutating region to give you better resolution. Yeah, for your own study... Yeah. One day we'll be able to sequence the entire 16 years then the debate will be gone. Right, that's true. Yeah, they have some enrichment data right now that you do not receive and look at more than one gene. Yeah, but the immediate solution I think is based on what reference data said that you want to compare to and stick to... for your own study design stick to one marker. In the studies that were involved in we looked at multiple markers and they all give slightly different stories so for your own sanity it's probably best to just focus on what the community accepted marker is and as Morgan said be aware of the biases that the different markers may have. Yeah, so there's a handful of studies that do show that and I'll also show a plot showing you the range of variations within the V regions. And there's also studies comparing workflows of course comparing the different bioinformatics tools and again the key message there is that there's always slight variations in these tools and they don't necessarily give you... I think that the majority of tools typically give you the same story if you look sort of from a really... from a high level view but when you try to drill down to more detailed interpretation of your community the different tools cannot really give different results. So it's difficult to evaluate tools for microbiome analysis because the ground truth in other words what the community really is like is unknown and that's why you're studying it. So people have been using mock communities that they spiked in known organisms or sometimes they take simulated data sequence data to form a simulated community and use these to evaluate tools again these are nowhere near the complexity of a true community and usually the tool that performs the best in mock community or simulated data is not necessary and usually not the tool that performs the best when you look at empirical data. So again it's work in progress so probably the best advice here is to stick with something that's a fairly commonly used tool a standard practice and focus on interpretations that you can make and not sort of hit your head against the wall for interpretations that really requires improvement in technology to be able to deal with. So for example if you're trying to understand strain level variations that might be a question that's much harder to answer with the current sequencing technology but if you're just trying to get a sense of the difference at the genus level at the sort of lower taxonomic level then that's a much easier question to address with the technology available. Okay so because the reads are quite short and assembly sometimes creates what's called chimeric sequence or mosaic sequences it's difficult to understand or to interpret the strain level diversity in metagenomics as your assemblers might actually put together fragments from different strains of the same species rather than being able to differentiate different strains when you carry out metagenomics analysis. So that leads to the question should you assemble metagenomics reads or should you just take individual reads to and do your profiling do your taxonomy profiling or your functional prediction and again this is a highly contested area and it's a struggle between longer sequences can give you more information but by assembly the reads you can create chimeric context or DNAs from different non-conal organisms. Okay so quick word down what taxonomy versus OTU and again get into this a little bit more in the marketing analysis but taxonomy essentially is a label that you give to a group of organisms and this stems from our humans' urge to be able to name something and to be able to classify something so sometimes the classification works well that in other words your label described the group of organisms very well but in the cases such as E. coli as you may know it's a poor label in that there's different kinds of E. coli some are pathogenics some are non-pathogenics so when you just let's say you found E. coli at the species level in your sample based on 16 years study you often cannot interpret whether it's pathogenic or not so be aware that taxonomy essentially is a name given to a group of organisms and that group of organism may or may not be homogeneous so the group of organisms could be quite varied OTUs is sort of an attempt to address this issue an evolutionary theory sort of predicts that more similar sequences are likely to result in more similar functions and in more close phylogenetic relationship so OTUs it's a arbitrary in the sense you can define the coli arbitrarily but once you define a coli let's say 97% coli then at least you know the group of organisms in that cluster different by an average of 97% sorry, different by an average of 3% in that group and again you don't know functionally if that's a good coli picking coli for your OTUs is again somewhat arbitrary and somewhat defined by the community practice and really when you interpret the results you need to take it with a granite salt and you might want to actually do cut-offs at different levels and see if your hypothesis still paying out at different levels of cut-off or if your results actually change significantly with the cut-off you use yes, so to a lower or higher percent cut-off and typically people would do it iteratively so you'll cluster at say 97% and if you want to relax you can take that and cluster at a lower percentage and also be aware that that could give you because of the algorithm use give you different results than if you just take a sample and cluster directly at 95% and we'll talk a little bit about that again so this is just to show that you're forming OTUs by essentially drawing circles around the seed sequence and if you're using an algorithm that's greedy in other words first come first serve then you could see cases where your sequence actually it falls somewhere between two seed sequences so if this so sometimes this particular sequence could be assigned to this OTU and sometimes it could be assigned to the other OTU so so that's when changing your cut-offs may make sense if you have groups that are if you have communities that are likely to be overlapping there may be more relaxed cut-off to be more inclusive in campus in the whole community well alternatively you might have to make the cut-off more stringent so you minimize situations where you have ambiguous assignments and of course some of them some of the sequences will fall outside of these circles outside of known reference sequences then you need de novo clustering approaches to address these outliers so I think Morgan would talk more about functional annotation problems when he talks about metagenomics but here I'm just highlighting two papers I'll show you so in the first paper it's a sort of a protein functional prediction challenge where they a large group of researchers take a large list of genes of unknown functions so these are hypothetical proteins or hypothetical genes in genomic sequences and then they do functional predictions using tools like BLAST SIFTER and argon and so on to predict the functions of these sequences and the functions are predicted based on gene anthology terms so that's why they break down into broad categories based on molecular functions and biological process annotation and they wait 11 months or a year and then they go back to the literature to see if some of these genes of unknown functions have been studied in the lab and have a confirmed function using what lab studies so in this particular challenge about 850 genes that was unknown a year ago now has a function that's confirmed in the what lab study so they then of course use that what lab annotation as the go standard and then look back on the predicted results and here it's shown that two sort of key points is that it seems that these the new generation of tools that are designed specifically for predicting functions of metagenomics data outperformed the generic BLAST search and trying to use BLAST to assign annotation so does everyone know what annotation means by the way so it just means that you provide a description I'm just loosely using it to provide a description in this case to a gene so anyone else have better definition of what annotation is so the typical usage is that someone a curator someone would look up and then make a judgment on what the function of a sequence should be and then they assign that function to the sequence and that's called an annotation of that sequence again annotation in a way is a label given to that protein to that gene the second more of the story is that you can see using the F score measure which is essentially taking into consideration both precision and recall so it's an average of precision and recall there's still a lot of room for improvement and these tools generally are better at annotating the molecular functions in other words the general functions of the protein rather than the biological process that the given protein or the biological pathway that the given protein is involved and the reason for that is likely that the different there are different tissues or different organs or different environmental niche so on and so forth and while the function the biochemical function may be known the actual biological process that it's involved it's much harder to be to be predicted accurately so basically because in other words homologs or genes in the same family can assume biological roles in different tissues and organs so the biological process is much harder to predict okay so I'm actually going to just very quickly go over or actually show you some of the resources for 16S and for other type of analysis and again these are things that we will talk in more depth in each of the sections so for 16S there's quite a few public databases available RDP to the rival sumo database project that's been around for a long time CELVA is a newer database and the green gene is another competing database the main difference so both CELVA and green genes give you pre-aligned 16S and 18S sequences or templates and that will allow you to align your own sequences to those databases and the key difference between CELVA and green genes is that they have different algorithms to generate the alignment template and CELVA typically CELVA alignments typically are a lot longer than CELVA alignments and there's debates whether the longer sequences give you better alignments in the trade office that you need more memory and more space to do the longer alignments versus green gene which has shorter align templates but some people argue especially people like Shilas the developer mother is really against green gene saying that it doesn't give us good alignment of CELVA so mother by default use CELVA and chime by default use green genes so those two can be directly or would you use in different circumstances? you can use both because both mother and chime and there's no reason you can just try both and see if you give you the same mother use CELVA by default they are in competition yeah I guess they are do you want to elaborate on that? competition is good so I think what we're getting at is they are trying to one up each other but as an end user you're free to use both and both mother and chime would be able to take templates or profiles from both databases but if you look at publication coming out of Rob Knight who's a developer and Greg Paparazzo's these are the developers of chime versus papers coming out of the mother camp they are often directly commenting each other's results so there's definitely competitions going on polite of course there's no I don't know so in Twitter's fears there's less polite but in publications there's usually more polite not on blogs what's Pashlas Patrick Schloss's lab in Michigan the reason mother is actually I'll save it for when we talk about 16S okay so here's a list of genomics databases that are available so this is where you can download reference genomes and sometimes if your community is well defined with known organisms and the reference genomes available you can actually download the reference genomes and use those for your metagenomics analysis you'll get into that more or maybe not will you talk about reference based metagenomics analysis reference genome based yes okay I'm sure Morgan will get into it he has so tonight to prepare anyway so okay so there's also a few metagenomics databases and we will actually in the tutorial cover some of these and lastly functional databases and this for sure Morgan will cover tomorrow okay any questions