 All right, so this is what I call the holy crap, look at the data scape, right? You've all seen them, it's the first sentence of every abstract of every microbiome paper ever. Continuing rapid fall in the cost of computer components, blah, blah, blah, blah, DNA sequencing is now a fast procedure available, yada, yada, yada. So pretty standard thing. What I really like about this quotation in particular is that it is from 1979. Roger Shadden, developer of possibly the first bioinformatics package, which you can still download from the web. So that's pretty cool. All right, so welcome to my talk. I'm going to be talking mostly about microbiomes and mostly, mostly about marker genes. And one, so there are two lines that I wish to correct. In the program it says I'm a candidate research chair, but because of incompetence I'm no longer a candidate research chair. And the other is that Morgan is actually going to be giving the read quality and the 16S tutorial this morning. And the reason for that is that his software, Microbiome Helper, is actually very effective for both 16S and for metagenome analysis. And it allows us to give kind of a more consistent picture and it wraps things like time anyway. So it's not like you're going to be missing out on the base standards. Okay, so here's what we're going to talk about. Defining stuff, big questions. I would assess the microbiome using 16S and then the cranky part of the topic. All the limitations of 16S and how you can fall into certain traps that have been fallen to many times before. Okay, so let's define the microbiome. One thing that's annoying is that there is no sort of canonical citation for the microbiome. There's the classic Joshua Lederberg. This is kind of summarized. The collective genome of our indigenous microbes, microflora, the idea being that a comprehensive genetic deal of homo sapiens needs to include the genes in our microbiome. Okay. One thing that's interesting is that there are two different definitions depending on how you parse microbiome. So there's microbiome, which is like the genome and the various other ohms of the microbiota. There's also microbiome, which is actually the various tiny things that live inside us. I tend to follow this definition. Basically, the idea of your microbiome is all the bacteria and various other things that live inside you. Either is acceptable, just be clear about what you're talking about. Okay, so the microbiome, people often, when they say microbiome, they mean the human microbiome. But it's very important to remember that there are other, like, pretty much every habitat on earth also has an associated microbiome. This includes soil, this includes seawater, this includes non-human animals. So this is basically people have done studies of the gut and rumen microbiota of herbivores, carnivores, and omnivores to do comparisons, right? Different diets, different functions expected and found. And so there are a ton of microbiome studies done in pretty much every habitat on earth. Forrest Rower, who some of you might know, has claimed to have sequenced every single virome in every single habitat on the planet, which is a pretty bold claim. Okay, so the microwaves. So often, again, people, when they're talking about the microbiome, they're usually thinking about bacteria. But there's actually a lot more to it than that. And actually, we had a question already about bacteriophage, right? Viruses that infect bacteria. So it's important to remember that even if your particular study is focused on bacteria, and there's nothing wrong with that, keep in mind that there are many, many, many other constituents of the microbiome, right? So that other cool domain of life, the archaea, does exist in the microbiome. There's some important methanogens that live in our guts that are often missed by the standard 16S primers. Okay, so that's important to keep in mind. And then various other things, small things to consider. Viruses of us, viruses of bacteria. Microbial eukaryotes, very important. Some very important pathogens like giardia. And then some people also include, you know, overly complex organisms like worms and other types of worms, right? So it's important to keep in mind that there's this big ecosystem in which bacteria are extremely important, but they're not the only piece of the puzzle. Now, trying to characterize this all at once is super difficult. But again, just keep it in mind that everything's there. Jock already had a slide about this. And of course, we disagree on the number of genes of the human genome. Actually, I don't care, right? Somewhere between 20,000 and 150,000. So I said 25,000. But the key message here is that we have 20-something thousand genes. And if you take a look inside a standard human, typical human gut, you might find two to three million different genes. And quite a few species as well, whatever species means. The point is there's a lot of biodiversity. And so one of the important things, and that's presumably why you're all here, is to learn about culture-independent methods. Because we all have a giant pile of guts living in our microbes. I set that backwards. We have a whole bunch of microbes living in our guts. This is being recorded by the way. But there are obviously some limitations there. And we need to be able to consider approaches that look at things in the big picture. And so there's this great plate count anomaly. How many of you have heard of this before? Okay, excellent. Quite a few of you. So the idea here is that if you take all of the microorganisms from a particular habitat and try to culture them, you might be lucky if you can culture one percent of them. This statement is patently false for many habitats. And it's becoming more and more false as people come up with different culture techniques, different ways of actually getting these things to grow. Nonetheless, even if everything was cultureable, it would still not be practical to culture every single last bug that you find in the human microbiome. Some people, like Mike Saretzlav, are extremely good at culturing a ton of stuff. You can never be sure that you're getting everything that matters. And the other thing is that there are many ways to do this, right? So in addition to trying to grow things, there are many different ways, different types of data that you can generate, many different types of analysis. And that's what this workshop is about. So one of the simplest approaches to characterizing the microbiome is market gene analysis. The idea here is that you have all of bacterial biodiversity and you want to grab on to one thing that everyone has so that you can target that in a molecular analysis and do comparative studies to try and get at who's living there. And so the most commonly targeted gene is... 16S. 16S. ...prison zone learning. Yes, exactly. For all of its limitations. And so it's universally present. It's often, but not always, present in single coffee. And it has nice patterns of conservation and variation that we'll get out to later on. And the key is that we sequence these genes and then we use fancy bioinformatics techniques to try and translate the DNA sequences into proxies for the actual microbes that live in that habitat. And so this is from 1985. This is one of the first environmental DNA surveys. And these authors looked at fairly low diversity, super hot habitat. I believe it was in Yellowstone, if I recall correctly. Octopus Spray. And they actually looked at not the gene, but the R for the ribosomal RNA sequence itself, which is what you see the use in there. And by doing this, they were able to find sequences that match pretty well to previously known things. And also magic new things, which should come as no surprise, right? And so this was the first illustration that this could be quite powerful and effective. So it's all been done before. And then beyond that, and again, to get hooked up on semantics for a moment, the term metagenomics is often applied both to marker gene studies and to the environmental shotgun analysis that you'll be learning about this afternoon. Again, as long as you're clear about what you're talking about, okay, but you should definitely not refer to marker gene surveys as metagenomes around me. Okay? It's okay. So the key here is that the metagenome term was coined in 1998. And the idea is that you want to do some sort of functional analysis of the genomes that can be found in a particular habitat. And so what's interesting is that most people treat metagenomics as a synonym for environmental shotgun sequencing, but that's not where it came from. It was actually based on cloning and expression, but it's the same idea, right? Let's pull out a bunch of DNA, check it into some vectors, express stuff, and see what happens. So the use of the term is evolved over time, and now it's basically taken by most people to mean sequencing of random fragments of DNA from the environment. You're still getting functional ideas by looking at everything you can find, but you're not actually doing the expression and characterization. That step can come next. And yeah, here we go. By its strictest definition, it does not encompass marker gene surveys because 16S is not vaginal, right? So doing this environmentally is not metagenomics. And you will see some of these techniques as well. You can go from boring old DNA to looking at what's actually expressed in a habitat, and you can say, okay, what's the metatranscriptome, right? Which genes are actually switched on in a particular habitat by a particular set of organisms under a particular set of conditions? Metaprodomes, metapetabolomes, culture-omes, okay, this is like some of this large-scale sequencing, or not sequencing, culturing, and then you have more refined techniques like stable isotope probing to try and follow the track of different metabolites as they go on a whirlwind tour of various bacteria. Now, what would influence your choice of method is several things. Obviously, one of the most important ones is cost, and this is one, certainly not the only, one of the reasons why 16S analysis remains probably the most popular approach, I think. Reliability, okay, so that's an interesting one because different approaches will give you data with different levels of confidence, right? Think about sequencing error, think about false positives. Those profiles change if you go from one data type to another. Reference databases is another example, right? And then interpretability, which this is the part where we talk for the next three days and hopefully you don't get too depressed to teach you how to interpret some of these things. So big questions. What are some of the things we want to know about the microbiome that we might try to get at using culture-independent techniques? Here's one, the classic who is there. So you take a sample from somewhere, poop, the table, you know, the air, and then you do some sort of culture-free profiling. And as I said before, you can use the sequences you get as proxies for biodeversity. So here's an example. This is the famous Costello study from 2009 where they used 16S to profile about 32 different body sites, right? Including the gut microbiome, right, for which poop is the usual proxy. Lots of different skin sites, lots of various sites on and in your face, and so on. And you can see with little pie charts there that the colors represent different, in this case taxonomic classes or phyla. And you look at the skin sites and they tend to be dominated by actinobacteria. This was not hugely surprising. But then the gut samples right here, whose samples are dominated by two families, classes in particular. Bacteroidia is the purple one and Clostridia is the yellow one. So this is fairly typical. Now the ratio of those two phyla can vary, sorry, classes, can vary a great deal. So that's an interesting thing to look out for. So that's taxonomy. About functions. Well, if you do environmental shotgun, and this is fairly recent results from our lab where we've taken a cohort of elderly subjects in a nursing home and got this sort of microbiome profiling, this is metagenomes, right? This is shotgun sequencing. We've taken those sequences and we've mapped them against a highly curated database of antimicrobial resistance genes. By doing this, we can take the database hits, map them to the ontologies, which is basically just the categories of different types of drug resistance, classes of drug resistance. We can summarize this over our various subjects. And so the key takeaway here is we can look at specific types of function and we can look at the distribution of abundances across everybody in our cohort. And so here's an example, cephalosporin. You can see that this is a log scale. Some people have relatively high concentrations of cephalosporin resistance genes. And some people have not. So this is something that's fairly variable within the population. Elfamycin resistance, which is basically never used, you can see everyone has it and it's a fairly tight distribution, so everyone seems to have it in the same proportion. This is an interesting story that we can talk about later on. It's not actually elfamycin resistance. This will be a good cautionary table. And then you can look at something like beta-lactam resistance, right? Panicillin, emphicillin and friends. And it's also fairly abundant, but there's a somewhat wider distribution. So some people have more of it than others do. Other question? What does the microbiome correlate with? We do surveys all day every day, but it's not very interesting or useful unless we can start building, identifying associations with different types of potential drivers in the environment. And so here's an example. This is actually from 2006, but it's still one of my favorite examples. It's not even DNA sequencing. But these authors, Vera Jackson, took about 95 samples throughout all of the Americans, and they characterized soil in various different ways. pH, phosphate, nitrate, salinity, quite a few different drivers. And the one they found was the most important, or seemed to be the most important, was pH. And so pH increases along this scale, and then this is the diversity of the sample. For those of you who care, it's Shannon diversity. And you can see that low pH, this is about four, not a lot of organisms like to live in these habitats, because it's a pretty difficult habitat to live in. And then you can see that from there, it increases to a peak at sort of almost a neutral pH, and then as we get to alkaline, we start to tail off. Presumably, if we kept going to like a pH of 10, we would probably have a similar crash diversity. Okay, so there's one. You can test against various potential variables and identify the ones that matter most. I had to throw this in. Some of you may remember the click bait kissing microbiome study a few years ago, where they approached people in an Amsterdam zoo and said, hey, would the two of you like to participate in a study that involves kissing? And so what they said, if you look at it, you have to look at the supplementary material. Basically, a couple of things, okay, number of kisses per week between this couple, versus the similarity or dissimilarity of their salivary microbiome. And so, the more frequently they kiss, the less dissimilarity of their salivary microbiome makes sense. And this is simply time since last kiss. All words. But the best part of this, this is not relevant at all, but the best part of it is that they asked each, the members of each couple to independently describe how often they kiss each other. You have to look at the supplementary material. Okay, another question is how do microbiome respond? Again, not super recent, but one of my favorite studies, and it's an example of a bizarre multi-dimensional projection that actually tells a really cool story. So the backstory to this is you infect mice with C-dip, as one does, and we'll get into this later. This is kind of a two-dimensional summary of biodiversity. So you got like a zillion different bacteria in there, but you can't really project, you can't present that, so you try and come up with a much more efficient six-sync summary. Long story short, sick mice are over here. If they have C-dip, their microbiome tends to look like this. And they have four treatment groups, and as I should mention, healthy kind of looks like this. So you can see that they are pretty well separated. C-dip infection, not really surprising, right? They have C-dip, they don't really have C-dip. And then what they did was really interesting. They took examples of these mice, and they subjected them to various treatments. So they were interested in particular in probiotics. And so what you can see is from a starting point here, from a starting point here, the progression over days as the mice recover from their C-dip infection. So what's happening here? Well, what is it? Orange. Orange is simply a fecal microbiota transfer from a healthy mouse. So pre-transfer, three days, four days, six days, 14 days. How about gray? Gray is a sort of a passage, one of the passage inputs, which means that they try to culture the bacteria overnight, negative for mice, negative for mice. And so you see a similar progression. Although actually, oh, I'm sorry, I'm lying. I brought this backwards. This is a bacteroides lactobacillus formulation, familiar with. Pre, here's six days, here's 14 days. That's it. That's interesting. This is the culturing overnight, the blue. Pre, six days, 14 days. And then this is really interesting. The last one makes B. This was taking six bacteria from that passage culture and giving just those six to the recipient mice. Six, a controlled, reasonably well characterized microbiome sample. Pre, six days, 14 days. Last thing I'll point out, you can also treat C. Diff with antibiotics, right? Such as rhinomycin. Those mice ended up here. So this is, it's like my favorite figure because it tells such a great story with such a usually abused and horrible visualization. Okay, so like I said, I'm mostly going to be talking about 16S. We're just going to take over the task of metagenomics in the afternoon, but we're going to start simple. So, basic pipeline. Extract DNA. Amplify using PCR, using some set of numbers. Those are not the inevitable errors in your sequence. Build clusters if you're feeling brave and then look at the diversity, similarity, associations with environmental data, as one does with 16S surveys. Okay, so I don't want to dwell on this too much, but this is a simple collection of DNA extraction. There are kits for this. Different types of environment have different inhibitors for DNA recovery. So this is just something you need to read up on and look at the potential inhibitors and then find the appropriate kit or the appropriate experimental protocol for this. This is great. So in science, we have these things called controls, which are useful. And this is my fan. The paper published a few years ago about the microbiome of a particular organ that was thought to not have bacteria in it. The paper was like, oh my God, look at this. We sequenced the microbiome of this. And we found a huge representation of taxonomic group X. And then you're like, okay, that's interesting. And some people pointed out that taxonomic group X was a common contaminant of the reagents that they used. So, careful, careful. And it's redacted to protect the... Here's an interesting one that sometimes doesn't get considered, size fractionation. One thing that's really interesting is that different constituents of the microbiome, that's not surprising that they have different sizes, right? All bacteria are not the same size. But that if you fractionate based on different sizes, you can get completely different, completely contrasting ecological results. So this is a great example. Looking at an oxygen minimum zone in the ocean. And basically saying, if we looked at the 0.2 to 1.6 micron fraction versus the 1.6 fraction, different genes were enriched, which strongly suggests very different ecological roles. Another example, how many of you have seen this figure before? This kind of went, yeah, this went viral last year. This is Laura Hugg, who's one of the organizers of the Canadian Society Microbiologist Meeting that starts tomorrow. And so this is the candidate phylaradiation. So here we have the archaea. Here we have the eukaryotes, which are pushed off to the side, colored uniforms, which I love. And here we have the bacteria. And so they took it to those, and they added the super tiny cells that they discovered, which they made the candidate phylaradiation, and put those into the tree. These are smaller than 0.2 microns. So people previously hadn't really found evidence of these because they weren't looking that small. These things are small. They have small genomes. What are they doing to be determined? What's your PCR strategy? So there's kind of the typical somewhat boring stuff, like choosing the right melting temperature for your primers. What sequence are you using influences the read length that you get, primer specificity, and then I'm going to talk about this in a moment, and then comparability with previous studies. And did I touch on this? And it's kind of, and I'll show you a couple of examples later on, that might be the most depressing slide to hear. So just warning. But did I point it directly to the Earth microbiome project and as an example of an attempt to standardize the different types of analysis. So that, you know, wonder of wonder, miracle of miracles. If you do a study in 2016, and I do a study in 2018, we have some hope of being able to compare them to a meta-analysis and learn something that isn't lies. So that's good. So this, and I'm sure some of you have seen this before, this is the 16S ribosomal RNA gene, which is typically about 1,500 dittides in length. And these are the different base positions. And what you're looking at here is the different degrees of phylogenetic conservation throughout. So there are some regions, and those are the peaks that are highly conserved across everybody, right? Especially, you know, bacteria at least. And then these are much less conserved, and those are the variable regions. And so the common practice is to have primers that target the conserved regions so that the primers will typically bind to most things and to target, have the primers converge on specific variable regions that should hopefully give you some sense of the diversity within your sample, although it does have its limitations. So this is an interesting paper where they looked at the different types of bacteria that could be amplified using different sets of primers, targeting different variable regions. And basically, long story short, if you use something in the d1 to d3 region, that's kind of 5' end of the 16s gene, you'll get an over-representation of pre-vitellic use of bacterium-steptic opisthrand with patellar bacteries for carbon monosaturant. This is 20. Use v4, v6. You get these ones, and you often fail to detect fusobacterium, which can be pretty important, actually. So that's alarming. And then v7, v9 favors a particular set of organisms as well, right? And failed to detect synomonas, v7, and microplasts. When you're studying, you're interested in missing mycoplasma might be a serious problem. One of the bits of good news is that with the advent of long-read sequencing, and eventually error rates less than 10%, I can hope, the full-length 16s ribosomal RNA genes have actually been shown to be quite effective at detecting a lot of things, and, as you'll see, differentiating a lot of things as well. Here's another paper. Evaluation of general 16s ribosome RNA gene PCR primers. So these people went and they validated a whole pile of primer pairs, 512 primer pairs, based on 175 primers, and about 10 that worked well enough to be worth recommending. So this is a pretty exhaustive study, and if you look at the literature, you'll see these standard primers being used over and over again. So at the very least, pick one, which is somehow justifiable in your study, and stay without. You'll read it and interrupt if you have any questions. So in terms of the source fractionation, what was your suggestion there? Wow. I mean, it depends totally on what you want to look at, right, the habitat. Some habitats like soil, I think, can have many more larger bacteria. Whereas other places, you know, if you look in the ocean, then you'll have these pico-phytoplank, prochlorococcus, my other favorite one, sinecococcus. And so you might want to target a couple of different strata of size diversities. So if you're specifically looking to find a new place, your target place? Well, I think, you know, if you choose your filters, then you can catch a lot of stuff, so you just have to make sure that you're going to catch the stuff you're most interested in. Yeah. So you're saying if you sequence two different regions to be? Yeah. So basically, you have two different loci within the 60-and-60 that you amplify. You're going to get a lot better taxonomic resolution that's going to cost more for us because you need to sequence every gene in the place. Sorry, was that your question? Excellent. So was it better apart from the question? Well, it depends on the accuracy. You're going to get kind of the union of, well, I'm not sure about accuracy because you'll still have the biases. You will miss fewer things. And one hope is that at least within a set of primer pairs, the biases are comparable. So the question is about using mock communities where you have water or whatever medium and you throw bacteria in at some defined proportions and then you try and say, okay, what happens if I do various types of sequencing? Different library, perhaps different sequencing approaches. Can I get the right bacteria in the right proportions? And often the results have been fairly terrifying. So if you think of, I'd expect 10% of each of these bacteria except for the 10 equal proportions. Often, I can read the paper. It might have been out of John Eisen's lab. Morgan, correct me if I'm wrong. I think there's a paper out of John Eisen's lab that built mock communities and then tried different library preps and sequencing techniques. It was DNA extraction. Other people have looked at different B regions and also doing shotgun sequencing, the metagenomics, and then trying to reconstruct 16S sequences from that. And the answers are predictably different. So, yeah, it's something to keep your eyes on. And sometimes, like I said, the best you can do is try to be intern with the system. Okay, one more and then I'm going to keep going. Just for DNA sequencing, how is it easy to rely on data? So, sorry, you mean in terms of sequencing error? No. I'm interested in one area, but we also want to know the same thing. Like the flanking community? Exactly. I'm just wondering if the good strategy Well, I'm certainly stick with standard primers, unless you have a very, very targeted question. Other people might have opinions on this and this will be a great topic of discussion later on, but at the very least, if you're using standard primers, they have known properties. Whereas with custom primers, unless you're really worried about missing particular organisms, you can kind of stick to a standard B4, B6, B8, B4, B5, something like that. Okay, so steps and analysis. Quality control. If anything is left after quality control, you can assess the sample diversity. And there's basically two ways to do this. I care about, sorry, I don't care about taxonomy, and I care about taxonomy. Looking at similarity among samples, basically comparing things and saying, how similar are they? Are they similar at all? If they are similar or not, can we associate that with, for example, metadata, metadata basically is taken to mean everything that isn't the sequences. DNA is the only thing that matters. Oh, by the way, we've collected this out of crap. Who cares? So often people call it contextual data, environmental data. Again, just be clear on what you're talking about. People are never going to agree on this. Semantics. Next, we can also look at machine learning classification. So I'm not going to go into any detail, but it might be a teaser for what's coming on Thursday. A couple of neat examples. And then functional prediction. This segues into something that... Hey, Morgan, are you talking about progress? Okay, cool. So it segues into something that Morgan will be talking about according to him. So there's a couple of standard pipelines. You've seen Chime, Mother is another one. These have been developed by different labs to more or less accomplish and achieve the same goals, but they're done in very different ways. So interfaces are very different, and Chime has actually just been updated to Chime 2, which is a massive, massive change. API-based. I haven't had the time to look at it myself very much, but my people in my lab tell me that it's much, much better laid out, much easier to use, and much more extensive. So that is promising. And then Morgan is going to be walking you through his software, Microbiome Helper, which wraps a lot of the other approaches, both for marketing surveys and for metagenomes. So let's talk about quality. So when you get sequences, you will get quality scores for each base in those sequences. Typically expressed as what's called a Fred score. The Fred score is proportional to the probability that a given base call is wrong. Higher Fred scores are good and lower Fred scores are bad. And so you can have some sort of threshold, redefine what constitutes high quality, what constitutes little quality, and then the stuff you can do with that is obvious. You can say, if a read has overall poor quality, it goes. If part of a read is pretty good, the rest of the read is bad, slice, if you want to keep that. Often you'll get ambiguous bases where the sequencer is like, I don't know, I have no idea. Maybe you can accept a couple of those, but if a read ends up having a lot of them, you might want to toss it out. Various different quality filtering tools and one important part of this when you do amplification, PCR amplification of a marker gene, you can often get PCR chimeras where the front half is from this gene and the back half is from that gene and I used to say partial lateral gene transfer, but that's not true. Often it's an experimental, it's a methodological thing and so there are several different tools including UTIME, which identify these basically saying, okay, reference database. This half is similar to this thing, that half is similar to that other thing, sorry about your helpful. So that's another important aspect of quality control. So FastQC is a very widely used program to do this. The URL is up there and very difficult to read colors. The basic idea is that I think I deleted the, okay, let's see if I can describe it. So it's very colorful and imagine if you will a read that goes across the top, right? Position, position, position, position, position. Position, position, position. Here, on the y-axis you have the thread quality scores. So what FastQC does is over an entire run it summarizes the quality scores per position. So here's a million reads. Position one, the quality score is 36 plus or minus five. That's pretty good. But as you go towards the end of the read you'll see the quality tend to fall off and so you go into yellow, right? The quality score is kind of borderline and then red's like no. And so the good example is green all the way along, right? Maybe tailing off a little bit at the end. The bad example is a little bit of green followed by yellow followed by a really long stop light, right? So that's kind of a, I don't know how I feel about this run kind of situation. Having been accomplished, now we can actually start treating the sequences as valuable things that we wish to look at. One of the questions we can ask is how diverse are these samples? So alpha diversity simply refers to looking at one sample at a time and seeing how diverse it is, right? Which can be expressed in many different ways. So a few different ways that you can consider. One is individual sequences. So you amplify your V4 region and you treat each unique sequence you get as an independent thing. You say the frequency of AGCC AGCC AGCC AGCC is 20%. Right? The frequency of the above. But there are a couple of problems with this. One of the key ones is that you will inflict your diversity because of sequencing error. Right? You're going to say this sequence is present in 0.0002% abundance when it actually probably goes to the state. Here's the real sequence, one nucleotide away is the oopsie, right? So this is one of the reasons why people do clustering of sequences, and I'm going to show you the cute little animation in a moment, but the basic idea is that you look at the entire constellation of sequences and you cluster the ones that are similar to each other. And I'm going to get more details. Another obvious thing you can want to do, taxonomic names are often like a big warm fuzzy hug, and so sometimes you want to say, oh, these are bactoroids, right? These are pre-vitalamus and so on and so forth. But that approach is extremely perilous, because there's bactoroids and there's bactoroids, lumping them together can hide a lot of important stuff, so you have to be very, very careful about that. I've got some examples for you later on. So okay, I'd like you to read this, and then if you reach the end, I'm going to assume you consent to this. Presentation of OTUs, operational taxonomic units, should not be taken as an endorsement of the OTUs strategy, nor should it be assumed that OTUs have any biological meaning at all. Do not hold me responsible for the use of OTUs. Now let me tell you how I'm going to do it. Okay, so here's the approach. First step for operational taxonomic units, choose a percent identity threshold. So you say, alright, everything that is, every pair of sequences that is at least 97% identical is going to go into the same ball, right? And so first of all, each of the little blue dots there is a sequence. I should mention that 97% threshold is very common to use because it certainly solves most of the sequencing error problems, and 97% is a magic surrogate for species, although if you look at the Airpush stock around paper from the 80s and you look at the sort of percent identity and how that maps on the species, you can also simplify the dataset, right? Because you can often go from like 1.5 million sequences down to like 50,000, okay? So it can just make the problem more tractable, yeah. Sorry, was your comment there that should be higher? I think the outcome of the next few slides is that you should not use OTUs. There are much, much, much better ways to cluster that are available now. I mean, you don't want to treat everything as real because of the sequencing errors, but there are much, much more thoughtful approaches that you can use, and there's quite a few of them now, and many of them are very good. So this pair of sequences is 6% identical, so you don't want them in the same OTUs, okay? And so what you can do is say, alright, well, let's find, let's start with some sequences. There are just different ways to choose this, but whatever. You say, this is going to be the center of one of my OTUs, okay? And now we apply this 3%, and it's basically a radius of size 3%, everything gets something, okay? And then you just keep doing this, you cluster things until you're done. Now, there are several problems with this. One of the main problems with OTUs is that OTUs are not real, right? There is no clean separation of sequences into 97% identity, right? You know, here's an overlap, right? These things overlap. This sequence could be assigned to this OTU or that one. So there are important decisions to be made along the way, which are often mathematical rather than biological. So that's a serious problem. How do we do it? Well, there's a few different ways. One is you can use a big database like, say, green genes, if you can find it, and cluster against that, right? So your seed sequences are actually from a reference database that's already taxonomically attributed. And then you take your sequences, sprinkle them like fairy dust, and they get clustered. The alternative approach is to use no reference database at all and say, within my own set of sequences, I start with these and a cluster. One of the recommended approaches is to actually combine the two. Start with the closed reference database approach, pull in what you can, and then whatever's left, whatever hasn't been clustered yet, is treated as new, right? And then you can cluster that in a different way. So this is a hybrid approach. This is open records. When you do this, you get an OTU table, which is basically across the top. Each of your samples, this is, again, our nursing home study. Is that really a pattern for this? Excuse me, an example. So each column is a sample. Each row is an OTU count. So our OTU identifiers here are hopefully also integers. So down this column, this is actually ID. This is not counts. But you can see that OTU number 13, there are 88 sequences in sample 1. There are 55 sequences in sample 2. 166, yada yada yada. Now, there are several problems with this. One is that each sample will tend to give you a different number of sequences. So if you compare based on the sequence counts, you're not comparing apples and oranges. You are comparing apples and oranges, right? Apples and oranges, by the way, fall into the same OTU. So the thing about it here is that if this is sampled much more frequently, if you get more sequences from this sample than from that sample, you have to do something about that. So there are a few things you can do. One is that you can convert to proportions. And actually, given the time, I'm not really going to go into the details. Again, we can talk about this later if you like. Proportions are bad. Rare refaction, you can identify the sample with the smallest number of sequences and sub-sample everything else down to that count. So you throw away some of your data. And then there are model-based methods that tend to be a lot more statistically justified. They use things like the variance of the counts to try and get better estimates over all that diversity. Each of these has the limitations. Each of them is used, and so just the best you can do, even if you use one of these, is to be aware of the limitations. So we can visualize diversity. And basically here we have subject identifiers along the X and then relative abundance of different groups along the Y. And it gives you a nice quick visual summary, of what every of your clusters are, of how individuals compare and differ. So you can see here, for example, blue, I believe is bacteroids, and subject 26 has no bacteroids. And so you can start to look for things, orange is achromandia, which more later, and so on and so forth. Now, OTUs, I don't like. Taxonomy, also questionable. So here's another one of these bar graphs. And one of these colors might refer to the genus, achromandia, or the species, achromandia, if you set a fill up, whatever. And then you can say other groups as well, a pair of bacteroids, whatever. And then you get all the mystery, right? Because a lot of your sequences are going to map to no known taxonomic group. And that's okay, right? You just discovered things. It's exciting. So how do you assign taxonomy? I'm sorry, I'm very simple. So how do we assign taxonomy? A few different ways. One of the most obvious ways to do it is to use sequence similarity search, right? It's 16S genes, everything's homologous, right? So you just use something like BLAST and whatever best BLAST match you have between your sequence and something in the reference database is what you call your sequence, right? So the best match is 2,4,5,6 into ballast. So my sequence is 4,5,6 into ballast using some sort of threshold, right? You wouldn't say my 16S sequence is 50% identical with 4,5,6 into ballast. So it's simple, but it's also too simple. Because what do you do if you get matches that are roughly equal identity or e-value to very different things in the database? There's different things you can do. Phylogenetic placement is pretty cool. This is where instead of doing the sort of similarity-based approach you can build a phylogenetic tree of your reference database, right? You've got the factories up here, you've got the pre-vetelo, whatever, and you take your sequence and you map it somewhere into that phylogenetic tree. That mapping can serve as the basis for the taxonomic site. One thing that's nice about this is that it immediately gives you some idea of how precise you can annotate this thing. If it maps into genus streptococcus, you can say it was probably streptococcus, but if it joins the tree at some higher level, like class level, you can say, well, I don't know what the genus is, but it's probably class claustridia. Makes sense? Another way to do it is machine learning classification, which is fun. The ribosomal database project classifier does this and sort of like an anti-visionary at some time. Basically what the RDP classifier does is it takes your sequence and it breaks it down into a series of what are called k-mers. A k-mer, k is an integer, k is a number. A k-mer decomposition is simply counting the different sets of nucleotides of length k. So if k equals 2, then you'd be counting the incidences of a-a, a-c, a-g, a-t, c-a, right? So you get this vector, this list of counts for each of those two. You can go further. k equals 3 and so on. The decomposition of this. So that's good because it's fast. It lends itself really well to classifying a million sequences quickly. But think about it. By breaking things down into little counts of length a, you lose all the positional information. You don't know whether this thing was next to this thing or over there, whatever. But it actually works, which was kind of surprising to me. So I'm excited at some ridiculous number of times, like 3,000 or 4,000. So it's pretty effective as a quick and dirty approach. Here's an example of what it looks like. So we've got two sequences over here, GD6J, whatever. And it gives you a classification of each rank. So bacteria is the domain level. In this case, the sequence has been assigned to plankton-myses, which is phylum level. So you get that level of classification. What's nice about it is that it also uses a bootstrap-based approach to give you some sort of confidence score on the assignment. And so look at the second one there, right? Bacteria, domain, the confidence is basically 100%. Fertigutes is the phylum, 32%. Okay, so on there, plus 3 to 8, class 0.26. Often what you will do with this is that a minimum threshold. You might say anything less than 0.7 is counting. I'm not going to forget it, right? 19% support. No. Do not call that an aero-electronicus or anything else. Alpha diversity, how do you calculate it? Well, simple way to do it is to count your OTUs or counter-species or gender, whatever arguments you're using. And so that gives you what's called species richness. You can also use these information-based approaches such as the Shannon diversity and the Simpson index, which tell you both about richness and the evenness of your sample. So richness would say 10 and 10. So Shannon would take into account that this has 10 species, but it's 90%, 1%, 1%, 1%, 1%, right? And this one is 10%, 10%, 10%. So richness would not distinguish them. Shannon diversity or the Simpson index would. Phylogenetic diversity is fun. That's where you take your sample with all the sequences, build a phylogenetic tree from those sequences, and then sum the branch lengths. Think about what this means. If you have a sample that for whatever reason contains only streptococcus, you're going to build a tree from those sequences, and the sum of branch lengths is going to be very small. So that's not a varied ever set. If you have a sample that spans bacteria and archaea, the sum of tree branches is probably going to be pretty large. So that can be pretty important. Here's an example. I won't dwell on this too much. Basically, and this is cool because it looks at different types of data as well, so metagenomics, reference genomes. And these are just different body sites, and you can see long story short, some are more diverse than others, right? OK, well, we knew that. But the choice of diversity measure and the choice of OTU threshold, taxonomic mapping, whatever, it's going to be obviously very important for this. OK, part three. So beta diversity. Now you take two samples at a time, and you compare them, right? So here's one of your sample groups. How similar are they? How dissimilar are they? And you can probably imagine in your head ways that you might want to compare them. Turns out there are dozens of ways you can do this. Once you've built this, you put each pair-wise comparison in a dissimilarity matrix, right? Example one, two, three, four, five. Difference, difference, difference, difference. And then you can summarize that matrix using various approaches, such as, I'm sure some of you are familiar with this, principal coordinate analysis, right? Ways of trying to cram that matrix into some sort of representation that may or may not be useful. So given a pair of samples described by, let's say, OTU abundance, calculate their dissimilarity. Beta diversity measures can be non-filer genetic or philer genetic. Do I care that the two samples are not philer genetic and very diverse? If I do, then I'll use a philer genetic measure. If I don't, then I won't use a philer genetic measure. Weighted or unweighted, do I care about the relative abundance of my different things, right? And there's approaches to that as well. And there are a lot of measures that's ridiculous. Ray Curtis is weighted non-filer genetic. Jacquard is unweighted non-filer genetic. weighted unifrack is weighted philer genetic. This is a paper we had a few years ago where my PhD student Donovan put in a total of 39 different beta diversity measures. This is not a philer genetic tree. This is a correlation tree. Each of these is a different measure. Their proximity is proportional to the similarity of their predictions, right? So they'll tend, you know, things like Ray Curtis and normalized weighted unifrack basically give you the same answer all the time. And so Donovan identified several groups that tend to give you different classification types. And so one thing that might be useful if you're looking for patterns is to try one from each of the colored boxes, okay? Excuse me. What do you mean by normalized unifrack? Normalized. Oh, God, it's been a while. So it's weighted. So it considers the relative abundance. What's the normalized? Help me out here. I forget. I think it has to do with... I think it has to do with the distance to the root. So weighted considers branch lengths, but it doesn't really account for the fact that some leaves are really far from the root. So it tries to compensate for that, I think. So it's been a while since I've thought about that. And I don't think it's very widely used. Sorry? When we use weighted, everything is already different. Yeah. There's a way to normalize this one that I don't remember. Okay, what to do with the dissimilarity matrix? So this is that projection where you take your complex dissimilarity matrix, which is based on Latinxonomic stuff, and try to project it in a few minutes. So this is comparing the gut microbiota of three different populations. U.S., Amerindians, and Malalians, and you can see that on some axes of differentiation or diversity, you can separate this population from that population. But if you look along that axis, it's actually really not very differentiable at all. And then there are different methods you can use to say, okay, this is principal component or principal coordinate one. What does it mean? And you can look at different ways that the taxonomic groups feed into that dimension. And that can give you some clues as to what's really differentiated from this group. From a matrix, you can also do hierarchical bustering and say, yeah, these are pretty similar. These are not. There's different techniques to that, UPG and the different samples. All right. Different beta diversity measures can yield dramatically different results. And I'm not going to dwell on that. I know there was a figure, but it disappeared again. I think it's like... Associations of metadata. Okay, so you've got the diversity measures. Now what? Well, one thing you're probably interested in is how does the diversity of the microbiome vary with solidity or temperature or how recently a cow walked over that spot, right? And so you'll take your diversity information and you'll take your metadata or whatever habitat information you have. And you'll use some kind of technique. You might use one. So ANOVA, many people are probably familiar with. Permanent ANOVA is simply a non-parametric, you know, permutational-based one. You can look at similarity approaches. The Mantel test, which I'll show you in a moment if the figure's survived. So one of the simplest things you can do is take a particular taxonomic group, operational taxonomic unit, whatever, and regress it. So this is the abundance of a particular OTU. This is some metadata. In this case, it's the frailty of the subjects in our nursing home study. And you can fit a line. You can say, well, okay, there's a positive relationship. You can put a p-value. In this case, the p-value is like 0.001. And then you can look at the plot and say, but is that useful information? I always need to consider residuals and effect size as well. You can do ANOVA to compare different categories. So this is actually the abundance of a different function in people who are 60, 70, 80, or 90 years old. So you can do your ANOVA test and get a p-value. Let's see from Nicolas Sagata and Curtis Huttenhauer's group is really cool. This will tell you which taxonomic groups, back onto a tree, are overrepresented or underrepresented, relatively speaking, between two different symbols. So red are overrepresented in one type of group, green are overrepresented in the other type of group. And then there's other stuff. Mantel test is a different type of regression. Again, I'll just leave it there. Machine learning classification. You know what? I'm going to hand this off to Francois Levy with that. Basically, they're really powerful and really useful. But there's like a billion different pitfalls you need to consider. So this is one of my favorite studies. Well, one of my favorite things that I've done in my life. And it's using something called support vector machines or random forests to distinguish superdindable flat communities from subdindable flat communities. And using not just Otu's, but a more phylogenetic based approach. I'm just going to leave it there. But there's a classification accuracy. Functional prediction. This one's on Morgan. Take your 16s. Take a reference database of genomes and try and predict the functions in the metagenome from your 16s. Okay, problems. The complaining part of the presentation. Data quality is obviously a problem. Sequencing errors can be introduced. And, of course, it's proportional, both in percentage and in type of error, depending on your sequencing platform. Chimeras I've already talked about. Brar. Reproducibility. I've already talked about this. Two different variable regions can give quite different results. Ditto goes for sequencing platforms. There are papers showing all of this stuff. And then there's a bunch of different tools and your choice of workflow is going to influence what comes out in the end. So this is from a mouse frailty study we did a few years ago. And this is interesting. We had young middle-aged and old groups of mice. We didn't have very many of them. So we're like, okay, can we use sequences from reference databases to build up our own dataset? And so we did this kind of similarity ordination approach, principle coordinate analysis. And this reference set was actually microbes from six different strains of mice that were published in a previous study. And look at what we get. Our old are differentiated from our middle-aged mice, are differentiated from our young. We were hoping that these would be kind of reference middle-aged mice, but they all fall in there. So the two studies are not comparable, which is why this is not in print. This is cool. Okay, the figure disappeared again. The basic idea is that 16S is only one option. You can use others that are more highly variable. So the intergenic transcribe spacer sequence is an example. People have used that specific one to find very interesting patterns in the ocean within a single species that would not have been detectable at all using 16S. Okay, very careful. Okay, taxonomy and OTU suck. What else we got? OTU, so this is coming out of my call as a PhD student in my group. He has developed an algorithm and software called an algorithm which clusters sequences not based on sequence similarity, but based on their time series. So two things tend to go up and down all the time. They'll be assigned in the same time series cluster. So here's a time series. This is one particular lake in Wisconsin over the last 11 years. The data set was collected by the activity of the man. And this is if you do OTU clustering of the sequences. You get this pattern. It's seasonal, right? Right? But if you do time series clustering, if you look carefully, you can actually see that this OTU is supposed to one part of it, one time series cluster comes out predictably a week or two before the other one. So what have we lost by clustering into OTUs? A lot. So this is taxonomy. I think I'm pretty close to the end. Yeah, two more slides. And so what you're looking at here is a time series clustering of our dressing home subjects. And each color represents a different time series clustering of acromancia. So we're in this acromancia and we set a fill up, red. And notice how different subjects have different types of acromancia. And even within subjects, different types of acromancia have different temporal behaviors. Are these things ecologically the same? Should they be treated as the same thing in the analysis? No. They're all the same look to you though, right? This is potentially a problem. Last example, biological relevance. Here's a microbiome sample. 99.9999% is the log of active, right? And then 0.0001% is accidental active. So does this thing matter? We found it in the sample and if you sequence deeply enough you can probably find almost everything. But where do you draw the line in terms of biological significance? And so this is a very complicated question. So summary. There's many ways to do it. Marketing analysis is widely used and relatively inexpensive. And then there's a bunch of limitations. What can we do? Well, that's the rest of the presentation. So just to finish off you can sequence the microbiome of anything you want and people have. Children's inflatable poops. Chicken poop. Thanks to John. We've got squash bugs. We've got kimchi. Who likes kimchi? Statues. People are interested in the degradation of statues. Roller derby. Skin to skin contact and roller derby. Donkeys. But not the whole donkey. The certain parts of the donkey. So. And somebody asked about power calculations and so on. So I just found this online. I haven't looked at the web server yet because the internet was down. But I think this gets at what somebody was asking about. So. Compositionality. Yes. So you could list some of the methods that we don't believe but are they applicable to compositional data without transformation? So the question is about compositional data and this is a problem that comes up for instance when you convert your counts into proportions. Because like converting into proportions you lose freedom on your data. You can strain on some to one. And people have shown that certain types of analysis. One thing I didn't talk about at all is co-occurrence analysis. So if you run co-occurrence analysis on compositional data you are going to get a lot of spurious results. So. Again some of these problems can be remedied with other types of approach. And again there's like 15 different ways to assess co-occurrence.