 Please join me in welcoming today's speaker, Dr. Julie Segre. Hi, good morning, and thank you to all of you who have joined here in the auditorium and also online now and in the future. This is very much a developing field, but I think we're also at a moment in which many of the methods that we use for interrogating the microbiome are finally hardening so that I can give you a talk that I think if you were to sort of use these approaches, your paper would still, you know, be current when you tried to submit the data in a year or two. So, because this really has been in the last five to 10 years something that has evolved with the sequencing technology. When I first started looking at microbial communities, I used Sanger sequencing to look at the 16S. Then we bought the 454 Roche instrument to use that. That instrument has been discontinued. So we're all now really very much sort of in sync using the Illumina MySeq and the HighSeq, which means that there's more ability to do cross-study comparisons. But I wanted to sort of talk through today really what are the standards that our community is using and how do we set up these studies. So I have no disclosures. And I'll start really with the introduction. And I'm going to give a very human microbiome focus study because that's really my field of expertise. But I would say that the types of analysis that we use, especially from the sequencing realm, would be generally applicable if you were looking at ocean communities or human communities. We are all looking to understand the myriad microbes, the fungi, bacteria, viruses, archaea. And from the human perspective, why we're interested in this is that there's a lot of, you know, there is some variation in the human genome. But if you think about the orders of magnitude here, your body also is covered in these bacteria. So we are estimating trillions of microbes. And their genetic potential is quite diverse at the strain level, at the species level, at the genus level. And so this is another way in which there is a second genome that is associated with all ecosystems. These microbes are diverse and dynamic. And that is really some of what you will see now as our society grapples with that the microbial communities may go through times where they bottleneck and then what's going to come out the other side, either with humans in terms of antibiotics or oceans in terms of oil spills. So just some general terms that I'll use. The humans, of course, are host to many of these microbes. The microbiome is the microbial community, the totality of all of that DNA. You'll see many of these things in the pros. The microbial cells outnumber the human cells. Some people would say it's sort of an equal number if you're only talking about bacteria and fungi. I choose to include viruses in that. So I would say that the microbial cells will then outnumber them. As much work as we've done to understand the function of human genes, the microbial DNA is really understudied. So if we sequence a human genome, we may know the function of 75% of the genes, and we may understand that they can play multiple roles. When we annotate bacterial genomes, even things like E. coli, we haven't had that same rigorous testing of sort of what are the multiple functions of this protein. And we have a hard time even trying to predict protein structure, protein function based on structure. So we're really entering a new period of discovery in which we need to understand some, have some assays where we can then see how does maybe this genetic potential read out in terms of function. So well, we've really focused for the last 100 years on the pathogenic potential of these microbes and thinking about Ebola, thinking about tuberculosis, thinking about staph aureus. There are also many beneficial microbes. And that's part of the reason why we are so interested is the role of these commensal beneficial microbes who aid in vitamin synthesis, digestion. Really important is the education and activation of the immune system. This is a way in which perhaps the microbial communities are being read out systemically. And over the lifespan of a human. One of the major goals also of these beneficial microbes is that nature abhors a vacuum. So if a pathogen comes out, enters the system and the microbial community isn't stable or isn't present, maybe you've taken antibiotics, then you are more able to be colonized by a pathogen. So now to launch into how have we been studying microbial communities. And the original way of course, which is still done for many reasons also, is just culturing microbes on these. This is blood auger. And it was recognized early on, of course, that the majority of these bacterial species don't grow in culture. And this has been called the Great Plate Countenomaly, where there are microbes that are really hardy and very easy to grow on culture. So I could grow Staphylococcus epiderminus. I could grow it from every body site. But it might not be that common. There is a bottleneck or a distortion of the microbial community that you read out by culturing it. And also, our systems have been set up to sort of grow these microbes in isolation. Whereas we know that many microbes really rely upon the community in order to flourish. So the sequencing came about with the idea that maybe we could get a different perspective on what the microbial communities are. And really just to bring it full circle, we do a lot of culturing now, too. But we usually do it informed by sequencing. We know what we are trying to culture. And we pick culture and conditions that will then enable those bacteria or fungi to flourish. So it's quite different. If I want to capture the fungi that live on the skin, I put olive oil on my plates to capture the malicezia. Or I could inhibit the growth of the bacteria with certain antimicrobials. So this was really the first experiment that we did where we were even saying, does sequencing give us a different answer? Because I'm sorry I've kind of gone out of cycle and sort of said, well, of course it does. But I think from first principles you have to understand that because we're about to make a big investment in sequencing all these microbes, does it tell us anything different? So this was our first experiment that I sat up with Patrick Murray, who was the head of clinical micro. Looking, he put the skin swabs on many different culture plates, the blood auger, the chocolate auger. You can see the different morphologies that we recovered. And then we sequenced every isolate we got. We took that same parallel swab. We put it straight into Lyceus buffer and just sequenced what we got. And the results did not astonish Patrick. What you can see from here is that the orange is the Staphylococcus. So this is the comparing, the ALR creases, the side of the nose, the umbilicus is the belly button. And what you can see here is that if I just do a survey, which is the DNA sequencing, I get a community that has mostly propionobacterium, that dark blue, has some cornflower blue, the carina bacterium, it has some staff also, those orange and the other firmacutes in red. When I put it into culture, I lose that diversity that you're seeing up at the top, including that little green proteobacterium. And what I get is basically propionobacterium and staff, which we know how to culture. The umbilicus or the belly button is even more extreme where you're seeing that actually the sequencing would say that there's a lot of carina bacterium there. We can culture those carina bacterium, but they're mostly being overgrown by the staff and the firmacutes. So we're using this now to say that there's a similarity between the two communities of what you get from DNA sequencing and what you get from culturing, but that there is a reproducibility and an accuracy of the representing of what is the community based on the sequencing. Sorry, I should have put this into the slides. I would say that for all of our experiments moving forward, we standardize to the human microbiome project mock community, which can be ordered from BEI, which is sort of like ATCC. It's a not-for-profit repository. And we sequence the mock community. We actually do it on every single plate of sequencing that we do to standardize our experiments. And I have shared it with people, but I would also recommend that people just order it if you're starting to do sequencing experiments because it has a known answer. That's something that you're seeing here where I'm giving you two different results and you're saying, but what is the truth? And that's where the mock community, which is a mixture of 20 bacteria that have been put in all at the same concentration, is very beneficial because it allows you to standardize across sequencing lanes, if you change protocols, anything that you change, we always standardize back to that same mock community. And we run it with every plate that we do of sequencing so that we know if a plate has failed and also helps us if we do a study and then we collect more samples maybe two years later and we think, well, have we changed things in the laboratory in those two years without even perhaps realizing it? We always go back and compare that exact same sample again. So topics for today's, well, first of all, there'll be the random things where I go off and realize that I should have put something like BEI in, the mock community in. But I'm gonna first talk about bacterial diversity studies, fungal diversity studies, bacterial genomes, metagenomics, and then finish with where is the technology going? Bacterial diversity studies are typically based on the 16S gene, which is part of the 16S is a ribosomal RNA. So I'm sure you're all aware that the ribosome is where proteins are synthesized. The ribosome is a mixture of ribosomal RNAs and also of proteins. These ribosomal RNAs are in high copy in the genome and they also have a structure where there are regions of them that are more conserved because they are necessary for structure and also more variable. And this 16S gene has really been used as the signature phylogenetic marker for decades now that allow you to identify bacteria and archaea. And you see it here where this is on the left is the ribosomal RNA gene. And you can see these stems and the loops and the stems are of course more highly conserved because they have a structure where you're gonna have to have a double stranded RNA there. But we use these regions where you can see on the right hand side is the variability, so the variability across the gene where each of the variable regions where you might get more information is marked. And then you sync primers in the highly conserved region and you sequence across the more variable regions which helps you to identify then what are the genus sometimes to the species level. So this is sort of the basic workflow where from a microbiome sample you can have multiple members of the community. You do one DNA extraction directly from the sample. We don't do culturing beforehand. You amplify the 16S gene and you can use that for taxonomic classification. You also can use that for doing population based analysis where you talk about alpha diversity and beta diversity that basically means how many different species are there in this community? How does this community compare with another community? And you can compare two different communities. Okay, so I put this in, I thought really even just in the handouts, just to kind of put this there. Okay, so pretty much I've said people are using 16S but before that, what are the things that you need to consider when you're setting up a study? So first of all, I think it's really useful to define the question as precisely as possible. Are you, here's one question. I wanna compare wild types with knockout mice. It turns out if you come and talk to me, I'll have a lot of questions about that. Are these mice littermates? Because what we've seen is that there can be variation even due to just cages and how they've been breeding and you'll see one example of that. But I'll also ask you what controls do you need? So I think it is important to try to really be as clear as possible about the study design. And that's not the focus of today's talk. I'm gonna really talk about more of these other questions here. What sequencing platform will you use? What region of the 16S gene will you amplify? How many reads per sample do you need? What are the hidden technical issues? I'll focus here on chimeras. What analysis tool will we use? How will you display your data? How will you compare your results with other published studies? And what information do you really need from these studies to yield a testable hypothesis? So I wanna just sort of take you through my sort of cookbook how you would follow this recipe. From the very beginning, one of the things that we do struggle with is calculating the bacterial load. And so here, I would say that typically people are using a QPCR approach to say how many copies of the bacterial gene. Most people are using still 16S RNA. I would say that also there has been some effort from Elhanan, Bornstein and others to identify genes that are single copy to get even a more accurate assessment. I did say that the 16S gene can have multiple copies in a genome that you may have to control for. But I think you all understand that a QPCR could tell you how many copies of bacterial genomes do I have in this sample. I think the hard part is really, what are you gonna normalize your sample to? Maybe with, if most of you do gut studies, maybe you can normalize that to the grams of stool. And I guess it's just something that you have to consider in that maybe there's more undigested food material. So maybe the grams of stool isn't always the right measure. You know, Jeff Gordon has done this and he's sort of measured it versus how many calories are being excreted. But I think that's, for us with the skin, we sort of sometimes think about it where we're trying to just say per square centimeter, how many bacteria do I get? And we're comparing when I swab the skin with when I scrape the skin with when I do a full thickness punch biopsy. And there the difficulty is that we can normalize to square centimeters, but we do wonder, maybe there's variability in the user who's identified, who's collected the sample. So probably what most of you did come here to talk about is the DNA sequencing. There, the method that will give you at this point the most information of sequencing the 16S is to use an alumina mysic with an amplicon. And so what you're doing down here, you're putting in these primers. This is amplifying the V1, V3 region. There are other primers that are very standard that will amplify V4, V6. And you're amplifying the 16S gene, putting it on the alumina mysic and that's the sequencing platform that we have sort of standardized to. I would say for a small study, what I have seen is that the sequencing is limited because really there is an investment here for these primers, the way that we do it in a production, Boutique production lab is that we have these primers but we then have a stutter linked to an alumina barcode. And the stutter helps us so that, and you can read more about it in all three of these papers. The stutter means that when you load it onto the alumina, if you have an amplicon sequencing, what is hard when you go onto an alumina instrument is that they all will read an A and a C and a G if you're amplifying a PCR product because they're all gonna read the primers that you used to amplify that PCR product. So the first 20 base pairs is all gonna look the same and it's hard for the alumina instrument to get the register if every cell is lighting the same base pair. That's why the stutter means that we put it in where there's sort of between zero and four base pairs on the different primers and that gives us where now everything is off register from each other and then we can actually go in and detect much better the amplicons. If you don't do that, then people often load FIACs just so that there is not everything, reading A, C, C, A, G, something like that that would be the primer sequence. But to get to this point, I would say the scaleless issue. So in a small study, the sequencing is limited and I still think that that's a lot of why at the NIH we are still really trying to create a microbiome initiative where there would be some place where you could load 100 samples and he would have 50 samples and she would have 25 and we could really sort of do this together rather than all having to set up the different reagents, set up the same platform and have a few samples every once in a while. Because right now we do multiplex 400 samples together in one lane, but even we who are microbiome lab have a hard time finding 400 samples. There are other means of sequence data acquisition. Some people will talk about oligotyping or phylochips where you send it and they put it on a microarray. I think that the analysis of that data can be more straightforward because it's more like looking at microarray data. It probably is more expensive but these things are always hard to cost out. And I guess the limitation is if your goal is to find a unique or novel species, you can't find that on something that has defined material. The other really good method is the Illumina High Seq. And that is what sort of these big studies that the earth microbiome are doing and they're pretty much analyzing there the V4 region. So it's a shorter read, it will give you less phylogenetic information but that certainly is what a lot of the larger studies that you'll see are doing. Okay, so you get these 16S reads back, how do you figure out what they are? And if you think the answer is that you would go and blast it, you will unfortunately blast your sequence, it will match tons of things and probably the majority of what it matches will be things that say uncultured from a 16S RNA sequencing study. And that won't help you very much because unfortunately people like me have just littered gen bang because we had to deposit all of our data from all of our studies and we just annotated as an uncultured 16S so that really doesn't help you very much. I'm gonna talk about the tools that we use. Mother, Chime and Clover, I'm gonna really focus on Mother and Chime because they're kind of the workhorses. And built into all of these are a lot of tools that I'll try to unpack some of them but I have to say in the olden days, underlying Mother is Sons, Doter, you won't see that anymore. They were all kind of built as separate tools but they've all kind of been brought together where it's sort of one stop shopping now either at Mother or Chime. And it's also been a place where now the community then adds additional resources. So like at one point my lab did a fungal study and we built this fungal database. Well what we did was then we loaded it into Mother so it's and into Chime. So it's kind of gotten to be a place where we really bring together tools. So the 16S sequences. So we use pretty much a reference dependent database. So if you wanna classify a sequence, within Mother, within Chime you can go in and use, we all have sort of standardized to the ribosomal database project. Which is very similar to Silva, very similar to Green Jeans. And it will give you an assignment for a bacteria where it's a curated reference data set. So it actually has sort of brought into play what are the high confidence differences between two different genre and between two different phylum and, well not phylum, but within two different families, orders. So it'll give you that kind of resolution and you can feed into these databases any of the different regions of the 16S gene. There are some differences like if you wanna get beyond the genus level, then there are some regions that are better for getting to the species level. So like for Staphylococcus you would wanna use the V13 region but for Lactobacillus you'd want to use the V45. So it is important to think about what is really the genus that you care most about or what is the tissue that you care most about and you may want to tailor your sequencing to that. I also should say that each of these primer sets has their own bias and that has been documented. That's where the, again the Mock community comes in really useful. Because you definitely want to test out your primers, your sequencing on the Mock community because if there are signature taxa for your body site you wanna make sure that you're actually recovering them. So if from this it'll return sort of a, something that you can make into a bar chart that basically says what are all the sequences in the genus is if you get a sequence that has no reference you may think that you have identified a novel bacteria but there are other explanations and I'll get to that in a second. Not even to say more about it but just I wanted to at least give you the basic facts. This is the RDP database. As I said it's based on aligned curated annotated 16S genes where a lot of sort of work has gone into classifying. And I'm just sort of giving the other, the real specifics to it because there are, there were choices that had to be made. I mean for example, I don't, for example we use, for RDP they use Berge's taxonomy. There are sometimes bacteria that change from one name to another and this can be frustrating to people but we continue to discover more microbes and more distinctions and there is a community that determines when something gets reclassified. From the RDP classifier you can also generate things like probe match and seek match. This is the silver database. They're really quite similar. I don't think at this point you'd get a different answer from using silver than using the RDP but I wanted to at least make you all aware of this and there are some things that you can do in these tools. Pretty much they'll all do more or less the same things but it may be that one of them visually appeals to you more than the other. That's a constant challenge. I'm sure everyone has talked about in this lecture series. Genomic information is so rich that a lot of times it's the display of the information that really is important in terms of understanding the depth of it. So it may be that because of what you're trying to pull out and identify the visualization tools that are built into these programs will appeal to you more than the other. Okay, if you get a novel sequence you may think that you have something truly novel. I would say that probably the first thing you should think about is do I have a chimera sequence? And what happens here? You think how could I have a chimera? These things are just 300 base pairs long, you know. Well, that's another thing that the HMP, the Human Microbiome Project really took a close look at and I have to say I think everyone who served on that committee was shocked at how many chimeras we had. So let me tell you about the test we did. How do chimeras occur? Well, it's incomplete extension of a PCR. So basically what happens is you start amplifying on one strand and then that cycle of PCR ends. And the next round of PCR when it starts, in fact you've ended in the middle of a very conserved region and now you could amplify anything. So your query sequence would end up being something that had started as a green and then ends up as a blue. And when you go into the database, it can't assign that. And so it'll say this is a novel species. Well, you know, how often does that happen? Again, this is the use of the mock community where what we were doing here were two things. So first you will see here with the mock community we were trying all these different primer sequences and you can see these are the 20 different bacteria that we had in the community and you can see that some of them ended up being overrepresented by certain primers, some of them ended up being underrepresented by primer sequences and each set of primers has their own bias. Not great, at least it's been documented. But then along with that, every set also has this percent of observed chimeras. So remember, we put 20 bacteria in and then how many species do we get out again? Well, it turns out that depending on how you cluster and what are your criteria for pulling out chimeras, you could end up at least having 40 species in here, but you could end up thinking that you had 350. So now built into things like mother and chimes are these things chimera slayer that will identify these kinds of sequences and remove them from your run. And you may say, but what about if this really is a novel bacteria and this really is what I want? You can certainly go back and look through those sequences. They're not removed from your dataset, but you'd need to use that data with caution. So with this kind of sequencing data, I just want to sort of show you some of the results and how we can use this data. This is the data from the NIH Common Fund Human Microbiome Project where 250 healthy subjects were surveyed at five major body sites. And in some of those sites, like in the oral cavity, there were multiple samples taken. And we then asked, what are the bacterial communities using 16S amplification? And you can see that the major determination here is what is the body site? So you'll see in the gut or in the stool, there's a lot of these bacteria it eats and there's a lot of these firmacutes, the yellows and the browns. Whereas in, and this is actually the average of the data. I'll show it again in a minute. You know, whereas in the nares, you're seeing a lot more of these blues, the actinobacteria, and the vagina is gonna have the lactobacillus, that red. So the major finding here was that the body site is more determinant than the individual. And in fact, it goes even to the body site. So that, you know, the bend of my right elbow is most similar to the bend of my left elbow. But after that, the bend of my elbow is most similar to Andy's. More than it would even be to inside my nose because this is a moist epithelium and this is a sort of a drier crease. So again, this is showing more of the individuality. So you're seeing the same features as I was saying. The lactobacillus is really dominating in the vagina. The gut is again these bacteriodites and firmacutes. The mouth is gonna have this high representation of streptococcus. And you can see that this is again showing just that the determination of the body site. And so you can use this as a way of sort of guiding what are the bacterial communities that you would expect to find. And when you set up a study, if you can recruit a small number of healthy volunteers, then you could sequence those and assess whether you got data that was similar to the larger human microbiome project. And that would allow you to sort of leverage the larger data set. Just to show one example from our own work of sort of how you think about these changes in bacterial communities. This is a study that we did where we looked at the skin microbial communities as children transition through puberty. And I think you can see here, this actually for me, I'm just putting up because I think it's a fairly obvious explanation or a fairly obvious study. The kids here on the left are all pre-pubescent. The kids here on the right are all post-pubescent. And what you can see is that these kids before they go through puberty have a lot more of the reds which end up being all of these proteobacteria. They also have a lot more of these streptococcus which also makes sense if you think kids get these impotigo, a strep infection which adults don't get. And we always thought it's because they're icky kids or something but maybe it's really because there is more strep that naturally colonizes their skin. The changes here that we're seeing is that post-pubescent, there's more of these crinobacterium and these propionobacterium, the greens. And that also would make sense in that these are bacteria that require lipids for their growth. And when you transition through puberty your skin becomes oilier. So it would make sense that these bacteria could become more prominent. So that's an example where even in a healthy state you could see very clearly a transition and we can sort of lay it out. Oh, we can lay it out down here where you're seeing which bacterial genera go up and down. So obviously in the, I was saying in the later kids it's these crinobacterium and the propionobacterium. I guess this also does make the point that if you have to think about some of these things for us we did the study because we were wondering when we have kids do we have to age match them? And from this the answer is clearly yes. So okay, so you've got this 16S data and you could plot it as RDP as what is the bacterial genus and species but you've probably also seen other types of analyses typically when people are looking at them at the community level. And there some people will just use them at the genus level based on the 16S. But within both mother and chime there is this other way in which a lot of the studies end up being done on what we call operational taxonomic units. So let me take a minute to explain that to you. So you could say that these bacteria all belong to staphylococcus or streptococcus where you may say that these are all firm acutes but really the sequence data is related to the phylogenetics and we sort of have these definitions that typically a species would have to mean that you have to be at least 97% identical at the 16S level. So we have the sequence data. So what we do is we really try to then take the sequence data and cluster them based on sequences that have 97% identity because that kind of gives us, it's a computational mathematical way talking about sequences that have the similarity without having to go through this sort of loop of identifying what every bacterial genus is and some bacteria don't have that proper specification down to the species level. For example, the crinobacterium we just haven't sequenced that many of them. So I can't just assign things and say this is a crinobacterium aculins, this is a crinobacterium simulins. I don't have enough sequenced reference genomes but I can see in the sequences that these are all crinobacterium and these sequences are much more similar to each other and these are much more similar. And so I wanna be able to retain that level of resolution but I don't have the reference genomes always to make an assessment and say to the species level what is this sequence. So this allows us to really capitalize on the sequencing data and say that I have these operational taxonomic units and I can assign them based on 97% identity or 99% identity. There also are differences here in terms of whether you are a lump or a splitter. And you can use the furthest neighbor or you can use the nearest neighbor as your joining methods. And by that I mean that you can have a centroid sequence and you can say anything that is 97% identical to it I will put it together into the same OTU. That could mean that two sequences are really only 95% identical to each other. That we actually require that every sequence within an OTU is at least 97% identical to every other sequence. So I don't know if that's a little bit too much of a nuance but you can see how these OTUs, I'm just like think about it in the general way that you can either be a lump or a splitter and you have to make those kinds of decisions. Okay, so then you have these OTUs what are you gonna do with them? And I think the two most common things that people do is they look at community membership and they look at community structure. So let me just distinguish for you in a toy way what I mean by that. Let's say I have two groups and I'm making two kinds of fruit salad and my group A I'm gonna use mostly apples and oranges but also I'm gonna put in some bananas, some pears and some grapes. In the second group I only have apples and oranges. And so if I think about community membership where I say how many categories of fruit are shared between them, then it's only two of the five. If I think about community structure where I say if I pick a piece of fruit out of A and I pick a piece of fruit out of B, how common is it that I would find the same piece of fruit in A and B? Then the communities look much more similar to each other because 94 of the pieces of fruit in group A are the apple and orange and that's 100% of group B. And both of these are accurate comparisons of what is the community and that's where we have both of these measures. And so we really do assess that because you could imagine that in terms of when I'm thinking about how a bacterial community transitions that concept of community membership is gonna be important if rare species end up blooming and causing disease. Whereas if you have C. diff in your gut community and you take antibiotics, you could end up having a much greater colonization of C. diff whereas if you don't have C. diff in your original community and you're not exposed to it, it can't bloom. So in that case, the rare species are important but if you wanna talk about what is a community that maybe provides colonization resistance then it may be more important what are the dominant community members. So I would say most of the time we calculate both of these and we look for if there's discrepancies. Community membership, I'm giving you here one example and I'm sorry, I seem to have forgotten the reference for this, it was from a PNAS paper that was done very early from Jeff Gordon's lab in which they're looking at mice that are from across where they're typing obese mice, OB, OB mice that have the mutation in the leptin gene and they're looking to see how they cluster. And what they're finding here, this is community membership. So they're looking to see, do they share microbes? And it's not about the relative abundance, it's just do they share microbes? And here what you'll see is that the code here, these are the pups, M33 means it's the third pup, the first and the second. And what you see here is that the pups will end up looking most like their mother. And here again you're seeing they cluster based on who was the mother of this litter. And in this case here's mother too. So at the level of where are you inheriting your microbes from in this case where it's an experiment with mice where presumably the father might have even been taken out of the cage, I don't know why they didn't analyze it, but you inherit your microbes from the mother and then they're sharing amongst the siblings. Another study that we were looking at though, these are littermates and what we're looking at here is that what is the community structure? So here we're looking more at there are enrichments in certain bacteria that are shared by the genotype of the mice as compared to the wild type even though they were born to the same mother. Because in this example where the mice have a defect there are certain bacteria that are more commonly colonizing because the skin is impaired in those mice. So that's kind of the two different measures and why they might give you different readouts. One of the questions of course is how many reads do you need? And I would give you a ballpark estimate of like 1,000 sequences for a first pass analysis. You typically will over generate. So like in a mice seek run it's hard to not generate 10,000 reads. But it is still probably important to think about how diverse is the community, especially if you're doing ecologic studies. So I would say it also depends on how you're clustering them. That's why I sort of talked about the OTUs first. But for some sites we see very low diversity. So if you look even at the y-axis here this is like a very low diversity sample where we really think that there's just four species. Whereas for this person's belly button we really are still accumulating new sequences. So it's just worth checking to see with a rare faction curve how diverse is your community. And then as I was saying, there are these different ecologic measures, richness, evenness, diversity. They all are telling you something different about the community. And they all are easily calculated within mother. And they're also or in chime. And there are very good tutorials within both mother and chime. They were both written by ecologists who really are trying to translate this for people who may not have the full background. In addition, I wanted to just highlight these two papers that I think really tried to talk you through what are the factors that you should consider when setting up a microbiome study. Okay, I mean for those of you who are looking at the time and wondering how I'm gonna get through all of this. Topic one was the major topic because that's what most people are interested in. I'm gonna kind of give a flavor of more of the rest of the work. I'm gonna talk for a minute about fungal diversity because it's very similar to bacteria, but it does require a different sequencing method and a different database. So I've talked a lot about the 16S amplification. In fungi, there are ribosomal RNA genes, the 5.8 and the 28 and the 18S. For those of you who ever run eukaryotic RNA gels, you know that those are the bands you're looking for when you're running a northern. And some people do sequence the 18S. It gets harder, especially if you are looking in human samples to deplete and find primers that are specific for fungi rather than humans. And the primers that have worked best for us have been primers that are actually amplifying the ITS-1 region, the intervening transcribed sequence that's between the 18S and the 5.8S. This is also the region that is used by most clinical micro labs to identify a fungi. And so the databases for these are also just the most well-developed. This has some difficulties in that I was talking to you about how the 16S sequence has structure to it and has those more conserved and variable regions because it is a functional RNA. Here you are working within a non-coding RNA and even just a spacer sequence. So you don't have that fixed-width alignment to do your classifications. You can have 20 base pairs coming in and out and obviously it's not affecting the structure. So really the way that we then align, we don't penalize for these kinds of large insertion deletions. We do have custom ITS databases that have been resolved at the different phylogenetic levels. So it is similar to how we do the RDP classification for this. And we get different results. So in our skin bacterial communities, we're gonna see what is the skin and we're gonna say it's mostly carinobacterium and we talked about how the left elbow was different than this chest in the forehead. There's totally different communities when you look at the feet, I'm sorry, at the fungi. So in this paper with our skin, we looked at what are the different communities of fungi and we really are here trying to develop datasets where you can then say maybe are there fungal bacterial interactions. For the skin, we really found that it was mostly malicemia but we could find tremendous fungal diversity on the feet which probably wouldn't surprise you if you think about the fact that this is where you see many of the fungal infections amongst healthy volunteers so that would be toenail infections and the athlete's foot. But just to say that the fungal community is not as robust with their tools but they certainly exist if you want to do those kinds of studies and you can then see, you know, for us we looked at what's the fungal diversity versus the bacterial diversity and saw the discrepancies. Okay, now I'm gonna move on and talk about bacterial genome sequencing. So again, I've come up with sort of like, you know, what are the things you should ask yourself before you embark on this study? So because I think for many of the times you might be thinking about sequencing a microbial isolate and then wanting to annotate it or use it in your studies. So, you know, first just defining what is the study objective. Really for us a lot of our next question is what, you know, what reference genomes exist because if there is a very good high quality reference then you can often take your reads and scaffold them onto an existing reference. But, you know, for the most part we're gonna talk today about what sequencing platform will you use? What depths of sequencing do you need? What assembly tool do you use? How are you wanting to display your data? What are you gonna compare to other published studies and how will this information yield a testable hypothesis? But I do put forward those first two questions because I think they can often drive the decisions that you'll make later on. So how to assemble a bacterial genome? Just, you know, a staphylococcal genome is 2.5 megabases. I'm gonna talk here about gram negatives which are more like six million base pairs. And our typical way of sequencing these is still on the aluminum myseq where we're getting, you know, 30 to 50 fold redundancy. You can also do these on the high-seq. I'm trying to kind of give the examples of a myseq because I think that's probably still more accessible to people as an instrument. So what happens is that you take your bacterial DNA, you lys it, probably most people right now are gonna make a next-terral library which is where you insert the transposon aluminum barcode right into the DNA. You know, probably previously people had sheared the DNA and made these libraries, but this is really sort of one-hour easy DNA prep to get these kinds of reeds to feed straight onto an aluminum instrument. So you end up with these reeds that are 100 or 300 base pairs and sometimes they are paired end reeds. And what you get is that one reed then leads into another and you can assemble these into context. So I say it like, you know, and then you just assemble them into context. Well, it turns out that this really actually is something that we spend an enormous amount of time trying to quality control. You know, how are you gonna really assemble these sequences? And because there are choices that the assemblers are making about when to break a contig versus when to bring it together. Underline, most of the assembly programs are still frappe, velvet, a lot of people are still using the parts, the guts of the salera assembler. Probably right now, most people for bacterial genomes are using spades, meramazurca. And I can tell you, we just recently did a reanalysis of this in our lab, they do give you different results. And I don't really know what to tell you on that. So we have ways in which we then benchmark these assemblers to each other. We often, in our lab, have gold standard genomes. In our case, I'll get to this, we generate a fully assembled genome and we're benchmarking to that. But it is difficult because some of these genomes will give you longer contigs, but maybe some of them have less support. And I don't really have an answer of like, this is the path forward. NCBI is working very hard on this too and Richa, Agarwal and Bill Klimke are also looking at this issue. And I think it's just gonna be something where it depends what kind of data you want and what kind of genome you have. Right now we've defaulted to spades which we error correct with pylon. But I'll try to highlight where those differences might come into play. I just wanted to sort of even explain to you quantitatively like how these assemblers even work and why you might have differences. And it's really in these decisions that how are they simplifying, they go into these hashing methods and try to build it to brain graphs and it really is in these kind of simplifications of the linear stretches and in the error removal that different programs make different decisions. And we still so far don't have like a true method. So as I was saying, evaluating these assemblies is something that's still within genomics is really something that people are really working on. I mean, meetings that I go to, will have like the the, the, the, the assemblathon where we, everyone takes these sort of genomes and people compare. What did yours say? What did mine say? And oftentimes we come back to, that's why it's really important to deposit your reads into NCBI and into the SRA because the assemblies that you deposit can have biases in them. And I often, if I wanna compare my study with someone else's study, I will just grab their reads. So we get these contigs back. Many of them are quite large. We do look at coverage. That's one of the things we're looking at. So like plasmids can be at higher coverage. This is genome coverage. And some of the plasmids can be at higher coverage. The ribosomal RNA operands will be at higher coverage. Other plasmids will be at higher coverage. And these are also the kinds of things that you have to know. Like these ribosomal operands, cause they are, as I was saying, there's five copies of them in a genome. That's what breaks assemblies. I mean, you're gonna break, every time you enter a sequence in a short read library that enters the ribosomal RNA, those operands are large. And there's no way for a short read technology to know where to come out on the other side. That is where we have turned to pack bio genomes for creating references because the pack bio, which is this single molecule wave sequencing technology, can read these very long reads that are 10 KB, 17 KB. And so that's long enough that it actually can read through all of these ribosomal operands. And from a pack bio genome, we can actually generate a fully assembled reference genome that will give us the chromosome and all of the plasmids. And then we can scaffold short reads onto that. Those have ended up being very valuable for us. And those are the genomes that when I'm looking for a reference, if there is a pack bio genome, it tends to be more complete and I will use that as a reference. Genome aligners, if you want to then find, what are the changes? This is often what people are trying to do and looking for single nucleotide variants or insertions, deletions again, options that you can use. For genome annotation, we do NCBI offers an assembler P-GAP and also the Joint Genome Institute has an assembler. For some organisms now, I should have included it. I'm sorry. Platypus from University of Maryland is another one that we're using for genome annotation. And so these are typically, you submit your genome sequence and they will return to you an annotation really quite rapidly. I mean, not real time, but rapidly, like days. And the reason that you'd want to do that is that a lot of this within a bacterial species and certainly within a bacterial genus, there's gonna be a variable region. So like in a staph epidermidis, Avery Staphylococcus epidermidis has 80% core genes, but there's 20% genes that are in this variable region, which is also called the pan genome. So that would mean that as you sequence more bacterial genomes, you will continue to get more genes that are in that species. And so you'd want to annotate what are the particular genes in this strain that you've sequenced. I can say based on experience that it often is true that the differences that are in this pan genome are the least annotated. They often do come back as open reading frame function unknown. But you still have to sort of know what is the basic annotation of this genome. So now I wanted to just talk about some examples where you're saying I want to compare two genomes, find SNPs, find mutations, find deletions, insertions. And I distinguished here between SNPs and mutations because those are two different things. Often we're using single nucleotide variants when we're talking about a phylogen, wanting to build a phylogenetic tree. And those are markers of or signatures of the evolutionary tree, but they don't necessarily change an amino acid or cause any change in the function. And so if you can identify a single nucleotide variant, if you wanted to say that it is a mutation, you of course would need some functional studies to support that. So just as an example, this is a study that we did looking at three different multi-drug resistant acinetobacter balmonies. And our question was whether these three strains that all were seen at the NIH clinical center, whether these evolved from a single origin or whether they had all come into the clinical center with an independent origin. So we're looking here every time that there's a SNP relative to the reference genome, we're gonna code it and we can use Circos here to make these very nice colorful plots. What you're seeing here are SNPs relative to the reference and we had these three strains, A, B, and C, and we're looking to see is there a relationship between A, B, and C. You can find that there are these regions obviously that are unique to each of our three strains and they're in these clusters. And that's actually why I wanted to say that if we had just looked at this without having sort of stitched together the contigs, we might have called that there were thousands of SNPs different but in fact what you see is these clustering of the SNPs in the red regions, the blue regions, this green region here and again here, blue, red, green. That's a recombination that can't be counted as a hundred different SNPs. That's one event that caused that and I'll show that to you here. Where what we've had is the o-anogen biosynthetic locus has really come in and recombined that's right here at this, right near the origin. So when we're talking about how to build a phylogenetic tree, we wanna look at these SNPs that are each clustered independently of each other and when we have hundreds of SNPs clustered together, we have to be able to distinguish that that's a recombination rather than, could be a single genetic event rather than hundreds of independent events. I'll just talk for a minute about how we did use it when we had a clonal outbreak and here we had a cluster of patients who all had the same carbapenem-resistant Klebsiella pneumonia. When we sequenced all of these isolates, they were much more similar to each other and we did find these clusters of SNPs, these are all across the genome and these are all independent SNPs and this clustering of SNPs did help us to identify that there was a closer relationship between patients one, two, three, and five than from this other cluster and you can even see that we could narrow it down when we have these SNPs up here, 12, 13, 18, that there would be a closer relationship because they share these common SNPs that must have evolved during the spread of the outbreak. So that we can use to reconstruct transmission. I have to say this is, this was an example of how this, the genetic information really is very clear. I would say many of the questions that we get though, if things are 100 SNPs apart, is it clonal, is it not clonal? It's very hard for us still to make that judgment without having more references and that's often where I wanna just be very clear that genomics is powerful but it doesn't point the direction of the arrow and there are times when I just simply, all I can say is it's this many SNPs apart. I can't make a judgment of whether that means that, I can tilt one way or the other whether this is more or less likely, but we really can't say it is this combination of the epidemiologic information and the genomic information because we certainly have had examples where there are clonal strains circulating in the US. We have received two patients that clearly have no epidemiologic link and their isolates will be 10 SNPs apart. Maybe they, three hospitals ago were someplace similar or maybe these are just dominant strains that really have locked their genomes. So I just wanna be clear that we're not saying that there is some minimal information that if two isolates are within 10 SNPs that they're clonal and if they're more than that, they're not. This is really something that the global healthcare system is struggling to incorporate genomics into it. So I'm gonna talk now about metagenomics which I have to say is probably the topic in which there is the greatest change coming on now. This is basically, we've talked about sort of using these markers and using these, this is basically like you take a sample from someone's stool, someone's skin, something like this, you just feed it straight onto a sequencer and then wow do you have a bioinformatics challenge. So it's a very complex mixture and it's very complex computationally. So what do we do? I think shotgun metagenomic analysis, you do this when you wanna know really who's there and their abundance and you wanna know their function and you wanna know what genes are present and you wanna identify pathways and you wanna identify strains and you wanna recover genomes and you wanna find novel pathogenic organisms but you get just so much data that you probably do need an analysis plan before you start getting these sequences because they are overwhelming. So I've been talking about on the left where you're doing these sort of marker genome studies and now as I said, we're just gonna get fragments of DNA back. So what do you do with them? Well the reason that you would do this is if you're trying to think about differences and you may even have as we were, as I sort of talked about where there are these pan genome, open genomes, different strains, different species can have different genes. So you could get something here which is a Phil Hugenholz example where you'd have the sort of similar bacterial community but within this the open part of the genome or the flexible part of the genome might encode different genes and so you'd end up having what look like two similar bacterial fungal communities that actually do have very different genes that they're encoding and that's when you'd need to get to metagenomics. This is the other reason you wanna get to metagenomics which I actually have to say I find these studies totally cool. So they're trying to find out how does a termite, how does it digest wood? So that's actually a function we might want to know because we might wanna use that to find new metabolic enzymes. How does the rumen of a cow degrade this biomass? How do you create energy from biomass? And so these are metagenomic studies where they're then looking to see what are the, we need to, how are you gonna find it? You don't know what it is that you're looking for. They're looking for new metabolic enzymes. But yeah, so there are two ways that you can do, you get this large data set and the first thing you probably say is, wow, this is a lot more than I was expecting. So I really am trying to break it down here that you can either do read-based methods or you can do assembly-based methods where you can try to assemble your reads and then use these larger contigs to identify genomes and clusters and do gene calling and or you can just do read-based mapping. So I'm gonna talk about those two strategies. If you're looking for function, you can use these keg, cog, these kinds of tools that leverage functional databases. And this is really as good as it gets and the only issue here is that they tend to be more focused on sort of metabolic core functions and they're not gonna return as much of the unannotated dark matter of the microbes. There are some pathways that you can use. This is Curtis Huttenhower's Humon where he's trying to give you sort of, you feed in your reads and you'll get out pathway coverage, pathway abundance. This is certainly a good place to start and Curtis keeps all these tools available and they're all available through the bio bakery and he does continue to improve them. It's a fairly solid generic look at your data. This would be an example of the kind of output that you get. On the top, I'm showing you the great differences that we're seeing and we've talked about at the beginning of it, where the stool has all these bacteria deets and these firmicutes, whereas the oral community is gonna have more of the streptococcus and so on. When you look at them in terms of their functional output, they all look much more generic, right? We know that there are differences in what these communities do, but as I was saying, these are the functions that are most often annotated. Every bacteria deet is still gonna have to go through cell cycle division. So those functions are gonna be better known. So it kind of gives you this sort of blurred view where everything looks sort of much more similar than maybe if we incorporated what were the unannotated functions in the, you know, would tell us. But these are certainly, you know, as good as it gets. Some people are trying to call genomes out of metagenomics. I think that this, if you have hard to culture organisms, this is one of the things that you can do is you can just shotgun metagenomic sequence and then try to bin them. This actually was pretty cool where they're binning them here both on, sorry, that should say tetranucle, oh, it sort of does, tetranucleotide frequencies. And from a metagenomic sample, they're like pooling apart the reeds that into different genomes. If you can culture the organisms, it's much easier to culture them and match the isolates to the metagenomic reeds. And I realize I'm not giving you a path forward here, but this is kind of the state of, you know, if you set up a metagenomic sequence, you should be prepared to spend at least a year analyzing your data, at least we do. You know, the sort of idea about how to form these linkage groups. So, because that could make it more powerful and sort of an intermediate between single reeds and having full genomes is to sort of try to bin your reeds into clusters. And the way this has been leveraged, originally in this paper from Carlson, is that if you have multiple samples, you think, well, if these two, you know, reeds are from the same genome, then I would expect that they would be found at the same frequency in the different samples. So, you form as many contigs as you can, but they are often quite small. And then you cluster those contigs based on their frequency in multiple samples. And that can get you to be able to reconstruct larger metagenomic clusters. And that's kind of where the state of the art is moving. I wanted to still talk about something that is another way that we leverage metagenomic data. And for us that's called strain tracking, where I've been talking about how there are this pan genome. So, you know, my favorites, we talk about staff epidermis, how it's 80% core and then the 20% are often these more diverse mechanisms. So, Evan Johnson at BU, wrote this program called Clin Pathoscope, where, you know, if reeds come from the core, then they're gonna map to, you know, every genome. You have to have a set of reference strains. You have to already have sequence genomes for phylogenetically diverse strains. There can be SNPs that distinguish these. In the pan genome, you're going to have reeds that map to some strains, but not others. And then with Evan's program, Pathoscope, it takes both the information from the SNPs and also from the pan genome and will reassign so that now you would assign all of these reeds to strain A. We, you know, that's obviously been done with a lot of simulated data, but also we've done that with our human data, where we then are looking at, you know, from a single individual, if they could have all of these different strains, we look to see of the, you know, on their body sites, what strains of P-acnes do they have? Or I don't know why these, I have to redo, sorry. These slides are cut off, but they should be full in your handouts. I don't know what I've done wrong to them. But this happens to me every time I present on a PC, and I don't know what it is. So we've used this data to sort of look and see this is one healthy volunteer and they have different strains, different individuals have different strains. Well, you can see some of it here. So here for the P-acnes, you can see that individual C will have those brown strains, but individual A is only having these blue and green, and the purples are, you know, between the two. So you can start to use this to then say, you know, what strains are carried by the different individuals, and you may from this see strains that are particularly enriched in a disease state. That's what we're looking for. That's why we're going all the way to the strain level, because it might be that some strains of P-acnes are more associated with the development of acne than the commensal beneficial ones. Strain tracking is also able to be done with read assignment to find the core and the accessory genomes. This is if you don't have reference genomes, but you have many more reads. So this is a very similar, you know, it's two different ways of leveraging, you know, the kind of data you're going to get out of metagenomics. So why are strains important? I think it's really just to find the accessory genes to determine whether prebiotics or probiotics can have a lasting effect. Could you get a new strain in? How stable are these strains? And to underlie what's happening with diseases. Really, as my last topic here, I'm going to talk just about, within the context of metagenomics, some people are now trying to use this where you have a patient who presents with fever of unknown origin, and you want to know in a clinical setting, could you identify what is the pathogen? So one way of doing that is clean pathoscope that I've just discussed from Evan Johnson, this is also Serpy from Charles Chu, which is going through the same kind of analysis where you're taking raw sequences and you're saying, what does this match to? I think both of these are very powerful, and the question is that I would just caution you that you will get an answer. And, you know, it often is related to just how many times has a sequence been deposited in the database, you know, it will make a match for you. This has been used very successfully by Charles, where he was trying to identify a patient who had exactly that, you know, recurrent illness and they couldn't identify what it was. They used this sequencing to identify this leptosporaceae that they could then validate in a clinical test and define the best treatment for this patient. For all of these studies, you will often have, I've talked all about the microbial DNA, I would just caution you that, you know, with the genomic data sharing policy, you also will get human DNA, and you really have to think about, you know, if your studies involve the microbial DNA, to be very careful what you're doing with the human DNA, and especially, you know, if your goal is to sequence a microbial community, you will likely recover human DNA, and you shouldn't just deposit it in the database in an open way without, you know, filtering it out, and also that you will recover human DNA, so I consent all of my patients or all of our subjects for whole genome, whole exome sequencing, because I do want them to be aware that even if I am trying to sequence their microbial DNA, I will recover human DNA, and I just think that's something that patients should be aware of. In the last two seconds, I'll just, you know, close, because this is actually a smaller part of my talk than it's ever been before. Where is the sequencing technology now, and where is it going? A lot of the stuff is right now just going on the Illumina MySeq and the HighSeq. I think the PacBio has a role for us right now for looking at long reads to get these good reference genomes, and you know, is there any new technology on the horizon before I give this talk again in two years? The only one that I'm aware of is this Minion, which you can see is a small handheld device. It's a portable small cell. Could be used for fast diagnosis like think Ebola. And so that's probably the only new thing on the horizon. And I'll just finish by saying, you know, I sort of talked at the beginning and talk at the end about sequencing is the start. Really, you're trying to generate a testable hypothesis with the sequencing data. Maybe you're trying to identify a novel pathogen, but then you still have to think about how would I test this, you know, and what do I do with that? So with the sequencing, what I've really tried to talk about here is coming back to Cox's postulates where you're trying to assess that, you know, there's maybe a microbe that causes a disease, but our more nuanced view now where there's a microbe causing a disease, but it may be causing that disease only in the context of a certain microbial community. So you need to understand what is that microbe and probably down to the, like, sequence level, you know, because different strains may or may not be able to do that function. And it may or may not be able to do that in the context of what is the microbial community. So, you know, with that, I'll close. We're just, we're really trying to understand what is the role of possible pathogens in the context of a microbial community. So thank you all very much.