 Okay. Thank you, Tierra. And it's a pleasure to be here. And I did get sort of relegated to the last position because I guess you all are thinking that these microbes are other. But I'm going to try to talk to you today. Well, I have no financial disclosures. Welcome to being an NIH employee. You know, sort of why the human microbiome and why do we even consider this part of a genome analysis course? Because the microbes that live in and on our skin and gut and vagina and oral cavity, these microbes have coexisted with humans since time began. And if you think about a lot of even human health, human disease, these microbes are intricately connected with those processes as well. And I don't think I need the lights down quite this much. I mean, if it's normal, but I don't have any fluorescent images. So we could probably take the lights up a little more. Okay. Thanks. Microbes are, you know, if you think about it, what you've talked about is all this diversity among the human genomes. But really, every human cell with the, you know, with the exception of somatic rearrangements and some somatic alterations, really, they have sort of the same protein encoding potential of, you know, transcriptional editing and so on. But they're really about these 25,000 human genes. Whereas the microbes are really diverse and dynamic. And if you have 1,000 microbes living on your skin in a square centimeter, they could all have quite different genomes and genome encoding potential. So humans are host to many microbes. I'm going to define the microbiome as the totality of microbial community DNA and really underscore that microbial cells outnumber the human cells. And most of these you'll find in your gut. But even on the human skin, we estimate that when you do a punch biopsy, you could identify a million bacteria per square centimeter. So these microbes are really the great wild, the unknown, in that many of these microbes have unknown functions. So I'm going to talk about how they're both commensals and pathogenic, but I'm going to really focus today's talk on sequence analysis. So you've probably heard of some of the pathogens, the mycobacterium tuberculosis, the staph aureus that cause infections. And in fact, a lot of human therapies are directed towards really controlling these microbial communities if you think about why do you take antibiotics? So it is this balance, and I do have to say that there is a discussion going out in the public domain of sort of microbes being both good and bad. There certainly are beneficial microbes that are required for vitamin synthesis, digestion, education, activation of the immune system. And even in systems like the skin where their functions may be less direct, a lot of what the skin bacteria are doing is preventing more pathogenic bacteria from taking up colonization. So today's talk is going to focus on the analysis methods. But I want to underscore that the reason we would do this is to understand really microbial host and microbial-microbial interactions. Okay, so this is the way that traditionally we've explored bacterial communities on these sort of auger plates. The issue here is that the great majority of the bacterial species don't grow in culture and they certainly don't grow in isolation. So that's led to what we call the great plate count anomaly, where what you're going to grow is the lab weeds, the things that are really good at growing on blood auger. But you aren't going to get a full picture of what is the microbial community. And it's going to be hard then to even compare has the bacterial community changed when you're going through this bottleneck of plating things on culture. So what we're going to explore in today's lecture is direct sequencing. But before I even get to sort of how do you do direct sequencing, I wanted to sort of come back and say, the first experiment that my lab did was to say, would I actually get different results? Because investing and sort of developing all of these sequencing analyses wouldn't really be beneficial if it's just going to give us the same answer. So the first experiment that we did and others did similar systems is to take parallels samples. In this case, we took skin swabs. One of the swabs we put into DNA lysis buffer and just sequenced the community directly. The other one we grew on chocolate auger and blood auger and then identified each individual isolate. You can do it either on Malditov or you could do it by sequencing. And these are the kinds of results that we got. So the ALR creases the side of the nose and the umbilicus is the belly button. So what you can see is that these are relative abundance plots. I'm going to show you more of them. I'm going to go into more detail on them and how they were generated. But I just want to at least start by saying that you do get different information. If you see the cultures will favor bacteria that we know how to grow, propionobacterium and staphylococcus. But in fact, when you look at the survey data on the left for the two sites, you can see that there's also this preponderance of these actinobacterium, in particular the carina bacterium in the cornflower blue color. Carina bacterium are easily overtaken by staphylococcus. They will often take five days to create a colony or you have to supplement the media with additional lipids. But if you use just standard culture techniques, as you can see, especially here from the umbilicus, we would estimate by sequencing that the proportion of carina bacterium is nearly 50%. We can culture a few of them, but I'm going to vastly overestimate that this site is dominated by the staphylococcus and other firm acutes and really sort of miss that there is this preponderance of carina bacterium here. Okay. So topics for today's talk, now that I've sort of introduced why is it that I'm even going to make this investment in sequencing and analysis. I'm going to have five topics. They're not going to all get equal amounts of time. I'm going to spend the most amount of time talking about bacterial diversity studies, which are really going to set up the intellectual space in which we do these sort of microbial diversity studies. Here, we're going to focus on the 16-S gene of bacteria as a genetic fingerprint. Then I'm going to talk about similar analyses of how we used a similar method, but had to develop a slightly altered pipeline to look at fungal diversity. Then I'm going to talk about bacterial genomes and how we look at genome sequences of the full genome. Then we're going to come and combine all of this and talk about metagenomics, which is just shotgun sequencing of the entire microbial community. And then in one slide, I'm just going to talk about where is this technology going? And I hope I remember to talk about also where is NIH going with this. Okay. So bacterial fungal microbial diversity studies, they're based on these core marker genes. And the core marker genes are typically part of the ribosome. Now you may think about the ribosome as where proteins are being made, but they're actually this combination of proteins and ribosomal RNAs. So all of this orange here in the bacterial ribosome is the RRNA genes. The 16S RNA gene here. And this is a highly transcribed, oftentimes multiple copies in a bacterial cell. And it has a structure, sort of a secondary structure because of how it's integrated and forms the ribosome. But this is also a core marker where you can use the sequence and you can see that all bacteria have a 16S, all eukaryotics have 18S. I mean, if you've run in Northern, you know they have an 18S, a 5S, and a 23S. But these ribosomal genes can be used to distinguish that these are the eukaryotes. These are the bacteria and these are the archaea. And these core marker genes can help distinguish even at the kingdom level. We're going to exploit the ribosomal RNA genes to now do phylogenetic classifications within the kingdoms. So this is the bacterial 16S RNA gene. And I've sort of drawn it, it's drawn out here where you can see the secondary structure of the RRNA. And the great thing about the 16S RNA gene is that you can see it's this mixture of stems and loops. So the stems obviously have greater constraint on the evolution, because you can't change one base pair without changing the opposing base pair. And that constrains those sequences more than the loops. So what you end up with is regions that you see in the boxes, those are more variable. And then there are regions that are more constrained. So on the right, you can see it across here where the 16S gene is 1,600 base pairs long, that's how it gets its name. And you can see here, now this is the mean information entropy. OK, this just means whether it's conserved or not. So in shorthand, I would take this to mean that only 20% of the time is this region different between bacterial species. But you can see these regions like the V2, V3, V6, these have a much greater amount of diversity between different bacterial species. So you can set the primers for amplifying the 16S gene from a mixed community in the conserved regions and amplify across the variable regions and then use the variable regions to identify what type of bacterial genera and species you have. And this is what's done in clinical microlabs and it's also what's done in research labs. So the workflow looks like this. You take a sample, which can be a stool sample, a skin swab, a teeth picking, and you straight away harvest the genomic DNA. You'll amplify the 16S gene and you'll get all of these. They'll all amplify with the same conserved primers, but they'll have different sequences in the intervening part so that you can use those sequences then to assign that this is a firm acute and it's a staphylococcus and it's a staphylococcus aureus. OK, so I realize that this is a busy slide, but I had to put this in there because when you look at the handouts, these are the questions that I get if someone comes to me and says, I'd like to look at the microbial diversity. And I say, OK, here are my nine questions, setting design, like really define what it is that you're trying to answer. Are you trying to say, is there a difference between these wild type mice and some JAS mutants? Well, if you want to know the difference between, that's a simplest example, wild type mice and mutants. If that's the simplest example, then I'm already going to start to have questions for you. There is variation between mice. So you have to think about study design. How many mice do you need to power this study? Or how many humans do you need to power this study? And oftentimes that's a really hard question for people to answer because they don't know what size effect they're going to be looking for. Maybe the bacteria is going to go from never-present to present at 30%. In which case, you could do that study with three mice, three humans, and three controls. But that's the first question, and really the hardest one is, what study designer are you going to use? If you want to compare wild type and knockout mice, I prefer that the mice are litter-maids. That way you're at least controlling for differences between facilities which exist. There's differences between cages. There's differences between rooms. If you bring in your knockout and wild type mice from Teconic, I can't tell you anything about the history of these mice. And it could be that they would just a priori have different microbial communities because in Teconic, they were raised in two different rooms. That wasn't the intention of these mice for doing this experiment. OK, so we're going to spend a lot of time talking about study design. Then we're going to talk about what sequencing. I mean, so that would be the sort of, I mean, that question I'm not going to address here so much. You'll see some examples of why I'm so concerned about study design. I'm going to really focus on what sequencing platform will you use? What region of the 16S gene will you amplify? How many reads do you need? What are the hidden technical issues? What analysis tool will you use? How will you display your information? How will you compare your results with other published studies? What information will yield a testable hypothesis? I mean, I would have to say that many people come to me and they say, I just want to know if there's a difference. OK, and I really have to say formally think about these questions. OK, one of the ways that bacteria can change is they can change in biomass. You can see a difference where it used to be that there were 10,000 bacteria per square centimeter on the skin. I can tell you that when we look at diabetic mice, their bacterial load is 40-fold higher. So one of the things that I do think is an easy first thing that you want to think about is from your genomic DNA has the bacterial biomass changed. This is a standard QPCR, where we set primers in the conserved region, and we look to see how many counts does it take to cross the threshold. I'm not going to go through that. That's a fairly standard thing, but it also is possible to do that for bacterial load. How many bacteria are here? The difficult thing is what do you normalize to? I mean, maybe you're going to normalize to grams of stool. Fine, that's great, but you have to think about what are you going to normalize to? For us with skin, it's a little harder because what do we normalize to, the square centimeters? But sometimes it could differ depending on if you swab really hard versus if you swab lightly. But if you have a solid sample like stool and you can measure against grams of stool, great. OK, DNA sequencing to assess bacterial diversity. This is what I'm going to spend the most time on. I would say that two years ago when I gave this talk, actually I know because I looked through my slides and there was a large amount of time that I spent talking about what platform should you be using for sequencing. And there were several options two years ago. By now, there's no option. The aluminum I seek is the dominant platform on the market at this point. And I'll sort of go through for you basically why everyone else has fallen away. So alumina will now give you 300 base pair reeds that are paired. So what I'm showing you here is that you can amplify different regions of the 16S gene. But I've given one example here where base pair 1 is the first base pair of the gene. So 8 means that you're starting at base pair 8 and 505 reverse is get this 500 base pair amplification primer would give you the V1 to 3 regions. So the variable regions 1, 2, and 3, the 5 prime end of the gene. And it's more or less that you're going to get 300 base pairs in this direction, 300 base pairs in this direction. In fact, you'll really only get 250 because you have to remember that you're going to be sequencing through the primers. But you get 50 base pairs of overlap and you get sort of almost a 500 base pair region. Given you some basic stats, it's a three-day run on the MySeq. I'll tell you later the MySeq costs about $100,000. You get two runs on one instrument. Each run will cost $2,000. Which if you multiplex 500 samples, you're really only paying $4 a sample. So scale is the issue. If you're going to put 500 samples on, you need to be able to identify them all because they're actually all going to be the same. So I have given you some references down here. It's a pretty complicated strategy of how to build these primers. We end up putting dual index barcoding. So there's a barcode on this primer and a barcode on this primer. If you have barcodes on both primers, then you can develop 24 possibilities on the left primer, 24, or on the forward primer, 24 possibilities on the right primer. And that actually gives you a huge amount of space where you have 24 and 24. And you have really an enormous number. I mean, say it's 20. So you'd have 4,000 possibilities of how these two primers could be put together. And that gives you enough diversity that you can uniquely pick how to put barcodes on both the primers and get 500 samples that you could multiplex together. So for a small study, sequencing is limiting. For a large study, bioinformatics is limiting. And I guess even at this point here, I would say that what we're really talking about, and I hadn't, I'm sorry, I don't have any slides on this because I hadn't really intended to dwell on this, but I'm going to come away from preparing for this talk, thinking that we need an NIH solution. If people want to do microbiome sequencing, there has to be some way in which people bring their samples together and there is a core facility that multiplexes the 500 samples and puts them on an instrument. This is just, if you have 20 samples and I have 20 samples and someone else has 20 samples, it needs to go someplace central that people could access this technology because we've now gotten to the point where it really doesn't make sense for you to sequence, 50 samples and set up this whole infrastructure just for sequencing 50 samples. Okay, and I guess I should say more about that at the end if I have time. Why is it that two years ago I was more conflicted and now I'm not, is that for the last few years, we've been using the 454 pyro sequencing which gave us 500 base pair reads. They were sort of in that sweet spot of longer reads. Illumina was still at 75 base pairs, 100 base pairs, but Roche is no longer supporting the sequencing platform. So as of July, this instrument is going offline and there will be no longer any support. So the other thing that people have talked to me about, I have no experience using is the Philo chip which is kind of like a 16S micro RNA. It's limited to known taxa. You can get species level designation. It's more expensive. You'll never find unique or novel sequences and it basically locks you in to this platform. So I haven't explored it because it's not suitable to my needs but I could imagine that computationally it might be easier for your lab if you had six samples and you wanted an answer to, and you had experience analyzing microarray data to use that. The High Seq Illumina, if you wanna do production sequencing you definitely should use the Illumina High Seq. We do all of our metagenomics on it. And but you have to be ready for the scale of this data. The instrument runs for 10 days and produces four billion clusters. So this is what people are using for multiplexing and using for doing whole exome sequencing of the human genome or whole genome sequencing. But remember these bacterial genomes are 1,000 times smaller than a human genome. Okay, so those are my thoughts on sequencing technology. I'll come back to sort of where's the future but let's say now that you have your sequences. Well, you're probably traditionally used to sort of putting them into BLAST. Well, BLAST is not gonna work for you because you're gonna sequence your BLAST and unfortunately what you're gonna come up with is that your sequence matches a lot of other people who have done bacterial sequence studies and it's gonna match a lot of uncultured 16S sequences. So I would have to say that one of the great things about the Human Microbiome Project is that we've generated a lot of sequence and one of the worst things that we've done is generate a lot of sequence. You cannot just put your sequence in and BLAST it and get a unique identification that this is truly a staff epidermis. It will match so many uncultured sequences that my lab has produced that you really need and you're gonna have hundreds, thousands of sequences. So the first thing that you're gonna need to do is think about how you're going to classify your sequences. So the 16S gene, as I said, is a fixed alignment and we have these programs. Were you gonna align each of your sequences to a known reference dependent set of bacterial sequences that sort of what you would think of as type strains? Where these are really highly curated and accurate genomic sequences. So accessing it either through the Ribosomal Database Project, silver or green genes and I will give you in a minute sort of how you access these databases. They've all been sort of wrapped into pipelines. So here I'm just showing this is a really, this is an old slide, but you can see that here if I'm talking about 100 base pair sequence or 200 base pair sequence, 400 base pair sequence, I should be able to, even if I had 100 base pairs, I should be able to get it to the order level, the family and the genus level. I would say easily, for us, we can with 400 base pairs, we can basically always get to the genus level because the number of microbes found in a human are even better defined than what you would find out in the soil and the water. So these are our numbers based on environmental sampling. With a few hundred base pair sequence, we should pretty much be able to identify the genus level. If you wanna get to the species and you have to have special considerations and I think actually getting to the species level is worthwhile, especially if you're trying to use this data to talk to anyone who has a clinical question. For us, the difference between staff aureus and staph epidermidus when we're looking at clinical samples from kids with eczema is really important. So we always are sure to sequence the V13 region, so the five prime end of the gene, because that allows us to determine whether it's staph aureus or staph epidermidus. But you have to think about that in advance because if we looked at the V4 to V6 region, the middle part of the gene, the staph aureus and staph epidermidus are basically identical in that mid portion of the gene. So if we had sequenced that part of it, we would not have the resolution to distinguish between species. Lactobacillus, similarly, but what you have to think in advance, what is the species that I might really wanna study? So if you come back and you think that you have a lot of bacterial sequences that don't match a family or don't match an order or even don't match a genus, you might have to consider some other explanations. And in three slides, I'll get to that. So here's how you would access RDP. There's a good tutorial. Most of these microbiome tools now actually have really good tutorials associated with them. The silver database also is really very robust and solid. And now the thing I was saying about how you may find sequences that aren't well, that aren't assigned to a genus, this is what I would worry about. And this is built into when you're going through these pipelines, chimera checking is gonna be built into them. And so what I mean by that is that, sort of in a joking way, combine a chicken with a rabbit and you get this, I don't know, chappet. Okay, how would that happen? Because that's a huge problem with microbiome sequencing that we all struggled with a few years ago. Okay, here's what happens. You're going through 30 rounds of amplification or 23 rounds of amplification with these conserved primers from the 16S gene and you're amplifying the five prime region from V1 to V3. So you start amplifying and then that cycle is over. So you haven't extended all the way. You've run into some little secondary structure or something and it drops off. What happens is that then when you start the next round of amplification, you're in one of the conserved regions and these 30 base pairs of conserved region are enough that you'd actually pick up on another template and create a chimeric template where now instead of having a Staphylococcus sequence, you're having a Staphylococcus mixed with a Streptococcus. So the way that you look for chimeras is that within chimera slayer, you are blasting this sequence and saying, tell me what this matches and you're blasting this sequence and saying, tell me what this matches. And a chimera will come back. It will be returned as a chimera if these two blasts don't match the same sequences. Okay, so how do we even validate this chimera detection? Because a lot of people will say, okay, so do the extension for longer or maybe it'll be better when we aren't doing full length 16S sequencing that's 1.6 kB. Maybe if we did shorter regions, we would get away from this chimera problem. And I guess I would come out in the end and say you need to deal with this problem computationally. If you come up with some brilliant way to deal with this experimentally, that's great, but we deal with it computationally. The way in which we analyze this is we generated a mock community. And I think this is also a really important control. This would be another reason why I would say that these kinds of experiments need to be centralized. We generated a mock community, which is 20 bacteria that are mixed together, and I'm sorry, by we I mean the human microbiome project, mixed 20 bacteria together in known amounts. And then we PCR amplified that mock community. So you'll see here what are the bacteria that were mixed together in the community. And if it's yellow, it means it was underrepresented. If it's blue, it means it was overrepresented and especially when you see these pluses, it's overrepresented. Now, in this sequence, this was done a few years ago when we were still exploring Sanger sequencing versus 454, although the results are fairly similar even with the Illumina platform. What you can see is that although I've told you that these are primers that are in the conserved regions, they don't actually amplify all of the bacteria similarly. So you can see the methanobrevy bacteria are underrepresented. Some species, some general like the pseudomonas or the clostridium in these primer specific ways are overrepresented. But the other thing that came out of this is we put 20 known bacteria into these sequences. Okay, so yes, I know you're looking at this and you're sort of saying, how can I even use this data at all? I mean, this is an ascertainment bias. And it means that you cannot go and say, I'm gonna sequence V1, V3, and then compare that directly with V6, V9, and say that there is a difference between these two samples. It does need to be standardized, the platform, the primers. Okay, but the other thing that came out of it is that when we put this through programs without you looking for chimeras, what we would see is, this is the number of sequences that we thought we were finding in our samples. And it was typically between 100 and 200 or in some platforms even higher. Some it was slightly lower, but really what was happening here is that we were creating chimeras. And so this would be a chimera between a propionobacterium and a staphylococcus, and then it would look like a new sequence. So these chimeras need to be recognized and pulled out of the analysis to bring you down so that you actually can then, from a mixed community of 20 bacteria, identify 20 bacteria. Your analysis will be misleading if you don't consider chimeras. Okay, with all of those technical concerns, this is an example of how we would display the data from this kind of genus-specific look at the data. The Human Microbiome Project looked at these different body sites. And what you can see is that the body sites have different bacterial communities. So here in the vagina, this red that you see is the lactobacillus, which is really more prevalent in the vagina than it is on the other body sites. You can see the oral cavities share similarity. The skin sites share similarity and the gut has more bacteria deets than you see, but you can also see them in other sites. So this is the kind of way that we could display data based on classifying the sequences to the phylum genus level. This would be another way that we take that same data and look at it. This is the box and whisker plot, where again, you're looking at what are the bacterial phylum and then genera? And I'm showing you the same data and now saying that if I look in the gut, you're seeing these bacteriodeets and these firmicutes, whereas in the skin, you're seeing more actinobacterium and you're seeing these proteobacteria. Also, if you see firmicutes, they're more likely to be of a different genera than they would be in the gut, a different genera in the vagina. So these are the kinds of ways that you can show it. Those circles are really good in terms of kind of giving a visual display, but they're my least favorite way because they can be slightly misleading to show when you really wanna talk about relative abundance. This is some data from my own lab, where I'm just trying to show you now how I would display this if I was looking at two samples. And so here, I'm looking at the skin microbiome of kids before they've gone through puberty. So their ages are down below, but they were all tanner staged and kids who have gone through puberty. And what you can see, I think, is that you look at these RDP plots and you can see that there's very different bacterial communities in, this is actually the Nair, so inside the nose, from kids before they've gone through puberty, then after they've gone through puberty. And we can then display this as what you're seeing is that in the postpubescent, so tanner stage five, which is the red increase in carinobacterium, propionobacterium, so the bacterium that live on lipids, kids as they go through puberty will become more oily and a decrease in these other gamma, proteobacteria, beta, proteobacteria, and so on. So this is the kind of way that I can show you this and these are the ways in which you would use this type of data. So those sorts of methods are often enough for people that they look at that and they think, that's all I really want to know is, is there a difference? But the next step really does take some more computational challenges. So then if you go beyond just what is there and you start to say, how have these communities really changed? You need some more, I mean you could look at those RDP plots and say, as I showed you, the staff has gone up, the carinobacterium has gone down. But really what you're gonna do is, you're gonna look at your DNA sequences and they're gonna sort of all cluster where they're gonna be certain distance away from each other and you have to calculate the pairwise distances between each of these sequences and then we're gonna draw clusters which mean that every sequence in here is at most 97% different from every other sequence. This is our way of computationally sort of assigning things to what is like a species, that they have to all be at most 3% different from each other. That's not the equivalent of species but we're gonna use that definition as sort of here's how we think about bacterial sequences that should be very similar to each other. So sometimes it's gonna be really clear that these all belong to the same cluster and they're all 97% identical. There are gonna be times when I'm gonna have to rip things apart because this sequence is 3% different from this sequence which is 3% different than this sequence. I could use the nearest neighbor joining method and kind of make that into a giant blob but we've sort of standardized that we have to have everything within the 3% or it could be the 1% and that this is how we're gonna get sort of a standard resolution and these would all be identified for example, if these were all Staphylococcus, this would be the Staphylococcus, maybe Capitus and Hominus and Epidermidus and Aureus and that's why they're all so close or it could be ripping apart two sequences that are both Staphypidermidus because they are just that different from each other computationally. These are computational definitions that we've tried to match with typical microbiologic data as closely as possible. So the pipeline tools that you would use, actually you could, all of RDP and Silva could be implemented in these but now really you can't go further with this OTU assignment and everything without going into some sort of computational pipeline. We use mother, chime, clover, lefse. These are sort of, they all have very similar needs or similar outputs but some of them may be more visually appealing to you. Okay, so here's the kind of analysis that I'm gonna do. I'm gonna talk about it in terms of fruit salad. So let's say that I have two fruit salads and they each have 100 pieces of fruit in them but one of them has 60 apples, 34 oranges, two bananas, two paris, two grapes. If I'm gonna say, have I made the same fruit salad twice, right? This is really what I'm looking at when I'm talking about, are these two bacterial communities the same and I've got 100 sequences in each one. If you looked at it and you just said, if I pick a piece of fruit out of group A, will I find it in group B? Two times out of five, I would say yes. So if I pick an orange, I will find it. If I pick an apple, I'll find it but if I pick a banana out of this fruit salad, I will not find it in this other fruit salad. So if I looked at community membership, I would say they're only 40% identical. If I looked at community structure, which is I take every piece of fruit out of this and I say, every time I take a piece of fruit out of group A, have I found it in group B? Well then 94% of the time, I would pick a fruit out of A and find it in group B and I would say that these two fruit salads were very common. And I use this as an example because really here, there's two different ways you can do the analysis. You can either focus on the dominant members of the community or you can focus on the rare members in the community and I cannot tell you a priori which is the more biologically relevant. If I'm talking about a kid who has severe atopic dermatitis, what I wanna know about is, is there staff worries on that kid's skin or if I'm talking about some, I wanna know could this bacteria bloom and cause an infection? And then I would really care about, tell me everything that's here and focus on the rare bacterial species. But if I'm looking at some sort of context of dysbiosis where I'm wondering, is there community stability here and is there a diverse community? Then maybe I care more about the concept of community structure. So I think it's important to look at both of these. So community membership here. Now I'm giving an example. I'm sorry, I left this off. This is from a very early Jeff Gordon paper where what you're looking at is what are the bacterial communities on animals, obese or non-obese mice? And what you can see is that at the community membership level, the pups are most like their mother. This is one litter of mice from this mother. This is another litter of mice from this mother. And these two mothers are sisters. Now here's another litter of mice from a different mother. And at the level of community membership, they're gonna look most like their mothers, which means that bacteria aren't inherited from their mothers. So I haven't said community structure. They would not be community structure. They would segregate by their phenotype. But community membership, they would segregate by who is their mother because there will be rare species inherited directly from the mother shared by the pups that would not be present in mice of born to other mothers. But as just a simple example, if we look at knockout mice compared to wild type mice, they're gonna segregate based on what is the phenotype or the genotype. Okay, have I sequenced enough? This used to be sort of an issue for us because with Sanger sequencing, it was quite expensive to generate 100 sequences. But now basically on an Illumina or an old 454 instrument, I would say you needed ballpark of 1000 sequences for a first pass analysis. And most human sites will have started to level off by that. If you wanna look at it, you can use a Chaowan rare faction curves that's again built into the pipeline. What you can see is that here for the bacteria of the toe web space, it's a very limited community. We've probably, there are only four different OTUs. We've probably captured them all by, by the time we've sequenced 300, 400. But like the umbilicus or the belly button is a much more complex community and we'd have to sequence 1000. But most of the time, you're gonna get to that now. There's a lot of ecological measures that you can use. These each have a meaning, like richness is how many OTUs do you find that can vary across different body sites? How evenly distributed are these sequences? And the diversity sort of accounts for richness and evenness. If you're gonna talk about richness, like how many different bacterial species? Do I find in this sample, it's really important to sub-sample your data? You can't be looking at 1000 sequences from one patient and 10,000 sequences from another patient. Because obviously the 10,000 sequences gives you the opportunity to have more species. So again, you have to quantitatively think about what are the assumptions going into the analysis. In particular, I think that this reference tries to really go through study design issues. But there's nothing like experience. Okay, I know I've spent half my time talking about 16S sequencing, but I think that's sort of the heart of really probably what I thought this audience might be interested in trying to do on their own. I'm gonna go more quickly now through fungal diversity and bacterial sequencing. So this, I can say, is pretty much the same, it's the same framework. You're gonna, like you did the 16S amplification, now you're gonna amplify the ITS-1 region. In the ribosomal genes of eukaryote, you have the 18S, the 5.8S, and the 28S. And the ITS is the intervening transcribe sequence between 5.8 and 18S. So you can sync one primer into the 5.8 and the other primer into the 18S, amplify this region, that's what's used in clinical microlabs for fungal identification, that's what we're gonna use for fungal identification. These databases that I talked about, like RDP did not exist. So my lab created an ITS-1 database that's now available through mother through chime so you wouldn't have to do this again. We created this database really, Keisha Finley and Joey Ying by mining GenBank resolving taxonomy and reducing redundancy. So bacteria and fungi, they're really different beasts and again you come back to, are we gonna find anything different? Well, there are probably famous examples of bacterial fungal interactions and certainly many of the antibiotics were originally identified from fungal species because there is this interplay of bacterial and fungal communities. So just as one example, why we would do fungal analysis, when we looked at the human skin bacterial survey, we saw a tremendous amount of variation. We thought it was physiologically determined where the blue or the oily sites and they have a predominant of the propionabacterium, that light green, whereas the moist sites, the dry sites had different bacterial communities on them. This is the fungal survey that we did of the human skin and what you can see is that it's actually not as diverse as the bacteria. In fact, we see a predominance of malassezia. There are different genera of malassezia, but you can see that in the feet, which all the action, sorry, in fungi, all the action is really in the feet in healthy volunteers, which may help to help you understand why we see toenail infections, athletes' foot and other really foot fungal involvement that these are the sites that have the greatest fungal complexity. But the bacteria in the fungi are really associating differently with the human body. So richness, the number of bacterial or fungal genera that we see, core body sites is fairly limited for both of them. The arms have really where the bacterial diversity is and the feet is where you're seeing the fungal diversity. So unfortunately, of course, I got the slide from someone in my lab who made it into a movie, but you can see here that if I'm looking at the bacterial communities, they're really separating dry versus moist versus oily, whereas the fungi are really separating based on arm, torso, head and foot. So there's two different ways in which the human body has now created these arranged, these bacterial fungal communities, but the analysis tools would be very similar. Okay, so now I'm gonna do another sort of deep dive into bacterial genome sequencing. And again, I come up with, what are really the questions that you wanna ask? Cause, what are the study objectives? When we sequence bacterial genomes, why are we doing this? Are we trying to find out if two hospital isolates are the same or are we trying to determine whether maybe the staff that are present, staff epidermis that are present on human skin are quite different than the staff epidermis that are on indwelling medical devices. What is the question? The first issue is gonna be, do reference genomes exist? If you need to, if you had a reference genome that's high quality, you can sort of assemble your DNA sequences using that as a reference. If not, there are certainly methods for doing a de novo analysis, but it will make it more complicated. Again, what sequencing platform are you gonna use? What depth of sequencing? What assembly tools will you use? What alignment tool? How will you display your data? How will you compare your results with other published studies? And what information will yield a testable hypothesis? Sounds like the exact same things as what I talked about before, but again, you get back an enormous amount of data from these kinds of studies, so you do have to think about what really are you trying to answer? So I'm gonna sound a little bit like a broken record here or maybe that I'm doing. How do we sequence a bacterial genome? We have really, at this point, standardized, again, where if I were to give you sort of a standard advice, we use the alumina myseq, which again is creating these 300 base pair sequences, but the 300 base pair sequences, if you sequence them at depth and you're just randomly starting these sequences at different places, you're going to cover where you know that you've got this sequence and then you've got this and you can kind of carry yourself all the way across into creating these contigs. Contigs is the sort of shorthand term of contiguous piece of DNA that it goes from one end to the other. And so you'd end up with a gram negative of six million base pairs. You'd end up with less than 100 pieces that were each about 100 kilobases. You can do lighter sequencing coverage than this if I'm not going to really get into that. So there's a lot of different ways that you can get your data from a sequencing machine into an assembled genome. Again, this is something I would probably leave for a core sequencing facility that has some ability to QC their data and to assemble it. There are different methods even NCBI is building a website where you can compare your results with these different methods. Basically, by now, the sequencing data is pretty good. So any of these would give you results that you could reliably trust. Okay, I mean, just as one example, and this is what I'm saying, this is Velvet, which is kind of a workhorse assembler. This is what they would be looking at, and they'd be having all these sequences. They're much longer than this, but I'm giving you this as an example. And you'd go through and you'd sort of hash it out. You'd have these linear stretches, but you'd have regions where there was a bubble because you had two sequences that maybe had a heterozygous base pair. You'd try to resolve, is this a sequencing error or is this maybe something that actually needs further attention? But you simplify these linear structures, you remove the errors, and you return an aligned sequence. We don't wanna discard this information that there may be a base pair difference and actually most of these alignment tools will return that data to you and you should be able to look at that because more and more what I'm looking at is genetic heterogeneity where there may be underlying in a patient sample that there is a little bit of heterogeneity. In particular, if the patient is long-term colonized, they may be colonized with multiple strains simultaneously. Not strains that you would see as a difference if you looked at them with culture, but changes that have occurred just through evolution as single base pair changes. So assemblies, the kind of words that we use, we talk about coverage. So that's how deeply has the genome been covered? So usually if I say it's a six million base pair genome and I've given you back 60 million base pairs of sequence I would say that's 10x coverage. But in fact, you need to look at this and say, here are core genes that I should find in every genome. Have I found these genes, a set of 100 genes that you would expect to find in every genome? There are assembly parameters because you wouldn't want it to be that one region has been sequenced very deeply and another region maybe was not represented. That's less of an issue than it used to be in the olden days. When we used to clone before we sequenced, there were large regions of bacterial genomes that were not cloneable and they would create these gaps where we wouldn't find promoters of genes and so on because you couldn't put a promoter into a cloning vector. Those kinds of regions are no longer an issue because we're just straight shotgun sequencing. The bias is tremendously reduced from what it used to be when we had to go through the bottleneck of cloning. But typically the sequencing is getting cheap enough that we'll generate things like sort of 25, 30X coverage. As I said, the N50, I mean this is the kinds of things that you wanna be at least conversant in. The N50 size is the point at which 50% of the bases are in contigs of this size or greater. So like I would say I have an N50 of 78 kilobases. Some genomes I can sequence them just as deeply and I'll get an N50 of 150 kilobases. So really there are certain things that break bacterial genomes. And so those are the kinds of parameters that I want returned to me when I look at have I sequenced this deeply enough. This is a helpful display for me. I look at the contig length. So now I've assembled a contig and this contig may be 50,000 base pairs. This is the assembly. And I've sequenced here at 25X. But there will be some regions that seem to be at 2X coverage. So like they're at 50X instead of 25X. So for example in this staph aureus this is the USA 300 plasmid. And it's at two copies per cell. Those are kind of, this would be a high copy plasmid. This would be the rRNA operons because in this case you're seeing that these rRNA operons are multi-copy in a staph aureus. So maybe there's four copies of the rRNA gene or six copies in a staph aureus. So they'd appear to be at a higher copy number. Okay. Genome aligners. So really I'm using an aligner now when I want to get at the next question which is to say I have genome A and I have genome B. What are the differences between these two genomes? And I want to find sequence nucleotide variants and I want to find indels. Mummer, mugsy, mauve. Again it really depends on small differences in terms of what's the display you want. We've used all of them and they've all been, they've all worked well for us. Genome annotation. Okay that's different than genome alignment. This is where you have the DNA sequence and you want to predict and name the genes in coding proteins actually and non-coding RNAs. We used to use the JGI tool, Glimmer, GeneVarq. We've now moved over to using PEGAP which is being supported by NCBI. You could certainly use two different tools and see if they predicted the same genes. There's gonna be differences in terms of annotation in terms of whether it returns to you that this is iron transporter or a metal transporter. I mean these sort of different names that they're gonna use for it. But pretty much it should be able to tell you what an open reading frame is. Okay if your question is how much variety is there within this genome? This is a classic paper by Ayurvedic Tedlin and Claire Frazier where they looked at Streptococcus Agilac DACA. And what they're doing here is that they're sequencing a number of genes. So this is again the olden days so they probably sequenced 12 genomes. And what they said is they annotate the genomes and then they say when I look at two of them how many genes do they have in common versus how many genes encoding protein are distinct between the two genomes. So when you sequence five genomes you see that the fifth genome that you're adding here is gonna add something like 60 new genes. So there's 60 genes that I'm finding in genome five that I didn't find in any of the four previous. And then when I'm looking at five genomes I'm saying that there's 1,900 genes that are shared between all five of them. And you can sort of see that these curves level off but really what I'm saying here is that this is an open pan genome. I'm expecting that every time I sequence a new Streptococcus Agilac DACA that I'm gonna add 30 more genes. What that means is that there's a very open genome that there is a tremendous amount of Strept genome diversity. These genes are being brought in by horizontal gene transfer, they may be there on the chromosomes they may be there on the plasmids but every time you sequence a new Strept you should expect to find new genes. And the core genome which can be used for phylogenetic studies would be about 1,800 genes and you should expect to find these 1,800 genes in all Strept Agilac DACA but that helps to understand why it may be that if you have one Strept Agilac DACA and you have another one you can't expect that they're gonna do the exact same functions because they have different genes and if you just compare two of them they may be as different as saying having a hundred genes different between the two of them which if you think about a 2,000 genome a bacteria with 2,000 genes I'm talking about a 5% difference between the protein encoding functions of two Strept Agilac DACA and we've certainly seen similar numbers for other bacterial genomes. This is not true of all species of bacteria some of them have a lot more activity going on. Okay, so that's when you look at the whole genome and you're saying what's different I'm now gonna talk about if you look at very close comparisons between bacterial genomes. So when I'm really saying I wanna compare genome one to genome two and I'm looking for SNPs single nucleotide variants sometimes they're gonna cause mutations I'm gonna look for insertions, deletions. And I give two examples here from the NIH Clinical Center this was in 2007 we had three strains of multi-drug resistant Acinetobacter baumani and we were wondering what was the phylogenetic relationship of these three strains that were circulating in the hospital. When we first sequenced them there were thousands of SNPs. So it really helped us and enabled us to make sense of this data when we mapped this data to a reference genome. In this case, the black outer circle is an ICU strain from Rome, ACICU. And I'm now putting genome A as the first circle inside of that and what you see is that any time that there's a SNP so it's single change between ACICU and strain A I'm marking that in blue. So there are some regions where there's just individual little SNPs but there's also thousands of SNPs right near the origin of replication and that was actually what was confusing us was that we saw these thousands of SNPs. What we didn't understand was that they were blocks of recombination and they weren't independent SNPs so we can't be treating them independently. The red is where strain B differed from strain A but you can see they also still shared this region of similarity that was different of them from ICU. This region, if you can see it, this is green, red and blue. This is another region where the three strains differed. Strain C also here in the green is different than the blue, the red and the blue. So what I have here is a complex picture that before I saw these as assembled genomes I was kind of confused that how are there so many differences between these strains? But once you see them in this way you understand that it's not thousands of events. This is a block of recombination that had a major change to the genome and then a smaller number of SNPs. And the block of recombination is the oannogen biosynthetic cluster. And this is another ribbon drawing so again I'm trying to talk about how you display genomic information. So in strain A this maps to strain B this is the same region of the genome. Maps to strain C and you can see these other ribbons that connect the distal region. But what you have in the middle of each of these is genes that encode the oannogen biosynthetic cluster but it's been a wholesale recombination so that they swap out this oannogen and we believe that this is a way for them to basically elude the immune system is now to have it a new biosynthetic locus. So that's how you would look for SNPs, areas of recombination. I'm gonna talk about now more close similarity and I'm gonna really just focus on the technical side of this. This was a clonal outbreak at the NIH Clinical Center that's been written about and I'm not gonna discuss those elements of it but just we have these 18 samples and we're trying to understand what's the relationship between them. We looked at the transmission where we had different patients and we tried to understand what was a possible relationship of how are they transmitting between different wards in the hospital but if you looked at just the epidemiologic data you would end up with this sort of plate of spaghetti and there would be questions about maybe what was the relationship between patients nine and 10, patients nine and 11 when in fact genetic data as you'll see later would suggest that there was a relationship in red here between four and 10 but this is just really, the epidemiologic data is really complex and we didn't have micro, we didn't have resolution with standard microbiology where we would use post-field gels or targeted sequencing to understand them. So what we did was we sequenced each isolate and we started with the index patient where there was tremendous heterogeneity, sorry not tremendous, a small amount of heterogeneity. This is six million base pairs, I'm only showing you the seven base pairs that differ between different strains sequence nucleotide variants that she had in the throat. The urine was all the ancestral allele, the groin and the lung shared for different SNPs and these are spread out through the chromosome so if you imagine that circle there would be four little points where they would be different and the throat would have three different points but what we saw with whole genome sequencing and this is an example of how we use aligned data that's been assembled is the isolate from the throat was a near perfect match just to the isolates from patients two, three and five and we could see that these SNPs which are the genetic fingerprint of this genome are then shared amongst these other patients and we can start to use this genetic data and the epidemiologic data which are both dense but using two dense data sets will actually give us the resolution to reconstruct the transmission. Okay, I know. I'm gonna talk about metagenomics for five minutes. Metagenomics is the sort of shotgun sequencing for multiple organisms. You basically have a sample like a stool sample and instead of doing amplification instead of doing this amplification of the marker genes you're just gonna fragment the DNA and sequence it. I can only say that metagenomics is really complicated analysis. This is why you would use it. So for example, let's say I had these two populations and I wanna know if they're the same. Well, if I just looked at them with marker genes I would say that there's four circles, five circles here, one squiggly in two boxes, one squiggly in three boxes and they would look like fairly similar communities but in fact, there are genes within these genomes when I've talked about bacterial genome diversity. There are pink genes that have been used to adapt to environment A and down here you're gonna see that there's an enrichment in these green genes that are adaptive for habitat B and they may be dispersed throughout different bacteria but this is really an even greater level of resolution to what are the microbial communities. This is some of my favorite examples of when you might use this. So here these are two studies where they used metagenomic sequencing to find new metabolic enzymes. What they're asking is, here you have the termite hindgut and you're saying how does a termite breakdown wood? How does a cow break down biomass? And from this you can really like doing these kinds of shotgun sequencing and then targeted analysis. You can say what are the new enzymes that I would find here? Most people are gonna look at it and if you wanna do things like say what are all the genes here? What are the strains here? I would say that this is really only for the computational experts. There are very few pipelines for doing this analysis. You have to really be able to do everything, command line. You're gonna have billions of sequences and we don't, while there's a lot of excitement in the field, doing read-based assembly, read-based metagenomics, assembly-based metagenomics. There are tools but really a lot of it is you get these very complex data and then we don't really have good ways of saying what functions do these encode. So you end up going into like keg or cog which just say you have carbohydrate metabolism. So I think that by the time I give this talk in two years we'll have pipeline this out but you have to at this point be willing to go into programs like Humon which are gonna take you through a very complex analysis to try to understand what's in your data set. And so here's the type of results that you could get out. This is the Human Microbiome Project and I've shown you, these are the different body sites and I've shown you that at the phylum level they're quite different so you'll see all this yellow and here the blue but when you look at what metabolic pathways they encode it seems like this fairly consistent, you can see the differences and you can see that even amongst the samples that have different phylum they maybe make up more similar communities and there's a greater difference between the tongue than there is between the stool. But a lot of this similarity is driven by the fact that the core genes that we're able to understand their function are really pretty basic things like ATP synthesis, purine metabolism. These are probably not getting at the question of fine level like iron transport, heavy metal resistance. In terms of metagenomics, oftentimes also one of the things that I would say if anyone's interested in metagenomics I would caution you is that as we start to do metagenomics on patient samples we do recover human DNA. I think it's important therefore to consent the patients for human genome sequencing you can put it aside but while you're sequencing microbial DNA you probably will end up sequencing the patient's human DNA and if you have consented them then you can access and utilize that data but I think it's important to tell patients that that's part of the microbial sequencing pipeline it's just really hard to separate human DNA from bacterial DNA and you may think even you're getting a stool sample which would be mostly bacterial but if you then end up collecting a sample from a patient say who has diarrhea there will be human DNA in there. Sort of as the last topic where sequence technology going right now the aluminum I seek is sort of the dominant instrument in this pipeline. I would say that there is roaches dropping out of the market and PacBio is coming on big. The benefits of the PacBio instrument is these long reads where I know we've been telling you from the genome project how we're going to shorter and shorter reads and how this is all gonna be fine for you but in fact these longer reads have really been very helpful for us in terms of assembling full genomes. I know I tried to convince you that if I gave you 100 pieces a genome and 100 pieces that each piece was 100 KB that you should be happy with that but I actually think you would be happier if I gave you a fully assembled genome where I said here's the chromosome end to end and here are the three plasmids and so we have actually it's a more difficult instrument to run but I now actually really from a user perspective I really like the PacBio. The long reads give me an accurate full genome assembly. I find this particularly useful when I'm trying to understand if plasmids have moved from one bacterial species to another. It's really hard to assemble plasmids with short reads. So my first priorities are to do reference genomes on the PacBio so that you can have a really good reference and assemble if you do choose to use shorter reads you could assemble them to a really strong reference and I think it's important for us as a community to start thinking about how we're gonna move from DNA sequence data to a clinical report and I think that's a place where the NIH could make a real leadership could take real leadership is if you had a patient who had a bacterial infection what kind of information, genomic information could you utilize in terms of understanding infection control or transmission between hospitals. So I think this is a place where given our current stature we could really lead on developing a national database of hospital pathogens. Sequencing is just the start then you have to go and develop really look at Cox's postulates. I've really only focused here on how would you get the information that you need to say this is the bacteria or the bacterial community that I have here. I just wanna stress as the last part that genomics generates information that you can use to test a hypothesis but genomics does not in and of itself like if you see the same if you see the same isolate of MRSA you can't say that what is the direction of the arrow you can't say that this patient contaminated the environment or this patient acquired the sample from some sink drain we cannot use genomics to draw the direction of the arrow genomics can say there is a connection there is a relationship but it needs to be integrated with biological clinical information in order for us to test hypotheses. So thank you, that's it.