 Good morning, and welcome to today's lecture, the ninth in the current topic series. Up until now, the lectures have focused primarily on the genomic analysis of mammalian systems, most notably human. But today, we're going to switch gears and discuss not humans, but the small creatures that live on and inside them, namely microbes. In healthy adults, it's estimated that microbial cells outnumber human cells by a factor of 10 to 1. However, these communities of microbes have not been studied until recently, so we don't know too much about their influence on human development, physiology, immunity, or nutrition. Our speaker today, Dr. Julie Segre, is working to understand how the human microbiome, that is the collection of bacteria and other microbes that share our space, influence human health and disease. In particular, she studies the microbes that inhabit the skin, the human body's largest organ. Many common skin conditions are associated with either a change in the number or types of microbes that colonize a skin. By sequencing the DNA of bacteria collected from the skin of humans and mouse models of human disease, Dr. Segre's group has investigated how these bacteria contribute to health and conversely, how changes in the bacterial community structure can contribute to chronic skin disorders such as psoriasis and eczema. Analysis of microbial diversity has traditionally been based on culturing microbial samples, but this method is limiting because only those bacteria that can actually grow up in culture can be sampled. So, Dr. Segre's lab has been using new genomic tools that identify bacteria based on species-specific sequences in their 16S RNA ribosomal genes. More recently though, using next-gen sequencing techniques like those that Elliott Marguely has discussed last month, she's begun studying complete genomic sequences of microbial organisms. In her lecture today, Dr. Segre will be discussing techniques that she and others have used to analyze the genomes of microbial populations. Dr. Segre got her PhD from MIT working with Eric Lander at the MIT Genome Center. She then moved to the University of Chicago where she did a postdoctoral fellowship with Elaine Fuchs. And she came to NIH in 2000 with a Burroughs Welcome Career Award in Biomedical Research as a tenure-track investigator. She's currently a senior investigator in the Epithelial Biology section in the Genetics and Molecular Biology branch of NHGRI. Please join me in welcoming Julie to present this morning's lecture. Okay, thanks. Well, I'm going to get started and thank you, Tia, for that very nice introduction, which was probably better organized than my entire talk. So, I'm going to start by talking even just about why the human microbiome, because really, for a lot of basic simple genetic disorders, we can find the genetic basis of them like cystic fibrosis. We understand what is the genetic basis or the common Delta F-508 mutation, and you've heard a lot about that, the genetic basis of disease in the previous weeks. But really, for a lot of the medical management of these disorders, including something like cystic fibrosis, which is a very simple Mendelian trait, the complication comes in the treatment of CF, and one of the major complications is the bacterial infection that happens in the lungs. So, you have to really think about what is the host environment interaction, and although that's true of the simple genetic disorders, and by that I mean that there's one genetic cause. That's even more true of what are some of the most complex genetic disorders, where there seems to be many components in our genetic makeup, but there's also a large degree of interaction with the environment for disorders like inflammatory bowel disease or the skin disorder of eczema. So in terms of why the human microbiome, well, actually our human cells are the same in every cell that is the DNA encoding potential, but the bacteria and the other small microbes that live in and on our bodies actually outnumber those genetic cells, the human cells, and it's not just that they outnumber us, because if you do the math, like a bacterial genome as you'll see is about three million base pairs, and a human DNA of a human cell has about a thousand times more genetic material, so we still outweigh them, but if you think about that each bacteria could have a different genetic makeup, whereas human cells all have the same genetic makeup, the genetic diversity of the microbes may also be as great as the human genetic diversity. So when we think about the cause of human disorders and the treatment of them, it's really important for a lot of these to think about the gene environment interactions. So about two years ago the NIH roadmap launched the Human Microbiome Project, which is to understand the genetic component of the human as an ecosystem, and the primary goal is to establish a baseline of what is the human microbiome to empower these future clinical studies. One of the large studies is going on where 250 healthy individuals are being recruited and sampled at five different sites to understand what is the diversity, what is the core microbiome, then say what does each individual have uniquely to identify them and what do they share. And so the five sites are really the five epithelial linings of the body, the gut, the nasal cavity, the oral cavity, the vagina and the skin, although we also are investigating some of the sites that are harder to sample because it's more invasive like the lung epithelium. So the goals of this project are to sequence the human diversity on those sites, but also to sequence bacterial reference genomes, and together those two will hopefully empower us to do metagenomics studies that I will get to in more detail, but that's metagenomics is the analysis of a combined microbial environment, so if you were to just be able to scrape the skin and then sequence the entire microbiota that you get off that. To look at the correlations of changes in microbial communities with disease state and to explore the new ethical, legal and social implications of this field of research. So the microbiome studies really started in the environment, and this is Craig Ventur's tour of the Sargasso Sea where he collected seawater in all these different places and looked at the bacteria like the Prolochorus, which is a common bacteria in the sea, but he also looked at the diversity and found that the diversity of the seawater is really enormous, and actually the environmental diversity seems to be much greater than the human diversity. But so this is one type of study that you can do where you can examine the environment in different places of the ocean, and so these are all the different places that he picked up bacteria, or you can look in one place, and this is Phil Hugenholz study of a saline mat, so that's something where he looked from the top, so it's like if you start from the top with the grass and you went underneath to the dirt, but took a core of the earth. So in that case, you start at the top and you look what can live at the top, what means the oxygen, and then you go down and you see what can live in the different areas, and what he showed here was that as he goes down into the saline mat, the types of bacteria change as you travel down, in the environment there are these different ways that we can survey, and those are really the ideas that we wanted to apply to the human. So now this is a study that, and then I'll get more into the details of how we do this, but just what sorts of questions we want to ask is what I'm starting with. This was a study Norm pasted where he looked at what's living in the shower heads across the United States, and so these are the types of environments that then interact directly with humans, and looking at what are the biofilms, and then on the right is all the different types of bacterial organisms that he found in shower heads. So it really is about the interaction that the human ecosystem, the environmental ecosystems, and then the interaction between the two. So starting into how do we do this type of analysis, the study of bacterial diversity is typically focused on sequencing the 16S RNA gene. The ribosomes of a eukaryotic cell and also of a prokaryotic, well we tend to think about the large subunit of a ribosome and a small subunit of the ribosome, and that's where proteins are translated, but in fact a ribosome is really 60% ribosomal RNA, and they give the ribosome a lot of its structure, and they also aid in the translation of the messenger RNAs. So this is one of the most highly transcribed genes in a cell, and anyone who's ever run a Northern knows that in a eukaryotic, you base it on the 28S, the 18S, and the 5S gene, I mean you can see those bands on an RNA cell. Similarly in bacterial cells there's a 16S gene, and there are multiple copies of this 16S gene, and the reason that we use this gene for assessing bacterial diversity is that it's a standard housekeeping gene. It's in many copies in the bacterial cell, and moreover it serves as an evolutionary clock, so that if you know the sequence of a 16S gene, you can use that to identify what type of bacteria do you have, and this is a way in which we talk between genomics community, microbiologists, and physicians, that this is a standard nomenclature where the 16S is used by all of us to identify what type of bacteria we find. And so that's all the previous studies that I had shown you of the Sargasso C, the shower heads, these are all based on 16S sequencing. Actually the Sargasso C, I'm sorry, is actually had metagenomic sequencing in it, but then you can still pull out the 16S gene even from metagenomic samples and look at that. So the 16S gene is extremely abundant, and it's very easy for us to characterize it, and then to talk between different disciplines with that. So one of the features of the 16S that's going to become important is that it has this structure. So when I talk about this as being an evolutionary clock, it's an RNA, it doesn't get translated into a DNA, but it's an RNA with a lot of secondary structure. So what you see here are these stems, and that's a double strand between two different RNA moieties or ribonucleotides, and so these have evolutionary constraint on them in a way that these loops are less constrained. So when we want to make an identification, now I'm going to show you the 16S gene as linear, when we want to make an identification of the bacterial gene, if this is, it's 1.6 kb, so I'm showing it to you here in a linear form. Well, this is the conservation, if you looked in 50 base pair windows. There are regions that are highly conserved, and there are regions that are less conserved. Well, one of the first things that you might want to do, and I'll get back to how we use the sequence to identify, one of the first things that you might want to do that you don't need to know anything else about the bacterial is, you might want to know what the bacterial load is. So I do have to say that this has to be done with the caveat that you have to be clear that you are assessing what you think you're assessing. So if you want to calculate the bacterial load, you can put primers into these regions that are more highly conserved and you can amplify like you do with a standard quantitative PCR. This is again not quantitative RT PCR, but you're looking at quantitative PCR to know what is the bacterial load. And we did something like this when we were trying to assess what is a swab. If you swab the person's skin, how much bacteria do you remove? If you scrape the person's skin, how much bacteria do you remove? And if you do a full thickness punch by MC, how much of the person's skin do you remove? And, you know, that's to understand the depth of sampling and how much bacteria are you taking off. But of course there, it's very important that we always used a very standard shape for sampling and that what we were interpreting was not that this is how much bacteria is present, but this is how much we were able to sample. But it can get you some ideas about if two samples, one has a higher bacterial load than the other. So the way that you can calculate the bacterial load, it's very similar to how you do a standard QPCR. You see how many cycles it takes and you can just see here that if we have 300 pg of bacterial DNA, it will take 17.8 cycles. If you decrease the bacterial DNA tenfold, it takes three more cycles. So that's two to the third or eight times, you know, eight times less DNA, ten times less DNA, it actually works out. And then if you use three pg of DNA, that's three more cycles or three point, you know, three point three more cycles. And we can use that to calculate copy number just as you would do for most QPCRs. This I'm just showing you here as normalized with E. coli and showing that if you spike the bacteria with human DNA, it just doesn't change the sancle number. So even if you have human DNA mixed together with the bacterial DNA, you can still calculate the bacterial load. Okay. So how do you study microbial diversity, which is what is, which is more what those samples were trying to do of look at what is present, not just how much. So there's really three ways. First of all, there's fingerprinting, where you could amplify the 16s gene and do something where you chomp it up with restriction enzymes and look and see if there's any differences. I'm not going to cover that. I'm giving you a reference for that. It is certainly the cheapest, but it's very limited in terms of what type of information you can get out of that because those BAMs don't have any molecular identifying information associated with them. Phylochips are very much like micro rays where you put down unique probes for each of the known bacterial lineages and you can hybridize to them. I think that phylochips are certainly going to be extremely important in the future when we have assessed the microbial diversity and you want to go in and see how much of the different bacterial phylum there are. But at this point, like any microarray, chips will never find unique sequences for you. This is just like if, as you'll hear about with microarrays, why people are moving from microarrays to RNA sequencing is that the chips will never find you unique sequences. You will only find what you are looking for there. They will be very powerful in the future, but at this point we are still in this project in a discovery mode. Finally, what I will focus more on is the sequencing and the taxonomic classifications. One of the points that I really want to make in this talk is that for a small study, the sequencing is limiting. If you want to look at something and you want to compare two samples and you want to look at 200 sequences from each one, probably the sequence will be the limiting factor. How can you get 400 sequences? But if you want to do a larger study and compare a longitudinal study and use five different animals or five different people and look at them at three different time points, at that point the bioinformatics becomes limiting and I'll explain more about why that's my opinion. This is an example of the phylochip, which again is very useful if what you're trying to get is an overall sweeping characteristic of the study. This is a study by David Rellman and Pat Brownsgrub, where they're looking at what is the intestinal microbiota in the first year of life. Of course the first year of life has a lot of changes. The child is going from breast milk to oatmeal to semi-salin food to table food. What they found was that during this first year of life there are many changes in the bacterial populations and that the diversity is between infants and also between times. What you can see is that even the diversity doesn't look the same between these three people or these three infants and that there are little spikes where the bacterial diversity will suddenly switch. We don't really have any understanding of why these shifts occur and in this case they can correlate it with the clinical metadata but what they're finding is that at this point the resolution that we have doesn't really give them the pattern in here. It just seems like there's a lot of diversity between kids. If you want to sequence the 16S gene, now I'm kind of getting into the nitty-gritty of it all. As I was saying there are these variable regions and so this picture shows you as you go across in 50 base pair windows, V1 stands for variable region 1. Now that would be one of those loops that you saw in the E. coli structure whereas these regions here are the stems so that's a very highly conserved sequence. The way that we sequence the 16S gene is that we put our primers into these highly conserved regions that are here at the end of the gene or if you want to do 454 sequencing which gives you the 400 base pair sequences you put your primers here into the highly conserved regions in the middle of the gene so you can sequence either with a full length Sanger sequencing and assemble the sequences together to give you this 1.6 kb or you can use these shorter reads and I'll show you in a second the difference between those two. But one of the real take home messages from this is that the primers that you pick significantly determine the microbial diversity. Even though we talk about these primers as being that they amplify all bacteria and that they are in these conserved regions that of course needs to be taken with a grain of salt. I mean you know even if you put a primer here into this region you're talking and you think that you're amplifying 95% of the sequences you know that's based on what we already know that we found so you certainly are missing 5% of the sequences and perhaps more and the 5% that you're missing if you're using this type of a primer here could be very different than someone who is using primers that are here and here in the different you know conserved regions. So it's unfortunate at this point but this is not like the human genome where there is a reference. It is still very dependent on the tools that you are using so it's still at this point complicated to use someone else's data set as a reference and then just use your own as you know a test set to see for example if you know your if your mice with a genetic defect have a different microbial profile than someone else's it's very hard to compare data sets between labs and know whether the difference is biological that there is a difference or whether it's technical that it's based on the two groups using different primers or using different amplification techniques. Yeah to a great extent if we all agree to use the same primers and that's one of the goals of the human microbiome project is to say that we will all use these primers and standardize what are the conditions but there is a fair amount of pushback because of course any of those primer sets does have their own bias and so the bias of what we might lose if we were to do oral cavity sampling might be very different than if we were doing gut sampling and so it may end up being that we end up having primers that are specific to the body site or we may have to make some agreement like that but in order to assess that and there just recently was a paper that a Swedish group published on a mock community that's a similar analysis that the American human microbiome project is doing where you build a mock community and you put 20 bacteria of that are known and they you put them in different ratios and then you amplify them with all the different primers because again data is going to lead the way I can't convince you that my primers are better than your primers if they have different results but if we both assess them on a mock community and then I say my primers give me back the microbial diversity your primers are skewed and that they amplify firmacutes and non-actinobacterium then you might be convinced to use my primers but until we have that type of data there's no way that anyone can convince anyone else that their primers are better representative and so yes so you just got into one of the issues that the the one of the reasons that we're really trying to build a human microbiome consortium is to agree upon standard principles of how we are going to assess microbial diversity and I think as with all of these types of projects it's fine to break the rules as long as you understand that you are breaking the rules and why you're breaking the rules because anyone will agree that you can use your own primers if you have a reason for doing that and we may even move beyond all of this when we get as I was saying like when we get phylochips then we'll we'll we'll move into a different bias this is really just in the discovery mode so how many reads do you need and this is probably going to end up being important in terms of assessing how much is a project going to cost you to do and how much bioinformatics do you need well as I said the the two approaches that we currently are using are to get the full length 16s and with Sanger sequencing and the advantage of that is that it allows you to assess your sequences compared to samples of microbial isolates that have been cultured because it really takes almost that much sequence to make a unique identification the four five four Roche instrument gives you 400 base pair reads and you know every base pair is much cheaper than doing senior sequencing it gives you a lot of data and for many things this assessment that you can get with a four five four would be sufficient you can sequence 400 base pairs and get conserved you know cross two variable regions and conserved regions so Lexa Lumina and I know Elliot Margolies talked about all of this so if you're interested more in the technology you should certainly refer back to his earlier talk in this lecture series these are going to move towards a hundred base pair tags they've moved from 35 to 75 and they will probably get to a hundred base pairs that's still a little small to identify bacterial genera but it's great for whole genome bacterial sequencing and that's still the hundred base pairs you know again might be enough to tell you if there's a dramatic shift if that's what you're looking for so this is just looking at now what type of information you get if you compare a 400 base pair sequence with a full length sequence so Jim Cole James Cole's group has put this analysis together where if you look at a full length sequence and here of course you have to refer back to your own ideas about bacterial tax taxonomy and this is you know the standard kingdom phylum class order family genus so you can get different levels of taxonomic identification if you if you use different sequence lengths so here I'm showing you the full length sequence compared to a 400 base pair sequence and these other data points are what 454 f lx was and you know if you were getting into a luminous sequencing so if you have a full-length sequence of course you know you could identify basically everything at the phylum level and most things at the genus level now if you go to a 400 base pair sequence of course you can still identify everything on the phylum level and your level of resolution to get down to the genus level drops slightly and you know you just have to decide where you want to be you know what's your sweet spot for your type of analysis there and that is really about what type of level of organization and taxonomic diversity do you want to pick up so you know let's move on from that because the sequencing technology has been covered let's say that you have these bacterial 16s sequences and I'm just showing you here the 16s sequence from staph aureus so you've got these sequences and maybe you've got full-length sequences or you have 400 base pair sequences what are you going to do with them now well you'd like to identify the bacterial sequence and the next thing you'd like to do is probably align your sequences so if you there are tools within NCBI like BLAST and there are that you could see if your sequence matches a previously cultured sequence but even within this you probably want to know if your bacterial sequence I mean if it's or if it's a known bacteria then you will get a match in here like if this is staph aureus you would get a match but you may also be picking up the sequence of something that hasn't been previously culture and there are databases that can that contain large amounts of 16s sequence so this is the ribosomal database project that was developed by James Cole and what this allows you to do is to align your sequence specifically against other 16s sequences so this is about a million 16s sequences that have been aligned and annotated and it uses a naive Bayesian classifier so that you can identify what sequence you match most closely and it kind of takes a lot of the work out of it for you so that you know you you might match a lot of things at the 95 level and then a smaller number of things at the 99.99.99 percent level and a very few things at the 99.9 percent now one of the things is that different taxonomies have been developed for classifying bacterial sequences and you can use Berge's taxonomy use of the NCBI for most bacteria this won't be an an issue you know all the taxonomies will agree but there are instances where different groups have their own taxonomy so it is important to at least be aware of the fact that there are different taxonomies that exist within the RDP classifier there also are some very nice tools and so I've kind of pointed out that there is a tutorial associated with RDP but you can use this sequence then if you said that you wanted to you wanted to make primers that specifically amplified staff but didn't amplify strep you could go in and try to use the sequence match or the probe match parts of this program to pull out what primers would be ideal for doing those types of analyses so based on RDP classification and they do have a pyro sequencing pipeline that's one of the things that actually is going to be an undercurrent to this is that people are switching from full-end sequencing to the pyro sequencing the 454 and many of the tools are somewhat ready for getting pyro sequencing data but not really ready for pyro sequencing data so if you do intend to use 454 you I'll try to point these things out but you should really look at these tools and see which are which are being adapted for short pyro sequencing reads because that's a real moving target where everyone knows that people want to develop tools for that but that ends up being hard I mean it's you're trying to feed a lot of sequence data into these programs that were originally built for a smaller number of full-ends anger sequencing yeah yeah um right so I think actually so the question was what's the number of species that a human has and so I don't know the answer to that and I think that I can answer that at the phylum level I mean so what's interesting there is that there are about a hundred known bacterial phylum and there seem to be only about eight that can inhabit the human whereas a hundred of them can live in the sea and can live in the soil and if you sample the sea or the soil you'll just every sequence you get you keep getting something unique it seems to be a very diverse environment there's definitely been a selective pressure put on what bacteria can live on the human so at the phylum level we're talking about a very um a bottleneck where it's about eight of a hundred phylum but when you get down to the level of genus or really what you're even saying species we don't yet have um enough of an exploration to know um how many differences there are um and whether it might end up being that the diversity is at the at the centimeter level or or if there really is a core microbiome um and I'm part of it and I guess I I should just you know while I'm just giving my opinions part of it is going to be the concept of what is a species because bacteria exhibit horizontal gene transfer so the concept of a species would probably mean that you're talking about a conserved 16s sequence um even that answer I don't know how many 16s sequences would we find on a human um and I don't even have an order of magnitude for you I mean that's the you know the interesting thing they talked about the human genome how there were going to be a hundred thousand genes and then probably there's 25,000 but the the diversity is in the splicing and in the protein forms and so you know that can be a number that you could give in many different ways even for humans yeah I guess this the hundred thousand genes was ultimately tracked back I think to Wally Gilbert um and as a physicist when it turned out that there were 25,000 genes he said how remarkable it was to have been within an order of magnitude so um I I guess the fact that I haven't heard a number of what the order of magnitude of bacterial um genuses on the human is it's just that rarely no one well I I haven't I haven't heard a number so what could you do with an RDP classification well this is one of the examples that I think really got um the field started is um Jeff Gordon's study of lean versus obese mice and what you see here is that the uh and this is a an animal model where the the ratio of lean uh sorry of firmacutes to bacteriodites in the in the lean mice and this is just the homozygote wild type versus the heterozygote mice in the OB OB which are the obese mice what you can see is that there's a statistically significant shift in the increase of firmacutes and decrease in bacteriodites now with all of these there is the concept of you know this is correlative data is the shift in the microbiota causing the obesity or is it a result of the obesity and that's the type of information that this data can't answer for you but this is um at least a hypothesis generating um uh you know that that the obesity may be affected by the ratio of firmacutes to bacteriodites and that may be a possible target for drugs so uh I'm going to show later Jeff's lab has gone on and studied this in in humans but I'll uh oh sorry I guess I'm talking about that right now so um Jeff's lab has gone on and shown that in humans where you see up in the upper right hand side these are the ratio of um firmacutes to bacteriodites and this is people that are have been put on a diet that were previously um considered obese and what you can see is that as they become leaner the ratio of bacteriodites increases and the ratio of firmacutes decreases um and hainted an assessment of the change in body weight um if they were put on a carbohydrate restricted diet or a phow restricted diet so there are things that we can do easily in mice and then you know if there's an answer that's provocative in mice you can move that into humans knowing that um you know the mouse studies to do that that's sort of uh to do that on mice is often 10 mice and the variation in humans is much greater so you need to enroll more people to do that study and that's a much greater undertaking so um before I talk about how we do the analysis of the data because we've just talked about how you do an original classification now probably the next thing you'd like to do is a sequence alignment um and to start thinking about what kinds of shifts do you have below the level of of phylum but we get into an issue that um I want to spend some time on because I don't think that it would um uh it it may not occur to you when you're first starting this analysis but it will end up being a tremendous uh have a tremendous effect on your data and that's the issue of chimeras so what you're trying to do is you're trying to amplify the 16s gene and then you want to sequence them but what happens is that um you will get chimeras and I'll explain to you in the next slide how you'll get chimeras and the problem with chimeras is that you will end up with sequences that are not really derived from a bacteria and you need to be aware of that so here's how you get a chimera you're sequencing many bacteria together and what'll happen is that during the amplification process not all of your PCR products can complete their extension so you're moving across and you're amplifying here but then the PCR can you know then the PCR is over even if you extend for a minute and a half you you keep starting starting strands that you don't finish and so you you fall off and now what you have is you're at a conserved region and you anneal that to a different um bacterial 16s and you now extend you know you started on a blue sequence but now you extend in the second extension you extend through the green so now what you've gone is something that's half blue half green and what this would look like in your data set is this would look like a novel bacteria because it won't match anything that's previously in the database and um you know so you'll think you have these novel bacterial species when in fact what you have is chimeras and people worry a lot about this we worry a lot about this because right now we're trying to build a reference and within a reference um if you have these chimeras then you will have these databases that are filled with sequences that are artifacts of the PCR not in fact true bacterial identifications um but I guess I spend this time on the chimeras because if I were reviewing a paper from any of you and you didn't chimera check your paper you didn't chimera check your sequences I would tell you to go back chimera check and send me the paper again because I don't even you know I really don't there's just the rate of chimerism is actually significant you know there's a lot of um there is a lot of question but there are a lot of people who are becoming convinced that the rate of chimerism is actually higher than we had um even previously appreciated so there are a lot of people who are working on algorithms to do this chimera checking um pin tail is um one of the methods and um bolerophon is one of the methods so again you know what there's what these are doing is that they go back to this type of information and what they do is that they they look at the whole sequence but then they'll look at what does this end of the sequence match and what does this end of the sequence match and if those match two different bacterial genera then it is likely to be called or at least flagged as a chimera and you can go in and manually curate whether you want to accept this um but the chimeras can occur between different genera there's also um other algorithms that are looking to see what would you do if you had chimeras that looked like they were from within the same genera so we have pin tail bolerophon chimera slayer uh is probably the first one that can uh it's under development but it can deal with um the the sorry actually I mean shorter 454 xlr sequences so again this is they're trying to develop this so that it can be it can be used with um 454 sequences and again this is the same like the bolerophon where you look at the two different ends of the sequence and say do these look like they are from the same bacterial sequence the other thing that you need to look out um if you're taking samples from a human is you need to deal with the host sequence contamination um and it's really important because for the IRB when when we consent individuals into our clinical study um you know I don't I don't I don't want to be putting their human DNA when I've told them I'm going to put their bacterial DNA onto you know into these databases on that are publicly available on the web um ethically you really should make every attempt to filter the sequences from human subjects before submitting to the public databases um and this this is something else that the human microbiome project is working on because um I think this is something that we want to develop community standards for um so I mean what you typically do is that you make it that it has to have a positive signal that it looks like a bacterial DNA and it does not look like a human DNA um and so we've filtered all of our sequences and we do this before we do any further analysis to make sure that we aren't um putting human sequences into the databases okay so now you've got your sequences and you've removed the two things that I worry about the chimeras and the human contamination so how do you align your sequences to start thinking about whether you've had a shift in the microbial diversity well I mean we're all fairly familiar with programs like blast where you can blast two sequences together or cluster that you know allow you to put a lot of sequences in but the issue was that if you use one of these programs you lose a lot of the information which we do have about the 16s structure which is to say if I have to put a gap in between two sequences should I put that gap in to a conserved region or should I put it into a variable region well I probably actually knowing what I do about the 16s sequence would put it into the into the loop structure rather than the conserved structure so there are programs that um nest and now um as I said again nastier is the broad adoption of nast 4 4 5 4 sequences but this takes what we do know about 16s sequences and does a fixed width character alignment so that you know it it it it gives you it puts the gaps in it puts the base pair changes in it it understands the structure of the 16s sequence to do an alignment for you and then from that you can build a phylogenetic tree so that you can begin to look at what types of sequences do I have here and what is the the branch length between these different sequences so if you go into our but which is the sort of probably the first database that you would look at um it then reads out for you this is I can't read on this screen but you know here you have 59 sequences that are in this category 42 and you can open these out and see what you have but really what you want to do if you want to do any um more sophisticated um computational analysis is that you wanted to find these taxonomic groups and so um you're you're probably going to have to delve into this but this is where um Patrick Schloss has developed a suite of software tools called doter sons and now mother which is a group of all of these analyses and doter the root of this is operational taxonomic unit otu and what this gets around is this concept operational taxonomic unit is sort of another is is a uh is is our way of saying that this is a species because species really means something if you have organisms that involve that you know engage in sexual reproduction and you can talk about whether the f1 is fertile um to produce an f2 but in terms of bacteria they undergo transduction conjugation I mean it's the you know they could have the same 16s sequence and not really be the same species so taxonomic unit is a way of saying they have the same 16s sequence and get and moving away from the concept of species although that's what we probably that's what we sort of the equivalent of what we mean so otu's will cluster sequences based on how close they are to each other so this is from the original doter paper and what you see here is that um depending on how similar you want your sequences to be you you can look at the variation so if I say that everything in my group has to be at least you know has to be 100 identical then it looks like every sequence that I got that Patrick Schloss characterized here every sequence is unique he's only this is the original paper so this is only looking at 140 sequences but you know at that level from this from the environment every sample is unique now if you started allowing for there to be 3 different so that they're 97 percent identical then you know there's then there begin to be sequences that are clustering together and you know now we're obviously doing many more sequences and getting many more repeats and um to put things at the species level we often classify at the 97 or 99 percent identity and I should say I'm sorry I didn't include in these slides and I should have said at this point that when we look at the sequences we often apply a lane mask and I didn't I'm sorry I didn't put that in the slide so if anyone has questions you can just ask me about that but what I mean by that is that we often don't examine the most variable sequences because they can change and then revert and often in those real loop structures the the the change in sequence base pairs is actually greater than the other regions so we want to look at something that seems to be all the sequences are all being changed at the same evolutionary rate so with the abundance of sequences that we have we're typically clustering at the 97 or 99 percent identity and then what we want to start doing is we want to start thinking about how similar are these groups and so we have two ways that we can think about how similar are these groups community membership and community structure and these are two different methods community membership gives predominance to the rare species so in this case with this example here I'm saying how many categories do they share so group A has a banana but group B doesn't same with the pair so if you look at this the number of categories of fruit that they share is only 40 percent they only share two out of the five but if you ask how many pieces of fruit do they have in common now you're giving more weight to the dominant species and so you say you know if I find this orange here do I find this orange in this fruit salad do I find if I you know and you're going to find the orange 34 times and it's always going to be in the other you know from A and B and there are going to be six times when you pick up a piece of fruit and it's only in the fruit salad of A and not in B so that's where the community structure would look like 94% here because that's about each individual sequence instead of how you've grouped them so you can also the other term that we that is used is whether it's weighted or unweighted and weighted means you've given greater weight to the dominant species so in this which is understandably hard to read but what I'm trying to show here which you can is that in that same analysis where Jeff Gordon showed that the obese mice clustered together they all had the same change in the ratio of firmacutes to bacteriodites in fact if you looked at their community membership the mice are most similar to their mothers so this is m3 is mother 3 and then all of her pups are most similar to her in community membership so the mice inherit the microbiota from their mother so for community structure they look like who you know that's genotype dependent what types what the ratio of sequences in but the actual individual identification of the sequences the pups will look most like their siblings the pups will look most like their mother so those would give you two different pieces of information and they're not you know they're each correct a pup looks most like its mother in terms of community membership but a pup looks most like other individuals of its genotype when you consider community structure so sorry that was sort of the other example where you know pups will cluster based on their genotype for community structure um and the the method that patrick schloss developed of doeder is um really um uh rob night's group has developed unifrack which is a unique fraction metric but again this is really as I was saying this can be you know weighted or unweighted it's really the same principle where they're looking at community membership community structure and um there are p values that can be um assessed for uh unifrack and there are visualization techniques so this is really um you know it it allows you to do the same types of um analyses um and that's really those are um you know sort of the two major groups of um analysis tools so how much diversity is there in a population while there's um techniques for calculating that too once you've defined your operational taxonomic units and this gets to the point of how deep do you need to sequence I mean how many dominant groups do you expect to see how many were rare groups this is the easiest one to calculate I think is the the chow one rare rare faction term and what you see here is just some you know our first runs of of sequencing where in the in the you know in the between the toes we saw that there was really um four predominant species so if I'm doing Sanger sequencing um I'm probably not going to be able to capture that many more rare species because there are these dominant um predominant sequences but in the umbilicus or the belly button you know as I sequence 400 sequences I've captured um 55 but because of the way the sequences I keep finding rare sequences I predict that if I kept sequencing I would find many more um and if you really want to get into um you know a deeper analysis within that there are um indices that have been developed by the ecological people to look at um the the number of otus that we just you know that we sort of just talked about the richness so that would be the number of of otus the evenness of them and the diversity so the evenness is you know whether there's one predominant and then many rare or whether they're spread out um so many of these types of indices have been you know there there are algorithms for calculating that um this is just a note if you are going to use 454 sequencing I think you should consider this method of vamps to form the operational toxin on units and the reason I say that is because on an individual read basis there's still is a fairly high number of errors within 454 so you don't want the if you're classifying sequences at 99% identity um but there is an error in your 454 sequence read um because of the just the um you know the error rate of individual reads on 454 you you don't want that your sequence identity is driven by the error rate of 454 oops wrong way um so everything that I talked about for bacterial diversity we there are similar strategies being developed for classifying fungal diversity um but those are really um probably at this point uh not yet ready for prime time but if cheer invites me back in a year and a half hopefully we'll have made more progress on that this is just a highlight to sort of give you an idea of what type of sequence diversity we have found so this is on the human skin and what we've what we found was that the the the the ecosystem really determines the bacterial diversity so the the blue sequence the blue um text here means that this is an oily surface and you can see that they are predominated by this dark blue propionic bacterium the green or the moist surfaces and those have a predominance of these um green proteobacteria and the the red or the dry sites and these actually have the greatest diversity of all the sites but that at each site and you'll see this actually here better each site there is different microbiome diversity so when you look at the back of four different people it's all predominantly this propionic bacterium and that's quite different than if you look at the anticubital crease which is the bend of the elbow of those four people so the the the the site is more determining of the microbial diversity than the individual and the bacteria that can live in a moist environment like the crease of the the bend of the elbow is quite different than the bacteria that make their home in the sweaty armpit which has large hair follicles and this is um nicely illustrated in in rob night's work where you can see when he does a principle coordinate analysis the green or the oral cavity samples the blue or the gut and you can see how they cluster away from the pink which are the skin sites um there is some mixing though of course when he's looking at some of the sites um but you can also see that here in his unifrack analysis of what sites group together and that you know the the sites the left versus right that's you know the left versus right of the arm is not that different it's really between the sites and then between the individuals so I just gave some references for what are the sort of you know if you're interested in more about the techniques or more about the human microbiome project and um now I have a few other comments to make about what else are we trying to do that was the bulk of my talk but I'd like to kind of give some other comments um about the technology in case these are things that people came to hear so we can also um sequence the whole genome of bacteria and again this is pretty technology driven um and aliet margolese went over the roche 454 and the luminous lexa but really what's happened is that we've moved away from cloning so it's not just that these these technologies are much cheaper per base pair but it's very hard to clone bacterial DNA into bacteria so when we used to have to make libraries and then sequence them in Sanger well it everything was cloned into a bacterial promo you know into a bacterial plasmid but a bacterial plasmid doesn't like to have another bacterial promoter sequence in it it actually doesn't mind having a eukaryotic promoter but it doesn't like having a bacterial promoter so when you try to sequence and clone bacterial DNA what you would miss was the promoters of the genes and that was often the first you know the beginning sequences of the bacterial um open reading frame so when we get away from cloning as our way of making libraries we can actually get a much better coverage of the sequence of a bacterial genome so with these what we get are these and I know Eric Green went through this in terms of how to do an alignment we end up with these um we get wind up with these reads which are you know whatever length they are and then um as they talked about we get these paired end reads where you get like 200 base pairs from two different regions and for bacterial genomes it seems that the sweet spot is to get them 8k bopart so from the the unidirectional reads you build these contacts and then with these paired ends you get um some sequence from this region plus some sequence from this region and you can bridge between those two so there are and I know other people went through this there are assemblers that we use to assemble these sequences this is all done computationally it's really actually pretty straightforward um the 454 Roche uses noobler Elliot um you know has an expertise in developing uh velvet which was really developed by you and Bernie's lab and then you can look at evaluating these assemblies and um there are these sorts of um algorithms for how much coverage do you need we generally depending on uh if we're doing Roche sequencing we generally over sample so we get like 20 times over sampling and that allows you to have these contacts that start at the different places and get the assemblies now what's cool about those coverages is that when you're doing 30 fold coverage things that are at different than you know the genome level pop out at you so here um these contigs are um this is a log curve so you're looking here at what is a 500 base pair contig and a 1kb contig and you can see that this is really what is um you know we sequenced here at like 30 fold coverage so most of our contigs are big contigs so ignore everything smaller than 500 base pairs those are just you individual reads we get 30 fold coverage and this is um Staph aureus and what you see are some of these things that are at two fold coverage so this is a plasmid some things that are five fold coverage this is the RNA operon I was saying that they're like five you know there there's five copies of them here's another plasmid that pops out and here's another plasmid that pops out so when you look at just these contigs the things that are over sequenced are are bits of the genome that are there at higher than you know one copy number and those things just pop out for you and here we're really talking about you know is there so one of the things that we're interested in now that I was alluded to before is is there a reference genome so you know we talked about things might have the same 16s sequence but here's an example and there are many more of these of astreptococcus where claire frazier ligates group went in and sequenced um so 12 of these genomes and what she found was that as she sequenced additional members she kept finding new genes that even though they would have the same 16s sequence they had different genes in them and that's because they engage in horizontal gene transfer now over on the right so she kept looking at when she goes to 12 genes how many new genes was she finding and how many genes did each one have what she found was that as she kept sequencing she kept finding unique genes and so in the case where this would um if this curve when she when she kept sequencing new genes if that went down to zero then she would say well at least I've identified all the genes but no every time she would sequence a new genome she kept finding new genes meaning that this is an open genome that it can um keep bringing in new pieces of DNA and there are examples like staph aureus which has a very dynamic genome but if you sequence staph aureus that's present in hospital isolates this usa 300 it has actually driven to being a fairly fixed genome so with whole genome sequencing we can find deletions we can find mutations we can find insertions you can do comparisons between different genomes so um I just wanted to spend again you know these are just sort of small vignettes that I wanted to talk about but this is another um role or another type of microbiota of course the viruses and these um are also about technology driven um is that the virus is what's causing these new diseases and a lot of this is even now zoonotic transmission about you know animal to human transmission or a virus that is associated with a disease but we never we we we you know we've never detected that before um so examples are SARS and the Merkel cell carcinoma so what people can do now and the price is dropping but what people can do now is you can take samples from people who have um a chronic or an acute disease and you can do the sequencing and then you just digitally subtract out the human DNA and you can find from that what aren't the sequences that might match viruses so the the issue here is that this is quite expensive because you have to sequence basically an entire human genome and then throw away that data to look for these viruses but once you've done that then anyone else who wants to see does this kid with diarrhea have this type of viral infection that's really cheap for them to do because they just need to do a PCR you spend all the upfront cost finding the virus um but then other people again could use some sort of phylochip or could do a PCR based assay so it's about an initial investment to find these viruses that um aren't infecting humans um and then a much cheaper assay in the in the future so I'm just going to talk about uh in a few slides uh this arena virus that was found in what happened was that there was an um three people who got organ transplants from one individual and um after getting these transplants then they all three um they all three died um and you can see that they all three rapidly decline um and of course they are on immune suppressions but what was what they did with these three people was that um they sequence the RNA from all of these individuals and it it really is the needle in the haystack because they have to throw away all of the human sequence but they did find um fragments that matched an old world arena virus and from that they could then PCR walk and find this complete new virus that was infecting these individuals this is true of everything I've said and so this is where I'm you know starting to wrap up sequencing is just the start I mean we will find correlations even in that case with that arena virus you know they don't find it in other people and they find it in all three of these patients who have died but then it's really just the beginning because what we are generating is hypotheses and then they need to be um corroborated and um you know so the first one of coxis postulates can be sequenced sequence based that you find this in the people who are sick but you don't find it in the healthy people but the rest of them really require biology because you need to be able to grow the microorganism you need to be able to establish the mechanism by which this microorganism would establish a just a disease um and so that's a really different component um and um it's sort of like finding you know a disease a gene locus in a human and then trying to figure out how that's causing the disease so the final um few slides are just on this concept of you know we're going to get a lot of sequence and so the ultimate goal is not even the 16s which is just a signature of the bacteria or the 18s of the signature of the fungi but really to understand how all of these organisms um co-exist um and so that's the metagenomics concept that I brought up in the beginning where you'd like to just be able to scrape someone's skin and from that say here's the entire microbial profile so and and the reason that this is important is that you could be misled so if you looked at these two populations A and B and you looked just like what types of shapes of bacteria do I have you would say I have four circles um one squiggle and three rectangles and you know down here I have two rectangles four five circles and a squiggle so they might look pretty similar but if you see who is the actual genes that are in those bacteria um you know this is a huge representation of these pink genes which some gene that's you know encoded in all of these bacteria but you didn't pick that up by 16s but this pink gene is something that allows it to adapt to live in habitat A but doesn't have that much of the green whereas you know this one this population has a lot of this green that allows it to live in habitat B and everyone has a lot of this blue say this is the 16s gene which is an essential gene so metagenomics will get you beyond there may be two different bacteria that have two different 16s signatures but have the same gene content or the same protein encoding potential to live in that habitat so um we are starting to see more metagenomic sampling but the problem is that once you get those metagenomic samples we don't really have the tools to know what the bacteria protein coding potential is so you get a lot of information about the bacterial phylum and what types of sequences you know do you have but then when we go into these cog or keg complexes which is just the catalog of orthologous genes I mean it it just doesn't give you much resolution you've got this large amount of sequence and then we just don't have very much information to classify these sequences um but I'd like to finish by just even you know asking the question about you know how are we as Americans um you know what is our relationship to our microbiota because there is this over preponderance right now of wanting to kill all the bacteria that live on our bodies and in particular on our skin without understanding that the bacteria are really contributing to our health and part of the goal of this project is really to lose the language of warfare and to understand ultimately what is the relationship between these microorganisms between each other and also with our own human cells so that we don't really have as our goal anymore to sterilize our bodies but to live in peace and harmony with our microbiota and use that to consider that the microbiota are not just driving infections but are actually also promoting our health.