 All right. Good morning, everyone, and welcome to this final week of the 2012 Current Topics and Genomic Series. Just a little bit of housekeeping before we get to today's lecture. For any of you who missed any of the lectures along the way, just by way of reminder, you can view any of the lectures by just going to the course website, which is again at genome.gov slash course 2012, and all of the lectures to date have been posted for viewing. And we're very heartened by the fact that there's already been over 25,000 views of the lectures in this year's series, and I would encourage any of you who are engaged in any sort of teaching activities to use these lectures as an educational resource to supplement your own curricula. All of the handouts will continue to be available on the website as well until the next offering of this course in the spring of 2014. For those of you who have been signing in each week for your CME credits, there is one final thing we have to ask you to do, and that's complete an online questionnaire which will be hitting your inboxes later this week, and once we have that in place, we'll be submitting all of the registrations into the CME office so to make sure that you get your credits and a certificate for participating in the course. For everyone, we also ask that you complete a brief online survey, and let us know what you liked about the course, what not so much, any comments that you have about specific lectures, any suggestions that you have for improvement, basically any constructive criticisms that you have regarding any aspect of the course would be welcomed. So the survey invitations will go out to everyone on the course mailing list at the end of today's lecture. Please rest assured, all of the survey responses are collected anonymously. We read each and every one of these survey responses we get back, and the changes that we make based on your feedback from all of you is really a key part of why this course is successful from year to year. So this will take you less than five minutes to complete, so I would very much appreciate your time in completing the survey once you receive the invitation email today. So just in closing, I really hope that you found the course both interesting and informative, and Tira and I really encourage all of you to apply the concepts and the methods that we've presented to you over the last 13 weeks to your own research interests. So thank you for your participation and support, and we look forward to seeing all of you again in the spring of 2014 for current topics 2014. So to today's lecture, it's my pleasure to introduce to you Dr. Julie Segre, who is a senior investigator in the NHGRI Intramural Research Program. And Dr. Segre's main focus is on the body's largest organ, namely the skin. And over the years, her research program has provided us with a great deal of insight about the genetic pathways that are involved in building and in repairing the skin. Now obviously, the skin provides a critical barrier to invasion by microbes, but it also, at the same time, provides a major home to them as well. And through her lab's work and her work as part of the Human Microbiome Project, Julie's efforts continue to provide us with new insights into how the bacteria that constitute the skin microbiome contribute to both chronic skin disorders, such as psoriasis and eczema, but also to overall human health. So given her role as a thought leader in the field, I'm quite happy that Julie could join us this morning to share with her her perspectives, share with us her perspectives on the genomics of microbiomes and microbes. So please join me in welcoming today's speaker, Dr. Julie Segre. Okay, thank you. So I'm going to just launch right in. And so the no financial disclosures, one of the great benefits of being an NIH employee, it makes all that a lot easier. So actually, I've been involved in the Human Genome Project for 20 years now. And as this part of the Human Microbiome Project really started about five years ago, when we started thinking about the fact that humans are really super organisms. And that what's contributing to health and disease is the chromosomes that encode the human, but also the multiple microbes that live in and on our body, including fungi, bacteria, viruses, and archaea. Now of those 23 human chromosomes that you've heard so much about, there's 25,000 genes, you know, I mean lots of alternative splicing. But in fact, each of those cells has more or less the same gene encoding potential, whereas the microbes that live in and on our body actually have quite varied protein coding potential. And as you could imagine, the bacteria that live in your gut are quite different than those that live on your skin. As well, when we start to think about some disorders, like allergic disorders, asthma, hay fever, that have really increased in the last 20 to 30 years, it can't be our genetic material that's changed in that short of a time span. So it's something about the gene environment interaction. We use that word a lot, but what really would be integrating this gene environment interaction? And one of the ideas of this project is that perhaps the gene environment interaction is really being integrated by the microbes that live as, you know, to gather with the human cells. And as you could imagine, although our human DNA is evolving in a very slow way, the microbes that live on us could be evolving more rapidly as we integrate antibiotics into common human health, which is to say that my mother probably didn't take antibiotics very much because growing up as a kid during the war, they weren't very available. But I took antibiotics as a kid. My children take antibiotics as a kid. So are we going through a bottleneck where we are actually changing the microbial diversity that lives on our human body? And that's really part of it is just to set a baseline and understand what are the microbes that live on us. How do they change during disease states? And how does that integrate with human health? So this is part of the Human Microbiome Project that is a large project that is part of the NIH Common Fund, born as the NIH roadmap, that the goal of this project is to assess the microbial diversity of 250 healthy individuals at five sites and to make all of this data publicly available. And the data has now been collected. The papers are under review. But the data is already publicly available, although some of the data about the clinical features of the patients is in controlled access. But the DNA sequencing of the microbes is in open access and would be freely available for anyone wanting to use it as controls for their own experiments. I'm sorry. And the five sites that are being studied are the gut, the nasal passage, the oral cavity, the vagina, and the skin. And in several cases, there are multiple sites like in the oral cavity being sampled so that you could compare the left cheek versus the right cheek of the mouth or the left arm versus the right arm and understand what the variation is between individuals and between sites in an individual. So here the goal is to sequence bacterial reference genomes. In the first paper, the first 180 bacterial reference genomes has been published. And here it's really to expand the repertoire of bacteria that have been sequenced. Predominantly, bacteria that have been sequenced are ones that are involved in disease. So there's like the MRSAs that have been sequenced are the ones that are circulating in hospital. But what are the MRSEs, the staff epidermis that have been circulating in hospitals, but what are the bacteria that are part of the normal, healthy human microbiome? And part of the reason for doing these bacterial reference genomes is really to enable metagenomics, which I'll get to as the final topic of today. Metagenomics is the analysis of the combined coding potential of a mixed population. So imagine that a spaceship comes down in New York City in a crowded street where people are crossing the street and basically sucks them up and takes the DNA from all of them simultaneously. That's what metagenomics is. Like you would just be scraping your skin and sequencing everything. And then you try to sort out which one came from which genome. So instead of sequencing microbe by microbe, we'd like to eventually sequence altogether the entire gene encoding potential of the bacterial community because there is such an interaction between the bacteria of how they control and interact with each other. The goal is also to look at the correlation of changes in the microbial communities with disease states. So there's some classic projects being looked at here. Inflammatory bowel disease, Crohn's disease, psoriasis, eczema. And really to understand what is the relationship of these disease onsets with bacterial communities. Now it's very interesting because, in fact, what we're seeing now is that it's not just that the bacteria of the gut controls gut disease. But in fact, they're seeing that enzymes produced by the gut are having an effect on coronary heart disease because drugs that are used to treat coronary heart disease or even products that contribute to coronary heart disease are metabolized by bacteria that inhabit the gut. Similarly, oral cavity seems to have an effect on multiple systems of the body. So we're beginning to understand that the immune system is educated by the microbes, but also faraway sites are being affected by the microbes. As well, we're really, as with the human genome project, we're taking this project and also exploring the ethical, legal, and social implications of this field of research. So for example, it's really unclear how probiotics will be regulated in this country. That is to say that right now probiotics can be used as drugs if they go through the full IND process of a drug. But it's very hard to do that. But on the other hand, most of the things that you'd buy in Whole Foods are actually being regulated just as natural products or food additives. And therefore, there's not the same level of clinical scrutiny on the manufacturing and the efficacy of those. So that's one of the things that we're looking at as part of the project is how to really regulate probiotics. Also, what will people think about how can we change the impressions of people in this country in that there are these nasty bugs out there that you certainly would want to avoid. But it's not necessarily that the language of warfare is always applicable here. There are lots of healthy bacteria, and our goal should not be to just kill them all. And it's interesting because again, people are very interested in the probiotics that go into their gut, and then they want to sterilize their exterior using all these hand sanitizer products, which have their role. But we can't lose sight of the fact that hand washing has been extremely effective and been really tested over the years. So how did this project start? Well really, the microbial diversity has been studied in the environment for decades before the human microbiome project started. And these are just some early articles about microbial diversity that was studied in the environment, where in this example, they go around and they're taking all these different spots in the Sargasso Sea and they're looking at what the microbial diversity is. And what they found here was that by DNA sequencing, you could recover a much greater diversity of microbes than you could by culturing. And we call this, this has been well known also suspected in human studies, which we call the great plate count anomaly, that you can see when you take a sample that there's a great diversity of bacteria. But when you try to grow them up, they're really not as diverse as what you could see in the original sample because you really have certain bacteria that are really lab weeds. And for the skin, it's really the staphylococcus that just grow tremendously well when you put them on auger plates and let them grow as individuals. And these environmental samples have been sampled. That was studying different places in the Sargasso Sea, but this is another one, which is a saline mat, where they sample at all these different sites as you go down in depths of the mat. And they're looking then at what are the biological functions that are being performed. These sorts of things are also being done in the ocean at different depths. And what you can see is the effect of sunlight because you go from much more photosynthesis to less and those kinds of processes that you can really see the extent to which the bacteria are responding to the environment. OK, so there's the environment of the oceans and the sea, but there's also the environment in which the humans live. And we rarely think about this, but this was an experiment that Norm Pace's lab did, where they went and tested shower heads all over the US and looked to see what was in shower heads. Now, this is an example of something where you think you're standing in the shower and you are getting clean. And in fact, you are, but no one in their home very commonly changes the shower head. And it turns out that that is a moist, warm environment that we are creating in our homes that really has a great potential to grow bacteria. So part of this is also to think about where are the environmental point sources. And it turns out that there's lots of bacteria that live in your shower head. And that some of the bacteria are dependent on what type of shower heads you have. And it doesn't necessarily mean that you should run off and change your shower head, although I've often meant to do that after reading this article. So how do we look at bacterial diversity? I mean, there's the culturing. But now what we've entered is this realm where we think that the DNA sequencer is a very powerful microscope that can tell us what are the bacteria that are in a certain place. So the way that we look at the bacterial diversity is by sequencing the 16S RNA gene. So ribosomes, as I'm sure you all know, are made up of proteins and these ribosomal RNAs that guide the tRNAs through. And a ribosome is actually 70% ribosomal RNA. And the crystal structure has been solved. And that's been really one of the beautiful works of biology. But these 16S ribosomal RNAs, which means that they're not translated into proteins, but they have a lot of secondary structure because they're part of this ribosome. And the 16S RNA gene has been used as a signature for bacterial genes for a very long time because there are regions that are more highly conserved because they have to form these stems. And there are regions that are less conserved because they form these loops. But in fact, there is a phylogenetic distinction where all the firmicutes are more like each other and then all the staffs are more like each other and all the streptococcus are more like each other. And you can use the 16S sequence, the similarity, to go from the phylum, to the order, to the family, to the genus, to the species level. And so it's the 16S gene that Carl Woz and then Norm Pace really developed as a molecular signature for bacterial diversity. And I'm just going to sort of spend the first part of this talk talking about how we use the 16S gene for characterizing bacterial diversity. So here's another display of that same 16S gene. And what you can see here is that, again, I've laid out the 16S gene. It's 1,500 base pairs. And I've laid it out where you can see now what is the sequence similarity of these different regions. So these regions up here are those stem regions that have to be highly conserved. They probably interact with the tRNAs. And you really just can't, you don't have much wiggle room on those. But then there are other regions. And you can see that they vary, that are very highly diverse or less diverse. Now, in fact, we use different regions to get different levels of specificity in that these most diverse regions are sometimes hard to use if you want to get to the level of phylum or something because they are so variable. And I'll kind of go through that. But the first thing is that sometimes you just want to know how much bacteria is there. And so quantitative PCR primers have been designed that are in fairly conserved regions that you can put ends in and you can get most of them to then do a QPCR and figure out, like for example, is do these mice have a greater bacterial load than these mice? Or does this site on the human, like the oily sites of the human skin, have a greater bacterial load than the dry sites? So actually, this needs to be sort of standardized. And so for example, this is how we calculate the bacterial load, where we actually took bacterial DNA. And using Avogadro's number, we figured out how many picograms of DNA we were putting in of the bacterial DNA and then spiked that with an increasing amount of human DNA because when you get samples, some of those samples have a significant amount of human DNA in them. And then we did the QPCR curve, where as you decrease the amount of bacterial DNA by 10 fold, you're now increasing the number of cycles by three and by three again. And from this, we can calculate what the bacterial load is. So for the skin, we were wondering, which as Dr. Baxavanis mentioned is my lab's area focus, we were wondering how much bacteria were we getting by the different methods. So with this, we were able to calculate that when you swab someone's skin, would like a Q-tip, you can release 10,000 bacteria per square centimeter, whereas if you scrape the skin, so removing that white things that form the dust bunnies in some people's bedrooms, that would yield 50,000 bacteria. But if you use a biopsy, you'll actually generate a million bacteria per square centimeter. And that's because the bacteria don't just live superficially. They, in fact, live very deep into the skin in the hair follicles and the sweat glands. So you can generate more, although you don't need to, because we can get most of our answers with a lower, with a subsample of the bacteria. So when you're thinking about how to study microbial diversity, there's really an emerging, this is one of the questions that's really been emerging is, how do you study microbial diversity? So the earliest studies would take the 16S DNA, they'd amplify it, and then they do a fingerprint. And based on the number of fragments, they'd calculate how many different types of 16S gene they had. But that's really based on the limitation of a gel. So that's the cheapest method. But it's very limited in resolution. The phylochip or the geochip are kind of like microarrays, where they have the different 16S sequences laid out on a slide. And you can use that to say, I have this much of the staff epidermidis, this much of the strep aglectae, this much of the strep pyogenes. And I think that this not only has a role right now, but it has a continuing role in that the analysis of these types of microarrays is often much easier than using sequence analysis. But the problem is that with any of these microarrays, you are only going to find what you know is already there. And you'll never find the unique species. So what we need in order to build these chips is a very good reference library of what are all the possible bacteria that could be found on this site so that you can interrogate that rather than thinking, well, you'd hate that the population you were looking at had some unique bacteria and you're just not assaying it. So that brings us to sequencing. Because sequencing is gene discovery. This is how you can find a full dynamic range and compare multiple complex samples. So for a small study, the sequencing may be limiting, but for a large study, and I would actually even say for a small study also, the bioinformatics becomes limiting as you go through this. Because most of the programs that I'll talk about for sequence analysis do require you to kind of dive right into this and do some of this command line programming or at least run it on the command line and have some understanding of what may be the issues associated with your sequencing. So this is an example of phylochip just to show you how this type of data can be used if this is the type of experiment you want to use. This was looking at the intestinal microbiota in the first year of life of children. So on the x-axis is days, and on the y-axis is the percent of sequences in a relative abundance that belong to these classes of bacteria. And the punchline of the story was that there's great diversity between infants and between time points with these sort of spikes that you can see, these blooms here, and that that is part of the normal process. That for infants, as you can imagine, the child makes major shifts as they go from breast milk to cereal to eating a diverse diet, much as they do on their skin microbes as they go from always being held to being seated to then crawling around and exploring their environment. Also to say that as they are infants, they have these roles of fat and then their skin kind of changes as they start running around and become leaner. So something like phylochip can give you this overall perspective of what is the microbial diversity over time and how stable is it. But if we want to get to sequencing, it again becomes this issue of where are we going to put our primers. In terms of you want the primers to have the specificity that you could amplify as many of the types of bacteria as possible, accepting the fact that there will always be some amplification bias of any primers. But you put the primers into conserved regions, and then the phylogeny is determined by the variable regions. And the size of these amplicons is really, again, technology limited. So if you were using Sanger sequencing, amplifying the 16S gene, doing a ligation, and having it sequenced with Sanger sequencing, then what many of the early studies did was amplify the full length 16S. But most projects, if not all projects, have now switched over to the pyro sequencing. And the human microbiome project has been using the V6, V9 region. There's actually a V3, V5 primer set. And also, from the V5 prime end of the gene that's not shown here, there's actually a conserved region here into the V3 region. And one of the things I would say is that there is a fair amount of variability. So if you take the same sample and amplify it with the V1, V3 primers at the V5 prime end, and that same sample and go for the middle region and go for the end, they aren't going to be exactly the same. Some of those primers are better at amplifying firmacutes. Some of them are better at amplifying the streptococcus. Some are better for lactobacillus. And so really, the region that you pick is often driven by what type of bacterial diversity you are expecting to see in your sample type. For example, we for the skin always use the V1, V3 primers because it's very important to us to get a very good handle on the staphylococcus. That's just one of the important bacterial genera in skin diseases, including atopic dermatitis. But for people who are studying vaginal microbiota, they may use the V6, V9 region because that gives them better resolution. Also, for the oral cavity, they typically use the V1, V3 because they need to differentiate the different streptococcus. So what region you use is often dependent on what body sites you're looking at. However, that does bring up an issue that if you are sequencing the V3, V5 region, you can't use as a reference someone who's studied healthy controls but sequins the V6, V9 region. You may find that there are great differences between your disease state and the healthy controls, but that really could be driven by primer choice rather than by the disease state. So these are some of the complications of these studies is that we are still utilizing these kinds of sequencing techniques that clearly have biases. So I sort of said this, but just to reiterate, there's the Sanger sequencing. There's 454. And I would say keep your eye on alumina. The sequencing read length of 70 to 100 base pairs has often been too short to really get that much specificity from alumina, but especially now as the MySeq comes online and people get paired ends of two times 150, sequencing 150 base pairs from the V3 and 150 base pairs from the V1, you're getting in exactly the same range as the 400 base pairs of a 454 Roche. As those read lengths go up to 200 base pairs, 250, you're going to be getting a much more powerful data set from alumina at a much lower cost than you are from the 454 Roche. It's not clear that as you go, if Roche really does go up to 600 or 800 base pairs of DNA in an Amplicon, whether that additional sequence would really be useful and important for sequencing the 16S gene because you typically get enough resolution to get all the way to the species level with 400 base pairs. And that's sort of laid out here, although in this reference where they talk about the sequence length, and if you just had 50 base pairs, you may not really be able to get even to the genus level, but you could get to the phylum level. So it really depends the sequence length and the primers that you could then use and how that really, what type of specificity you want for your bacterial diversity sequencing. Maybe it's enough for you to just even know has there been a shift in the phylum or the class? And then you can think about using different sequencing platforms. Okay, so you've got a sequence and you've got a sequence. So you've got a 400 base pair sequence or you have 2,400 base pair sequences. Well, you can't just blast them anymore. And this is kind of frustrating to a number of people who work even in clinical microbiology labs. You used to be able to blast something and get it to match something. But by now, so much bacterial sequencing has been dumped into blast that if you put in a sequence, I mean when we put in a sequence and we're trying to say what is the sequence, what we usually get is that this matches thousands of other sequences that our lab has deposited into GenBank that are uncultured. And it just comes back, uncultured skin bacteria. Well, that really doesn't actually help us that much. And so this is one example where more data has actually, I mean it's great, but it has gone beyond what you can do with blast. So fortunately, there is a solution which is that bacterial sequences do have their own classification systems and this is not, I refer you here that there are these tutorials which actually will take you through this. We have the ribosomal database project that was curated by Jim Coles lab and contains approximately a million 16S sequences that have all been classified based on common microbiology technique taxonomy. And I do get into this because for clinical microbiologists, there are different taxonomic systems. There's Bergeys, Yusibi, NCBI, but they're basically, the Phil Hugenholtzes, they're basically the same. You will find bacteria that they've been reclassified or renamed and you can look into all of that. But so within RDP, you can find what is the bacterial classification. You can do things like find probes if you want to find something to try to make either a little microarray or to do in situ hybridization. You can do seek match. But basically by this point, you need to move into these kinds of bacterial programs because blast is now quite limited. This is just an example of the RDP pyro sequencing pipeline where the data is processed and formatted and then RDP will already give you some of the analysis tools that tell you from a sample, how diverse is this sample? Have you sequenced enough to achieve saturation? And I'm gonna kind of go through some of those. But it's one of those programs that will take you sort of soup to nuts. Okay, if you are dealing with human sequences, then for 16S, I would say that we very rarely, almost never end up amplifying human DNA with those 16S primers. But it is something that you need to have a filter in because ethically you really, because we release all our data into the public, there's a distinction when we consent patients that we would put their microbial sequence in open access, but we wouldn't put their human DNA into open access. So just sort of a shout out that one of the issues is to really think about that. So at the level where I was saying that you could gain insights even from the highest level, this is an early paper from Jeff Gordon's lab looking at what is the difference between lean mice and obese mice. And these are genetically obese mice. They have a mutation in the leptin pathway. And what you can see even at this level, the shift is quite great. So what the Gordon lab can see here is that the, oops, oh that's not good. The obese mice have an increase in the amount of firmicutes and a decrease in the amount of bacteria dates. And that it's really this kind of a wide sweep. Now actually some of the most interesting things that are now coming out in mice studies are how the effect of having the mice in the same cage affects them. In that mice are caprophidics, they will eat each other's poop and that actually kind of makes their microbiomes conform. So you can actually even see differences if mice are housed in the same cage. They are much more consistent than if they're in individual cages. And they will go towards a norm if mice are in the same cage with each other. And they can actually even transfer the microbiome from an obese mouse to a lean mouse. And those sorts of things, actually Richard Flovell was here two weeks ago talking about experiments that he's done with Jeff Gordon's lab on that topic. So again, it is very interesting when you start to think about setting up mouse studies, you have to think about how you're housing them because these microbiomes are not unique to the mouse. There is a community aspect to these microbes. Now in humans, I would say we are just at the beginning of studying this. There are these sort of small studies that report that a couple living together will start to conform to the same, not to the same, but to a similar microbiome. And certainly the twins have greater similarity, either monozygotic or dizygotic twins than siblings. So we're just at the beginning of understanding how microbes are shared between people. With mice, there's a lot more sharing that goes on. So the studies on obesity also have been shown now in humans, this was sort of one of the studies that kind of got a lot of press when it came out five years ago. Looking at what are the bacteria that live in the human gut? And this is sort of these people where they are put on a diet and as they become leaner, the amount of bacteria deets increase and the amount of firmacutes, wait, so the amount of bacteria deets increases and the amount of firmacutes decreases. And that is correlated with changes in the body weight. But I mean, Jeff is very clear on this, and so I wanna be sure to say this, that in terms of weight, the microbes are playing a role and perhaps being educated and selected by the diet that you are picking, but there is still a tremendous role here for what your diet is and how many calories you are consuming in terms of body weight. So it's not to say that microbes are the whole story if you're consuming many more calories than you need. Okay, so coming back to sequence analysis. This is another one of the dirty little secrets about bacterial sequence analysis, which is chimeras. When you're doing PCR to look for the 16S gene and this would be true even if you were sequencing or if you were going to a phylochip or any other means, you are starting by amplifying a staphylococcus, the PCR cycle is over, you're not yet done amplifying this 400 base pair product. And then when the next round of PCR starts, you have almost an exact sequence identity at the three prime end of that gene or that product to match any other sequence in there. So these are what chimeras will look like where you start by amplifying parent, start by amplifying parent B and then you switch over and now you're amplifying parent A. So the reason that chimeras are this dirty little secret and are so pernicious is that when you're thinking about how diverse is my sample, we did an experiment where we took 20 known bacteria and we mixed them together and we did the 16S PCR and we generated thousands of unique bacterial sequences by generating chimeras between the sequences. And so when you think about how diverse is my sample, we knew in that case we had only put 20 bacteria in, but because of the multiple ways that a chimera could be formed either here with the here or here or here, those each would be viewed as unique sequences. So you can't consider how diverse is your sample without correcting for chimeras. And we used these 20 bacterial DNAs that had been all integrated together to then use that as a training set to develop a chimera detection program. And the one that basically everyone is using now is called Chimera Slayer. There were other programs earlier, Pintail and something else and something else, but really this is the most well-tested program by now and it really reduces, oh sorry, Balerophon was the other one, and it really reduces the false sequences that you could otherwise generate with that kind of PCR amplification. So then you wanna figure out like how many different bacterial species do you have in this sample? And so you have to start binning the sequences and you wanna start by doing an alignment of your sequences, but these 16S sequences, we know a lot about the structure of a ribosomal RNA and we wanna use that information to generate an alignment. So many of the alignment programs that have been generated for looking at DNA sequence are based on the fact that they should form a protein and that therefore you would panelize something that had a one base pair insertion or deletion because that would throw off the frame shift. But what we know about the 16S gene is that there are gaps and that those gaps in the different regions might mean something different. So if you have a gap in a stem that actually should be penalized more than if you have a gap in the loop and also indels may not be as some, they may not be that different than here where you have a base pair that doesn't match up. So again, this is something that the Human Microbiome Project has worked hard on to sort of come up with a fixed with character alignment format, NAST. And again, NAST was the original program that really just specifically is designed for aligning 16S sequences. NAST has now been changed or made better. And it's now called NASTier, which again is from the Broad site. And NAST is the original alignment based on this ribosomal database project, the curated data set. NASTier, the differences at NASTier now allows you to have, if you have paired end sequencing and you don't have the middle region, it doesn't penalize you for putting ends in the middle of your sequence. You can still do that type of alignment and it is aware that you could have a gap in your sequence. You could have a gap in your sequence. So the thing is that what you have now is that you have these sequences aligned, but now you wanna build a phylogenetic tree and you wanna calculate the branch length between each of these sequences and start to bin them. And for this, typically people are using ARB, which is based on the silver database to build the phylogenetic tree. And so you'll end up with a parsimony-generated dendogram. And then this tree is then input into the next step, which is typically to define these taxonomic groups by sequence similarity. And now you kind of, I mean, and actually now mother, there's two programs. There's mother and there's chime, and I'll sort of get to chime. Either one of these is, again, now this whole sort of soup to nuts program of how to do everything from taking your bacterial sequences and bringing them through to an analysis and a visualization and a display tool. But mother will take your sequences and it will group them once you have that phylogenetic tree. It will group them based on what sequences are what are similar to each other. And you can set that similarity and you can say, I want the groups to be 99% identical. I want them to be 97% identical. And again, it depends on what level of specificity you want to have. A lot of our projects will look at 99%. Other projects will look at 98, 97%. And you do sort of want, it is a craft. It's like fact sorting. If you say you want 99% similarity, you will have many more groups. And it's really about what level of specificity are you trying to do your analysis. The other thing that you have to be aware of when you're forming these taxonomic groups is the nearest neighbor joining method versus the furthest neighbor joining method. And all of this is really documented and explained, but furthest neighbor means that any two sequences have to be at least 99% identical to each other, whereas the other method means that you kind of pick a root sequence and these two can be 99% identical and these two can be 99% identical. But then these two other sequences might be 98% identical to each other. How that becomes important is if you think about what the error rate of the sequencing instruments is. If, you know, in other applications, when you think about the sequence error, it is not as big of an issue because you are often doing an alignment and you're looking at multiple reads of the human genome, and you have like 50 reads of this region of the human genome, and you're then saying 25 of them are A's and 25 of them are C's, and you call the genotype as AC. But in our case, again, like I stressed about this issue about chimeras, each one of these sequences is being taken as its own sequence that is uniquely representing a single bacteria. So we don't have the same way where we are correcting sequencing errors. And probably the more realistic view of the human genome data would be, instead of 25 A's and 25 C's, you'd be getting 24 A's, 24 C's, one G and one T, and you ignore the G and the T. We don't have a reference by which to know whether this is improbable or not. There could be one bacteria out of 25 that actually does have a G at this position. So again, this is where you have to kind of understand the data and start to think about the number of sequences. This is old Sanger data I can see because now we wouldn't no longer take it up to 140 sequences, it would be going to 3,000. But the idea remains the same, that you look at the percent, if you're looking for 100% identity, then is what you're measuring here sequencing errors or is what you're measuring here actually bacterial diversity? So most experiments classified the 97 or the 99% identity. Okay. Now when we do the types of analyses, there's really two different types of methodologies that we can use. And these are the most common. We look at community membership, which the term jacquard, it's known as the jacquard and the community structure is the theta. But these are two different ways of looking at the data sets. I've sort of diagrammed it out here pretending that we're talking about a fruit bowl. And the question that you would ask here is you've got these two fruit salads. And if I wanna say how, what categories of fruit do they have in common? And this would be like, do they all have streptococcus? Do they all have staphylococcus? I would say actually they're not that similar because only two out of five of the communities are shared between the two groups. But if I said, then the other way of saying it is the community structure. If I took 100 pieces of fruit out of the first fruit bowl, what I find, how many of those 100 pieces of fruit would I find in the second bowl? And there the answer is about 90%. So the question is what's important? Is it community membership or is it community structure? If you think about two bacterial communities, whether or not they are similar probably has to do with what kind of protein encoding potential they have. And so you think about community structure. But if what you wanna say is does this bacterial community have the potential to bloom and maybe there's some infectious bacteria that certain people are susceptible to having these kinds of infections and other people aren't, then you're worried about community membership because it could be a bacteria that's there at very low levels but under other circumstances could bloom and suddenly you have a staph aureus infection. So that's what you wanna look at. So we typically calculate both of these and then we look at they can be used for very different types of questions. This is an example of using the community membership where what we are, again, this data is showing and it's looking again at those obese mice. It's laying out that this is the genotype of the animal but then also it's which mother is it from? So M11, M12 and M13, those are the pups of this mother one and these are the pups of, oh sorry, that's too small even for me to read. Those are the pups of mother three and mother three clustered together, the pups of mother one and mother one clustered together and mother one and mother three are sisters. So what you see here is that at the level of community membership pups are most like their mothers and the next step is that they are most like their cousins if their mothers are sisters. And so this is saying that microbes are inherited at least in mice, microbes are inherited from their mother. But when we looked at a mouse mutant, what we saw was that with community structure and you saw this also with the OB mice versus the wild type mice that they are most like other mice of the same genotype because what bacteria they have may be defined by their mother but the proportions of those bacteria are then defined by their genotypes. So that's why community membership and structure can give you different types of information. And unifrack like mother, this is part of CHIME now. These are, I would say these are sort of two of the most highly used methods, unifrack and mother for generating these same kinds of data. This will get you the same calculating the branch length, giving you statistical analysis and it generates the same kind of data. Again, I always think it's useful to use two independent methods to look at your data sets because if you see something being statistically significant with one method and with another method you have additional confidence that you're really looking at something real. How much diversity is there in the population? Here we calculate these rarefaction curves where we just say how many OTUs am I seeing as I add additional sequences and how many would I predict? And again, here what we're seeing is that that's very dependent on the body site. As you delve into it, we have measures for all of these things. These are pretty much have been developed from environmental sequencing. So the richness is the number of OTUs or species. The diversity accounts for, sorry, the evenness is the distribution, do you have 91, 1, 1 or do you have 10, 10, 10, 10, 10? And then Shannon diversity which is pretty much what people put out there. The Shannon diversity index takes into account the richness and the evenness. So all of these are the kinds of ways that you would characterize the community structure beyond even saying what are the bacteria and that people use to compare. So this is an example of work from my own lab where we looked at a survey of what are the bacteria on the different parts of the human skin. And what you can see here, this is those plots from the RDP where each of the bacteria are just classified at the phylum and then genus level. And what you can see is the blue sites here are the oily sites and they have a high preponderance of these propionobacterium which are lipophilic bacteria. The moist sites, these creases have a lot of the carina bacterium and also in cases have a lot of the staphylococcus. And the dry sites which are typically the, well the buttocks and the arms, those actually have the greatest diversity. So there's a lot of different ways that you can look at all these communities. This is probably the easier way to see that now. If I'm showing you four different healthy volunteers you can see again that all four of them have a lot of propionobacterium on their back. And what you could really see from here that anti-cubital creases the bend of the elbow. So you could see that these people, their backs are more similar to each other than their back is to their arm. And then really what we see from this is that the ecology of the site is dominating what bacteria live there rather than the individual. So the bacteria are responding to, is this an oily site, is this a dry site more than who is this that I'm living on? Because in general humans provide many different microbiological niches for bacteria. This is an early analysis of the different habitats. And the colors here, the greens up here, these are all sites from the oral cavity. These are all from the gut. And then these are different sites from the skin including hairy sites and inside the ear. And there is wider diversity than you see for the oral cavity and for the gut. But what you see here is basically again at a higher level it is the body site that is defining what bacteria live there. So stay tuned for further insights from the Human Microbiome Project and further tool development. That was talking a lot about bacterial diversity. We're also doing work to look at fungal diversity. Fungi are eukaryotes that have an 18S and an ITS intervening transcript sequence. And we're using time to adopt the same methods that we did for bacterial sequences to look at fungi because there's probably a tremendous amount of a relationship between the bacteria and the fungi. So now I'm gonna talk about sequencing bacterial genomes and this sort of goes hand in hand. But again I would say that these sequencing instruments that are coming online and that are present in many of the sequencing centers and may come into people's labs also soon. These are really ideal for sequencing microbial genomes. Microbial genomes are about three to six million base pairs. And the type of data that you get from a high-seeker from a Roche is really perfect. You get a lot of depth of coverage and they're fairly affordable to do now. So what you get is you get these short reads and I know other people have talked about that. And you then align these reads into contigs. There are also ways that you can get paired end reads and depending on the size of the paired end read you can bridge these contigs. There are several different assemblers that are used for assembling bacterial genomes. This is also sort of fairly out of the box we just use standard assemblers. And it really is for these that you can make a library, get your, give your DNA, make a library and for many of these instruments you will now get back large contigs of DNA sequence. You don't get back a finished total reference genome. Velvet is another one that we've used. So you do have to think about what level of coverage that you need, so how big are the contigs gonna be and how many of them are there. And you have to sort of start thinking about that because you kind of need to pilot that. For a six megabase genome you can make these calculations of how much sequence you would need but there are things that break contigs. Like any time you have a ribosomal RNA operon that will break the contig because there are many of those copies of that in a gene genome. And you aren't sure if you start on this contig and you hit the 16S operon or the RNA operand. If there's five copies of that you don't know then which is the next contig that you're going into. So, and transposons, phage insertions. There's things like that that will just break contigs. But we also can use that information to try to generate information about what is the genome that we're looking at. So here's a Staphylococcus genome that we sequenced in the lab. And what you can see here, this is the contig length and you can see that most of the contigs are here and I'm sorry I can't even read this but this is sequenced at about 30-fold depth. And you can see most of our contigs are this size. We don't really trust anything that's a tiny contig but what you can see is that over here this sequence, these three all assemble is present at two times the amount as the other parts of the DNA. This is found at five times and these are plasmids that are high copy. And that gives you an idea that these encode non-either repetitive sequences or plasmids that are in the genome. One of the things that we have found as a field, people talk about Staphypidermidis or they talk about Streptococcus agilactae. And morphologically they look indistinguishable different isolates will look indistinguishable by traditional biochemical means. But then what you can see is that you know some of them have drug resistance to this or that but they also sometimes have different invasive properties. And so what we find is that in fact there is a pan genome where something like a Streptococcus agilactae when you sequence and this project was based on sequencing here when you sequence like 11 or 12 of these genomes, what you see is that about 80% of the genes are found in every genome. But this is looking at so the number of core genes as you start sequencing, you'll find that there are approximately 1,800 core genes. But in fact, each of these genomes has an additional 200, I'm sorry, 400 genes that aren't part of this core genome. Those are sort of a random mixture that are found in some of them but not others of them. That's called the flexible genome, the open genome, the variable genome. But what we basically are seeing is that of course we talk about bacteria as species but they're not engaging in sex. This is that they can switch around their genetic information with recombination, horizontal gene transfer. They don't have the same constraints. So there are bacteria that are extremely similar but there also are a lot of bacteria that have a core genome and then have a flexible genome. And that actually starts to get at why perhaps are some strains more invasive, some strains able to metabolize this or live in the bloodstream versus the urinary tract. So that tells us about the genome structure but also sequencing the bacterial genomes. We can also pick up mutations that have occurred either as insertions, deletions or mutations. This is a project that we did in my lab for the NIH Clinical Center where we had three different bacterial strains that were of an acinetobacter that were circulating in the hospital and we wanted to understand what is the phylogenetic relationship. So we sequenced these three genomes and then these exactly as I told you, they formed contigs and now we're using Marco Mara's program Circos which was actually designed for cancer genome sequencing but I can tell you it's perfect for microbial genomes which actually are circles. We now are looking at it and what we're looking at is anytime that there's these three different strains that were in the NIH Clinical Center in 2007. We're looking at strain A and we're saying anytime that strain A is different than the reference to which we line this you code that snip blue. Anytime B is different than A you code it as red. Anytime C is different than B or different than A you code it as green. And what you can see is that these three genomes are different but they're different because there have been these regions of homologous recombination. There are also snips that distinguish them but really what was confusing to us was that there are these blocks that you could now see are blocks of homologous recombination when you line them up like that and that allowed us, sorry, that allowed us to then understand the phylogenetic relationship that B and C were more closely related to each other than they were to A. And if you look at these blocks of recombination this is exactly what you're seeing is that this is again Marco Mara's program where you're seeing it now as a pinwheel. So these genes here from A are identical and found in B and found in C but in fact these genes in the middle region this is defining exactly the block of recombination. These genes are all unique to A, B and C and these encode actually the O-anogen biosynthetic locus that is on the outside of the cell and is used for detection by the human immune system. So this is an example where you really get down into the nitty gritty of understanding the different bacteria. So I've sort of talked about bacteria, I've talked about fungi. I sort of wanna give a two minute shout out to viruses because I think this is gonna be another really important area of microbial genomics but it's the hardest one which is to find novel viruses or even to understand what is the viral diversity that lives again on us. And it's the hardest because you really have to do sort of de novo sequencing and there's RNA viruses, DNA viruses, you have to think about what are you gonna use as your control. But I wanted to just sort of show this as one example where they use genomic sequencing to identify a novel virus. This was a case of someone who was an organ donor and then the three people who received organs from this person who of course these three people are now immune suppressed because they've just received organ transplant. They then all died within a month from this fever and this sepsis. And so the question is, was there an underlying viral origin to these tissues? So what you can do is then these can all be sequenced independently but you're really looking here for the needle in the haystack. You're looking for something that you see in these sequences that you never see in the human genome. So I'm sorry, this ended up finding a novel arena virus. This type of method has also been used for finding Merkel cell carcinoma but this really, for diaries, this is really an area that needs a lot of sequence being thrown at it. So coming back even to the regulatory issue of how are we gonna keep a healthy microbe. Sequencing is just the start. If you wanna talk about a microbe being associated with a disease, then historically you should satisfy Cox's postulates that the micro organism is found in abundance from organism suffering from the disease but not in healthy animals. You should be able to isolate it in culture and then transfer it into a healthy organism and recreate the disease. It's not clear to me that when we now start to think about microbiome, that it is going to be individual organisms. It may really be that it is, that you get an introduction of something like vancomycin resistant enterococci, you know, the VRE. But that is only pathogenic in the context of limited microbial diversity of the gut and that perhaps if there is a VRE but there's also sufficient amounts of the commensal bacteriities that they would keep that VRE in check. So that makes it difficult for us to move. I mean, the sequencing is all about generating hypotheses but then thinking about how we're going to test them becomes complicated because we may not be able to satisfy what are the original tenants. In the last few minutes, I'm just gonna talk about what is the most complex part of what we do, which is metagenomics. So again, I was saying this and, you know, about the spaceship coming down and sequencing the DNA from all those people in the middle of the crosswalk in New York, but that's really what we would like to ultimately achieve, which is to understand who are all the players altogether which would get us all the bacteria, fungal, viral, archaeal DNA all together. In some cases, probably you'd also generate human DNA because the bacteria live in such intricate association with the human. You would end up, you will end up with a very complex mixture and the computational analysis is very complex. So what do I mean by that? With metagenomics, you know, and we sort of talked about this in the context of the pan genome, you know, that you could imagine that you'd be looking at two different populations and that, again, you know, that, you know, you'd see the pink, the green and the blue, but it's really about getting at the level of the green gene is enriched and the pink gene is reduced in this population. And you wouldn't get this by looking at 16S because maybe these are all the same type of bacteria but within that type of bacterial they have diversity. So for example, when I look at 16S sequences, I can't tell you if this is a methicillin-resistant staph epidermidis or this is a methicillin-sensitive staph epidermidis. So I would need to do this kind of metagenomic sequencing to understand what is really in those genomes. But oftentimes the sequences will then be discontinuous. So in humans, there are the first studies of metagenomics. There's been a lot of metagenomic sequencing also from this group Metahit and it's generated a lot of controversy about how many different types of gut microbiomes are there? Are there two? Are there seven? Are there eight? How many different vaginal types are there? Are there five? Are there three? And what we're getting at here is what is the diversity that constitutes normal and what is going to constitute dysbiosis or deviations from the norm? And there's a lot of room for argument here because we have not yet solidified how we will analyze these rich complex microbiome sets. So the tools don't yet exist to catalog and comprehend microbiome data. These are from the human gut microbiome. And what's really kind of sad about this is that in this rich data set, and then you can from this sort of look at what bacterial phylum are present or what keg cog terms are present and they look fairly similar. But in fact, you're taking a very detailed information set and you're reducing it to sort of 20 categories. And that may not be the level of resolution that you need to really understand what are the differences between these two bacterial communities. But it's really hard to know at what altitude you need to be looking at this kind of metagenomic data. Metagenomic data has been very useful in these types of experiments where you're looking for new metabolic enzymes. And this is sort of in terms of what are the new energy sources that the world could be harnessing. So the two energy sources that have been examined from a metagenomic perspective is the termite hindgut. So how does it take wood and create that into energy? The cow rumen. So what they do is they put this into the cow stomach and they incubate it and then they look at what bacterial they will find in that cow rumen after that food has been digested. And they're using this, they're actually from in these cases, they're getting to a level of specificity where they know that they're looking for certain classes of enzymes and they can find these with metagenomics. So these are some of the examples where metagenomics is now being used to find new enzymes that could be used in energy production. But in terms of the human genome, we really still need some computational tools to think about not if you're looking for one gene, but if you're looking at the whole classification, how would you really deal with metagenomic complexity? And that is just very much an open question. So that's my presentation for today. Thank you all for coming and for participating in the course series. It's really a pleasure for NHGRI to host this. Thank you. A couple of questions. Anything to use comes to the microphone, please. Thank you. Well, thank you for your comprehensive presentations and it looks like we are still in the process of discovering more some of those variations and all their implications. And so since we are putting all the money, is there any way we could get something back in terms of investment? I attended to talk on autoimmune disease. So at the end of the presentation, it was a beautiful talk showing us what are the issues. And then the question came out that maybe we need, our immune system is so aggressive that we need more challenges for them every day. So the question is, I asked, so what is a good system? We say we need a good parasites. I say, okay, what is a good parasite and where we should put them? In what body cabinet is on the skin, gut or somewhere else? So since you're covering some of the areas, you think we could get some mileage out of these sequences. Right, so I think that in terms of human disease, I think the first progress will be in the context of using the microbiome as biomarkers. In the sense in the same way that a diabetic checks their blood sugar level, we would hope that a kid who has eczema would be able to check their skin microbial diversity and see when they were about to have a flare. So that's one possibility. I think in the intensive care units, it would be used on erectile swab to say, is this antibiotic basically doing something bad to the person's GI population, that this person is now at increased risk for developing a VRE infection? More generally in terms of health, you're getting to the ideas of Stan Falco and Marty Blazer that we've gone through these bottlenecks and that we're not properly educating our immune system because first of all, the use of antibiotics in early life and lack of understanding of how that may affect kids six months and a year later and 20 years later and 40 years later. There is something called the hygiene hypothesis that believes that kids who are in, that shows kids in daycare or kids on farms have less allergic disease. But I think that's why we just sort of really need a baseline to understand what is the microbial diversity now and are we messing with it? In the same way that my husband would love to see what Chicago looked like 500 years ago, I'd love to know what the microbiome looked like 100 years ago before we started using antibiotics and also the urbanization of our society. So we also use a lot of antibiotics after a lot of infections and other things. So what is a good way to repopulate the good bacteria in your gut? You know, I certainly have to encourage people to take antibiotics if you have strep throat. And I don't really, you know, I think Activa is a marketing genius, but it's not clear to me that it changes the microbial diversity of your gut if you're a normally healthy individual. And so I don't really have any comments other than, you know, eat a healthy diet, get exercise and don't smoke. Generalized health record. Yeah. Any other questions? All right, let's take a moment to thank Julie once again. Thank you.