 Good morning and let me begin by congratulating Eric Green and his colleagues for ten spectacular years of NEST. And it is truly a pleasure to be here as part of this star-studded cast today. And had I known that we were actually going to be coming to a birthday party and get cake, I would have been even more enthusiastic about participating. What I would like to do in the little bit of time that I had this morning is try and give you an overview of what 12 years of effort in the microbial genomics area has provided to us in terms of our understanding of the invisible microbial world. And I think it's safe to say that one of the landmark accomplishments in the microbial genomics field was this paper on the sequencing of the homophilus influenzae genome that came out of Tiger in 1995. And with all due respect to Fred Blatner, who had already begun the sequencing of the E. coli genome, I think that one of the most important conclusions from this publication was that in fact a whole genome approach could be used to study large pieces of DNA much larger than those found in lambda or cosmic clones. And I think back as my colleagues and I were nearing completion of this project, which I should say at the outset, despite some hubris we weren't entirely convinced that we could pull off, we were sitting around talking about how wonderful it was going to be with the completion of this approximately 2 million base pair chromosomal sequence to, for the first time, understand the workings of all of the genes and the proteins that they encoded of a free living organism. And we were absolutely convinced that once we finished the annotation of this bacterium that we would be able to place every predicted gene into the appropriate biochemical metabolic pathway and for the first time really understand what life was all about. Well as you all know, nothing could have been further from the truth and the biggest surprise was that between 30 and 35% of the predicted open reading frames encoded proteins of unknown function. This led us to choose as a second organism for whole genome analysis, Michael Plasmagenitalium, which had been estimated to contain approximately 500 protein coding genes, which turned out to be very, very close. And we had decided that if we weren't able to place 1800 genes fully in context, surely we'd be able to do that with 500 genes. And yet again, that pesky one-third of the genome appeared to encode proteins of unknown function. And from that point forward we decided there was really no point trying to think, at least at that point in time, about fully understanding the workings of a free living organism based on genome sequence alone. And I think that with the rest of the talks today and all one has to do is pick up any genome sequence paper and see that we still have a long way to go. And those of us who are interested in really fully understanding what all of this genome sequence means hopefully can be guaranteed gainful employment for many, many years to come, given how much there is still to understand. But that said, we carry on. And I think that each genome sequence project that we complete provides considerable additional information. This is a figure here that I suspect almost everyone in the audience has seen. It shows up until the year 2002. The growth of information in GenBank plotted both as billions of base pairs of DNA sequence, as well as number of sequences going into GenBank. And although this ends at 2002, this exponential growth in sequence information has continued. Fueled in very large part by the completion of the human genome sequence and subsequent analysis of other larger eukaryotic genomes. But I put this up to make the point that when we think about microbial genomics, although the contribution of these organisms that are typified by much smaller genomes has perhaps been small in terms of total base pairs of DNA compared to what's now found in GenBank. The numbers of organisms that have been studied by genome sequence analysis that represent members of the microbial world is really quite substantial. And these are data that I took from the Web yesterday, and they give you a compilation of completed or high coverage draft genome sequences for bacteria, archaea, viruses, and various single cell eukaryotes. And you can see that these numbers are really quite substantial, and these efforts have provided us with a very, very different view of microbial life than we had prior to any genome sequence information. And I think it's safe to say that over the past decade or so in thinking about microbial genomics, and this is also now, I think, a paradigm that is being followed with more complex eukaryotic organisms, our view of comparative genomics has changed. Initially, in the microbial arena, there were many discussions about what the appropriate number of microbial organisms at the genus and species level would be in terms of those that should be targeted for sequence analysis. And I remember being able to participate in some of these very early discussions where the number that was being put forward was somewhere on the order of about a dozen or so. And this, in part, reflected what was known at the time about the number of major bacterial divisions. And it was also, I think, based on our ignorance about the diversity found in the microbial world. As we began to get into microbial genome sequencing projects, it became clear that this number of a dozen or so was a great underestimate if we were really going to use these large-scale approaches to understand microbial diversity. And then the discussions dropped down or moved up to another level where there were discussions about looking at differences between species using genome analysis. But I think at that point, which probably reflected some of the discussions in the late 1990s, early 2000s, there was still very little enthusiasm for taking genome analysis to this level and looking at differences between isolates. And I think almost in part through some accidents, if you will, in terms of duplicative funding, we were able to see that at least in the microbial arena, when you started to look at differences between isolates, there was still a tremendous amount of information that could be revealed by looking at information at this level. And so what I would like to do for the remainder of the talk is try and summarize for you what I think some of the most important lessons are that we've learned by having the benefit of 12 years of comparative microbial genomics. And if I had to summarize this in one sentence I would say without question that this information has changed many of the existing concepts that provided the foundation of microbiology before this work began. And I'm going to go through some of those lessons that we have learned in the remainder of the talk. The first is that many of the distinctions that we thought could be used to neatly separate organisms on the tree of life have begun to erode as we've accumulated genome sequence information from microbial species across both the bacterial and archaeal domains of life. We've also started to look more at various viruses and phage. And some of these distinctions that we felt were really hallmarks of microbiology, such as cell or genome size, cellular complexity, composition of lipids, cellular organization have been turned upside down. Just one example which may at first glance seem somewhat trivial but has really come from comparative genomics has to do with genome size. And if you look at what's summarized here, and this is a figure that came from a review that I wrote with Naomi Ward a couple of years ago, above the line are summarized the size of the genome sequences that had been completed at that time from bacterial species below the line represents everything else in the microbial world, archaea, viruses, and various single cell eukaryotes. And you can see that there really is no neat distinction. One cannot say that there is a cutoff in genome size that distinguishes one type of life from another. Another lesson that we have learned is that the microbial genome is a terribly dynamic entity that is shaped by multiple forces. And this is something that we need to continually remind ourselves because with every complete genome sequence that's been reported, we have to remember that what we're looking at there is really a snapshot from most of the past work of a clonal set of cells grown in culture in the laboratory or in association with other cells. And this by no means reflects what goes on in the natural world, although one can be lulled into a sense of security, if you will, that we can in fact neatly define genomes based on genome sequence alone. And some of these forces that are really playing a role on shaping microbial genomes are the following. The first is genome reduction. When we think about evolution in the microbial world, it doesn't always mean that organisms are becoming more complex. In fact, we have a number of examples to suggest that some of the extant organisms today are in fact reduced forms of ancestral organisms. And in many cases, this genome reduction represents loss of metabolic pathways, cell surface molecules resulting in many, many examples of what we would call minimal genomes. And one of the interesting conclusions from looking at these reduced genomes is that there appear to be as many possible routes to genome reduction as there are organisms that we study. And many of these are organisms that apparently have evolved as endosymbionts associated with hosts, as parasites of hosts, and many of these minimal genomes, many of these minimal species, appear to have gone irreversible genome reduction and based on our ability to make inferences from genome sequence information, it's hard to think that these are organisms that could now survive on their own. Just one other mechanism that I want to mention here on this slide, and it will come up again in the next few slides in my talk, is the process of lateral gene transfer that I think we really underestimated prior to having genome sequence information in terms of its impact in shaping microbial diversity. This was not a new idea that came along with the advent of microbial genomics. There were many well-studied examples, perhaps one of the best being the transfer of antibiotic resistance genes on plasmids. But we now have a very long list of examples of lateral gene transfer among organisms within the bacterial domain between bacterial and archaeal organisms and also between bacteria, archaea, and some eukaryotic organisms. And this seems to be one of the more dominant forces in generating microbial diversity. It certainly appears to play a very important role in the evolution of new pathogens or pathogens with differences in virulence and transmissibility. And this leads us to, I think, begin to need to consider rethinking how we define microbial species, given that there is so much lateral gene transfer that goes on. Bacterial taxonomy has a very long history for the past 30 years or so. Much of our definition of microbial species has been based on various molecular methods. Perhaps one of the most well-used is 16S ribosomal RNA classification that came out of the work of Carl Woes and colleagues in the late 1970s. But I think that even with these molecular approaches, there are some real limitations to our current thinking about bacterial taxonomy, bacterial classification. The first is that there are many examples where conflicts exist between phenotypic information and phylogenetic information. Bacterial species don't always exhibit the kind of phenotypic or genetic cohesiveness that we might expect of them based on some molecular classification schemes that are based on the sequence of one or a very limited number of genes. And I think one of the major limitations right now is that there is currently no way to classify non-cultured bacteria under the current paradigm. And as we know, if you go out into the natural world, most of what you find there in terms of microbial species cannot be successfully grown in culture in the laboratory. And I'll never get to that point, and there certainly has been some progress made. Certainly work from Steve Giovannoni's lab as an example is unclear. And it may be that the reason a lot of these organisms elude our attempts to grow them in pure culture in the laboratory is that they are absolutely dependent on their neighbors that they coexist with in microbial communities in nature. And this is just an example now. One of this has been revealed again through molecular methods. This is a figure taken from a review by Phil Hugenholz in 2003 showing what at the time were the major divisions of the bacterial and archaeal domains of life. And those divisions shown here by the white bars represent divisions for which no member has yet successfully been grown in culture. And you can see in total that those represent nearly half of all of the major divisions of bacteria and archaea. Again, there is a lot that we still do not understand. And so when we think about all of this in terms of the species definition, some of the things that we are wrestling with is the fact that genome sequence reveals to us metabolic potential. But this is not the same as phenotype. And if one goes back to Berge's manual as an example, which has really been the Bible, if you will, of microbiology for more than 100 years, there are some, one can find some real disconnects between what is listed for a given species in Berge's manual and what has been revealed so far based on genomic information. And also comparative genomics now at the level of different strains or different isolates has revealed considerable genetic diversity in what we are calling a species. And this is one of the first examples and I think still one of the most striking examples that came out of work from Fred Blatner's lab in the late 90s and up until 2002-2003. This was looking at the genome sequence of three isolates of E. coli, the K-12 laboratory strain, which was the first E. coli genome sequence that was completed, and two clinical isolates that caused disease. One, an entero-hemorrhagic strain. This is 0157H7 has received a lot of attention in the press in the past few years being associated with the outbreaks of food poisoning, with spinach, et cetera. And another E. coli strain that causes urogenital infections. And if you look at this Venn diagram in terms of gene content, you see in the middle that only about 40% of the genes among these three isolates of E. coli are shared. And all of the rest represent genes that are either found in two, but not all or in many cases are unique to each of these three strains of E. coli. And this was, as I say, one of the first examples. But this is, again, a theme that has started to recur over and over again, both with pathogens and non-pathogens, as we've had the benefit of comparative sequence information. And these are just some of the organisms here for which this same type of genetic diversity has been described. Many, but not all, are pathogens. Some of these represent very important environmental organisms like thermotoga pyridoma. And what this means is that it is not possible to fully describe a particular bacterial or archaeal species with a single genome sequence. And that is fundamentally important. One might make the same argument when thinking about higher eukaryotic organisms. But while we all agree that those of us sitting in this room all differ in terms of our DNA sequence, we don't presumably differ in terms of gene content by 25 or 30%, although we all may know one or two people where we suspect that might be the case, but we don't yet have any strong proof of that. So it was based on a number of these observations that some of my colleagues at Tiger and I in collaboration with Reno Rapouli's group at then Chiron, now Novartis, vaccines set out to ask the question, how many genome sequences are necessary to fully define a bacterial species? Can we even do this? It was of academic interest to us from the point of view of better understanding molecular diversity and evolution. And it was of interest to our collaborators at Chiron vaccines because they were interested in targeting a number of key pathogens and developing novel vaccines, taking a genomics-based approach. And what they wanted to make sure they were doing was not focusing on potential new vaccine candidates that had limited distribution among large numbers of pathogenic strains. And what emerged from these studies, and I'm not going to go into any detail, was the notion of the... what we described as the pan genome, the complete repertoire of genes associated with a given species. And what was revealed in essentially all cases that we have looked at since our initial study on group B streptococcus and others have followed on with the same kind of analysis is that in the majority of cases, one can describe genomic diversity as follows. That there appears to be a core of genes that one can find associated with all isolates of a given microbial species. And surrounding this is a set of what we're calling non-essential genes, genes that are found in some, but not all, strains. And then are yet another set of genes, these strain-specific genes, which we've depicted here as a genomic halo because we have yet to put a definitive number on what these strain-specific genes represent. And we believe that in some cases, it may be the case that these strain-specific genes very likely outnumber the core genes and the non-essential genes. And yet these are all strains that under the current microbial species definition we are calling members of the same species. This model here, we have termed an open-pan genome model. It seems to be consistent with organisms that we know can be found in multiple environments. And there is very good evidence to suggest that these strain-specific genes have likely been acquired through lateral gene transfer. And so these organisms tend to be found in environments where there are other bacterial and archaeal species present, and this likely provides an opportunity for multiple lateral DNA exchange events. We can contrast this in a very limited number of cases. These tend to be what you see with intracellular pathogens, for example, of a closed-pan genome, where in fact the number of genes associated with a species seems to be much easier to quantify. And again, these are situations in organisms where they seem to live a much more isolated lifestyle and presumably have more limited access to a global microbial gene pool. And just as an example about what all of this means, this came from a review article that was published a couple of years ago taking this notion of the pan genome and core genes versus unique genes and looking at two examples. One, a set of group B strep strains and another group of bacillus anthracis, Sirius and Thuringensis strains. And these trees here were constructed based on gene presence or absence, and the length of the lines represent degree of relatedness, again, as defined by gene presence or absence. And it's a little bit hard to see here without taking a ruler to this, but the take-home message was that at the extremes of this tree here for group B strep there was a greater distance between these organisms than there was between some of these bacillus anthracis isolates and bacillus Sirius and bacillus Thuringensis. And this certainly supports other evidence that has suggested that this entire group of organisms here really should be considered as a single species rather than as three different species. So it may be that as this idea goes forward our current classification scheme of grouping organisms together as species may end up being modified. And this brings me then to the final point of my talk that we may need to consider the possibility that different criteria may need to be adopted for understanding the species concept depending upon what part of the phylogenetic tree one is looking at and what organisms one is studying. And I think one of the best ways for doing that will be through the science of metagenomics which in terms of the microbial arena really represents the next frontier. And again as you all know the field of metagenomics really represents the study of microbial communities as an entity rather than looking at single organisms and I think that as we go out into natural environments and begin to look at what's there this will help us to clarify our notion of what constitutes a bacterial species. One of the projects being led by NHGRI but this is clearly a trans institute project within NIH is the Human Microbiome Project to look at the microbial communities that we share our space with. We are not sterile organisms the role of our immune system is not necessarily to make a sterile organism and there are multiple environments in the human body that are very densely populated with microorganisms. The GI tract probably being the most complex but all of these environments here representing a fair amount of complexity and if you look at what we believe to be present in all of these environments collectively there may be 150 times greater number of microbial genes associated with humans as there are genes present in the human genome. The advantage of metagenomics as you all know is that it can theoretically access 100% of an environment it bypasses the need for growing microbial organisms in culture which if we do this from native environments really ends up missing most of what's there because we can't get these organisms to grow but because when we study complex communities we end up with a different set of data than we have been used to dealing with so far starting with a single organism in culture we've essentially been able to generate complete closed genome sequences here diversity equals one when we're talking about metagenomics projects of any sort we're looking at multiple organisms present at different levels of abundance and studies so far have not gone as far as I think we would all like to go down into looking at the diversity and what one often ends up with our incomplete data sets but yet all is not lost as we demonstrated in a paper published last year in collaboration with Jeff Gordon at Washington University and David Relman at Stanford one can in fact go a long way with metagenomic analysis this was looking at organisms in the human distal gut to begin to understand who's there and what these organisms are doing and to get back to the point that I made about natural diversity this is a figure that came from the paper looking at the dominant archaeal species present in the human GI tract and aligning all of the sequence reads from this effort against a reference genome that had been completed at Washington University and what you see this should actually be percent identity we're looking at DNA sequence identity here every read that aligns to this reference sequence from Athena Brevor-Bacter-Smithia is not aligning at 100% DNA sequence identity suggesting that in this environment there is a fair amount of diversity within these organisms that we are classifying as the same species based on 16S ribosomal DNA analysis based on annotation and comparative analysis we see looking at two normal healthy human volunteers there are more genes that are unique than there are shared in these data sets the caveat being that these are very limited data sets given the complexity of the community but nonetheless we can begin to take this information and do metabolic reconstruction and what is shown on this very busy slide here are maps of a couple of pathways involved in starch and sucrose metabolism that are known to be encoded by human distal gut bacteria and not present in the human genome and we can begin to make putative assignments of our predicted open reading frames from this study to various genes in the pathway we can begin to get information about relative levels of abundance and although there are lots of boxes here that are shown in white for which we have no matches we suspect that had we gone deeper into this study we would have started to fill in more information so one can in fact start to make inferences and start to do comparisons with these metagenomic data sets they've also turned out to be very very useful from the point of view of mapping protein expression onto metagenomic data and this is work from one of my collaborators Bob Hedek at Oak Ridge National Lab using the human gut metagenomics data set as the Rosetta Stone for interpreting some metaproteomic data that he has generated these are now looking at a different set of subjects in healthy patients and a number of subjects suffering from Crohn's disease but you can see that there are actually a fair number of protein matches and again to get back to this notion of diversity and diversity within communities if you look at these two individuals and look at the top 15 hits based on proteomic analysis rather than genome sequence analysis these are the top 15 hits there is some overlap as shown by these colored arrows but they aren't necessarily all consistent and I think what's very interesting is that looking at these two individuals and looking at these two sets of black arrows here contained within the top 15 apparently most abundant proteins are hypothetical proteins and this gets back to the point that I made at the beginning of the talk there is still very much to understand about what is going on in these microbial cells that I think we tend to think of as being fairly simple so with that let me conclude what I have come away with from thinking about microbial species and using large scale approaches for the past 15 years is there is still a tremendous amount that we don't understand I particularly like this quote from Shakespeare in nature's infinite book of secrets a little can I read that is today but I think if we look forward into the future another 10 or 20 years we will hopefully look back and say that what we all discussed at today's meeting was primitive in terms of our technologies and our understanding about life on earth and let me just conclude with one acknowledgement slide absolutely impossible for me to give credit to all of the people I worked with at Tiger over the past 15 years many of whom have now moved with me to the University of Maryland as well as to a very large number of outside collaborators who have broadened our horizons and allowed us to use our expertise and large scale approaches to study some fascinating organisms and with that if there's any time I'll take a question or two we will try to squeeze in a couple questions for each speaker so if you come up to the microphones there's a microphone on each aisle if not I'm going to take the liberty to ask Claire I'd be curious to hear some of your thoughts on how all this new knowledge and insights and even redefinition of strains and species so far will affect clinical microbiology that's a very interesting question and I've heard a number of discussions about this I think right now the sense is in terms of clinical issues there is a sort of a level of comfort that comes from being able to make an identification of a particular genus and species and I can understand that from the point of view of diagnostics but I think that we will start to move away from that as we begin to get a better understanding that not all staph aureus isolates are the same and so I think that many people are sort of slowly coming along to this notion that we need to redefine genus and species think about a new definition the clinicians in particular seem to be struggling with this but I think that eventually we will reconcile all of this and quite honestly I'm not sure it matters is that we have the appropriate technologies and the appropriate understanding of what it is we're looking for and we have the appropriate markers to study that okay I think we will move on then