 Thanks, Adam. It's really a delight to be here. I want to thank Eric for inviting me and also wish you a happy birthday and another many years in the future of great success with the NISC. So why population genomics? What are we referring to here? This is simply taking the set of questions that population genetics addresses and taking it to the genomic level. So population genetics is interested in trying to infer the balance of the roles of the mutation, drift, migration, and selection to make a statement about the way evolution works. In other words, it's not just a description of evolution. It's sort of an attempt to really understand the mechanism behind evolutionary change. Unfortunately for population genetics, this means having to drag armies of students through very tedious lectures dealing with these nasty statistics like pi and theta and rho. I'm going to spare you much of this, but at least give you a little bit of the flavor of where we can do this at a genomic level. Another question that one can address with population genomics is having to do with past population demography. It turns out a lot of the issues with the HATMAP project, with inference about complex traits and identifying genes associated with complex traits are contaminated by confusion about the demographic history and population structure, so we need to understand that. In addition, having genome-wide data from multiple individuals allows us to do sort of scan statistics, looking across the genome for heterogeneities in these various statistics about levels of nucleotide diversity, recombination, and so forth. So the basic building block for modern population genetics is built on the idea of having complete data, so sequence data from multiple individuals ascertained in some uniform and random way, and so that ideal data set would look like this, where you have complete sequence with no errors, of course, no missing data, and that's the ideal. Of course, what we actually usually get is something far from that, and what we want to do from those data in any case is to estimate some of the primary parameters of how much variability is there, what's the sort of cause of that level of variability. The primary parameter that arises from every which direction, if you look at theoretical population genetics at all, it's astonishing the degree of convergence on this parameter, theta, which is four times the population size times the mutation rate. You can derive this as being a primary determinant of the level of variability in a population forward in time with a Fisher-Wright model, backward in time with a coalescent from infinite sites, from infinite alleles model. It's an amazingly convergent sort of statistic as being a primary parameter. So one of the things we'd like to do is to estimate that from various sorts of genome data. The particular kind of genome data I'd like to talk about is the very exciting short read data that we've been hearing a little bit about already. We've heard several talks dealing with short read data and the interest in using these short read technologies to recover the sequence of the individual. Often this is for the purpose of identifying the individual mutation or the set of mutations in that individual, getting high accuracy for an individual. And we've seen in those circumstances that it's necessary to go to reasonably high read depth, 10, 20, 30 as a read depth for those individuals. I want to look at the opposite situation where the data that are available are actually rather sparse. So this is ten different individuals where you're looking at a coverage of well under one X in each individual, but you have many individuals. Can you do any kind of useful analysis with that? And it turns out in the population genetic setting, the issue of estimating these sorts of genome-wide parameters, this sort of data works very well. And I'm going to show you some of the sort of directions to approach this. This was actually a problem that arose at first, the first that I was aware of it, with the celerogenomics data when they had sequences from six individuals and it sort of fell upon my lap to estimate some things like nucleotide diversity and so forth from those six individuals. And we did a rather crude ad hoc method. At the end it wasn't published because they weren't releasing the snips at the time, but it was a problem that I kicked around with Rasmus Nielsen right at the beginning and he was sort of enchanted by the problem. It's actually quite interesting, not only is the data very gappy, but of course there's relatively high error rate, a typical single sequence error rate for Selexa reads might be on the order of a percent or two. There's also site specific errors, so particular regions of the genome might have different sequence error rates. And in addition, if you're sampling from deployed individuals, you're actually sampling from the two alleles of that individual. So there's a binomial sampling from each individual. It's a very interesting sampling problem. And along with Rasmus Nielsen, Anas Hellman, we have a paper submitted that actually derives a water synestimator in the face of all those errors. And all you need to do is you need to have a very good error model. You have to know what are the determinants of error, what's the rate of error, what's the sort of base neighborhood for those error rates. If I can specify an error rate model and have that kind of data, I can estimate theta actually very well. So the particular test data set that we're going to talk about was actually funded by Adam Felsenfeld. It's a pilot project for one of those interminable committees one sits on for NIH. And this was one that was dealing with selecting genome sequences, and in particular the idea of using short read technologies in that context came up for SNP finding across the whole genome. How does one most effectively identify polymorphic sites across an entire genome without having to fund another HapMap project for every organism? And so the idea of just throwing them into 454 or Selexa was very appealing. And we proposed doing this with just ten lines of fly, six from North Carolina, four from Africa. 454 sequencing was done at the Wash U Genome Center. Elaine Martis was our primary contact, and she was terrific. And this was back in the days of the GS20. So one run was done for each of the ten lines. It gave 3.4 million reads, about 351 megabase pairs total of sequence. That alignment did look like this. This is actually a region of chromosome 2 of the real data. And you can see that there are regions that actually are gaps in the data, other regions where there's a depth of about 2.5 average across the whole project. About 74% of all those reads had a unique fit. This mosaic is one of the assemblers I'll talk about in just a second, but it's a pretty interesting start to the data. The first thing to ask about is how homogeneous is the depth? Are we sampling some regions of the genome better than others? Are there gaps and so forth? The North Carolina population, there were six lines, remember, so the depth for North Carolina is always going to look greater than the depth of Africa, where there are only four lines. And it was quite homogeneous across the, this was the X chromosome, except for occasional spikes one way or the other, presumably for some kind of repetitive element. In some cases, we could clearly see that's what it was. For these data, there was a reasonably good fit to the Landau-Waterman equation for coverage from a whole genome assembly, whole genome shotgun. I have to say however, since doing this, realization was that the power for discrimination of the goodness of fit that Landau-Waterman was quite poor for these data, because the read depth was so low, you saw from Richard Gibbs' slide quickly flashed by for the Jim Watson data that there was actually a bimodal distribution. There's also been something like 7x coverage of the C. elegans genome done by 454, and there's a pronounced excess fatness of the tails of the coverage distribution. So there are too many regions of the genome with insufficient coverage, too many regions with excess coverage, and those partures from the Landau-Waterman equation are really crucial for use of these short read data for inferences of things like expression level by counting methods. So I think this is a really important problem we need to get a handle on with these methods, but what really determines coverage. The average coverage for the African lines is about 40% of the genome. The North Carolina line is about 60%, so the pooled was about 3 quarters. Most of the regions that were covered by one were covered by the other and so forth. So we could look at particular regions of the genome that had particularly poor coverage, and so this is 10 kb fragments that had less than half coverage, and again it's quite spiky. Particular regions of the genome are looking like they're falling in those sort of gappy regions. So the real problem then, the primary problem of this whole pilot was to infer polymorphic sites. Where are their snips in the genome, and can we devise a sort of inferential method that would have a relatively low false positive rate and a reasonably good accuracy of determining those snips. So this first began, so we actually turned to some folks who had actually already been thinking about this, Gabor Marth and Erin Quinlan, that Boston University have been sort of working on problems of this sort with saying or sequencing, and recently turned their attention to short read technologies, and they realized right off the bat that the critical thing was to understand error. So one of the lines that we sequenced was actually the ISO-1 stock, the stock that was done to something like 14x coverage by Sanger Sequence, Sanger Resolfo Molana-Gaster. That gave lots and lots of reads that did have errors in them, and we could then build this error model. PyroBase is then their base caller that not only called bases but also called the confidence in the bases that they devised, that's actually currently online, you can download this and start to play with it. They're going to town with it. So anyway, they produced this multi-alignment of the 10 strains all across the reference genome, which is off of the Lana-Gaster, and called some 660,000 snips across the whole genome. About 1,200 of them were submitted for validation to the Washoe Genome Center, and those 1,200 had a posterior Bayesian probability of being a snip of about 90% and 92% in fact validated. So it's actually looking pretty good. So this is not the 99% confidence of snips in each individual. Is this a polymorphism in that position of the genome? And that accuracy is pretty good. So there's actually a quite strong correlation of the nucleotide diversity in particular regions. If it's very low diversity in Africa, it will be very low diversity in North Carolina and so forth. That's sort of expected. These are populations that are derived one from the other. Basically, the fly population pretty much followed the human population in migrating out of Africa, so we expect them to show that kind of state. Divergence between species, so Lana-Gaster versus Simulans is also correlated with this level of diversity within Africa. So what we're asking here then is is there simply heterogeneity in mutation rate? Are the regions with as high diversity driving that high diversity due to high elevated mutation rate in that part of the genome? If so, you would expect there to be an elevated divergence between species, and in fact you do see this to some extent. However, this correlation is much stronger than this one, and so mutation doesn't drive all of that difference. Something else must be going on. And one of the things that you can do to get at what's going on is to actually look at the ratio of polymorphism to divergence. So here's that level of polymorphism ought to be predicted by this parameter theta. The divergence is determined by twice the mutation rate times the time since the divergence between the two species, and the ratio of those has mu divide out as you can see, and so we ought to get something that is a distribution that depends only on those other factors, namely effective population size and the time since divergence. And when you do this, you have to simulate then to get what's the expected level, and the expected level is shown here under a neutral simulation. This is on this axis the diversity to divergence ratio across the whole genome, and those 10 kb chunks across the whole genome, we get this distribution. What we actually observe has a much greater variance. In other words, some regions of the genome have much more diversity than expected under that neutral simulation. Others have much less than expected. So that means the other parameters, effective size and time since divergence must be what are differing between those different regions of the genome, and those are precisely parameters that are driven up and down by things like natural selection by correlations with recombination rate, and so forth. So there's some interesting heterogeneity across the genome in these parameters of evolution. So one of the things that's often done with data that address polymorphism across the whole genome is to look for signatures of natural selection. We've seen this certainly many times over now with the human genome, and it's still an area of interesting research going on, but we can detect selective sweeps by troughs in diversity. If there's a favorable mutation, it's going to drag that particular variant up in frequency, replacing all the others, and hence reducing the local diversity around that particular adaptive mutation. And we can look at this then across the whole genome and ask, are there particular regions where there are big dips in nucleotide diversity, and it doesn't jump out at you very whoppingly for this particular data set, although there are regions where there's a curious dip for both the African and the North American sample, and it is actually significant at the genome-wide scale. Some of them are actually regions of the genome that have otherwise shown signatures already. This gene bag of marbles on the X chromosome is one interesting candidate that's being pursued in Chip Iquodra's lab, for instance. We've also seen in the human data efforts to look at differences between populations as being a means of identifying particular regions that might have undergone region-specific natural selection. We can do the same with flies. This is a difference between African and North Carolina diversity, so are there regions where there's a big spike up or down in the diversity, and in fact we see them. Those are then again candidates that are nominated for potentially interesting region-specific selection. On the issue of demography, flies show the same sort of demography as humans. Namely, there was an ancestral smaller population that grew rapidly at some point in the past, and in the African population there was not particularly much change since that time, but a very narrow bottleneck and expansion into Europe and the Americas. This is seen in the site frequency spectrum in African versus European flies. European flies have a big excess of rare variants because of the nature of the variation that would make it through that bottleneck would show this skew in the site frequency spectrum. So that was already published. Do we see this with these data? It's not nearly so clear because we only have a depth of two and a half on average, but what you can ask is, are there differences between different parts of the genome with respect to the relative frequency of different polymorphic sites? So one of the things you see is that there's a consistently reduced diversity in North Carolina compared to Africa. Now remember there were only four African lines, six North Carolina lines, and nevertheless there's more variability in the African lines. Again, for this demographic region, reason is well known. African populations have a larger effective size, look like they're more diverse. If you look at the X versus the autosomes, the mean for the X chromosome is about .004. The mean for the autosomes is about .006 for the North Carolina populations, and you see that X is less diverse than the autosomes. This is seen in almost every organism. The reason for this, of course, is there are fewer X chromosomes than autosomes. So the X chromosome has a smaller effective size in that mutation selection balance. Beta ought to be smaller, and you end up with less diversity, even if it were strictly neutral, you'd end up with less diversity. But in fact, the theoretical expectation is the X ought to have three quarters the diversity of the autosomes, and it's more like half in this case. Well, it turns out there's good theory for this. In a population that's undergone a bottleneck, you actually expect to see more severe reduction in diversity in the X compared to the autosomes. This is a paper by John Poole and Rasmus Nielsen. If you compare them now, in Africa, the X to autosome ratio is about 65%, which is, again, already lower than the 75% expectation if it were neutral. For North Carolina, it's about 50%. North Carolina is a derived population, and you can see in fact there was a greater reduction in diversity on the X compared to the autosomes. Nicely consistent with that expectation. The final sort of point I wanted to illustrate from these data at a sort of genome-wide scale is that regions of low recombination, particularly around centromeres, show dramatically reduced heterozygosity. You see that both in Africa and North Carolina. If one actually looks then at the local recombination rate, intensity, estimated as centimorgans per megabase pair. This is estimated again from many of the mapping experiments done with flies over the years. For the given local recombination rate, what's the diversity in Africa and in North Carolina? You see a pronounced positive correlation. This is widely described in the literature as being attributable due to the fact that there's a thing called the Hill-Robertson effect. Regions of very low recombination are going to suffer from the fact that a positively favored mutation is going to drag down diversity, so lower recombination will make a larger region swept to fixation and drop the diversity more than a region of low recombination. And also regions of low recombination, if there's negative selection, so deleterious mutations are occurring, it will also reduce the effective population size to a greater extent when there's lower recombination rate, so called background selection model. So some combination of those two is driving this positive correlation. You see it in these completely independent samples of four individuals here and six individuals here. We see this positive correlation between recombination rate and level of diversity. One thing that might drive that is if recombination itself were slightly mutagenic, so that when recombinations occur you also get mutations, that would drive a positive correlation between divergence and diversity. And in fact in flies we do not see that. This is the recombination rate again on this axis against the melanogaster, so there's no convergence and there's no correlation. So it seems like recombination is not inducing this positive correlation. It really is a Hill-Robertson like effect, the local sort of environment force adaptive mutations is favoring a greater reduction in diversity in regions of low recombination. This made a paper back in 1999 with just 12 data points, here are 30,000 data points, it seems to be true still, and so that was begun in a quadro back in 2000. I wanted to take the last couple of minutes to sort of shift gears, and this is another project that was funded by NHGRI to look at sort of comparative genomic lessons that we learned from a dozen different Drosophila genome sequences. And this is a particularly sociologically interesting project that featured data from all the genome centers except Stanford, I think, including adjunct biosciences, I guess NISC didn't contribute to it, but many, many different groups over many years contributed to these data. The choice of these 12 species was made on the basis of the fact that there's a huge diversity in the ecologies and sort of lifestyles of these different flies. It's over 400 million years of evolution spanned by that tree, so there's phenomenal saturation of mutation at many, many sites in the genome. Manolis Kellis was in charge of the analysis of the sort of annotation of the melanogaster genome and how we could improve the annotation of the melanogaster genome using these data. And this is just to illustrate something that I think you all know very well, which is that in protein coding regions, of course, you expect to see more substitutions between species at synonymous sites, because after all, they would still retain the same amino acid sequence. You expect to see substitutions that preserve the reading frame, and you expect to see substitutions that replace one amino acid with another one that has very similar properties. So a number of different software tools, Exonify and so forth, are very, very good at finding exons in the genome based on these sorts of signatures. So we can color code them based on their sort of attributes through these substitutions along this 12 species alignment. So this is a 12 fly species. Do those substitutions smell like they are protein coding substitutions and color them green? If there are substitutions that are radical, non-synonymous changes or frame shifts that are less likely to be protein coding, we can color them in red. When we do this, we can identify regions of the genome that are more likely to be protein coding, and we can use it then to improve our annotation of any given genome. This is just to indicate the difference between that kind of strategy and the strategy that's based on just simple conservation. So this is the track that's from the UCSC browser, the Fastcons track, showing this sort of degree of conservation. It's a very nice metric for overall conservation of the sequence, and it's showing again, so here's the region of this particular gene, CG9945, the boxes being, again being exons, and you see high conservation for most of those exons. But if you look carefully, you see there are regions where there actually is high conservation, but we failed to annotate an exon. Subsequently, we see that it also has a very high probability of being a codon based on this evolutionary model, and in fact, we go back and see that, in fact, that was another transcript there is, in fact, an exon in that region of the genome. So that's a novel sort of annotation to the melanogaster genome that came about because of our comparison to these 12 species, giving us much more power to detect these things. We also see things like this, where FlyBase was right that says there's not an exon there, but we see very high conservation. When we look at the protein coding signal, in fact, the substitutions that do occur in that region look very exon-like, and in fact, there's not an exon there. So this is, again, where there's a big difference between simple conservation and the protein coding region. So these sorts of methods resulted in many new annotations of the melanogaster genome, particularly in protein coding regions. Some 413 cases of translation start changes, 912 cases of different splice signals, 240 cases of polycystronic genes and so forth. So we gain considerable power and annotation of a genome by looking at this sort of comparative approach. Just two other stories with a sort of comparative evolution analysis, one of them has to do, again, with this X versus autosome comparison. Now, I made it sound simple. Comparing X and autosome, the X has a smaller effective size, so it looks like it ought to have more drift than the autosomes, because it's a smaller effective size. Things bounce around stochastically more in a smaller effective size. On the other hand, the X chromosomes hemizagis in males. So any mutation that occurs that's expressed in males, even if it's a recessive mutation, it's immediately expressed. So there's no masking in the sense of being hidden, for a rare allele being hidden in males, there's no such thing as recessivity. And so one expects then that deleterious mutations ought to be more effectively screened by natural selection. Recessive, advantageous mutations ought to be more effectively and more quickly identified by natural selection and dragged up in frequency. So these things lead in opposite directions. The latter results in an expectation that the X chromosome ought to be evolving faster. And so across the set of papers that have addressed this, the evolution of X versus autosome in different organisms, it's a wildly chaotic literature with very poor consistency. And if you look at the neutral divergence, so this is at synonymous sites, the divergence at non-synonymous sites, now with full genome data and all these species, again you see this sort of pattern where some lineages, it looks like the X is faster, other lineages, the X is slower, and you can see why the literature has been so confused. Because even with whole genome data, the pattern is still rather on Knife's edge, depending on changes in the demography and other things the X or the autosome looks like it's going to be evolving more quickly. The one thing that's absolutely universal is codon bias, which is the degree of codon bias is greater on the X chromosomes than the autosomes all the time. Now codon bias refers to the differential use of the synonymous codons, the fact that the codon bias is greater in the X means that the population is better able to discriminate between these very weak differences in selection between different alternative forms. Now those are almost certainly recessive differences, the fact that the X chromosome is detecting them is saying that again it's probably because of this increased efficacy of natural selection to see variants on the X chromosome and on autosomes. And remarkably consistent on every branch of this tree except for a couple where there's just insufficient power. One final point has to do with the evolution of innate immunity. This is the idea of taking a pathway for sort of any process you can imagine and having it descend down this 12 species phylogeny it's kind of exciting to imagine. So how is the pathway tuned? How is there pressure from pathogens exerting itself on this pathway? Do we see accelerated evolution and recognition or effector molecules? How does this go? And this is work of Tim Sackton in my lab and a number of collaborators of paper that's still actually just in review of nature genetics. And one of the punch lines that came from this is centered around the gene relish which is one of the transcription factors that results in launching of transcription of a number of the antimicrobial peptides. There's an inhibitor domain on relish that's joined by a linker. That particular linker region this is showing again sort of signature of positive selection on this axis against position along the gene. Most of the signatures of positive selection are right in that linker region. Other proteins that are actually involved in cleaving that linker dread has a caspase domain that actually does that cleavage. And again the caspase domain shows this excess signature of positive selection. So it's an intriguing case where in just the melanogaster lineage we're seeing multiple signatures of positive selection on just that part of the pathway. So this is a paper with over 244 authors. It's going to be in the November 8th issue of nature. It's been a tremendous sort of thrill and opportunity and privilege really to be working with this group. And at that I'll just close for questions. Thanks. Questions from the floor. So Andy, what are the challenges to take these approaches and then apply them to studies of human populations? Obviously flies are going to be much simpler and clearly we want to be able to do the kinds of comparisons you are showing with conservation and better annotation of proteins. But it must not be an easy generalization. Well, I mean there are a number of things that are different in the human situation because we're sort of coming at it in a different context. We have so much data from the million base pair SNP typing platforms and so forth that I think doing something like short read sequencing on top of those million SNPs will really, really sort of leverage each other in a very exciting way. Some of the primary issues of estimation of these parameters are less important perhaps for human than for sort of model organism studies. We're much more interested in the medical sort of questions where it's so important to get real accuracy of individual calls. But that's where we have in the context of the HapMap project and understanding that haphap, haphap, haphap type background, these methods that allow one to impute missing data. If you combine that sort of imputation with the short read sequencing I think they could really, really fuel each other.