 The next agenda item will be, next two agenda items will be project updates. We regularly try to give council just regular routine reports about ongoing projects. And so this meeting, we're going to present reports on the ENCODE project first and then microbiome research. So please. Hey, thank you Mark, happy to give an update on ENCODE project and in doing so realize that a lot has happened since the last time that I reported to council. So just a reminder of why we're doing this project. When we have the human genome sequence completed, we really were not sure how we could read the sequence. I like to say that we had no instruction manual, there are no readily available punctuation marks. Evolutionary conservation can help us to identify functionally important regions. There's at least 5% of the genome that's conserved, about 1.5% of the genome is protein coding. But we really want to know what the function of the non-coding conserved sequences are and even what the function of the non-conserved sequences are. We know we're moderately good at identifying protein coding regions, but the fine structure of the coding sequences are hard to predict from the sequence. We know that regulatory regions can be very far away from genes and we felt we needed an unbiased experimental investigation to gather this information. So we launched the encyclopedia of DNA elements of the ENCODE project with the goal of compiling a comprehensive catalog of functional elements in the human genome as well as the genome of model organisms. We started this in 2003 with a pilot project focused on 1% of the human genome sequence and then we scaled this to production space in 2007 looking across the entire human genome sequence. In 2007 we also launched the modern code project, we used our production efforts to map the function elements comprehensively in the genomes of C. elegans and Drosophila malana gaster. With funds from our economic stimulus money in 2009 we also funded limited production effort in the mouse with the idea that we wanted to gather information that could help us annotate the human genome sequence and then throughout this project we have supported several rounds of technology development. I reported about a year ago at last May council the results of a mid-course review we did on the ENCODE and modern code projects. The purpose of this review was to assess progress towards meeting the modern code and ENCODE consortia goals and to consider options for the future of these projects. I'm not going to go into the details of this review except I wanted to say that the outcome was that there was recognition that these projects were in very high production mode and it would be advantageous to continue them for an additional year to take advantage of the high throughput production capabilities that had been generated and to really increase the amount of data that's being produced and also give an HDRI time to plan for the future. Those projects were going to terminate this year however they're now going to extend it for an additional year and we're in the beginning of the fifth year of funding of those. This is just a reminder of the different functional elements that are being studied. So this is a figure from the marker paper from the modern code project that was published in 2009 just outlining the different functional elements that are being studied in modern code. These include transcription factor binding sites, histone and chromatin modifications, DNA replication sites, the sites of transcription, different products from transcription as well as fine tuning the annotation of the genome. In addition, in the human ENCODE project we're mapping chromatin structure using DNA's hypersensitivity mapping as well as DNA methylation. So these projects were supported with the idea that they would be community resources and that we were hoping that we'd be used by the community to further understand the regulation of gene expression and to hopefully be used to understand the genetic basis of disease. These are one of the hallmarks is a rapid prepublication data release policy and the analysis of this data has required development of common data reporting formats, data standards and analytical tools. So there are multiple ways to access the data. The ENCODE data can be accessed by the ENCODE portal which is at ENCODEproject.org hosted by UCSC. You can also find the data at the UCSC genome browser and ensemble and through NCBI on their epigenomics page. The modern code data can be found at their portal at moderncode.org as well as the fly base and worm base which hosts the fly and worm data respectively. Can you just give us some idea? I mean so there are all these places where you can access the data. Are we accessing the same views and the same data or is each one got some value add or some differentiating factor? I'm going to let Peter answer that question. I think each one has a different value add. So if you look at the ENCODEproject.org and moderncode.org you're going to get the data coordinating centers. You're going to get the raw data and the derived data. And then you're going to be able to get tracks on the genome. You're going to be able to download all the data if you want to do large-scale analyses. What you're looking at for ensemble and fly base and worm bases is they sort of add value to this. Ensemble makes predictions about they have this sort of regulatory build framework that they layer onto ensemble which is sort of cell line or tissue specific. And then fly base and worm base they try to bring this in as they would any curated data. So they layer it on top of the other data that's available. Thank you. Okay so I'm now going to walk you through a little bit of specific information for the individual projects. This slide shows the data submissions from the modern code project over time. And you can see while there was a bit of a lag in the beginning there's been quite a number, quite a steep increase in the amount of data that has been submitted. And now topping 2,000, I'm sorry almost 2,000 data points, data submissions I should say. There's a lot of work that has taken place to actually analyze the data as I mentioned is coming from a lot of different data types. And last year the modern code consortia published two major papers in science in December 24th issue of science including integrative analysis of the C. elegans data as well as integrated analysis of the jesophila data. And along with that were 19 companion papers from individual groups published in nature, genome research including a modern code special issue as well as genome biology and database. So this was very exciting to see the fruits of all of this work in publication. As I mentioned we've extended this project for a fifth year now and in this fifth year we plan to have increased coordination of data generation within the species and between the two species. And ongoing right now are plans for fly and more integrated data analysis. There'll be a data analysis working group meeting this coming weekend and building next door in advance of the consortium meeting that Eric mentioned. And this group hopes to take advantage of the additional fly and worm genome sequences that are now available or soon will be available. Now the modern community was very excited at the recent jesophila genetics meeting that received a lot of very positive feedback from the community and this prompted them to do a community survey of the use of the encode data. And I don't have time to go into all the details here and I can't actually see it all but the first question is who is using the data and most of the individuals are from well I should say that over 650 individuals responded to the survey. Not everyone responded to all the questions but it was quite a large response. The majority of investigators are from North America some from Western Europe as well as from Asia. The majority of individuals assigned them declared that they were basic researchers and model organisms or in genetics. Almost all the individuals work in academia and 80% of the respondents said they're either using the data, the modern code data or plan to use it in the future. The different individuals who are using this data span from PIs to postdocs and graduate students as well as educators shown here and they're using the data either several times a week, several times a month monthly or weekly. I don't expect you to see all the details on this slide but one of the questions was what data types are you actually using. The top here shows the data for C. elegans and you can see that there's broad use of the different data types most being the RNA and transcription factors but some groups are reporting using all the data types and there are similar results for Drosophila. So this is very encouraging that there's wide use of all the different data types and then the query was what kinds of analysis were you were using these for and well the vast majority were for single gene studies or on classes of genes. There were also a number of genome-wide assays that were genome-wide studies that were being conducted. So this is very encouraging that the data is getting out and that the community is using the data. I want to briefly mention work on the mouse and code data that's been submitted. This project got started about a year and a half ago and after some lag there's been data submission pretty steadily and now topping 100 data sets and we expect there'll be a lot more of this coming and we're looking forward to being able to use these data in the integrated analyses. Shifting over to the encode data there's a similar lag in data submissions but then quite a steep set of submissions of the data almost set at 2,000 data sets and showing no signs of stopping and quite a steep increase in the number of data submissions. One way that encode has found useful in thinking about the data is to think about the encode dimensions. There are three dimensions. One is to think about the different methods and the different factors such as transcription factors that are being studied. Second is the number of cell types that are being studied and the third is the dimension of across the genome. So recent summary of the data shows that there are 164 assays that have been performed in encode including 114 different chip assays and that over 180 cell types have been interrogated across different assays and of course we are studying the across the entire genome as most of these assays are using high throughput sequencing methods to gather their data so this is fairly agnostic across the genome. So in summary there are over 3,000 experiments. This has covered five terabases of the genome which represents over 1,700 fold coverage of the human genome sequence which is quite remarkable. Now as Eric mentioned the encode consortium published recently a user's guide to encode. This is published last month in Plast Biology and this was in recognition of just the vast amount of data that's being generated and the goal of trying to make this as useful to the community as possible. So the purpose of this user's guide was to explain to what data is currently available, what data can be expected, what the community can expect to see, how to access the data and has several examples of using the data. I think one is specifically looking at how the encode data can help with interpreting GUI SNPs and I'm going to talk a little bit more about that in a minute. The encode community did not do a user's survey like modern code but we can just use a snapshot of a recent Coastman Harbor meeting that Eric mentioned in the Genomes of Biology where there are several plenary talks and poster presentations from encode investigators but interestingly there was 21 non-consortional abstracts using encode data so this is quite encouraging that the data is getting used by the community. So for the plans for the fifth year to have increased coordination of data generation there already is some coordination on common cell types but we've expanded those. There's heavy, very serious work going on now on integrated data analysis across the data types that we hope will lead to a publication this fall. There was an analysis working group meeting in March that hammered out some of the details of this analysis and there are weekly analysis calls. It's been very intensive effort head up by Ewan Burney and Ian Dunham. We hope to have an integrated analysis of the mouse and human data now that mouse data is being submitted and there are plans on the way to discuss the possibility of analysis of all species. There's as you mentioned a consortium meeting coming up next week and there will be a session of joint analysis working groups for both encode and modern code that will explore this possibility. So we're not the only group that's generating this type of data. There's a Common Fund Epigenomics project that I'm sure you're all aware of. This was funded in 2008 and included a big effort on what called the reference epigenome mapping centers. NHGRI staff is working, has ongoing coordination communication with NIH staff that are managing this project as well as have communication with the mapping center participants and this in many cases is easier than you might think because many of them are the same individuals which is fortunate for us I think. We plan to have a joint meeting next week to discuss ways to further synergize. We're actually having back to back consortium meetings so this will be a short session in between those two meetings and we want to discuss several topics including opportunities for directed data generation to have what we call complete data sets, opportunities for integrated data analysis and really I think most importantly maximizing, figuring out ways that we can maximize accessibility and the utility of the data to the research community. I just want to close with some exciting recent results on the intersection between GWAS hits and encode and it turns out that at least half if not more of the GWAS SNPs that are in the NHGRI catalog developed by the Office of Population Genomics, they've fallen within regulatory regions either mapped by either DNI hypersensitive sites or transcription factors and the data on this slide is actually is from John Stomatoianopoulos who is a PI both in the encode and the epigenomics group and this summarizes some of his results with both projects data so I think it's close to 100 cell lines and what he's found is that what he's showing here is that the GWAS SNPs are mapping very closely to DNI hypersensitive sites and found that over 53% of the GWAS SNPs are falling right within DNI hypersensitive sites and if you look more critically at the GWAS SNPs and looking at externally replicated ones that actually increases to 63%. Then if you consider the SNPs that are in complete LD with the DNI hypersensitive sites looks like 75% of the SNPs are actually mapping in DNI hypersensitive sites. These are the GWAS SNPs. Some other interesting features that is that the disease or trade associated SNPs the GWAS SNPs are localized in pathologically relevant cell types. For example SNPs associated with inflammatory bowel disease or mapping in T cell DNI hypersensitive sites. Many of these SNPs mapping in DNI hypersensitive sites alter allelic chrominin states indicating that they are functional and the GWAS SNPs are localizing in physiologically relevant transcription factor binding sites. An example is shown on this slide. This is from a paper that was published just recently by the Bernstein group with reporting on some of his encode analysis and they mapped nine different chromatin marks in nine cell types. In looking at the combinations of where these chromatin marks mapped, they were able to actually identify 15 different chromatin states that are associated with different functions such as promoters, enhancers, repressors and in active regions as well as transcribe regions. And they found interestingly that this landscape of the chromatin states actually differed significantly in the different cell types. And they also found that some GWAS SNPs mapped enhancers that are active in relevant tissues. So an example is shown on this slide looking at SNPs, a subset of the SNPs that map in erythroid phenotype. The SNPs are listed here. They are mapped against the different chromatin states that are shown in the different colors. This slide, this is across the nine different cell types. If you look at the second one here K562 which is an erythrolochemia cell line, these orange boxes indicate orange and yellow indicate strong enhancers. And so these SNPs are all mapping in enhancers in K562 cells. Then if you focus in on this one SNP in red, it turns out that that maps about less than 100 base pairs from an enhancer in K562. And the nucleotide change in this SNP actually increases the consensus towards a transcription factor binding site GFI1B. It strengthens the binding site of that transcription factor. And this transcription factor is a putative compressor in K562 cells. So clearly this is very intriguing finding that is warranting follow up. So in terms of implications for this for ENCODE, when we look at the correlations between functional elements identified by ENCODE and the GWAS SNPs, we feel they can lead to testable hypothesis for how disease associated genotypes can lead to disease phenotypes. And it appears that the power to illuminate disease related variation is related to the depth and the quality of the data and clearly more cell types will allow coverage across more disease and trait phenotypes. So we feel that ENCODE is positioned to have a significant impact on interpreting genomic information associated with human disease. Now I just want to make brief mention of all the participants. There are over 200 participants in ENCODE and modern code, most of whom are coming next week to our consortium meeting. This slide lists the nine different PIs associated with the project and with all of their co-PIs, the groups that are in asterisks are also modern code PIs, and it also includes two other groups. I'm sorry, mouse encode, thank you. Mouse encode groups, including Ross Harrison, our council member here, and there are several additional participants, including Eric Green, who was involved until he became the NHGRI director. In codes, modern code, similar, it has a large number of PIs. These are 12 PIs associated with the project, as well as numerous co-PIs. And for both ENCODE and modern code mouse encode, there are many additional scientists, graduate students, postdocs, bioinformaticians, data aliases, etc., who are participating in this group and have really worked well together in a consortium to really make the sum much more than the whole, much more than the sum of the parts. I just then want to end with acknowledging my other colleagues, Peter Goode, Michael Payson, Dewey Zheger, Mark Geyerall, helped tremendously on this project, as well as to still our program analysts, Rebecca Loudon and Leslie Adams. Happy to answer any questions. Mike. So you said that 53% of GWAS hits fall within DNA's hypersensitive sites? And I was just curious, what fraction of the genome is DNA's hypersensitive sites? Oh, I think I knew that number. Do you remember that number? It's pretty high. Okay, so it's a multi-fold increase. The numbers are on 10 to 20%. And John presents today that it is statistically significant this increase. And I should also say that it's just been fortuitous that that GWAS SNP seemed to be enriched for ENCODE regions just because of the way the assays were developed, and the requirement for certain GC content, what makes a good probe also happened to be a lot of the same features as the regulatory sequences. So they're biased in the direction that's to our advantage. If I can expand a little bit on that, the this is now wearing my cap as part of you and Bernie's group on the analysis working group in ENCODE, but we recognize that characterizing this phenomenon as clearly as possible is really key. I mean, I actually see this as one of the major driving goals for the ENCODE project. And happily, whether you use John's work on the DNA sensitivity, Brett and Manolis Kellis on the histone modifications, you see substantial enrichment. Now, there is this ascertainment bias. So if you just look at the genotyping microarrays, they're also enriched for these functional categories that we get from the ENCODE data. But we're also working to try to integrate everything that we know, the factor binding sites and the DNA sensitivity and so forth, and working but we just very recently, it's become quite clear that we're using as our null model the that that distribution of the genotyping steps, we still have quite a significant enrichment in several of these functional categories. Also, and we're because the other point is, is anybody using it? Are they finding anything that that really is helping them? And I'm starting to see multiple papers coming out where people have GWAS hits in noncoding regions, and they start with the ENCODE data formulate the hypotheses and go back and do the same kind of functional assays in the relevant pathological and physiologically relevant tissues. So I think it's really, really exciting.