 Okay. If we can reconvene, we're going to proceed with the open session and have a series of three reports on workshops that have taken place of late since the last council meeting. I also wanted to set this up by pointing out that the first two of these workshops that we held were held on behalf of NIH. And in fact, they were, we even got funds from headquarters to help us put on these workshops. They were at the request of NIH leadership because they were interested in these topics, but we were the obvious lead to help facilitate such a workshop. And so the first of these is on establishing a central resource for data from genome sequencing projects, and Lisa Brooks is going to give us a summary of that work. Before Lisa starts, just one logistical thing Comfort asked me, if you haven't actually signed in on the sign-up sheets at the desk outside, please do. Thank you. Because that's such a mouthful, we're going to call this data aggregation workshop. That I should say that this is what Adam Felsenfeld and I organized, taking on something that Terry Minolio had started. So basically this is a problem arising from success. As you know, there's a whole bunch of human genome sequence data being produced. At least in June, the estimate was by the end of this year we'd expect to have 18,000 whole genomes and 54,000 exomes in at least 200 studies, you know, your requisite exponential, super exponential growth of lots of sequence data. This has been a smashing success of Jeff Schloss in the sequencing technology program and going into use, but then of course you figure out, okay, what do we do with this all? So the issue is that there's a whole bunch of genome sequence data sets, and there's a huge amount of value actually to looking to be able to analyze across a bunch of these data sets together. There's a lot of scientific questions which are really best addressed in these aggregated data sets. Things like you can do very large scale GWAS and sequence association studies, much larger than you can do with any single study. Things like gene-gene-gene environment interaction require very large sample sizes, shared genes variants are exposures across diseases. It's very useful to look at multiple data sets for multiple diseases. Another thing that's been discussed is kind of a human knockout project. Not that that's actually happening, but the notion of being able to look for people who have loss-of-function variants, especially maybe homozygous loss-of-function variants, or other interesting genotypes, or if people have extreme phenotypes to be able to look at those genotypes. If you have a very large data set where people have been sequenced, there's a lot of phenotype data, then you can look for these types of individuals. There's a lot of scientific questions that there's a lot of value to looking across these data sets together, but there's a lot of scientific obstacles to being able to do that. This workshop was focused on what are the scientific questions we're trying to address, what are the obstacles, and how do we get past that? What are the things we should do to allow analysis of aggregated data sets? One of the big issues, of course, is even simply data access. Getting access to all these data sets is not easy. We discussed four models of data access. One is simply open data release. This is like the George Church Personal Genomes Project. It's also, of course, what some genomes has done, where the data is simply out there. People have provided consent to do this. The data are publicly available. They're very easy to find. So that's one version. The other is streamlining the current system. So with DbGaP now, one has to ask permission for each data set, and if you're interested in 30 data sets, that's quite a lot of work in order to be able to get access to a lot of data sets to combine them together. Another possibility that's been discussed as a research commons to overcome the problem of getting 30 data sets separately is where researchers would get basically registered in the U.S. by NIH, perhaps, and then once they're kind of certified as being a researcher who can do this, then they could simply have access to all the data sets at once. There's also a notion of, and that would include the underlying individual level data. There's also this notion of a central data server where some group could have the underlying data sets and do analyses on them but not provide researchers with the underlying individual level data but provide researchers with the analyses, the results of the analyses. So that has sort of an interesting LC component of not having the underlying data, but still being able to provide answers. So these are various models that were considered. There was also discussion of things like calling variants. It's very clear that it's really, really best to call variants if you have a huge set of data. Then if something is rare, if you need evidence to call a variant in sequence data, if you have a whole bunch of people, then that way you can acquire evidence for something that might be fairly rare. Otherwise, things that are pretty rare, if you only see them once, you don't believe things that are only seen once, you'll lose a lot of rare variants. But if you have a very large data set, then you'll be able to do a much better job of calling any of the variants. Of course, another simple problem, harmonizing phenotype and environmental data. If you want to combine across a lot of different data sets, and we're talking about the phenotype exposure data, it really helps if they're done in standard ways. Blood pressure is always the example. Blood pressure in one study means the same thing as blood pressure in another study. Of course, we know that's actually fairly difficult. Of course, the other problems are computing on large data sets and doing analyses. The overall question of this meeting was, okay, there's a lot of problems here. There's a clear scientific reason for wanting to do this. What do we have to do in order to make this happen? I want to thank, we had Michael Benke and Wiley Burke as our co-chairs, helping us plan. Also with David Altru and Paul Fleecek. And we had staff from 10 NIH institutes and centers. So this was a trans-NIH activity. We had about almost 90 people that attended from a range of areas. Major things, sort of considerations to talk about things like how should we deal with data already collected. So the consents have already been signed. The data are already there. How do you deal with that versus what's needed going forward where you can ask participants to consent to certain things. You can design your phenotype measures in certain ways. So there's a difference between looking toward the past, looking towards the future. Of course, there are very valuable data sets already collected. Of course, there's going to be huge amounts more data coming along. So what can be done now? Some things can be done right now. Other things will require changes in policies. So the question is what sorts of policy changes are needed, how to do that? Another point, of course, is that the solutions don't have to apply to everything. There's so much data out there and so many data sets that are going to be coming that things that apply to 90% or even 50% of the data are still going to be extremely valuable. Things that apply only to new data not to old data will also be extremely valuable. So it came up with nine recommendations out of this meeting and so I'm just going to go through the nine recommendations. First off is that these data, the sequence data and the phenotype exposure data should be deposited in one or several central databases. The point is that it's very hard to access data sets if they're all over the place. So having them in central databases is a very efficient way of getting them all together. The consent procedure should seek permission for broad data use. This is both for broad sets of users. NIH feels that there's actually value in having companies, for instance, be able to develop drugs and therapies. We think that's a good idea. So we don't want situations where data are not available to commercial users. So we want all users to be able to get access to the data and broad data use in terms of things like not having it be disease specific. But there's lots of interesting evidence that certain diseases, the underlying mechanisms related to other diseases, so it really helps a lot if there's this broad use allowed because that just makes the data the most useful possible. And I will say absolutely there's recognition that certain sorts of studies that may be particularly sensitive, either because there's some populations that are particularly sensitive about this or because things like HIV and drug use, something like that where you really wouldn't want that getting out in certain ways. So we're not talking about absolutely every single study. But studies, most studies, this should be the default option and that researchers should have to justify why anything else would be being considered. So that's a very important one. New governance procedures should be created to oversee these central databases so that the public has input and there's accountability. You all know about the Homer et al. Craig paper. We're based on that. Leal frequencies that had previously been publicly released had been now put behind the DBGAP access. So you have to ask for permission in a DBGAP type way so they're no longer publicly available. There was a strong consensus at this meeting that these data are extremely useful, knowing things like allele frequencies, the significance of allele frequency differences between case control groups, the magnitude of the effects of the alleles. It was extremely valuable that the risk to individual participants, while not zero, is fairly small and yet the value is so large that it's something we really want to work with to change that NIH policy to allow these summary data. So again, not the individual level but the summary data to be made public. So the process for accessing data from DBGAP should be streamlined. That's sort of soon and then on a little bit longer time frame to do something as what we discussed before about the registered user, figuring out a system where it's actually very easy for people to get access to data sets. Another major point was that multiple, this is at least several flowers blooming process, that there are multiple ways of aggregating data and analyzing data and that should be supported. There's not just one way of doing things. Because of the very clear value of doing sequence, variant calls based on sequence data from very large data sets, aggregated data sets, in particular, central processing of sequence data from sequence reads to variant calls would be a very useful thing to do. And then of course, harmonizing phenotype and exposure data, retrospectively, oh actually that got cut off, retrospectively should be encouraged and prospectively should also be encouraged. Okay, so these things are not all easy to do, but there are certainly some steps that can be taken and have already started to be taken. Certainly there's been a lot of discussion with Laura and there'll be more discussion about the sorts of things that are needed to do policy changes, things that are needed to implement some of these things, things like the data access to dbGaP or research commons, release of summary data, those are really policy things and so we're talking about that. But some things, for instance, dbGaP has already put together general research use data sets, I believe there's 19 in that set. And so with one access request, you can get access to all those data sets. So that's definitely moving in the right direction. Things like going for supporting central databases and having broad consent, that's not something we can just say, okay do it and it happens, that's something that needs a lot of discussion with researchers, with other NIH staff to promote that sort of, to understand that this is actually important. Clearly we have to work with other initiatives like Big Data, some of the centers we're talking about in the Big Data situation relate to some of the centers that would be useful to do some of these processes and clearly a lot more work is needed for things like registered users or phenotype harmonization. So now that we've got this report out and it was just finished on Friday, so we'll get it to you as council members and we'll be putting up probably on the NHGRI website. We actually need to do more work with this. So Mike, do you want to say anything more about this report? So Lisa gave a really good summary. Perhaps the single most important thing she didn't say is what a fantastic job she did in terms of, well midwifing this process through would not have happened in anywhere near the good form it did without her fantastic effort. I just want to re-emphasize a couple of points that she made and that is that a lot of what we're talking about here is sort of changing the default in terms of the expectation of what should happen. It's not that everything should change. It's just that instead of there needing to be a justification why we would make summary data available, it's we need justification for why we shouldn't. When there is so little there that would seem objectionable, it seems very logical to turn that around in the same sense when it comes to broad consent. Surely we don't believe that broad consent is the right choice all of the time, but we do believe that unless there's a strong reason not to have broad consent, that probably should be the default unless there are some compelling reasons. That's a little tougher than the first, but thinking in those terms seems to make a lot of sense. And this issue of existing data versus data to come, the issue of 80% solutions are very worthy of taking forward there was really strong consensus. Actually I was impressed by what the degree of consensus was at that meeting around these key recommendations even though it was a very broadly representative group. So I was pleased with how it came out and I think it is important that we take forward as quickly as possible some of these relatively easier things and that for some of the things that require policy changes or at least to be addressed as policy, it's going to be very important for some of the key leadership and I'm looking at you, Eric, to be actively using your bully pulpits to take these things forward if you believe in them because some of these will not happen without some leadership. But thanks, Lisa, you did a great job. Let me pick up on the mic's theme and also as always I put some of these things into broader context. So in my director's report I spoke about the recommendations of a working group of the advisory committee to the NIH director of this data informatics working group. Jill Mizrov was a member of that working group and they deliver the reports in June and this is now a big part of that which was packaged under the phrase big data but is actually broader than just big data is now heading towards some significant discussion at the end of this month by the institute directors of sort of next steps. And I've been asked, as I told you in director's report and NHGRI staff have asked a major leadership role in formulating the plan and indeed this workshop report is part of that, has fed into that as well and has fed into an overall sort of framework for a proposal for consideration by the NIH institute directors will be discussed in a few weeks. And Mike's absolutely right. Just about building this or building that there's a lot of other things I feel very much like this is a campaign I mean maybe that's exactly what you were saying this is not just about we need to fund this RFA there's some cultural changes underneath there's some philosophical points that really need to be stressed and we're in the midst of it and it's going to require lots of things money is one part of it an important part of it but it's not just money and so you're absolutely right but this will be a very important part of this. Or Terry? I will point out actually for this particular without being able to predict what the future is going to hold for any of these possible initiatives the history of this is that the idea started to bubble up around while we have all the sequence it's not aggregate it can't compute can't do this that that and at least alluded to it that Terry was originally asked by Francis Collins Terry Minnoli was asked by Francis Collins to do an inventory which was done about a year ago or 15 months ago and was presented at a retreat or a forum a leadership forum of the Institute and Center Directors about a year ago September of last year and Terry presented this and there was resounding enthusiasm to take this to the next level and as always we get asked to then lead which as always to a workshop but so I mean here we are a year later but that said I think there's I don't think that these kinds of recommendations are going to be controversial and other institute directors not going to be seeing this for the first time so at least this slice of this bigger pie I'm hoping will be non-controversial and already we've sort of softened the ground even by the presentation a year ago okay yes I just want to say that in support of your campaign for this I think this is a really good idea but I think you know a lot of the policy the emphasis has been on sort of protection and privacy issues and I think this is making the data more useful as a big part of respecting the participants who contributed the data so I think that's an important message to get across okay so that was the first of three workshops so the other workshop again this is a trans NIH workshop but we were asked to organize it and Terry Minolio will give us a report on it a workshop on sequencing and cohort studies and large sample collections