 I'm here to give just an overview, more of a snapshot of where we are with data, sequence data that's being produced by NIH. I'm not asking you to read any of this, it's just an example, it's a snapshot of part of what we call the inventory of NIH sequencing projects. Most of what I'm going to talk about today is derived from this. It's available at that URL, a couple of notes. This is information projected through roughly the end of 2012. And the other thing is this is a spreadsheet, it's not a database, the data have been manually gathered from many program directors at NIH and so there are some inconsistencies, but it's extremely useful for planning. Some high-level numbers, by the end of 2012 we expect just under 200 projects and just under 69,000 samples sequenced, roughly 18,000 of those is whole genome sequencing, though there is some overlap between WGS and whole exome sequencing. It's whole genome and whole exome, it's not other targeted. And there's some fairly large other targeted studies that are excluded from this. It's growing fast, this is current information, this is right now just shy of 30,000 samples, again we hope by the end of the year 69,000 samples, they may not all be in dbGaP yet by that time, but they should be done. The data are organized into projects and there are different ways of taking slices across the data, there are many small projects, a fewer than a hundred and a few, quite large studies of a thousand individuals or more and I think it's interesting that roughly 50,000 of those samples live in that bin. It's easy to take other cuts of the data, you can just look at this distribution across different areas of research, this is arbitrary, I just did this by eye, eyeballing the descriptions of what the studies were, of course many are going to fall into multiple categories, there's overlap between some, the Mendelian is surely an underestimate, etc. And the vast majority of these studies are cancer studies. The largest studies probably live in the heart diabetes and autism bins right now. I was thinking about other interesting pieces of information or other interesting ways to look at this and I came up with this idea of a high value samples and this is high value for I think our purposes today. So out of the 68 total, roughly 26,000 samples have no data use limitations, now they're still almost all in dbGaP but they don't say four disease X specifically, so those are potentially easy pickings for aggregation potentially and 26,000, sorry 2600 of those are actually, they don't have phenotype data, they are 1000 genome samples so they have no restrictions at all. Just under 16,000 samples have a condition so that participant re-contact is permitted and one in principle could go back and re-phenotype those samples, for example we've been talking about whether when you find a very interesting genotype it would be sometimes desirable to be able to go back and do re-phenotyping. I could only find 25,000 samples for which physical samples, blood or cell cultures were available, I think that's probably low and may just be an artifact of people reporting inconsistently. The things we definitely want to know, some of which is in this inventory but I think it's difficult to tell because it's being reported inconsistently, so phenotype data, very important, what exposure data, population data, age of participant, et cetera. So of course all the data are in dbGaP and we all know the problems, the problem is that the data are balkanized, there are 200 projects, roughly 400 consent groups, one must go through multiple data access committees to access these, there are inconsistent metadata, inconsistent quality, all the other things that have been talked about today and the problem scales with these issues. And of course not all the data are in dbGaP, some of the NIH funded data, cancer data are now in CG Hub, it's another thing that has to be taken into account. I think that disease specific databases could proliferate so that whatever we come up with will have to be operable across multiple databases and of course there are more data outside the U.S. So in summary there are a lot of data, it's easy to imagine that this tripling by the end of 2013 to over 100,000 samples probably will be more. I think a question to keep in the back of our minds is what number of samples should we plan for? If we do nothing, of course these will remain divided. On the other hand people were talking about 50% solutions being useful, parcel solutions being useful. There are large numbers of samples that are concentrated in a few studies. What I didn't look at carefully enough is to know if the high value samples are concentrated in those and by high value I mean no restrictions and principle recontactable. But it's far from optimal even in the best of circumstances and it's not scalable. So with that I would like to thank a number of people, Nicholas Clem, Terry Minolio and Ian Marperi for putting together the inventory, Nicholas for digesting the data and for some of the slides and the folks at DbGaP and Lisa. And with that if anyone has any questions I'll be happy to take them. Since there are other people, I mean there are large projects outside the U.S. but I'm just wondering, you have a few friends from the U.K. What's now the total estimated number from all of those projects this year and are there any similar kind of discussions ongoing with those? So I don't know the answer to your first question. I really don't. I can only go by and I'm not just thinking of the U.K. I'm thinking of China and with some real unknown about what's actually going to be available along what kinds of timelines. But given the amount of sequencing capacity that I think NIH has we can't have more than half the capacity and probably something like a third at this point. So I wouldn't all be surprised if in five years you can do the extrapolation. So to follow up on Aravinda's point and also on the question that you raised about what populations, you can imagine there's going to be a lot of reticence in people outside the U.S. and depositing the data into DbGaP and one of the real important things that this could come out of the meeting is a way by which these data could be shared without needing to go into U.S. databases and so on. And I think the other point which also kind of dovetails on Aravinda's point is that nature about a year ago put out a very kind of informal little study about what genome sequencing was going to look like and it kind of predicted something like 30,000 by the end of this year with the vast majority being the same populations have already been largely studied, European populations and now some East Asian populations largely due to the BGI's and so on. And I think the other thing that would be good to think a lot about is if we're thinking about building this resource into the future and what the face of medical genomics is going to look like, you know, some of us just came from this big conference also on health disparities in genomics and thinking about the issue of diversity in populations and how to think about that as we build the resource that's really going to be the driving force for the next 10 years or so. Ewen. Let me just say that, of course, there is an analogous system to DbGaP. It's not one-to-one mapping. It's called the European Genome Phenomarchive. I don't have numbers to hand. We operate in a slightly different way in that we very much, we have to work under a situation where we have multiple countries submitting their data and therefore the relationship between the submitter and the database is slightly different from here where there's less centralized control and that's good. We have submissions from across the world, including the U.S. So we don't have any boundaries in that. But what we don't have is a centralized planning process. So we don't have the equivalent here of your spreadsheet to look into the system. What we can do is make estimates and the estimates are scary and by extrapolating the plots and similar to here, the biggest driver is cancer. So when we do our estimates of volume, the biggest driver by far is cancer and then once you've kind of mentally solved that implicitly you've solved everything else, at least volume-wise. I think here we're talking more about consent-wise, which is a very different thing. So I appreciate that those two things are different. Yeah, this reminds me that another issue that comes up is the number of groups depositing sequence is going to influence how easy it is to centralize things. And I'd be interested to hear presently your experience with it. I mean, I agree that there's a lot of practicalities here that you have to merge in with the goals. So you have to have working systems. I mean, I know it's obvious, but the system must work. So in other words, submitters must be able to submit at a level where the data that they retrieve, where somebody else retrieves it, is actually reasonably good to go. It doesn't have to be perfect, but it's reasonably good to go. And so that already places some constraints on the entire system. My question is a ground rules question. Since we don't know what it is, does it include tumor normal pairs in cancer? Given the size of that histogram, are we going to combine them, basically, any genomes and exomes from humans into a single? Yeah, I don't know. Oh, I see. So personally, I think there are two orthogonal axes here to explore. One is a consent axis and the mechanics of what people are allowed to do when and how and what have you. And the other axis is a practical axis. I think it's really important to keep those two things orthogonal, because as soon as you start trying to blend them, you end up in a really complicated discussion. So it's better to say what consents, what is the process for researchers so that they can get access to a big enough set of things. And then given a good set of practical ways of allowing researchers access, what is the best way to satisfy practical access to the data? Anyone else? Yeah. So is there an assumption that the raw read data and alignments will be part of this resource? Or are we still going to be discussing the issue of storing just genotypes called sequence versus raw sequence versus some sort of compressed sequence? Is that part of the discussion? I'm going to ask Lisa's help on this. I think we're not making any specific assumptions there. I've heard, yeah, Steve. In our computational group, we considered access to raw sequence genotypes and haplotypes as important for different questions we'll be facing in the next five years. So all are on the table, I think. All are on the table, but we don't. But there could be different strategies on how you manage each of those types of data, but we want to make them all available to analysts. All right, thanks.