 Let me begin, if I can, by asking Bob a question. You mentioned in your presentation that good quality science is made up of several components, and one that you emphasized actually three times was replication, replication, replication. As we go from 5 to 10 to 20 to 50 to 100 genome-wide association studies, and as the numbers of individuals and cases and controls and populations studied gets larger and larger and larger, it really would be wonderful to have the tools and the opportunity to conduct in silico replication. Are we near the point where we can have such tools and implement them effectively to allow for an expensive replication? Well, I think David addressed that a little bit in his presentation, and actually Jim's relevant here too, is that the, in fact, two of the recent publications on prostate cancer used the NCI results as another replication since they were on the web. So I think there, as more and more of the actual results of these get posted, you have an ability to replicate instantaneously in any of your findings. What about for more global tools to allow for replication across traits? I imagine D.B. Gap might give some of that capability. Well, we'd love to, but we're not sure how to do it. We do assume that people will be doing in silico replications, particularly when there's access to things like large cohort studies like Framingham, because there's a large number of phenotype measures that are available that might be appropriate to lots of these studies because they do physical exams on people. And so even though the exact measure may not be exactly the same, since they're typed on large numbers of SNPs, since you can impute across platforms, you could probably certainly find out if you were consistent or not. We have questions from Terry Minoglio and Debbie Nickerson. One thing that I think people would find very useful in these large databases is a very simple, very obvious way to cite them when we're trying to look at these and then acknowledge the use of them in papers in other places. It would be just grand to say, you know, to cite this, and you should if you're using the data at all, here's what you put in your reference list. Is that available on the CGEMS website? I mean, people have, they did in fact reference where they really got the replication and they referenced the website. And also to answer that on the D.B. Gap one, in the authorized access download, do you agree to data use conditions? And typically it's things like I'll only study eye disease so it's consistent with the consent or I promise not to try to re-identify individuals, but also usually there's a few sentences in there which says, and I will cite this study not by citing the database, but by citing the study, the original PIs. A different type of citation is when you publish, having analyzed a lot of this sort of stuff, those accession numbers are really for that. I don't really credit the database, but it's for tracking the data. Debbie? So I'm curious, Jim, when you take the phenotype data, will you take only normalized phenotypes or will you get raw as well? Because a lot of these large studies are going to be a consortium of 10 epidemiological studies that are out there looking across maybe 20,000 to 30,000 people at a time. The one thing I could think of is care, but those phenotypes will be normalized. And it's kind of like the difference between the raw genotype and, you know, having raw versus processed. Well, just like the genotypes, we'd like to have both. And so we encourage our submitters to submit both. When they try to submit only one or the other, we get back to them and encourage them. But of course, this is voluntary, so we can't really, if they really only want to submit the derived variable and won't reveal the raw data, we can't really force it. I would say for the most part, most studies are willing to share it all. It's kind of like, I'll give you what the sequence is, but I won't give you the trace, so you won't know if it's really junky and whether the base calls are really true. That's exactly right. So that's the problem. So we have had extensive discussion with some submitters who didn't want to submit the background information. And like I say, in most cases, it resolved that they found that they could. In a few cases, they found they couldn't. So I have a follow up on Debbie's question, which is there's a lot of accumulated knowledge that happens when people do a study, right? And when you go into a new study, it can sometimes take, I don't know, a year to really get, if you're that quick, to get up to speed and all the nuances. So is there a place in DVGaP where someone can say these are the things you need to watch out for, or how do you help people avoid making mistakes that someone more familiar with the data might not have made? Well that's a very good question. The original submitter can supply whatever documentation or comments they're able or willing to make. And as you know though, a lot of these things are sort of, well, something's funny happening here. You talk to the PI, they say, oh yeah, well five years ago we changed exactly how we were doing that, now we're doing something else. And this just has those same problems. I mean if it's really still just sitting in the head of the PI, that's where it is. And however, these are not anonymous. So the name of the group that supplied it and the people working on the project are there. And if someone, our assumption is that if someone really starts to get into one of these and finds interesting things, they're very likely to want to contact the PI and collaborate. And that seems like a good thing. In fact, I would guess that the PIs may find themselves with more collaborators than they want, as opposed to losing their data, it's going to be the opposite. Yeah, I think that maybe to extend another comment into the comments about the candidate gene SNPs and the seeming lack of replication. I guess I went into doing this genome-wide association and having done candidate gene studies before and thinking, you know, the genome-wide is just going to find so much more. We're going to learn that this candidate gene approach really didn't work very well. And what's amazing is that especially for traits like lipids, the things that are coming up the top are the known candidate genes, and that's true also for diabetes. So I guess I would see almost the glass more full for candidate genes, maybe not in taking the whole spectrum, but in terms of actually have been able to pick out of the whole genome things that were associated, it's pretty amazing. I agree with you in terms of the diabetes area and lipids that did a much better job than the cancer people did. I'm talking my own view of the cancer perspective. We've done an absolutely terrible job, including myself with candidate genes. And so that's why I'm so pleased with GWAS results because we can actually now start to look at gene environment interactions for genes that look to be important. Bob, if I could return to the theme of replication. Really today, and actually several times today, we heard about the Welcome Trust Case Control Consortium, and it was a real tour de force, 14,000 cases divided among seven diseases, 3,000 controls. This is a paper that was published in a top tier journal, but I don't remember seeing replication in there. Does it surprise you that a paper was published in a top tier journal using genome-wide association without replication? David, do you have any more? Actually, my question wasn't directed at the Welcome Trust Consortium. It was a question about journals and criteria for publication in the absence of replication. Debbie, you had a comment on it? Well, just that most of the things that were in that paper were replicated prior to by prior publications. So it is true that there were some things in there that were not replicated, but for the most part, most of the things that were reported had been reported before. Thank you. Raju Kavindaraju has been waiting patiently. Well, this is for Jim Astell. I believe you mentioned that the DBGAP, what I perceive from this is, it is geared for individual, probably individual, investigators so they could apply. And what if, on the other hand, no individual investigators, probably except a few, is clever enough to look at the multiple aspects of the data and then what if multiple investigators get together as a group, and then supposing they want to take this data elsewhere to different places, is there a mechanism built in to allow this group to apply and then get the data and then take the data elsewhere and then again come back and then provide the results? Well, this has been a hard thing for the NIH policy people to deal with in terms of this broader sharing of individual data like this. And so the compromise position right now is, even though there's a lower bar for getting at this data than traditionally was before, each application has to come from one institution because what they're looking for is the validation of an accepted, it's actually the institution that's taking responsibility for the behavior of the investigator, the same as with the grant. And so a group of collaborators, you can list collaborators on the application, but only collaborators from the same institution because the signing official is who is signing off on this. And if you're going to be collaborating with people from another institution, they need to also apply so that their institution also signs off as a separate application. One of the things you agree to is to not redistribute the data and that essentially you're agreeing that you will hold it, you'll protect it, you're not going to put it somewhere where people can take it or Google can index it. So it is a little clumsy if you're going to want to collaborate sort of very flexibly, but this is a big step forward and that's sort of the compromise position for now. Thank you. Just a quick question because it is two o'clock and we'll need to move on, but in terms of the replication requirements that initially had been put in the literature and journal editors were really quite strong in requiring this. It became kind of obvious that sometimes people were sort of splitting their samples and doing other things that may not have been as scientifically productive and useful. And so replication, while it's very important, there may be other reasons not to and that's somewhat covered in the Nature paper that came out just the page before the Welcome Trust case control consortium. And there there really was a group of diseases that had many things in common that the editors thought was much more important to get published than waiting for the replication. Thank you. One more. Quick. I just want to say this is all very exciting, but there is still a bit of interpretation I think several people feel. So it's exciting, but it's also scary and it has to do with just maintaining the confidentiality of our study participants and it is conceivable that there are some situations where a genotype could be used to infer identity, even if name and the traditional identification factors aren't in there. So I just want to kind of get it out there that there still is an issue. And I think part of the issue is as well that there's confusion about what exactly is going to be required as both a data provider and a data user. So there's still, I think NIH is coming up with a policy of some sort. So I wonder if there's any hint that we could get on kind of what's the guideline for when data needs to be contributed and what kind of data? Is it everything on the questionnaire? I mean you could really go on and on what kind of clinical data. So is there any information we could get on that? Well, I hear it's covered in the next two talks. And I can also say that that policy is being finalized now, I understand, and they're talking about releasing it in August, I believe now. So I don't want to jump the gun because a lot of those decisions are being made by people who aren't me or lawyers.