 So I'm going to talk about what happens once people decide they do want to share the data. We still have to do something with it. Based on the impetus from the GAIN project, which is a public-private partnership between NIH and a number of commercial pharmaceutical companies and the Framingham Share project. I'm sure all of you have heard of Framingham, who had decided to make their data more widely available. We developed a database called DBGAP. And it's now taking information from lots of other studies and initiatives as well. So we had to develop this very quickly. We had to design it with the idea that every study was going to be different. So rather than attempt to model the studies themselves, we attempted to model the design of the study. So the main elements of a whole genome association, you need a model of the phenotype. You collect a genotype, and you attempt to measure the association between the two. For phenotype, typically the protocol's questionnaires and documentation are the best explanation of the measured attributes, but they're text. And they're often on paper, yellowing and filing cabinets. They're scanned PDFs. In many cases, they're unavailable even to the sponsoring NIH Institute. And so you typically have to talk to the PI to understand the study. And it's very difficult to get a good sense of the guts of the study if those aren't available. The data is relatively simple in structure. It's typically a square table where the row is the individual. The columns are measure. The problem is that the column names are often obscure. And they're in lots and lots of different formats sitting in different sites scattered all over the place. What would normally tie these two things together, the columns of data and the meaning, is something called a data dictionary. And these are often unavailable. Or they're embedded in a SAS system and not widely available, not readily computable. So what do we do if we want to rapidly intake a large phenotype study? Well, first we try to have minimum barriers. So we take the data, however the PI has it. We then automatically load the columns and rows of the data into a generic database which doesn't model phenotype, it models a square table. And that means we can use the database to quickly analyze the contents of the columns. We can look and see all there's just numbers in this column. So this is probably an integer measure and we can calculate a mean and standard deviation. For example, we can search for things that we shouldn't have in the database like dates and social security numbers and things like that. And we can look and say, for example, a column is a set of strings. But if we count up the occurrences of those strings, that 50% of them are the string male and 49% are the string female and 1% are something else. And then we can review this report with the submitter. And for example, in the last example, the PI can look at this and say, oh, no, actually, there's some mistakes here, we can correct it. It's much faster for us to take the data from the PI and then tell them what we see and have them correct us than it is to ask the PI to describe the data to us. And that's just based on long experience. We organize the documents next. NCBI has lots of experience with electronic documents from doing PubMed and PubMed Central and online books and things. So we basically take whatever documentation they have, whatever form it is, we actually mark it up on paper where we circle sections of the document and say this goes to column one, this one goes to column two. Again, we sort of review it with the PI again after that process. And then we mark it up into XML, which is an electronic markup language, sort of like HTML, which you use to make webpages. Once we have it in that form, though, it means that we can display it back as an HTML page, you can read it and it looks nice. Because it's a tagging language, we can also put the tag in that links the section of the document to the data column in the table, which means we can index it. So even though controlled vocabularies weren't used in a lot of these cases or there's slightly different wording, we have enough text that we can find words like blood pressure, heart rate, things like that. And we just index them as text. And so you can search this the same way you'd search PubMed. It's basically a free text search. You don't get everything perfectly, but in sort of a browsing mode, you can find the studies and sections of protocols that you might be interested in and might be relevant to you. In addition, because we've broken it up, it means we can deal with each subsection of the document independently. So we can show you the parts of documents from many studies that have the words blood pressure in them, as opposed to sort of browsing study by study. Here's just a quickie example of what happens. This is a real example. Under ethnic category in the original questionnaire, there's two choices, Hispanic or Latino or not Hispanic or Latino. The column on the left there is what we actually found. And so obviously, if you were interested in using this variable, you wanna know what's really behind it before you go to all the trouble of requesting the data and downloading it, planning your experiment, whatever it is. So what we can do is we can render the questionnaire on the web, just like this. And then if you put your mouse on it, we can tell you summary data of what's behind it. We can show you the summary data because there's no individual data here. There's no violation of privacy, it just gives you a sense. So you can actually kind of dig around in this database and get a sense of the studies and the quality of the data underneath. So with this basic design, it means for completely unrestricted public use, I can browse and search projects and studies. I can look at all the questionnaires, protocols, supporting documents. I can look at summary data for all the phenotype measures. I can look at summary data for all the genotype measures. I can identify studies of interest. We can point you at the process and the authorities you have to go to if you want to download individual level data. This is all the identified data, but because it's a lot of data and the genotypes are possible to be used to match you up to other genetic information, there is, it does require an authorized access mechanism. You have to submit a research plan. You have to show that you're a legitimate researcher. It's not as high a bar as participating in identified work. You don't go through your IRB necessarily, although some studies are gonna require it, Framingham for one. But it can vary by the study. You can also view pre-computed and published associations. And that's because the association is also summary data. You're just showing the correlation between SNPs, for example, and columns in these data tables. And so those can be public. There, all these public items have accession numbers, which I'm gonna show you when I demonstrate the database, that are citable just like a GenBank accession number or just like a PubMed ID. And the idea is that when you publish a paper that's based on these studies, you can actually refer to exactly which column of data you are using in your analysis. If the data's been updated since you looked at it, we can update, it'll get an incremental version number. So you can say I used version two or I used version one. In addition, it's possible to submit the association itself as a data table, which is basically a column of SNPs, the column ID in the database that you are associating with, scores that you associated with, and then link to your publication. And I'll show you an example of that. And I'll just mention that both editors of nature and science have seen this and said they think it's great and they wanna know when they can start requiring accessions in order to publish an association so that actually you can replicate or test people's assertions in these. For authorized use, the only thing you have to be authorized for is to download the individual data, the data on individuals for genotype and phenotype to do the recomputation. For the genotype data, I'll just quickly go through this. In general, we're trying to work as much as possible with the Institute's sponsoring genome-wide associations early. In some cases, for example, with the GAIN project, the genotype data is coming directly to NCBI from the genotyping vendors. We're doing the handshake with the vendor because as was mentioned by a number of people, this is a lot of data. It takes a certain amount of technical expertise to handle it. And so that way we're dealing with sort of the three vendors as opposed to every individual lab having to deal with various vendors. For some of the studies, we've dealt with big centers like CIDR, for example, for the macular degeneration study. We're collecting both the genotype calls and the underlying intensity data wherever we can get it. One reason to collect the intensity data, and I'll just say it again, is for these quality control checks, for example, where you can look and say, well, I have an outlier, is it because it's a bad SNP? And you need the intensity data to be able to see that. The other thing is that this is a developing science, and so people are changing the algorithms they use to call the alleles from the intensity data. If we have the intensity data, it means when a new algorithm comes out from the vendor, we can actually recall all the genotypes that are already in the database and produce allele calls version two, as opposed to everybody having to do it. And we will have both available. So that if you've downloaded version one, that's still valid, you can still refer to it, but you can also go back and get version two without having to recompute it again. All right, so this is just a short list of current activities. It's actually expanded quite a bit beyond this, since I made this slide. And what we're finding happening is actually people are sort of coming out of the woodwork. We're not seeing, in a sense, too much pushback on this. There is some pushback, but instead we're tending to see people discovering this and saying, oh, actually, I'm already collaborating widely. This would be very helpful to me to be able to put my study in here. So this seems to be taking off very fast. I'm also just mentioning here the resequencing study. So we mentioned that once you have an association, you wanna start to do deep sequencing to look for the causative alleles. And there's already projects starting to do that. And part of the notion of this database, reason it's called genotype and phenotype, is because there's sort of a phenotype component, which is all the phenotype measures. And then genotype can exist in a number of different other sort of satellite databases. So genotyping data is one type of data, resequencing data is another type of data and would exist in another resource. Fine mapping may be yet different than that. But those can all be coupled to the same individual phenotypes in the original study because typically people are gonna go back and do more work on the same individuals, especially if the DNAs available or cell lines are available. All right, so this is the Quickie demo. You can find DBGaP either by going to that URL up there, which, and I can never remember URL. So my other suggestion is almost everybody uses PubMed. You can go to PubMed, pull down the menu, go down to DBGaP on that menu, push go without a query and you'll go to the DBGaP website. Again, this is an older screenshot. There's a much longer list. All the gain studies are listed here now. But in the main screen here, you can see that you can browse by study. These were the first two studies in the database. You can expand sections. In the case of a study that has multiple components. So for example, Framingham has several exams. And there are other sub-studies within it. It's organized that way. So you can sort of navigate the structure of the study, be it temporal or be it by subspecialty. In the case of the Parkinson's study, there's actually separate groups of cases and controls because there's a repository of controls from which that are drawn for these studies. So another thing that you'll start to see here is where there are pooled common controls, you get overlapping sets of those used in different association studies. So the controls start point, in a sense are not copied over and over again, they're referenced. So we use this individual, even though they're anonymized, you still reference to the same underlying individual. All right, it can also browse by disease. So we classify the studies by using the mesh terms that are used in PubMed to classify the disease. If you, and you can also then expand by study within the disease. For each study, there's a main study page. This is the one for the Macular Degeneration Study from NEI for ARIDS. And you can see in the upper left-hand corner, there the accession number and the version number. So you can cite the study if you wanna cite it in a general way. There's a summary of the overall study that's worked out with the study provider just to tell you what it's like. It includes a timeline, includes a set of publications provided by the PI of the study that are relevant to that study. Disease is covered by the study, the principal investigator, so it's kind of a landing spot, access restrictions at the bottom, inclusion-exclusion criteria. On the right are the subsections or the subpieces of how you might wanna look at this study. There's associated analyses. In this case, there's one genome-wide association here. This is one we calculated in concert with the ARIDS group. And I'll show you that in a minute. There's the associated variables with the whole study. So you can just browse through if you're looking for something particular that you already know. But as you can see, these variable names are not terrifically informative. We do keep the column-heading names that are in the original study because people working on the study refer to them by these names. And so you still wanna find them. And then we have all the associated documents here. If we go down and just look at one of them, you can get it. This is actually a fairly ugly page. We'll change that, but... This is the examination procedures document. You can get it as HTML. Oh, sorry. You can also search within the document and we'll list all the variables that are described in this particular one. If you go to the HTML version, the document is presented effectively as a webpage. With the table of contents, you can use it to navigate within the document conveniently like that. We also make a PDF version. So if you wanna download the document, have it locally printed out, that sort of thing, they're available this way. Which is actually very convenient and our Framingham collaborators got very excited about this. And they've been very helpful about tagging and getting the documents in because they wanna use this for their own work. All right, going back to our main study page, you can also search for things. So I could type in the search terms, systolic blood pressure. Searching through the studies, it tells me there's actually only one study, the arid study that has those words in it. You can see from the folder tabs, there's 24 variables and three study documents that contain those words. If I go to the variables tab, it starts to list all the individual variables and the information about them. These are all diastolic blood pressures, which you can quickly tell. And it's because in that document section, they also mentioned systolic blood pressure. This is the imperfection of going by text. But you can pretty quickly find the systolic blood pressure measures. And then that takes you to a variable summary. In this case, we're showing you that this is at follow up year seven. We give you the overall statistics for it, number of cases and controls, females, males, means and standard deviations, some sample values, this is all public. Down at the bottom of the page, we're showing you the sections of the documents now that are linked to that variable, not the whole document. In the blood pressure measurement section, you could just browse through that one little encapsulated bit if you wanted to. Or if you use the hot link, you can go look at the whole document in that part of the whole document to find it. So that allows you to get sort of from the variable to the document. And you may have made this connection because you searched the way I just searched, or you might make this connection because you did an association and you got a hit to this column. And now you can dig around a little more about how it was collected. There's the questionnaire that goes with it. So going back up to our association study, if you click on the link to the association, there's an association page which describes how the association was calculated. This was against the AMD stat variable, which is a derived variable. So they did lots of measures of acuity and visual and then classified them by whether or not you have macular degeneration. If you want to browse the analysis across the genome, you get this little genome browser. Each one of those is a chromosome. And the color of the boxes shows you the highest scoring hit in that bin. You have a number of filters there on the left so you can refilter this display based on p-value or the Hardy-Weinberg equilibrium, the minor allele frequency or the call rate. In this case, I'll just change the p-value and now I'm just getting the very high scoring hits. If I click on that bin on chromosome one, I get a zoomed in view of this region of the genome. That's the high scoring hit regions in the red. And right below it is a section of the genome browser and you can see it's mapping down to complement factor H, which is nice because that's the right answer. And then in that little rolling window in the middle there, we're showing you the individual SNPs that are in that region and their scores and some of the background minor allele frequency and that sort of statistical information for it. If I mouse over a section I'm interested in, that box scrolls and lights up in purple, it's kind of hard to see there, the SNPs that are right there. If I, let's see, oh, in this case I've added another map so you can see the checkbox on the left. I just added the decode map, which is showing up down there at the bottom of the additional markers if you wanna see them in context of where you are. I can also, you see the G and the C to the left, that's summary data for specific SNPs. So the G is the genotype summary for that particular SNP in this study and that gives you a report like this where you can see the genotype frequency of cases versus controls, allele frequency, the success rate of this particular SNP, some summary information down at the bottom. Again, you don't have to download all of this to see this, you can decide if you're serious about this or not. If you go to the C, we show you the cluster plot, which you saw before, in this case it looks something like that and here you can see that there's three nice clusters but that little circle with the black Xs, there's the three black Xs means there were three SNPs that weren't called out of this set for this particular SNP and that summarized on the right. So this gives you, of course, from this location because of the RS, you can go to the association down to the base pair. This is a lightning summary of authorized access. If you decide from all that that you want to get access to the data, the process is you have to have a login account and you can use your ERA login if you apply for NIH grants, you have an account. You can log in using that. You can select the studies you're interested in. You fill out an application form saying what your research purpose is. That will go forward to your institution's signing official who's the same signing official who signs for your grant applications and that's basically the institution just saying, this is a legitimate researcher and we're not gonna abuse this data, do things like try to re-identify participants and that kind of thing. Once the signing official signs off, it goes through something called the data access committee which just does sort of a quick review to make sure that your use looks appropriate. If you're granted access and then you can log in and you have a little page and you can pick the data you want and download it. And all those pieces end up closing the loop between publication, the genome, genotyping and phenotype. Thank you.