 Thank you very much. I'm glad to see that so many people have stuck it out for this afternoon. I realise everybody is probably quite keen to get back to their hometown now that the meeting is drawing to an end. We've already heard open data sharing mentioned a few times. I think Winston Churchill used to say, say things three times in each speech to make sure that the point gets driven home. So maybe this will be the last time that you hear this, but I doubt it. I think it's worth discussing. I come from a disease lab background. The lab I run is a neurogenetics lab that works on genetics of neurological disorders like stroke, Parkinson's, Alzheimer's, things like that. And almost without exception I think that the participants of the studies that I'm involved in, the patients with disease and their caregivers and the controls that volunteer, would be horrified if they didn't think that we were doing the most we possibly can to try and figure out the diseases. And I think that it wouldn't occur to them, it wouldn't occur to many of them that we would worry so much about about sharing the data. Of course it's valid to worry about privacy but I think in general the patients really want us to to share the data and get it out there. So really this is what I'm going to be talking about today. So persuading collaborators of the pros of open data access. Clearly particularly for epidemiological work most of the cohorts have been worked on for years, decades even, and investigators have poured their entire careers, their entire lives into collecting the data making sure that the data is very well collected, solid data sets. And it's really their career, it's all of their work distilled into one large project or a couple of very large projects. So the knee-jerk reaction of course is why should I share this data? I'm giving away my life's work and more than that I'm giving away the work of the postdocs who work so hard on these data when the junior investigators that work so hard on these data. My funding will be in jeopardy when I give away these data once I've let it out into the open it's gone forever and I have no control over its use so why should I be funded to collect more? And another point which comes down to the appropriate scientific analysis of data is that you want to make sure that your data is being treated in a proper way and really the investigator who collected the data is the best person to do that because they know the data so well. So what are the pros? For genome wide association, you've heard big numbers all day. You generate billions of data points in a good genome wide association set and with epidemiological cohorts you're usually typing many phenotypes, hundreds, thousands of phenotypes. There is simply too much for one investigator to do in a timely manner to analyze 3,000 quantitative traits across a half million genotypes. It's just too much for one investigator to do in a in a rigorous way and to make sure that they don't miss the obvious things and the not so obvious things. We've talked about spurious results in data and odd results in data which actually might point you towards something that's particularly interesting. If you're analyzing many thousands of associations you're more likely to miss those. I think that sharing the data obviously creates goodwill in the community. It really encourages collaboration and this is going to be the theme of what I'm trying to get across in that when you release your data to the public either through databases like DBGAP or private databases it doesn't generally encourage people to take your data and run. It really encourages them to contact you and try and work on collaborations and expanding upon your data and expanding upon their data. There is a final point which is really the absolute driver of this and you've heard this many times today and that is that generally no one cohort, no one investigator can unequivocally prove association for effects that are fairly small where you have an odds ratio of 1.1 to 1.4 something like that. Size absolutely matters. The number of patients, the number of samples that you have matters. This is kind of a rough calculation of sample size that's required to find effects with odds ratios. This isn't working. With odds ratios of sort of 1.2 to 2 that's the gray numbers. You need several thousand individuals to be able to detect an effect reliably in your cohorts above where you would correct for multiple testing. But more than that, once you do find an effect with an odds ratio of 1.2, 1.3 even in several studies where you have 1,000 cases and 1,000 controls such as the diabetes studies that are shown here. Three separate studies with a few thousand cases and a few thousand controls. You then need to be able to replicate those hits in even larger cohorts. So you should be able to read here that the final number of subjects analyzed in this study which identified effects of an odds ratio of about 1.1 to 1.4 was about 33,000 samples. And there really are no studies or very few studies where you could do that individually. This kind of analysis to unequivocally show association and to really weed out the true hits requires collaboration across investigators, across sites, across countries usually. There's another advantage to establishing these collaborations now. And that is that this is really the first of a wave of genome-wide technologies. Genetics is kind of different to most approaches in biology in that it's very driven by technical advances and increases in technology. That's what drives what we do. And genome-wide association has been just an incredible leap forward. Even three years ago, four years ago, if we'd been talking about hoping to type a few hundred thousand snips reliably and accurately and individually, I don't think I would have believed it and yet now we can do this. But this is like all technologies transitional. We will move on from genome-wide association to assays of epigenetic variation. So being able to assay histomethalation in a genome-wide quantitative and fairly unbiased manner. Looking at digital expression where we count transcripts by sequence and look at splice variants by sequence across the genome in different tissues. You can imagine doing this in brain and liver and kidney. Genome-wide assays of allelic expression. So seeing if different alleles are expressed at different rates within an individual. High density resequencing, which you've heard about. Resquencing genomic regions such as in the ENCODE project to try and gather up rare variation. And of course genome-wide resequencing, which one would guess is only three or four years away from being affordable as an approach to disease. All of these approaches, genome-wide association and all of these approaches are going to require tens of thousands of samples to be able to get at real effects, real effects that are associated with traits or diseases. And I think that we have a really great opportunity now as genome-wide association comes along to get our ducks in a row, hence the ducks along the top, to get our ducks in a row and really organize ourselves and make friends with our competitors and colleagues. Come up with analytical plans that will work. Come up with strategies for promoting the postdocs and junior faculty that work on this on these kind of approaches and really get our act together. Maybe we'll be lucky enough to be like physicists in a few years time by having hundreds of authors but everybody gets their due credit. I never thought I would say lucky to be like a physicist but there's no physicists in the audience. So what are the rewards and challenges to open data sharing? So we've just heard about this so I'll talk about it really with just one slide and that is that ensuring that your sample collection and consent allows you to share data openly or at some level of sharing of data and taking into consideration things like intellectual property of your host institution, tiered consents, so being able to track who consented to what level of data sharing. I think that there is a lot of confusion and it's a transitional period in terms of institutional review boards and understanding the impact of genetics on our lives or everyday lives that I would guess is going to happen over the next 10 years. It's a really transitional period and we're trying to all get on the same page over what are the real barriers to sharing these data and what are the perceived barriers. I would imagine that all of the investigators here who've worked across institutions realized that no IRB really acts in the same way or agrees on the same restrictions. So we really need to get better ideas about what are the real barriers to this sharing of data. I would say that this again is really worth doing now as we get our ducks in a row because more and more funding bodies both government funding bodies and funding bodies outside of the government are really mandating that sharing at some level is part of the research plan of any grant that goes into an institution. So striving to use standardized phenotype and genotype measures that can be applied across studies. From a genotype perspective, this is difficult but not impossible. There are really only two main platforms at the moment, the Affymetrix and Illumina platforms and each of those has different chips and each of those has different releases and different versions which vary ever so slightly enough to give you spurious associations and enough to cause you some small worries. But these differences are known and they can be dealt with. Jim has already spoken about the advantages of using a database like DB Gap to help standardize measures of genotype and to be able to display the raw genotype data. And I think that being able to release the raw unfiltered genotype data, being able to see the intensity plots, the measure of BLEL frequencies and log R ratio which give you these intensity plots is key to being able to make sure that there's uniform data across and between studies. So phenotype data, standardizing phenotype data is much, much more of a problem in my opinion. We want to be able to strive to standardize phenotype data both across studies that are looking at similar things and also in studies which are fairly disparate, can somebody who's interested in looking at the genetics of blood pressure use blood pressure measurements that are taken in an Alzheimer's disease study or in a Parkinson's disease study or a study done elsewhere in the world. I think that coming up with easily transferable phenotypic measures is a critical stage in our use of this, in our leverage of genetic association data. Unfortunately, I attend a lot of consensus conferences for neurological diseases and getting neurologists to agree on a single diagnosis or a single way to look at something is almost impossible. And whenever you decide something by committee, there's always a downside. It's never the most efficient way. But I think it's important to decide on phenotypic measures, on standard phenotypic measures, and stick to them at least in the short term so that we have a core set of data. Deciding what, where, and how to post, and also continued support of the data and reasonable responses to queries. No doubt when you post your data, you'll be asked about it by investigators who are interested in your field. So there are several levels of data that you can post. You can simply just post the association data, what are the p values and the odds ratios that you achieve for each SNP and each trait. That's actually quite a complex thing for epidemiological studies where you're likely to be looking at many traits across individuals. So you'll be posting a lot of data there. You can post genotype frequencies, individual genotypes, which is I think one of the more powerful approaches. This allows people to apply novel analytical techniques to your data and actually dig down deeply into the data. And of course the release of data with the most utility, the raw image data, cell files, so that people can reconstruct the data and recall the data and make sure that data across studies and between studies is called in a very similar or identical way. Another consideration of course is what level of phenotypic data. Do you release all of your phenotypic data in one go knowing that you're putting everything out there, you're allowing everybody to delve into your data immediately. When, there's a lot of discussion over when you should release your data. On publication seems to be a fairly standard approach for a certain amount of time following data production. That seems to be the approach that's being taken in the GAIN initiative. One advantage of this is of course it drives the investigator to analyze and get their data out into the public domain as quickly as possible knowing that they have only a certain amount of time to beat everybody to the punch. Or the alternative approach is to just release your data immediately upon data production and data cleaning and know that you're in a race with everybody else to check your particular phenotypes. Where should you deposit your data and how? You can choose to deposit your data in a central database, which is a DB gap, which I really feel is going to become the main repository for these kind of data and certainly NCBI have been dominant in hosting genomic scale data over the past 10 years or so. You can also choose to deposit data or subsets of your data on institutional websites or distribute on an ad hoc basis sending out to investigators when they when they ask for it. I would say that DB gap is the way to go. Having your data a centralized database really means that it's a one-time investment. You send in your data dictionary, your phenotypic data, your genotypic data. You usually get a few questions from Jim and colleagues about what's missing and what they would expect and any problems they see with the data. But once that's dealt with, that's essentially all you have to do with that data. It's a large amount of infrastructure burden has been removed from you. And there's the advantage that there's some standardized data cleaning and manipulation. Things like standardized calling algorithms and imputation can be applied at DB gap. And of course, as Jim demonstrated elegantly earlier, your data is immediately tied to all of the other databases within NCBI. So it's a nice way to be able to navigate through the data. Depositing on a local website or distributing it in an ad hoc, you know, there are investigators who will burn their data onto a DVD and send it off to the mail in the mail. Although many of these data sets are way too big for that kind of approach, this requires a really significant amount of time and resources. And you have to be prepared to answer questions. When we first produced our data, we just put it up on a website through Corrielle cell repositories as genotypes and map files. And we were getting four or five queries a day from individuals who downloaded the data asking about an individual SNP, what direction it was in or why was one missing. And just to have the time to deal with this kind of thing is no small burden at all. So I wanted to just talk briefly about the experience of my lab. We've shared data openly for the past year or so. We're, I guess, a medium-sized genome-wide SNP genotyping operation. In the past year and a half, we've generated genome-wide data on about 6,000 individuals. So that's about three billion genotypes using the Illumina Infineum platform, mainly, although we also have some Affymetrix data. It's been on neurological diseases, so stroke, amyotrophic lateral cirrhosis, Parkinson's disease, neurologically normal controls, Alzheimer's disease. But we've also been doing some other work on diverse populations from around the world, looking at genetic variability across populations. And some epidemiological cohorts, particularly those focusing on aging phenotypes. I'm from the National Institute of Aging, so that's directly related to the mission of my institute. We've published four genome-wide association studies for very small genome-wide association studies, really only three, four, between 300 and 1,000 cases and 300 and 1,000 controls in each of these. These were generated by six postdocs, and each one of those six postdocs now has a first author paper from these data, and each one of them is a joint first author paper on the other five papers. So they all got quite a lot of bang for their six months' investment in this work. They all got six essentially first author papers from this work, and I think that the journals are willing to publish large lists of joint first and joint last author papers and to emphasize the role that each person's played in this kind of work. The paradigm for our kind of research for these small cohorts was not to find genes of moderate and small risk. We were looking for genes of very large risk, but with the knowledge that they probably weren't there, the main aim was to generate an initial data set for neurological diseases and really get the ball rolling in Parkinson's disease, ALS, and Alzheimer's disease and stroke, and have that data in the public domain so that other investigators could mine the data and also augment the data. The nice part about genetic association data is that it's digital and can be added to. So we've released just over a half billion genotypes. These genotypes have been downloaded 500 times by 500 individual unique visitors from the Corel website and it's also available at the DB Gap website. There are two manuscripts published by our competitors and five other manuscripts in press and under preparation by other downloaders where they've let us know that they've analyzed the data and that they have something interesting that they're prepping, and these are either disease associations or people who are developing software to deal with these data and are using these as test data sets. The real experience that I've had from releasing data out into the public domain has been an absolutely positive one in that it's really encouraged collaboration. Most of the people that work on the data we've released have contacted me at one point to ask about the data, to ask about what else we're doing, to tell me about what they're doing, and really to ask about collaborations. They want to share their sample series with a lab that has experience in doing this kind of work and they want to see if we're doing any more so that perhaps we can get together and find true risk genes. I think that most people are realizing we need very large cohorts and they can't do it by themselves so everyone's starting to play nicely together. So an example of where this has really benefited us, we analyzed copy number variation in a few thousand individuals that we've assayed with the infineum platform and this is essentially looking at each chromosome by eye and scanning along the chromosome, looking for copy number changes because the algorithms that are out there don't work particularly well. We were contacted by a lab that specializes in CNVs and now we collaborate together on producing an efficient algorithm to automatically detect CNVs and that's going very well. So that will have, if this works, which I feel it will, it will have saved the work that we'll put in on future chips and to give you an idea of how much work defining the copy number variation by eye is, it took six people from my lab, three months, three pretty much full time, three months to identify and plot all of the regions of copy number variation. So that's been a really positive outcome. So the bottom line, I think that data sharing is becoming standard, I think that most of the geneticists with large cohorts who perform in these kind of studies realize that at some point they have to share their data with their colleagues. It requires a certain amount of time and resources and upfront planning to be most effective, thinking about how the genotope should be deposited, how the phenotype should be scored, but in my opinion it really repays the investment. If we're to move forward, we have to share the data, we have to work across studies and collaborators generally, sorry, people in your field generally don't want to just take the data away from you. They know, they understand the issues that you know the data you've produced better than anyone else, so they want your experience and your perspective on that data. And the final point, which is, which of course will be a real driver, is that funding bodies are beginning to mandate data sharing. There are several charities, at least in funding neurological diseases, where open data sharing is absolutely mandated by the funding body.