 I was charged to give a presentation and encourage discussion on these four potential routes to increase researcher access. Supposedly you all read this on the plane coming over and so this is more a summary process for you to remind you of that reading process on the plane. And just to remind you one thing, I'm actually standing in for this handsome gentleman here that's Paul Fleechech. He is very sad that he can't be here but he had a small genetics experiment that successfully completed about a week ago. So congratulations to Paul, Melissa and William who is the new edition. So I'm going to talk about these some overarching goals and then these four areas. One of the things, one of the givens to think about this, some of this is about how researchers sort of formally interact with data sets. And it's a total given but it's worth reminding everybody that we're going to do that in a way which is consistent with the consents. It has to be consistent with the consents. Why are we trying to do this? So there's one business of just making research go faster. I mean David actually expressed this far better. One of the things which I'm though very keen on or very aware of is that there's a lot of serendipity in science. And whenever you start placing barriers for what a scientist has to do to get access to something else, you cut down that chance of serendipity. And serendipity really is a huge driver and very unpredictable. And so the flatter and more open the system is, the more serendipity can occur. Again, this is another obvious statement but compliance with consents is not a zero sum game with researcher access. So there isn't some trade off going on here. It's just about creating a system which satisfies consents and maximises the utility of this data. And somebody mentioned this, Lucia, I think, that research participants overwhelmingly seem to want to see this happen. And multiple times when people have gone and asked why do people donate their samples and data and sometimes undergo quite extreme procedures if you ever do a double skin biopsy. It's not fun. I haven't done one. It's pretty impressive that they expect all of that to be really helping research. And so that's one of the things to remember. And I also am very aware of the fact that we shouldn't let a one or 10% or even 20% of edge cases prevent a good solution. So I think the cut off is somewhere, maybe it's 50%, certainly I'm happier at an 80% solution. And so there's this business of historical consents and a business of special cases. For example, studies that are looking at things that people naturally feel more, there's a higher risk of harm if this information was de-anonymised. So there were four solutions presented in papers and I'm going to go through each one of them with a little bit of pros and cons. Are my pros and cons meant to stimulate debate? They're not really a summary set of pros and cons. So there's an open access proposal to have anonymised identifiers for genotype and phenotype information. That means it would be hard to, you wouldn't have people's personal names. You wouldn't have post codes or zip codes. Everything would be with anonymised identifiers, but in fact all the data would be open and downloadable. Of course the participants have to understand this, they have to consent to this. And one's never suggesting that this is done without, outside of the remit of consents. So this totally maximises the serendipity aspect of this. Interestingly enough, this is already a widespread use for molecular phenotyping and that's because of the hat map samples. So it's widespread because of hat map for molecular phenotyping. Of course many people don't think of humans just as cell lines and molecular phenotypes on cell lines. But it's worth pointing out that there is a big use of this already. And there's an interesting question in my mind about whether it could be extended for other normal phenotypes in particular. And it's already been extended for normal and disease phenotypes and the PGP project is a great example of that process. So what are the pros? There's zero headache in research or access. There's a maximal use of data, reuse of data. And it's the most likely, I think, to generate or release the total utility of the data. So the cons are that there's a small but higher risk of participant harm. You have to think quite carefully about this, how this would occur, but it certainly can occur. The authors of the paper Laura and others pointed out that there's an unknown risk of what happens for participation. So would an open access proposal change the participation rate or not? And of course that's very hard to know unless you actually do an open access proposal and see what happens, but in particular by disadvantaged groups. And it's probably a harder sell to local IRB boards and would be the biggest change from current practice. So this would be the biggest change from the current way we do things. And so kind of the system would find it the most challenging to get their head around. So what about streamlining our current access? So in the paper that was written again by Adam and by some other people, there were eight points. I'm not going to list those eight points though they are in your pack, as it were. But it's relatively obvious so to consolidate DACs rather than to have a DAC for every study, to share more language and terms, to have broad consents and to have standardised consents. And so for me all of these are pretty relatively obvious and good things and one should, as I'll come on, definitely do these things. There is an interesting proposal to change or to perhaps remove a precautionary principle that was the release of genotype numbers. So releasing openly genotype numbers. Now I think this would be very, very useful and very powerful. Some people in this room know this debate. It was assumed four or five years ago that we would be able to release p-values in genotype numbers because nobody could, in aggregate, because nobody could de-anonymise from those p-values in genotype numbers. But a very gifted geneticist went off and showed that at least in a closed world scenario where you can assign, you have somebody's genotype, you can work out whether they came from the cases or the controls. And then from that paper, because of the precautionary principle of not wanting to do something that would put participants at risk, the NIH and then the Wellcome Trust said no to the full release of p-values in genotype numbers. Now this I think is open. At the end they say why don't we change this and I think that's a good thing. It's certainly worth debating more properly and fully. So improves research, one of the prayers and cons here, improves research on access, sets up having future broad consistent, I should have had consistent consents, releasing genotype numbers and therefore p-values is a broad level of reuse that we don't do now and would be great to do. There is this very low risk if this happens, if one takes Homer at all as truth, there's people who dispute this, that this is really feasible in the real world, that you've changed the risk for the participants. But it's quite a complicated scenario because you have to genotype someone. And so presumably by genotyping someone you've already got access to that individual quite extensively. So again it's quite a complicated thing to work through. And interesting enough, I don't know whether this is a true con or not, but it's worth remembering this is our current status quo and it perpetuates the current system and I think my opinion is that we can do better than just streamlining. So researcher commons? So I think the best way to think about researcher commons is that you're not changing any aspect of the consents, you're not changing any aspect of the researchers, you're not changing any aspect of the current system, you're just moving a decision ahead, you're pre-consenting, you're pre-authorising people for consents. So there's no change required in the mechanics, the legal mechanics of this process, except you're moving one decision to being post or inquiry to being pre-inquiry. And so individuals will be pre-authorised to see a class of data that have broad consents. So this is very practical in the sense that many of the researchers could therefore become such pre-authorised individuals, we'll call these pre-authorised individuals certified researchers, and then there could be large common data sets that are consistent with those consents which certified researchers could go and access. It needs a certification authority, something that does this pre-authorisation. In the current scheme that is a DAC. And so by taking the streamlining approach of having less DACs, one can imagine that all the broad consents would come under one DAC and that would be the person who would be doing the pre-authorisation. And therefore the whole thing becomes identical to the current model, it's just that a decision is being made at a different time. Another option is to effectively use other societies for example, the American Society of Human Genetics could be one such thing to certify researchers. Two other good things about certification is probably the case that in a certification scenario you could ask researchers to regularly once a year for example, or once every two years, go and update their certification and go to and see a seminar and think about it properly. And to be honest, I don't think that we do a good enough job in ensuring that researchers understand the difference between data where they have to keep the information separate from data which is fully open. And if you go into a practising large scale genetics or genomics lab, there can be quite a lot of mushiness between those two areas and I think certification would help a process of keeping researchers aware of this. And the other good thing about this is that one can imagine internationalising this. And so different groups would recognise certifications across different countries, would be the way you would do it. I mean it's the obvious way and so the US would recognise certifications coming from n number of countries and n number of other countries would recognise the certification that comes from the US. And so that process I can imagine, it sounds complicated but it's probably far less complicated than it seems because everybody would like to have the same goal of a common set of accesses. So prose, it improves researcher access, serendipitous research is more easily achieved and provides a context for centralised either institutional or more broader systems. I think there's a, what are the cons for this? There's a reputational risk I think of this seeming like where the researchers are bending the rules inappropriately. I think that's an education issue and a presentation issue rather than a real issue but I think it's there to think about. So when you present this this has to be all starting from the correct starting point of saying we want to maximise utility and we're not changing anything and we're not actually changing the way researchers interact with this data we're just making it easier for utility to come out of it. And again perhaps perpetuates the current system to some extent. So the fourth option I think is in a different class. So this is a class of option which is a technical option. The three other options were really changing the formal way that researchers get authorised for access. This one is a technical solution about some of that but it does have a little piece of interaction and it's on the list to discuss. So a central server that provides analysis results. You can imagine a whole variety of different levels here. For example the important but somewhat mundane process of imputation and then going through different levels of calling and stats modelling and all sorts of different things for this. There's a particular suggestion put forward in this paper that low-level data, so reads, genotypes are kept private but people can use a cloud-like infrastructure or an app store-like infrastructure so that flexibility is provided for the precise analysis routine that is executed on top of this. So again the praise enables more research over data sets might provide a mid-level access option so we would have a different level of access and the heavy lifting happens only once. Now that is also a con. The heavy lifting happening only once might cause an almighty IO bottleneck that the world has never seen before as all the servers on the planet try and access five disks or something like that. And actually as well as an IO bottleneck undoubtedly there would be effectively a help desk functionality that you would have to have for something like this and it may kind of have a people bottleneck. And this is kind of the analogy of perpetuating the current system. For me I think we shouldn't miss the chance of doing this mundane business of releasing the full sets of p-values. I think that's underappreciated at what level of impact that would have and we just have to decide that that's okay to be a big result outside of the near kind of statistical genetics community and going towards still people who are I think somewhat sophisticated but closer to the set of biologists that David mentioned. So now this is just a stimulate debate. It's not trying to summarize. So the first thing to say is that these are not set up as mutually incompatible models. We do not have to choose one of these. You allow to choose all of them at the same time for different reasons. So I think that it's a real mistake to have these as if they're in competition. Many aspects of these are just good or great and I see no reason why choosing to do one of them prevents us from doing another one. The other thing I notice in this discussion I feel like an old man here not least because I come to windowless ballrooms in Bethesda too often. But I feel that in these situations one must remember that the tendency is to look for new things. That's entirely appropriate but one mustn't forget the old things as well not least because the old things have reasonable utility and many of the things reason why people complain about them is not really is because the problem itself is hard. So there's a little bit of moaning about dbGaP. I think that's a bit unfair because it's mainly about policies around dbGaP rather than the actual dbGaP mechanics. So all of these guys should be in the mix. So something old, something new, something borrowed, something blue. I don't know what the blue thing will be. So I'll leave that slide here to help stimulate some of the debate and now it's over to you guys for discussion. Lincoln. I have a comment in the question. The first is a comment on the central server that I think we should avoid falling into the trap of thinking that the central server is going to solve the data access policy problems. I don't think we can rely on the central server to implement data restrictions on what you're using the data for, how you combine it, because sufficiently clever people can figure out how to link up different data sets via the central server's API and violate those restrictions. To give justice to the paper about the central server, they talk about this and they basically say when you get access to the central server you sign a document saying I'm not going to do this, which is the basic document that we all sign when we get access to these data is that we're not going to do anything bad as it were. But I would agree that the central server thing is in a different class of solution. It's a sort of technical solution and not necessarily a data access solution. I think the central server is a great convenient powerful thing to place on top of any of the other solutions. The question is that a lot of the serendipity to date has come from the ability to access raw reads and the BAM and cram file layer. It's becoming increasingly difficult to move these really large data sets at the raw level across the internet and it's wasting a lot of local resources because we each have a petabyte or disk for these mirror data sets. You mentioned at one point a cloud infrastructure for accessing the higher level data sets. Do you see any possibility of providing a cloud environment maybe on a fee-for-service basis, which would allow computational biologists to get read-level access? I think this is what Suji Hub is thinking about or at least people placing things beside that. I know that ourselves at the EBI we are actively piloting now direct access of what we call a private cloud or community cloud. I get ticked off by a cloud person regularly by using the wrong term. Basically a cloud that we run to let other people precisely to solve this problem. But I think that that problem of data movement again is orthogonal to the access criteria question and I think it's useful to separate those two things out. Although I think there is an absolute problem about or the solution is to move the compute much closer to the data, that's a separate thing from the policy about researcher access, which I'm sure you'd agree on. David? We would certainly like to have, as you say, cloud services associated with Suji Hub, but if you build something on top of Suji Hub and you essentially end up redistributing the data via this mechanism, then that runs afoul of the current policies. I think this comes back to this. That's why I think the central server, these are in the same class as the central server case of trying to say that this is an allowed scenario where an authority gives out some level of access without it being in some compute environment. So I agree that there is an interaction there and so there is an interaction between policy and the technical aspects of running clouds. For a cloud solution to work, some of those policies must be looked at. Right. My main feeling here is that we really need to encourage creativity. So we want the creative solutions on top of these data. They're storing the data, which is a very low level thing. That should be taken care of, but then at this point, if we do too much control and centralisation and limiting of the creative things that you do with the data, we're really screwed. This is going to Gonzales point. The important thing here is not to set these things up as oppositions, as either rules. So it's totally fine for us to do both of these things to allow researchers to download more data directly onto their disks with appropriate security in-house and then run things across them and to allow cloud access. You don't have to go for only one model. Right, but we need to make sure that we allow a class for a class of people who actually do create value added on top of the data and then redistribute that value added to the medical and scientific community. So Lincoln, again, is that... Well, let's get a mark and mark and mark. The marks. So I was glad to see your last slide that these are not mutually exclusive or competitive activities. In fact, one way in which I think about it is that these are all good things and that some samples and some collections may be available for certain types of sharing and others may not. That doesn't have to mean they're in separate places but we could have a solution that suits that very well. I was thinking since you brought it up, I think it would be helpful to the community if at one point in time, whether it's in this meeting or not, we collectively did weigh in in some way, shape or form, about the genotype P value sharing as in the Homer et al issue because it is a point of great confusion and I think it would be enormously beneficial to geneticists around the world who look to groups such as this to map position statements to weigh in on this because it isn't at all consistently handled. There are many of the disease consortia that do in fact still produce genome-wide full P values or odds ratios and betas and so forth for public use and I think that's a very good thing and then there are many that don't have projects in the sequencing realm such as the ESP project which has put out exact allele counts and so certainly you don't even require particularly sophisticated computational tricks to take someone's DNA sample, sequence it and identify whether they're in that project or not and so I think a clear articulation of what we consider reasonable risk from all the ways of considering this would be really important for the field going forward. Arapinda? I'm just going to sort of add to this in the way it might be useful. You're already getting into the technical details of the architecture and how this should be distributed but the main impediments I see are certain kinds of piece of information are pre-computable and even there I think Mark is right there's not widespread agreement as to what presents what level of danger but I think they can be listed and I think we can discuss them and most of them in fact I believe and I personally at least believe can be shared and should be shared whereas on the other hand data that I need whatever level of data that is in order to compute something that has not been done because of whatever reason I guess you're calling that serendipitous discovery or new discoveries that depending on the data that you need is different levels of guard that one needs to put out and it may be useful just to separate them and one is much more easily done than the other and most of the impediments are I guess Mark said confusion or their cultural issues whatever whereas the other one is a very very different issue very important on how we can compute on the total amount of data and make new discoveries. Well I think this is an excellent discussion let me just repeat that I think it would just be brilliant if an outcome of this meeting is a re-evaluation of Homer et al I know of people who have attempted papers about this, Steve has been an author one of them I know that David Boulding has done some analysis on this and actually didn't manage to get it published because it wasn't considered cool enough because it was saying that it doesn't work and so there's a whole bunch of different people here I think and so at the moment just practically we or browsers are not allowed to put up p-values I mean that's the truth of the matter and that's because of this ruling as it were Who governs that? So it's the DAX and the NIH have made this precautionary policy and the Wellcome Trust DAQ made a precautionary policy so I mean there's a clear piece of governance there is a very clear piece of governance it's about informing that piece of governance about what should happen But I don't think actually that's the case each consortia on its own are making up the rules so I think that as a community we could speak we could speak to the pros and cons of that it would be good just to clarify that little bit Yeah I can believe the community but that doesn't mean so just to just I mean Mark asked the question why are we not doing it and I gave the straight answer I mean I think that one thing that would be really clear is actually articulate what our current view of the community is there's no kind of position paper on what's easy to share and the default right now even though there are exceptions Mark said sometimes in his disease studies they will share the full list of results the default is very much not sharing which is kind of odd this data has very little value to you once you've published your paper but has a lot of value to other people that want to follow up studies on it and you're basically cutting off all that because you think you're doing the right thing in some way I think it's excellent that we're coming together sorry Debbie I was just going to say a lot of this data is the journals have it it's at the back in supplemental information in the publications Not the full P-value list Not the full but I agree it could be gotten from because it's not in the supplement The original Homer paper focused mainly on allele frequencies and genotype counts but I think the main kind of objective of having ways of assessing risk I mean that's something that's agreeable Now one of the things that you don't see addressed here is you could talk about genotypes and P-values but as we get to rarer and rarer variants it's more important to be able to share those so I think the challenges become a little bit larger because there is this question of whether or not somebody wants to be identified if there is somebody who would have access to their DNA and we're all in public places so I think those are good questions too Mark and David That's why I give the ESP project a great deal of credit because they have put out all of this information on a base by base count by count basis and maybe there's some discussions that would illuminate the rest of the group from within that group as to how that decision was arrived at because it is of enormous value and has contributed to many serendipitous discoveries in the last year because to their credit that project has done this and I think we need to be cognizant of the positives that can come from that kind of sharing as well as perceived or imagined risks David sorry and then Lincoln David As I see the I think it's good that you said all of the above and they're not mutually incompatible I think it's important that we realize that they're actually like all components although I'd be in favour of having multiple attempts at this for the exact reason serendipity doesn't just come in the analyst having access to the data but in different people trying to solve this kind of problem but none the less to me these all look like different layers or switches thrown on one I mean is if the data is together and again I don't want it to be in one place it could be in multiple places but if the data is together some data will say it's available for open access so you flip the switch and everyone can access it some way of dealing with the data is you want access to the individual data you can always go get it yourself and put on your computer or you could work in a research commons some people might say there's some way of interacting with it where they want to use it they're not the serendipitous computational people these we should serve they're people of questions but are not computational people who right now are not served unless they can get the ear of a computational person and then you could also release or not the p values in the variant counts in other words these are all layers of access and ways of interacting with what we do not have yet just as we think about this even if individual cohorts and projects are coming together they're not all together everyone agrees with that so there's like a big win here if we can figure out how to get it together and recognize that there's not going to be one size fits all different layers of access and they're all elements potentially of what you want they're not mutually exclusive options Lincoln? Just drawing back a little bit from the specifics to the philosophy of access control I think we should be moving away from a supply side data access is where we're controlling what data researchers can access and to a focus on what the researchers are doing with that data this is why I think that the certified researcher model is actually very useful because then the international or national community has a club if a researcher does something egregiously bad with the data such as de-identifies a patient they can withdraw his certification and basically puts him out of the research business he can no longer access any of the data sets and I think that that's the individual responsibility should rest on the researcher and we need effective controls on the researchers behavior we should stop trying to put our fingers in the dykes of the databases because that ultimately is self-defeating Yeah I think streamlining is very important but I guess the question I would ask to you is the comment that you made that research participants are overwhelmingly in support of this not my experience I'd certainly drop the overwhelmingly I just may be my entirely my bias of the cohorts I've interacted with okay but I think those who are in favor are in favor of it but with conditions and there is a much larger push right now for control both uses and people and users and wanting to get information back and if you look at who's willing to put anything into a quote bank you know the two highest two most important features are a. I get research results back and b. I get paid okay so my fear on all the streamlining I think we also have to look at what's happening from the public's perspective in terms of they want to participate in this and it's not just giving something they want to hear back and my a little bit of my fear like opening this up widely is how do you possibly address that issue because you know if that if the supply drives up you know we're all dead in the water so I mean I appreciate that and I'm aware here there may be a cultural difference I'm hanging out with too many Scandinavians and they are incredibly you know of course you want my body for research yeah so it's kind of yeah so it's it's great so I think these I think this has to be handled sensitively and responsibly and remind I mean I think you very often in my experience when you go back to the top and remind everybody why are we just doing this in the first place and then you bring people down to the specific and it doesn't surprise me that you know one has to go back and say look it's actually a you know it's a basic human health thing that's what we're trying to achieve and go back and forth and down there and so I am that's why in those cons if you notice I did put reputational risk of the kind of system and so I'm less personally I'm less worried about the mechanics I think that's handable what I think would be a bit of a disaster is if there was a kind of reputational risk of researchers saying well they're cooking the rules for themselves and that's that's not what we're trying to do but it it's not only important that we understand that but we also present that correctly so so you know we just have to handle this appropriately yes sorry and I don't want to necessarily represent the ESP conversation on the variant server maybe Debbie did you want to I mean I can I can I can make a comment on what I remember from the discussion because I think it's relevant to this I can comment sure the thing is that this comes back to my earlier comment about our constituents we need to think about are the cohorts and the populations that are providing the data but also I strongly believe data should be available p values and that kind of thing and what was discussed is that in aggregate by having a lot of cohort data together that provided a level of anonymity that individual cohort from one place like Framingham where I work doesn't have and that that level of risk there was a consensus discussion that that level of risk was therefore lower and I think that's what we're seeing with giant and other of these big consortia where they're putting the data out there because basically all of the cohorts have agreed that by putting all those data together the risk is lower now is there data to support that is there been a study to show that some people have looked at that but I'm not sure if that's been published that may be one thing that could come out of this meeting is that the risk could be lower by having the larger I think there's a kind of consensus modern day consensus opinion of running maybe big cohorts or something like that and if we can formalize that write it down and publish it that is the way scientists communicate with other scientists and then we can take those publications to our appropriate backie people and say go have a look at this which I think would be excellent but there's a little bit of caution has to be when you're enrolling in a study and there may be 15,000 schizophrenia individuals in a schizophrenia cohort the neighbor being able to determine that you were in at this study is something that some people wouldn't want and so it depends on nature and that's where it's evaluating the risk in Birmingham what is the risk and a BMI study what is the risk one of that goes back to this very early thing I said about the 80-20 rule and in the Scottish study in Tayside they have a very interesting view so you can access an awful amount of information when you're authorised because the entire Scottish healthcare system is at your disposal at that point but in fact you're only allowed to access certain information in a closed room with no USB key you walk up to the thing but that's the sex offender's register and that sort of thing is in that class whereas BMI is something that you can transfer into a secure location somewhere else so I think this business of risk so that is taking in the probability of something happening plus is it harm or has it it's too late for me harm, thank you so the harm of what would happen is the right thing and so it's totally appropriate for us to have a nuanced view between different studies and so this study is off the table for this class of access because the risk is higher because of the level of harm to the participant if it's de-anonymised at the back I'm sorry I can't read your name so I'm not a lawyer but I play one on TV and in some regards there is precedent for what you're talking about in the law with respect to the management of information in a manner that is not necessarily impregnable to harm or even risk but is sufficiently protected and there is a lot of case study around this in various domains and so I would encourage this community to look at some of those domains so it happens in epidemiology and it also happens in just medicine in general the other thing that I wanted to voice was that I'm very hesitant to support open access in this domain particularly because we do not know all of the harms that could transpire at this time and while I do agree that you should be sharing this information it's extremely dangerous to do so in a manner when you do not necessarily know who is downloading or using the information so all of this has to be predicated in the context of you have to set up not necessarily access control but some ability to authenticate the individuals that are downloading this information before you set it out free it's just necessary for auditing purposes I'm sorry I'm unaware of whether the session is out of time or not any more questions next I just wanted to make a comment on the earlier issue of public responses and I think the contrast with Scandinavian countries is interesting because I think my impression at least is that they've had a level of public discourse that hasn't happened so much in North America for instance with residual blood spots in Denmark this was discussed and so on and from Canada we have similar results with people saying yes in principle we think this is a great idea but calling for some more participation and transparency I mean I think it's definitely a societal thing but maybe it's Saunas, who knows we've got to do the epidemiology of that so sorry Jake I don't think it's anything that silly actually I think it's access to healthcare so as someone who consents people for genome sequencing in short people are people who are terrified of losing their insurance who can't afford life insurance and so their concerns about their data compared to someone in a country that has a much stronger social safety net are hugely different and so I think if we're talking about United States participants there's going to be a huge variety and it will be culturally very different and socio-economically I think quite different with regard to people's willingness to have open access and Gina doesn't change the landscape no because it doesn't protect you against life insurance it doesn't protect you against long-term disability insurance and even if you talk about health insurance and I'm not a lawyer either here or on TV I think the definition of insurance is actually a little vague I think it has to be an employer with 50 employees or something I may be wrong about that but there clearly are very real issues that it doesn't protect against alright so thank you very much good discussion all round I think it's over to Laura