 Good morning, everyone. Thank you ever so much for joining us. Today myself, Christina, and Bearte from the UK Data Service, we're going to be discussing solutions that secure data facilities can implement to ensure reproducibility for papers based on personal and confidential data. We're going to see that in theory, there are a couple of solutions out there, but one key area that we want to actually understand better is what would be the best solution from a researcher point of view as well. So we are going to start initially, if my slides are working, of course, with a very brief icebreaker. If you follow the link in the chat, I'm just spacing it now into a browser, or if you scan the QR code on the screen with your mobile phone, there are just a couple of questions that we want to ask. And we're going to run this for three, four minutes. It just gives people more time to join us. We know we were just saying previously, working remotely and so many online sessions is quite easy to run from one zone to another. We're from one team to zoom and so on, quite back to back. So trying to give everyone a bit of time. Apologies about that. The link is now in the chat. It only went to the house because that was my option. So please follow the link in the chat now or scan the QR code on the slide. And I will start sharing the web page. In here, you can also scan the QR code on the right corner. That will take you to the survey as well. We can see a couple of people started completing and I can see more people are joining us. So we're just doing a little icebreaker for our presentation. Trying to get to know everyone that's in the audience. The very first question is, which of the following stakeholders do you primarily identify with? And we can see we have PhD students, professional practitioners, not for profit researcher. We're just trying to find out more about what kind of research you are doing. And at the end of the session, we also have a similar survey, but looking at the solutions that are able to enhance reproducibility when it comes to secure data facilities. So trying to see whether a solution is more fit for a purpose for everyone, no matter their background, their stakeholder, the domain they are working in. I can see a couple of more people joined. So I'm just quickly going back. If you'd like to join the poll, you can follow the link in the chat or scan the QR code in the right corner of the screen. There are just four questions trying to get to know each other in a Zoom webinar format. So the next question is, which discipline best represents your primary area of research? We can see social sciences are in the lead, but we also have help in life sciences as well. We did think that most of the participants will be from social sciences, help in life sciences, and maybe art and humanities as well. Now, trying to think more creatively, what is the first word that comes to mind when you hear reproducibility in the context of research? And we can see we have quality, open assurance, transparency, all words that would come to mind when we hear reproducibility. And this is why we want to do the talk, because the UK data service, we do want to lead on the discussion. How can we ensure transparency, assurance, quality of data and openness of data by ensuring reproducibility even in safe data facilities? Of course, it's much easier when we're talking about open data that one can just download or safeguarded data that is behind an end user license agreement, because of course you can provide the link to where the data is hosted much easier when it comes to secure data facilities. It's an entire process of application to be able to gain access to that data. And our final question, what single word best describes a challenge you associate with reproducibility in research? Lendliness, perish, access, exactly access is the most difficult one in our experience as well. Publish yes more and more journals are now asking for reproducible research. And this is one of the main drives of the discussions that we're hearing about ensuring reproducibility in secure data facilities, because we're going to see there's a there's a current way of ensuring reproducibility, but journals are trying to move away from that because it's not as transparent as it should be. Now I'm going to change my share back to the PowerPoint presentation. I think we've allowed sufficient time for everyone to join. So thank you ever so much for joining us. I do hope you have enjoyed the icebreaker and it's really nice to get to know you. Brief introduction about ourselves. I'm Cristina Materta, Data Collections Development Manager at UK Data Service. In my role, I am leading on ensuring that data are made available for social, economic research via the UK Data Service. This includes large-scale governmental surveys, longitudinal data, but also smaller-scale academic research as well. I also need the research data management training portfolio, because of course in order to be able to hear all of this data shared, we need to support researchers understand how to make data available, how to clean their data and publish their data. I'm going to pass to my colleague for a brief introduction. Thank you, Cristina. My name is Beate Lichtwald. I work in user support and training and I'm also representing the UK Data Service in the International Data Access Network and our aim here is to make controlled data or secure access data available abroad and also to make international secure access data available to our UK researchers via our state boom. I'm also co-chair of the International Secure Data Facility Professionals Network. This is basically a network where practitioners and leaders of secure data facilities discuss problems of the present but also of the future. So for example, we also discuss topics like what we are talking about today. How can we ensure reproducibility because we are all facing similar challenges, although we are at different stages in the development of secure data facilities. Thank you, Cristina. Thank you, Barde. Now moving on, what are we looking at today? We will do a brief introduction to UK Data Service. If you have not heard of us, we hope it will offer a great deal of data that you can use. But the main focus of our talk is reproducibility in secure data facilities. So we first start with the current situation, what is happening now? What are the journals required in? But also what are the landscape changes? We've come together with different colleagues and we look at what potential solutions might look like. And the most feasible ones are these two scenarios where a reviewer is allowed to conduct the reproducibility test within the secure lab, within the secure data facility. We refer to them here as trusted research environments as well. Or scenario B, where a reproducibility service is established by the data provider in our UK secure lab. We're also looking at the outlook and we're going to have a brief discussion because you really want to hear back from you what solution would work best for you or whether you have any other solutions in mind. So when it comes to UK Data Service, we hope the largest collection of social, economic and population research data in the UK. And it's not only providing access to this data, it's actually providing support guidance and training in order to facilitate high quality social and economic research. And also education is well-overvived in our data, as if our action involves for teaching purposes. The discussion today from parts of our aim to support researchers to allow their data to join. So not only giving them the data, but supporting them, meeting, for example, reproducibility requirements imposed by journals. UK Data Service is a partnership. I've been working at UK Data Archive based at University of Essex, but we do have colleagues in Manchester at the KT Marsh Institute for Social Research. They lead on teaching and aggregate data. Our GIF colleagues that lead on impact and our AD9 University of College London that lead on the census data. We also support the development of best practices for data preservation and sharing standards. And we're part of different consortiums of, for example, CISRA, the consortium of European Social Science Archives to ensure that other depositories make fair data available in support of their researchers throughout their journey. When it comes about stats, we have over 9,000 data sets in our collection with around 300 new data sets and new editions every year. For the long-standing governmental or longitudinal survey, they do get updated, some quite regularly, depending on the methodology they're using, and this is where the new editions are coming from. We have around 48,000 registered users and in their accounts for 130,000 dominoes or accesses of data in a year. We've made a brief calculation in every six minutes, someone accesses data from new GIF, which I think is quite great. It shows that we do live in a data-driven world and more and more people are using data. In terms of researchers, and it's really nice to see the icebreaker cover different types of researchers that we have of the service as well. We do have a lot of academic researchers and students that use our data, our training services, but also we have government analysts, charities and foundations, independent researchers and think tanks, and also business consultants and data analysts. So we need to ensure that our proposed approaches do fit the entire designated user community. Data sources are very, very, we have national statistical authorities. We make available office for national statistics data as safeguarded data. So we know that on the ONS website, there's quite a lot of open data systems that are made available. However, in order to be able to conduct more research, we do need more granularity in the data. So this is where the variables are a little bit bandy and we don't make available the full ethnicity variable. We've done that, we make available a safeguarded data. All of that data is available via UK data source. We do have other government departments that deposit data with us, such as department for work and pension and department of transport, intergovernmental organizations such as the home market, different research institutes, the center for longitudinal study, eye service bill, economic and social research institute, but also for individual researchers as well. Our Fisher repository does contain more experimental data and even trials data as well. Now, the focus today, rather than on safeguarded data, is our secure lab. We first set up secure lab back in 2011. It's quite unbelievable when you think about it and working with the Office for National Statistics, we came up with a five-safes framework to facilitate access to this sensitive confidential data. Under legislation, this is personal data. It needs to have a lot more safeguards in place to be used for research. So when we're thinking of the safe data, we do ensure that there's nothing that perfectly identifies participants in the data, but the granularity of the data is so high that it does constitute personal data. So in order to protect that, we put in place safe project, which means that all the projects need to be approved by the data owner or the research accreditation panel, and they need to demonstrate public good. Safe people, researchers have to do the safe researcher training and pass on it some. Well, this sounds quite scary. The exam is not scary at all if you do the training with very straightforward. Safe setting, the data doesn't leave the secure lab environment. People can only access it remotely. We have now put in place working from home for a variety of our data sets, so researchers can access their computer at work from home, and then they can access secure lab. And lastly, but just as importantly, safe outputs. Nothing that needs secure lab can leave without being checked by our users supporting training, because they need to ensure that there's no secondary disclosure that leaves the secure lab environment. So only by looking at this and hearing this brief introduction, there's quite a lot of steps to access data in a secure data facility. So of course, reproducibility is raised as a big question. So what is the current situation? Journalists, of course, they do appreciate that it's quite difficult to reproduce papers based on secure access data. So what they are doing now is actually accepting the code that the researchers are using. Some of them advise researchers to submit the code to a code repository. It might be something like GitHub, or if they use the data from a specific data repository that can be submitted to that data repository as well. So there is a workaround, but going back to the icebreaker, when we think of reproducibility, we do think of openness, transparency. Just putting in the code is not just as transparent, because what if the code doesn't work? There is that question there. And for someone that actually wants to reproduce that work, currently they would have to go to the entire process of application. And even there, there is a question mark, could they actually get access to the data that the researcher use? We selected plus as an example of a journal, because they do have a very nice data availability and clear data availability policy. They're clearly explaining that they do appreciate that they're ethical and legal frameworks in place, and they do not wish to contravene that. So this is why while they are insisting on the data to be made available, and also should deposit their relevant data in the public data repository, they appreciate that might not be possible because of legally medical constraints. So then authors need to clearly state where the data was accessed from, is there any code, where is the code that they have used to allow others to reproduce their research. Interestingly enough, one of our secure researcher training app, indeed, said typically a data access statement saying that secure data can be accessed by NG data service would be fine. This would be accompanied by R scripts and do files. So we've done a little bit engagement with our users to see what kind of challenges have they encountered and what kind of solutions have they found. However, we started getting more and more queries around, well, how can we ensure more transparent reproducibility, despite the fact that a lot of researchers are getting this solution in place. We can see an example of a data availability statement on the slide. They're clearly saying the Millennium Coverage Study can be accessed from new KDS. It gives the full citation of the study, including the DOI. That's very important because we need to know what addition the data was so that we can reproduce and also they provide the code in their paper. Now on to my colleague Bearty to discuss landscape changes. Exactly. So as you said, Christina, at the moment, the status quo is researchers submit their codes, their do files, their syntax to the journal and together with indicating in their submitted paper the exact reference, including the DOI, the necessary steps are taken to indicate how somebody else could potentially get access to the data, but as Christina said, not necessarily for the purpose of reproducibility. Now, well-established secure data facilities like ourselves, for example, but also others. We are in frequent communication with our colleagues, obviously across the UK, but also across Europe and internationally. We all are increasingly receiving inquiries to enable robust and transparent reproducibility and we will be required at some point to facilitate and assist peer reviewers prior to a journal article publication and we see such requests, especially for economics publications. So they seem to be the driver in that direction. So in this context of the constantly evolving secure access data landscape, reproducibility has become indeed a growing concern for all involved and that's journals, researchers, and data service providers alike. So Christina and I, in this talk, will develop and examine possible solutions as to how secure data facilities could handle the new reproducibility requirements for secure access data and we will also discuss the very practical implications of each of the proposed processes and possible solutions. So do I have control? I cannot change to the next slide. Christina, please, could you click me to the next slide? Thank you. So what could potential solutions look like? Scenario A would be we could allow direct access for peer reviewers in the secure data facility and others could do that as well. So basically we are talking all the time about what secure data facilities in general could be doing, but also then refer back to what that would mean for us if we did that, for example. And the other scenario B is we could actually establish a tailor-made service within the secure data facility to basically certify reproducibility, basically to stamp it and basically say this has been reproducible, we have tried and tested it. So the main aims of our presentation today are in the short run to outline how secure data facilities can support the peer review process better and in the long run how we can help to pave the way for enabling reproducibility of scientific research based on secure access data. So the current situation at the UKDS secure looks like that. So the original data is within the TRE, Trusted Research Environment or Secure Data Facility. The researcher undertakes analysis within the TRE, then it undergoes the statistical disclosure control checks of the user support and training team, then the output and the code is released to the researcher for publication. The researcher prepares everything, submits it to the journal, the code and the article and after the normal peer review process without replicating the results there will be a publication hook for you. Now in scenario A that would change to this flow chart. So again we have the original data within the TRE, the researcher carries out the analysis, the output is then released after the SEC checks by two output checkers to the researcher for publication. This will then be submitted to the journal and then the peer reviewer would in this scenario apply to access the TRE for reproducibility purpose and we will talk in a minute about the challenges that would pose but this is how it would go. So then the reviewer will get access and can run the code that would need to happen via remote execution and I will explain a little bit later why that is on the original data and then the reproducibility checks are carried out within the TRE, the reviewer feeds back to the journal and the publication can go ahead. Now we have various challenges with that scenario, one is time. At the moment the application process for accessing secure access data at the UK Data Service Secure Lab is three months. The peer review process is already quite lengthy and if you now have to add three months for getting access to then carry out the peer review itself that will cause a bit of a problem. However the possible solution could be to have a new and streamlined process for reviewers and of course here we would need to have considerations regarding single and double lined reviews but this could actually be designed in a different way to overcome that challenge. The next one is access itself so it is not necessarily the case that you can, as Christina already mentioned, that you can get access for the sole purpose of getting access to the secure access data for review purposes only and reproducibility purposes. However that could be factored into as a deposit license agreement and we could overcome the challenge this way. Another challenge is obviously, as always, costs. So we would need resources to train researchers and to establish the process and procedures to allow reviewers access to secure access data for reproducibility purposes. For that we would need funding and then we have other challenges like, as I mentioned, already the single and double lined reviews, how do we go about that and if there are public registers and you can basically see who are the researchers etc etc but again there are solutions to that and we have just made a note of that here. And finally technical arrangements within the TAE, so within our secure data facility, we would need to have a certain setup that you can get the peer reviewer can access the code but not all the other folders that we have at the moment in the secure lab project etc etc. So basically we would need to have a new solution whereby we are establishing a new setup for reproducibility processes only providing access to the code which then would need to be remotely executed so the peer reviewer wouldn't have access directly to the full data but basically would run the code on the original data which are in a different folder and just basically to do the necessary bids to reproduce the results used in the publication. This is very very important and we need to think carefully about that. Now coming to scenario B, how would that look like? So if the let's say established a reproducibility service within the UK data service secure lab so within the UK data service. So again we have original data in the TAE, the researcher does the analysis, the output is checked and released to the researcher, the article is submitted to the journal and then the journal basically requests a reproducibility stamp prior to publication. At that point the in-house service would run the researcher's code on the original data within the TAE, then all the checks would be carried out and then the in-house reproducibility service would feed back to the journal whether it was able to reproduce the findings or not. So and then the publication would happen. What would be the challenges for that scenario? Let's have a look. So again it's time, again at the moment the application is about three months. However that could be a default access controlled via agreement so basically that would be not a huge issue and even easier compared to scenario A. So again the access hurdle could be overcome by having it as a key component in the deposit license agreement and the big difference here is the cost. So whereas we would need a little bit of funding for scenario A we would need quite a considerable amount of funding for establishing a service doing reproducibility stamps. So a complete new sub-service would need to be established and we would need quite a bit of funding for that. Again we would then consider all the issues around single and double-blind reviews but these can be overcome and then technical arrangements also need to be in place for scenario B obviously. So the projects that have access to data documentation and code where do we put it and who has access for how long. Again we need new processes for that and again the same like for scenario A the controller or the peer reviewer from the in-house service would only have access for a very limited time only to what is absolutely necessary the code and would run the code then and just check the reproducibility. Now there is actually an example of scenario B so it's not a completely new idea and that is the French reproducibility certification agency Cascade CISD. You might have heard of that if not there is a link underneath that illustration and please click at it and have a look because there's quite a bit of information available. So we also had a presentation of Eric Der Bonnell and Roxanne Silbermann in the ISDFPN and that was very interesting and what we took from that basically Cascade CISD is a project that was run as a pilot or started as a pilot in 2018. It was then a three-year project from 2019 and it is now in its four-year extension after that. An interesting fact I found is that the average workload for a request would be one and a half days and I find also very interesting that they also found a very simplified certification specific accreditation process that just lasts two weeks so they have actually managed to agree with the authorities to reduce that particular access process in terms of its time to two weeks which is fantastic I think. And again the Cascade Controller will only access data and code for the time of certification and also they have a way where they can basically allow different obviously different data Cascade Controllers with different skills in terms of which software they are experts in and which subjects they are experts in and they also mentioned that it's very important maybe to have experts who are not from the same subject so there is no collision of interest also in a very interesting point. Now I would just like to end regarding Cascade CISD with saying that this certification agency for scientific code and data what basically Cascade stands for is a non-profit certification agency and that's created and that's very interesting that's created by academics with the support of the French National Science Foundation and the Consortium of French Research Institutions and so I think very important is to bear in mind the goal of this agency is actually to provide researchers with an innovative tool allowing them to signal the reproducibility of their research and here I think this is the key to signal the reproducibility of the research so they don't need to wait for a journal to come back and request that particular stamp but what they can do is to actually request it themselves and it's free of charge at the moment so and then they can basically submit their article to the journal already with that stamp saying look there is actually proof that this is reproducible and and basically it's like a like a signal for quality if you wish right as we had in the beginning for example when we what you also mentioned what what that actually means for you reproducibility and that's actually very important so it's about changing the game completely from oh there's absolutely no way anybody can reproduce what has been done here to well actually I already provide my article together with such a stamp saying it has already been reproduced and now I would like you to please publish it I think that's that's really a game changer and this is very very interesting and of course we would be very interested in seeing how we can facilitate reproducibility for secure access data much more in future okay now the outlook for the UK data service secure lab coming back to the UK data service is at the moment what we do is code submission we encourage code submission to the journal we can't do it the research has to do it so and it goes together hand in hand with a complete data citation including the DOI in the references section but in future we will determine the feasibility of a reviewer access to the UK DS secure lab that was scenario a but we will also look into what it be possible and what are the odds of considering a reproducibility service for the UK DS secure lab in the long one and now we would like to come back to an interactive part and we would like to hear from you again as in the beginning so this is where the circle closes please could you join the discussion via the browser again there is the same link and I think Christine has already put it in the chat again or you use your phone camera and you scan the QR code so whichever way so they are quite questions so I think more time is definitely needed to think a little bit about it anyone having any issues with the poll please do drop a line in the chat if it doesn't work I've re-put the link just the same link or again the same the QR code on the right corner if you scan that it will take you to the survey you don't have to use your real name you can put up pseudonym you can skip the name question you're just trying to find out more from you about how do you ensure reproducibility what would be the best solution for you are there any challenges that you envision because as they are they were saying cast out was able to be ready because it was all about engagement with researchers we want to do this to support researchers so we need to hear from you one of the questions we are also particularly interested in is between scenario a and b which do you prefer from your perspective and why is that I think we have seen in the beginning we have some phd students in the audience and maybe you're not at this point yet where you had the problem when submitting something to the journal so it's especially when you're at the beginning of your phd that might not have applied to you yet but maybe you have experience secondhand experience basically via your colleagues yeah well that makes sense so basically somebody said scenario b sounds basically preferable because this is something like a stamp of quality and of course that would be would need to be an accredited service provider so basically then yes that would be acknowledged and and recognized yes yeah well actually that was interesting because casket had quite a bit of take up in the beginning and as I said they wouldn't wait for the journal to request it they would basically go themselves and ask for it to be done so they can actually submit it already stamped with a certificate yeah that's also a very important point yes that basically peer reviewers might be less likely to agree to peer review with more work to do okay the next one they're already in relatively slow um well I mean yeah it would slow down is a whole process by definitely the three months of getting access and then also afterwards to carry out the the work although we have seen actually with with a casket for example it actually takes one and a half days as a matter of fact thank you very much for that response that's very useful what what we are actually also thinking is um it takes much longer to establish um scenario b and it would be presumably quite helpful to have scenario a while looking into um setting up scenario b so as a stepping stone and is there also something some answer regarding what methods do you currently use to ensure the reproducibility of your research have a trusted peer with reducing work yeah okay so basically that means you haven't used um secure access data because you wouldn't be able to share the data but yes absolutely that would be the normal procedure for safeguarded data that you can basically share the resources and the data yeah mm-hmm okay thank you I think this is a very valid concern because with the peer reviewers we do hear from from researchers that is currently taking so much longer to publish and that's not to critique journals or peer reviewers they just have so much work so it might be that the reproducibility service that actually has quite a tight deadline to meet um potentially at the beginning it probably is not that doable to have one and a half days um I would say I think in time maybe that's something to aim for and and Cascade has been amazing but maybe at the beginning imposing something similar with Alka chips four or five working days um that would still be much better rather than taking months to have something actually approved um it is quite similar with with the feedback that we got from ISIS as well and researchers there and in data providers they were more keen on a reproducibility service perspective I think our main worry from our end is getting enough support from the research community to be able to get the funding to set something like that up because we can leverage legislation very easily when it comes to the to a reproducibility service it would be much easier than to have it done via a peer reviewer an external peer reviewer because it would be people that work within the UK data service infrastructure we have very high standards that we need to follow even when we get hired we need to to do a lot of things beforehand so I think it would be quite easy to leverage that to our advantage to convince data owners to allow that and it does seem to be the preferred method from a from a researcher perspective so this is certainly helping in terms of planning for the future as while originally we were thinking solution A might work better for journals is definitely solution B seems to and probably where it's better for journals as well because they don't have the time to to have a peer reviewer actually apply because it would have to be even if it's a straightforward application process they would still need to apply to be able to use that data wouldn't be a guarantee we do something similar in ratio in the field deposit repository we have sometimes we get peer reviewers for the day by itself and even that takes usually a couple of weeks because it's getting in touch with us making sure that they have all the details that they need there's no discussion between the researcher and the peer reviewer so they have to wait to hear from the journal so it can take quite a lot of time so it does seem I don't know the art but it does seem like we might be looking at getting some funding for a scenario B project yes I think that that would be the long-term goal but I would still say that in the in for the interim it would be very helpful if we try to implement scenario A as I said as a stepping stone and also to improve the situation immediately quite a bit yes because there might be we're generalizing there might be some journals that actually have the capacity and it would be quite easy for them to have a peer reviewer that actually checks the code yes indeed and I think actually also by signalling it is in principle possible to basically go down that route of scenario A and have a peer reviewer coming into the UK data service secure lab and basically carrying out such a peer review just by doing that alone I think it is already it already has much more credibility because it's not like it's absolutely not possible there is no way you could so basically in terms of quality assurance this would be a huge step forward how much it would be taken up that would be interesting to see to be honest I would like to see what happens then but I think the current situation of it not being possible at all is not acceptable in the long run that can't continue no and it's a pity because the whole idea of reproducibility is to to ensure transparency and there is no way of making this data accessible under other conditions they need to be in a secure data facility it it can't go out of that but that doesn't stop us from implementing solutions to ensure proper reproducibility like we had on the on the previous slides as people usually share their data and all the study resources and all the methods could that not be facilitated in this in play the the secure data facility I'll just double check whether we have any responses to to the final one we're very grateful for for the discussion and I think any other comments the Q&A is enabled and I think the chat is enabled as well and we do want to hear more from from from researchers and if you want to find the us come come later we hear better about our contact email addresses here if you would like to work with us as well on this project that that would be much appreciated trying to find out more from from your experiences and what would work best for you as well absolutely please get in touch then I would like to say thank you very much for attending today's session it has been a pleasure and enjoy the rest of the research method e-festival