 get started so welcome everybody and thanks for joining us today for this first day of the CNI 2020 fall virtual meeting. I'm Cliff Lynch the director of CNI and I'll be introducing the session. Just a couple of quick logistical things. You have a Q&A button on the bottom of your screen you can use that to pose questions and we'll take all the questions at the end. There's also a chat box you're welcome to introduce yourself in there and to use that as we go along. At the end of the session Diane Goldenberg Hart of CNI will beam into existence and will moderate the Q&A. Everybody is muted at the moment except for our presenters. Close captioning is available if you'd like to make use of it and I don't really think I need to say much more about the mechanics other than the session is being recorded and will be available later. Now let me introduce the session really briefly. We have with us from the Wild Cornell Medical Center both Terry Wheeler and Peter Oxley. Peter will be doing the presentation. Terry will join Peter to handle the Q&A and this is a session I'm really looking forward to because it weaves together quite a number of issues about how to essentially document and share research work and it incorporates a strong commitment to reproducibility. It gets at how we manage research workflows. It also has some very specific and interesting issues because as a medically situated piece of work it has to deal with some confidential data where the sharing is limited. It also very provocatively incorporates electronic lab notebooks as part of the flow here. I think you'll find this very interesting and as I say I'm really looking forward to it. With that let me just thank Peter and Terry for joining us and hand it over to Peter. Wonderful thank you so much Cliff for that lovely introduction and thank you for the opportunity to present what we've been doing here at Wild Cornell and I'm excited to be able to share what we have been wrestling through in terms of research reproducibility and how we can improve this for the institution and for our researchers. So I hope there's some lessons that will be of benefit to everyone who's tuning in to watch now or the recordings later and it's going to be a very quick run through with the limited time we have but don't hesitate to ask detailed questions at the end and I'm happy to go a little bit more into the weeds with people afterwards if that helps and so I want to start with how our journey begins and what are some of the motivating factors for us and as we started to really earnestly engage with the issue of research reproducibility there were three factors that are primary drivers towards why we want to do this better and these will probably be very familiar to many if not most of you but it's always worth revisiting and allowing these things to shake us up a little bit because these issues of genuine concern and we as institutions that value contributing towards the scientific endeavor want to be helping as much as we can to address each of these in turn and so the first of these is what is better known as the reproducibility crisis and you've probably heard this phrase it's probably been in fact perhaps a little overblown or overused now even and while we could in fact spend our whole time today even just debating what reproducibility means I think it's important to recognize that that reproducibility on many different levels is an important hallmark of what science is and it is one area that we are everyone can can always be working to do better at and so there's been numerous studies that have come out over the years particularly in the last decade or so attempting to reproduce at some level established scientific papers of the past and they they basically all come to consistent conclusions that depending on how detailed you want that reproduction to be it is in fact very difficult to properly reproduce a lot of work that has made it into the published record of science and on this screen here I'm showing you a slightly older but perhaps well known study from from John Ioannidis who who looked at the specific field of molecular biology and what I the reason I want to highlight this one is because it shows nicely the breakdown of issues that they identified in attempting to analytically reproduce a number of of papers they weren't experimentally trying to reproduce the the work or biologically reproduce it but you can see here that we have more than more than half of the papers could not be reproduced and when you look at the reasons for that half of them was because the the data were not available to be able to run the analysis on and then we have other issues including software availability and the methods that had been provided being unclear and these are not deep philosophical issues that are hard to root out these these are actually very at one level surface level issues that we can begin to chip away at and and try to improve on in terms of of helping this this process of reproducibility now connected to this this first made of motivating factor is the the mandates that are coming from funding bodies and industry and government more more loudly as as the years come by and more explicitly and it is related in many parts in many ways to the issues that we just saw previously where data availability is is really important particularly when that when those data have been gathered using publicly provided funding or taxpayer funds and so there's there's just a scattering of requirements that I've put onto the the screen here that involve NIH and NSF and and others and there are a funder mandates there are a legal mandates as well that the institutions need to be aware of and and these are the more straightforward of the retention requirements that researchers need to be aware of and they can become much more convoluted and hard to predict such as in some instances there is a illegal requirement to be able to access the original data for a study uh up to uh six years after the last citation by the original authors of the original paper and this uh this particular requirement for data retention is uh ultimately unknowable at the time of publication you don't know how long your retention period has to be in that case because it depends on on what happens in the future and so data retention in itself even when the rules and requirements can be clear can be very complicated to to implement and while many institutions will consider the researchers as the appropriate steward of the research data while they are actively working on a on a project or while they are at least active faculty members of the institution the institution does still have a right of ownership of the data they have their own legal responsibilities for data retention and being able to provide those data and as such they the institution needs to to take responsibility for how it is handled as well not to mention the fact that when researchers move on to another institution or retire then the the data that is left still has to be be managed by someone and so these these are important things for an institution to to be considering and working towards and our third motivating factor and and perhaps the the more insidious negative factor here is is the persistent presence of academic misconduct in science and I have a couple of quotes here from some of the the more well-known studies that have looked into that this in in the last decade and while we may not be heading towards a a crisis of academic misconduct nonetheless it it is certainly persistent and we we certainly want to be doing what we can to dis incentivize academic misconduct within our own institutions help researchers to be able to to police it themselves and identify it when it's happening within their labs and another aspect of this of course is that there is an increasing number of allegations against institutions as far as academic misconduct is concerned and so even when there isn't genuine misconduct occurring the the accusations that come in need to be addressed and so the institution in order to be able to appropriately defend its researchers who are doing the right thing and doing honest research needs to be able to easily and quickly address the allegations that that come in and so being having systems in place for robust research reproducibility internally will substantially help towards this effort as well so as much as these three drivers have have something of a push factor towards pushing us towards reproducibility we of course also as an institution have our own ethical mandate and and a desire to maximize the value that comes out of every data set that is gathered and particularly when we're looking at medical research and the the fact that there have been patients that have been part of this process we we have this ethical obligation to do our science with integrity and to have responsible use of the patient data and this is intimately linked to research reproducibility practices because you you cannot guarantee that a patient's data have been used ethically unless you're able to validate the research that that patient's own data have enabled and so we we strive to to maximize the the value of the data that we've been collecting in the research enterprise as well and so all of this feeds back into a desire to to improve our research reproducibility practices now quickly some things that we're not trying to claim that we are doing here in in this presentation the the systems that we are going to present to you today do not automatically make research reproducible reproducible that comes from improving reporting standards at the point of publication that that requires having clear principles of reproducibility such as the the fair principle for findable accessible interoperable and reusable information ideas like literate programming which have been around for a long time now and having communities of practice such as a with open the open science community and more specific communities such as the touring way and and these groups working with researchers to identify how at the level of publishing into the academic community research can become more reproducible instead what we are wanting to provide here at Wild Connell is an infrastructure that helps to make the the research research reproducible can we give them can we give the scientists the tools that enable them to make their research more reproducible and perhaps to to take a phrase from the touring way can can we make it too easy not to do so what we're trying to capture then is can we identify and and track the the raw data that are being generated as part of the research process are we able to describe and again store the workflows that have been used and applied to those data and then link both of these both of these to the results which typically are the the end product and what does get published and and can we can we track these and store them appropriately and then find them and bring them back to bear when necessary either for further research or for defending against legal challenges or or claims of misconduct now in order to do all of this we we need to deal with a number of hurdles none of these are unique to Wild Connell I'm sure that many of these things are you you can all identify with and so they include things like the fact that we have lots of confidential data as part of our research projects and so we have to be very careful to make sure that we're we're handling those those data according to appropriate secure security standards and so that of course as far as research reproducibility there's there's open and then there's there's open so we we have to figure out ways of being able to still be able to track the work that was done in a reproducible way that does not involve actually exposing the the data necessarily to to everybody but but still being able to provide it to the right people when when appropriate the second hurdle is of course researchers are incredibly busy already doing the research they don't want to have added to their plate 50 different things that they must now also do to meet institutional policies and to to make their work more reproducible and and so even though most researchers would certainly agree with the goals of reproducible research they don't want to have these extra burdens placed on them which slows down the the primary goal of of doing the research so we need to come up with a solution that that does not add effort to the researchers but removes the effort required for them to to do the work and and make it reproducible as it as it should be we are a very large organization we do a lot of different kinds of research and so that means that the data and workflows that we deal with are incredibly diverse they're generated from all sorts of different systems the analytical pipelines exist on very different forms of infrastructure and so being able to capture all of that and having a system that is flexible enough that it will be able to apply to as broad a spectrum of of research as possible is something that we need to be aware of and finally in in terms of data retention we also want to make sure that our solution becomes a place of high quality data retention and and research route reproducibility we can't create a system that becomes just an electronic dumping ground for everybody's data and workflows that then makes it actually really difficult to sift through and identify the the relevant data and workflows when it is required not to mention the the costs of of maintaining irrelevant pieces of information within that system so these these are some of the things that have been through front and center of our considerations for for how we we build the system and so we've we've come up with essentially a four piece solution that we are putting fitting together to work together to to address all of these issues that I've raised already this afternoon and so piece number one is to have a supported electronic lab notebook solution that will allow the researchers to capture small data sets and the basic workflows within the lab that also allows the the flexibility to to meet lots of the the research needs of the different projects and so there are many different electronic lab notebooks out there and you know that different ones will fit different institutions we are going with lab archives and but that's not necessarily I don't want to say endorse that above all others but in our case this this works very well for us because the solution that we have with them allows us to have direct storage of raw data files and automated in fact automated capture of files that are being produced off research equipment into lab notebooks they also allow the the capture of analytical workflows by integrating with a number of software tools that researchers use for their analysis making it easier for them to do their work internal to the notebook or connected to the notebook so that it is automatically capturing the work that they are already doing and because of its its basic layout and providing effectively the researchers the ability to choose how they they upload and and organize the data internal to the each individual notebook it gives a degree of flexibility as to to how it is used for capturing the research data they have file versioning and immutable time stamps which can become very important for being able to to track when research was done at a particular time and I'm now working with the lab archives team on integrating their notebook with Jupiter notebooks and so they have they've already started to build an extension on their end for display of notebooks within display of Jupiter notebooks within lab archives and I'm also building a Jupiter lab extension using their APIs that will enable people who do computational work within Jupiter to be able to access data that is stored on their lab archives notebooks as well as as upload the Jupiter notebooks as well into the electronic lab notebook system so further allowing capture of a multitude of different computational workflows in a single location of course these lab notebooks themselves are trackable identifiable shareable between people within a lab or between labs and ultimately as faculty members move on those notebooks can be shared and given over to a new person so that we as an institution will always have access to the work that is has been done within them piece number two is to have a file management system for everything that doesn't fit within the electronic lab notebooks so that we can still tag and track the other data and and files that are being produced and so we we are working with Starfish for this to produce a system that allows the the researchers to continue to work within the institutional storage solutions that we have provided the various file share solutions but in a way that will now allow us to identify which files are associated with a particular project so that we can then more readily access those and so this this works in essence by allowing a researcher within the the file system to flag uniquely folders and files that are associated with a project or publication and then all such files that have those flags can be appropriately hashed to provide a unique identifier that will allow us in essence to validate in the future that any specific file that we are attempting to retrieve is in fact the same file that was originally tagged by the researcher also the primary feature of the the tracking system itself is its ability to track where files are moved to within institutional storage so even if the researcher relocates their data sometime down the in the future the Starfish system is able to find where those files have been placed as long as it's within the institutional storage and then finally we can create actionable scripts allowing researchers to indicate when they're finished with a research project and that we need to now move into an archive state for for the project or to to capture something more of these files for for the future piece number three is to provide secure file access management and and this is done through our data core at Wild Cornel and and this really is a unique piece based on our patient patient data that will be needed for for much of the the researchers here and so what we do in essence with data core is provide a a virtual secure collaborative workspace that allows or appropriately authorized people to gain access to a specific set of data and software tools in order to do the research as allowed by the data governance that maps to those data sets all of that is then monitored for the lifecycle of the project by the data core curation team who track expiry dates and authorization and also other requirements for data retention or data destruction at the the end of the project lifecycle to ensure that all of these requirements are met saving the researcher from having to to remember all of these things and then also when necessary providing curation for data import or export either directly to the researchers or between third party data providers so these are all the essential pieces of the infrastructure but how do we connect these to provide a sort of single solution that ties them all together and and the answer for us is through our data catalog and so the data catalog not only provides a place to capture metadata associated with any particular data source but it ties it to projects to grants to publications and to each of these other infrastructure tools that we've just looked at and so the the catalog functions on the back end of the data cores administration system so that it allows the curation staff to be able to track data use agreements governance expiry appropriate authorization of of users to access those data it is also able to capture the unique tags that starfish generates so that when a researcher has generated a new data set associated with a publication or with a grant they are then redirected by starfish into the data catalog so that they're prompted to to enter in at least a minimal amount of information about the data to help make it findable and more reusable in the future but also then it becomes trackable because the data catalog has that unique flag which is then able to pass back to starfish should someone need to recover those those data files and so used therefore allow the starfish system to go and appropriately track down validate with the hash and then present the the files to to the appropriate individuals now this at the moment is still behind a curation team we don't serve up files to anybody who searches or asks for them but the ability to be able to down the line flag levels of access as well as as tracked by the data catalog is something that we're we're keen to to try and develop and then finally the the unique tags associated with each of the electronic lab notebooks is also stored within the catalog so that any notebooks that are used as part of a project or as part of a publication can likewise be tracked down enabling a request to be made to the the PI or the the faculty member to share access of that lab book with the appropriate researcher if if necessary and all of that then is also built with the discovery layer within the catalog which allows individuals from wild Cornell to be able to search not only data but data use agreements reuse conditions of the data authorization levels for access to to help people understand not only what is available internally within the institution for sharing but under what conditions and for what purposes and so together this is is maximizing each of the these data sets that we're we're producing and so we are as such particularly for for larger patient data sets that have been built are able to make them available and discoverable to the largest number of people that can then be provided with appropriate access to do their research and so finally and briefly the triggers to prompt all of this is project and grant completion publication or a faculty member leaving the institution then leads to making sure that all of the the research is appropriately stored within the catalog and I will finish with a quick point out to the many hands that have made this system what it is today and if we have time for questions happy to take them now great thank you thank you so much peter that's a really interesting integrated system and I realize that we are right at time but we do have one question so we'll go ahead and take that now what infrastructure is the data catalog running on yep so it is built is using the Django web framework to host the front end it uses a postgres database which is actually aws hosted in the back and we use elastic bean stalk to as the the infrastructure that everything sort of sits on top of to be able to to serve the data there's there's no protected health information that's stored within the catalog itself it doesn't store phi it is only storing those those pointers to the governance and the authorization great okay thank you for that answer and thank you so much for that question and our questioner thanks you as well peter for that question um I really want to thank our presenters for uh being here with us today at C&I and I also want to thank all of our attendees for making time um in their busy days to join us today as we are a little bit past time there I don't want to hold folks up anymore but I will invite anyone who would like to stick around after I turn off the recording um to sort of approach the podium and have a chat with our presenters we'd be delighted to have you do that we can turn your microphones on and you can ask live questions or make comments so with that I'm going to thank peter and terry again thank you all for joining us and we will be back in about a half hour with our next session as part of C&I's fall 2020 meeting take care everyone you will