 Good morning. All good. And thank you for joining us today for an update from federal agency and nonprofit organizations on their public access programs and initiatives. Shortly we will hear from six different representative organizations about what's coming and what's new for their program areas. Given current conditions and travel restrictions, this is a hybrid panel just to let you all know, with some representatives joining us virtually and some in person. We will have a live Q&A with every, with all of the panelists once all the presentations have been completed. So please be thinking about some good questions as you're listening to the presenters. So with that, I would like to briefly introduce today's panelists. My name is Cynthia Hudson-Vaitali. I'm the director of Scholars in Scholarship with the Association of Research Libraries. Joining me is Catherine Funk, program manager with NLM and NIH. Jason Gerson, who is a senior program officer for clinical effectiveness and decision science program at the Patient Center Outcomes Research Institute. Josh Greenberg, program director for the Technology Program at the Sloan Foundation. Martin Halbert, who is a science advisor for public access at the National Science Foundation. Bob Hannish, director of the Office of Data and Informatics in the Material Measurement Laboratory at NIST. And Carly Robinson, assistant director for information products and services at the U.S. Department of Energy, Office of Science and Technical Information. So with that, we're going to kick things off with two virtual presentations. That should be queued up shortly. Wonderful to be with you all today. My name is Carly Robinson. I'm the assistant director for information products and services within the Department of Energy's Office of Scientific and Technical Information or OSD. And I wanted to provide an update today on our persistent identifier or PID services that we provide. Just a little bit of background about our office and how we fit within the Department of Energy in case you're not familiar. So DOE program offices fund about $12 billion in R&D funding each year, and that goes out to our 17 national labs and to grantees at universities and other institutions. And from that funding comes R&D outputs. So things like journal article, accepted media scripts, software, data, technical reports. We estimate there are about 50,000 R&D outputs that comes from DOE funding each year. And that's where my office, OSD comes in. We work to collect, preserve and disseminate DOE funded R&D results. So we work to make those publicly available to the general public, to DOE, to other agencies through some custom developed search tools that we provide and also through other commonly used search engines. And our office goes all the way back to the Manhattan Project and the Early Atomic Energy Commission, but most recently our work in our role is called out in the Energy Policy Act of 2005. That says the secretary through OSD shall maintain within the department publicly available collections of scientific and technical information resulting from DOE funded research. We have a number of different strategic priorities that fall within our mission space. I'll mention a few here, but I'm going to focus on one. So we work to develop and provide leading edge search tools, including OSD.gov, which is our primary search tool for finding all DOE funded research results. We integrate, we work to integrate AI and machine learning into our workflows and tools. We work for strategic decision making through analytics of R&D outputs. We partner with other agencies and private sector to promote access to unclassified R&D results. So one of the agencies that we partner with, for example, is NSF. And we work to promote open science and linking of research objects through persistent identifier services. And that's what I'll be focusing on today. So this is a very busy slide, but I'll briefly kind of run through the different persistent identifier services that we provide at a very high level. I won't go into any kind of very specific detail, but I'm happy to delve in or answer any questions you all may have. First off, we provide persistent identifier services assigning digital object identifiers or DOIs to research results or research outputs. So we assign them to technical workshop reports, to conference posters and presentations, all through our Injust system, which is the system that folks use to submit these research outputs to us, and that system is called Elink. Also within Elink, we provide our DOE data ID service, which is providing DOIs to data sets. We also through DOE code, which is our software platform for submission and discovery of DOE funded software. We also assign DOIs to software now. And we partner with other federal agencies through our interagency data ID service to help with assignment to their data and their research outputs. So in addition to DOIs for research results or research outputs, we also have a pilot service providing DOIs for awards called our award DOI service. And it's still in its pilot phase. We're working particularly with DOE user facilities to assign DOIs to the awards that they provide to use those facilities. We're working with two facilities so far and they have assigned over 8,000 award DOIs for all of their awards, kind of looking retrospectively but also moving forward. We also provide the U.S. Government Orchid Consortium, which is focused on persistent identifiers for people, specifically orchid IDs. So the consortium is a group of U.S. government organizations that are orchid members through the consortium that, you know, share best practices and things like that and are kind of working together to use orchid within the U.S. We have 16 members so far. So those include, you know, DOE organizations, our national labs, our user facilities, but also five other federal agencies are members. So USDA, NASA, CDC, FDA and EPA are all members of the U.S. Government Orchid Consortium. And that's when at least we are looking at persistent identifiers for organizations, which is definitely a little bit more challenging because there are a number of different persistent identifiers for organizations. But we, you know, one of the things that we're doing in this space is we maintain our own internal organization authority, you know, kind of text-based organization names. What we're working to do is map our internal authority to persistent identifiers for those organizations, particularly ROR or Research Organization Registry IDs, but also other types of persistent identifiers for organizations like grid IDs, ring-gold, Thunder Registry IDs. So we're working to kind of map across our authority for all of the organization identifiers that we have. And then we also maintain the Open Fundor Registry, which are DOIs for funding organizations. So we maintain that for the Department of Energy, a listing of all of the DOE funding organizations that should have Thunder Registry DOIs. So there are also some links on the right-hand side. So if you are interested in finding out any information about these services, you can go there. But in addition to just kind of talking about our services, I want to talk to you about, you know, how they're helping to promote open science and why it's so important for our mission space. So osti.gov, this is a data set record within osti.gov, which again is our primary search tool for finding DOE funded research results. And so, you know, this is a very traditional, you know, record landing page where you have information that describes the record, like the abstract, the author, publication date, research organization, things like that. And what we're working to do with persistent identifiers is to associate persistent identifiers with as much of this metadata as possible. You know, historically, a lot of this metadata is text-based, which has been really wonderful for years, but when you can associate persistent identifiers with the text, it can help in a lot of ways. So it can help with, for example, disambiguation of authors or of organizations, things like that. It can help with disambiguation. It can also help with findability and discoverability to make research and researchers and other research objects more discoverable. And you can also create connections between these persistent identifiers that's a lot kind of easier to use and work with than just kind of text-based information. In this example, currently for this data set, we have the DOI, which you can see on the left-hand side. This is a DOI that we've assigned to this data set. We also have the ORCID IDs associated with the authors of this data set. You can see the little round green ORCID IDs symbol next to their names. And then in the future, we're working to add research organization identifiers for the research organizations and potentially, you know, award DOIs for the funding, things like that. The other thing that we can do is we can connect research output. So this data set references a journal article, and we can make that DOI for the data set connected to the DOI to the journal article within the metadata. So we're working to make those connections as well. So I wanted to show you another example. This is a record page for one of the awards that we've provided a DOI to. This is from the Environmental Molecular Sciences Laboratory, which is a user facility at a Pacific Northwest National Lab. And so this is a record page for the award on the left-hand side. Sorry, it's a little bit small, but you can see the award DOI that we've assigned. Again, traditional metadata. We have the award description. We have some investigator information. One thing that I wanted to note around the investigators is we, again, are associating, have the ORCID IDs associated with those investigators, with those people. So you see the little green ORCID IDs symbol. And also in the award DOI service, we are linking to the RUR identifiers for the investigator organization. So the example here, the lead investigator is from Oregon State University. And sorry, it's really small, but there's a little RUR ID symbol next to Oregon State University that links out to their RUR. So again, we're trying to wherever possible have persistent identifiers within the metadata that we provide. And as I mentioned, one of the exciting things that you can do when there are persistent identifiers with these things that becomes a little bit easier once you have persistent identifiers is to start doing more analytics and visualizations with this information. So this is something that's not quite yet launched, but we're working on it. And this is a tied to a journal article record. So kind of in this visualization, the journal article DOI is at the center. And we're linking to funding information from that DOI and associated organizations and authors and research outputs. And we're doing this all through linking up persistent identifiers. Now, of course, you have been able to do this with, you know, text-based information, but just having those persistent identifiers really again helps with disambiguation. So you know which author is which, even if maybe some have the same name, helps with disambiguation with organizations. And you can start to explore this information in other ways. For example, you could look at authorship networks or you could use this to help find peer reviewers or look to see what other funding is going to kind of a certain space. So persistent identifiers are really helpful for trying to do that kind of analytics as well. So I will stop there. I really appreciate your time. And I'm happy to take questions whenever it's time. Thank you. I'm Martin Halbert, the science advisor for public access, the National Science Foundation. In NSF, I'm responsible for steering the development of the NSF public access repository and cross-agency coordination of NSF policy on public access reporting requirements for the billions of dollars of research awards that the National Science Foundation awards every year. I'd like to briefly share with you today some perspectives about a few incremental next steps that I think are needed to improve public access to research and the contemporary international ecosystem of digital repositories of research materials and ecosystem that is continuing to rapidly evolve. Let me preface these comments by saying that they do not represent some official stance of the National Science Foundation, but rather my own observations about improvements which are likely to be needed for further improvement of repositories and research public access practices broadly. The White House Office of Science and Technology Policy issued a memorandum in 2013 entitled Increasing Access to the Results of Federally-Funded Scientific Research that directed major federal grant-making agencies to each develop a plan to support increased public access to the results of research funded by the federal government. In response to this memorandum, the National Science Foundation established both policies and procedural requirements for researchers who received NSF awards to make juried research papers publicly accessible for free digital download. The NSF public access repository, or PAR, was created as a mechanism to facilitate such public accessibility. Version 1.0 of this system was available from 2015 to 2020 and now provides access to tens of thousands of new scientific publications each year. This year in 2021, we enhanced the system with new capabilities to enable NSF-funded researchers to publicly report and make accessible research data sets created in the course of their projects. We call this enhanced version PAR 2.0, and even though we have initially made public access to data sets optional, we immediately saw a quick uptake by our researchers of this functionality. By the end of calendar year 2022, we plan to release version 2.5 of the system with additional capabilities for significantly more types of research products and outputs. We plan to release version 3.0 of the system in 2023, featuring a variety of capabilities to express robust interlinkages between research products. In the course of planning the specifications for these anticipated future versions of the NSF PAR, I and my developers became aware of several limitations in the practices and framing of the current research ecosystem. Limitations which I'd like to discuss with you through a series of overarching comments about the life cycle of selected research products and their types. From long-standing practices, inculcated in scientific cultures, researchers have a broadly shared understanding of many expectations about the life cycle of research products in the print era. The main category of research output that was to be routinely made publicly accessible was the peer-reviewed scientific paper made available in print form. Researchers would indicate the antecedent publications which they referenced in the course of their research through citations. Citations and publications, or increasingly in ever larger indices made available in research libraries, enabled scientists to discover new research findings and then build upon them in new investigations. This oversimplified representation obviously glosses over many details of the research life cycle, such as who is responsible for exactly which components of each category, how it all gets funded, etc. But this was the general understanding about these key aspects of the ecosystem. Each part of the cycle was relatively easy to understand, at least conceptually. Scientists just had to complete quality research, write it up in a format that they all understood, and send it off to the publisher in order to get credit and succeed professionally. This all got much more complex in the digital age. As networked computer systems have rapidly become more interconnected and complex over the decades, information exchange rapidly became more interconnected and complex as well. Scientists could now easily share not just publications but underlying data sets, research software code developed to undertake research on the data sets, computer models and specifications for understanding the data, etc. These many new forms of research products are made publicly accessible via a huge range of file serving computer systems which generically came to be referred to as repositories, some of which were centralized and fairly simple, some of which were very decentralized and complex in nature and operation. Because resolvable URLs for information resources on the web tended to move around over the years due to changing system administration practices and other reasons, the innovation of persistent identifiers stabilized the accessibility of digital resources in a way that was somewhat akin to the fixity of printed publications. Using persistent identifiers and other innovative new standards, a new breed of ancillary repository services has now arisen which provide new interlinking capabilities in the latest, more robust form of the repository ecosystem. But many kinks and practical stumbling blocks remain to be worked out in this new lifecycle model. So we finally come to my comments on the problems in the current ecosystem which I think need some attention. Let's start with the researchers and note that societies and cultures cannot turn on a dime but typically require generational periods for deep changes to take place. Research cultures have not had time to internalize and habituate themselves to the many new technical requirements undergirding interconnected repository systems. They understand their own research areas but not metadata standards meant to enable discovery across globally linked networks. For the librarians in my audience simply put they don't think like catalogers. Moreover, they don't feel that they should have to do that. Rather, that's the responsibility of someone else. Repositories are fundamentally file servers and although these systems are functionally acting in the role of publishers in the print era, almost nobody understands them in that role in the digital era. Repositories can provide access to virtually any format of digital information but relatively few research communities have reoriented themselves on the notion of publishing data sets, code, etc. And getting credit for that activity in the same way that they earn advancement and tenure for publications. Finally, most standards for persistent identifiers have not until recently even begun to regularly differentiate different types of research product formats. Often with the framing assumption that everything being referenced is simply a research paper. And it is unclear who among the ecosystem of interacting stakeholders has the responsibility and right incentives to ensure quality and rigor in the metadata linkages between different research products. Any one of these problems is enough to break the fragile assumptions in the research information life cycle and we're rapidly seeing that happen in many instances today. Until we figure out how to effectively mobilize and coordinate our efforts to remediate these issues, I'm not sure how we're going to improve public access more broadly open science as a whole. So my general admonishment is that we need to work on these topics and remediating these issues as a shared group of stakeholders across this landscape. So thank you for listening to these comments. If you have feedback that you'd like to provide to NSF regarding public access or any topics in open science more generally please contact us at this email address. Thank you. Now we're going to turn it over to one of our in-person panelists, Dr. Bob Hannish who's going to talk to us about open data and open science at NIST. Thanks very much Cynthia and this is the first time I've been in person at a conference for two years that is really a thrill to see real people in a real space. Let me tell you briefly about open data and open science at NIST. I'm going to talk about three topics. One is about data publication and our NIST public data repository. Second about laboratory information management systems which are for data acquisition and a project that I did talk about a year ago at this event, the research data framework with an update on the progress we've made in the past year. Our public data repository hosts all of the public facing data sets that are associated with peer-reviewed publications from NIST research staff. And as I think you may know, NIST is largely an intramural research organization focusing on the development of technologies and tools for measuring quantities, measuring properties of physical systems at very high precision. We publish thousands of papers per year and an equal number of data sets now in our public data repository. You can do a search, a keyword search, an author search, whatever and as is typical for good website search tools, you'll find the data sets and associated publications that are based on those data sets in our public data repository. If you drill down, you will find a homepage that's been automatically generated for each of these data sets as a result of the author uploading a certain amount of metadata into a system called Mitus, which I'll mention briefly on the next slide. So these data publications are preserved in the NIST public data repository. There's automatically a homepage generated and they're automatically generated a DOI. Researchers can create their own homepages and we have other data sets that exist independently, but they're all linked from our public data repository search interface. So users can upload their data through this Mitus platform. They can be organized into folders and soap folders retaining the logical structure of a data collection. They can also link to websites such as GitHub repositories. This whole process starts with an inward facing tool called Mitus for management of institutional data assets. There are two functions to Mitus. One is to create data management plans. The other is to actually publish your data through our enterprise data inventory tool. I will admit here with some chagrin that getting our staff to write data management plans has been a very upward uphill battle. I have some staff members who have spent more time telling me why they shouldn't have to write a DMP than simply writing the DMP. But slowly but surely I think we're making progress on that front as well. One side effect of the pandemic has been kind of an explosion in data publications as our staff were unable to enter their laboratories and collect new data. So they focused on publishing old data that they had not had time to yet publish. So this is a trend that I hope continues despite we hope lessening of the pandemic. It's really important if you're going to have data that is findable accessible interoperable and reusable the fair principles to collect that data in the first place such that it can be fair. And laboratory information management systems enable this by providing capabilities to researchers in integrated way that capture and organize data from equipment preserve and recover data track samples convert data to open formats and so on. What we found at NIST is that given the wide variety of instruments that we have at NIST many of which are highly customized. There's no single solution. There's certainly no off the shelf commercial solution that fits the needs that we have these commercial limbs packages are simply designed for routine you know processing particularly like in pharmaceuticals and they're not flexible enough for a highly diverse and dynamic research situations. Our limbs data flow looks something like this generically where data sources from the instrument are passed through into a shared repository. They can trigger automatic data processing steps and ultimately are pushed out through our public data repository. One of the key elements here are metadata extractors a lot of commercial instrumentation. The vendors provide data in proprietary formats and for us to collect sufficient metadata to catalog that information we have to write custom metadata extractors that unpack those proprietary formats. We're focusing first on limbs for electron microscopy. We now have I think eight or 10 electron microscopes that are integrated into a network called the nexus limbs. The nexus limbs is very easy to operate. An observer experimenter comes to the microscope says start a session and all of the metadata coming from the instrument is automatically collected automatically uploaded to a central file store. Where it can be examined by anybody in that research team or anybody actually within NIST. It's an inward facing system again but enables data to be born fair and exported fair through our public data repository. The last thing I want to mention is our research data framework. This is a project that we started already two years ago. But what do we mean by a research data framework. It's a map. It's a guide to the research data space who what where when why it's a dynamic guide for stakeholders engaged in research data management. It's a resource for understanding costs benefits and risks associated with research data management. And most importantly it's a consensus document formed by our engaging the research data community. NIST the SN NIST stands for standards but we very rarely invent standards from whole cloth. We build standards by engaging community and reaching consensus and that's what this is about as well. Ultimately we hope it's a tool that will help change the research data culture in organizations. The research data ecosystem is complex as you all know there are various funding models and sustainability plans various roles within an organization. Decisions need to be made about how long to keep data how to assess data quality and ultimately how do we measure the value of research data. We see many benefits in this framework in terms of increasing research integrity reducing costs and maximizing efficiency guiding risk management and reduction and increasing scientific discovery and innovation with the fair principles. There are many stakeholders in the research data ecosystem from government laboratories universities and research libraries publishers professional societies and industry particularly important stakeholder for us at NIST. Since we're part of the Department of Commerce and our charter is to help improve the competitive competitiveness of US industry. We also work with national international collaboration organization standards bodies funders researchers and the public. We have a publication that describes the framework core. This is the basis for the basis for this is six life cycle stages in the research data life cycle starting with envision planning generating and acquiring processing and analyzing sharing and reusing and preserving and discarding. We are in just now conducting two pilot studies one with we call our university cohort another with our materials science cohort choosing a vertical across a particular research discipline. We started these workshops the most recent materials science one was held last Friday where we are engaging the community testing and refining the framework core and obtaining feedback for our NIST team to create framework profiles and implementation. There are a couple of other things that I would like to share with you. One of the things that I would like to share with you is that I would like to share with you a little bit of information to yours that would guide you if you are a CDO if you are a librarian what are what are your roles and responsibilities and who should you know about who else is working in the research data ecosystem that is an important interface for you. I'll just wrap up by saying these are just a few of the things that were engaged with at NIST the research data alliance fair digital objects forum and so forth. And on the national academies round table on aligning incentives for open science which has now concluded three years of activities and has released released a toolkit to help organizations posture themselves in the research data space. Most recently and this is news breaking yesterday the OSTP subcommittee on open science has put together a new infrastructure working group that I'll be co-chairing along with Carly Robinson and Scott Thompson from USAID to help better integrate government funded resources and computation data management and networking to improve open science and open data. And I will conclude there. Thank you. Great. So we have two more brief virtual presentations and then a final in person before we get to the Q&A. So let's take it away. Thank you for the invitation to speak here today about public access at the National Institutes of Health. My name is Katie Funk. I am from the National Library of Medicine which is one of the institutes at NIH and I am the program manager for PubMed Central which is the repository for all NIH supported publication. Today I wanted to give you just some brief context about what we funded NIH and then talk about how we make it accessible and what our current priorities are. NIH has a budget of around $40 million annually. Almost two or three quarters of that goes into our extramural research program which is our grant funding program. We make over 2,600 awards to academic universities, hospitals, small businesses and other organizations throughout the U.S. and internationally annually. We also support an intramural research program which makes up the other quarter of our budget and that all together produces around 100,000 peer-reviewed journal articles annually. The NIH public access policy then is about returning those products of federally funded research to the public and that's to support the NIH mission of advancing science and improving human health. We think access to the research findings can really do that. The NIH public access policy has been around since 2008 and really serves as a template for a number of other federal agency and private organization policies. It applies to all investigators funded by the NIH that includes the intramural research program I mentioned, the extramural research program as well as contracts. Those investigators need to submit or have submitted for them an electronic version of their final peer-reviewed manuscript to the National Library of Medicine's PubMed Central and that should be submitted at the time of acceptance in order to be made publicly accessible no later than 12 months after the official date of publication. And in the ensuing years since the policy took effect we've really seen that sort of interest and the value of public access to NIH research and PMC. There's been year after year growth of this content within our database and in 2020 we saw more than half a billion views. That includes web views as well as PDF downloads of NIH funded articles in our database. Authors have two roads to compliance as far as complying with those requirements that I walked through earlier. 45% of the papers that NIH funds annually come in directly to PMC. They are submitted as the final article by the journal or publisher and this really comes down to PMC's role as a journal archive and the relationships we have with a number of publishers and journals. The other 55% or just over half of what NIH funds annually comes in through the NIH manuscript submission system or NIH MS as the peer-reviewed author manuscript. Those peer-reviewed author manuscripts make up more than 800,000 records in PMC at this time. Nearly two-thirds of them are deposited annually with some type of embargo and with just over half of them being embargoed for that maximum 12 months allowable period. And this sort of when you think about the time peer review takes, the time publication takes, the time the embargo takes. This has really gotten NIH thinking about what are options for speeding dissemination further. In 2017, we released guidance that encourage investigators to make pre-prints in other interim research products such as pre-registrations publicly accessible. We gave some provisions on the types of licensing to choose, how to select a pre-print server, what they should include when posting a pre-print, competing interests, requirements of support, data sharing, that sort of thing. And then building on that in 2020, the National Library of Medicine committed to a pilot program to make all NIH funded pre-prints reporting on COVID-19 research accessible and discoverable through our PubMed Central and PubMed resources. And this really was building both on the NIH guidance but as well as our role as a repository for NIH research within PMC. The result has been nearly 3,000 pre-prints being added to our databases and those date from January 2020 to the present. The vast majority of them come from Med Archive and Bio Archive but we're also indexing and curating pre-prints with NIH support from Research Square Archive, SSRN and Chem Archive. And again, we're seeing the value of this type of access. There have been more than 2.5 million views of pre-print records in PubMed Central and a nearly equivalent number of views of pre-print records in PubMed during this time. So total we're seeing 5 million types of engagement with pre-prints and sort of that accelerated access. We're very interested as you would expect in how pre-prints in the pilot have sped access to NIH funded research reporting on COVID-19. And in this chart you'll see in that gray bar, those are the pre-prints that have been published in a journal. The blue are the pre-prints in our databases, not yet linked to a published journal article. At this point, I believe it's about almost 50% of the pre-prints we have archived and indexed have been published and 50% have not. And while of course the further back in time you go, the more likely it is that the pre-print is linked to a journal article. It is by no means 100%. If you go back to what was available at launch, so those pre-prints posted January to June of 2020, two-thirds of them have been published, one-third hasn't. And I think understanding the characteristics of those pre-prints and the motivations of those authors is an area that we'd really like to better understand. We've also found in a analysis completed in May of this year of 800 pre-prints that had been linked to a published journal article that there was about 100 days on average from pre-print posting to journal publication. And when you consider that that is in a time of accelerated peer review, there's really a lot of potential for pre-prints in a time of more standard peer review for lack of a better term, which is more of six to seven months. And then once you think about the embargo on that content, you could be speeding research by almost 18 months to two years potentially through pre-prints. NIH is also very interested in not just making sure a paper is human readable, but that it is machine readable and machine accessible. This seems to us to have a lot of potential for maximizing the impact of what we fund for driving new discovery. That said, we are seeing only about a quarter of NIH investigators actually choose the sort of CCBY or public domain licenses that we're recommending. There's a significant number of authors selecting other Creative Commons licenses, but there's still about a third who are selecting no Creative Commons licenses. So I think there is a lot of room here for funders to start thinking about how can we increase the number of openly licensed articles that we support, either through pre-prints or through journal publication. Finally, in addition to speeding dissemination, we do not want to compromise rigor and reproducibility, and we continue to engage very much with open data and data discovery. In PubMed Central since 2018, we've been highlighting supplementary materials, data availability statements, and data citations in sort of what we call the associated data box that shows up before the abstract to really expose this content. And we're very much supportive of the new NIH data sharing policy that will be taking effect in January of 2023. This is really an expansion of existing policy to apply across the spectrum of what we fund, not just to those large awards. And it's going to be a real opportunity to help get investigators and grant applicants to start thinking about at the beginning of the process, how are they going to share their data? How are they going to make it open? And how can we ensure this type of reproducibility throughout what we do? And again, maximize the impact of what we're doing here at NIH. So with that, I want to just say thank you again for the invite to be on the panel. There are a few links here and I'll turn it over to the rest of the panel. Good morning, everyone. I'm happy to be with you. My name is Jason Gerson. I'm a senior program officer at PCORI, the Patient-Centered Outcomes Research Institute. And I'll be speaking with you today about PCORI's policy for data management and data sharing. For those of you that are not familiar with PCORI, we are a funder of Clinical Comparative Effectiveness Research, or CER. That is to say, we primarily fund randomized trials and observational studies that generate and synthesize evidence comparing benefits and harms of at least two different methods to prevent, diagnose, treat, or monitor a clinical condition or improve care delivery. The studies are meant to measure benefits in real-world populations and generate real-world evidence, describe results in subgroups of people. That is to say that they're large enough to assess heterogeneity for treatment effect, and help consumers, clinicians, purchasers, and other policymakers make informed decisions that will improve care for individuals and populations. Ultimately, the studies we fund should be generating evidence to inform clinical or policy decisions. From its creation in 2010, PCORI has articulated a commitment to open science. The first two bullets on the slide describe policies and processes in place to make the results from our funded research available to the public. The policy for data management and data sharing, which was approved by PCORI's Board of Governors in September 2018, was meant to make the underlying data and data documentation available. On this slide is an articulation of the policy's guiding values statement. At its core, the policy is meant to articulate the idea that the research data funded by PCORI is a scientific asset, the utility and usability of which should be maximized in order to encourage scientifically rigorous secondary use in order to foster scientific advances that ultimately improve clinical care and patient outcomes. So we wanted to put in place a policy that would call for the systematic creation and preservation of research data and documentation in order to facilitate data sharing for the longer term. So now I'll describe some of the core features of the policy. First and foremost, it articulates the expectations to our funded investigators for data management and data sharing. It specifies the data and data documentation to be shared. It designates a specific repository where data is to be deposited. In this case, it's ICPSR at the University of Michigan. It provides funding to support investigators' time and effort to prepare data. It specifies when data would be made available for third-party requests. It describes the third-party data review process, and finally, it articulates criteria and review processes regarding exemptions from the policy requirements. Let me say a few words about the data deposition process. So only de-identified data will be deposited to the repository, so that would be de-identified in accordance with the HIPAA privacy rule. So PCORI thinks about it, the data as a full data package. We took our cues from an Institute of Medicine report that was published in 2015 or 2016 that spoke of this full data package, which includes the analyzable data set, the full protocol, metadata, data dictionary, full statistical analysis plan, and analytic code, among other things. The idea here is that the data documentation would provide essentially a roadmap to the data set itself, and this full constellation of documents would allow the deposited data to be more interpretable by third-party requesters and those who were not part of the original data research team. We also want to ensure that the data that does get deposited was collected through informed consent processes that permit the data to be used for secondary research purposes and shared with researchers that weren't part of the original investigator team. For those of you that are less familiar with PCORI, we have a large research investment called PCORNET, which is essentially a network of networks comprised of large health systems around the country. The data exists in a distributed data network, so a lot of EHR and claims data that cannot be deposited in a repository for legal or proprietary reasons. As such, the policy describes a set of analogous but not identical data products that might be deposited by investigators doing research within PCORNET. So these are described on the bottom half of the slide. I won't belabor this point in the interest of time, but they are called out in the policy document itself, so you are free to look at that as well. So as I said, previously our funded investigators will deposit their full data package or the applicable data elements and they'll work with the ICPSR staff to curate it. The data will be hosted by ICPSR, not PCORI, and a repository that we together named the Patient-Centered Outcomes Data Repository or PCOTR. And so awardees will ultimately enter into a data contributor agreement or a DCA with ICPSR. This DCA governs the data deposition process and establishes the investigators' rights and obligations as a core governance document for our data sharing activities. I'll say a few words about the data request review process. This review will help ensure that data requests have scientific merit, will evaluate whether there's a scientific purpose that is clearly described, whether the proposed analyses will make use of the data in a way to contribute to generalizable knowledge to inform medicine and or public health, that the proposed research can be reasonably addressed using the requested data, and that the requester team has the appropriate expertise to conduct the proposed research. The request will be reviewed by an independent committee comprised of five individuals, including representatives from ICPSR and PCORI, a data scientist, a clinical researcher with expertise to the data request and a patient representative. Approved requesters will sign a data use agreement. This outlines their rights and responsibilities, including commitments to not make attempts to re-identify individuals using the data. And all approved requesters will conduct their analyses in a virtual data enclave. The data sharing policy also allows for PCORI investigators to ask for an exemption from the policy based on the types of data included in the research project or if appropriate informed consent to share the data has not been obtained. So as I said earlier, a number of PCORI studies include electronic health record data, claims data or other proprietary data sources, which cannot be deposited into a repository. So the policy allows for investigators to ask for such an exemption and to provide a written explanation to PCORI with any supporting documentation to demonstrate why it's not feasible for them to comply with part or all of the policy. And PCORI will review such requests on a case by case basis. So here are some of our early implementation lessons to date. The overarching one is that implementation of data sharing for clinical research requires careful deliberation about the details. It's not simply a matter of directing awardees to deposit their data somewhere and declare victory. So in the process, we've taken the time to speak with investigators individually about their questions and concerns. This has been time consuming, but we view it as necessary. Many of the investigator teams don't have extensive experience with data sharing and these conversations with PCORI program staff are proven so far very helpful with them to kind of allay their concerns, make clear the timelines and the steps that are involved in engaging with ICPSR. The choice to designate a single repository that we view as a scientific partner has been critical. It allows ICPSR to work directly with the investigator teams to engage in the data curation activities. This allows them to also do something they do very well, which is to educate awardees about the data sharing process. This was viewed as a kind of core value that ICPSR brought to the table and it's borne fruit so far. And finally, that funding the time and effort of investigators to participate in data sharing activities has been critical. Other funders may treat this as an unfunded mandate. PCORI has opted not to and the time and effort that goes into preparing the data and data documentation is something that we think should be funded. And finally, deriving a set of clear contractual milestones and deliverables and communicating that to awardees has also been very helpful. It kind of reduces the ambiguity about what's expected of them and what the timeline is for completing the data sharing activities. So with that, I will stop and happy to take your questions towards the end of this session. Thank you. This is the first time I've spoken at a podium wearing a mask, so I have to remember to take it off. I'm Josh Greenberg. I'm a program director at the Alfred P. Sloan Foundation where I run the technology program. I realize there is a looming coffee break, so I will be very brief. Historically, the technology program had been there about 11 years. Historically, if there was a sort of previously on real, you would have seen things about data curation, data science, scholarly communication. About two years ago, just before the pandemic, we did a program review, a decadal program review, decided to refactor the program into tighter scopes on more specific problems and opportunities. The main one that's live right now, you can go to our website and find more detail about all these things, is what we're calling better software for science. And there, in the context of the presentations that we heard, the key point that I would make is I think when you talk about software in the research enterprise and you talk about open source, open carries more connotations than it necessarily does in the context of open access or open data. We get to benefit from some of the work that's been done in software communities and thinking about not just access or licensing, but process. Thinking about, you know, how community resources and collaborative processes of development and maintenance of resources happen. And a lot of the work we're doing is trying to figure out how to reconcile what's happened in other software communities with what happens within the research enterprise. It's been great to see the emergence of more funding directly targeted at software, not where software is a side piece of a particular project grant. The Chan Zuckerberg Essential Open Source Software Program has been fantastic at putting tens of millions of dollars into the scientific open source context. They didn't come up today, but NASA has been doing great work. In addition to a number of the agencies represented here in their open source science initiative, we're seeing a proliferation of new organizations like Open Collective, Code for Science and Society, NumFocus that are thinking about stewarding open source. And they're all these sort of pieces. So what we're doing at Sloan is really trying to think about three ways of deploying our resources to build up that system. The first is thinking about tooling and processes for publication, review, archiving and curation of software, the kind of real bread and butter issues as they map into the unique context of software. The second is around workforce. Who builds and maintains that software? I would put here that I think there's a really important question to answer about what the role of libraries is as knowledge organizations on campuses and not just sort of stewarding frozen and amber software, but continuing to think about maintenance and curation of software. And finally, bureaucratic innovation within research organizations. You'll hear later today there's a presentation about open source program offices and universities and we're very interested in those kind of constructs to call out and get a seat at the table thinking about some of these issues. We also have an exploratory program where we're trying out other newer, like more frontier ideas. We have a small portfolio of grants around open source hardware. When you move from bits to atoms, things get interesting in different ways. We have a small program on trust and algorithmic knowledge which is really thinking about what's the reproducibility agenda in the context of machine learning and train models. And finally, a nascent program on virtual collaboration where really Cliff did a better job articulating the issues yesterday than I would hear. So I'll say a lot of the stuff he was gesturing at about future of meetings and collaboration we're trying to figure out how to play a constructive role in. So I think I'll leave it there at time. At time. Thank you all for joining us. I did want to, if folks want to stick around for a few more minutes and I think we could have time for maybe like one or two questions. And the panelists are available as well. And I can actually kick things off if that would be appropriate. I'm curious to hear from the panelists. You know, the Biden administration and the National Academies have recently launched initiatives focused on scientific integrity. I'm curious to know like what kind of conversations you're having within each of your organizations about this topic and how you see it potentially influencing your programs going forward, if at all. Yeah, Bob, do you want to go first since you volunteered? Does this work? Yes. Scientific integrity is certainly what we do at NIST. We measure things very carefully. We document uncertainties. We look very deeply into fundamental physics and chemistry and biology to understand these systems. I co-authored a paper about a year and a half ago with my colleague Ann Plant addressing the so-called reproducibility crisis. And we think it's less a crisis than it is a failure in some cases for people to fully understand all of the parameters that can affect an experiment or can affect a computation and to document all of those things. But it's really at the core, a principal value for us at NIST is doing measurement extremely well and characterizing those measurements extremely well. So this is really, in my view, the essence of what we're trying to do around open data and open science is expose that integrity. Do any of the online panelists want to add to that? Just to add, this has been a huge topic at the National Science Foundation. I've had many conversations with Skip Lupea, the assistant director of NSF about this, and also with Manish Parashar, the head of the office that I'm situated within the office of advanced cyber infrastructure and ways that we can integrate, especially one of the focus points in the immediate future is reproducibility, replicability, and generalized ability to use the vocabulary from that recent National Academy study on this. And there's a lot more I could say, but I'll yield to the rest of the panel. I would just add that, you know, in making available federally funded research, we see that as a long term commitment and that means, you know, keeping on top of retractions, keeping on top of findings of research misconduct and making sure that's transparent to the public when they access information. There's sort of the front end on the reporting side of things of open data and open science steps that we can take, but there's also making sure that we're committing to preservation and transparency in the scientific record. We do have a question from the audience. Danielle, do you want to go ahead? Hi, thanks so much for your talk today. I'm Danielle Cooper from Ithaca SNR, a not for profit research organization that focuses on research practices, including several projects currently on data sharing. And just as a side note, hi, Martin, I wish I could meet you in person. But I really appreciate the candidness of some folks on the panel talking about the limitations around data sharing, especially in terms of researcher culture and motivation, and also in terms of the professionalized knowledge around metadata, especially that's needed to do this properly. And I was curious if anybody would want to speak to what you're seeing going on in other geographic locales. So I'm thinking, for example, of Canada and the way they're trying to get universities involved through the tri agencies policies or any other jurisdictions, maybe Europe or UK. Is there anything that you see going on in those areas that you would especially hope that we could adopt here, either in terms of funder requirements and policies or institutional or researcher cultures? That's a really good question. I'm following a lot of what's going on in terms of open research clouds, open research commons. You all have probably heard of the European Open Science Cloud, Australian National Research Commons, et cetera, et cetera. There are many of these. And the US is rather sadly, I think, behind in terms of getting our collective act together. And that's one topic that will be addressed by this new working group under the OSTP subcommittee on open science. Yeah, there's a lot that we can learn. You know, US funding for science is much more siloed than it is in many other countries. But given the tremendous investment we've made, I think it behooves us to find better ways to cooperate and collaborate across the federal government and also with the private sector in terms of their computational and data management services. So just a plus one to many of Bob's comments, I would in particular call out, but in addition to the European Open Science Cloud, which is actually a quite an interesting and broad ranging initiative. I think the international go fair movement, which they are some of my, the American branch of them are some of my awardees. Also, I think there are some particular things that we could achieve if we do a better job of coordinating internationally. And I think we have some fantastic opportunities actually in 2022. It's just in a long conversation. NSF International is coordinating with the Australian equivalents of the National Science Foundation and NIH on their overhaul of their public access policy and their risk mitigation strategies. I'm having a call tomorrow with my counterpart in the French government about some possibilities of coordinating with some initiative, the second open science initiative of the French government. So it's a fantastic time for aligning thoughts. I also did the points about the various subcommittee on open science working groups. I'm one of the co-chairs of the together with Katie on one of the working groups and we're doing a number and we're undertaking a number of new initiatives in 2022 that should be very interesting, very provocative where we go with this. All right. Well, I don't want to keep anyone from the coffee and snacks that are outside. So thank you. Thank you panelists for lending us your time and telling us more about your programs and initiatives. It was incredibly informative and have a lovely day. Thanks.