 Welcome to our closing session for the fall 2018 member meeting. I hope that you've had a good meeting here. Certainly the sessions I've been able to look in on seem to have gone really well. I was, I have to say I'm seeing an unusual amount of energy in the discussion at many of the sessions and that's really delightful to see. I also feel like our use of SCED to help relocate some sessions into larger rooms is definitely having a good effect. As far as I know there was only one session in the last round and that was very crowded. So we will continue to strive to do better with that but I feel like definitely that's on the right track. So that's proving to be a very useful tool. It really just falls to me to do a couple of quick things. The first thing I want to do is ask you to join me in thanking our wonderful staff at CNI who have made this conference run very, very smoothly. And also the two volunteer helpers that we had from Georgetown who have worked with the staff and have made it possible for us to capture a considerably larger number of sessions, breakout sessions than we would have been able to do otherwise. So please join me in a round of applause for all of them. The other folks I'd like to thank are all of the presenters who contributed their time, their effort, their work, their good thinking to all of the very, very rich set of breakout sessions that we've enjoyed. Those are absolutely the backbone of the meeting and I am so grateful to everyone who took the time and the energy and the effort to give us that superb set of breakout sessions. Please join me in a round of applause for our presenters. And now let me get on to why you're really here. Our closing plenary today is going to be given, and I'm just so pleased that we were able to work this out, is going to be given by the National Librarian of Medicine, Patricia Flatley Brennan. You have Dr. Brennan's official biography in your conference materials and it's on the website, but I want to say a couple of other contextual sorts of things. First, it is really hard for me to say enough good things about the National Library of Medicine and the amazing contributions that that organization has made over the last 20 or 30 years to the advancement of science and healthcare leadership in the way we think about science as it's become increasingly driven by data and information technology. It's a fabulous organization that has had wonderful people and now is thriving under a new leader that is just carrying it on into the future in a wonderful way. Let me say a few words about Patty herself. I actually, I believe, and I have not tracked this down and I haven't even told it to Patty, first became aware of her work probably a good decade ago when I saw some of it presented at a conference showing off high performance computing and communications applications. Interesting place for someone who comes to the profession from the side of nursing. She's done amazing work with AR and VR in healthcare and was quite early to that work, I believe. At the same time, her pathway in through nursing has really, in my conversations with her, resulted in her having a very broad view of the challenges and opportunities of data and information in the healthcare context. Sometimes bioinformatics in my experience can become extremely physician driven and really, as all of us know, it's a much bigger world out there than just physicians. There's an army of professions and the patients themselves who are all part of this very, very complex healthcare system that we're trying to evolve into the digital world. So I think you will find her perspectives to be both very nuanced but also very broad and I am just thrilled to have her here to tell us about all the wonderful things that NLM is doing and all of the very fabulous thinking that they're doing about their role within NIH and within the national and global healthcare and biomedical sciences world. Patty, welcome. Good afternoon. Thank you very much for coming indoors on one of the most beautiful days in Washington and for actually staying here for this time. I'm going to make it worth your while, I promise. There's never been a greater need for trusted, secure, accessible, valued information in the world as we face right now at this very time. Libraries, data scientists, informatists, networked information specialists, we are essential to the future. I'm very proud to be here as the director of the National Library of Medicine. I'm proud to see some of my NLM colleagues here, our regional medical library, National Network of Libraries of Medicine colleagues and good friends. But I'm actually here because I owe a debt and I owe a debt to Cliff and a few of the others of you who are in the room that my very first year that I was at the National Library of Medicine with the transition guidance from Betsy Humphries, we established a strategic plan and our strategic plan is now guiding our investments into the future. I'm here today to talk with you about the National Library of Medicine, its strategic plan where we fit in with the NIH, how we are preserving trust in data, and what options are coming forward that might be of interest in your institutions. And I promise not to keep you here until much after 6 p.m. No worries. I want you to look at the line in the last box. How the National Library of Medicine directs trust in data, creates trust in data. We are, what does a library do? Lots of places have shelves. Lots of places have server racks. There's very smart people that know how to enumerate and count and curate things. So what does a library do? We fundamentally create trust and the substrate of discovery right now is data. We are, at the National Library of Medicine, committed to a data-driven discovery, a process of partnership with the NIH to make sure that data-powered health becomes a reality in our society. As I said today, I'm going to talk to you about our strategic plan and also about the National Institutes of Health Strategic Plan for Data Science. Data don't organize themselves, you've probably figured out already. You've got to do something, you've got to put some stakes in the ground and that's what we're all about in trying to establish. We are preparing and actually doing some things for data-driven discovery. I'm going to be sharing them with you today, very pleased to see what's happening and we'll be talking about training. But I suspect there might be one or two of you in the room that doesn't really know much about the National Library of Medicine. So let me first introduce you to our library by showing you a brief video. As an institute of the National Institutes of Health, we are fundamentally a research engine. The National Library of Medicine supports direct research in the areas that you saw here, but we are best known for our products and services that we deliver to the world millions of times a day. PubMed, Medline, Medline Plus for lay people are early out-of-the-box innovations such as the Visible Human Project or our very essential, wiser assistance in the moment of crisis. The National Library of Medicine is committed to delivering high quality, trusted, trustable information to serving as a repository of that information and to applying research techniques to make that information discoverable and useful. We know today that a million citations a year are added into PubMed. Not everybody is reading all those citations, and we must find ways as a library to make them accessible, to use knowledge extraction techniques, modern machine learning techniques to make the information useful to society and also to preserve its relevance to our society and our biomedical community. Now the National Library of Medicine will continue to provide its fundamental core services, DBGAP, our database of genotypes and phenotypes, or as the clinicaltrials.gov, which is our repository of both the declaration of clinical trials and their results. But over the next decade, we're committed to advancing in three key areas, and that's what I want to talk to you about right now. First, our first pillar is to accelerate discovery and advance health through data-driven research. Second, reach more people in more ways through enhanced engagement and dissemination. And third, to build a workforce for data-driven research and for health. Let's take these apart a little bit. To accelerate discovery and advance health through data-driven research requires that the library stay clear and true to its core mission, and yet we envision a future where we'll be fostering the ecosphere of discovery by connecting digital research objects. What a library does is build the arcs and structure the ovals in this diagram that you see on the screen. We are envisioning a time where there is a seamless pathway from the literature to the data underlying that literature, perhaps to the people who conducted the investigation or the pathways and protocols used to carry out the investigations. We are envisioning interconnections that allow an investigator to extract knowledge in an efficient fashion. What we know that the library does and has always done is made connections, named things, and made them visible to others. We are committed to continuing to do this. We are committed to do this under a new framing of knowledge, data science, and open science. And this is a game-changer for the way we think about what is trustable information. The National Library of Medicine, as many libraries around the country, have had a successful long-term partnership with publishers who provide the vetting of the knowledge, who provide the structuring of the knowledge, and in our case, send us the XML files to create the bibliographies. Partnership has been important. When we move to data-driven discovery, we've moved the intermediary. We're dealing with raw substances. This requires we rethink what the library does. And we're doing this in an era where open science is a philosophy that is rapidly being embraced around the world. Open science, open access to data, open participation in research, open sharing of information. While we've had two decades where the economic value of discovery has been heralded as the reason why we do research, now we're realizing that opening the data, sharing the discoveries, actually accelerates the economy in a much quicker fashion. An open science model, though, changes the game a lot in terms of how we think about what is the protection of intellectual property rights? What is the protection of an individual participant's data? And fundamentally, what are the rights of our investigators to exploration without intrusive supervision, which we are able to do now with all of our digital pathways? So we move into this era of open science, recognizing there's new players, there's new rules, and there's new kinds of data to deal with. But fundamentally, we know as a library, fewer and fewer people come up to Bethesda to see us. There's a great big fence around NIH right now. It's a lot harder to get on campus than it used to be. We remain accessible and available to people, and we are fortunate because we know that the people who know us like us. But what we have to remember is we need to reach out. We need to get to more people. We need to get to those people who don't know us. We are committed to enhancing our engagement, reaching more people through better understanding, through knowing in the moment what the person is after, through going beyond pattern matching of terms and concepts so that the question and the answer are responded to at a level that is factual. We now need to move into a level that becomes operational. We're committed to enhancing information delivery. We are expanding our investment in standards, particularly health data standards, but also the structuring of data around formalized terminologies, including biological data as well as image data. We are experimenting with PubMed Labs. Now, those of you who've grown to love PubMed might be getting a little nervous when I say we're going to change it. But we're going to change it for the better. We've got some very exciting things coming your way. If you haven't seen our new experimental site, When I'm Done Speaking, please Google PubMed Labs. And you'll get to see all of the new innovations. One of the most exciting innovations we're using is taking machine learning strategies to present your list of citations, not in the reverse chronological order, which you're familiar to, the most recent citation first, but in something we call relevance-based ranking that is understanding how what you're after and what relates to that in terms of behavioral patterns of others, actually a thousand different factors, might actually give you information more quickly in a more timely fashion and with greater relevance. We know this is important for many reasons, but the fundamental reason is 80% of the people who launch a PubMed search never go to the second page. So if your best article is on the second page or the piece they need most for that next study is on the second page, they won't find it. So we're working to make information more accessible in the moment. We also recognize that many of our resources are used from machine to machine process. That is, there isn't a human in the loop at the point of a search or receiving. So we need to do several things. We need to make sure our pathways are trustable. We need to make sure that the National Library of Medicine brand is well known and well understood because the trust that that can imbue in the user of our information should extend their ability to do their work. We recognize that less than 50% of our users nowadays are human eyeballs and we have to figure out how to convey the same level of resources to them. We're looking to reach new users in new ways. We're experimenting with augmented reality and virtual reality strategy. So if you look at the second box from the left, you see what looks like a nutrition label hovering over an orange juice container. Imagine delivering information in the moment that someone needs it. Now we can't do this today. That Google Glass thing failed, but there will be things in the future where we can do this. And the library is getting ready for that. The library is getting ready to see how do we translate information that was once in a permanent structure on a piece of paper to information that's floating in the air. We're looking to use augmented reality experiential information presentation so that the information that is needed at the moment of need, not at the moment of want, can be held by individuals. We also recognize that no matter what happens with technology, we are fundamentally a human enterprise and we need to use our skill set to be reaching people wherever. We are very proud of the fact that the National Library of Medicine has cultivated the national network of libraries of medicine organized across the United States. We now have 7,200 points of presence around the country. Health science libraries, hospital libraries, public libraries where trusted health information is accessible and this platform is now powering engagement so that the All of Us program, the major initiative by the NIH to engage a million people in a massive scientific exploration, now has a person on the ground in the neighborhood that can answer health questions that can be there and be present. So the National Library of Medicine is looking towards a period of engagement mediated by technology fostered by individuals. We also recognize the importance of building a workforce for data-driven research and health. We recognize there's different types of workforces that are going to be needed. Certainly we need data scientists and we partner with our colleagues across the NIH to determine how to best prepare researchers who have the data science skill within a health science framing. We also, they need to partner on our university campuses and research institutes where your roles become critically important. So the alignment of data science investigation, whether it's in physics, biochemistry, or chemical engineering, can all share the important shareable parts of the analytics, the data management strategies, and the advanced visualizations. And the uniqueness of health information, biochemistry information, or mechanical engineering actually can be built in those areas. We are committed to enhancing research training for biomedical and informatics and data science because it's essential to remember that what biomedical and informatics brings to the conversation. Formalization tools, structuring of data, making it possible to understand information relevant to the culture and context of health is critical. But we also recognize that that must be done in a way that fosters a diverse workforce that brings individuals into the academic workforce and the research workforce in new ways. So we are committed to training across society. We are committed to using hackathons, which can engage young people cleverly in using different kinds of computational techniques to get excited and be willing to and interested in becoming part of the workforce. We recognize that we have to support laypeople in understanding the value of data science and data-driven health discoveries. So we've established a new extramural research program called Personal Health Libraries that brings the power of data science into the hands of your neighbor. We recognize fundamentally that what the National Library of Medicine must do is foster our distinctiveness as a reliable trusted source of health information and biomedical data including the analytics that are done and the visualization tools that are used to make this possible. Now, we don't do this alone. We are one of 27 institutes and centers at the NIH. We are very delighted to right now be enjoying a $37 billion annual budget. And you have a right to know what we're doing with that money and how we're fostering data-driven discovery. If you've not visited the NIH campus before there, let me call your attention to the diagram on the back. That's our physical layout. And then the left-hand side, midway up, you see a building with a diamond-type roof on it and then a tall building next to it. This is the National Library of Medicine space. We are in a wonderful spot on the campus because we're the gateway to the campus. And yet, we recognize we share the responsibility for data management. We are not the data dump of NIH. We have to accelerate effective use of data across NIH. And with 27 institutes and centers, I can tell you're being one of 10 children at home, getting 27 institutes and centers to look in the same direction much less walk in the same direction is a significant challenge. However, we've made progress this year. In part, it stimulated by Congress, we built the National Institutes of Health Strategic Plan for Data Science. We received a recognition from Congress that data science is important. We've received additional funding to foster data science. We now have a plan of how this will be rolled out and we're in the midst of an implementation process. Let me walk you through it briefly, though. The NIH Strategic Plan for Data Science should be aligning with the plans that are going on in your institutions and we should be developing synergies with them. We are focused first on building a good data infrastructure, second on modernizing the data ecosystem, third on improving data management, analytics and tools because the T-test that worked in an experiment of 40 people is not going to work on a billion dots. I can tell you that right now. We also need to focus on workforce development and remain committed to stewardship and sustainability of our data. Under each of these columns, you see the key activities. Please note in data infrastructure that the NIH recognizes to optimize storage and security is our primary responsibility. We are dealing with very precious patient-level data and we must make sure it's secure. We're also, though, proposing to interconnect the NIH data systems because, frankly, 27 institutes and centers has led us to about 300 different data repositories, not all of which talk to each other. In order to modernize our data ecosystem, we've taken steps towards both creating better repositories, finding strategies to safely share individual-level data and improving the integration of observational data with information that comes out of traditional research activities. In terms of data management, we're committed to generalizable workflows, generalizable visualization tools, increased ability to catalog and know where our resources are. Our workforce development, as I explained earlier, NLM is spearheading this, but it's across the NIH a commitment to expanding both from the standpoint of researchers as well as clinicians and expand the understanding of data science. And finally, in terms of stewardship, we are committed to the FAIR principles. All data should be findable, accessible, interoperable, and reusable. Wonderful aspirations, a little hard to deliver in public. We've already started down the plan for implementing these, and we're recognizing some cross-cutting things that are requiring the NIH to actually come together and sing from the same prayer book. We first know that there must be common infrastructure and architecture upon which more specialized services can be built, but we need a basic underlying infrastructure that supports the entire NIH operation. We will not do this alone. Although $37 billion sounds like a lot of money, we must leverage commercial tools, commercial resources, new kinds of expertise from other fields because that $37 billion should be driven towards cure, should be driven towards health, should be driven towards people, and not necessarily creating an infrastructure that could be best supported and used through the Department of Energy, the National Science Foundation, or other government bodies. We have recently launched an initiative with the private sector cloud computing resources, particularly with Amazon Web Services and with Google, to provide low-cost, permanent, accessible cloud storage to our major research projects. We're committed to enhancing the training, and what we recognize is enhancing training can't simply be adding courses into a program, but it is fundamentally rethinking what is the core knowledge that people have to come to doctoral training with in the biomedical sciences and what is the fundamental knowledge we should be helping mid-career individuals access and acquire. We're committed to the secure and structuring of our data resources as fair and also to ensuring information security. We have moved to a model of identity and access management that we believe will extend beyond NIH-identified investigators, so we'll be able to move towards sharing the NIH data resources more broadly, making use of identity and access management resources that are both vetted by institutions, such as you do, as you give individuals permission rights in your institution, but also will allow citizen scientists to actually engage with and use our data resources. A significant challenge the library is taking on is curation at scale. We recognize the most expensive aspect of any kind of storage that goes on nowadays is the curation part, the human engagement, understanding how do we label, how do we make this findable and identifiable. And we've launched several research programs to develop new tools for computational-based curation, as well as to engender best practices, and that requires that we work closely with community input to promote and refine data standards. The word community takes on new meaning when we move into interdisciplinary models where communities intersect at the edges, and so as we build new standards, build new terminologies, we are committed to working with some partners we may not have worked with before, physicists, chemists, humanities, to understand how to extract the essentials of curation that are valid across all disciplines. And we are fundamentally committed to coordination with funding agencies so that we can avoid unnecessary duplication. It makes no sense to repeatedly stand up and structure data repositories. And I state that, frankly, as more of a question that I want your guidance on than a conclusion that we have reached completely, we recognize that institutional data repositories are becoming a big focus of concern in many of our universities. We need to hear how to best work with you. The vision that we have at the NIH is there should be a sustainable infrastructure that rests on a federated data commons model. There will not be a single point of all data in the world collected within the NIH fence, but rather think about how we federate these with interoperable tools that you see written down the center, common user authentication strategies, shareable APIs for data access and computing, automatic implementation of the fair principles, dockerizing or containerizing data and analytics, reusable workflow management, digital object identifiers. The world is finally realizing how critical it is to identify digital objects with unique IDs and finally data standards and sharing. What you see floating around here are some of the major NIH investments. What's outside of the cloud that we know needs to be connected for our scientists is environmental analysis, transportation information, agriculture trends, data that needs to be interconnected. The NIH cannot do this alone and we need to do this in working with our partners. In addition, though, we recognize that the NIH must make sure that data are available from the moment they are exported from a research project. When I started my academic career, the answer to a research grant was the hypothesis resolution. You closed it up, you wrote a paper, you went away. Now the research data is becoming a very important, perhaps even most important part, more important part of the research process. So how we foster an era of shareability during this time of transition where we do not have all the answers and all the pathways is requiring the NIH to make first some very early fledgling steps. Let me show you where we are with our data sharing policies at this point in time. We encourage our investigators, we encourage, we do not require data sharing yet, but we're encouraging all NIH funded investigators to share their data through open access data repositories. For data sets that are small, it's possible to attach the data sets to PubMed Central articles that as an article is published, the supplementary materials can include data. For articles that are slightly larger, sorry, for data sets that are slightly larger, up to 20 gigabytes, we recognize that commercial resources such as Dryad and FigShare are very good partners because they assign unique IDs, they store and manage data sets, they allow for some very interesting investigator driven management of those data sets. Now though, when we get to very, very large data sets, terabyte and petabyte sized data sets, we need different approaches to storing them and that's why the NIH is investing in these partnerships with commercial cloud providers. We are going to be continuing to evolve in this area. We need to hear from the communities including the information network communities about where the investments should be, what you rely on NIH to provide, what you'd rather see provided within your own institution. The National Library of Medicine though is key to making all of this happen. You've seen things come all the way through here. So let me show you a little bit about where the National Library of Medicine is going to make this happen. The National Library of Medicine is committed to creating the 21st century collection. The 21st century collection has the characteristics that you see written across the bottom line here. We are serving as custodians for some of the content. We are serving as connectors for some of the content and we are building discovery tools for the remainder of the content that might be needed in order to make the 21st century collection possible. We need to find new ways for attribution. Who is the author of a data set? How do we hold accountability here? How do we devise automatic indexing strategies so we can make data accessible as quickly as possible? And how do we create personalized presentation and delivery whether you're a group of kids sitting around a high school gymnasium looking at a health education video, you're a scientist in the moment needing a specialized piece of information. We need to know you better so we can deliver to you better. We recognize that the 21st century of collection of biomedical knowledge has to be a permanent and discoverable archive of text and data, a wide range of data, image data, sound, genomic data. We are listening to trends in science and scientific communication to understand how to best partner with the way the scientific community is interacting. We recognize open science principles are important. Preprints and other interim products of research which are now acceptable as part of an NIH dossier must be supported and we also recognize a growing but not completely adopted approach to data sharing which we want to be fostering. So we are planning as the library to prepare and lead new directions through collaboration. Now when we create a collection at the National Library of Medicine, we're first driven by our Board of Regents policy. The Board of Regents provides the federal and public oversight of how we make a collection. Here's what our collection looks like. If you think of the National Library of Medicine's collection as the large oval here, on the right-hand side you see PubMed Central. That's our full text repository over 5 million articles, about half of which are fully open access and machine interpretable. On the left-hand side you see a circle reflecting PubMed. That is our bibliographic database, 27 million citations in there now, and within that bibliographic database, slightly over 90% is what we consider our specialized collection, our Medline collection, are highly indexed, highly accessible. So first we think in terms of how do we create a literature repository either by holding or connecting to important literature. We are driven in many ways by the NIH public access policy. The public access policy is almost 10 years old and it specifically states that after 12 months full text of any article reporting NIH funded research must be available and we created the PubMed Central Archive to be able to make that possible. Recently, as I've mentioned, preprints have become an important part of communication and yet the NIH has taken what appears to be paradoxical stands on this. On one hand we do accept preprints as an interim product of research that you can report in grant applications and as satisfaction for grant responsibilities. But we are not housing preprints at the National Library of Medicine and we don't anticipate doing that at this point in time. We're working with publicly accessible archives such as bio-archives to make preprints available. We are increasingly focused on how to share data. Now NIH has had a data sharing policy for over a decade, largely driven by the genome process and the work on the human genome which came with a commitment that said we want to make sure data are accessible and not stored or held for private gain alone. Any researcher requesting NIH funding must provide a data sharing policy. At present, this is not a scored part of a grant which means it doesn't contribute to the evaluation of a grant but in the future there will become more central to the evaluation of a project. And it's possible for an NIH funded researcher to use NIH funds to actually support data sharing and we are encouraging experimentation among our researchers to do this. We're also providing them services and that's where our data discovery emphasis in PubMed Central and PubMed become very important. If you think about the article as the nexus for discovery, surrounding the articles are some of the things you saw earlier on our sphere or ecosphere of data-driven discovery. Patterns, profiles, research grants, preprints and we recognize that the structure in this present day 2018 anchoring around an article is still the most common way people will think about organizing information. So we're looking at ways to link data to articles and have been successful to do this. Within our strategic plan we say we're going to stimulate new forms of scientific communication to make linkages for data. And we do this in part because we've had a 20-year history with an NCBI for doing it well with genomic data but also because we recognize that by making data shareable and accessible it enhances the rigor and reproducibility for research projects better in viewing the public's trust. Now, if you look at publications in PubMed Central, that's our full text data search, authors say many funny things. The data are available on the author's website or all the data can be downloaded from this in a URL is provided. In these different sections you find a wide range from references and pointers to specific NCBI accession numbers for genomic data to an individual investigator's laboratory where you can contact to get information. We recognize this is not enough. So as of October of 2017 we've started to allow the collection, the connection of supplementary data into articles that are published in PubMed Central. Supplementary files may include computer code, implementable algorithms and computational models, but can also include the actual raw data. Investigators are expected to and responsible to ensure that the data are shared under the appropriate human subjects consideration. We allow for data citations that should facilitate the access to the data and any associated metadata code or related materials. So the data can actually be reused, can actually be employed in secondary areas. And you see on the upper right-hand corner what the data citation looks like in a PubMed Central record, how accessible they are. And then below that on the green you see the XML code. This is often provided by the publishers. I would say in most cases the data are coming in from publisher sources. Our current snapshot is looking like we're making progress. There's data availability statements in about 136,000 of our articles and almost 20% now have some kind of either supplementary material including data or a data availability set statement. And this allows us now to have some good experience. How big are the files? How complicated are they to work with? And they're frankly working really quite well. We find each time we make data slightly more available we see a rise of 20 to 30% of downloads per day to the most recent statistics I received this morning almost 40,000 downloads of data with data accessibility statements in them. So we know that the data are being used. We're exploring different ways to understand how they're being used by individuals. So far I've been describing to you data that's associated with a full text article but it's possible to have data associated with our bibliographic record within PubMed. And these we refer to as our secondary source ID. The secondary source IDs can be provided by the individual author or by the journal itself and they exist at the lower end on the right hand side of the screen you see a typical PubMed citation record with the abstract displayed and the related secondary data are available right there at the point. Now this still requires a great deal of human engagement. We're working to make sure machine engagement can happen with. We do recognize though that public data services like Dryad or FigShare are becoming important so we use our resource called LinkOut that allows you to link out from articles link data sets that are deposited into Dryad back to a PubMed record and this record can be maintained and updated that is that when an individual indicates a new data set has been added the PubMed record is updated itself. Not by itself though. Humans still have to work on that. We are seeing on PubMed Labs a whole range of new possibilities for data citations. PubMed Labs allows us to experiment with different kinds of data accessibility tools. So as we prepare for the future we try to develop new ways to expose data. We are also looking for better bringing better scientific data into PubMed Central so that associate data sets a range of associate data sets can be accessed by IDs or by URLs. An important feature that we're working on right now is the linkage of data sets to an article through an individuals controlled My Bibliography. My Bibliography is an NLM sponsored utility that an individual can use to list all of their articles together. They are able to make updates of linkages throughout time so if a year or two after an article is published an individual wants to attach a different data set or an additional visualization to that article it's possible for the individual to do that and the PubMed record will then be updated so that we are able to engage our authors in helping to keep our information current. Lots of things going on with data. We are trying to follow the principles that we know to make a library resource useful and secure and available to the public. I want to briefly touch on some additional data resources that are available at the National Library of Medicine and then return to the major NIH directions in data science. Many of you are familiar with our clinicaltrials.gov repository. This is a place where you can declare clinical trials and have the results reported. Importantly it serves as a public accountability for drug trials that are useful for FDA applications and for NIH accountability. But I want you to think not in terms of the interface that allows individuals to locate a trial that might be useful for somebody in their family to participate in but rather to think about the concept of clinicaltrials.gov as an information scaffold. Think back to the idea of creating an ecosphere of discovery and you'll see on here that we've moved from the focus being sharply on an article to now the focus being on the trial declaration so we're able to connect additional relevant parts of a research study the protocols, the analysis plan, the results database even individual participant data through a single point. We recognize that investigators and scientists and society will come to our resources in lots of different ways and we need to provide flexibility in how they do that. The inclinicaltrials.gov is one example. In clinicaltrials.gov we have, excuse me, we know that instruments, research studies use different kinds of measurements and having standard approaches to measuring familiar and important concepts in biomedical research is a critical accelerator of interoperability as well as extending studies. So we have developed something referred to as the Common Data Elements Portal. Common Data Elements is a commitment across the NIH to enhance the findability and interoperability of data for common concepts such as depression, excuse me, family structure, adherence to medications. The use of Common Data Elements in a research project allows for validated measures to be used so it increases the rigor of research studies. It makes it easier to cross, to correlate findings from studies and it improves our ability to extract knowledge when studies may be too small to resolve individual hypotheses. The National Library of Medicine is fostering the Common Data Elements in both human and machine-readable forms allowing for structured information to be available earlier in the point of research planning. Also allows investigators to harmonize across different measures to be able to select the most appropriate measures and to reduce the duplication of effort. At the other end of the spectrum, we maintain an enumeration of data sharing repositories. Over 350 repositories of completed studies with varying levels of accessibility and they're summarized on our summary chart for the data sharing repositories allows us to take data collected for one study that has been approved for reuse and make it accessible and reusable. We're making small steps, but the steps are important on this trajectory of fostering data-driven discovery. My slides, by the way, will be available afterwards so that if you've been trying to jot down URLs, we'll be able to get these to you through the CNI listing. Let me finally move to closing and then conversation to tell you what's happening at the NIH to integrate the solution to move it out of a single institute to bring a broad strategy in. We have just launched something called the Office of Data Science Strategy within the NIH office of the director. Susan Gregorick is the director of this office. She's a fantastic colleague and a very great advocate, trained in most of her experiences within NSF and large-scale data systems there. Susan's efforts in the Office of Data Science Strategy represents a commitment from the NIH that a number of the projects that I've just described to you, data sharing within articles, common data elements, the scaffolding of information around clinical trials or the ecosphere of data discovery is becoming an NIH priority. So we are making another significant step in ensuring that data will become an important substrate for discovery. But we recognize that we require a rapid infusion of new workers into this workforce. So the NIH is launching several Data Science Fellowships and I'm bringing them to your attention here because applications are open now and you may have students or faculty in your institutions that might be able to take advantage of these fellowships. The first fellowships we have are graduate data science summer programs. Graduate data science summer programs are managed through the Office of Education on campus, on the NIH campus, and they are built on an initial consortium of local universities. You see UVA, George Mason and George Washington. Students from all universities can apply. Their projects have been proposed. We anticipate bringing in 15 interns this summer who may be master's level or early PhD level students to come and address specific data science challenges within biomedical data. Now we're doing this for several reasons. We certainly want to expand the workforce, but we also recognize that many of these young students are bringing in approaches to data analytics that will be novel to us, so we will be able to learn from them and it will be a sharing process that we're really quite excited about. If you go to the NIH website and simply look for data science training, you'll find this. Another fellowship training program that's happening right now is something called Coding it Forward. Coding it Forward is part of the federal civic data digital fellowship program. It's a student led nonprofit that actually is setting up partnerships between students. Usually these are our college level students. They don't have to be graduate students to participate in an internship program for 10 to 12 weeks at NIH. Our Office of Data Science is going to be coordinating some activities so the fellows have a point of presence. They will have some more supervised training than our graduate fellows who will be deeply embedded into laboratories. The third fellowship that I want to bring to your attention is a fellowship that is designed to increase the capacity of NIH to handle large-scale data analytics. Our goal here, and this will be launched by mid-summer of this year, is to create a program called the National Data and Technology Advancement Fellows, the NIH Data Fellows. The idea is that we would bring individuals from industry or from academic programs, maybe on a one- to two-year IPA, a sabbatical program would be appropriate for this, where the data science expertise, not necessarily people who have experience in biomedical data, but expertise in data science, will work with our massive data resources and be able to investigate how their approaches and their strategies can benefit from and maybe expand our ability to work with the data that we have available at NIH. Our goal is to bring a cohort in in 2019 for a two-year period. Our hope is to accelerate both the capacity we have for analytics at NIH, as well as to stimulate interest in biomedical data around the country. We expect to have the two to five fellows in each cohort, and we recognize one of our major challenges here is going to be bringing individuals in at the appropriate level of salary, because as you have all discovered, I'm sure, data science training has brought people with very high salaries, so we need to find ways to be supportive of that. Let me close by talking to you about a critical policy issue right now and to invite you to send me your personal comments if you have any thoughts after this particular session that I'm presenting to you. NIH is proposing a data management and sharing policy for all NIH-funded research. This policy is available on the website now, even though the common period is closed, but it is designed to help the NIH hear from the community about what policies make the most sense in light of the way institutions are managing data, the way institutions hope that the NIH will be funding data management, and to drive us towards a period where data-driven discovery becomes the norm, not the exception. Before we establish a policy, we need to understand from the community about the different ways that data and clinical data, sorry, biomedical data are defined. We need to understand what the institutions and the individuals believe are the requirements for data management and sharing plans. When I was an investigator, it was enough to say that data will be made available by the investigator via email. That doesn't work anymore. It doesn't have the protection on the data set that doesn't make the data set accessible. But without excessively burdening the institutions, what are the next steps that would make sense to have NIH consider? And what is the timing of this? Should we be doing experiments over the next three to five years? Is the community ready for us to put some standards in place for data sharing? We're hearing both responses, frankly. Some people want us to get moving more quickly. Others are saying test out a few different structures because we know institutional capacity for data management isn't where it needs to be right now. The considerations that we're bringing together here have to do with the budget for data management and sharing, particularly who will be paying for this. The use of existing repositories are particularly interested in hearing from institutions that have already made heavy investments in intramural data repositories in their own institutions. We see a lot of encouragement to use local repositories, and we need to know how we're going to make data in those repositories accessible, shareable, understandable by others. And finally, we need to hear from around the country about what the community expectations for shared data actually are. We're investing in data management, in part because we recognize it's good stewardship, but in part because we believe it will accelerate us to a future. Over the last hour I've taken you on a very fast walk through the National Library of Medicine's strategic plan our focus on new ways for data sharing, the NIH investment and interest in this, and now I'd like to hear a little bit about your questions and your comments and directions we should be going. Thank you very much for your time. And we've left some time for questions, and I believe there's two microphones up if anyone would like to begin the comments or questions. That was great. In your discovery section, you mentioned the desire to do more personalization to deliver the information that your users need, and we've been having a couple of conversations here about patron privacy and what steps we should be making to protect those. So do you have any thoughts on the tension between those two goals? Well, the tension is an appropriate tension for certain, and the issues about privacy, I would be, actually don't go away because I'm going to ask you, what are the key privacy concerns that you have? I can tell you the ones we have. So we're first and foremost interested in reducing the number of clicks. Secondly, we're interested in allowing individuals the unfettered access to the information they want without undue scrutiny. And at the same time, we recognize we need to understand the trajectory through our resources so we can determine how to best organize them for individuals. So what are people willing to have captured about them, whether or not it's identifiable or not. And the third part that we're particularly interested in is how much of your history should inform your future. So are individuals, if you will, self-reflective enough and adept enough to be able to say, toss that search, don't ever use that one, but build on this one because it made sense for me. So can you tell me some of the issues that came up in the discussion here? For our users when they're making decisions about what they're willing to share. And then, I think more alarmingly, probably is what the tools we are using to aggregate that information, what data is getting shared among, say, publishers and third-party vendors and non-libraries. So when data becomes an asset, a material asset, yeah, I see. So I'm from the government, I'm here to help you. So of course we would not do that. I say that with, we don't do that, but I do recognize, you saw me say the word trust 15 times in this talk. I know the federal government has a lot of trust-building to do around health data. We have not always been honest or good stewards about this. And I would be interested in guidance from this community if there is some about how do we explain what we do and don't do already. So we do not profile users. We do not know users who choose not to let themselves be known to us. But we do have an option called My NCBI that you can actually be known to us on a very regular basis and we develop a lot of dialogue. That's the second level of helping people understand what exactly is it that we do with our data. We do not share or expose our service logs. We do not allow investigators, except for in the improvement of a specific service to come in and look for our Thursday night users or our high profile users or our everyday users. We don't have the ability to track individuals at that level. But saying what we do and don't do is one level of building trust, illustrating how we use the information as another. Now I would be interested also in knowing as we interact with the publisher community and as we interact with commercial information sharers, what kinds of questions should we be asking them about best practices? And to what extent should the National Library of Medicine actually be fostering the dialogue around the best practices? I've been in some conversations that have been really... This is a neat time for scientific communication, very exciting. But also the awareness that scientific communication, particularly the labeling of journal titles and articles, has taken on all sorts of meanings. So beyond impact factors and whether or not your H index is high enough, the presence or absence of a nature cover article in your CV speaks volumes according to some people. Others are arguing that we should redact journal names from tenure cases and from CVs so that we level out the playing field and make the evaluations beyond the science and not on the presentation of the science. When we move into the citizen scientist, a lay person who, if you will, is not of the guild that academics are, we have a whole new way that we have to be explaining what we're collecting and not collecting, what we know and don't know about individuals. And to be very honest, you can get PubMed results through Google. So we have these even very confusing displays of information. When you're on our site, there's certain behaviors you can count on us for, but when you're receiving our resources but not on our site, that's that NLM inside. We have another level of information. We do not, at this point in time, provide page view statistics. I don't know, was there interest in page view or was there concern about page view? I've heard both sides from people who's, that's a measure of how widely your work is being read or now it's distortion. The page view issue. Yeah. Okay, that's helpful. We're very open to policy and ethics. We've expanded our investment in the series Jerry Sheehan had been our public policy lead for many years. He's now the deputy director, so he brings into that position of NLM, a very keen awareness of our public responsibility and also our collaboration across the federal government. Dina Paltu is now our director of public policy and she's supported by Rebecca Goodwin and also Lisa Federer's work is really helping us think about the ethics of communication and the ethics of knowing what kind of information people are looking at. Other comments or questions? Yes. Hi Patricia. Eric Mitchell, UC San Diego. I'm really intrigued by the data sharing and management policy for the last topic. As we picked up this topic at San Diego, we realized actually as we picked up the topic about making data more openly available at San Diego, we realized we didn't have a good grasp on the policies even that govern how research data is managed within our institution and I'm curious as we take on this kind of modest endeavor, how can we do so in a way that will dovetail well with some of these, if government agencies are also going to pick up this topic, how can we make sure we're working together on it or maybe you've got advice for us on what we should stay away from. Thank you very much for the question and this is exactly what I had hoped to have come out of having this conversation here is to learn from what's happening in the communities around the country and how does it align with her or need to be aligned with what we do. So I'd like you to think about three things that we find really important to think about. One is the life cycle of data and helping investigators early on in their planning to think about what are the data products of their resource and how long are they going to be valued. Now everyone thinks their data is really valuable and it's going to be used by lots of people. So that's a fine starting point and we might want to help them think about what is the long term and even if it's not required in a federal document to begin at the department level or to begin at the, whether you have an office of research or the research and sponsor projects office to begin to get that information because it will give the institution an idea of what kind of downstream commitments might be there. So we think, first of all, think about the life cycle of data. It's pretty clear to us that data are used a lot in the first two years. They're less often in the next five years but there's a really long tail for some data and we're finding that it's much, the librarians in the room will know, it's much more difficult to get rid of data than it is to acquire it and deaccessioning costs more than accessioning. So once you've made a commitment to preserving data in some way from an institution's perspective and not the six floppy drives that are sitting in my apartment up on Connecticut Avenue which came from my 1984 study on AIDS, there's, you really have to look at what the institution resource is. So the second thing I think is really critical is to to the extent possible forecast cost. And from the perspective of the National Library of Medicine we have just launched a study with the National Academies to actually provide some econometric modeling of long-term data sustainability and we're interested both in what is it, how do we tell an investigator at time zero the lifetime cost of that data and then who pays for that cost. Now the things that, one of the things that costing models really help you do and cost forecasting is really helpful to do is it begins to uncover the presumptions that we have. This data will never need to be refreshed again because it's in an operating system that's never gonna change. Lie. You have to help people, you have to understand where the barriers are so that you can make more realistic concerns. But we also have issues such as where will the curation happen? My goal as a librarian is to push the curation upstream. I want you all to do the curation as according to my standards by the way and then let me take it in. I don't wanna have to curate on ingest but that's right now our challenges. We are creating on ingest. So our challenge with this is now to then say where is that cost going to be and it is not free to curate data and if data are gonna remain in an institution somebody's gonna have to at least manage the directory, manage the identification of it. So the cost actually takes you to helping to lay out the life cycle a little bit more. And the third concern that probably should have been said first but I apologize but I'm really instrumental in my thinking these days is the ethics of the data. Is the ethics of what is kept, why it's kept, who governs reuse of it. And boy this requires the IRBs to get on board really quickly. The institutional review board and ethics boards really need a lot of assistance in thinking about data in a different way because the IRBs tend to think largely about the value of the study and if you will inform consent of the participant and the reuse of the data is not brought in and there's two sides of the reuse challenge here. One of them is if you don't reuse it you've exploited your participants and not gotten full value if you've not given them full value. The other side if you do reuse it in a way that in 1998 wouldn't have disclosed the individual's identity but 20 years later when we've got better next-gen sequencing we can know it is your mother. That's, we have to rethink how do we keep vibrant policies in place. So life cycle, econometric modeling ethics are the three pieces that I would focus on. We are very interested and I can't stress enough I would be very interested in hearing from you even though the comment period is closed I'm happy to share the comments with the committee and they will take informal comment later. To what extent should data management costs be a direct cost on a grant or program versus an indirect cost? And for those of you who aren't familiar with that lingo or that expression of this if we make it a direct cost we can allow say each individual investigator can have $3,000 for this grant to pay for curation, buying in indemnity to store data dry it indemnity costs $4,000 right now I think for permanent storage. So we could identify a cost and it would be paid to the individual. Institutions may or may not want that direct cost if we put it in the indirect cost model as an investigator I can tell you that never works because the investigator never finds any access to the resources somebody else does at Wisconsin. But the challenges really if the institution is willing to say we're going to be stewards of the data then we have to generate some income to do that and indirect cost recovery is one place to do it. So I'm very interested in hearing your thoughts about dated management costs should they be a direct cost or an indirect cost on research projects. Are there others that you thought of that I haven't thought of? You feedback on indirect versus direct cost. Yes, thank you. Again just speaking for my own is tapping into indirect cost or tapping into institutional support means coming up with almost this bulletproof thing that everybody will adopt and that would be perfect. And so it strikes me that a direct cost model would kind of allow a marketplace of solutions to thrive. Oh that's helpful. I hear some support for the direct cost model. Okay. Yes, oh yes. And the corresponding risk that comes for people to start sharing their horror stories about ransomware and our attention is currently on that. How do we even just get a sense of the landscape so that we can understand the risks we face and kind of start planning for the future to your point of surfacing those one or two really valuable data that has a 100 year lifespan. I'm curious, do you think we're taking our eye off the kind of another ball that is more important? Well I certainly think that you've if you will crystallize an idea that I haven't heard expressed this way before which is form and content have to go together. So if you're, when you're setting a data management policy the mechanism for sharing whether it's going to be a uncorrupted file or a hard drive or some cloud accessible space has to be dotted at the point that the data are being shared and not later on. I am, I run a small institute in NIH we're about a half a billion dollar operation we have about a hundred million dollars a year in research funding and the rest is in support of our services. NCI, five times as big as we are. What we all know is that the downstream cost of data management are starting to erode our research initiation budget and that is not necessarily bad or avoidable but boy we want it to be conscious if we're making that decision. And so in an institution the more commitment you make to internal data management on the campus the more you're building in downstream costs that are not going to be easy to move around. So we are particularly, NIH is particularly interested in knowing first of all how do we define a high value data set which isn't in fact a risky thing to do because maybe it is or maybe it isn't but we define a high value data set can we partner with the private sector and the institutions to actually have an NIH solution for storing that either a low cost discounted cost access to storage or some kind of actual repository to NIH and this brings in then the local control and who's going to have say over the data and it does get equally complicated but the thing I think we must keep our eye on is if we don't preserve efforts for investigator initiated inquiry and new models of big science and small science we will end up becoming stewards of data and that may not actually help the discoveries for the generations of the future we have to make sure we appropriately invest in data we appropriately market data as an opportunity for research which is going to take a lot of culture change at least in my base programs of engineering and nursing and also to think forward about what questions can best be answered with existing data sets even as a thought exercise so that can help guide our principles for storing I would encourage institutions to begin experiments on a small level with some pretty clear objectives whether it's identifying a topical area a clinical domain a particular set of investigators and start the structuring of the data storage planning because we find that it is only when you have to actually get down and say this set of bits called this name are going there that you start to realize where you can begin to set boundaries keep an eye out for some projects coming out of NIH called the Commons Pilots there's been some very interesting and university based tests that have been going on the last year or so for this that have been really helpful we're getting close to that time so I think we can take one maybe two more comments and then the sunshine is waiting for you Hi, I'm Tim McGarry from Duke EOS One thing I am concerned of at UNSF sponsored workshop was the burden that it is on the institution for data management and I'm concerned that if we don't have a scale solution for this that we're going to actually have a loss of competition of research that we're going to have institutions that have to get out of the research game because they simply can't afford it and collaborations between privileged institutions and institutions of different sizes and different scale I think that's going to really jeopardize research at large if we don't solve this at scale and solve this collegially So if I could ask you to please send that comment to me in the email that's a really important perspective that has not been brought up yet about how universities maintain research competitiveness and select policies that are useful for that Let me ask you if you're particularly concerned about biomedical data or atmospheric data or all data At least at Duke we're talking about all data and from our context we're actually concerned that because Duke does have resources that will actually be seen to have the opportunity to take on more of the burden and we're afraid that the collaborations we do with other institutions will lose those and there's a lot of value and diversity and thought that we will lose out of that and so it's going to strain the actual fringe areas of research that we typically are able to take that burden on because we are able to resource ourselves and other It's a very interesting set of concerns and I think it's it's not an easily solvable one because right now we have yet to have the political appetite to make good choices about what data to throw out and our approach has been if there's data it's precious we better save it, we might need it and then somebody has to bring up 1962 and NASA it just really is going to require us to make very, very difficult decisions early on and take some risks and evaluate I'm excited to see some new models of risk and risk propensity coming into the conversations about data but I think your point about losing competitive edge and losing the diversity of participants is actually a particularly serious one I thank you for bringing it up Yes, one last one, one last one That's what I'm talking about Suri says okay Do anybody have Alexa? How did Alexa start talking to me this morning too? Hi Patricia, thank you for the comment and I want to dovetail on one of the previous questions that was asked about the NIH management and sharing of plans so we sent you the comments, the deadline was yesterday and you'll have plenty from us I was more interested in the data infrastructure that one of the questions asks about what one of the previous questions said not knowing the scope what are the plans of NIH or have considerations been put forth into what is this data infrastructure considering you've started partnerships with AWS and GCP and others We are committed to a federated infrastructure at this point in time we are committed to a five year experiment with AWS and I think that the Google relationship which I haven't completely seen yet is also for five years the details for this are under a program called STRIDES on the NIH website we are very clear that there will be a plural there will be multiple commons that we need to interconnect so our focus is on the interconnection and less on the integration of all data the challenge with this of course is data management application maintenance those types of issues we work closely with the national cancer institute and NIAID both who have taken very different approaches basically taken an institution specific approach NIAID is less invested in broad based cloud integration and more interested in portals as a pathway to integration NCI their structure is particularly driven by Bob Grossman thinking from Chicago is actually large scale repositories and the idea of compute in the cloud this is useful for very very large data sets but what we recognize is that NIH has to have a strategy for small, medium and large data sets investigator control will remain an important but not sole characteristic right now access to DBGAP resources for example is driven by the institution the institute within NIH where an individual gets their permission for access and that does have a costing structure underneath it I believe that the federated system will not sustain us for more than 10 years because it's too cumbersome and there'll be too much redundancy in it so I expect that we will see new data models within 10 years that will be coming more out of clustering and data that will be more beneficial in use rather than so focused on storage the challenge that I experience at NIH is the legacy data sets bother people so much more and they're not focusing enough on the new data that's coming at us and I think we need one solution for legacy data sets but we really need to be thinking from the beginning where are our what is our willingness to tolerate throwing away data so I thank you very much for your comments, for your questions you've given me homework to do and good luck and please keep working because remember that is the critical part for our country is to have information for health thank you so much that was just wonderful and the most wonderful thing for me about it is the way you've opened up a conversation with the institutions and the disciplines who are represented here who are going to be key stakeholders in this and I think it's really really important that that conversation continue I'm delighted that those of you who didn't send in comments yet can still get some informal ones in and I urge you as you reflect on what you heard today to really take that invitation and opportunity seriously as as Patty reaches out to the community in future we will of course pass some of those calls out through the CNI announcement facility to help make you aware of them please join me in thanking Patty and hoping that she will not be a stranger and will continue to stay in touch on these issues and with that the only thing left for me to do is to wish you safe travels good holidays and a wonderful new year and to say I hope to see lots and lots of you in Kansas City in early April thanks for coming