 Yeah, you can switch and John you're ready to go. Okay, so just by way of introduction An immediate thanks to John Lorch for his willingness to come speak to council. John has spoke of John as I told you earlier Director of National Institute of General Medical Sciences good friend of the Institute interacting on many different levels He's come and spoken to this council once before To give an introduction to his being the new director at that time a new director of The Institute, but he's here in a different capacity today one that Both he was interested in coming to talk this council, but it's even bubbled up under discussions. We've had with council one of many hats that John wears is Co-chair along with Steve Katz of an important group that this council has heard about called the scientific data council a trans NIH group that is part of the governance system That's a very that Francis Collins created a number years ago to deal with all the unique Challenges that we're now facing in biomedical research data and having a trans NIH strategic group thinking about many different issues large subset of which we talk about at this council meeting and So in that capacity John and I'm a member of that scientific data council and And John is the co-chair and among the many things are doing what John's going to talk about some of it is the Development of a strategic plan for data science at the NIH level that is now reaching a very mature state and is highly relevant to some of the Things we've talked about here. And so he's gonna So any case I'll keep I'll keep rifting but that's what he's here to talk about and As you will quickly Figure out among the first of all NHGRIs and Thomas many very interested obviously in the data science challenges with Certainly genomics as I've always said being a bit of a poster child for some of these channels But not the only data types not just genomics data, but in particular Some of the issues surrounding data resources that are many of which are actually many institutes fund We fund ones related to genomics This council deals with the complexities of supporting those and lots of discussion about how in the long term These are going to be sustained Especially at a trans NIH level as John's going to tell you some elements of this new strategic plan directly Try to put a framework together for a long-term sustainability of these important trans NIH data resources So I'm done rifting. It's up your anchor. We solved that data science problem There are a lot more to solve. That's what I'll tell you about now So I want to tell you about the strategic plan for data science that was developed by the scientific data council as Eric said in conjunction with the IC directors NIH leadership and a lot of input from the community It was actually requested by Congress in the 2017 omnibus appropriations bill It was really I think a very Insightful thing that they did request that because it allowed us to sort of galvanize everybody Into developing the plan in a relatively short period of time Overarching issues that the plan focuses on first are modernizing the data resource ecosystem to increase its utility for researchers and other stakeholders and to optimize its Efficiency of operation I just want to pause there and really highlight that part of the sentence in red which is to Increase its utility for researchers and other stakeholders and in some ways although that sounds obvious This is a shift for the way NIH and the I think the ecosystem in general is approaching this for many reasons Most of them historical I think the focus from NIH's point of view has been on the The resources themselves the PIs of those resources what their sort of needs are and their Desires are and we're really trying to focus this instead on what is best for the research community What are the resources and their organization and their construction that will be most useful for the people who actually Use the data and not for the resources necessarily themselves. What is most convenient? Which I think maybe has been more of the emphasis we want to enhance data sharing access and interoperability Improve ability to use electronic health records, which is obviously Area of really rich potential, but a very difficult one right now as well as observational data for research While at the same time ensuring Confidentiality of the data in those systems, which is obviously a critical issue and finally underpinning all of this is Modernizing the infrastructure which this is all built on as well as increasing our capacity that is training and other related issues Few definitions, so we started out having to define what we even meant by data science And so the definition that NIH agreed to has shown here an Interdisciplinary field of inquiry in which quantitative and analytical approaches processes and systems are developed and used to extract Knowledge and insights from increasingly large and or complex sets of data So that's our definition of data science and then Eric already mentioned this term Which is an underpinning of the entire plan which is fair fair stands for findable accessible Interoperable and reusable that is all the data that NIH supports the sharing of should Adhere to these fair principles Now in order to tackle this we actually broke what we thought of data science as into five different domains And you could parse them differently or you could subdivide them further But this seemed convenient to us the domains that we focus on were data infrastructure That's the hardware the platforms the architecture etc data resources Those are the methods practices and features that are used to increase the value and utility of data Beyond its native state. So these are things like as we'll talk about in a second the databases knowledge bases, etc management analytics and Utilization tools, so this is the tools program software, etc Which actually allow people to make inferences and insights from data workforce developments this is capacity building piece and Policy stewardship and sustainability the policies that we have in place that allow us to actually meet the other four parts of the domain in the optimal way So the organization of the plan is shown here it starts with overarching goals Those are each of those big domains basically and what are we trying to do in those domains? Under each of those goals, there are one or more strategic objectives So how in more general terms we actually going to achieve those overarching goals and then underpinning each of these strategic Objectives are more specific implementation tactics, but I should say this doesn't get to the level of granularity yet Of actual implementation plans in general and general and that's the phase that we're now embarking on and then under each of those Implementation tactics are going to be milestones and performance measures And so I'll tell you about the overarching goals the strategic objectives in some cases I'm going to show you implementation tactics in other cases. I'm just going to give you some general ideas I won't tell you about the milestones and Metrics yet because those are being developed as part of the actual implementation plans So the first overarching goal is to support highly efficient and effective data infrastructure It has two strategic objectives They are to optimize data storage access and security and a really important point here that we sort of came to early Is that we really need to rely when possible on the private sector because they are the Experts in doing this. That's where most of the developments in technology And and practices are happening and we really don't want to be spending time Reinventing the wheel and spending money on things that industry can do a lot better and is already done in many cases So we're going to leverage what's happening in industry when possible rather than trying to reinvent it The second Objective is to connect NIH data systems together So I'll talk in a minute about the NIH data commons Strategy and idea That will be one of the hubs to connect the data systems that I operate together But we also think that NLM particularly NCBI needs to be a core hub in this and connecting the systems that I actually Is supporting itself and running itself together So the commons I think probably most people in this council are familiar with this idea But it's it's going to be a cloud-based system and as Eric mentioned it's already under development So one or more clouds in which high value Generally consortium generated NIH data sets are going to be placed and the initial data sets as Eric said were top med GTECs and the consortium of the model organism databases These will be linked together using the fair principles so that they can be Interoperably searched the data can be found and then those data can be used together in some way rather than having them siloed As they currently are and so users from the outside will be able to gain access to these data Use them in different ways and hopefully advance knowledge by being able to connect disparate kinds of data together Under these fair principles and if you want to learn more about it. There's information on the Common Fund website The second overarching goal is to promote the modernization of the data resource ecosystem And I'll spend a fair amount of time talking about this because I think for this council. There are some really important Conceptual changes that we're making that we'd like your thoughts on and we'd like your help disseminating So the first is to my objective is to modernize the data repository ecosystem The first implementation tactic underlying that objective is to separate the support of databases and knowledge bases and so this is something that Received a lot of feedback when we put this out for public comment both positive feedback and some less than positive feedback And I think this is going to be an area where some significant change management is going to be required Again, I want to emphasize the goal here is to make this system and these data resources More useful to the community Okay, so that again sounds obvious, but it's a shift away from the way we've approached things in the past So what do we mean by databases and knowledge bases? What is the separation? So databases we're defining as data repositories that store Organize validate and make accessible the core data related to a particular system or systems So this is an example what we mean by core data For the kinds of things that most of you think about would be things like the genome sequence the transcriptome and the protein sequences So this is data that although it is evolving and more is being added is probably not changing at the same rate in terms of its exact You know nature as other data you might find in a different kind of data resource Now when we put this out for public comment One of the issues that came up from some people was that although that part of it is important the core data aspect Another significant issue has to do with the level of curation Particularly human curation in different data resources. So for databases as we're construing them Although there is some curation necessary. It's mostly to do with quality assurance and quality control, right? It's not adding additional information or different levels of knowledge on top of the core data That is the job of the knowledge bases And so what a knowledge base does is accumulate organize and link growing bodies of information Related to core data sets one or more core data sets So again examples from the world that most of you inhabit would be things like expression patterns splicing variants localization interaction networks and pathways from one or more organisms and Frequently in knowledge bases. There's also a lot of publication information information is being gleaned from the literature and then Incorporated into this infrastructure or this framework in order to hopefully make it more convenient or easier to make discoveries Again the community Made clear to us that they thought a distinction that was important really had to do with a level of human curation and Generally knowledge bases require a great deal of curation typically human curation Okay, and so this is the distinction that we're making Elixir which I think many of you are familiar with which is a consortium of 20 some European countries that are wrestling with these same issues Has made a similar distinction and they actually commented on the strategic plan and applauded us making this distinction They don't make they don't use the same words as we do database and knowledge bases But this conceptually is very much along the lines of what elixir is also thinking about and working towards in terms of their organization Okay, so that's the first implementation tactic then under in addition to that and underlying this strategic objective We want to use appropriate mechanism review and management To support and evaluate each of these different kinds of repositories databases and knowledge bases I'll tell you the importance of that in a minute We also want to dynamically measure data use utility and modification a really important issue is What data are we going to store how long are we going to store it? When are we going to move it to some sort of cold storage and what are we maybe even going to just jettison it all together because it's Not useful anymore. We can't store everything forever. That's been made abundantly clear by Experiments like dbGaP for example We want to ensure privacy and security for Information that is personally identifiable and sensitive. That's critically important create unified Efficient and secure authorization and access to these kinds of sensitive data We can't have every system that has sensitive Personal data in it using different protocols to get into it. That's completely inefficient So we need a trans NIH solution to this It's actually one of the things that's already in the implementation phase, but a single point of entry a single way of Authorizing and determining what level of access different data sets have for different users and Employ explicit evaluation life cycle sustainability and sunsetting expectations for the resources that we fund This is not to say that every resource is going to be sunsetted in some you know near-term period of time But this needs to be a explicit part of the discussion during evaluation Funding etc because again, we can't fund all data for all time And we have to make decisions about what's important and when it's no longer as important Okay, second objective is to support the storage and sharing of individuals data sets So much of the discussion we talk about data sharing tends to revolve around these large what are frequently called high-value data sets Generally made by consortia that doesn't mean that other data sets aren't high value But hopefully the ones that we're storing from GTX etc are The problem there of course is that most of the data being generated by the research community is not from these large consortia It's from individual labs doing their individual work And so we really have to find ways that the data from that work the individual labs that is again worth sharing That we have a mechanism in order to do that And again, this needs to be fair needs to be findable accessible interoperable reusable etc And so how are we going to tackle that? Well initially and NCBI is already working on this What we're going to do or NCBI is going to do is to make it possible to link data sets to published papers in PubMed Central So you publish a paper and they're gonna make it possible that you can actually just attach the data sets To that and so people can find them and use them Eventually that's going to evolve to something. That's more fair like that is you can find them easily They're in some kind of standardized format and hopefully Move towards interoperability Longer term the goal is to expand the data commons once it leaves the pilot phase assuming it's successful To allow people to submit their individual data sets into this cloud commons environment again in a fair Compatible format. Well, again, we do have to think about what will be the rules of entry, right? And we can put anything up there or is there going to be some kind of standardization of what goes up How long it can stay there and how we evaluating whether it should be there or not? The third strategic objective under the second goal is to leverage ongoing initiatives to better integrate clinical and observational Data into the ecosystem. There are several Tactics under that creating efficient linkages among these NIH data resources that contain this kind of information again, they shouldn't be silos you should be able to use them together to synergize and optimize on the utility Again, we want to develop and implement universal credentialing protocols and authorization systems I don't mention that and promote the use of NIH common data elements repositories This is something that NLM is leading its common data elements for clinical type data If everyone's using different data elements, there are no standards Obviously the utility is very low and get into a fair principles driven environment is going to be extremely difficult So that's something we want to emphasize and again, whatever we're talking about standards community driven standards are critical And I just want to oppose them in general. It wants the community to develop them and therefore use them And so the kinds of things we're talking about here are say all of us The Emerge network from NHGRI, Moonshot, TopMed, Echo, etc Are the kinds of things that we need to make sure are linked together are not silos and as new Resources of this kind are launched by NIH They should already be primed to just fit right into this system once we make it Interoperable as we're hoping to do here The overarching goal number three is to support the development and dissemination of advanced data management analytics and visualization tools The first strategic objective There is the support useful generalizable and accessible tools and workflows and the implementation Tactics under that should look somewhat familiar to you So the first one is separate support for tools now from the support for databases and knowledge bases And I'll get to the importance of this in a second and again we want to use appropriate funding mechanisms review procedures and criteria and Management for tool development just as we want to do that for databases and knowledge bases When a leverage commercial tool software workflows and expertise So again, we recognize very early on that there are many things in this space that industry is way ahead of NIH and We just will not catch up and should not try We should really leverage what they have and so that's an important principle here And finally promote the development of open source openly shared and reusable tools software and workflows So as as much as possible, we want these things to be free to users We want them to be open to users and we want them to be open source so that people can actually improve What's what's there? And I'll get to the more details of this in just a second But let me just go through the other two objectives under this goal broad in the use of specialized tools so we recognize that You know not all tools that are useful for biomedical research originated in biomedical research an example Would be algorithms from astronomy that have been adapted for use in cellular imaging and there's several examples of that So we want to support ways for tools and algorithms and approaches from other fields physics Engineering computer science etc to move into biomedicine as fast as they can We also want to support research for improving methods for using electronic health records and other clinical data When we have dozens or even hundreds of different systems at thousands of different institutions It's very difficult to really mine this Gold gold mine essentially of electronic health records and that's something I think that additional research is required to figure out How to do in an optimal and secure way And finally the third objective is to improve discovery and cataloging resources So again, if we're gonna have a fair compliant ecosystem that need to be standards Our view is that in general the community the experts need to develop these standards But then they need to use them so the the goal is for NIH to serve as a convening body Hopefully with the help of professional societies and FASAB has already expressed a strong interest in being involved in this But developing those standards and then finding ways to ensure that they actually get used because of standards are developed And they're not used. They're not useful at all Okay, so let me just tackle this because I think again. This is something that will be of interest to this this council Why do we think we should separate the support and evaluation of databases knowledge bases and tool development? That's come up twice in both strategic objective 2 1 and strategic objective 3 1 And there are a number of reasons that we think this is going to be of great benefit to the overall research community The first was kind of an overarching one and it is general and doesn't have anything specifically to do with each of those domains But it's that historically NIH has funded data resources and evaluated them during peer review as Research grants right and this makes sense because all of this dates back to the 1980s and 1990s Literally the system has not changed dramatically since that time in terms of how we are supporting these things And in those days these really were research projects You know you start off with a book which is the fly genome and you got to figure out how it's going to become digital That was a major research endeavor, but we just kept using those mechanisms And so once these things became much more hardened resources We NIH did not shift away from that paradigm and that had a number of implications First there was a misalignment of the goals and review expectations if we were funding them as research They needed to have innovation. They needed to have sort of hypothesis generation or testing etc And that was the focus of the review Rather than what we really should have had which is a focus on user service Utility and efficiency of operations the usage of the database etc Okay, so what we want to do is shift away from focusing on these as research projects Which in general the hardened ones at least or not and think about how useful they are to the community and how much Value the taxpayers are getting and the research community is getting for their money for these The other thing that this has led to is that there's been entanglement of tool development With resource management and that's because if you want to meet these innovation criteria when your grant Your resource grant is being evaluated as a research grant You generally put in development of new tools, right? That's how it's going to be innovative research sort of aligned and What we've heard from many reviewers as well as a variety of councils is that this has in some cases led to Panels being reluctant to say the tools are really not the highest quality Maybe the people running the databases or the knowledge bases aren't the best people to be making the tools Because they don't want to lose the core resource itself So you can imagine if you have a core genome resource for a particular organism or system The panel is very reluctant to give a score that reflects the fact that they don't think the tools piece is Actually at the cutting edge because they don't want to risk losing the core information That's completely understandable our council and other councils sort of felt the same way And therefore that led us to the realization that we really need to separate these things and that tool development should be evaluated and funded on its own merits Separate from the importance of the say core data itself or the knowledge base piece itself And so that's what I just said The other issues that database and knowledge base functions needs and uses are not the same and that's why we think again They need to be evaluated separately funded separately using distinct mechanisms that can be targeted to their specific goals and needs So core data for example Genome information protein sequence information could be absolutely in this generally absolutely essential to the research community But it's a question of whether the knowledge base part of that You know the additional pieces the publication information The other experimental information that's put on top of that is and in some cases Certainly it is in other cases it might not be and this is another Area that we got feedback from you know study section members and councils was that they felt that again Frequently reviewers were reluctant to say hmm not all of this knowledge base this curation stuff That's going on is as useful as it could be because they didn't want to jeopardize Access to the core genome or protein sequence information and again We need to separate those things so that we can evaluate them on their own merits and not feel that somehow The core genome information is being held hostage to this other stuff that's going on Another issue as I alluded to before is the cost of human curation is extremely high So in many of these data resources 40% or so of the total costs are for human curation So it's a lot of money that's going into into that activity and again It goes back Decades and in some cases it's still necessary and still needs to be done that way in other cases It probably doesn't and if you just think about what science was like in 1990 say Getting information about key papers in a field that was not entirely trivial You had to go to the library and look in those big green books and remember that whole thing Now we have Google now we have PubMed right and so it's a very different world and I think we really need to critically ask ourselves what's worth the money and What's just a historical artifact? And now I'm being pretty Aggressive and up front, but I think these are issues that we really need to have on the table and not just dance around And finally the need We really have a strong need for usage utility and impact and efficiency metrics for these kinds of resources again How much are they being used? How much impact are they having on the community? How big is the community they're serving? The data within them again How much as each different data set being used that decisions could be made about what to keep and what not to keep? We don't really have good ways of measuring this yet Not at least in the coherent way and that's a key part of what we need to be doing figuring out how to do that I Do want to clarify one thing that's been a source of confusion Although we're talking about splitting up the evaluation the funding decisions and the organization of Database knowledge base and tool support that doesn't mean that the same group couldn't do all three of those things Right. It may be that the database part is essential Their knowledge base piece is really important for the community And they are the right people to deviate the developing the analytical or visualization tools to use the data That could be entirely true But we need to be evaluating each of those things separately to ensure it's true and to ensure they're being done as efficiently as possible and now I Did come up with a little graphic to try to make one further point that builds on this which is another Really overarching issue and it gets back again to trying to focus on what's going to be best most useful and most efficient for the research Community, so this is the artist conception the artist being me And Brent's laugh He's a hundred times better artist than I am but This is our conception of what the current or a little piece of the current data resource ecosystem looks like So the first thing you notice is it's highly siloed so you could think of this just to Push and be a little bit more up front here You could think of this as part of the model organism database Ecosystem right each organism is a silo more or less right now In addition as I said there has been a conflation or an entanglement of the different functions So the database piece which is the hexagons up here? That doesn't work the hexagons Is part of each of these data resources the knowledge base piece also is that's this Cylinders and then incorporated into those is also this tool development that's going on So these are now inextricably linked to each other in ways that make it very difficult again for us to assess their utility and efficiency What might be and is to us more appealing would be something that looks like this again You can think about this as a model organism database Reconstruction, but we have at the center core The support and organization of the databases this may be the genome information the protein sequence the transcriptome information And I've actually arranged it so it's not just a single one But now all of the organisms in this particular space Their genome sequence for instance is linked together in a completely seamless interoperable way So instead of going into each database or knowledge base and pulling out the sequence for flies and worms and saccharomyces separately and then Going and doing your alignments or whatever else you're going to do Seamlessly you could go in and see all those things immediately together All right, how much more useful would that be to you as a user than the current construction of the system? Now I personally think it would be great if that could happen through NCBI That's kind of a one-stop shop But you know it could also be some organization in the outside world that does it as well But having it all together I think as a user myself would be dramatic enhancement of current capabilities Then a rain around the outside is the knowledge base piece and we sort of hone the knowledge base pieces just down to their most important Components so they're not maybe as expansive as they were and note They're all linked together and they're also linking into the genome and protein sequence information in the middle So in terms of linking them together The thought is what if the additional information that you would get in a model organism database? For instance currently this knowledge base piece were all built on the same platform Right, so it was all seamless that when you looked at you know this expression patterns in one organism It was giving you the same visualization the data were organized in the same way as in another organism So you could actually easily use it all together and the you know Need to learn different platforms and different systems that disappeared right so that again I think will be a major advance and increase in efficiency and usability for the ecosystem And then finally on the outside is where the tools live and our idea here is this is a sort of tool depot And there are commercial or not in commercial But there are examples of this now where people can put tools. They're freely shareable people can even modify them if they're open source And these would be hopefully as free as possible and as open as possible And all these different layers can communicate with other layers So the tools can communicate directly into the core data or the knowledge base the knowledge bases can communicate directly into the core data All in a highly interoperable interconnected way So again, I just ask you to think wouldn't this be a lot more useful to you as a user then the current siloed and entangled ecosystem Now I showed this at a meeting where Jim Ostell the director of NCBI Happened to be and it turned out it resonated with him because as he thought about it He realized that this is Actually the way that NCBI tries to organize itself and so he he had this picture, which is much much better obviously than my pictures Drawn up and it shows how the core data sits in the middle right around the outside of various knowledge bases that Actually interconnect with and use data from NCBI currently and then around the outside are tools that they have organized for use of these various things And so again, I think this is my personal opinion that what we really should be focusing on some level is expanding this NCBI model so that it's a one-stop shop for many of the things that we all want to do You know, it's a place that has a brand has some expertise and could be I think brought really into the 21st century to Make it as useful as possible to everybody Okay, very quickly I'm running out of time The last two goals are enhanced workforce development for biomedical data science For the three strategic objectives are enhanced the NIH workforce. That's internal. How do we get program staff? Intramural scientists etc Who have cutting-edge knowledge and skills in data science arenas? One idea two ideas one is to increase the actual training and again That's something NCBI has already started and I think we could expand There another thing that is is being worked on right now something called the data fellowships Which would be a national service fellowship where people from industry or academia could come for two or three years As a sabbatical essentially or before they start something else And work on a high-profile high-impact project bringing skills that NIH doesn't have readily available And in general can't afford because you know Someone can go work for one of the major tech companies for three or four times the money that we're able to pay And so this may be a way to bring people in for a national service on something really high-impact all of us You think of cancer moonshot these kinds of things And bringing knowledge and expertise that we simply don't have access to in a straightforward way The second one is to expand the national research workforce The original statement of one of the tactics was enhanced quantitative and computational training for graduate students and postdocs One of the areas of input we got from the community was that we should not limit it to graduate students and postdocs And we should really push this back to undergraduates and even before and so in the final version of the plan We actually have expanded that So for instance NIGMS has a number of undergraduate programs where this could be Leveraged and we actually now have a K through 12 education program the SIPA program where again we could begin to leverage this And then the third strategic objective is to engage a broader community So this is things like citizen science working with librarian scientists and in terms of the citizen science side an area We're very actively pushing on is codathons bug bounty programs Contests trying to bring a much wider array of people into working on data data science related issues For biomedical research so you can imagine, you know There are high school kids out there who can do things that I can can't even dream of myself And if we could bring these guys into the the system and let them, you know go to it for a few days We really might get some very interesting things happening Finally Appropriate policies to promote stewardship and sustainability We need to develop policies that support this fair data ecosystem data sharing policies Standardization policies etc. But a very important point here is that those policies need to be achievable Not just aspirational Right, so we could put a data sharing policy in place that says everyone has to share all their data For all eternity and we already know from DB gap that that's not going to work Okay, it's just too expensive not all the data is useful It would take all the money we have plus a lot to do that And so we really need to figure out, you know, what data do we need to share? How do we share it? How do we decide again? When does it go away? How long do we keep it and these are things we're going to need the community to help us wrestle with We also don't want to add unnecessary burden You know, we could put a policy in place that makes you do all kinds of crazy stuff that adds no value Makes your institutions pay for all kinds of systems that add no value again We really need to think this through carefully and we you know, we don't want to go for aspirational if it's not achievable and useful And finally enhancing the stewardship We want to develop standards of use a standard use in utility metrics as I said how much are certain data sets being used That will allow us to decide when to get rid of them when to keep them And we also want to come up with standard review expectations for data resources knowledge basis databases and for tools development And we want to establish sustainability models for data resources and we can't keep every resource going forever But there may be places where there's enough of a community that they can keep it going in some other way And the NSF has experimented with models of this that have had some success And I think we need to think them through in advance and try to help the community Move to those more sustainable models when necessary So fine next step the plan was delivered to Congress in May It was actually discussed at the Senate hearings right at the beginning senator Murray talked about it, which was kind of nice The posting of the final plan is imminent you should see it in the next few weeks I hope the implementation phase has already started, you know things like the data commons you know about There's a big project going on to get Cloud Leverage cloud space from the major providers in a way that would be useful not just for NIH But accessible to the community and we hope would lower costs and increase access But this is gonna be ramping up fast. This is a very ambitious plan. There's a lot of pieces to it But it's very important. So we want to really want to get this going fast and finally Performance measures and milestones is going to be key We need help from you to decide what will be useful ones What won't cause perverse things to happen if we implement them and it's an area We're going to be working on very fast during this implementation phase. So thank you Happy to take questions Carol so can you say a few words on the funding mechanisms you envision for this landscape Sure, so I mean there's lots of different pieces of it in terms of the database knowledge based tools. We actually soon will have Template funding opportunity announcements that the ICs can use that's that have distinct features for The database and knowledge base pieces and so those should be hopefully you know You start seeing those in the next within the next year You'll start seeing those on the streets and what we really need to do is disseminate them across the Institute So everyone starts using these approaches to make them maximally useful tools development one will be following fast on the heels of that So that's the first piece, but then there's all sorts of other places That we need to think about the appropriate funding mechanisms for instance in the work on That I mentioned on getting cloud resources from major providers They have some access to what's called other transaction authority, which is different than grants or contracts And that's been very useful so far So I think we're exploring other ways to interact with the ecosystem in ways that are more nimble and allows to change direction Quickly as technologies or processes evolve So just just a couple comments. So one is I think what the diagram that you put up there Is actually an architecture right and it forces people to fit a presupposed solution Which may or may not be the modernized solution for it So by by balkanizing funding in this in this way that's been articulated The unintended consequences are that you won't be able to modernize fast enough. So elixir does Say that they recognize there's difference between say a data deposition archive and a knowledge base But they don't then force them to be funded under separate or evaluated by separate mechanisms They're still like an overarching set of Criteria against which those resources are evaluated which allows the resources then To follow the trends in modern data science and architecture and software engineering and everything else My fear is that this is going to put us at a disadvantage because it's painting us into a corner in many ways by funding What should be an interoperative? Ecosystem into separate components. So that that's one thing that the other thing is it's it's difficult for me So there's one thing about the reliance on commercial say software and then the bullet point underneath being open source And those are two opposite like a lot of commercial software is not open is black box So even if it is an optimal solution, which I'm not sure how you actually measure that it Goes against the grain of being open source. So I think that's a problem for a research Community and then the final thing I'll say is that it costs a lot of money to generate data Tons and tons of money to generate data the cost of the curation It's always brought up as this giant problem But no one ever talks about the cost of actually generating the data and the return on investment for manual curation of some of this knowledge Which makes it accessible to the development of new computational approaches is the return on investments quite high so you know it's it's like Everyone wants to pay to have the data generated but not for the stewardship costs And I think stewardship cost should be factored into any large-scale data production enterprise So I know that was that was covering a wide range, but this is really a very very fundamental key It's an exciting area. It's great that there's this this is focus on it right now But I'm really concerned about some of the directions that this is going in terms of the the long-term consequences to data science So I think I mean I think at some level the answer to all your questions is it's going to be a balance Right that you know, there's no one black one cookie cutter solution to everything And we certainly recognize for instance the database knowledge base that there is a gray area and that's going to evolve So something can start off as a more Dynamic knowledge base and eventually those components will harden and it will become a database There are things that are in the middle again. I want to emphasize that the way this is organized You can have still a data resource. That's all three of those things It's just the individual components need to be looked at separately to make sure each one is doing its job Now, how do you define in some cases? What's the data? What's the knowledge-based part? That's going to be a nuance. That's going to be up to reviewers At some level to decide I Guess the pushback I would give you is you know You're saying that this is more hardened than the current system the current system again is a completely historical one That goes back to the 1980s 1990s and has been completely hard-baked into the system And it's created all of the inefficiencies that I described the goal here is to break out of that I'm sure we'll have to make tweaks along the way But if we just keep doing what we're doing We'll never get any better and we'll continue to live in the same inefficient and not maximally useful system that we currently live So I would say that there's been a ton of innovation to him So I completely agree with you that it always makes sense to take a look at how these things are funded and Think out of the box about better ways to do it And I think some of the some of the funding Criteria and stuff that you put up. That's that's amazing. I think the community really is going to embrace that concept Very very openly But you know if you look at what's come out of the way things have been funded You mentioned a lot of inefficiencies, but the whole community can also point to many many times when it's been transformational Changes things like sharing knowledge Through the genontology consortium for example that completely transformed how genomics data were analyzed and how information about Genome biology was shared across the community that never would have happened necessarily under this particular funding mechanism, so I guess I don't see that point But I mean I agree with you there are absolutely transformative things. I don't necessarily see why transformative things aren't completely possible and maybe more possible under this this conception Yeah but I mean, but just just emphasize one thing in terms of Curation you're absolutely right that some curation is incredibly high value But I think it needs to be assessed, you know on its merits and not linked to things that the community says I don't want to say anything about it because if we lose that we're doomed, right? Thanks for that presentation. So my first question I have to is It comes from my experience 15 years experience developing tools and analyzing data where I often find that when there's a two different groups one Deciding what the formats are and one deciding how to analyze it are separate it often happens that the The the formats that are selected aren't optimal for for analysis and I mean and what we end up doing often is just Skipping around those formats and getting the raw data So all that effort was a waste of time So I wonder if there's gonna if you have put any thought into how you can improve that that that has been happening Yeah, so often not just in the NIH. I mean, it's a really good point And certainly we would value your feedback and how to prove that our our conception and Again, your feedback is important here is that we need to let help the community Figure out the correct standards But I think your point may be a that the right people aren't always at the table That point of one but then maybe two you don't know what the future is going to be So you could come up with a standard right now and then you develop a new algorithm, right? That doesn't work for that standard. Yeah, so The second one's really hard to deal with because it's the nature of this incredibly Quickly evolving system the first one. I think it's solvable We just need to make sure the right people are there and and their voices heard and what they need is incorporated and then used Oh That's good to hear so my second question relates to this separation of tools and data generation which I in general think is a good idea So so you have often you have very talented data generators that include people who develop great new technologies new laboratory techniques They're not they're not necessarily the best tool developers because you can't be the best at everything So then you have the best tool developers over here and The debt but the data generators need the paper So they need to analyze the data They'll figure out how to do it. They'll hire a postdoc and make them become the data Her and the The problem then is if you're gonna say Let's just have the data generators just generate the data and that other tool developers develop the tools I'll have it has there been any thought into how to make how to reward the data generators Yeah, if they're not getting absolutely So that's one of the things we talked about in the plan is ways to give credit for for instance data sets Software is another one where you can make incredibly useful software tool. Sometimes it's published. That's great But you know, how do you get credit for something that somebody made that may not be published? So having DOI's having ways to give credit for those things appear on a CV a bio sketch That's part of what we're gonna have to figure out how to do Is that a view Have you go ahead jump in now? Just because of the civic area So this is actually something that in human cell athletes we've discussed a great length multiple times because the plan for our data is to be open and One of the critical points that was raised and we've actually been discussing with journals recently He's not just to have DOI for data But to also have journals leaving up room for citing each later on Because journals still largely limits the number of citations and you can imagine that if you move to integrative analysis And you use a lot of data sources and there are a lot of DOI You could end up with no ability to cite even if you have the best of intentions So I think this is something where the NIH can actually play an important role in making sure that results are acknowledged Also in the way you say patients can be provided. Yeah, I think that's a good point yeah, just to Emphasize that I agree with it with a view and it would be great if NIH could come up with Standards that the journals could follow that would be an excellent So there's a model for this so NIH their tab back can avine the group of journal You may have been there mark Journal editors to talk about reproducibility and they came up with sort of set of principles and some standards for what they were Going to do there for instance, no limits on method sections. We could do something similar. It's a really good idea Can you explain a little more the role of NCBI and NLM in this side seems to me they had to have a lead role I agree 100% and and my personal feeling is that they should be absolutely at the center NCBI especially at the center of this you know, I think this is an opportunity to take the great stuff that's there and Really leapfrog it into something that's going to point the way for the rest of the century the way, you know 1980 happens sort of So I completely agree So you you said At one point we can't store everything Forever right so I'm just curious. I mean like like obviously there are costs that accumulate as you you know What is it? Is it the actual storage costs that are? Currently causing or you anticipate as issues or is it? the curation costs like just taking you know DB gap as an example You know, I think you reference it in the context of Containing too much data, but the media shoes seem both to be different there. So both those things are true I mean, are we actually running up against those storage costs as being a rate limiter? Is it more the curation costs and the the DB gap? It has been I mean the BAM files It's just too much right and we had to make some choices You know, of course storage costs go down But the trouble is that the size of data keeps going up and it sort of outpaces that the decrease in storage Costs has been a problem and then you also come up with this question of At what point to the curves of cost of generating or regenerating data and cost of storage Cross each other and there are places where it's going to be actually cheaper When someone needs some data to regenerate it then to store it for long periods of time And that's another thing we need to wrestle with in this whole question Let's be curious if you were like I mean just this it'd be interesting to just look at a simple matrix of you know How many entries are there in geo right and what are those collectively add up to? Versus how many entries are there in DB gap and what are those collectively add up to right like so something like geo the kind of Resource that at its current rate could just keep doing what it's doing forever, right? And it's really just these massive Human data sets that are causing the issue where we have to grapple with doing something more Sophisticated absolutely, but of course we don't know what other kinds of data sets we coming online So cryo e.m. Is something we're particularly interested generating enormous quantities of data What what of that do you store? Get how long all the same question Because my point is that 95% of the community will generate data sets that are entirely manageable, right? And it's really just this restricted subset that are creating the headaches that well I don't mean that that sort of way But accept that the total data that the community is generating is very large even if individual pieces are small And so that's that point I talked about earlier was how do we store that in a way that's accessible? But how do we decide? You don't want all the kinetics gels from my lab probably, right? Maybe you do I give you but but you know those those are the questions and we're talking about data sharing and storage What what do you want? I mean what do we mean by data, right? So we're gonna go Jonathan and then Steven and then Dan Rodin on the phone Okay, so just Following up on that that a little bit you you made a comment about Regenerating the data may be cheaper. That's provided you can be definitely there are certain resources That may be very limited and you can't do that so though that may Factor into what you how you decide to do that. I wanted to actually go back a little bit and second one of the things that Carol said When when she was talking about the the commercial and open-source software I mean it I think it's really important that we not overemphasize the commercial software because it is often a black box and it does Having the open-source. I think is a much better way of doing things if we can possibly do it So would I agree there has to be a balance. Can I just add a point to that? I mean I think we agree although there are places where You know the tech industry is so far ahead of us that they really have the best thing and so you have to decide Which do you want an additional question though is when an algorithm say is developed in academia? What we've heard from a lot enough. I hope I don't offend anyone now But what we've heard from lots of sources is in general the level of code that an academic program has is Not up to industry standards and we heard an example from Eric Dishman of when he was in Intel he worked with very prominent Academic center on their code and because they understood the architecture of the Intel chips They were able to increase the speed of the program by rewriting it by a thousand fold relative to what had been written there So what are we going to do to help this problem? And so there's this idea of Finding ways to allow sit what we call systems integrators or systems engineers to work with academic groups on particularly high value algorithms or prototype programs to bring them up to industry standards of speed and efficiency and utility And so that would be a way where we're paying for an industry standard group To do things at the industry level, but still we can make it as an open source kind of thing Yeah, and I agree with that And I think one of the things is to hold the academics to a higher standard of coding because a lot of that stuff Really isn't very very good. It's clued together and we do this all the time, right? And because it's for our own purposes and then we you know, so I think holding holding the academic Folks to a higher standard would be would be something then just one more just one more point I think amplifying something that Raphael said When you're and I get the idea of separating the databases and the knowledge bases and the tools But what's the incentive now or the academic community do the databases if they don't have the other stuff That's the only thing they're doing is putting that data together. It's not a very academically satisfying Kind of thing. So what's the incentive for for the academic community to want to continue to do the databases? Well, I mean I guess one question you guys What's the incentive for the academic community to do any user service resource and hopefully there is some Which is you know, there's some sense that this is an important function Going back to what I said earlier though, you know, I do think that NC bi for some of these core things really makes a lot of sense to me that you know They are you know, they're just set up to provide resource and access to the community So maybe that's where we should move some of those core functions And then let the higher level things where knowledge is being accumulated and hopefully refined Happen in the academic community. That's just an idea. Okay, Steve and then Dan road and then try Yeah, just quick question. You you mentioned briefly during your presentation about looking at some other fields like for example astronomy Yeah, and I was just curious how you're thinking about that because it certainly seems to me that they face a lot of the same questions and There's certainly a tremendous amount of data being generated today and it is replacing older data all the time Another area is of course something like NOAA or something You know, we started to take a look at weather patterns and this sort of thing So how are you thinking about that and so what kind of lessons Yeah, the great point one of the ways we're thinking about it is and this is very explicit in the plan Is to have as much collaboration and coordination with other federal agencies and international agencies as possible Both, you know somewhere like NSF that has more similar data at least in some spheres to us But also things like no, yeah department of energy, etc To learn their lessons and help them learn what we're doing as well and to leverage one another's resources because If we're building redundant unnecessarily redundant things that doesn't make sense either We are we did as part of the plan look at some of those Case studies and think about them, but we need to continue as we implement to do that as well Yeah, I mean it seems to me that in biology that your cons You know the definition of what constitutes raw data Versus how it gets mixed up with biological data. Yes is some is somewhat confusing. Yes, and You know, maybe it's much more straightforward if you take a ccd image of a region of space I'd be curious as to whether or not you could decompose things so that you actually did have a pure data Whoops versus mixing it up with with biology Dan rodent on the phone. You can hear me think That was a very helpful overview. Can you give us some sense of your thinking or the group's thinking on What the mechanism for funding this very large effort within NIH might be I that is a key question and I think One there's two pieces of it one piece is the institutes all have funding in this space Can we Through these kinds of corporate endeavors Make the system more efficient so that they can get more for their money By contributing to some of the collective efforts So that's one part of and hopefully that will encourage them to Contribute more because they see the value And I think that's a piece that's been missing is the siloing Is in part because I will say that we have continual Discussions in this council of who should fund fly base or should fund this that the other database that we feel like we Contribute disproportionately compared to other users in the community. So this is an important question for this council in particular And I I think eric and I circle today have talked many many hours about that exact point And the scientific data council actually has talked about it quite a lot And one of the reasons that we came to that knowledge-based database tools development Separation was because of of that particular space Was that we we came to the conclusion that in order to get a corporate solution to that problem And your problem is a historical one, right? It's that you started and I started funding these things back when they were You know part and parcel of the bringing the human genome project along Now we're many decades or several decades later and and does that make sense? So we need a corporate solution but to get one We really have to convince everybody that we have a much more efficient way of doing things And a much a way that's going to be much better for all of their grantees And that's where much of this came from and to emphasize I know you've implied it but to emphasize that We all concluded you being a clear champion of it that this and congress asked you to ask me for it The strategic plan needed to be the first thing done You couldn't have conversations about broader corporate funding models and plans without the blueprint So the strategic plan is a blueprint. It's just coming public and now the next big thing we're going to be doing Is thinking about how we're going to corporately be funded, right? I mean, you know, let's just put things on the table here. I think the likelihood of NHGRI is being able to go around to the ICs and say Would you fund fly-based, worm-based, you know, each of these things individually that contribute to their funding And the long-term is very, you know, you might get a few years here and there But you're never going to get a long-term solution to that The way to get a long-term solution is to create a unified Data resource out of these things right in a way that's more efficient more clearly useful I think I think leverages NCBI And therefore is a corporate tool that many different communities will use That's something I think the ICs would be willing to contribute to in the long term But a piecemeal one-off, you know, give us a little money here and there for each of these silos It's not going to work And again, that's going to be radical It's not going to happen overnight. It's going to take a few years at least. Trey? Okay Any last comments anyone on the phone? Oh, Carol? So NLM hasn't been mentioned in your overview Well, NCBI, I talked about a lot Right, but And NLM, I think as an overarching component is critically important to many of these places And you think about the electronic health records stuff both in their intramural program and Extermarily, they're one of the leaders in thinking about how to utilize EHRs, which are going to be Transformative when we figure out exactly how to do it Bringing the information scientists, the library scientists into the system I think something many institutions are wrestling with. I know when I was at Hopkins Figuring out what to how to reconfigure the library, the medical library was key and nascent issues that we talked about here We're all part of that discussion And the other key component and John, I don't think you are quite in my director's report when I mentioned it But I showed the advert for the Recently launched I should have heard the NIH chief data strategist and so that'll be again Of reporting directly to the NIH director This person will play a key role in the scientific data council play a key role in coordination cross institutes with nl All that so that's another key So the only last thing I'll say is that your emphasis like on workforce development. I think it is really Very very important. We haven't had many comments about that But I think that is that is essential And the citizen science component to that. I think that's that's pretty innovative and new and Hopefully there'll be funds for that as well. I think and there's places to partner with other organizations as well there Thank you very much. Okay. Well, thank you. John. We knew this was going to be an important discussion We appreciate you coming out here and talking to us. So thank you Okay, you've earned lunch, but first we need to take our family photo may as the council photo time So the weather is permitting. I think we're going to send something in of you But let's be back. Oh, sorry Dan Let's be back at 1 15 to resume with the open session concept clearance