 Okay, before NHGRI can issue a contract solicitation for a research and development contract, it's required to have the underlying concept be approved by an advisory group in an open setting, public meeting. And we've always used the advisory council for all FOAs that we issue so that you'll know the full range and scope of things that are being planned by NHGRI. Now we've heard earlier this morning presentation about the NIH data commons and that was sort of a segue or introduction to the concept that Valentina DiFrancisco is going to present to the council. There will be ample time for a discussion with the council, but at some point when that discussion is concluded I will be taking a vote asking the council to approve or disapprove the concept. So Valentina. Good afternoon, I am Valentina DiFrancisco, I'm a Program Director in the Computational Genomics and Data Science Program at NHGRI. And I'm speaking on behalf of the NHGRI Sandbox Group. This is a group of program directors and program analysts that we have been working together for several months now to develop this concept. They're all listed here, but a special mention goes to Kevin Lee, Chris Wellington, and Ken Wiley. Ken Wiley is going to help me out with answering some of the questions today. So the outline of the presentation is the following. I'm going to touch upon some of the current challenges that are caused by the genomic data avalanche and they are associated to genomic data sharing and analysis. I will describe the features of the proposed sandbox. I will touch upon some of the available tools for interoperability that can be used across different emerging data commons. I will describe briefly the type of users that we expect for this type of resource and finally describe the funding mechanism. So the challenges of the genomic data sharing and analysis. In an era with increasing sequencing capacity and decreasing cost, the bottlenecks are becoming more and more linked to data management and data analysis. So for that reason, researchers need scalable high-performance storage and computing infrastructure and technical expertise. The problem is that the technical expertise and the scalable computing infrastructure is not readily available everywhere at any research institution. Some research institutions have all of that. Some others really actually need a lot of help from that point of view. The other challenge has to deal with distributing big data over the internet. People have been complaining about the inability to transfer terabytes and petabyte of the data over the internet. They use hard disks, tapes, and so on and so forth now. And also, the distribution is very inefficient because in some cases, the same data sets are distributed in multiple local systems or even copied multiple times on the same cloud resources. The other issue is that, for example, EnergyRI is funding a number of sequencing studies, and data resides in different local systems. There is little to no attempt in trying to consolidate and harmonize the phenotypic information that is being made available by the studies to the rest of the scientific community. And so phenotypic data integration necessitates a little bit of coordination. And finally, and this is something that is quite some recent news, NCBI has recently communicated to EnergyRI as well as to other institutes that it is reducing its data archival role for the NIH. Okay, so this clearly has a big impact on what we're going to do in terms of providing data generated by our studies to the rest of the scientific community. So the solution that is proposed today is this concept of an EnergyRI data sandbox. Really for us, this is a resource that is meant to democratize genomic data access, sharing, and computing. It leverages a scalable cloud-based infrastructure to collocate data storage and computing capacity. We commonly use tools for analyzing and sharing data. It provides both unrestricted and controlled access data and metadata from EnergyRI fund the programs. It is a trusted partner of the big app. With respect to genomic data, what I want to emphasize here that I don't mean just genome sequence data, but also data coming from genomic studies. And so genomics should be really used in a very broad sense. The other thing that we would like the sandbox to do is also to provide some data harmonization services across the EnergyRI fund the programs. And also, as was pointed out earlier today, not all the tools are readily available and run on the cloud, on a cloud platform with how efficiencies. So we expect personnel in the sandbox to actually keep on developing and optimizing the tools for the cloud platform that they will use. And finally, facilitate interoperability with other data resources and data comments. So you've seen these slides earlier today by Vivian, so I am not going to go through it again. But you can just generally imagine that this resource will have some an access portal, user interfaces. It will implement security measures to make sure that controlled access data is kept private. It will implement fear principles, the fair guiding principle. And if they do not exist, it will help contributing to the development of those. And so on and so forth. So I'm not going to go into the details of this slide. One thing I want to however emphasize is that in 2015, the NIH has issued a notice that basically allow for the use of cloud computing services for storage and analysis of controlled access data that is subject to the NIH genomic data sharing policy. So that basically is allowing us to propose this concept that relies completely on a cloud platform for sharing data with the rest of the world. And the other thing is that in order to do that, we are going to make a requirement for the cloud providers and for the group that is going to support the platform to utilize a cloud platform that is FISMA-moderate and HIPAA-compliant. So you heard from Vivian earlier today about this emerging concept of a federated data model. The one on your right is the EnergyRI data commons or sandbox. We have been talking about what type of data sets we may want to use to initially populate the sandbox. We came up with a few. These are just proposals. They are now written in stones. And of course, the data sets will increase. They will change over time. But for the moment, we thought that encode data that comes from the eMERGE network, the data from the 1000-genome project, and the data from the genomic sequencing program will be good data sets to initially populate the sandbox. At the same time, as you heard, there are multiple NIH data commons that are emerging. One of the NCI genomic data commons, one from H&P, the one to support a human microbelom project data. And other data commons to support top-made data from NHLBI programs, and as well as to support the PMI efforts. So the issue always comes, and we heard it earlier today, is how are we going to make sure that this data is not siloed and it's actually going to be a support for users to basically go across different data resources and deploy methods across different cloud systems or across different data commons in general. So these are the tools that we think will alleviate some of the concerns with respect to interoperability. For example, the establishment of a common user authentication system. The adoption of common APIs for data access and computing could be the GH4GH or others that may emerge. The adoption of fear principles. The broad use of Docker containers to deploy analysis tools on different resources. Catalogs of digital objects so that people can easily find the data where it resides. And also the adoption of common data standards and ontologies. I'm listing all of these because I want to emphasize that there are tools that are either already built, and tools and technologies are either already built or that we're already working on. Some of the BD2K programs are actually really pushing a lot of the development of some of these tools. So I think the real issue here is really mostly to do with social engineering and making sure that the right group of people are around the table to make decisions about common adoption of some of these solutions. So in terms of users of the sandbox, we expect two different type of users, primarily. One is a computational genomic scientist who is familiar with cloud or high performance computing, developed a tool and wants to test it out or utilize it on a high performance computing platform. The other one is a researcher that actually does not have this high level of expertise and may want to use graphical user interfaces or interfaces to analysis workflows to basically simplify their analysis needs. We don't expect really genome sequencing centers to utilize these resources. One, because they have their local system and they will probably will not need to come to the sandbox to do their own analysis. And with respect to what type of users also will come to us, there are two types that we think will come to the sandbox. One is an individual user, a user that is not part of an energy at I funded consortium. And they may want to upload and compute on the sandbox and on the sandbox data as well as on their own data. They will be able to upload their data, combine it with data that resides already in the sandbox and do the analysis that they need to do. So the other feature that is also very important is that they will not need to download the data from the sandbox. They can basically do all the analysis where the data is hide. And by doing that, basically we're minimizing potential data incidents. The other thing is that they may want to have access to commonly used tools and workflows. As I said, one of the activities of the sandbox is going to be optimize the tools and workflows in a way that will run smoothly on a cloud computing platform. So they will be able to use those tools and also have a workspace for their own new tool development and sharing. So how will, so if a user, if it's our user will want to have access to the energy at sandbox, how would that happen? Well, we're not changing anything really. The way that this works is the following, is that that's the way we propose it will work. That a user will be a registered user of the ERA Commons. It will have to submit a data access request to DbGaP. It will use all the data authorization protocols that have been put in place in DbGaP. So if they're asking for NGRI data, the NGRI data access committee will review the application and will approve it. And then with an approved DAAR, then the user will be authorized to use the data that resides on the NGRI data sandbox. No changes to the current protocol, basically. We're utilizing a system that is already in place with all the security checks on approvals and consents and so on. And there will be no transfer of data from DbGaP to another system. OK, so with respect to potential costs to individual users, so the sandbox, if you think of it, given what you heard, that means a user could upload potentially terabytes and gigabytes of data, which is going to cost. And we will not be able to basically cover the cost for all their needs. It's not a free resource. So there will be cost associated to data storage and also cost associated with computing and as well as cost associated with data egress. Now I have provided some cost estimates just to give an idea of what the expected costs are. I just want to emphasize the fact that costs are changing all the time. They are decreasing quite rapidly lately. And also the cost will change depending on the actual tools that are going to be used, depending on the optimization of the tools that people are going to use, and so on and so forth. So please take these numbers just as a general indication. But anyway, the current data cost is about $350 to $450 per terabyte per year, depending on the speed of access to this data. If you need to have fast access to the data reciting on spinning disk, then it will be the most expensive version of that cost. So with respect to computing, you can imagine a user that wants to use a standard Vader calling pipeline. And in that case, the cost that I was quoted is about $270 per sample on AWS on Amazon Web Services on Demand. Or if you use the AWS SpotEast, those costs actually reduce significantly to $30 per sample. Again, with an RNA-seq analysis pipeline, with the tuxedo tool suite, basically the cost will be around $16 on AWS on Demand, or $2 per sample for SpotEast instance. These are just an indication of what the cost could be. So the other type of users of the Sandbox are potentially members of an NGRI Consortium. So Consortium members could use the Sandbox to have access to cloud resources where they can deploy new data sets and workflows. They have a common infrastructure to share and compute on consortium data. And they can take advantage of interoperability tools and data submission services to other data commons if necessary. And that includes also data submission services to DBGAP in case they will accept those data. So members of a consortium could interact with the Sandbox in two ways. They could either have their own separate data coordinating center, and then the data coordination center could submit data to the Sandbox. Or they could basically leverage the data management services of the Sandbox. So in all intents and purposes, the data coordinating center, some aspects and functions of a data coordinating center could easily taken over by the data Sandbox. And in that case, each project could basically directly submit their own data to the data Sandbox for storage and sharing with other members of the consortium. So a consortium member, in that case, who wants to access data from the consortium, all that he needs to have is an account on the data Sandbox. In that case, program officers will work with the energy rise Sandbox, will basically recognize the user as a consortium member, and they will grant authority to the relevant and appropriate data sets. So user authentication and authorization is going to be done by the Sandbox in collaboration with energy rise staff as well. So we think that this system should greatly simplify access to data generated by a consortium compared to current methods. So finally, with respect to the funding mechanism, as Rudy mentioned earlier, we're proposing this resource to be supported through a contract funding mechanism that is required to establish a trusted partnership with DbGAP. And the reason why a trusted partnership is necessary is because we want the Sandbox to make accessible controlled access data to the broad scientific community. The contract also allows us to basically have a clear set of deliverables and milestones that the staff of the resource will have to meet for achieving additional, for basically receiving additional funding. They will have to provide special reporting requirements and mostly they will have to work really closely with energy rise staff. As I said, there will be a lot of interaction with the selection of the data sets that the Sandbox should support. There will be a lot of interaction with the type of services that the Sandbox should make available to consortium members and to other users and so on and so forth. So the funding period is for seven years. And really, we select the funding seven years because we wanted to make sure that this resource will have time to basically develop and also be tested in terms of its utility to both outside users and consortium members. And typically, programs in a GRI funded programs are for five years. Having said that, as I said, there will be milestones that are going to be set up for this resource. And so if for whatever reason the project is not successful, we will not support it any longer. So the timelines, because this is a contract, is going to take a while before we're actually able to issue an RFP, we think that we should be able to have an award in the summer of 2018. And we think that in about nine months after that we should be able to have a resource that is open to the public. And with that, I'll take any questions. And Ken too. Cattle, go ahead. Thanks, Valentina. So I have a couple questions actually. And the first is, how prescriptive do you envision the RFP being with regard to the data sets that are expected to be in this instance and the tools that are to be available to the researchers? Is that going to be defined up front or is that going to be the responsibility of the applicant to actually come up with those ideas? So that's one. The second one is, can you say a little bit more about the governance model over those seven years? Is it going to be entirely NHGRI staff or is there going to be an external advisory board as well sort of monitoring progress? Yeah. So with respect to data sets, I think we will select some data sets to initially populate the resource. So we will be prescriptive from that point of view. I can tell you we had a lot of discussions, internal discussions about what would be a good set to start with. But we really want to make sure that we start with a set and people know how to plan when they're making their budget proposal and so on and so forth. So with the tools, we have not had any discussion yet about the specific tools. So I can't answer that question yet. The second question was? Governance body. Governance body. With all our programs, we always had a scientific advisory board, so a group of outside members that will serve as advisors to this. So it's not going to be just NHGRI staff. So for that advisory board, is it going to be a combination of the contract awardee naming people to the advisory board and NHGRI or is NHGRI going to? Typically what we do with contracts, there are some people that are being proposed and typically NHGRI staff who will vet it and approve it or not. But most times, we will make sure that there is no obvious conflict of interest in the management and oversight of this group. Before we get too much into the details of this, can you just give me an indication of how this is different and how it relates to the NIH data commons? I'm still confused about that. Yeah. So this huge cloud is supposed to basically give you an idea of a number of different data sets from different projects across the NIH that are supposed to contribute to the NIH cloud. So what happens really is that NCIS set up their own data commons, which is part of the NIH cloud. NHGRI would like to set up our own data commons, which is going to be part of the NIH cloud. And the question then remains, how will these groups of these data sets interact with each other? And what I'm saying is that we are either building tools that will facilitate interoperability on the NIH data commons. So all of these data sets, the PMI, TopMed, HMP, the NCI data, NCODE, and GSP, eventually they will all contribute data to the NIH commons. But then why do you need to have the sandbox populated with your own data? It seems to me what the sandbox sounds like to me is a workbench with all kinds of tools that you then use to access data that's in the commons. Because we need to basically facilitate coordination across all of these activities. Right now, the situation is that people are just contributing to the big up and data is not coordinated. There is no help in trying to have tools that run across multiple data sets and so on and so forth. So if you have one place where you have the ENCO data sets, the GSP data sets, with tools that run on it, I think that is a good service for a user, for users. And right now, that is not doable. And I'd like to add that we also, if you look across the programs in HRI, oh, sorry, we look across programs just in our institute. I mean, we've set up some programs that have set their own cloud providers. And the idea is that this will help us provide an avenue not just for us to work interoperable with our own data sets, but also with other sets. And let's face it, we can't have one comment, one cloud provider is going to hold all our data. That's not feasible. So you're forcing, we really don't have a choice but to try to parse out these elements that we can and build it in a way that we can make it interoperable with each other. And I understand that's a challenge, but it's something that we're, at least the groups that are representing this model here, are working to try to build that in that capacity. So I have some concerns about the computing costs that you suggested. So you were estimating it would be about $270 for aligning a 30X coverage genome. And so I guess I have some concern that those kinds of costs might be prohibitive for a lot of users. So for contrast, that to my own experience doing things like that at my home institution, computing is heavily subsidized, and so that would be sort of effectively free for us. And so users are going to be trading off the clear utility and advantages of doing cloud computing against the costs. And I'm wondering about what your thoughts on that are. I'm just worried that the cost might greatly hit the usability of this. I mean, it is a concern, right? But the point is that not, as I said, our intention is really to make sure that users, especially researchers that don't have a local system that can support this type of analysis, can actually have a place where they can actually do it. This can actually solve it. I mean, the local system isn't really not for free. I mean, in a sense, NIH is paying for the overhead and all the other activities of different institutions. But that's true to some extent, but for an individual investigator you're doing the calculation, is it cheaper for me to do this computation on a server in my lab or on a system that I have locally as opposed to in the cloud where I'm paying for the compute cost? So I can tell you that there are now people that are developing tools that will give an estimate of the cost of your compute before you actually submit it to a cloud resource. That may also help the investigators make a decision about what type of cost they're going to incur and decide whether or not they can afford it. But I hear you. First of all, thanks for a very nice talk, Valentina. And a comment and then a question. So I think some of what Jonathan is worried about exists because groups like his and mine and people around this table and your funded community in general have not had this resource. So one only wonders in five years or in x years forward if we would continue to make those decisions to build up our own computational infrastructure, I think the idea of this if it works is that you wouldn't have to have any of that and you wouldn't be making those decisions. But your point is let's make sure then that we get it right. Back to Mark's question about how this relates to the commons as a whole, I think it is a good question to really understand what the relation is. And my question relates to your perceived allocation or focus. So there's sort of hardware, which can be broken down into computing versus storage. But then it seemed like there was quite a bit of a software construction and service component to your vision. And I was hoping you could clarify some of that. So just to kind of give some concrete examples, you mentioned for seeing a bunch of data harmonization. And that could be light harmonization or that could be a major activity. So for instance, would you recall every genome sequence in the sandbox using a common pipeline? That would be very useful but also a major commitment. Or did you mean other things by harmonization? And as another example, you mentioned that simpler tools and web interfaces might be provided by the sandbox for users who are not computationally savvy. And I'm wondering how many or how much development effort you foresee going into those tools by this initiative versus simply providing a place where some of us can put our tools we've already developed or in the process of developing. A lot of questions I realized, but they all relate to the relationship at the ground level with the larger project. So with respect to what type of services the sandbox is expected to provide, a realignment to a new version of the human genome, I can easily imagine that it's going to be one of the services. But once again, we're going to have a governance body, including energy or eye people and outside researchers that are going to tell us whether some computational exercises need to be done with the associated cost associated to it. So the answer is yes to all of the above with a careful evaluation of priorities and cost. And that's the kind of thing that a contract will allow us to do. So what was your next question? Well, I think if it's yes to all of the above, then it gets all of the questions. So the harmonization part, the harmonization part also what I was really referring to was mostly about harmonizing phenotypic information. Standardizing information associated to data sets and things like that. But again, we are going to decide over time, depending on, as I said, we're going to set different milestones. And the harmonization component is probably not going to be the first milestone as opposed to maybe the fourth or the fifth. We will have to set priorities about what we need to do first and then later. And then back with relation to the commons, presumably all of those NHGRI specific activities would somehow live in the commons framework that the commons is supporting. Is that right? Yeah. Yeah. So a slightly different question. You talked in earlier about milestones and sort of judging. Can you give a little bit more idea of what you're going to consider metrics for success of this program? Yeah. I mean, we have not really decided yet how to come out with the metrics for evaluating this program. And that's the answer. So let me just add one more. So some of those are going to be built. And this is, as we're thinking this through, some of those we built off of what we've learned from existing models that are currently going on in both within our institute and outside. We have certain programs where we are actually looking at things such as the user interactions with the data sets that are in different clouds as in one of the matrices. We're also looking at how are these different data sets in cloud platforms that we have? We have data sets that are from different platforms. How are they being utilized to actually work together and build a collaboration that wasn't already there? Those are things that we can also look at. So it's not completely in the dark where we're coming from as far as matrices that we would consider as far as milestones. We're looking at this, building up of what we're seeing around the landscape. As a starting point, I understand that that could potentially change as we move forward. The sandbox is not something that's designed to be scalable, both upscale and to downscale, also based off of these matrices that we're going to put in place. But that's to help answer your question, that it hasn't been finalized, but we are looking around the landscape of existing infrastructures in cloud platforms to help us guide us in that direction of what we should look at for our matrices, milestones. And I appreciate it's a very fluid situation and the field is moving extremely quickly, but it makes me a little nervous when your basic answer is, I don't know what those metrics are. So I can tell you some of the metrics that have been used by similar resources in the past. What we didn't do is to make a decision about whether or not these are the ones to use, but I can tell you the initial population of the sandbox with data sets and making these assets available is definitely a milestone. Making this resource a trusted partnership on the big app is another milestone. I mean, these are key aspects of this resource that need to be achieved. Implementing some of the tools like, for example, a genome browser that will basically be able to serve the data to some of the users. And X number of tools which we still need to define will be another sandbox. So there will be a different staging steps for this resource that they have to meet otherwise. And finally, also another thing that has been done, which I think we're definitely planning to do is to provide incentives to outside users to come and use the data and test it out. This has been done by the NCI Cloud pilots and they learn a lot through the system. And I imagine that a version of that, giving incentives to people to come and test out the resource so that we can improve it, is part of the plans. What we have not finalized yet, what those metrics will be, but... Yeah, and I appreciate it. I mean, I would not expect final metrics at this point. You've got time to think this through and there are a lot of issues that you need to think about. But I would think that one of those metrics would be, if only 10 people are using this three years after it's there, we're probably not where we want to be. And that's where you're seeing in some of our programs where we're testing those auditing, like how do we define what is the critical level of interaction with data sets to say you should be kept. And we're testing those out. And like Valentina, we haven't finalized those things, but we are looking across and implementing some tests of these ideas of these auditing tools in different programs and we're trying to collect that data, but we are collecting that data to help better assess how we want to move forward. So can I just clarify when people are talking about milestones? We're talking about milestones for the offeror meeting goals, or we talk about milestones for it's useful to the community and it's getting broad use or both. I was actually thinking more about the broad community, but I think the two are interrelated. So it would be both. As you can tell from this conversation, this is a very rapidly moving field. The technology's changing, the nature of the data's changing, and the needs of the community are changing. What are the plans to stay very flexible and meet the changing needs, particularly of the community, because they are our constituents, using a contract mechanism? How are you going to interact with any contractor? So this has nothing to do with the solicitation, but how do you plan on interacting in a contract environment to maintain maximal flexibility to meet the needs of the community? So there are contract mechanisms, and unfortunately I cannot go into the details, but we're looking at a contract mechanism that will allow us to have exactly that flexibility with respect to definition of tasks and concrete projects that need to be achieved in a reasonable amount of time. The mechanism exists. Yeah, I guess I had a couple of questions. One, somewhat building on this discussion. We started out and was mentioned earlier this morning that one goal of this was to make it available to universities and sites that may not have this infrastructure, but I really didn't hear anything about training. I mean, those people are unlikely to have the resources on hand to be able to use it, and I would certainly think it would be very important to build that into the concept, and I tried to look through it, and I also didn't really see that addressed. I have to admit to being concerned about the seven years, particularly if you're talking about not even awarding it for a year and a half, if I understood correctly. Well, it takes a lot of time to develop a call for proposals. But it's really hard for me to imagine what our compute space is going to look like in eight or nine years from now, and so I assume there's some type of annual review of this or update, because, yeah. There it goes. No, I understand you can't talk about dollar amounts. No, no, I wanna give you an example of a contract that we also oversee that's very large that just got renewed for seven years that is in the past iterations ended up doing things that were not written into the contract from the beginning. Right. And so you can write these. I just wanna assure everyone you can write government contracts of the nature that we're talking about to be incredibly flexible and they can be terminated at any time. Okay, well, so that was my question. And I'm still a little unclear and I'm not a computer scientist on the storage versus compute space. You started out by saying DBGAP is limiting the amount they're going to store in the future, but it doesn't sound like you're talking about this space being long-term storage. So I guess I'm a little confused about the storage versus accessing it for a year while you're computing on it, but that's different than... So the news from DBGAP and CBI is actually fairly recent. And so as we were designing the concept, we were not thinking about this serving as an archival system. We were assuming that that was going to be done by DBGAP. So it is, we have not, we still have to change the concept and decide what to do with archival. I think I recall that many years ago, NCBI was criticized for developing tools for literature searching and things like that, that was pushing out private companies from doing that, that they sort of dominated that space and therefore suppressed the development of those tools in the private sector. You're talking about developing tools for sequence analysis, and I wonder if you might be, if you're worried about being subject to that kind of criticism in the future. I'm actually, I mean, this is not supposed to be, the resource personnel on working at the sandbox are not going to develop tools as opposed to adapting tools to make sure that they work on the infrastructure and the interfaces that they have. Okay, users, individual users or consortium members can utilize the computing resources and the storage area and the workspaces to basically test out their tools or develop new tools. But I don't expect the sandbox people to really be the people that need to develop new tools as opposed to optimize the tools that have been done by other people and serve them to the rest of the community. Okay, this is a very interesting proposal because I think having all of these different data sets together in the same place where you can compute on them is gonna be really valuable. But on the other hand, I'm a little concerned with the cost, I think, like Jonathan was saying. And other than things like the fly stock center and stuff like that, I'm not really aware of anything where you have to pay to get something. Like, if the Santa Cruz genome browser, you had to pay to look at that, people probably would not use it. So I think the cost has to be perfect. Like, it has to be such that people, it'll be cheap enough that people will use it. And that's, like Jonathan said, a lot of people basically have free compute power at their institutions where, even though the NIH is still paying for it, it's through everyone's indirect costs, so PIs don't see it. As soon as you start charging it on direct costs on their grants, it's gonna be totally different. So this has to be so good that people want to pay a lot of money to use it. Or it has to be basically free. So I think that's gonna be a really hard balance to strike. And so I don't know if there's a way of giving people like NHGRI grantees a certain amount of free use per year. I don't, I think it's gonna be very tricky. Vivian, were you gonna talk about the credits model? That's exactly what I was gonna say. Why don't you go ahead and say a few words about that? Right, I didn't talk about it this morning because of time constraints, but we're working also in the commons for what we call a credits model, which allows you to get dollar-denominated credits from NAH for a grantee to expend on a collection of services in the commons. And that test is going on 81 requests that have come in now, we're gonna be testing it this year. The purpose of that is to see how are people using the data and to give them a method so they can actually pay for that and see what the cost is. And we expect the cost to be changing, in part because it's a changing landscape and in part because we see the cloud providers wanting to be more part of this. So there's a down cost, there's a reduction cost there. But we're also seeing this as a way of saying, okay, NAH would be paying for a certain amount in this credits model so that you can then use it for those services. Right now that's outside of the grant system. The intent is to move it inside the grant system. So a grantee would get, say, an R01, wants to consume NHGRI data and they would be able to get some dollar-denominated credits to then expend over that to use that. It gives us, somebody was asking about metrics here, how much you're using and what cost is it? And what is NIH actually expending on the use of storage of that data and the use of it? That's another part of the commons model. For those who wanna know more about it, I can take that offline. Yeah, I think something like that would help a lot because if the cost model is wrong, it won't be used. So actually my question was again about the payer model and it's only mentioned in the draft as that the contract will be used to build a resource but not for the users to use the resource, right? So there's gonna be expectation that people have to pay to use it and then there's not really any details. So is the credits model kind of what the thought is for? I mean, it's really, I'm having a hard time figuring out if this thing was built and if I wanted to use it, do I use my credit card, do I use grant funds? How would an individual user actually pay for use of this resource? And it's very high level currently in here. I'm wondering if that's gonna be spelled out in the RFP or if that's gonna be something completely different. We will have to spell it out in the RFP, yes. I wanted to go back to the question about the single versus multiple solutions. I can see all the value of having just one. It's already complicated enough and expensive enough, no matter what the cost is, but I also see the challenge and this is a little bit about, I think Mark said that about, it stifles certain types of innovation and the standard of tools and analytical approaches are developed by anyone on top of it but there is inherent innovation to the sandbox itself, to how it's engineered, to the design choices that are made there. How would that be handled and what would be the opportunities for opening up more than a single one and learning some lessons in the process? So I don't know, Vivian, if you want to say something about what NIH is planning to do through CIT and so on with respect to using multiple platforms. Again, Valentina, did you point out? What CIT is doing with respect to using multiple cloud platforms as a solution for NIH. Okay, but I'm trying to tie in with what Vivi, you were just saying, if I understood this right, you're asking the sandbox itself is in for its own level of development, right? Yeah, I'm asking on the sandbox itself, not the general, in the NIH there will be multiple solutions, they will all play with each other. I think we've asked that question before. Here there is a choice to make a single one for the NIHRI and I'm only asking about that choice. So in terms of making a single one as opposed to using multiple cloud platforms, as you said, it's a matter of cost. Now, we are hoping that we can leverage some alternative contracts that NIH is negotiating with other cloud platforms so that we could leverage, reduce reduction in the cost. And that would be great from that point of view because then it will allow for competition across different vendors. With respect to the science that goes into data management and to serving a research such as this one, absolutely. I mean, this will have to be an efficient resource, right? And some of the tools that need to be developed are brand new. So there will be a component that is related mostly to computer science and data management that the contractor will have to be able to satisfy. And that will be reflected in again, in milestones, in metrics, in efficiency, and so on and so forth. So we will make sure that that research component is included in this, but is going to be focused on the things I just mentioned, data management and support of the resource and funding efficient solutions for support in the resource. Let's disambiguate because I think, Aveve, you're asking an important question which is the workspace itself requires work in itself, right, and how you do that across the different places. So it's actually a multiplicity of issues. The first one was answered just by Valentina, right? Second one, which is about across the platforms, is currently underway in part of the discussions we're having across the commonses. And Valentina's comment about Andrea Norris and CIT is she's helping us with some of the negotiations for the cloud providers in the right place we need to do that. And then there was an additional kind of questions related to how does this fit into the overall commons? This is an example of a commons, right? The architecture is the same. We're all working on a very similar platform. What we have to figure out is how do we work, we're all doing similar things. Where are they the same and where are they different? That's what we have to do. We haven't done that yet. I don't have a sense yet of whether, are we talking about a single contractor that's likely to provide the services, not multiple contractors who will provide different components? We're looking for one. One award, one award I may have. They may have subcontractors. And that's likely to be a commercial entity? No, actually, especially for the R&D component that we were discussing, we would love to see also academics applying for this, to be honest. And what do you anticipate will be the impact on NIH personnel itself? Is this something that's going to have to expand the workforce here in order to maintain this long term, or is it going to reduce staff with economies of scale, or what's your sense of the staff impact for this? You mean from the extramural point of view or? Oh, really from the intramural point of view of interacting with the research community to help maintain this in a way that the community wants to see this develop? Yeah, that's a good question. I don't think we attached up on that yet, but it's something to keep in mind. Just one more point of clarification, kind of following up on Sharon's comment about outreach. So is there going to be an outreach requirement in this? Or is there going to be a separate group funded to do? No, there's definitely going to be an outreach requirement and training requirement. Right within that. Right within this resource, yes. I come back to the point that I think I made earlier, and that's the business of how you're going to harmonize this workspace, I declined to use the word sandbox, with other planned or underway workspaces at PMI, at NHLBI, and I don't know what the role of NCBI is in all this except I just heard that they decided they don't want to have anything to do with it, and I'm not sure how that decision got made, but that's a separate question. So the business of harmonization with other, how are you going to arrange that in a contract that you're going to issue to a sole vendor to make sure that they have harmonization with other vendors who may not be the same? You make it as a task, as a requirement, right? It's that require interoperability with other entities as they emerge. Vivian gave a number of emerging data comments, they're all emerging, and the NCI genomic data comments is adopting the GA4GH APIs. We will probably keep the requirement genetic without specifying the specific APIs that will need to be implemented, because again, the world is changing over time, but what we typically do in the past is just to set up the interoperability requirement as a task of the contract. So I think this is another example of the importance of making sure that we are all working together between the commons and HRI and HLBI and those other entities that we can, because as we are starting to develop our language in our contracts, we have a better idea of how to make that language interoperable with each other so that we can have, so if we have a solicitation, they understand what the expectations are, that it's not just for us, it's also how we work in a federated model. I think that's how we're able to start by having this conversation with our colleagues around an age. And by the way, with a number of us program officers, we have been involved in a lot of these discussions in the development of these interoperability tools. A number of program directors and at GRI have been working, been part of working groups and so on to make sure that these tools are developed properly and that we can adopt them whenever they are ready. So what's the vision for what happens after seven years in the best case scenario? If it's successful and everyone depends on this, is the idea to renew the contract at that point for another X years or would the idea be to internalize this among intramural staff or some other idea? So at the end of the seven years, we would like to do an assessment of what has worked, what has not worked, and overall assessment and which will occur annually during the process. And work, again, we look at this as working with the federated data commons model system and actually come to the conclusion of what should be the final step, next step in these for this iteration. What parts of this effort works? What parts doesn't work? Where do we actually need to focus our attention on as far as making this better? Because we know what Vivian has talked about, what Valentin has talked about is what you've heard across the energy in the community. We're still building some of these concepts out into, and we're gonna continue doing that. So after the end of the seven years, I can see us actually being able to reassess and look at what the community wants, make further adjustments, and do maybe another iteration with the federated commons are, let the community help us decide what's gonna be the best way to go forward. So I have one metric that might be too crazy broad to bring up, but why not? You're looking like you're ready for something. And that is, will there be an expectation that there'll be increasing numbers of, let's just say, extramural funded right now projects that will be cross-institute? That if you've got the capacity to have these interoperable data, big data, what I wanna say, focused on different kinds of diseases, but at the same time, in a lot of things that are interrelated, do you guys envision that the silos that the institutes now are in? Now, I recognize there's cross-funding sometimes. I understand that, but do you think that, do you envision a model of sort of investigation of human disease in a way that will, that this can really facilitate cross-institute funding of projects? It would that be pie in the sky metric, but it might be, okay? My reaction, I'm not sure it could be a metric that would be a term of the contract. I can imagine it'd be a metric for a term of whether we think this whole kind of line of development is desirable on that, could be something that we'd be very pleased to see, if that's what you meant by metric. I would agree with that. Valentina, I think you wore them out. So we've heard. I think Jay, who's been quite quiet, always worries me a little. I think this is, I think it's a great thing. I'm very positive about this. I think, I mean, one question I had, do you, I mean, kind of the, you think about, once you start something, it's hard to stop it, right, as we know. And I'm just thinking about, one is trying to think in advance about how, defining, and this came up a little already, but defining success in advance, I mean, do you anticipate that 10%, 20%, 50% of extramural investigators will use this, you know, 100,000s, kind of trying to, before this gets going, try to have a sense in advance of how we're gonna say that this was successful relative to the investment that was put into it, right? And the second thing I was just wondering about is I could imagine this starting off and there being a temptation to issue RFA's where we require people to use it, right? And I'm just curious if there has been thought about, oh, we would never do that, or maybe we'll do that. And, because this, you know, if you do start requiring people to use it, then how well it goes will have enormous ramifications for everything else we're doing, right? So something to think about. Do you wanna take the, so the first one, the first question, so I have to take it from Vivian, make sure I understood what you were asking, is how do you determine the validity of the sandbox itself as to whether or not it should be continued or shut down? Is that... How do you measure success, basically, and what are the expectations for success in advance in terms of usership? Okay, so that is still something that's posh this in development, but I could give you an idea of a potential way to do this. We can look at, first within the ERP programs, we can look at surveys from the members who are using it and those who are not using it, and maybe they have their own, still have their own infrastructure, client infrastructure, and do a comparison analysis to see if what the sandbox is doing is actually fulfilling those goals. If you're on the outside of the ERP program, let's say an outside investigator, okay, we can track those individuals, you know, and I should be careful using the word track, but the individuals that are using that and send out surveys and on to see what they feel those efforts are, this fulfilling their needs and their goals. Another thing that we could also do is actually have, you know, annual meetings to determine, you know, feedback from the scientific community. There are multiple ways to determine the validity of this sandbox, both within the infrastructure itself and outside the infrastructure with the community. And so that's one way that we can determine whether we should scale up, scale down, or shut down in those terms. Now I forgot your second question. So now I may turn that to the Valentina for the second one. Yeah. Of course you have a few ambitious ones for it to require you to use it. Well, I- Well, of course Nick, it's accessible, but hurry up. I frown upon the use require, okay? We would like to encourage. So the DNCI Cloud pilots have used this to give incentives to people to use the resource, which is, you know, a little bit of funding to experiment with the environment, right? That worked. That got a lot of attention to one or two of the three pilots that they had in place. So we actually, we discussed it, proposing something similar to it, is actually part of this whole project. So the answer is yes. Planning to do that. I just want to make sure I understand the intent of this sort of at the grassroots level. At my institution, we have a bioinformatics core that will serve as people who do, for example, RNA-Seq and do their MATVAX form and that kind of thing. Is it your intent to serve as that bioinformatics core for those people to displace my institution's bioinformatics core? And which wouldn't be a bad thing, frankly. The intent is not to displace it. The intent is to complement it in case some of the bioinformatics core at your institution cannot deliver on some type of analysis for whatever reason. Too much data to process. The tools are not the latest version of the tools that people could use for whatever reason. It's really to complement it. I don't expect this to replace it at all because as was pointed out, there are some costs and in some cases people are not. I think your costs would be cheaper than going to a bioinformatics core. Right, but in some other situations to scale up and try to deliver some analysis results on large amount of samples and data, the bioinformatics core, and I'll be able to do it. I think what would benefit us, just listening to this discussion, is a few concrete use cases. So for instance, use case one, I am a researcher who has a few thousand genomes I have collected from my patients, say, or from my study at my university. Is this a use case? So I would upload all those data to the sandbox and then I would pay a staff member who works in this extramural entity to assemble a line against the reference and call variance and then give them back and then deposit those in a file in that same commons. Is that a use case you have in mind? That is one use case. The other use case could be you have a consortium that wants to work with the sandbox and they all generate terabytes of data. They need a place to consolidate the data in one place and to run one pipeline on all of these data sets. That's another use case. What's hard to envisage or to foresee here is the amount of staff time. I think computing time is one thing, but then for instance, our bioinformatics core, most of the expense, it's also very expensive, although effective, I would argue. But most of the resource is actually the bioinformatics analyst you're paying to do that and that's not gonna scale in the same way. It's a different scaling problem. Well, the bioinformatics analyst, it doesn't have to be a sandbox staff member as opposed to maybe the data managers of the data sources. We'll run their own pipelines and get the results and make sure that the results are shared across the centers. So it's not necessarily somebody that works at the sandbox as opposed to just making the environment for somebody else to run it. Park, you have a follow-up, right? Yeah, I think there's two levels of users here. There's the naive user who doesn't know how to do any of these analysis and they're probably not gonna wanna learn to do it. And that's the kind of guy who goes to my bioinformatics core. There's the more skilled user, like Jonathan I think is alluding to, who knows how to do it and is probably not gonna go to the sandbox if the costs are too high. So I think there's those two groups of users that are wildly different that are gonna have to be dealt with. And I'm not sure one operation can deal with both of those. But even the, even the, oh, I'm sorry. It's actually an answer now to both, I think from my perspective. I actually think that first for the, regardless which user it is, a lot of the reason that there's so much work going on into this, all these analysts and so on, is that things are done so in such a non-engineered way. Everyone, the data is not sharded right. The, everyone has to send their own pipeline from scratch to basically do the exact same thing again with their own code in their own place and to learn the best practices pipeline. They're not doing innovative work in this way. They're just doing the same thing again, again, again. They're moving the data around a hundred times and the data is getting bigger and bigger and bigger. At some point it won't be that movable for these types, for certain types of analysis, not for everything, not for everything any of us necessarily does, but for certain types of genomics data that's no longer going to remain a viable model. But on top of that, there's all this organizing and massaging and moving arounding that is not done in a best way. And even if it mimics the right analysis steps, it's not done in a very effective software engineering way. And so it's slow and it breaks a lot and it doesn't scale. And so I would say, I think it would reduce the amount of personnel required, not increase it. They would still be distributed people in different institutes and so on, but they would have something to work with that they don't have to build from scratch every time. And I think also for, but that's for Jonathan Moore to say for his own people, but I think for certain types of analysis done at scale by computational experts, at least I can talk about my own lab, at some point you don't want to do it for yourself every time because that's not where your core expertise is. You want to develop the next method, not to build the next infrastructure on which the method would run. I think it merits trying it out. It doesn't preclude everyone to still doing their own homegrown solutions, but that would presumably be the incentive for people to go in. So to me, I think that the rationale for this is great. I mean, I think that the concept makes a lot of sense and I think that the only thing that I'm concerned about really is whether the cost is going to be high enough that everybody's going to be doing the cost benefit analysis for themselves. You know, there's the time for the people in their labs, there's the cost of doing it locally versus in the cloud, which may be better for reasons you're saying. And so you're going to go through these calculations and I think for this to be a success, the costs have to be low enough that people will go through that and a lot of the time they'll say, yes, I'm going to take advantage of all the great things that are up in the cloud and the system instead of reinventing the wheel locally. And there's going to be some kind of cost point that will make an individual investigator flip one way or the other. Aviv. And the main issue with the cost is what Brenton said before. The actual cost could end up being cheaper, but most people live in weirdly subsidized worlds. And so they actually don't know the actual cost. I don't know the reason. And if we know it, we don't care. And if we don't know it, we quote unquote don't care. In fact, we sometimes end up actually paying for it, not just through indirect costs, but we don't keep like a mental accounting of it. So we don't really know what we paid for anything because it's distributed in all of these little pockets and bits and pieces and so on. So it feels cheap even though it's expensive. Here it would feel expensive. Even if possibly, I don't know, it will be cheap. But maybe they would give it everything for free, the cloud vendors. That would be nice. My Brent. Yeah, I just have a logistical question and part of my naivete. Since this would be funded through a contract mechanism, does it go through peer review and would it come back to council at some point down the road? It will go through a process called a technical review, which is what you're thinking of when you think of peer review. It will not come back to council. No contract comes back to council. Yeah, we cannot discuss funding plans with council. All right, I'm imposing the five second rule. I was foolish to suggest that she had worn you out, but I think we do need to come to a close. We've heard a lot of good comments here and the staff will bear this in mind. But first I need to call for a vote. Can I ask for a motion to vote? I'm gonna ask for a motion to vote to approve the concept. Can I get a second? All those in favor and please hold your hands up. Thank you, opposed? Abstaining? Thank you. Wonderful, thank you all. Okay, moving right along. One of the working groups of council is the, excuse me, genomics and society working group. I believe there are nine or 10 members and Shanita, Gail, and Jeff. Look, they're all lined up there. These are our council members. Yeah, we clustered you together, right? Yeah. With his microphone off. Okay, and a requirement of any working group of council is an annual report. And Lisa Parker is here. She's the chair of the working group. Lisa is professor of human genetics and director of the masters of arts and bioethics program at the University of Pittsburgh. And she's gonna give you the report, the annual report.