 Thank you. So what I'm going to talk about today is the work I've been doing for the last two and a half years and actually a little bit beforehand I was doing this at NHGRI2 which is related to the data commons. So I'm going to walk through the pieces that sort of led to why we wanted to do this, the structure that we're actually using and what you will see further this afternoon from Valentino de Francesco and Ken Wiley about how that's being built based on some of the conversations we've had. And I'll describe a little bit about how we're working across these other genomic data, well, there's genomic data commons, but further than that they're beyond genomics data that exist outside of NIH but are often run by NIH researchers. Also a little bit about the work that's going on inside NIH which is relevant to the conversation you'll have this afternoon about the sandbox because there are multiple commons going on and clearly there will be questions around how we actually interoperate. So I'll try and cover all of those pieces. So let's look first at what's driving the need for the data commons. So I think it comes down to a number of factors and I think you heard that from Eric Green this morning in some of his director's report. There's a collection of things. The first one's obvious which is the mountains of data. All of you are probably dealing with that already at the moment, but it's changing the way we're thinking about data in a number of different ways. First of all it's changing the fact that we now have to really pay attention to the way that we're actually going to handle some of the computational aspects of it. It's also driving the way we're trying to do discovery and I'll give a little bit of an example of that in a minute. There's also an increasing need and also support for data sharing. Again Eric pointed to some of the genomic data sharing policies but I'll speak a little bit more about that as well. And also the emerging concept of fair which is findable, accessible, interoperable and reproducible. These principles allow you to make sense of some of the data so at least when you've got two different data sets you can start asking questions between them in some common ways. And I think a critical factor which has also come up in some discussions is the availability of digital technologies and infrastructures that support the data at scale. And the most obvious one at the moment is the cloud which has come up in multiple conversations. This has come up in the form of clouds that exist within institutions themselves but also in terms of commercial clouds and the most obvious ones that most people are aware of is AWS, Amazon, Google and Azure which is Microsoft's. They're playing an increasing role in this particular field and it's these three four factors that we're actually going to look at. So the first one is I was looking at the mountains of data. So we all know we're generating a lot of data and we also have computational needs but it's also driving the way we're doing science. It's this multiplicity of data and it's the volume of it that we want to be able to search through and across that's allowing us to ask important questions. So we need the infrastructures to support this so that we can ask these questions. And what we're seeing a lot in the work here is this is an example in cancer where you're starting to get these kinds of headlines which are coming out are incredibly important. So it's really data driven discovery that we're looking at. In addition, I mentioned the issues around the need and the want for data sharing. So a couple of things have happened. This is the memorandum that came out from the White House, from the Office of Science and Technology and Policy looking at really the issue of we need to share more of our data, especially because we know it's digital data. And this has changed the way that funding agencies have looked at data and are considering how it should be shared and why. And that's impacting all agencies across the United States. Obviously NIH is one of them but I've had multiple conversations with other US government agencies and I'll speak about that in a little while. This is a note as a Holdren Memo came out in 2013 and essentially it's really saying ensuring that we have open and shareable digital data. We're also looking at the genomic data sharing policy as an example of what we're looking at and improving sharing of data. So here are just some examples to go to that. Again, Eric Green was talking about that this morning but it's really looking at requiring public sharing of genomic data sets. And I think at NIH, certainly for grants over 500K, there's an open, there's a data sharing policy but we're looking at how do we extend that, what's the right way to think about data sharing that we should include in what we're doing on an everyday basis. So that's one of the other factors that come in. The fair content, what I was just talking about before, which is the findable, accessible, interoperable and reproducible. This is starting to gain steam in the communities and I think we're starting to see papers like this which are the foundational ones looking at how we actually need to think about this concept. So in the future when we have data at scale, how do we actually analyse it? How do we use it? How do we interoperate with it? These are the founding principles that I think are really essential to the commons. It's not just a compute infrastructure. It's what do you do with the data and how do you handle that? I guess one of the good examples here is many of the projects that include metadata don't necessarily include standards or issues around that to make it more useful and I think what we're finding in the future is common standards for metadata are really important and we need to adopt those and we need to push these through the communities. This is part of the fair process, just an example of what we have. So now touching towards the kind of things that we see in terms of technologies that support this, we see headlines like this and we've seen examples where Google, Amazon and Microsoft are impacting the way that we do biomedical science and they're involved in a number of different areas. I've just got a couple here, but you can see how these are involved. In particular, they're heavily involved with a lot of the NCI work because of the genomic data commons there and simply the volume of data cannot be stored locally nor can it be used. So this is one of the key reasons why they were wanting to use those commercial systems. So all of these factors come together to think about what do we really need to do moving forward? They're all important, but we need to have some sort of cohesive strategy around that and I think this has been a common discussion that I've seen at NEH. What do we do to bring these pieces all together? So the data commons in my mind is really enabling the data driven science. It's really the idea is that you want to enable investigators to leverage all possible data and tools to really accelerate biomedical discoveries, therapies, etc. And importantly, you want to have the data infrastructure and the data science capabilities that workers are collective to do this. So how do we think about doing that? And that's what I'm going to cover next. So I think up until now, these are all the shaping principles that have helped think about this. And now we're actually taking the direction of how do we actually do something about it? So we think about the data commons as something that treats the products of research, whatever that is, that's data methods, papers, anything is a digital object, it doesn't matter what it is, it's digital. And we want to be able to do something with those digital objects in a shared virtual space. Here are just some examples I've got up here of the things that you can do these seem obvious. But I think it's important to state that we want to do these things. And as you can see, they're starting to move towards the idea of using fair principles, which is the other thing we want. You want the commons to be able to work on these objects using fair principles, which is findable, accessible, and usable. If you can access it, it doesn't mean it's necessarily usable. You want it to be able to interoperate, and you want to be able to reuse it. So an example here would be you have some data that's large, you want to be able to find that you want to be able to share with somebody else that can actually use that they need to interoperate with it for their experiment. And they need to be able to reuse that in such a way they don't have to recreate the wheel or generate the data again. It's relatively cheap to generate certain data sets again. But the compute right now is still fairly expensive in these kind of infrastructures. So what we want to do is promote that kind of process. So I think of the commons as a platform that allows transactions to occur on fair data at scale. And I'm going to focus at the at scale part today. The reason for that is because many of the in the context of NIH, there are many very large data sets that we deal with. And it's very difficult for us to be able to operate on those at scale. They either sit at repositories, or they sit at local institutions, which is terrific. But it's often extremely difficult to operate on those data sets within those environments, which is why people are moving towards the cloud. It's you can move the data set there operate on top of it. And often you can compute against it quite cheaply. Storage is different, but people are doing that. It's a common use case that I see amongst NIH researchers. So what we're trying to look at is if you think about a commons for data at scale, how do we incorporate that in our project? So what I've been working on for the last 18 months is a platform, essentially, structure of it. It's in three parts. You have a platform which is compute, and you obviously need that because you're going to need data at scale to operate on. You have data. And you have what I call reference data sets. These are mature data sets that have been out there that have been cleaned and polished and ready for use by the community. And you also have user defined data sets, which are often smaller, potentially not as clean, but definitely useful, and definitely usable. And you also have an ability, you want to actually able to operate over the top of that using services and tools. So essentially have a compute layer. You have a data layer. And you also have a services layer. And the most common one that certainly NIH has seen is a scientific analysis tools area. We fund a lot of those kind of things. But a key area is also the services component, which is these APIs that allow interoperability between the data and allow you to also act as transactions between computes on various systems. Indexing is also really important. If you think about this and we have a whole lot of data in various places, not just the cloud, you need to be able to find it. So we need to think about those things. They're essential parts of the system, but they're often not the kind of things that we fund, simply because they're not directly involved in solving a scientific issue. But they're essential in the transactional analysis that need to occur at scale when using fair data at scale. The other thing we need to consider thinking about on the top layer here is what I call an app store user interface or a portal haven't really quite thought of what the right term is. App stores are easiest for most folks that don't deal in computational biology because you know when you've got an app just works and it connects magically. And that literally works on a platform. And if you think about platforms like Airbnb and Uber, which most of you have used, they work on the same very same principles as here. So what we need to do is create interfaces for our biologists that don't have the computational experience to be able to access this information quickly and readily efficiently and correctly. So if you want more information about this, there's a link here at the bottom, feel free to go do that. Just a quick comment about the left-hand side of the slide. For those folks doing more computational work, the cloud is seen as the infrastructure as a service, which is the IAAS. And the platforms as a service which allow you to interact with these systems are platforms as a service, P-A-A-S, and software as a service which on the top line is here is allowing you to have software over the entire stack to be able to operate. And we've seen examples in the commercial world of SAS applications. In the genomics world, I think you see seven bridges, genomics and DNA nexus are two examples of what you see in this particular environment. Those SAS components are where biologists interact most readily with the system that have low computational knowledge. But it does not mean that we shouldn't be looking at the other pieces in place. There are also examples both in the commercial and in the academic space at the PASS level. For example, Broad's Fire Cloud fits into that category as well for PASS. So I think when you're looking at the IT stack, this is the kind of things that we see in the industry, that's the way they talk about it. And we see examples within the biomedical industry that are populating this space. I think it's really critical for NIH to understand how this operates and to be aware of this terminology as it impacts biomedical sciences. On the right hand side of this diagram, I talk about digital object compliance. Essentially, I'm talking about being fair. How do we actually make this whole thing fair? And these are the conversations we're having at the moment. This next side is complex, but I'll try and walk through it as best as I can. This is digging down into the architecture and I just want to shout out that this is the work I've been doing with Alistair Thompson and Kishel Jekish and Wayne Ogan over at NHLBI. They are also building a commons and it's built on this structure as well. Simply drilled down of the previous diagram I gave you and Valentina Dufresco and Ken Wiley are actually going to be looking at this diagram as well. I want to point out just a couple of things here. The green layer is at the bottom is the compute and all I've done here is broken out a couple of pieces. The type of storage you need will be whether it's close nearline storage or offline storage depending on where you want your data to be. I'm not going to go into a lot of detail here. Just trying to show you that we're looking at the architectures that need to support this. We can go through that in question time where I can take this offline. The key point here too is that we want to look at cost tracking and management. Many times when you use commercial clouds the costs are quite prohibitive for research scientists, especially at the storage layer. And so what we need to know is what are we spending our money on and for what and what's the right way to do it. Researchers cannot support terabytes or petabytes of data on commercial clouds using their grants. They have to do this through another way and what we need to do in NIH is to figure out what are the right places to look at what's the cost and what's the right way to do this. The data layer I spoke about before, you obviously have data in this system. We need to look at the security data access rules. We're going to deal with a lot of human data. So we have to be careful about what the security access points are. And there's a lot of discussion around the right way to handle interactions with dbGaP. And what are the right ways to do this in the future now and also in the future. I just want to point that out because human data is really obviously important to the nature of what we do. Fair data access is the same as the previous one, which is literally the service layer, which is critically important. I just want to highlight it again because it's the boring part, but it's the necessary part to make this work. Now let's get to the point about the staging, the sandboxes and the workspaces. People have slightly different views about what this means. But in my mind, there are two parts. When you generate a project, especially in a project like encode, which is something an HGIS funded, there's a period of time where that data is not in its clean state, final mature state. It needs a lot of work. You need the researchers to be able to work with each other to clean that data up. I had experiences dealing with the human microbiome data for the same reason. You have multiple sequencing centres, you have multiple scientists. What you want is a sandbox, a scratch space for those collaborators to work together to make sure that they can pull that data together so that it's mature and ready for the community to use at large. So you want that. You also want researcher workspaces so that they can actually collaborate inside these environments. That is one of the benefits of the cloud. You can create these collaborations and it's not limited by geography. So these are the pieces I just want to point out here. My colleagues, Valentina DiFrancisco and Ken Wiley will speak about this later today with the sandbox for NHGRI. Over the top of that, there are the user interfaces in the portals. I mentioned about those briefly before, but this is just simply a drill down stack of what I was saying before. And this is consistent about what I'm seeing and what the sandbox has been developing. And also there are other data commons. And this is a question that I get asked as well. So I've been working closely with the genomic data commons at NCI. Also the Fred Hutch Center, which is looking at the Hutch Commonwealth. NAID developed what's called NFALY and this was essentially, again, a commons, an early commons over human microbiome data. This is work that I did with them and actually I was part of NHGRI. So this is an early form of a platform commons. But we're also seeing the New York Genome Center is also developing its own commons and the UC Santa Cruz group of developing methods for actually deploying commons. These are key players in the field and I've talked to them extensively. The structure I've described here, if I can go back, is the structure these guys are using. So the reasons why that architecture I think is reasonable is because it's based on the conversations I've had both with industry and also research is about what's the right way to think about that. I will talk a little bit later about where I think NHGRI is to play in that infrastructure. I just want to touch on a few other data commons. So I mentioned the ones that are external to NIH, which are shown here on the left. But on the right we've got where we're going here internally at NIH. Clearly the commons is taking hold and we have the National Heart Lung and Blood Institute which is putting together a commons related around top med data. I've been working with them extensively on the structure of their commons and exactly what it would do and how it supports the use cases that it needs. Common Fund also has data sets I mentioned the Human Microbiome HMP project but I'm also working with the GTACS project which was mentioned earlier today and they also have some other very large data sets which are resources that the community wants to access. And lastly the National Human Genome Research Institute which is this group is essentially looking at the same thing and you'll hear far more detail about the kind of data sets and the systems they want to operate on this afternoon. Another thing to point out is that this kind of engagement doesn't happen just at the biomedical research level. All of these groups which are US government agencies shown here on the left under the slide are starting to actually contact me directly or indirectly mostly directly about the commons work what does it mean and how does it actually impact their agency because that's really it's agnostic of data the type of data the platform is still the same and so we're starting to engage in conversations around what does it mean to have a commons what does it mean for different organizations how that should be done and I think importantly they can bring perspectives that we don't necessarily see it's always good to get a different perspective additional work that I've been doing is with the European Commission there is an European Open Science Cloud which is using the CERN Cloud and I've been involved in some conversations around what's the right way to create relationships there and also Elixir which is a European based organization looking again at the fairness of data and looking at being shared across the EU they're very interested in a lot of the components that we've talked about in the commons and want to know what's the right way to create the right relationships can we interoperate so one of the conversations we had is could we actually run a service that runs on CERN but the data actually exists over here in the United States we've been having all of these conversations simply because we know this is going to come and researchers need to use these kinds of technologies so we want to find the right way to integrate and interrelate with the appropriate folks I'm going to finish up on this which is a busy slide which is the interoperability with other commons is I think the first thing is discussion we're really at the early days different people are doing commons is because they know they need to operate on fair data at scale they're all going to plan to use some form of cloud they have some level of security layer and there are services we all have the same problems so we need to get together and talk but we all have common goals which is to democratize and to enable collaboration and sharing of data that is consistent across all of them what we also realize is we don't want to reinvent the wheel so we want to look at reuse of currently available open source tools which support this interoperability and there are examples of this the global alliance which was raised again today the GA4GH UC Santa Cruz the genomic data commons the New York Genome Center all have tools which they're either using or want to use and realize that it makes no sense to reinvent the wheel so when you create these open standards amongst the groups you avoid some of the silos but we need to have those conversations so we're looking at planning right now a meeting of commons developers that are working key people in this field and also to bring NIH staff together because the commons as you saw from one of my previous slides NHGRI Common Fund NHLBI are talking about this but I'm hearing other work also being done inside NIH PMI follows very similar issues structures to the commons here so we also need to talk with NIH and their pieces planned we are looking at a session at BioIT which is a very large conference in the East Coast in Boston looking at actually having a commons session and we were approached about actually having a commons session there it's starting to reach that level where people are really asking us to participate the rest of the slide is looking at some of the more details around it but really I'm trying to get at we've got a lot of stuff that we've already built we should reuse it and not reinvent it and when we share that information and we reuse it we hopefully avoid some of these silos because we're starting to have the same conversations so if I'm using the same API to exchange my data and you're using the same API the chances are we're going to have some level of interoperability and that interoperability happens at a number of different levels this is what the rest of this is which is these open standard APIs to allow that transaction at the data and the computing level when you deploy across clouds one of the concerns has been each of the cloud providers works separately understandable but some of these groups I just mentioned have also been working on tools that allow you to actually interoperate amongst the three different cloud providers there are concerns understandably that simply using one cloud provider locks you into that understandably so so what we're looking at is how do you deploy across each one of those using open standards Docker which is the containerization actually preparing your tools for use in the cloud that's Docker creating a registry of these with IDs and a store so that people can actually find them and redeploy them that's an important aspect there's work that's going on within the commons and BD2K which we'd like to reuse there's workflow management when you do something in these environments it's not usually one tool it's a collection of tools that operate in synergy together to create what you want to do for your analysis and it's critical that we actually do the same for those kind of things to make sure that we have the right workflows that can be reused another one is discoverability this is about indexing if we do this with all these different data sets and we do reside in commercial clouds it would be really important that we can find all of our objects wherever they are so we have to find ways to do that there are efforts within BD2K to look at indexing but there are newer methods out there as well the Europeans have been looking at some methods for indexing well and we're discussing that as well so it's a collection of ways to do it the important point is we know we need it we don't yet know how to solve it globally unique identifiers really important that we all have a set of identifiers that we can agree to otherwise it'll be apples and oranges in comparisons so there's a lot of work being done in the GA4GH in this but also other areas but it's critical for the way we go forward so if you call something an apple I know that it's an apple because we've agreed on the same identifiers additional to this actually the last point is a common user identification system that is when you're using human data do you have authorization to use it? is it you? and are you allowed to use this data? this is what it means and we need to look at what's the right way to do that with one data set or across them I can see reasons for keeping them separate but at the same time you don't want to do 25 of these if you're going across a bunch of human data that's funded by NAH and lastly the NAH Commons working groups there are several of these and Valentina raised those some of them this morning they are made up of BT2K members Alexia members and also the broader research community outside of BT2K and they cover some of these areas which is Commons fairness metrics what exactly does it mean to be fair and what guidelines do you need to develop so that can be deployed by other people interoperable APIs what standards are we using how are we actually coding to that and what can actually affect Docker registries can you containerize your tool for use in the cloud can I know what it is and I know I can reuse it data object registries that are beyond tools can we actually identify those because that impacts the way we actually can index so these are the kinds of conversations that are going on right now at the sort of technical detail level to say how do we as commons interact with each other but we also know that we just simply need to get together and talk we know we have a lot of the pieces now it's time to actually sit down and work on a plan to ensure we can actually do that so lastly I have an acknowledgment slide because I really would like to thank all these people I won't go through them individually however my point here is that there's a lot of work being done here across a whole lot of different institutes this truly is trans-NIH and it's broader than that I've had a lot of discussions with industry and research scientists outside of NIH outside of the United States about how this could be done and should be done so a lot of these folks cover from different ICs and we're also working with Andrew Norris from the Center for Information Technology she is also the CIO of NIH so I think what we're trying to do this slide shows just how broadly we're actually reaching across NIH and that's important because this kind of thing impacts everybody and their ability to do science okay I'm done thank you very much and I'll take any questions Carol yeah so thanks Vivian for that overview so I think we all realize how transformational this kind of work is going to be for for biomedical science right so to be able to share data seamlessly so so these are all great things in the architecture you spell out it's good to know that some of the commons that are being developed are following sort of at least a common architecture framework but the seamlessness of this whole thing lies in the details and so I know there's a lot of discussions going on but are there so I have two questions one are there specific agreements that have been made in those discussions to sort of enable the seamlessness across different cloud based commons resources and in the architecture and I think this is really important you talked about mature data so even mature data are changing and the example I think of very often because it affects me directly is genome assembly updates version changes which then lead to different annotations in that genome assembly will then propagate through this architecture and affect every single layer of it and I'm wondering if those discussions have been had as well about how to accommodate that kind of fundamental change and the ripple effects through the stack of this architecture that's been defined so let me see your first question was about agreements between the groups and the second one was about version control and change is that right okay so first one about agreements no agreement has been made what has happened is that everyone who's developing this knows we need to speak uh... so this is why we want to bring everyone together to get to a point of agreements but a lot of the work has been done already the g a four g h has done a lot of the work here one of the key things that we all agree on so far is let's not reinvent the wheel so we want to see what's already out there what do we need to reuse and then to come to an agreement what that is and communicate that so that if others come along we say here is what we used in why so that's where we've got so far with agreement in terms of version I didn't get into here because that's just not enough time absolutely versioning comes at exactly what you've described and also versioning comes in how do you do that across clouds so it happens at a number of different layers we definitely expect to see that let me just distinguish the datasets that i would call that in the early stages which they're often uh... very early parts of a project undergo tremendous types of change which is sometimes different to the more mature ones still change on both sides but i think what we want with the early stage datasets is to have a shared space where people can simply work on their own it's not for the broader community yet it won't make sense to them there are also issues around the use of that data for the versioning for the more what i call mature datasets absolutely i do not see them as static but the discussions have been around how do you ensure that the versioning is correctly done how do you ensure the versioning happens seamlessly across those clouds and is there a point where you don't necessarily need to store them when do you no longer version them that kind of conversation is happening not finalized yet but the good questions one of the issues that comes up all the time or seems to come up all the time in developing these very very large data resources is the business of privacy uh... of security and of sanctions against individuals who misuse the data uh... can you comment a little bit on on what you know how you would see uh... a structure that would would uh... would allow would guarantee to participants other research participants or researchers uh... maximizing security and at the same time figuring out a way to slap hands or worse of people who abuse the system so here christians being security and policing so sank sanctions which is a word uh... two s words so as far as security goes yes something we're really looking at which is one of the reasons i showed the architecture slide which is we're paying a lot of attention to how do you actually access this data in the appropriate way so how do i know what you're you and how do i know that you are permitted to use that data and for how long so one of the things the how long part is important right if you've been allowed to use it for a period of time it'd be nice to know that at a certain end of time you're not allowed to use it anymore all the uses ended or whatever ways to do that we're looking at ways that we can do that with machine-readable files how can i know that and let the system work through that way that's a technology piece but i think there's a bigger issue here also around consent which is sort of part of what you're describing particularly when you're going across data sets how do i know that the right consent was set up to use two data sets when they were never designed to be used like that in the first place which is why the conversations are not happening just at the technology layer for the security level api they're also occurring at this is why i have denapal 2 and the crew over at the office of science policy what are the right policies we need in place when we're starting to look at the and dealing with the issues of consent and how does that impact the way we develop the technology those conversations are underway in terms of the sanctions i'd say we know less about that but it's important to do what i will say though is once we can actually track the use of this data if somebody's actually booked it and using it we can electronically tag it there are some ways there we can actually look at this in ways we couldn't do that before what's the right way to balance that so i know that you've got the data for a period of time you're allowed to use it for so long i can actually look at in the track logs how can i use that to help figure out what the right sanctioning is not necessarily against you but should you need that i mean there will be people who will go into these data sets and say oh you know this is eric greens genome and look he has x and and that that will happen i it's a real snoozer i'm telling you right now go ahead try and then gal yes so first of all thanks for the presentation i think we all support the notion of the commons and it's clear that what you've presented is very well-designed and and thought out i wanted to ask a question that maybe is is related to dance but the flip side about openness and data access so uh... and and what's become very i think uh... clear to many of us who have benefited from the the general access one now has to genomic another molecular data in part pioneered by this institute over the past ten years and in terms of requiring people to deposit data uh... in the public domain in the timely fashion no subject to publication uh... requirements like that that the same culture does not exist at the clinical data level so so one thing is certainly one needs to worry about privacy uh... however i think that there was a separate issue which is simply culture and and uh... it's it's become painfully clear to some of us that when one tries to relate as one tries to relate these different genomic and and and molecular features the different clinical outcomes that it's it's simply been harder to get a hold of of those those outcomes that kind of data and it isn't just an issue of privacy and making sure that the patients are protected i think it runs beyond that to just a culture and i'm wondering if clearly for the commons overall to be successful you want people to be able to access freely once you're in inside of that firewall however it is you you define it you want people to be able to leverage different data sets from all sorts of of tronches genomic molecular uh... and and and you know metadata and and clinical if that's indeed a separate object so what you hear is issues of clinical data and also the culture changes needed exactly okay so let's take the first one clinical data one of the reasons i included here uh... the clinical center is because i do two things the clinical center has been very active in discussions around how do they participate in a commons so there's a couple of i spent about almost a year working with them on which data sets we could actually use because they are bound by very strong legal issues for the office reasons and what we did was identified some data sets that could potentially move into this commons environment there are some leaders within the uh... clinical center really trying to look at this we've identified some data sets that we'd like to try as being part of a commons set uh... for a couple reasons they're the right data sets to look at uh... they're very important and it also breaks some of that culture issue so we know it's going to do both of those so the clinical center is very important on that i've also been given the honor to be on the clinical center steering committee they've asked me to act on that because they want more of this in their environment so that's actually going to that this weekend kicking off in terms of the culture that's harder but i think what you need is folks that are going to be leaders in this field taking it and running with it example of clinical center and what we see with the pios out there and also some of the commercial companies that see it as a value add to their business but are also willing to have a conversation beyond that there's another culture change that needs to happen and that is at the data and tool level i'm sure you've been involved in this the publication is still the currency of what we do and yet what we don't make is data the currency of what we do and that's where we need to go doesn't exclude the paper it makes it as important but if i can actually go if a researcher can actually uh... presenter an application to any age that has collection of data and tools with potentially do i is associated to this as part of the application process and review process it also changes the culture what we do and the comments i think we'll start moving towards those kind of issues we can cite data for those very reasons those are the kinds of things that have to change within the context of any age and that can only come from us also testing these systems to try and change the way that we handle culture i hope that helps a little bit yeah uh... i too uh... was fascinated and a little bit overwhelmed by your task but a couple things really struck me and and these it's really sort of the nosology of what you're doing with regard to uh... that the words that are being used basically with regard to this new vision uh... or maybe not so new but it is a uh... a vision for the future and what one is common which has a long history in law and ethics uh... and which can be thought of as being as you said very democratizing like what is our comments really as as people in the u.s. as humans etc and the other is fair or fair and they've got this wonderful acronym which i kept thinking meant fair and i think that it it could but it but when i looked at what f a i are beans it's it's not equitable as much as it is making uh... i think maybe making access and and finding things and so on more broad uh... but not necessarily fair in in a kind of uh... ethical sense so it just made me wonder about and in this i'm awfully glad that two questions that came before me actually did come before me because it's set up you know you've already responded questions about privacy and security and also the culture or a culture change and uh... and i'm wondering uh... especially when i'm thinking about uh... who has access for how long and some of the things that have been written previously which may now seem almost pedestrian about by the rise of biobanks and the need to really create new standards for people put stuff into a bank who may need access later etc etc that you've got this uh... you know so who are the community that you're talking about is it researchers or if we think about the vision and uh... kind of uh... p r maybe about uh... the precision medicine initiative it's highly democratized it's access to data on the part of a number of different groups not just people who would be putting data into the this cloud right it's it's much much broader than that so i i feel like there's a number of really important sort of ethical and legal issues that this vision generates that uh... you know i i know i'm sure you have lots of thoughts about uh... and i'm sure people have asked you about that already so i wonder if you have a couple of comments to this broad-ranging set of questions the first one was you're so good at that i'm just trying to get my own head around it one was i probably summarize it by saying whether you said nomenclature or nomenclature whether it's potato or potato okay there's a lot of issues around terminology and we haven't settled on a lot of stuff so whether you call it nomenclature or nomenclature i think we need to come to a versioning of what is useful that people agree to there's the commonwealth the commons right then there's the tragedy of the commons there's all sorts of other things here so words have meaning and which is really important so the etymology this is really careful for us i have no issue with moving it to something else i want the community to have a conversation so that they're comfortable with a set of terms that we can live with whatever that is so and when you have this conversation with the community uh... this touches your second question it's not just about what important api do i need and what's the right work for language and if i used a jason object which is all the technology side that's great but if we don't address the social political reasons behind this and also the lc issues raised i think it's really not a good thing so i have done a lot to work with uh... the office of science policy because i know it's critical to what we do and i'll give you an example of that so a couple years ago the policy for dbgap did not permit us to use the data in the cloud and a lot of researchers were having many concerns about this i could see that researchers were doing it but the policy folks didn't know much about the cloud so you had a schism essentially between the technology people and the researchers who are wanting to do this and yet our policies in a h precluded us this is where technology was a head policies and so myself and a group of people work closely together to change that so you know how to move dbgap data to the club for a period of time but now you're permitted to do this so there's an example we have to work because the technology is changing so rapidly you must work with the folks in policy and lc i do this together to ensure that that moves at the same pace and there are times where they need to be educated in the technology but there are plenty of times where we need to be educated in the lc aspects policy particularly because we're dealing with human subjects and when you're starting to traverse different datasets there are going to be issues there that could preclude us not technology-wise but simply because of the way that we're dealing with very sensitive data so i think it's really critical that from the beginning we include these conversations at that level immediately and ensure that we have a discussion around those things from the start not at the end well just relevant to that i think it would be helpful if your group also thought about developing educational or templated materials for consent many of us still use i think outdated language about the data being on a specific network and things like that and i think part of the reason you're getting a lot of these questions your answers are very reassuring the slides actually talked a lot about public sharing and things that i think give less of the impression of the security of the data uh... and so i do think that that aspect of it is very important because most of the people in these studies have not contented the public sharing of their data that's a really good point you can show what you can see where i come from when i look at that structure at the api for the security labor is critical as is the policy pieces but you're making a good point which is to call it out in language inside the slides will be confusing to people you also may know the critical point which is training this is not included here but it's essential we need to include those folks that don't come from these kind of backgrounds to help us with some of the use cases uh... to be part of it and also to provide training outwards so people know how to use the system i just didn't include it here but it's critical i want to point out that training actually has two components there's training of what we usually think of as users of a workspace such as say data analysts but there is also a complete it there will be a a substantial training required for method developers for software engineers in the context of bioinformatics computational biology because a lot of them have been trained historically in very different software paradigms and have not really picked up some of these newer things not that they're incapable of picking them up but you need to find kind of the space for that to happen and so for example thinking in the slightly longer term of about say administrative supplements to encourage which i think the NCI cloud pilots did that if you had an NCI grant to develop some software you could apply for a relatively modest administrative supplement to really push your thing to become dockerized or to interact with data that was in workspaces for that were handled by the NCI cloud pilot and i do know that in ITCR which is an NCI program that supports a lot of software that handles cancer data people really took benefit of that opportunity and it served to train the software engineers which is a very different community than the users I 100% agree and I would say that previous slide here I'll leave it at that what we were talking about before is exactly that is the training of the users, the traditional users that we think of but ensuring that we use open standards around that and documenting that so that people know how to reuse it it's very common as I'm sure you know that we'll all write our own API if you just can't understand the documentation you'll redo it and then no one else can reuse that that's actually a problem so we need to ensure we have that writing the documentation and so on is one component of the training but there has to be some additional incentivation to really push people in there's a little bit of pull and a little bit of push and so a lot of the interaction around this at least from my limited experience does require you to have more than the documentation especially at the early phases of a lot of these data platforms which are not neither none of which is fully mature and they're really a work in progress in different phases of progress presumably across these many different entities and so you need real effort and by that I mean like person's effort like portions of FTEs on both the commons developers side and the incomeers with whatever things they're docker rising making compatible and so on and that needs that needs real incentive and real resources it's not just going to all happen on its own so to comment the Carol was saying which is about agreements although we haven't reached agreements there's a discussion around if we are going to reuse this what's the right way to teach each other what the right way to do it is so we document that and then teach each other and then teach the community that is definitely something on discussion I was wondering if you could say something about whether this will lock the community into using one of the cloud computing providers and course in general you know establishing monopolies is bad because it makes things more expensive and less good and you can comment on that. Yeah it's a double-edged sword so there are some groups thinking of just using one cloud provider because it's problematic enough then you get yourself into vendor lock-in and the work that we're actually doing is to with certainly with an HALBI and I don't know yet with an HGRI but I think it's actually up for discussion certainly the work I'm doing with common fund is to work across three clouds and there's a couple reasons for that so let's look at the downside first you've got three clouds that you have to replicate their environments are different it makes more work but on the other side what you're doing is creating essentially innovation and competition also there are tools certainly I know in talking with the genomic data commons they're looking at very much the tools about how do I actually operate across agnostic of cloud rather than focusing on it as you're the cloud provider I'm saying I'm going to use multiple clouds what's the right way to deploy across that so some of the tools I was actually talking about from UCSC are really about that it's agnostic of the cloud because they know we're going to need to use it and it will drive innovation and competition importantly from the cloud providers many of them have wanted us to essentially work only on their platform and we're resisting that for these reasons and I think that's important for us otherwise they potentially could hold our data hostage which we don't want I agree completely I think that's fantastic hopefully this becomes a huge resource that the community in general want to use and just to add to that point we're not planning to use the archives of that we don't want to use the cloud as an archive we want to use it as working copies of the data for the reasons I just spoke about there are archival copies at different centers repositories like EBI and NCBI and that's great and in this phase what we need to do is use those working copies so we also avoid our lock-in we don't want to be in a situation that our only archival copy goes so I'm going to ask a question that may sound very simplistic but can you give me at least a vague definition of what the data is we all talk about the data and I think we all probably have different ideas of what data we're talking about when we're talking about putting it there I mean is it any data that just happens to be large is it any data at all does it include derived data that you're generating off of these things can you give me some sense of that in the case of HMP data what we looked at was the final derived data sets but we also felt that the whole genome assemblies were really useful as well I think we all know that the derived data is what people tend to use a lot simply but I think that's big for a couple reasons it's what they tend to know how to use more but the other thing is they didn't actually have the possibility to do more with it because they're often their systems can't enable them to look at the larger whole genome sequence so what we're finding is people are interested in looking at the additional larger data sets like WGS and the tools are starting to come around that it's easier for you to do that so I would say it's a changing conversation where data was originally derived but now we're starting to see people wanting to look at more of the I guess the earlier data sets or the the larger data sets before the derived in part because they want to mine it and it's easier to do and I think the other one is because you want to mine across data sets and it's not the derived data set you necessarily want to use so by data I mean everything from the early derived ones for I'm talking about genomic data to its derived form phenotype genotype associations which is where a lot of people want to go or what SNP caused a particular you know there's a certain variant associated with disease but they're looking at all of the other information as well you've got a SNP call you've got a certain quality score on it that's great maybe for your project but maybe you want to look at how deep that coverage was or the quality score because you've looked at some other data and you're looking at a different disease and you care about that so there are reasons to go back to look at the data at the earlier stages for scientific reasons that I think are really valid so when we talk about data we talk about the entire gamut do we know how much that's going to be in the future? No what we do know is it will be dynamic and I think it would be a mistake to assume that it's static right now and we know what it is so our system has to be flexible sufficiently to be able to handle that to think about what are the use cases around that as we move forward I think that's otherwise we're in a really deep hole first thank you for the informative presentation so from my own experience there's kind of two parts to the commons they get conflated one is a place to bring together and integrate that data and then to serve it to the community usually for local analysis the second part of it is a place for people to do analyses to discover based on data that was already data and tools that were already present how much are you putting those two together in the so-called NIH commons and how much are you separating them? See if I understand your question so one of them is kind of like a sandbox approach when you've got early you want people to collaborate together and some of it is when people want to actually use the products of that data research is that right? and then your question is how much of that discussion is occurring at NIH the discussion is happening a lot but people are conflating those two things a lot particularly inside NIH and I do see it outside of people that are just starting up commons conversations and the reason for that is because I think as Carol was getting at the data changes both whether it's mature or where it's in its early stages and people don't delineate between those things we have to do a better job of explaining that I think you'll hear something from the sandbox crew this afternoon about that I think it's important to disambiguate the two some of the early tests that we're doing at the moment are with what I would call more of the mature data sets which are ready for consumption to the community that have gone through the iteration from the original communities simply because we want to see we know that people are wanting to use that and it's ready for broader consumption to the community and we want to test out what that will look like and how people would use it what we're noticing though that was where we started from but what we're noticing over the last six months to nine months is people are starting to look more at the sandboxing idea because they're starting to understand more of it but the conflation comes from lack of understanding of the whole piece the fact that there's not a lot of people with technology backgrounds that can understand this and you really have to think very carefully about what it is you're trying to do otherwise you're just going to boil the ocean these are the issues that I'm seeing right now and then my last statement I guess is a comment the tragedy of the commons and Garrett Hardin's work some people chalk up in these conversations to silly words actually think there's a lot that can be learned from her his work I think they can be his that could be brought into this we can learn a lot how to avoid that tragedy by learning from that literature yes and additional work that I've done that I didn't explain here is I've actually been working with some economists and people that work on e-infrastructures not necessarily for biosciences because the way that we handle governance here and looking at the way the models work here very different anything we've ever done and so I think it's really good to get that external perspective to look at what we do and almost think of it as a business model that we need to drive and that business model yes I think institutions are a business model but that business model also goes across to how do we operate with the commercial providers in this space that are trying to crowd in quite strongly and we as a research community need to figure out what's the right language to interact with them and what are the right social economic economic political areas that we need to include in the way when we build this comments okay we're ready for lunch can you be back here at one thirty please