 Thanks Eric, and thanks for some of the lead-in that you did prior on BT2K, which I will save us a bit of time. I'll also try and address the question that just came up regarding bib sketch size versus size of the rest of the proposal. I see that all as part, personally, as the changing face of scholarship and what it means to be a scholar in the modern era. And that's something that certainly figures into our thinking as we think about this, and I'll say a little more about that. But the basic notion that we've been sort of come to, and I'll describe a little how we got there, is that we're sort of thinking about the NIH now as much as we can as a digital enterprise. And I'll say a little about what that implies as I go along. But it's really the idea, in general terms, that it's not 27 individual sort of information silos, but that a lot more interoperability needs to occur between them. And in some ways data and other aspects of the digital world could potentially enact as a catalyst in that process. So that's really the basic idea of what we're thinking. So as Eric already pointed out, we had this report that came out, and Lucilla and Jill and others, who perhaps are not a part of this council per se, were involved in that report. And I looked at that before I even thought about taking this job, and I thought this was absolutely outstanding beginnings for what needed to be done. And we've tried to follow that pretty closely in what we're doing. And those are the sort of major findings there at the bottom. So this idea of cataloging and making accessible the components of this digital enterprise, which of course means all aspects of the research lifecycle, ranging from initial ideas through initial data, those collected hypotheses that are drawn, software that's used to analyze and labeling of reagents, and all these sorts of things that go in right down the lifecycle, all to the way to the final dissemination, which of course now is not just in the form of papers, but in a variety of other aspects as well. So how we catalog and find and use that and make that available is clearly part of this. To do all this in a biomedical research environment that's more analytical than perhaps it was before, clearly training is a large part of this. And I'll say a little about these things and what we're doing to address them. And then of course the idea that someone, it's not that these things weren't going on, but the idea that there was a point of person on office to sort of deal with this on a trans-NIH basis was the idea. And as such, I report directly to Francis Collins as the director of NIH and talk to him every week or two about what we're doing and get suggestions. As Eric has already pointed out, none of this would have really got to the point where it is without actually the engagement of Eric and many people in this institute. And for that, that made it all easier to come here. And I thank them deeply for that. And so I started on March the 1st and just by way of background, just to say that I came from UC San Diego. I was actually the professor in pharmacology, but I also do quite a bit in the open access space and I was also the associate vice chancellor for innovation. So I have all these sort of things that buzz around in my head when I think about these problems. So over the six months that I've been here, we've put together, I guess I'll show you, talked to many different stakeholders and come up and including right to the point last week, actually Jill and Lucilla were actually on the, we had another meeting last week. I'm not allowed to call it an advisory of course, but it was a group of people whose input we really appreciate who came together a whole series of stakeholders really helping us move to the next steps. Because in this fast moving pace, the report now that's two years old is in this world, this day and age is pretty outdated. So clearly a refresh was needed and we've begun that process and it will undoubtedly impact what we do going forward, including an FY15. But that doesn't change the overall notion of what we're trying to do, which is really to foster an ecosystem between the, within the extramural and intramural communities between the private sector and the public sector, such that biomedical research can be conducted in a way that's compliant with this sort of digital enterprise. And of course the last part there, which is in italics is actually the NIH mission. So nothing really changes with the mission. It's just from our point of view, something that's added to help us get there. So just to give you, this is just a flavor of the sort of what makes up this ecosystem and these respective groups. And I'm going to go into a lot of the detail here, but I just want to highlight a couple of aspects. Clearly this discussion goes on across the 27 institutes and centers. I've now met with all of the directors, all the ICs, tried to draw the commonalities that out of the concerns and opportunities that they see to fulfill their independent missions and figure out what it is we can do together. We've also begun to talk to a number of the agencies. There's clearly, and this will be a theme of what I have to say going forward, is the idea that with flat budgets and growing data needs, the more efficient we can be, the better obviously. The sustainability is something, if anything, that keeps me awake at night. And that kind of sustainability can be more easily achieved by cooperation across agencies. We've been having a number of discussions. I'll give you a couple of examples of that. And then of course there's the, let's just say, the higher levels of government that exist above the NIH within Health and Human Services and then other branches of government which all feed into this. And of course there are directives coming from the Office of Science and Technology Policy, which you know have driven some of these developments. And that continues. So there's now open data 2.0 and we meet periodically a couple of months at the White House to actually discuss how things are moving across agencies. I'll give you an example of how I feel we can benefit from that as well. And then of course there's a lot going on with the private sector, and these are just examples. And then of course other organizations, some of them data driven like the Research Data Alliance or Elexa in Europe. But also PCORI, the Patient-Centered Outcomes Research Institute and the PCORiNET, lots of interactions there. Interactions with groups that we don't normally or traditionally haven't had quite so much interaction with. CCC is a computer science group of esteemed computer scientists, cats around statistics, biostatistics, the traditional societies, foundations and so on. Lots of different kinds of groups and I'll illustrate what we're doing to work with those folks in a second. So these are just sort of a set of goals that have sort of come out from all of these discussions and an example, just one example of what we might do going forward and some of the things we're already starting on. So I'll just pick one out, the first one. So obviously I'll say more about sustainability but one of the ideas that came up, in fact Eric had a lot to do with this, was the idea, what do we do going forward? Perhaps we need to sort of stimulate different types of thinking around how we sustain the data resources that we're already funding. So one way of doing that is essentially not have these things drop off a cliff. So fund, fund, fund, fund and then suddenly not fund but have a model that sort of ramps down the funding with the expectation that new models will be found to either become more efficient through mergers and such or that there's interaction with the private sector. We could talk much more about what that might mean. So that's just one example. I'll just pick another one at random integration. So the idea of phenotype homogenization in terms of what we do across particularly within the institutes. There are common data elements for describing clinical information in particular that are used but there's also a lot of variation and that makes any form of interoperability all the more difficult. So addressing that going forward is clearly a need. So all of that's easy to say but how do we get that when in fact our little group is effectively eight people across all of NIH with full-time folks with folks across the various ICs about a hundred of them who contribute into BD2K and we have significant, at least from my point of view as a former PI, significant extramural funding that we can put to these problems. So that's really what we have as the raw materials. So what can we do? So we've organized ourselves around, I guess, five thematic areas and I'll just say something about each of them quickly. So sustainability, education, innovation process and collaboration of the five. We had a retreat internally and as I mentioned last week we had a working group meeting to get further input into all of these things and I think that was a lot of buzz about that meeting on the social networks. There was been a lot of follow-up around the world actually because there were international folks here as well and I'm actually very excited about where this is going. So let me just say a few things about each of them to give you a sense of what we have in the hopper. First one, sustainability. So Vivian, who's a Benazir, who's well known to many of you from this institute and George Comer Stoolis, who had a lot to do with the NCI Cloud Pilots and who's now working at NCBI are leading an effort to establish what we call a commons. That's a public-private partnership and I'll give you a sense of what it is in a second but it's a way of addressing or at least evaluating and then potentially addressing this sustainability issue. I should emphasize from the get-go that what the things that we're proposing to do in the next year and onwards are really what I would say are agile. They're small experiments where we will actually evaluate different aspects of what we're intending to see how well they're working. We're not building this massive... First of all, the commons in itself is not a compute infrastructure. We're not building any kind of massive infrastructure to support this. We're essentially sort of using what's there and trying to get it to work in better ways. So an example of that, of course, will be that the cloud computing is of growing importance. So let's leverage what goes on there but not just with cloud. Let's leverage what goes on in the cloud with what goes on in institutional compute resources in national labs, in high-performance computing environments and see what we can do with those things together. But that requires that we think about things a bit differently. So an experiment in that way is if we have essentially anonymized patient information, the idea of that moving into a cloud environment has some connotations. That has been discussed and that step is being taken. And at least it's bubbling up for approval. So BBGAP is what I'm talking about. We'll actually be in a secure cloud environment and that will change in some ways how that's accessed and Vivian was involved with that and she could say something about that if there were questions. It also leads to new funding strategies and new business models. And I think the key to all of this is actually the business model piece of it. I think that is fundamentally different. But so the commons you can think of as just an environment where elements of this digital enterprise, the research objects reside. And so to be commons compliant really means just two things. It means whatever research objects go into this environment, they're identified in some way. And the second thing is they have some level of provenance associated with them and possible metadata as well depending what they are. And just by virtue of those two simple things, it opens up a wealth of possibilities for what can happen with that content. And so that's really the idea and in some ways the commons is sort of like the internet in the way at least I think about it. If I asked every one of you what the internet was, you would all tell me something different and yet you all use it every day quite effectively. So the commons will mean different things to different people but ultimately so it could be a place to collaborate, you could think about it as an extension of what NCBI does which is out in the community rather than within the NIH if you like and so on, lots of different ways of thinking about it. But ultimately it's driven by this serious situation that we have of effectively we have the why. So OSTP and all these other initiatives including the genome sharing, data sharing plan that was just released speak to the why of data accessibility but they don't speak to the how. And what I've learned in Fed terms since I've been here of course is this is an unfunded mandate and this has serious ramifications for what we do with different data types. So we don't really have to address the how, we haven't really addressed the how yet but at the same time we need to because we want to maintain this end game which of course is all the usual things that we do and there are different data types and styles of data within this environment. So there's the long tail. There's all the data that's currently produced in many, many laboratories that suddenly in principle the data sharing plans are going to say okay you need to find a home for that data that where it's going to exist after the grant is over and certainly during when the grant's there. Obviously what comes out from high throughput centers and of course what comes out from clinical research and patient activities. And so and then within this context there are different stakeholders and there are different ways that we are through BD2K in particular that we're hoping to address it in this notion of this commons. And so I'll say just a little more about what that implies. So just to conceptualize what it means and it's very simplest mode you can think of it as a sort of drop box. So suddenly there's this commons icon sitting on your depth top. Data sets and other parts of the research enterprise can be dragged and dropped into this environment. When you do so additional information about the provenance associated with those could be asked for in the same way as happens in a drop box environment now. But each of the components of the Big Data to Knowledge Initiative including for example the Data Discovery Index will be expected to essentially find and make these things more accessible. So immediately you have sort of things that could potentially happen here. So immediately you're going to have situations where any of the research objects are in the commons. How much they're used would immediately be tracked. How people comment on them and abuse them can actually be tracked as well. So it opens up a place to collaborate and also because you can actually figure out in ways it's very hard to do right now finding data right now is very hard. Finding good data is even harder. So these are all examples of things that can happen in this space and we can certainly talk more about that. But the key element to this I think is the so-called business model. So the slide is less important than the concept. So the basic idea is what happens now is we give resources to investigators to buy computing. And typically they buy servers, they buy software and they sit in labs and sometimes the money that's given I just say this is a PI, I'm pointing a finger to myself, form a PI. That money is not necessarily all spent on what it was originally intended for. And secondly, sometimes that may not be the most optimal use of that equipment because it's idle part of the time and so on. So the idea here is rather than, and this is not an and or, this is something that potentially be phased in. But the idea is to give credit. So you get dollar credit to compute. You don't actually get the money per se. And then you spend that credit in a commons compliant resource. So in other words, one of these resources has agreed to support the commons with those two simple rules that I described before. Then what happens is, and then there's a broker. So an investigator will go to a resource that they find the most compelling. So if they happen to do a lot of work where they have lots of data but they don't do much computing, it's more compelling to go to a resource that supports that at a good price as opposed to other researchers who don't have much data but do massive amounts of computing. It drives competition into the marketplace. Then they spend those credits that particular resource of resources. Those resources actually then send a bill to a broker. The broker sends one bill to the NIH. The NIH pays that bill to back to the broker and the funds are dissipated. So we believe, at least, we want again, not a big initiative, but to test this out in various ways, that what we can get for the compute dollar will far outweigh what we currently do, where we spend, it's not clear what we actually spend on what you would call computing right now and compute infrastructure, but it's certainly well over a billion dollars a year. So if we could spend at least a portion of that to get more compute for the money and potentially open up things that we could do scientifically, which, of course, is the main point, that would be a huge plus. So I think the idea is to evaluate this in the time to come. What came out of the meeting we had last week and it came out from a group that had a number of the cloud providers in it. So including, for example, Dave Glazer, who's just one example of the people who were there from the major cloud providers, but came up with this idea that, of course, this all has to be done in a virtual cycle. So we need to find, to test this out, we need to find applications that scientifically can work within this context, but really have outcomes that are going to appeal to the researcher in a relatively short period of time. So clearly that will have to be the driver for this. You can't, this is not a build it and they will come model. It's basically identifying virtuous applications that are motivated, produce data, those associated tools and results, and that furthers the motivation. So this is sort of how it has to work. So that's, sorry, I went on a bit about that, so I better speed up. So that was sustainability. Let's quickly look at some ideas in education. I think the thing that drives me in all of this, I have to say, is personally, is the notion of what I call the Google Bus. So when I was in San Diego, Google did not have an office in San Diego, but they had one in Irvine. And slowly but surely, the folks in my lab were actually finding themselves on the Google Bus going up to Irvine every day to work. I could not keep them in the academic system. And I think that was a real disappointment because, yes, I would never be able to do that financially. We could never compete. But some of those people were not so concerned about finances. They were concerned about being appreciated as scholars. And I would say this class of data scientists are often underappreciated in academic systems. And this is something that we're working hard to address. It's particularly important if you think about how the NIH spends money. If we're spending money training these folks, it would be really good if they, at the end of all of that, continue to contribute to biomedical research and not something else which might also be worthwhile. But it's how we spend the money. So I think there's various things in the works to sort of address that, including, for example, which talks about this cross-cutting. We're having a workshop with the NSF where we're actually going to go out and talk to administrators at particularly academic medical centers to sort of highlight to them best practices where these kinds of people are kept in the system. So that's just one example. And then apart from the various training initiatives that are going on through BD2K, we're actually looking at a series of other initiatives. One, I'll just highlight one again, that excites me, is the idea that there's a mass of online courseware out there. There is a mass of physical courses that one can take. But how do you find them and how do you find the best ones? So one of the discussions we've been having with partners in the EU is the idea that we develop metadata standards to describe these courses. So in other words, that there would be standard descriptions for these things, which don't exist right now, which would, we would do this across various courses in Europe and in the US. And that would, you know, increase findability and potential usability. So that's just one example of a series of standards initiatives that we're undertaking. Typically one thinks about standards associated with different data types. I threw this in there because it's a standard about something quite different. Okay, so, and Michelle Dunn is leading that initiative. The innovation piece is really the BD2K piece. So it's really how we get the best that the extramural community has to offer associated with what we're doing in the world of data science and furthering the mission around biomedical research. Eric highlighted some of these things that, actually all of them are being funded, and Mark Gaia is sort of like the Hotel California you can check out but you can never leak. So he may have retired, but he lives on in various ways, including H3 Africa and BD2K. So I can say with Mark and Jenny Larkin who's working on this as well, I would be completely lost without these folks. It's overwhelming to come here and figure out how the system works. So they've been enormously invaluable to me, and I really appreciate that. So I think the kinds of things we're looking at, and so Eric mentioned these things, I won't go into those anymore, but just to give you a flavor of where we're thinking of going. First of all, we've been looking at governance models. We've been talking to the World Wide Web Consortium Group. We've been talking to the Research Data Alliance. How do we oversee these various initiatives? The 11 centers we're about to fund, the Data Discovery Index Consortium. How do these all play together? And the new initiatives that will be funded going forward. If we're going to create an ecosystem, there has to be some degree of non-ownerous sort of working together and how we best achieve that. And this is something that this institute obviously has done extremely well over the years. So there's lots of lessons to be learned here. So that's really what I mean by governance model. And then just to give you a flavor of some of the things that are going to happen. We've identified and agreed to fund this year. I already touched on sustainability standards. There will be a standards framework. There are already a number of efforts that sort of describe standards that are out there. They need to be coalesced. We need to be figuring out what new standards we need to begin to support. These are not things we decide ourselves. These are things that the community tells us are really needed. So the community engagement in all of this is really important. The whole ethics and other aspects of research use of clinical data want to really explore that. How we engage the private sector in the use of clinical data. We're going to have a workshop around that. How we begin to use electronic health records more for outcomes research. Clearly that needs to be discussed. And then of course there are these other communities. So the gaming community is very interested in potentially working on problems related to biomedical research. And these are kind of groups that are not traditionally brought into the NIH fold. So the idea of getting some of these people engaged is clearly important. So then a process, and I say this in the context of the questions that were asked earlier about the BIP sketch and so on. I think what I mean by process is what we do internally to manage and handle grants and other aspects of the enterprise. So we clearly need more clinical data harmonization. The one thing as an example I've been talking about recently and I've presented this to the directors, associate directors within the office of the director. And they've now got to go ahead to sort of look at this more carefully and then present it to the IC directors is the idea of really having the NIH support the notion of data citation. So this says a lot. So the rationale for doing this now is that there's, without getting too technical, there's an extension to JATS which is a format that the PubMed and PubMed Central ingests from all of the publishers. There's an extension for that coming this month which supports data citation. So in other words, we can actually bring, we can cite data in a formal way. It can be presented to a human in a variety of ways just like a citation of a paper. But underneath that there's a standard format for that which means we now have a way of citing data. If the NIH says we support the notion of citing data as a legitimate form of scholarship, that's a huge statement. So if you put a data citation in a bib sketch, when a progress report, not only does that say something but you can also do something with it because that is resolvable. So we now immediately have a resolution to that point of data. We can find that data. We also have prominence so we know who is responsible for that data. So it begins to elevate and value things that traditionally have not been particularly valued. And I have a real personal interest in this because the paper I have, I have a paper that's been cited 17,000 times. No one has ever read this paper, no one. And all it is is a paper about a database. It just recognized, why are we recognizing data with a paper when what's really valuable is the data itself in this particular case? So, you know, there are many examples, I'd love to talk about this for ages, but the scholarship is totally screwed up. Let me just leave it there. Maybe we can talk a bit more about that because it sort of seemed to be a theme that came out before. So in that context, you know, we have data sharing plans already across the NIH. So any grant over 500K direct needs a data sharing plan, pushing that down hopefully soon to all grants. But where it stands today is even, you know, the fact is I know this as, again, as a former PI until recently that those plans are not necessarily treated as seriously as they should be. And they're certainly not really evaluated in a formal sense. But there's no reason why they shouldn't be. So, you know, it's ironic that we have data sharing plans which are not machine readable in any form. If we could read at least elements of those data sharing plans, we could say that, okay, there's a commitment that some data is going to go into this resource in year two of this grant. Well, that resource could be checked in year two to see, for that grant number, to see whether in fact some data has appeared. So that cycle could be automatically completed. So the idea that, you know, the data sharing plans actually have some teeth could actually be done automatically. So these are all things we're looking at. I won't go into other aspects of this, but the idea of microfunding, open review of grants, and having those grants sitting there instead of going into a black hole when they don't get funded, for certain types of grants, if the PIs were willing, they could be opened up, and potentially there might be interest from philanthropists and others, foundations, particularly for particular kind of, I'd say, non-competitive data style grants where this might be of interest. Again, this is something to think about and try. Okay, so that's finally collaboration. Let me just say a couple of things about this. So I think the whole notion of public-private partnership with respect to a variety of different reasons, but certainly sustainability, that is, you know, that's something that other agencies, it turns out, I never would have discovered this if I wasn't involved with what goes on at OSTP now, is so no, obviously, well, not maybe obviously, but they have already begun to establish public-private partnerships. And that is a potential way of feeding resources back into support for the primary data from which these other things are built, like the Weather Channel. So potentially this is something that we should at least look at. So we're having some joint things with the NSF, and I think the two other things I'll just mention, there's a group, Heroes is the health, I forgot, Institute Research Organizations, pretty much within the G8, it was about 20 now, it goes a bit beyond the G8, but of organizations worldwide, like the NIH, they get together twice a year, Francis Collins is the chair, to discuss pressing issues that are of import to all of them. Big data has come up, actually before I even started this job last December, I went and talked to this group, I've subsequently talked to the data representatives from all of these organizations, I mean the process of doing so, and it's clear that there are opportunities to work together, there's a lot of excitement about learning more about what each of us are doing, what we're doing well, what we're doing poorly, so that's, I think, an exciting development. And then ELIXA is a project in Europe, and I believe now it's around 17 countries within the EU are now signed up, and they're developing nodes for biomedical research within the respective countries, they were at the meeting we had last week, we've already now discussed it just happened today, so it's hot off the pressures, there'll be at least two working groups to define aspects of how we could work together going forward. So that's just a sort of example of what we're doing in collaboration, and so I just wanted to give you a sort of brushstroke of things that are going on, and I'd much prefer if there's any time left to have a chat about what we're thinking of doing and having you tell us we should be doing something else. Thank you very much. Thank you, Phil. Open for discussion. The one immediate point, you said it, but I really want to emphasize this because I felt the weight of this when I was acting, and I know it has not let up, it's only increased, is having a point person now at NIH for these external groups to actually have a conversation is, I think, incredibly valuable because previously these things would just slip through the cracks so they would bounce around. There wasn't a central person who could have even the conversations with outside groups, whether they're funding agencies, other countries, companies, and so forth, so I'm sure it's killing you because it was starting to kill me, but it does justify the reason that NIH made the right move in creating this leadership position. I do a keynote pretty much every day, and it's pretty much, I used to say when I did research, I still do research a little at the NCBI, but as I would never give the same talk twice, now I just give the same talk endlessly with some variations, so I just think I'd get better at it, but anyway that's enough. But it's also the phone calls and the emails and the meetings are getting invited to. There's just finally communication channels that didn't exist before. So Phil, this idea of a, I'm not going to call it a cloud, a common accessible computational resource is appealing in a lot of ways, but I would just urge that you all be mindful of and careful about the fact that if you start having people go to different clouds based on the particular thing they're doing and how cost-effective it is to do that, we may be getting ourselves into a situation where there's a lot of data replication and data movement of fairly big, quote unquote, data sets across the net and so there's a certain tension there. I understand the attraction of using the most cost-efficient method for delivering computation to the community, but there's also something to be said for minimizing the number of times a particular data set is replicated and the number of times it's shipped back and forth between resources. So I just... Yeah, and I take that as a point, but I would argue, at least in principle, again, this needs to be tested, right? But the idea is that if this is more of a common space and you can identify and find data sets, the likelihood of having so much replication is less. In other words, you might actually go out there and look for a data set before you actually generate it yourself. I mean, I think it's highly dependent on the particular project. That's why I think pilots obviously are useful, but I think it's important to look at what we all call driving biological projects and different projects because there's going to be a tremendous amount of data generation as well as usage of existing data sets in the coming years. So one has to look at those maybe in slightly different ways. But it also does open up the opportunity to really do things with what I call data level metrics. So, Nona, you know, it's very hard right now to go and find... First of all, if you can find a data set to have some sense of how often it's been used... No, I agree, and that's a different issue than the one that I was talking about. I'm just throwing that in there because I thought of it. You're preaching to the choir on that point. Okay, we have Eric and then Dee Dee, then Bob and Lucille. We're just going to go right down the route. So, Phil, thank you for the perspective first. I have two questions. One is a member of council. The other is a member of the community. So the council question is, NHGRI over the last decade has done a lot to drive and push the formation of databases, omics databases that have grown and are now integral community resources. And really the question is, as a member of council, is who now basically picks up these databases and helps continue to build them for the good across NIH? Because they're no longer really just serving NHGRI. They're NIH resources. So that's the first question. Part because it goes beyond genomics. It might have started with genomics, now other things are getting added up. True. So let me just do that one first and then because I'll forget... I won't forget because it's so damn important. But, I mean, clearly, this is something... In my interactions with IC directors, I've asked them two fundamental questions going around. The first one was, how much are you actually spending on data related to activities? So that's, you know, that's a hard question to answer and none of them had a particularly good answer. Eric and John Lawsham, GM, probably had the best answers and in fact that's led to looking at this further. And I'll get to that in a second. And then the second is, well, how much should you be spending? So there's this tension between, you know, how much we put into, in a flat budget, how much we put into supporting the data we already have effectively for reuse versus new data generation. And so that's... So that, I would say, the dealing with those questions is an ongoing thing and we're essentially... All of the institutes are effectively surveying themselves at this point to actually look that up. The second piece of your question relates to, you know, I think it's tricky. I mean, none of this is easy, but if I just sort of give specific examples, right, you have model organism databases. Well, I think the... And they exist, you know, in a separate entity. And that has certain value. This is just a personal viewpoint in a sense. That has a lot of value because I think we can't underestimate the curation and quality that goes into producing data. But then on the other hand, you know, if you have to go to different resources and access things in a different way to answer questions which become more and more prevalent these days because you tend in a translation world not to be using a small number of resources anymore. You tend to flip around for a whole series of resources. Now, one argument, and this actually came up quite independent of what I'm saying at the workshop last week, was the idea that you have these highly curated, let's just call them, data objects that come from model organisms and databases, and you actually, you know, put them into a shared environment that is easier to use than what it takes to actually pull things from different resources right now. And maybe you could do all sorts of, you know, ortholog discovery and prediction and all sorts of things that are currently hard in the way that we maintain data. And I say this from, you know, it's very hard for me to say this because I actually spent 15 years working with a protein data bank and other resources where we did everything we could to support the community. But maybe, you know, it's time that these things are opened up a bit more than they are now. And then the community question is, we're living in a very confused world with reference to cloud computing or web-based computing at the NIH. I'm involved in many projects like everybody around this table. One is we're rewarded for being innovative and cost-effective for cloud computing. And then, you know, on Wednesday for another project, we're treated like Darth Vader and getting cease and desist emails because we're putting the date on a cloud. And I'm wondering what's going to happen to have some consistent policy and procedure across the NIH so we can all take advantage of this great resource because I have to say today it's extremely confusing as an investigator. Right. I mean, I think certainly it's the not trivial job of our little group to try and help homogenize these viewpoints. And I think it's inevitable as these new technologies emerge that there are those who see it as, you know, a real advantage that there are those who see it as a real problem. And, you know, I think what we need to do, which is why I like these like virtuous cycle, we need these things to actually work and to work well and we need to, you know, spread the word that way through success rather than, you know, but it takes time and I don't have... I'd be really interested to hear good ways of doing this now, but I feel the pain. I know because when I talk to different groups I hear this. But, you know, that kind of unification is not, you know, it's not something that can be mandated. It's something that I think grows as the science develops. Yeah, so thank you and I think you have a lot of really creative ideas and so it'll be very exciting if you're able to implement them effectively. Well, this is the bit that worries me currently is talk is cheap and action somewhat expensive. I think you're on a good side. If I don't come back in a year or two, you'll know it didn't work out so well. So I have just, it's maybe just a logistical question, but you talked about these compute credits and that NIH would pay the bill. How does that work for universities and investigators when part of their evaluation is based on research dollars in the door? Well, they would still be, I mean, it's a question of whether it's direct or not, but that would still be countered. They would still write a grant and say I want X dollars for computing and they would say I have a grant that's so much. It's just that that piece of the money would not be given as those dollars. It would be given as credit to spend. So they still count it? It would still count. I figured it was a logistical thing but I didn't see how you were planning to do it. I thought you were going to ask what happens when it's working but then what happens when the grant runs out? So what happens? There's no more credit. Who pays the bill to keep whatever it is that they generated? Well, that's exactly the problem we have now and the fact is we have no good way of dealing with it. So effectively what sort of seems to happen and you're all PIs so correct me if I'm wrong but stuff sort of sits on a website and gradually it just ages. I did a very simple experiment once where I went out and there's a new Clare Acid Research publishes all these different databases every year and it's sort of this thing that's growing like this. I actually went and of course every one of them has a URL and it's all open access. So I just pulled all those URLs and then I went and pinged them and I pinged them for a period of time and what you find is that you get a nutrition rate of about 10% a year. And 10% of those resources are no longer accessible. I mean sometimes they move and it's a little complicated but then it basically drops off 10% a year for five years. So at the end of five years all those papers that are in the literature have absolutely meaningless. So that was a bit of a rant. But then what do you do about that? So I think the idea that if we had better ways of measuring what was valuable is tricky because value is not just how many people use it. It's what impact it has on the communities. And that's much harder to deal with. We don't even touch that now in what we do. We don't know currently how the data that we currently fund is used really. I mean we might know that there are so many users in so many countries download so many terabytes but if you dig in and you look at individual items of data of what why when and how and what are the implications of that. So the idea of this kind of environment is potentially to make that at least slightly easier. The questions are still very hard but the mechanics become a bit easier because you have access to this stuff. I even hesitate to bring this up giving this extremely long and daunting list of tasks that you have. But I think one of the issues that has come up has been once you start talking about patient information is that the local IRBs and the OHRP don't have a clue about what to do about the cloud and they're unaware that there actually is security such as it is and so I think one of the government groups you're going to need to interact with is going to have to be the OHRP. Yeah, there's no question about that and it's clear you say you don't want to add to a daunting list well it's clear that we need extra help with that. So apart from the small group of people the intent is to hire one person who's going to be really dedicated to this type of issue and the other thing that's daunting is the types of data that we cover within are so broad that who's coming into this job is not going to have understand the spectrum fully and I readily acknowledge that I have weaknesses and the idea is to fill those holes with real expertise so hopefully we can start addressing this but I agree it's a big issue. Right, so with relation to the genomic data sharing I think that was a great achievement, the policy what about a clinical research data sharing which is totally needed, we do have technology to protect the data and also I think demystifying the cloud as a completely public publicly available resource, there are public clouds, there are private clouds, there are access controls, there's a whole lot of things that can be done and I think that the time is right for a clinical research data sharing policy. Yeah, I mean first of all I would also give great credit to the genomic data sharing policy, it has nothing to do with me, it was Eric and Laura and many other people around NHGRI who made that happen and it's been a very interesting process to watch and it's to be highly commended and I think the idea that we follow on with additional ones is clearly very important in respect to the clinical data I mean I think there are other I think it's difficult but at the same time there is pressure for this, there's increasing pressures you know in various ways from what you do yourself but also I mean we heard a lot about it at the meeting last week and I'm also thinking that for my own naive point of view in all of this there's a cultural change going on and I would the way I put it is for the first time in history in my view healthcare is becoming patient centric but the patient has much more control and understanding of the information related to their healthcare than they have had before and I think there's a question of whether and how we can leverage that in this kind of environment so there is a talk of various types of large cohorts where patients are actually taking their healthcare records and in an anonymized way putting them into a resource that could be used for research of course this is already happening in some countries that have that have social, particularly I guess I just picked those that have socialized medicine I mean we know what's going in parts of Europe and the discussions we have with the elixir folks and one of the workshops we're going to have in short order really is to specifically address this, can we create an international cohort that would be accessible in this way and of course there's oodles of questions around doing that but if the consent is coming from the patient themselves initially it sort of seems to me that it's at least a big step in the right direction so Phil, thank you for that and I just wanted to say that your comments on the need to recognize the contributions of data scientists and the need to have access to data sets that are published be recognized in at least a similar way that publication acknowledgments are recognized for career path development many of us in this room have been talking about this within our own institutions because we need to recruit retain these individuals so I think having your voice added to that is going to help us make, put into effect some of these changes in the culture of academic advancement much easier so thank you for that I would say I have a real passion for this because I was one of those people and I didn't necessarily want to take the path that I took to do what I wanted to do but in the end I had to become a tenure professor to do what I really wanted to do exactly, no it's very, very important and then a very specific thing is you mentioned a couple of times the common clinical data elements under your common data elements umbrella the clinical data so how do you plan to get sort of the input from the community on what those would look like and how do you envision engaging with a broad sector of the community on that particular project well I mean I think in general terms these communities are emerging I mean I just this morning I was on a call with the global alliance for genomic health right so it's clear that's got a lot of momentum many of us in this room have been engaged with them in one way or another and you know it's the these are important developments and I think there are several other of these sorts of organizations that really in a sense are doing these things for themselves but they need assistance and what's interesting is these are often things in my mind at least that don't actually cost large amounts of money because the community is already in but it needs something to catalyze and so I think looking for those kinds of wins in a general sense with respect to you know these common data elements associated with clinical research I mean some of that is sort of going on already I mean NCI is collecting these on behalf of a number of institutes and you know both from an intramural and an extramural perspective and I think if I had more cycles I would have looked at this more already but I think it's something that is working but it's a question whether it's working optimally and I think you know having someone to look into this specifically is what I was thinking but I'm open to other ideas well I just think that there's a big appetite for this now so many of us who are now developing information systems around these data if they're you know where do we go to find out what the current standards are so that when we build the infrastructure it's going to be integrable with other groups I mean that's the challenge we're facing right now we have to make a decision now and so knowing where to go to find out what the developing standards are would be incredibly useful. Right I mean certainly the Office of the National Coordinator and their efforts are something we're trying to wrap our heads around as to what is being contributed and I think the idea of having these workshops was specific this coming year specifically to do that. So all of us data scientists I think we have to thank you for being the Lorax who speaks for the trees but you know I think that the attribution for data and the use of other people's data is very important I also think we have to be more out there about the attribution of people's software tools some of which may just be a little our tool library or something like that and you know having had my first data set published in the new nature data journal right I almost think that one has to engage journals to have journals begin to recognize the importance of papers citing the software that they use and I know that you're thinking about software identifiers and so on but it's one thing to have an identifier it's another thing when a reviewer is looking at a paper when a journal editor is looking at a paper or the final copy editing of that paper to really say okay you say you did this analysis what software do you actually use and citing give credit to that person and it's just as important for large software packages as it is for the script that's written then available from a postdoc or a graduate student so just another Lorax moment here Vivian was leading a workshop on what we were doing with software and as a report all this stuff it's another thing I should say we're trying to be as transparent as possible so almost everything we do is tweeted is on social network google docs and what have you and you can comment on them so you know I certainly the sentiments from that that group was exactly what you just described with vis-a-vis software and we clearly have to take care of that just another comment you made in regarding publishers so again you know I live half of my time in a dream world but you know just last week I had a conversation with folks at the public library of science so those who don't know I actually found one of the plus journals and I've done quite a lot with them over the years and we talked about the idea of effectively micro publication so this is just saying it's going to happen tomorrow but there's there's a willingness to test this sort of thing where in other words you effectively you know you write a piece of software to do the research well effectively you actually published that at the time you get some form of plus attribution for that and it sits there of course immediately and other people can use it and it's accessible and then as you use it to essentially father down the research life cycle of what you're doing that the results of that and even the data associated with it also get micro attribution and so you're kind of building the publication as you go along and at the same time of course all of this stuff is available you know on one hand it leads to competition potentially on the other hand it leads to collaboration so I mean I think it's a time in history to really try these things out and then of course measure exactly whether they succeed or fail and in a small way as you know anyway we've done a number of these experiments with PLOS a few of them have been very successful and a few of them have tanked so you know I think it's trying and moving forward okay I'm going to step in here we probably could talk to Phil all day because of the great interest and relevance to this council but he'll be back I'm sure and more importantly this council will be regularly updated about all these things because there'll be so much interaction between Phil's organization and NHGRI so thank you Phil and thank you for counsel with an excellent discussion so Rudy you've earned lunch but let's try to be quick about this so hoping that the throughput upstairs can meet the demand here let's try to get back by let's say 115 okay we'll aim for that thank you