 If you're up for it, I'm ready to begin. We could go on chatting. So I should say that this is being recorded. And I'm fine with that. If you're fine with that, I encourage you to interrupt me at any time. And yell out whatever you want. I'm trying to address it. And that's what I can say. I'm all for having an interactive dialogue rather than just a talking head, which you could go off the internet. So I'm actually going to talk, well, I don't know how I got to this title, but I'm actually going to talk a little more broadly than this. But I think the nucleus of what I'll say has to do with this idea of a commons, which is something that I'd say is one of my major focus, a foci at the NIH now. So just by way of introduction, I've been in the rain here now for about eight months. I moved from being a professor at Houston, San Diego, where I was a professor of pharmacology, interested in a lot of things around open science, worked a lot, co-founding one of the posture and all sorts of things like that. So that's what I'm going to maintain by a lot of databases. So that was sort of my background in coming into this. And I went to the NIH because there's a realisation that the future biomedical research is going to be significantly different than what it is today. We talk about it and we'll see this in a minute as a digital enterprise. And in fact, so the whole enterprise is becoming more analytical. And that really changes the dynamic quite significantly. The question is, how can we maximise our rate of discovery in this changing environment? So that's sort of what interests me. But I want to just give one slide, and it's sort of related to my background, to sort of illustrate how far we've come in all of this in terms of what your objective is, but certainly the objectives that I have as a researcher and now at the NIH. And this is actually taken from the newsletter in 1976 when I was a graduate student in Australia. It is the sum total of the resource called the protein data bank, which is something that I became responsible for later on. And what used to happen is I started researching with this data and it would arrive on these links by email and it would take three months for them to get to the lab that I was in in Australia. At which point we would print the data and look at the numbers and go ooh and ah and actually make some interpretation of it. Well interesting, and this is related to my area of research now. This is a complete proteome of Ebola and I get that in milliseconds and it's in our whole composite view of a much faster set of data. And so I just happened to be interested in looking at our targets to using repurposing of drugs to treat, potentially to treat Ebola patients. But the point here is just this absolutely stunning change. And the other thing that's stunning is if I was giving this lecture at the beginning of the century, in 1900, in the last century, I would have been dead for 15 years because the average life span of that time is about 45. It's now, as you can see, close to 80. And I think lots of, that's an incredible success. Obviously it doesn't come just from what the NIH does, but obviously that has a strong influence. At the same time we've seen amazing successes in the way we use IT and I think the interface between those two sets of developments are some really things that the NIH has done, which I think we're certainly very proud of and I think have really impacted research in biomedical sciences, namely public access, public central and now serious Ebola new initiatives, which is what I'm going to sort of tell you a little bit about today, the kinds of things coming down the pipe to try and continue this sort of trajectory. So just to, you know, obviously a lot of this has to do with the data. This is my data file slide, everybody has one of these. This was the subtotal of the contents of the National Library of Medicine at NCBI with dates of day three, it was one C drop. It's now 20 petabytes of information, which is of the order of 400 million four to five habits if you care. But there's much to do, there's still lots of problems to be dealt with. So we have too few drugs, they're not personalized and they take too long to get to market. We don't spend enough time on rare diseases. Plenty of trials taking a normal amount of time, they're very expensive and not retroactive in terms of when problems are discovered, it often takes a great deal of time before that problem is reflected in what happened to that drug in terms of its place in the market. We're not doing a great job of education and training for the new workforce. And I think the thing I'll focus on most today is I think it sort of crosscuts this audience more than anything else is that research itself is not cost effective. In fact, as we move into this digital enterprise, you could argue that 80 to 90% of the time is being spent doing things that are not cost effective. It's basically wrangling and munging and doing things with data which could definitely be improved. It's a lot easier to improve those things than it is to improve trying to do better experiments or understanding experiments better where they're real physical things, they're analogue things. There's not such an excuse in the digital world for not doing better. So how do we do better? And that's really what I'll focus on. I won't go into this, but this is just another illustration of why we have so much to do is while in this country, much of the western world, that life expectancy's gone way up. If you look at places like Nigeria and Angola, it's actually still where we were in 90 to 100. So I think the idea that we are having an influence on what's happening in the developing world by what we do to me is absolutely critical. But there's a lot of promise out there and I'm just gonna say a little about what those promises are in healthcare and then I'll get into the sort of what I think is probably at most interest to you if you're not in healthcare area or related to biomedical sciences. So this is just taken from the 100,000 Genomes Project which has started now in Great Britain and which has a fairly short lifespan. The goal is to, in a short time, several years to sequence fully 100,000 people and use that as a diagnostic tool. And so I think this is a real step forward and we're going to see a lot more of this kind of endeavor including in this country where the situation is much more complicated by virtue of economics in particular. Another example I'll just show you in passing which illustrates where we potentially could go. So this is a graph made from electronic health records of a complete-dange population of 6.2 million people. And what it shows is co-morbidity. The likelihood of if you have one disease then you will get another disease. And so the thickness of those, so each node, each problem in that network is a disease or a disease state. And each edge, the thickness of it reflects how often people progress from one state to another. This is an incredibly powerful predictive tool based on a whole population. So we don't really have facilities yet in this country at least to do anything like this in the sense that we have a much larger and more diverse populations, but we don't necessarily have the same kind of access or the same level of homogenization and the data within the system that they have in demo. Interestingly, you can then take this kind of thing. You can say because in demo, from your social security number, you can get bank records, your banking details, you can get your health records, you can get social security information. In principle, you can start laying all sorts of other features on top of these kind of graphs. If you start laying things like socio-economic metrics and parameters on top of it, you start to see other kinds of trends emerge. So intuitively you can expect that kind of thing. But now we're actually measuring it and we're measuring it in some detail. This is the promise of healthcare in the future going forward. We have lots of problems here in how we manage the data as well as the economics, as I've said. We're sort of slowing us down in many ways, but we're clearly going to... The drivers to get to this kind of thing are really mounting because in my view, for the first time in history, the patient is at the centre of the healthcare system. You are the effective... You have control over your own healthcare information to some extent. And as you start to use that, amazing things can happen. So the point is, how do we... What are we doing at NIH to sort of utilise this disruption, these various types of disruption? And I'm going to just tell you about a few of them. What we're trying to do is to create much more of an ecosystem to support biomedical research and healthcare than we've had before. It's too much, just gets lost to the system. So money is spent on a grant, some data is generated, a piece of software is generated. There's a publication which of course is valuable, but a lot of the enterprise that went into that publication gets lost. And we really need to do a better job and that of course is affecting our productivity big time. So the way we've been thinking about how we're going to deal with that is by essentially creating a three-legged school because I think there are three aspects of this that have to go in lockstep. There is community, policy and infrastructure. And within that of course is the need to sustain, collaborate, train and so on. But those are the sort of three legs of the store. I'm going to say just a little about each one of them. But none of that really makes a lot of sense unless on top of all of that you have a so-called virtuous research lifecycle so that the motivation for researchers to be engaged with community, policy and infrastructure is what they're actually achieving in their scientific endeavours. So that's the driver. And so we mustn't lose sight of that. But at the same time you want to facilitate through this ecosystem. So let's just look at each of those legs of that three-legged school in what's actually happening and what's coming down the pipe. So increasingly the notion of data sharing and encouragement of data sharing is something that the NIH has taken on and taken seriously as our other agencies. Of course now we have mandates from the federal government which we've had to respond to as an agency to how we're actually going to move forward with sharing. So that's already manifest in things like the genomic data sharing policy which provides much more flexibility and more accessibility to genomic information while of course maintaining the required privacy or the desired privacy for an individual patient. It also means that we'll have data sharing plans on all awards which is something we don't have time in that sort of detail. But we also will enforce data sharing plans. So right now people with grants have to assuming they're on all grants to write a data sharing plan. But no one actually really seriously looks at that plan and sees whether in fact the TIs actually did what they said they were going to do. And it really is not part of the review criteria and the way it should be. But it could be. But first of all it's quite simple to make the data sharing plan itself machine readable. So if you say you're going to put data in repository X on date Y, we should be able to go to repository X on date Y and see whether in fact that has actually happened. If it has happened then that would release the next amount of funding if you like. So I'm just making a sort of simple discussion of this but you get the idea. Right now the fact that it was a human in the process and the rules are not very well adhered to just means that it doesn't work very well. So that's one example of how to deal with that which we're actually in the process of looking at. Another piece which is critical in all of this is the idea of elevating data citation to be considered by the NIH as a legitimate form of scholarship. So it's the idea that we will actually endorse and encourage people to actually provide citations to data sense. So when you submit a grant application or a renewal or some other kind of report, we should in fact really embrace and support the idea of data citation as a legitimate form of scholarship. And how that fits into our own workflows is currently being reviewed. But there's a small nuance that really makes this pretty straightforward. So there's an extension that's been worked out by a working group to extend jazz to the XML representation that's ingested by PubMed and PubMed Essential from the majority of the publishers. That extension covers what data citation should look like in machinery or form. So we will now begin to see and get that kind of information more than we had before. And we should be able to use it effectively. And of course it's coming from other sources as well because obviously there are now data journals as well that are part of this. So that's an example of what we're doing with policy, a couple of examples. Let me just say a little about what we're doing with infrastructure. So we just funded, we have an initiative called the Big Data Knowledge Initiative. We just funded $32 million worth of grants and this coming fiscal year is going to be more like $80 million. And we funded so far 12 centres of data extollers. And each one of those has its own virtuous cycle. So it's doing its own research. But they're all associated with different types of data. Everything ranging from genomics data to electronic health records to mobility data and so on. And all data coming from handheld devices and so on. So lots of different data types in this. And then we've also funded a data discovery index consortium which is going to build or is building a means of actually indexing and finding that information. Right now it's very hard to find data sets that are relevant to what it is if you want to do it. So typically you Google them or you'll find them through or try to. That's not very satisfactory. Well of course you get to them through papers or ultimately you actually have to work directly through the investigators or they're just not, there's no idea how to get hold of them. Which of course slows the whole process down. The two, the same can be said for software and standards. Now which standards to comply to, which standards to use and so on. So these are all the initiatives that are underway to really sort of create, help facilitate a better ecosystem. And of course all of them have to be connected and so what we're starting to see happen in this environment is these various groups now forming working groups to actually do specific work across the respective centres. That to me is the beginnings of an ecosystem. Whether we can sustain it is an open question but it's something we're really trying to do. And the way we're trying to do it is through this idea of a commons. I'm going to say a little about what that means to us. And then of course other labs will join the commons and in fact there's no restrictions on anyone joining this kind of consortium. We've been having talks with actually across a variety of federal agencies particularly talking to other funding agencies in other parts of the world because I'm very keen to try and see if we can do some of these things together. So what is this commons? Well the commons essentially the idea of course that all it is really is exactly like a commons in a village. It's a place where you go and you share experiences. So it's really a sharing environment and it's just conceptual. You can see how it's instantiated physically in a minute. But within that environment we're just conforming. The idea is we're all just conformed to these so called FEP principles of finding, accessing, interoperating and reusing the content. And that content as you should have gotten already relates to all different types of research objects. So it could be data, software, narrative and so on. And of course there's prominence and attribution associated with that. But the computing platform itself is agnostic. So these are the folks who are actually sort of driving this forward right now. Vivian Banas in George Thomas Sewells. But really what it is is the idea of a set of digital objects in this space. And then a mean to searching through this searching index I describe and then an agnostic compute platform and that's really the commons. So let's just look at each of these pieces very quickly. So the commons there's conceptual framework is made up of public clowns, not the usual players. It could be high-false computing resources, could be institutional repositories. So in-house compute solutions and then things from the private sector. So to me for this to work in this way, the business model by which this operates is actually a critical facet. I'll tell you something about that in a second. But just to quickly go over the other pieces of it. So how we identify these research objects, which is critical in this space, this is something that the community, and that's the other part of this that I haven't emphasized yet, is none of this makes any sense unless it's driven by the community. For us, for the NIH to say we're going to do this, people will sort of say, yeah, okay, if you give them some money. But that doesn't change anything. You know, it has to come from what they really see as important. So there's undoubtedly a momentum broadly across the research community for this kind of identification. So we're working, it was a meeting in January too in the UK to sort of discuss what these initial identifiers would look like. Could be DOIs, could be able to sum up all the handle. The community will decide that. And then these indexes we're building, those tools will trigger off those identifiers. An example of where this brings all this together just to illustrate is that there's some community that's emerged out of all of this called the Global Alliance for Genomic Health. Just out of interest because people are looking a bit bored. People have actually heard of it, the Global Alliance for Genomic Health. A couple, okay. So this is now an organisation that tells me a lot about who I'm talking to and maybe I'm talking on the wrong level, but because you're obviously very good at some things, but maybe I'm talking about helping people. How do I get out of this horrible gap? You can beat me up after this. So this is a group that's about 100 or so institutions sign on to it. And they're really active in a variety of ways. They've got a series of working groups. One of those working groups is dealing with particularly around API. So an example of how this would work is a question researchers ask is I want to know what variation there is in this position, chromosome 7 position X across the human population. So how do you answer such a question right now? Right now you would go to some centralised resource if it's such a thing as is, but it would only have a small fraction of the data. But in the world of the commons, more of that could actually be available from more sources. If those sources are at least accessible through some mechanism. That mechanism is really having an application that sits out there and this one being prototype right now will be able to do this. And really what it's doing initially is to test people's willingness to share. And the number of resources that are actually coming into this environment to be shared in this way is growing fast. So it's already becoming a powerful tool. And all it does is it essentially goes off. There are APIs into each of these different resources that have been written by communities who are interested in its access to this information. And it returns nothing more than a letter that happens to be present at that position on the DNA. So it's not revealing anything about there's protections to do that. It's just returning a piece of information that collectively, I don't think anyone, however private they wanted their information to be, they're just implying that without de-identified that that was being released. I mean the consent model is another matter for all of us. But that's sort of the basic idea. So we're testing the ability of sharing, we're testing the ability of communities to come together to build the tools that's necessary to make this work for the community. So that's the basic idea. And this is what it looks like when I'm going into the details of this. So I'd much prefer to have a discussion. But what's also important is the idea of a business model. So the business model for this is different. How many of you here are affiliated with academic institutions? Most of you probably. So this is where it's important for me to wake up because this is where things potentially could be different. I would emphasize that what we're doing right now is a small test to evaluate this process. And the process, this business model is as follows. What happens right now is that you or folks in your institution, they write a grant to the NIH, they get in that grant and put a line item in for hardware and software and things like that. It's essentially management of data resources. So if they're successful, they get that money. Then what happens? Well, I know this because I was a PI. Maybe I'll cite an offerment of that money to do something else. So I'm not actually spending money exactly on what I was going to. Or I buy this equipment and occasionally it's overutilized. And a lot of the time it's just ticking in or actually doing anything. So this is not a supply and demand model. So it's not necessarily cost-effect. As we move more to an environment where people are using cloud resources more and more, it's easy to then to think about a different kind of business model that's built on credit. So don't make it sound like a credit card debt. Essentially what happens is you write the grant and instead of being given those dollars in hard cash, you're given credit in that amount. Then you can spend those dollars in whatever commons-compliant resource you like. Commons-compliance means nothing more than that environment as agreed to share with appropriate protections and eventually maybe it'll start using the same research object to identify as the other things that I said. But what you're doing is you're getting that credit. You as the customer are then deciding where to spend that in a commons-compliant environment. So it could be that your institution is part of the commons, in which case you would choose to spend it there. It could be a public cloud, it could be some other kind of resource that's available to you. But you choose and you can change at any time. Who you choose is going to be dependent on the pricing model for each of those given resources. So it drives competition into the marketplace. Also you get assigned a certain number of credits. If you don't actually use those, those credits could be reassigned to someone else. So it gets if you've got this supply and demand. Or of course if you actually have a demand for more than you originally requested, you can ask for more. So if we believe, and this is just, I should say that the model we're using for this also enables a public-private partnership. So it also enables the private sector to be part of this enterprise. And so what it enables is what we want to test and evaluate is the idea that we can actually do more with our computing dollars than we currently do to support biomedical research. We can do it in a way that is more cost-effective. We don't actually know how much we spend at the NIH on competition and day-related activities right now. Because it's hard to sort, because we have obviously very computational types of awards that we make, but then there are also things for which competition is just a part of the award. So we don't know, but it's undoubtedly well over a billion dollars a year. The question is, can we do better with that money? And can we insight this research to be done in an environment that facilitates sharing and reuse and reduces a lot of this time and energy that's just wasted before we get to the real science? These are all open questions. And this is not a major big undertaking of doing here. It's something we're piloting and testing and trying to evaluate whether, in fact, any of the things I just said turn out to be true. So it's just an experiment. So that's what I want to say about the Commons. And then I'll just say a little about training in what we're doing. We're actually undertaking a whole series of training initiatives including, particularly with emphasis on minorities and underrepresented communities. We're actually going to be putting out a request for a workforce development centre which will focus on sort of collating a lot of existing course materials around data science that are relevant to biomedicine. That's a problem right now. There's a lot of stuff out there that have the idea of courses to develop metadata representations that actually describe virtual and physical courses so that they can be properly cataloged in such an environment. So I think that would be a real facilitator. That's just an example. And then there's actual courses and other things we're supporting. We're also trying to support other types of communities. So for example, today, yesterday and today and I was there yesterday and I unfortunately couldn't go today but we supported a workshop with the gaming community. So some of the famous gamers are willing to come and actually describe the kinds of things they do the kinds of algorithms they apply the kinds of tools they apply with a view that maybe we can do something that's going to benefit biomedical research. So it's reaching out to these communities that we've not traditionally reached to is a very interesting experiment. I was worried to death when I went up there to see it yesterday that they would all be sitting around not communicating with each other. It was quite the opposite. They were jumping up and down and running around. There was a lot of energy in that room. Whether anything meaningful will come out from it, I do know. It's an interesting experiment. So I just used that to illustrate and we're trying to think and do some things that are a little different than usual. I think what we've also done is I've actually polled across the NIH just 20, 70 tubes and centers. I've polled them, the kinds of problems that they have that's preventing them from filling their individual missions and then try to sort of collate what that means. These are just some of the sort of, I'd say, broad areas which we feel we need to work to improve situations. So I won't go into a lot of these details. I'll put these slides online and slide share. But the idea of homogenization of large disparate structure and unstructured data sets, how you use this integrated information and use it together, obviously huge problems but huge potential benefits, particularly as we look at from everyone, from someone's gene up and all the way through to actually their behavioral patterns that allow their electronic health records. So lots of opportunities. Visualization, modeling, looking at sparsely populated data, all these are kind of problems that exist and you can apply them in lots of different healthcare situations. So the idea of one of these projects relating to mobility which I mentioned before but it's really to look at the outcome of surgeries in children cerebral palsy and also to look at their gay physiology and try to understand how mobility can be improved and things like that. And there's lots of other examples like that. These are a whole group. This is a Trains NIH initiative. I report directly to Francis Collins who's the director of NIH. This covers all of those 27 institutes and centers. There's more than 100 people involved and it just goes to show that what we're trying to say is data is how we handle and manage those and how we use it effectively is a big part of what we hope to do going forward. So that's really what I wanted to say and I hope to get a lot of pushback, a lot of argument, a lot of questions. Thank you.