 Thanks very much, that's great. This title might seem rather ambitious, implying somehow that open data and the future of science in any fundamental way are interrelated and that's indeed precisely what I'm going to argue. We've had a great deal of effort and times expended on trying to understand how we might have open publication of science. Big data is the flavor of the month, but I would argue that in a fundamental way open data is really much more important than both, of course, all of them are entailed together in the same space. As Paul says, I speak as a scientist. I'm not a data manager, I'm not a data scientist, but as a scientist who deals with a great deal of data in his own specific domain, which is in glaciology. So let's get started. And a very useful way to get started is to look at the limit of history. This fellow is Henry Oldenburg. Henry was the first secretary of the newly created Royal Society in London in the early 1660s. He was a German theologian and an inveterate correspondent. He corresponded with people that we'd now call scientists from all over Europe and indeed far beyond. And he had the bright idea that wouldn't it be good to rather than keep my correspondence private, to publish it? And he persuaded the newly created society to do exactly that. So that the image you see on the right hand side of the screen is the title page of the first volume with the philosophical transactions of the Royal Society, which continue as a, as a soundtrack side of the journal of the present day. But Henry made two requirements of his correspondence. First was that they should write in the vernacular and not in Latin, which might not seem to be a big deal now, but it certainly wasn't Henry's day. And secondly, and crucially important, he required that his correspondence should not only provide the concept, which they wish to argue for, but also the evidence, let's say the data behind it. Without those two, Henry wasn't prepared to have them were published. Historians of science now have regarded that move to publish openly and to ensure that evidence and concept were tightly connected together in that open publication as the bedrock on which much of the scientific advance of the last two, 300 years has been based. And the bedrock for two reasons, principally because it permitted others to scrutinize the logic of the argument connecting evidence and concept and also for them to be able to replicate either the observation or the experiment, such that if they fail to do so or couldn't, then the presumption is that they have effectively invalidated the concept that has been promoted, which gives rise to what's been called the principle of self correction in science, which means that science is very good at showing where scientific ideas are wrong. There are two rather lovely comments from two great scientists, one a social scientist, the other one, Charles Darwin, in relation to this self correction process. Gussler writes, the progress of science is true like an ancient desert trail with a bleached skeleton of discarded theories, which once seemed to possess eternal life. How well do we recognize that? And Charles Darwin, false facts are highly injurious to the progress of science, but they often long endure. But false views do little harm as everyone takes a salutary pleasure in proving their falseness. But Henry Owensburg world and Cussler's world and Darwin's world has changed. It's changed dramatically in this change because of this, something that we all recognize. That is the capacity to acquire, to store, to manipulate and to instantaneously transmit data has exploded dramatically over the last two to three decades and is exponentially increasing by the year. That poses both problems and really difficult problems, and it also creates opportunities. Now I want to talk a little bit about the problems, secondly about the opportunities, then going to say why open is such a crucial concept. So let me make it personal. About 30 years ago, a pal and I published a paper in Nature in which we produced seven hard one data points from an experiment in a glaciated area. We describe in detail the experiment, the nature of the apparatus we use. We evaluated the errors and the uncertainties in such a way that others were able to replicate the observations, to amend them, to add to them and to develop the concept that we propose as a consequence. And it's now become quite a basic theory in glaciology. Three years ago, I was involved as principal investigator in a major experiment on the ruptured ice cream and Antarctic, which you see here. We use a variety of sensors and the data we collected amounts, I'm not sure, but let's say about seven petabytes, rather than seven individual units of data. And even the pages of Nature are inadequate to contain those seven petabytes. So the crucial question for us as we prepare publication is can we prepare that material and the data in such a way that others will be able to scrutinize it with the same rigor and the same good effect that happened 30 years ago. And that is not a small challenge. It's very difficult to do. I'm sure that others will be looking at it with some interest, particularly as I'm making the claims that I am now. This, the difficulty of doing this and some of the changes that have happened in the last year or so have produced some quite severe problems. This is a paper in Nature from three years ago in which an American group looked at the top 50 benchmark papers in preclinical oncology and failed to replicate the research findings in 89 percent of cases. Only 11 percent of cases could they replicate the results. The reasons were various. One of them was fraud, people had invented data, but quite a number were as a consequence of either the data or the metadata. That is the data that permits you to use the metadata were either not present or incomplete. Our argument in our Rosati report is a fundamental principle. It must be, as it has been for much of the history of science, that the data providing the evidence for a published concept must be concurrently published together with the metadata. And there's indeed to do otherwise ought to become regarded by all, including publishers, as scientific malpractice. This is the cover of The Economist of some two years ago. With the heading, as it says here, scientists like to think of science as self-correcting to an alarming degree it's not. What's happening now is that non-replicability is seeping out of the laboratory doors and becoming something of a concern certainly to economists and need to many others. If we're not careful, we have a crisis of confidence in science, such that ensuring that the means whereby conclusions can be replicated are present. For example, in the example I showed you, even if those results which couldn't be replicated but which were due to the absence of data, even if they were correct, there was no way of demonstrating that they weren't and therefore no demonstrating, no way of validating them. In other words, that work has the same ought to have the same status as myth. But openness doesn't really mean very much if all we do is dump data in some recoverable archive. We've argued that for what we call intelligent openness and who wouldn't want to be intelligent, that data must be discoverable, you've got to know that it exists. It's got to be accessible, you've got to be able to find it, it's got to be intelligible, you've got to be able to understand it, it's got to be accessible, you need to be able to ask questions, does this person or this group, do they have a particular financial interest in a particular outcome? It's got to be reusable and only when those criteria are fulfilled that data will properly open. But it's also crucial, of course, as many will know, that the software that manipulates data, basic data, to create the data that might well be publicly accessed, that software itself has got to be open simply because different groups in encoding the same equations can come up with quite different answers and it's a serious individual problem. Of course, also intelligent openness must be audience sensitive. We're very good at providing data when we do it to our fellow scientists and the metadata that they require to be able to utilize our data. But at the other end of the spectrum, we're very much less good at providing data to citizens. Now, why should we? Well, the reason we should, of course, is because many scientific opinions, views, concepts provide implications for social and economic life and the lives of individual citizens. And as they lose the habit of deference, then they want to know what the evidence is for this particular reaction to an infectious disease, what the evidence is for climate change so that they might as citizens vote for a party that argues that once you do something about it, we don't do that very well. We neither provide the background nor do we provide the data that permits them to be, if you like, intelligent audiences of scientific views. But of course, there are boundaries for openness and we would argue we have argued that openness should be the default position, but there need to be an underlined proportional exceptions for commercial interests, whether legitimate, for privacy, for safety security. The crucial point of our mind is that all these boundaries are fuzzy. Trying to define them precisely in a small number of words is not easy at all. In the commercial domain, for example, there are some political concepts which might require commercial bodies to open their data and some business models now in some sectors of industry are now shifting towards a much more open approach. The boundaries are fuzzy and if you asked a lawyer to give you an opinion about where those boundaries might lie, we're probably talking about 10,000 pages of opinion and a lawyer who's significantly richer at the end of it. So let's move on to the benefits. I've talked about the problems. What about the benefits? Thinking about it in a rather fundamental way, I think what is clearly happening is that we've had a major technological change. We've moved from the era which, in the sense of Johannes Gutenberg who invented movable type, the printing press that suddenly made written material much more broadly available and at reasonable cost to one in which the tyranny of a library which requires a location, the tyranny of the lecture hall which requires a location, the tyranny of the book which is heavy and difficult to carry around has been broken by new technologies and are in the process of creating a community which is exploiting the usability of modern technologies and therefore changing the way in which it behaves. The scientific community in many ways is rather conservative but it's now shifting in change and those changes both of technology and approach are permitting us now to utilize data on a massive scale crucially to integrate data from diverse sources to analyze complexity in a way that we've not been able to do so before and of course to communicate it instantaneously. If we think about what's really happening in the scientific understanding of the world, the cosmos, then we're moving from a stage where science was good at analyzing uncoupled systems, relatively simple systems, systems such as planetary motions in which we ignore what happens way beyond the planet to one where we're looking at highly coupled systems where the integral component parts interact with each other to produce extremely complex behavior. We've been doing that for 30 years utilizing the modern computer where on the left hand side of this diagram you see a fractal generated by simulating in this case a six component complex system in which the relationships between them are governed by some quite simple equations. We've been doing that for a long time but now utilizing if you like big data or I would rather say broad data where we are able to have big data from a whole variety of sources then we'll be able to characterize complex systems as stated described and putting on the one hand the description of the system and on the other hand the capacity to to forecast its evolution puts in our hands extremely powerful tools for scientifically discovering. One obvious example is modern weather forecasting where on this diagram you see an initial condition which could be deduced from surface observation on the one hand and satellite observation showing the patterns of circulation in the atmosphere. That initial condition then is initial condition of a model which applies laws of motion to the system and predicts changes in atmospheric motions and characteristics but of course after a few after 24 hours after five days after a week after a month model and data diverge of course what we can now do is pull back the model once that diverges a date plays to make it fit the data and the model also recognizes how much that has been necessary for a particular atmospheric state such that we can we can effectively by iterating between model and data we can produce forecasts much much more effective than they have ever been perhaps. We can integrate data from a whole variety of sources at the top here you see some historical soil maps on the left hand side you see a whole series of some a number of variables held for example the UN environmental programs website we can integrate those to get together to give a much better evaluation of soil fertility than has been possible in the past. Good example a couple of years ago when Monsanto bought for a billion dollars a company which had historic rainfall and infiltration data complete the soil properties and their quality and their conclusion is that this permits them to move to the production of agricultural yields at a much higher level of evaluation crucially important in a world where feeding a grain population is an important priority. There's the internet of things which many people know a great deal about probably more than me where let's say you have one satellite one device that's looking at the earth's surface it's evaluating moisture levels perhaps and then needs something from the infrared domain and it asks another satellite to give us some information it does other satellite does so the first satellite then says well actually this is the wrong frequency band shift the frequency band so it does and sends the other satellite does and sends the data back and the conclusion is that we get very increasingly sophisticated evaluations of real-time properties of the earth's surface and of course there are many domains in which these things are true too. We're able to do time critical work now where we have impending disasters possibly debris flows floods as a consequence of major earthquakes where literally minutes are important and having preparation that will translate the measurement of an earthquake into the implications downstream debris flows flooding of the like is crucially important there are many domains such as that the tsunami is a good example and we now have after the Sendai meeting a couple of months ago in Japan where the international community was looking at ways in which in which disaster monitoring response could be enhanced these sorts of technologies that I'm illustrating here will be absorbed to these processes so big data is good we must have open data to support sounding publication but more broadly why is sharing so important? Well I bring in George Benich or well in Irish novelist and Maverick if you like for his evidence if I have an apple you have an apple then you and I if we exchange them we'll have just one apple each got an idea if you've got an idea I've got an idea we have to change them that each of us will have two ideas and essentially what digital technology is now permitting us to do is to double to travel to quadruple those ideas through a process of sharing it's a dynamic which I think the scientific community the research community in general is beginning to recognize and I think build on this is a splendid example of the whole scientific community doing these things this is the European molecular biology lab and its elixir program which is now being rolled out internationally and if you look around that that circle that circuit top left is the earth something you want to measure some biological phenomenon you sequence the DNA from the phenomenon let us around the world contribute that data to the elixir program which archives it classifies it it shares it with other data providers it analyzes and adds value to it it provides tools for researchers to use it in their own ways um such that the enterprise is immensely immensely creative and it's certainly in my view it's these bottom up initiatives to create really powerful resources much more powerful than in the past which are going to be important in driving the open data initiative forward and of course there are some recent examples of rapid open collaboration between laboratories worldwide in this particular case of a severe gastrointestinal infection in Hamburg which over a relatively small period of time permitted those laboratories by by working together to come up with a series of recommendations which were put in the hands of public health authorities worldwide would have given them the opportunity of countering a highly infectious outbreak had it become more dispersed of course the second one here is the relation to an rise of dental antibiotic resistance where data sharing is absolutely crucial but it's also important to recognize there are lots of other ways other than these relatively conventional ways of doing science which can exploit new technologies in highly imaginative and valuable ways just Tim Gowers Tim as a fields medallist in mathematics equivalent to a Nobel Prize and about five years ago he put on his blog site an unsolved problem which had been unsolved for many many decades he put he put down a series of ideas about how one might address it about 27 people with 800 some south contributions contributed ideas they will rapidly work through and they claim after a month the problem had been solved indeed they solved a rather more difficult generalization of the problem rather the specific one that he posed Tim's comment it's like driving a car whilst normal researchers like pushing it so why don't we do more of this and the answer is a very very simple one it is that the criteria for credit and promotion are adapted to old ways of doing things maybe it's the first author article in nature or science and the problem really is to analyze what has been absolutely fundamental to the scientific research process over the last 100 200 years and what has simply grown up as a matter of convenience and habit which inhibits development of new technologies which will permit the scientific community to be much more creative and have much better productivity so a simple conclusion that you might draw given the way in which this machine is massacring my my lovely diagrams is that all we need to do to to openly exploit the data deli which is to have processes or tools for acquisition curation storage management access reuse and citation all of which exist to a greater or lesser degree and surely we should just say to researchers and their institutions just do it there's no problem well actually no there is a problem there are lots of Jim Gray the late Jim Gray was a an extraordinary data science guru and this is a quote from something Jim wrote when you see what scientists are doing day in day out in data analysis it's truly dreadful we're embarrassed by our data and I just listed a series of series of problems here one of the problems I think is that many of the simple statistical techniques that we utilize in analyzing data are derived from an era when we didn't have much data it wasn't necessarily terribly good and classical statistics was developed primarily to commit us to exploit that sort of setting that was what's happened is that we've inverted the whole process whereby the data volumes are enormous and we do need to rethink the way in which we utilize statistical approaches in order that our inferences are valid so it's no good just having beautifully managed data that's open and the rest of it we've also got to think quite profoundly about the processes by which we move from data to inference and indeed it's one of those domains where we need to get our mathematical colleagues involved in a very very serious way because many of them are one of the one of the problems of course in all this is is exemplified here let's say we have a lot of earth observation systems and systems of systems on the right hand side you see there's the glue with all these satellite satellites acquiring data which we then feed into into servers and we can generate algorithms which are able to distinguish properties of the earth and very frequently properties that before have lain below our capacity to resolve we see we see patterns there which we haven't seen before question that arises how is a multi-component analysis delivered to the human brain well typically of course it's through some illustrative means so in former times we had two variables plotted on the graph how do we do a 40 component analysis how do we comprehend that is there a possibility well actually if there is a possibility and indeed it's being realized in some domains there's a disconnect between machine analysis and human cognition so what might be the role in some of these domains where machines are extremely powerful analyzers of very variability it's a sort of black box and can we look inside it of course another key issue is who owns the black box is it Google or some ICT company or is it in the public domain and what does it mean to be a researcher in a data intensive age now I've proposed it answer those questions time is too long you might want to ask about them but it seems to me that it is not absolutely obvious that the thing which has been the agent of at least material progress during human history which is knowledge much of which has in the past been in public domain but of course there's always been a knowledge intelligentsia what it what ought to be a concern to us is the potential that knowledge should become privatized and we and might it be that we could conceive of a tragedy of the commons for new knowledge that's clearly something we have to be wary of in a sense one of the ways in which we can see the these various strands of use by the sound of community of new technologies could be almost exemplified by a possibility that's there now and is actually being implemented and that is one where we have all the data open and online we have all the publications open online and for them to interoperate and those of you that have looked at things of say cut made and seen some of the active interactions there between data which can be called up whilst you're reading paper which then permits you to manipulate the data in a variety of ways you begin to see it what's happening is an environment's being created which because it can be a realization of the sorts of things that I've been talking very large questions still still remain I don't talk a little bit about the way in which some of these big issues might be realized or are being realized in practice I think the first thing we have to recognize is that although science is an international enterprise most of us do our science through national systems and those national systems vary in the nature of the incentives that there are for us to do science the nature of funding the processes of organization of science so in a sense there needs to be a national system that supports the sorts of processes that I've been talking about which will involve necessarily involve major changes in institutions but I think the national system can be characterized quite simply in the terms that I show here and that is that at the bottom you see I put national policies and infrastructure national policies really about governments expressing a view about the importance of open research open science and indeed open government data and other sorts of data of ensuring that the infrastructure is in place in order to manipulate data and then we need to ensure that institutional management is such and support is such that scientists and groups in institutions can operate effectively in an open research data environment and that support should be for both big data analytics on the one hand which is relates to the Jim Gray quotation and the other one to open research data the two of which interact and then of course there are the scientific or research inferences we make on the basis of that analysis we need to ensure that the process of inference is a valid one and then the knowledge output and of course the key thing is those of us that are paid from the public purse the presumption is the knowledge we are creating is knowledge which society in principle should be able should it wish to use in a productive and helpful way clearly it's in the interests of societies a whole that those institutions that generate the data should be aware of the different audiences which they are there to serve there are challenges and all this from four institutions and individuals at all levels at the highest level governments need to express a policy for open research data that this is important it matters will benefit them nationally and in relation to national priorities then those that fund research and those responsible for strategy for research very often they are the ones that determine the incentives for research for the bodies that they fund they need to accept that the cost of open data is the cost of science I would argue you can't say well which do we do we either do science or do we have open data I would say that open data is science and science without open data isn't science it's as simple as that and that they should mandate intelligent openness when projects are completed and data data is deposited the publishers we need to have them mandate concurrent open deposition and to recognize that it is at you might be at least malpractice for that not to take place university institutes have really difficult problems I've referred to the the issue of incentives and promotion criteria which they need to think about they need to be proactive and not compliant what I mean by that a few years ago when this whole issue began to take off in the UK then as governments and research councils expressed a need for a move in this direction many universities said well what's the minimum we can do to satisfy those requirements the minimum that won't cost as much money and which way to reflect staff what I think now is beginning to happen is that many universities possibly even most are beginning to recognize that the future excellence of their and relevance of the research that they do with their institutions depends on them being proactive in this domain as a fundamental hurdle to get over once you're over it in the sense the system starts to drive the priorities that I've been talking about but universities also need to think about the way they manage their data and have management processes able to create open open data which utilized by their people and others and also to support the use of data by their scientists the library function is a key issue in many universities are struggling with that at the moment and I would argue that many of our libraries now particularly in science are doing the wrong things then pulling the wrong people and of course we also need to think about training but I'll come back with that shortly and fundamental at the base of that is changing the mindset and the intent of we scientists we need to think about this data not being our data anymore and I have as probably as good as excuse anyone to say it's mine I get cold miserable and awful places for many months to get it the reality is it isn't our fellow citizens the attackers have paid for the data that we've collected and as a consequence we should regard ourselves simply as custodians of data on their part the function of research is is not to give nice careers for scientists it's to provide knowledge for society as a whole although giving scientists a good career is an excellent way to make sure you get good people in you've got to get the argument the right way around and of course one of the big challenges I think has engaged in citizens which I'm not going to have time to talk about today this is a useful analysis to slide from tony hay or many people know it's about skills and roles we think about domain researchers biologists geologists social scientists do they need to become informaticians data scientists we know they don't I prefer my biologists to be able to distinguish between a lion and a tiger rather than necessarily to do complex complex analyses of statistics however we need to ensure first of all they are better educated better trained in appropriate database tech data techniques and and be aware of the responsibilities that show those tony suggested we could we we could subdivide data specialists or data scientists into data engineers who operate at a low level close to the data write code and the like data analysts who explore data through statistical and analytical techniques and give strong support scientists in doing that data stewards those who manage curate and preserve data information specialist archivists they're they're the librarians if you like the post-goldenberg world in the UK we've tried to respond to these things by bringing together what we call a research data forum which is about 30 to 35 people representing key components of what we might think of as the UK science system that's the funders the publishers the university people researchers various various other bodies British library gs which many of you will know about and its purpose is to drive practical change not work as a talk shop so we're publishing shortly come called out about the principles of underlying engagement UK which we expect everyone to sign up to we are driving the use of data site as an important it is an important tool and analogous analogous things and that's really part of the the national efforts that are required that are sensitive to national cultures and ways of doing things but then there's the international scene and at the moment there are three formal bodies which operate in this domain with overlapping interests all of them are involved in advocacy but in detail rather different aspects of this whole issue and of course the key thing for those of us who are involved I'm involved as president of code data the key thing for all of us there is to ensure that we collaborate and coordinate our activities because the resources available at an international level are still very small compared to the size of the community and therefore it's got to be efficient and as collaborative as possible so that's all been about open data the legitimate question I think is what's this all about anyway many people have referred to open science and this will be my definition of what open science might be I think there are three component parts it's doing science openly it's like open data it's having open access publication the open data domain of course I tend to have been talking about university time research there's also of course administrative data held by public authorities there's public sector research data like meteorological offices and there's research data of the type that many of us know and love and those the the data and publication of the outputs if you like outputs for who well researchers government business citizens citizen scientists I think we we feed the researcher community reasonably well from what I've said you'll have gathered not as well as I think we ought to government and public sector we try quite hard to service them well but not sure we do a job businesses set themselves up to take what data they need from the research community and it's a question of how efficient and how aware they are citizens I think we suffer we do not serve well and my view is that we really have to address that issue very seriously because ultimately I would argue that this is about science being a public enterprise and not a private one conducted behind those laboratory doors why should it be public I think there are two reasons the first one is that science has sciences and science will continue to change the world we live in change our economies we will change many of the aspects that determine how our society works as a consequence in the democratic society then we need to ensure that citizens too are aware of these issues are able in some sense able to be able to partake in them and at the largest level I would say this is ultimately about democracy it's about a society which has knowledge which has some understanding and where politicians their elected politicians can be can be brought to account because knowledge in the public domain is as great as knowledge contained within with a governmental domain if you have a system where government has the monopoly of knowledge about tyranny thanks