 Kaj so? So we are starting this afternoon session with three presentations. We are starting, please. Three presentations we start with science as an open enterprise from professor Jeffrey Bolton. I kindly ask all the presenters to introduce themselves, say, in the role, in the beginning, and you have a half an hour and a bit short. Well, good afternoon. My name is Jeffrey Bolton. I'm from the University of Edinburgh. I'm a geologist and I play with big data. I also chair the Royal Society, that's the UK National Science Academy. I chair its science policy center and also chaired the production of a report, which is referred to here, and the report was called Science as an Open Enterprise. And with luck I might even be able to use this device. It's important to make a very clear distinction. This morning we talked primarily about open access publishing. This is about open data. Open access publishing is an important objective, but if we don't manage to change anything, science will continue. In relation to open data, unless we grapple with the problems and the challenges of open data, then the progress of science will be seriously inhibited. However, the two are really very intimately linked, if you think about it, clearly, and I'll suggest later on ways in which that linkage might best be seen. But it's useful to start with a little history. This gentleman on the top left is Henry Oldenberg. He is a German theologian. He just happened to be the first secretary of the newly created Royal Society in London in the early 1660s. Oldenberg was an inveterate correspondent. He corresponded with what we would now call scientists from all over Europe and had a remarkable collection of letters setting up their ideas and things they had found. And he thought, wouldn't it be a good idea, rather than keeping these private, to publish them. And he persuaded the new society to publish its philosophical transactions, which was the first and indeed the longest-lived, it's still published scientific journal in the world. And many historians of science regard the advent of these journals as absolutely crucial in underpinning the scientific revolution of the 18th and 19th centuries. But what Oldenberg did was require two things. First of all, letters that he published must be in the vernacular and not in Latin, which, of course, prior to that almost everything published in what we call science was. And secondly, and absolutely crucially, he required not only that scientists should publish their opinions, but they must publish the data, the information, the evidence on which those published, those opinions were based. And they must be published together in the same issue of the journal. That's a principle that is as important now as it was then, because, in a sense, what it did was to establish a process that we, many of us at least believe, is a process of scientific self-correction. That science corrects itself as long as you make available the evidence, which others can scrutinize, potentially recapitulate and criticize. And he made this splendid statement in the first issue of the journal, encouraging those who would write to him to find out new things, to impart their knowledge to one another and contribute what they can to the grand design of improving natural knowledge to the universal good of mankind. And, of course, the large question is the one at the bottom. How do we do that in what is not becoming, but what has become a post-Goodenberg era, where we have massive digital acquisition, which has largely replaced the printing press for much science. And, of course, what we're confronted with is this marvelous picture from Delacrois showing this great storm of data. And it's important to recognize that not only does that vast storm of data, the like of which we have never seen, not only does it provide us with opportunities, but it also poses major problems to the very principle that Henry Oldenberg enunciated some 450 years ago. Of course, one of the problems with all this data that's arriving is that if we look at the rate at which it's increasing, it's been calculated that if we don't find any new technologies, given the rate of production of data at the moment, then within the decade the whole Earth's electricity supply will be required to cool the computers. So there's a problem of sustainability and indeed a problem of choice. So what are these challenges, problems and opportunities? Well, firstly is what I've called here closing the concept data gap, maintaining self-correction. About 35 years ago I published a paper in Nature with seven hard-won data points in it. We published the data, we estimated the potential errors and uncertainties, we gave full details of the experiment, such that others could scrutinize the data, replicate the data where they could in new experiments, add to it and thereby change the concepts and evolve the concepts that we've developed. About two years ago we did an analogous experiment in Antarctica, but this time not seven data points, but actually about seven petabytes of data and I suspect even the generosity of nature would not extend to including seven petabytes. And the problem that we have is how do we make that data available in a way that others could scrutinize what we have done in the way that what we did 35 years ago could be scrutinized. And the answer is it's extremely difficult to do and I'll talk at some length about why that is so. Indeed there are many of us who believe that science is currently sleepwalking into a major crisis and it's a crisis of replicability. Early last year there was a paper, both one published in Nature, one in Science, which took the top 50 benchmark papers in the last decade in preclinical oncology, a crucial area of medical research. And they concluded that no more than 11% of those papers were replicable. And the reason most of them weren't was firstly they didn't put in the data or didn't refer to a source for the data or even though they did, the metadata, the data that permits you to understand how to use the data wasn't present, the details of the equipment in apparatus weren't present and the ultimate consequence of that was replicability was extremely difficult. It all but a very few, actually impossible, in all but a very few cases. Now that's an absolutely fundamental issue. If we cannot get at and scrutinize and reuse the data that underpins published, published work, then frankly the published work is near better than myth. And the view of many of us adding to the comment made this morning about Peter Medowa is that actually too much science now is published in a way that ought to be quite unacceptable. The two, the data and the evidence and the concept must be published together. So that's the problem, that we've got to somehow close that gap. The challenge, the opportunity is I think a very obvious one, how do we exploit this data deluge in ways that are productive. And one of the ways of course we do that is by sharing data, which has become a crucial issue. It's interesting that the fields of science in which data sharing is best advanced and is most productive are new fields. And science finds it more difficult. They're bogged down, weighted down by their traditions. Bioinformatics is a classic whereby if you talk to a young student in bioinformatics it is so utterly self evident to them that data sharing is more individually important for them than keeping their own data and hugging it to their chest. And the consequence is of course that schemes like this, which is the elixir scheme in bioinformatics for sharing data by having a central hub, a series of national hubs and underpinning that with this computing tools, the standards and the training particularly for young scientists is needed to exploit this opportunity. We should be looking at areas like this which are very much bottom up driven ideas about how science can advance most rapidly in key areas. One other important point to make is that there have been many examples in recent years where openness has been proven to be an immensely efficient means of working. In 2011 early in that year you might remember there was a very rare gigatoxin strain of E. coli that were an outbreak in northern Germany in Hamburg but within three months openness and collaboration across many scientific labs across the world meant within three months the toxin strain that was responsible for the human damage was isolated and it was embedded or the techniques were embedded in regional public health responses. So open science is not only crucial to the promotion of science itself but if it's sufficiently open it can be immensely efficient in cutting down the time scale for discovery. Of course there are now enormous numbers of databases that are available in most areas of science. This is an example from the life sciences. These are a series of life science databases but these other colors show other databases of a cognate character some of them are geographical GIS based databases some of population databases some are environmental data but they all have several things in common they deal with issues which interact between all of those domains and of course the key question is can we link them together and this area of the answers yes we can can we link them intelligently and the answers will not yet we're all used to putting a series of keywords into the Google website but actually what Google can't do is respond to an intelligent question and an intelligent question might be the role of pogo sticks in nekromansi tell me how it works Google can't answer that question but actually it's been worked on and if we get to the point where it can be worked on we can exploit these linked databases in ways that hitherto has been quite impossible and it will give great depth to the diversity and integrity of scientific understanding one of the other things of course that's crucial is that data needs to be dynamic it's a public perception I hope not shared by you that data somehow is static once you've got it that's it well actually it's not true most of the raw data that we get is electronic voltages amperages, it's electronic information what we then do is utilize algorithms and other correlations to translate that into numbers of hearts beats per second frequency of particular proteins and so on and the algorithms and the theories that we use to relate the basic information which is often electronic to real information about real phenomena they change through time, we learn more and therefore we need to change the data and what we need to do of course to ensure this data is dynamic about three quarters that isn't it's dead data, it can't change because the computational mechanisms that permit data to be updated simply aren't there that we can work on and the economic implications are enormous this is a publication from the US about 18 months ago and just a series of numbers 600 bucks, 600 dollars would buy you a disk on which you could you could put all the music that has ever been created known to the human species 250 billion is the amount it's estimated could be the benefit to the European public sector, not private sector public sector administration on an annual basis which is more than the GDP of Greece and that value 300 billion dollars that's the amount that's estimated could accrue to the US health system that's both private and public if they were able to use data in the most cost efficient and effective way so there are big payoffs to those who care only about financial returns but it's crucial to say that it's not just curating retrieving and integrating data it's also what we do with it Jim Gray, this guy is probably the sort of big data guru and Jim is quite a guy when you read this it's really rather distressing when you go and look at what scientists are doing day in and day out in terms of data analysis it's truly dreadful we're embarrassed by our data and it's true we haven't either in our training or education or our practice to a large degree come to terms with the data world we're still working in my world of seven data points what we tend to do is look for patterns in the data so I have a theory I go to a very large database and if I try hard enough I can find a distribution that will fit my theory I'm making a fundamental error is saying what are the inherent patterns in this data and actually the inherent patterns that are there are different from the patterns that you might find if you look hard enough and we really need to think about how we do this in a much more serious when it needs to be integrated in our training to a much greater degree we also partially report data classic examples in clinical trials you find that clinical trials which utilize public subjects tend about 75 to 80% of them are public trials that have positive results the public trials with negative results are rarely published or rarely have been published and the consequence is of course the relationship between cause and effect in relation to particular medical interventions is distorted and our view very simply is that it's scientific malpractice it should be banned but equally we'd say that the non-publication of the data together with the concept in the journal is equally scientific malpractice and who is guilty of scientific malpractice and the answer is the publishers are guilty whether they are private publishers or whether they are learned societies and actually we are guilty in conniving with them as reviewers and editors of course one of the other things that we don't do very often in logic we don't use the right sort of logic it's all sort of inferential logic of experimentally based inferential logic and we need to think in Bayesian terms to a much greater degree and those of you who don't know what Bayesian logic is ask the person next door to you and of course the other key issue which has arisen dramatically in recent years is that of fraud this is a headline from the Guardian which is a serious, sensible well trusted British newspaper science is broken they had in their headline it's time to stand up for good science the examples are numerous and they are growing more frequent and what is the cause the cause according to the Guardian of the rewards and pressures that promote extreme behavior and normalize malpractice in other words you will do anything to get a good paper that will get you into nature or science and get good citations so what are the cures well obviously saying to young scientists personal integrity is important matters but sometimes it's not strong enough to offset that pressure a key issue though is does a system have integrity and systemic integrity we would say is that the data on which an idea is based must be open for others to scrutinize and replicate and indeed I think I would go further and say peer review needs to be open too if we think of the way in which we are addressing planetary challenges at the moment these are challenges climate change major major infections which are enormously important for our societies and for our fellow citizens they are important because if we decide that climate change is a significant issue then the costs to a national ex-checker and to personal finances of making major changes are considerable and really it's no longer acceptable that that is done on the say so of a few of we we scientists who say well this is what's happening this is what you government should do our fellow citizens want to know and should be given the means of knowing and the question is how do we give them the means of knowing if the data that underpins these ideas is actually inaccessible to them in the sense that's a reflection of the citizens demand for evidence my view is that we scientists are going to have to stop thinking of ourselves as we sometimes do in our worst moments as a priestly cast and remember actually we're like the guy who mends your motor car tires down the road we fulfill a function in society we're just functionaries we think we're important and maybe we are but actually we're just members of society at the base of it and of course the other key development of recent years is the growth of what's been called citizen science whereby amateur scientists who may not have had any training in research or training in particular discipline are becoming involved in major and formal and serious research programs simply because the professionals have found that there is great value if they are to do so so the astronomers some of the protein chemists for example and many others environmentalists have been involved in creating programs that have real scientific value but also embrace fellow citizens and in my view the 2030 question is can we imagine what the development of social media and of the interaction that's occurred as a consequence of the availability of electronic instantaneous electronic transmission might mean by 2030 to the business of doing science it might have stopped being science and just be regarded as the common property of the human species trying to understand itself and the world it lives in and one fervently hopes that that might might happen. Here's a marvellous example this is Tim Gowers a field's medelist in mathematics equivalent to a Nobel Prize in Maths about four years ago he put on his blog an unsolved mathematical problem that he had solved for 100 years because he had some ideas about how it might be solved and he put those ideas up and he just generally asked anyone who reading his blog did they have any contributions to make and after about 32 days 27 people had contributed really substantive contributions more than 800 contributions those contributions were rapidly developed or they were discarded as not being appropriate one of the most crucial actually came from a secondary school teacher in mathematics from Oregon and Tim reckoned that after 32 days not only had they solved a special problem but actually a rather more profound generalization of the problem his comment was, it's like driving a car whilst normal research is like pushing it and the question is why don't we do more of it and the answer I think is very simple though the criteria for credit and promotion prevent us from doing things like this which are much more difficult to measure because it's not a paper in nature or whatever you like and the less reason why it's important why this domain is crucial is because the more data that relates to us that is in the hands of the state even though it may be the benign states that currently flourish in Europe the balance between personal and freedom and state control is an important one that demands as Voltaire once said actually I've forgotten what he said but it boils down to perpetual vigilance permanent vigilance but there's a problem and the big problem is that open data of itself has actually got no value the only way in which it's got value is if it can be communicated and we call this intelligent openness and who would not wish to be intelligent and that is data that satisfies these four criteria firstly it should be accessible you should be able to find it secondly it should be intelligible you should be able to understand it thirdly it should be accessible who is this person what qualifications do they have are they expert in the field or not and it should be reusable you should be able to use it again and our view is that only when those four criteria are fulfilled our data properly properly open but it's also important that this which is the metadata, the data about data must be audience sensitive if let's say in the domain of climate change I'm making data available so that my fellow citizens can critically evaluate the evidence on which scientific assertions are based then that has to be presented in nearly fashion that the data I might make available to my colleagues the amount of work involved is enormous if that were required of all of us, let's say who were in receipt of European Community Funding science would stop tomorrow because we'd all be doing this very difficult task of making it publicly available and actually one of the important questions to ask of Nelly Cruz is what you mean when you say data has got to be open because frankly if it all has to be open so that fellow citizens can get it as I say, science would stop tomorrow there's a real problem and the other final point is that important to say to politicians who misunderstand this frequently scientific data rarely fits into an Excel spreadsheet most of it won't and can't and yet if you look at regulations that are frequently going through our parlaments they say machine readable data and in their minds they have Excel I think and it just isn't true which data and for what purpose these are the sort of purposes that I think are rather crucial I can just pick time is getting short, I'll just pick out one or two this is a dilemma of choice and the dilemma is if we required all the data that is generated even all the data that supports publication if we required all that data to be accessible then frankly our systems would silt up tomorrow somehow we have to choose which to curate and which not to curate and the question is how do you know that in 20 years time you won't have thrown away something valuable and the answer of course is you don't and of course at the same time in universities in particular we have contradictory injunctions today we're getting injunction to share collaborate and disseminate particularly there are injunctions to commercialize and in too many minds that means guarding your IP and keeping it close to your chest which actually is a mistake on their part but there aren't boundaries to open us and for us these are the three boundaries commercial interests are a boundary and I'll talk a little bit about that in a moment privacy is a boundary I mean it's quite obvious that one of the ways in which the public benefit from the utilization of health systems data could be enormous but at the same time mathematical demonstrate that completely anonymized data can't be done it just cannot be done in other words there will be a tangible and finite risk of individual health data becoming accessible to others and somehow we have to use that increasingly finite resource which is judgment to determine where the balance should lie and safety and security I mean there's been a great fuss in the last year very interesting fuss about the potential use of scientific information for terror and the discussion of that's been very interesting the key thing is that all those boundaries are fuzzy they're not very tight don't worry about the detail these are a series of industry sectors and these colors integrate the extent to which the business model in those sectors benefits of open data and other areas where it benefits entirely from commercial and you could see the pattern is a complex one this is what I call the data management ecology and it goes from individual collections of data the laboratory bench that you or I might might collect data on the institution, the university the institute national data centers it's interesting to note that many of these big international databases have arisen from the laboratory bench someone's had a clever idea about how to collate and use data their colleagues have agreed them what a good idea, let's do it all together it becomes national, it becomes international and so very often you find the databases actually rise up the system like this it's interesting, it's been suggested that the total sum of little science data, which is what you or I might have on our laboratory bench probably exceeds the total sum of big science data, which means CERN and similar enterprises what is equally clear this is a massive data loss at this level that's the level of the individual the individual institution I've got a minute and a half I think these are the views of young researchers let me pick away one or two these are the ones who are going to do tomorrow's science and frankly their views differ from those of us that are as ancient and antiquated as I am the view a common view is that data is not a private preserve it is a public resource the evidence and the concept must be published together well, you can read science data should be easy to remix as music is to a DJ and all those are really quite important and the crucial one at the bottom, we go back to it the cost of intelligent openness is an integral part of the cost of doing science it's not a question of you either pay or you do science they're all the same thing you can't separate the two these are some essential enabling tools where research is currently going on where we need more research and in the trials that the commission will undertake in the next two or three years it's crucial that at the same time they fund research in those domains if we think of the actors in research and what they should be doing the key issues for them the publishers really the publishing domain is utterly crucial freeing up text and data mining also in my view the way in which they react text mining is actually a tangible obstruction to the doing of science the key question for employers universities is what responsibility do they have for the knowledge and the data that their institution collates so I'll move over that these are international European and national efforts this is my argument is that actually that the bottom up drive is very powerful it's very important and what is crucial for those who have top down top down views is they shouldn't determine to do things in ways that will be inflexible given the indeterminate direction of the bottom up drive and the other things that are going to tell us how to do the science in 10 years time and this is where our rektor from Lierge and I would differ fundamentally I would say you could crush the bottom up if you're not careful and by doing that you're crushing the future what should our realizable aspiration be I think it's pretty simple and it comes back to the open data open access issue a realizable aspiration I think is that all the scientific literature is online and the two can interoperate and of course that's what you're here for and just to quote a phrase that we use in relation to political devolution in Scotland it's not an event it's a process we're not going to stop doing this we're not going to achieve it it's going to continue and the important thing now is to realize that we've been doing nothing for the past I don't know several decades we better catch up to the rest of the history of the human species thanks very much we have some time for a question now who wants to comment on the presentation that you have heard nohan no, it's not the re-use of data it's the integration of data so I don't know where the last time you were in a good night club but the DJ will go seamlessly from one recording to another recording and the problem at the moment that we have with these great accumulations of databases, this population of databases is that they are hard edged and mixing them in ways so they can effectively talk to each other is potentially doable but at the moment we're not putting enough effort into doing that because the benefits of doing it are actually enormous because, for example think of my specialist field and I want to know everything that is known about the acoustic impedance of sediments now in principle I could find that out by doing a bit of text mining which I'm not permitted to do of course and what's more if we could go a little bit further and have this remixing process I could actually correlate in creative ways data from this publication and that publication and produce a different sort of synthesis and I think that what's happening partially because of the financial mechanisms that we have partially because we don't yet know how to do it is that rather than being able to exploit the great three dimensional depth of scientific knowledge that is there although we don't have it in our heads actually we're limited to a few things that we do we can remember and a few things that come up in a Google search so, and I think that there is gold in the hills and we should be considerate the question relates to the same area that we talked about the business case in China yeah, well I mean I think we're referring to the business case I think is something one should be rather wary of I take the view that actually one of the values of universities to society which they themselves have lost a sense of is that of the ivory tower the ivory tower is actually a rather important contribution of the university to society of thinking unthinkable thoughts I'm be wary about the business case politicians and civil servants will interpret that and business people interpret that in a particular way I think if you pose it rather differently and that is to say it is in the interests of us as citizens in societies to have as much knowledge at our fingertips as we're able to get because arguably, I think unarguably the progress of the human species is dependent on knowledge and I think the question, the large question is how do we maximise our capacity to be able to understand things in the general case and we have found out an awful lot in the past but we've tended to find them out in bits and pieces here and there I mean, arguably you could say that the invention of the scientific disciplines in the last 100 years was a necessary step the cosmos was too complicated for us to understand it as a whole so we invented physics and chemistry and knowledge and whatever and those were the nuts and bolts of putting the motor car together again we have these big global challenges which don't depend on one bit of knowledge but lots of them and the issue of integration is an absolutely fundamental issue so I would say the business case is that it is an absolutely fundamental driver for if you like human ecology because it's interesting that we tend to think of when we talk about economics we think about economics as something that bankers and trezerns do in the economic civil of coniferous woodland that actually means the total ecology how it lives and works and the human species has an economy which is much bigger than the idea of the monetary economy and I think we just have to step back and if we can't be idealistic then nobody else will be so I apologize for having going on at the moment I just want to make a comment you spoke about intelligent openness so I completely agree with you that sharing does not necessarily mean retrieving information it means also to be able to understand that information and to use that information now at least my experience is that once you share data so when scientists share data this is not enough if they do not have the software tool for analyzing the data so I was wondering if it's part of the big vision for example the European Commission is pushing a lot of emphasis in sharing data but it's not saying a lot yet on sharing software yes I mean I think I'm not sure it's a question of sharing software I think it's a question of because it's a linking software crucial issue is could another scientist particularly in my field in the public domain could another scientist reuse my data and if the answer is only by using some proprietary software in which they can't have access to the answer is no they couldn't in other words it's not open and what slightly concerns me about the commission's approach and one level I applaud this but actually you've got to understand what the underlying problems and issues are and although the technical staff in the commission involved in doing this do it's not clear to me yet that they've grasped this that particular issue of intelligent intelligence which I would say is absolutely vital if for example governments worldwide have gone for open access policies to government data freedom of information policies and the like and they say sunlight is the best disinfectant and they believe that actually will diminish public suspicion of government of course that doesn't happen they're wrong in believing that but nonetheless the classic response to a freedom of information request of government is that a civil servant dumps a vast amount of data or letters on your desk which you can't use and in other words it's a completely wasted effort most of the time and we shouldn't fall for that rather silly trick we should do it so that actually the things that we do are scrutinizable by other people Yes, definitely this is what my point So thanks a lot again and I think that we can now Thanks