 Yn gweithio, rydyn ni'n gwneud o'r ffordd yn ysgrifennu, felly mae'n dod i'r cyfnod o'r ffordd. Rydyn ni'n ddangos chi'n gwybod y ffordd chi'n gweithio ar yr eithaf, yn y dyfodol gyda'r gyfnod ymryd, ar y gyfer yr ysgrifennu ar yr uneddiad Pidsberg, ac ar y cyfnodol gyda'r uneddiad ducau uneddiad, mae'n Keyth Webster, dydy'r Llywodraeth, ac yn lluniau'r cyfnodol yng Nghymddol yn cael ei wneud, yr aelodd, y dyma'r cyfnodd neu'r awdraeth. Fynd am fyddig i chi'r fyddiad, rydyn ni'n fyddig i'r Danyl Hooke, y dyma'r ymdillig yn ymgyrch, i ddysgu'r ddefnyddio ddefnyddio, rydyn ni'n fydig i'r Danyl, ond rwy'n gweithio'r ddechrau yn ysgrifennu hwnnw mewn Lundin, gallai'n ceisio maen nhw, ond rydyn ni'n gweithio Lundin dwy'n gweithio ymddill hefyd, that if we go over it we can let him sleep somewhere or just let him sleep gently. We'll just see how we are feeling at that time. What I'm going to do is offer some brief framing remarks to position some of the work that we are doing with digital science in the context, partly of Carnegie Mellon's strategic plan, partly in the context of Open access week. In a sense I'm almost doing the sales pitch, which I hope I get some sort of compensation for I said to Daniel that I didn't want him to come here and do the usual company stuff of here are our shiny products, please buy them. Partly because we've bought them, at least at the side of Oakland. But partly because I really do value digital science not just as someone who supplies some of our infrastructure but as a partner in the research enterprise. And therefore Daniel will talk about some of the research that digital science released yesterday as part of Open Access Week. I think it would be two weeks ago tomorrow that President Obama was in this building talking about the future of science. And a number of his remarks in the panel discussion which were completely unscrupulated very much played into the sort of things that we are talking about this week. The more we open up data the more we all feel empowered. We need to give our children the skills to help them prioritise and analyse data because it will help them to evaluate and make good decisions as children, as adults, as informed citizens. We need places where we can say that information is reliable and factual and I think at that point I tweeted don't forget about libraries. And we're going to have to build in the wild wild west of everything on the internet a proper curatorial function. So try and keep those presidential thoughts in mind over the next hour or so as we think about the scholarly record and think about open data. I'm often asked in the context of a sentence that includes the word library and the words strategic plan, what is the future of the library? And that's the simple answer. We can all speculate, we can all guess, but to a large extent none of us really know. The one thing that I see on campus and I'm pleased to see many of my colleagues here who represent our relationship with different disciplines at CMU, that the portfolio of services we deliver has a very different demand curve depending on the discipline with which anyone interacts. And here I just threw together this to show I can use Excel. You can imagine a world in which one discipline very much relates to the printed collections we provide through our libraries, the reference services they obtain from their liaison librarian and so on. Whereas another discipline might have a very strong focus on the services we provide in research data management, in research evaluation and impact. And the way in which we interact with each of those disciplines of course is very different. We don't have a single sense of the library portfolio. And thinking about that observation in the context of the professional skills that we as librarians deploy begins to take us on some interesting journeys. Traditionally when we think about a library building we think about things like this, books and journals on shelves. And that of course in many university settings is supplemented by special collections of rare books and archives. Over the last couple of decades we've seen the addition to those of either born digital or digital surrogate scholarly content from commercial publishers and through the open web. And also through the lens of technology we see research data, other digital products of research, learning and teaching objects. And we can begin to parse these out in different ways and I'm building on work here developed by our colleagues at OCLC. What we see is on the upper part of the screen a fairly homogenous sense of a library collection. Here at CMU we know that more than 70% of the books in our shelves are held in at least 100 other libraries. And if you look at the AAU universities almost every library subscribes to pretty much the same content. And we can kind of take that for granted perhaps we shouldn't but if we don't want to be here until tomorrow morning let's just take that for granted and turn our focus to those resources that are pretty much unique to any given institution. And inevitably in the context of this talk my focus is on the products of the research enterprise. Those are things that librarians typically have not related to, they've not interacted with research data or with other observations, the software that has been created as part of the research process and so on. But the intellectual process of building and curating those collections that I've asked you to take as being homogenous and everywhere are pretty much the same intellectual processes that are required to curate showcase disseminate access to those unique or relatively scarce scholarly resources. Much of that point is also borne out in another lovely picture from OCLC, their reflection of the evolving scholarly record. And the convenient shorthand of this model is that the library traditionally focused on acquiring and providing access to the outcomes of research. Typically in journal articles and in scholarly monographs and we've done that well for generations. But in the digital research world the products of the research process also become amenable to curation, to capture and to dissemination. The methods used in research, the evidence generated during the course of the research project and the community discussion that takes place in the open space and Twitter and blogs elsewhere as the research is being conducted. And Daniel may observe, for example, with one of the products of digital science, Figshare, that there are examples of data being captured from the microscope and disseminated immediately in a Figshare repository for community engagement. All of that requires capture and curation. And then at the outcomes side of the research process, we similarly have a number of artefacts that can be properly curated and looked after, the community discussion, peer review, the revision and reuse of data. So taking all of that we can begin to think about the research workflow in the digital landscape. Nothing terribly exciting about this, it's a simple conception of the research process beginning from an idea through to grant planning, experimentation, dissemination and so on. Anyone who has had the sort of career that I've had will have spent countless hours beginning to map different publishers and their product offerings. This one is a little bit out of date because I know that nature and digital science are no longer a closely coupled organisation. But nevertheless you get a sense here of digital science having populated the research life cycle with a number of products. Every academic's best friend Elsevier has similarly been very smart in building a suite of products and offerings that captures the life cycle. And we see this, it's certainly not exclusive to Elsevier, but many publishers recognise that their traditional business model of publishing academic articles may not be sustainable in the long run due to open access. So let's build an ecosystem that locks authors in more tightly. Many libraries have kind of not got that far yet and I do worry about the future of the library as a generic enterprise if we continue to focus solely on our role as the procurers of scholarly content and the disseminators of that same material and the providers of learning spaces. In a research university we need to take a broader view of the research workflow and build our services over the next few years around that workflow. Because the reality is that as the consumption of scholarly content goes digital the library as an organisational unit is increasingly bypassed. And if we don't understand how our services plug into the research or workflow we might find that we go the way of so many other media distributors that traditionally populated our high streets thinking of the borders and the tower records and others. So at Carnegie Mellon we have been over the last 12 months or so beginning to build a range of services around that. This is very much in the early stages many of these are still being tested before wider release. But we are recognising that there is a bigger story to tell here. And that is very much borne out in the university's strategic plan for the next decade where there is a specific reference to the concept of the 21st century library. Over the next while we'll be having debates about what that means. But we know that by and large we're going to be focusing on enhancing the quality of learning spaces that we provide. We're recognising that our collections are part of a global distributed network of material. That as a top tier research university we have a scholarly record that has to be looked after partly to satisfy the funders who make the research possible. But partly because this university is committed to the public good that its research does and to the widest possible dissemination of its research findings. And we recognise that all of this has to be made possible through the expertise of our librarians who bring information specialism to the research process. Another way of looking at the research life cycle is just, you could probably have a presentation with 30 different research life cycle models. But this one comes from digital science and I show that to depict the sort of data that is generated during the research life cycle. And I do that to flag up the elements product that we have licensed as have our colleagues at the University of Pittsburgh and most of the top 25 universities in this country as a way of capturing the data about the research activities that take place on campus. I'm not going to demonstrate elements, believe me, it's not quite there. But we know that amongst other things it will help our faculty understand the extent to which they are meeting their research funders open access requirements. Faculty will be able to track their publications, we'll be ingesting data from school bus, web of science and elsewhere to let us or to let individual faculty see how their publications are doing. We will be able to go into dashboards which show them citation counts, altmetric scores, they will be able to generate CVs for grant application purposes to generate NIH biosketches at the push of a button. But building upon elements we will also be able to integrate with another digital science product, FIGSHIR, where the data and publications generated from research will be linked to the elements records. We'll have a simple handoff because we're using the same researcher data feed, a simple handoff between the elements record and the research products stored in the FIGSHIR repository. Our FIGSHIR repository is nowhere near even screen grabs so just to illustrate for future reference what it might look like. I've shamelessly perloined a couple of screen grabs from the University of Melbourne which I think was the first institutional FIGSHIR deployment. The university uses that as their research data repository. We will be building upon the work at Melbourne by having FIGSHIR both as our data repository but we hope also as our publications repository. Watch the space, talk to David Shearer who's sitting at the back if you'd like to know more about the work that he is doing to make that possible. I also showed on the elements screenshot, I passed over it quickly, the altmetric score. Altmetric is another digital science product that we are implementing to help us understand the media and public reception of our faculty research publications. Again, this comes from the same list of CMU researchers and will demonstrate to the author and to the wider world the extent to which their research is being picked up and spoken about. It's another way of looking at the quasi-impact of research, it's different from citations but offers another way of telling a story about research. All of this points also to the need for us to be tracking our research funding, not just looking at where our grant funding is coming from but benchmarking against peer institutions, understanding how we're doing against comparable universities very close to Boston in Massachusetts, looking at areas where we are doing research and understanding who else is doing research in the same space to help identify research collaborations. We'll be having many more conversations on campus about these products over the next few months as they come close to release. My development team told me not to say anything about target release dates because as soon as I say that people will believe it and they just want to be sure. But I'm hopeful that over the next few weeks some of these services will begin to see a real place on campus. There are some notes about these in our boundless publication which is on the table at the back of the room. But I'm going to stop there. I've made the point already that digital science to us is not just a supplier or a service provider but really is seen by us as a thought leader and a research collaborator. I'm really grateful that Daniel has agreed to come and be with us this evening. He became managing director of digital science in July of last year having held a number of positions in the digital science group before that. He is a theoretical physicist. He did his PhD at Imperial College in London and is a fellow of the Institute of Physics in the UK. So Daniel, thank you for being with us. Welcome. We're going to do a very swift change of Mac books and I will invite you to talk with us. And then once we have heard from Daniel we are going to move into the far side chairs and have a conversation with all of you. So please be storing up your questions. So thank you very much Keith. That was a very generous introduction and also excellent expectation management. So if I fall asleep while giving my talk then perhaps I can be forgiven. But it's very lovely to be here this evening or this morning depending on what time zone I'm feeling like at the moment. But yes, digital science is for perhaps its sins academically led. And by that we mean not just that I have a PhD but that we take a fairly academic approach to how we conceive the market and how we try to work with the market. Indeed, we don't generally talk about it as the market. We generally think about it as colleagues and friends. And it's actually wonderful to work with universities and research organisations around the world. I think it's one of the few markets where one can really consider the people one works with and serves as one's friends. And you can go out to dinner with them and have an honest conversation about all sorts of things that are happening in the industry that we will work in. The topic that I wanted to address this evening is open by default. And it's a slightly controversial title I think. It is the title of the article that Mark Hannell and I wrote in the report that Digital Science launched this week on people's attitudes to open access and open data. And we were deliberately slightly controversial in our comments. It was something we wanted to really bring out about the nature of academia and the nature of the system in which we will exist. The system of meritocracy if you like. But what actually is esteem in an open access context in an open data context. Because the things that are happening to academia today are actually changing the very fabric of the structure which everybody has been used to. I always like to start these types of talks with a story that I read a long time ago. And I thought it was very amusing. We have come some way since the time of Newton. The reason I put this particular lovely picture of optics on the screen is optics is an excellent example of not being open with your research. Newton did all of the research for optics in the 1670s. It was cutting edge at the time. And in the 1670s some of you may know he actually published the Principia. So he's kind of magnum opus at the time. And in some sense, in some very real sense, optics is Principia's second volume. But there were contentious parts of the Principia. And not a relation of mine, but a chap called Robert Hook, who was in the Royal Society at the time, made Newton's life very difficult. He questioned, he found fault. He bogged everything down to a level that Newton found very challenging to deal with. He didn't enjoy the discourse because it was quite spiteful. And if you read the documentation at the time, there was real vitriol between the two. In fact, when Newton made the comment that if he had seen further than others, it was because he was standing on the shoulders of giants. It was a deliberate slight at Hook, who was a very short man. And so there's bad blood in this history. Newton did not share optics and publish the manuscript until 1704. 1704 is an interesting choice. It is one year after Hook died. So this really was not an open data ideal case. And I think that that's something actually to dwell on. People don't do that these days. Let's hope not anyway. I don't know of any recent stories of this kind of nature. And I think this highlights that we have moved through a number of ages of research. My colleague Jonathan Adams, who's an excellent researcher and who's our chief scientist at Digital Science, wrote an article a couple of years ago, which appeared in nature, on what he called the fourth age of research. Now Jonathan called, likened the kind of research cycle to the ability to do research on one's own, or the capability to do research on one's own, when a problem is sufficiently tractable that you can hold it in your own head, you can do it on your own and you don't need really anybody around you. This is very much the mode in which Newton works and people of that period. And as we move on through the ages, research becomes more collaborative and effectively becomes an institutional level project. So perhaps you need some postdocs, you need some people around you to have a discourse. You're able to hold more of the problems in multiple people's heads and discuss it. And then you get to the scale of problem, which is really national. You need national level of funding in order to do the level of work that you need to do. And then most recently Jonathan's argument in this article is we're now in the era of international research. There are so many problems that are intractable unless we can use international funding and international intellects, sets of minds across the world with specific capabilities which we wouldn't otherwise have access to. So he argues that we're going into the fourth age of research in terms of internationality. He did a bibliometric analysis showing that more than 50% of papers in certain regions now have at least one co-author from another country, so there are at least two countries on the paper. I actually have a slightly different way of positing Jonathan's categorisations and I try to again be perhaps slightly controversial. I think of the individual era as almost the unregulated era. This is the era of the gentleman scientist. This is the era of the philosopher where basically you don't need funding for your research and you can do exactly what you want. It's curiosity driven and in some sense many people feel this is the kind of purest form of research, the best form of research. We then actually much more recently so my tie-ups don't work out quite to the same level as Jonathan's but you have the era of evaluation. In the UK we're very well aware of things like the REF, the RAE as it used to be, various government exercises which measure us. And many regions in the world have exactly that kind of ethos, Australia, New Zealand, a lot of the markets that the geographical regions that digital science works in are well aligned with having those kind of government exercises. But more and more funders operate some kind of exercise to rank or rate your research to decide whether they want to give you more money. Universities give you internal targets that they want you to publish a certain number of papers in a year. They prefer them to be in certain journals and so we are in general in an era or have been in an era of evaluation. I think big projects and small projects are like we have entered an era of collaboration. And so the era of collaboration I think became almost after the era of evaluation. I think collaboration is a good era to have been in. It promotes international ties, it promotes international travel, cultural dissemination, different ideas. I think this was a very powerful era to be working in. And unfortunately we seem to have entered the era of impact. And I think the era of impact is actually a step backwards from the era of collaboration. The era of impact is really characterised by this idea that your research really must be applied to something. And while this is laudable at a certain level and certainly I think valuable for some areas of what we do, it does mean that we're not necessarily looking at and funding in the right way speculative research in the same way that we once were able to in the era of collaboration. I'm actually penning an article at the moment called Impact Selfies. Because I think the zeitgeist of the day is selfies and blogging and telling everybody your innermost thoughts. And at some level the research analogue of the selfie is the impact case study that we had in the UK for the last ref that we ran. In that we have almost 7,000, in some cases very deftly written marketing pieces which tell you why a piece of research should be looked at and why it's interesting. And there is a value to this. What we see is social media develops and the communications mechanisms around it develop. There are interesting insights to be drawn from that. So I'm not saying it's an entirely bad thing. But it does mean that impact in the way that we currently conceive it it's quite often at odds with collaboration. Because if you're interested in impact you're interested in IP and you're interested in owning things and you're interested in typically economic effects of your research. And obviously I think the era of impact was born in the financial crisis in 2008, 2009 which is where we find ourselves now. And I'm hoping that whilst we can learn something from this era the pendulum will swing in the other direction ultimately and this will be just another facet of what we do rather than the governing principle of what we do. So we live in a difficult era for openness. The era of collaboration was a good era for openness and it was the era in which a lot of the open technologies that we see d-space e-prints came into being. But the era of impact is a much more challenging era to be open. But I think it's good to kind of have a framework to have in mind when you're thinking about impactful things and open data things. And so being a theoretical physicist, line diagrams are my forte this is how I think of the world. So here is a kind of a first year physics student interpretation of the open data landscape. I view the open data landscape as this great void of undiscovered and unmind data. Things that we will discover through experimentation, through surveying if we're in artsy humanities, through some sort of video recordings or audio recordings, all sorts of different types of digital output now that we're collecting. And there is the wave front of discovery. This is an expanding wave front. You can tell it's expanding because I'm going to put arrows on it to make it expand. So now it's an expanding wave front into this undiscovered and unmind territory. And once you get hold of a piece of data, you don't usually just make it completely open. There are ethical issues. There are many considerations. It's a complex landscape to think about. But there is a width of that ring. At the forefront we discover the data. There's then a period of time where that data is embargoed or closed. And then at some point we make it open. That's the hope. And everything in the middle of that is curation and processing. So once we have the data, we're able to curate and process it. But a couple of things to note is once it's in that open area so many more people can curate it. So many more people can have access to process it in ways that we didn't imagine when we wrote the experiment initially. So the challenge that we have is how thick should that annulus be? If you just think about that width as a time width, should it be six months? Should it be one month? Should it be instantaneous? What should it be for a particular field of research? And I think the width of that annulus is actually the central problem that we all deal with in terms of openness. I was going to talk a little bit. So I think one of the scariest things in this field is small data. The era of collaboration really was the era when things like CERN started existing. The large Hadron Collider, the Hubble experiments, all of these large multinational experiments or experiments that really took significant amounts of national funds and involved many universities, many academics, thousands of people. We see 5,000 people on a paper relatively regularly in some areas. The problems of big data were solved or are being solved partially by grid computing, partially by the invention of the internet. They fueled a whole generation of technologies. Whilst I don't think we can say that big data is a solved problem, it's certainly one that is well funded in which we've spent a lot of time thinking about. The small data problem, however, is the scary one. It's the one where we don't yet really have a handle and it's almost an unseen problem. Many academics, and I'm speaking personally, so I think I can own this one, many academics have a small set of data. They've run some computer simulations and those computer simulations sit on their computer in their office or on a server inside the university's machine room. In order to back up the data, the academic uses one of these devices, a data key, and their data security plan is taking this home with them at the weekend, making sure that they have a copy with them at all times. This is how theses get lost and people get under a lot of stress just before their PhD fiver. Of course, things have moved on significantly in the last couple of years and people now do exactly the same thing with Dropbox. So the long-tail problem, the problem of people with small data sets is actually the forefront, I think, of the problems that we need to solve simply because it's almost an unseen problem. So in order to kind of address this, I came up with a naive manifesto of things that you should do if you're dealing with data and I tried to keep it deliberately naive and deliberately simple. So I said, capture early. Publication is not about being the final record of work. I think most people recognise that a publication itself is changing. It's been the same for 350 years and now, as we develop more open data, as we develop different ways of visualising data or interacting with data, we develop larger collaborations, we develop different ways of inventing milestones in the progression of a piece of work. It's no longer the static, papery thing or the PDF-y thing that we're all comfortable with. This is something I think that a lot of academics are challenged by. But by capturing data at the earliest stage in the experiment or in the survey that you're doing or in whatever your analogue is in your research area, we make sure that we have the maximum possibility to keep the best record possible. Capturing everything I think is really just a nod to the fact that storage is now cheap. We don't have to really be that discriminating in what we capture. We've already got massive problems around curation. Capturing more isn't going to make the problem that much worse for most people. Capturing everything, all of those Doctor Who episodes that got lost because the BBC couldn't afford to keep the tape because it was just too expensive and they taped over these priceless Doctor Who episodes. There's the research equivalent. We don't need to worry about that problem anymore. We can store, in essence, anything we want to store. Capturing it around is the challenge. Sharing early and I've put wherever ethically allowable. I think this is the kind of housey an idea that academics don't go into academia for money. I think that's pretty clear. They do go into it a little bit for fame, I think. There's an aspect there where one always hopes that one's work is going to be taken seriously by colleagues and one is going to get recognition for the nice ideas, the clever write-up, the interesting thought. But I think a lot of people, and this is perhaps me being terribly naive and just the set of people I go around with, but I think a lot of people are in it for the betterment of humanity. They actually do want to make a difference at that level and it's extremely conceited to think that I'm smart enough to have the idea, plan the experiment, do the analysis, capture the data, write the experiment up, disseminate it correctly. There are many, many things in the value chain that Keith so eloquently talked about earlier when he was doing his circuits around the research lifecycle. No one person can do all of that anymore. That's why it's a collaborative affair. But it's also a conceit to think that just the people in your lab might be able to do all of those things. And so if we really are in a collaborative mode, shouldn't we share early? And shouldn't we allow people from around the world to do the processing of our data because they may be much better at it than we are? We may be able to get a faster result, a better result, a more correct result, a more reproducible result by sharing the data. Now, of course, that's where the issue lies. And I'm going to talk a little bit about more in just a second. But I think the final piece of this manifesto is actually an important one, and I think it's one that actually isn't that contentious. If you can structure the data, this is absolutely critical, and getting persistent identifiers into your data wherever possible, if you're dealing with academics, use orchids, if you're dealing with institutions, use an institutional identifier, if you're using geolocations, use something that's recognisable and understandable to others. So persistent identifiers and structured data, I think, kind of pervades the whole stream. But the question I really wanted to ask are we and should we be open by default? And I think that's really tied up in that third point that we looked at on the last page. It's actually the reason why we wanted to do the state of open data reports and why we wanted to do this piece of research. I think there's a... due to the collaborativeness that's required to access some of the problems that we see today, we really need to think deeply about the system of incentives within research. If you have a Fields Medal and Tim Gowers at Cambridge does, then you feel fairly comfortable in sharing your ideas on your blog and getting everybody to work on your problems. And everybody is willing to work on your problems because you have a Fields Medal. So if you're senior and you're good, this isn't an issue for you. If you're in a highly competitive grant landscape, if you're a junior, if you're a postdoc, it's extremely difficult to share early. Nobody gets made professor, or at least I have not yet heard of a professorship being awarded to someone for the amazing data that they've created or the database that they've put together which thousands of researchers use around the world to create interesting results. And so I think for me the kind of call to action is that institutions now have excellent tools to monitor what people are producing. They have excellent tools to understand the value of what's being produced. And I'm looking forward to the first professor being appointed due to the fact that they've created a large amount of open data or that they've curated a database which really is fantastically useful for a large number of people. Or that they've done some particular data analysis in a very clever way where they were not necessarily the PI on the grant because the data analysis guys are not generally the PIs on the grant. So I think there's a real opportunity there. And that's what Mark and I highlighted in our piece in this report. There are some interesting facets to the report. We tried to be very international in what we did. There are a number of pieces written by people who are not in digital science. We wanted to really pull together a set of opinions. We've got pieces from Japan. We've got pieces from Africa. We have pieces from America and Australia, pieces from Europe. We have pieces from humanities as well as pieces from hardcore sciences. We have a lovely forward written by Sir Nigel Shadbolt who is the founder of the Open Data Institute in the UK or one of the founders. And at the centrepiece we have the results of the survey of 2,000 academics who worked and filled out this rather hefty questionnaire that we circulated. We actually got an amazing return. 2,000 academics is a significant return. We did this in a partnership with Spring and Nature. Nature's team has for a long time held a really excellent list of academics over a really broad range of subject areas. And so we felt strongly that working with fixture it shouldn't just be people who use fixture because we can get very skewed results if we just talk to people who are engaging with Open Data already. So we specifically removed that variable and we looked to a wide range of people. And I think the good news is that awareness is high. 73% of the respondents said that they were aware of Open Data. They were aware of the sources of Open Data. They were aware of reuse of Open Data. There were ways in which they were engaging with the area. And if you look at the breakdown between a discipline level, I thought this was absolutely staggering that actually it's really fairly homogeneous. It's not exactly the same everywhere. But the lowest level of engagement you have is in material science with 60%, which is still above half. This is a good result. And at the highest level you have 84% in social sciences, which is probably not where people thought it was going to be. People probably also didn't anticipate that the evidence in humanities would be so aware of Open Data issues. So I think it is fascinating to actually look into these data and understand where the awareness lies. Another breakdown that we did was by continent. And so you can see that Africa, small return from Africa if you look at the N, but 80% of them were aware. Very significant part of the return again at 65%. Australasia is very hot on Open Data. The government has instituted a large number of initiatives to try and get people to put data into cloud services that they're providing on a country level. So it's unsurprising that this is at 87%. The UK is actually marginally ahead of Europe because it has also put quite a lot of resources into this. But if you take Europe as a whole, it's still at 76%. And so as a planet, we are fairly aware of Open Data and we are fairly engaged with Open Data. And I actually think that this comes back. I thought about this because I was surprised at the results initially. And we also have a breakdown, which I'm not showing in this talk, on age group. And you can actually see that postdocs and professors, PhD students, lecturers, readers, all the way through are actually engaging, again at similar percentages with Open Data. I had anticipated it would be very centered around postdocs because they tend to be the people working very actively with the data. But it's actually across the spectrum. I think there's probably a slight bimodality to it in that you probably get more of a peak around postdoc and professor and people mid-career are probably a little bit more, they have to do their teaching, they have to tick the boxes, they don't necessarily have time to be aware quite so much as those other two categories. But still, it's not significantly lower than the other two. It's really in a good space. What we do find is clinicians are slightly less aware than your average core academic. But the interesting thing in the geographical split was that I think we're seeing network effects. I think we're seeing the effects of now being an extremely collaborative research society. This is no longer America taking a lead, deciding that openness is the way to go, and creating that culture and it gradually ebbing out around the world. It's actually a number of places in a highly connected network deciding that openness is the way to go, and this is spreading incredibly quickly. Open data is moving forward an order of magnitude faster than the open access movement moved forward. It's building on the open access movement and it's riding that wave crest. It is establishing itself far more quickly. Obviously, we can't yet say with confidence how fast things are moving, and what we hope to be able to do in the future is to run similar surveys to this and use this as a benchmarking survey so that we can see progression. I think the one concerning piece of the puzzle was that understanding of different licence types is extremely fractured. 64% of academics weren't sure what licence they'd interacted with when they made their data open, what licence they'd consumed when they consumed data. This was quite opaque to them. This doesn't necessarily imply that they don't know what the licence is mean, although I suspect there's an element of that in play as well. But there's a clear education piece that needs to take place where we make things a lot simpler. If we survey this room and if you're not a room fully of librarians, then those of you who probably know what CCBYNCND is, is probably minimal. These are not easily understandable names. It's not written down in a simple way for an academic who doesn't have time to engage. This is a specialist activity. This is clearly an area which is of interest to plan further. I think another area which is actually very heartening is what I call acceptance. The number of people who actually value a data citation is now in excess of 68%. If you count the bits that are more positive, you're looking at a significant majority of people now who actually value having their data cited and being cited by data. This is, I think, a phase change compared to where we were two or three years ago. I think many people wouldn't have cared about having their data cited. It wouldn't have been on their agenda at all. A couple of short years later, we are now in a situation where people really value this. This is really the first step in the call to action that I made earlier. If we are going to have an esteem-based system where people get promoted for the right things, this is the first step in making that professor for doing things with data. Finally, I just wanted to give a little bit of context to some of the reports that we do. I'll show you some of the research that we're working on. We have three classifications of report that we put out. This is actually our second science report. Our first science report was done with the University of Arctic Collaboration. We did a lot of work doing analysis of Arctic research around the world, how it was funded, the types of publications that are coming out, the kind of citation attention that it's getting and the type of metric attention that it's getting. These were actually three pilot reports and working papers that we released just a couple of months ago. We have the white papers that we do. These are really to look in detail, usually at technical matters in the space, where we try to give people a level of understanding of how we are thinking about approaching the space ourselves and how we think of the enabling technologies that make things happen. Then there are the digital research reports. These are deep dives into topical areas. Some of the most popular reports have been on things like collaboration, diversity of research portfolios, how to actually foster the right mix of research in the sector so that we have scale to access some of the interesting problems. The one I highlight here actually was as the result of having a lot of open data available to me. A colleague and I at Digital Science wrote a report as to why we should not vote for Brexit. As soon as we published the report, it got a lot of media attention, which was great, and it was actually notable in the rhetoric in the UK. Before the report, lots of people were saying, well, the European Union just takes money from us. After the report, people said, apart from science, the European Union just takes money for us. We made a real dent in the argument there, in a highly unsatisfactory way, but it does show you that actually there is a great deal of power in open data. If you can bring the concepts together, if you can publish something cogent supported by data, it's very difficult to argue against. I think that was a nice outcome for us. I think that's where I'll conclude my comments for this part, and I think we move to arm chairs. We'll move to the arm chairs. This is where I have to stay awake. We'll work on it. I was looking for the bottle of red wine, which I think would have fitted the evening, but thank you very much, Daniel. Over to you. I've got some prompts, but you've listened to the two of us for about 45 minutes. Thoughts, reflections, questions, concerns. Why did the Brexit vote go the way it did when your data showed that it shouldn't? But, Anna. Great talks, both of you. Thank you. My question is mostly for Daniel about the open-by-default idea and losing a sense of curation by there being too much information for us to easily digest. The example that you guys often cite of things being directly sent from lab equipment, microscopes, whatever, to fig share. If everyone did that, there would be so much to absorb. So how do we keep the distillation quality that peer-reviewed publications offer in this kind of open-by-default world? So I think it's an excellent question. I think that the way I would put it is there are going to be layers, there are going to be different layers of data. I'm a great believer in publication of negative results. I don't think we do it nearly enough and I don't think we do it in a structured way. I think there need to be a lot more curation mechanisms around negative results. I think there need to be a lot more sharing mechanisms. In the specific example you gave, sharing the raw data and making it available is relatively innocuous as an exercise. If we're talking about non-human trials, it's a physics experiment. We're using a scanning tunneling net on microscope. There's no industry partner where this is commercial and confidential. So the purest form of investigative research. Making those data available as soon as is practically sensible creates a data fabric that people can query in interesting ways if we provide them the tools to do so, and if the machine is pushing out sufficient good quality metadata to contextualise the data that you're putting out. If we know it's such-and-such machine, it's coming from such-and-such a lab. These were the dial settings on the machine. These were the parameters that we put into the experiment. That's bundled alongside the data. Then we understand the file format. We can consume the data. We can do interesting things with those data. At some level, those data may never need to really be curated at a really detailed level. I'm making vast generalisations here, obviously. There are lots of experiments that would require curation in order to be fully understood. But I think once it gets to the journal level, you, as the writer of the explanation of the research, are going to choose a data cut that makes sense and supports your arguments. In doing that choice of data, you need to be curating. In doing the research itself, the research has an element of curation. When I do my research, I work a lot with Mathematica. I do a lot of work with data sets. I look in specific regions that I think are going to be interesting. I find the data to support my arguments in those regions that I think are going to be interesting. I talk about why that region is interesting. I talk about those data. I've made a conscious decision to curate the data that I'm going to expose in the paper, that I'm going to share, that I'm going to put out there. All I'm saying is there could be a layer below that where I share everything that I've done. Then I pick my points from what I've done to package for the paper. You made the point in your presentation that the storage of data is now so affordable that we should just do it and not be selective. That doesn't account for the costs of curation. I wonder whether we still are in the realm of having to think about the trade-off between curation costs and the predicted reuse of data. If you have any thoughts on that. As the last answer suggested, I don't think everything needs to be curated. I think we need to get a lot better at building our technical machinery and our methodology so that provenance data is collected at source. I think if no lab researcher is 100% accurate all of the time at annotating all of the things that they should annotate with an experiment, very difficult to do consistently. Computers are very good at doing what we tell them to do generally. If we program them in such a way that they capture the things that we need at source, then we will have those data available in the future. I think therein lies the technical challenge. I think that's a completely surmountable challenge. I think we need to put out guidelines on how pieces of equipment should self-record, given that we're now moving into that era. That is something that would be a general framework, which is a valuable, not just the science, but across the research spectrum. Thanks. Alexis. It's a bit like herding cats open data. I'm wondering, one of the things we're struggling with at the University of Colorado in Boulder is buy-in. I've heard you talk a little bit about that, but it's not just buy-in from the scientists that this is a good idea. It's also buy-in from the administration, which gets to Keith's point. We have a pet-a library right now that was funded $5 million by NSF. There's no buy-in for the university to do more than that. That's a concern. It's a major research institution. Even once it's sitting on the pet-a library, you've got the issue of who's going to do the necessary curation. Who's going to do the preservation of that? No one has articulated the roles yet. I'm concerned that we're going to be losing ground before we gain ground. I'm just wondering, one, what's the magic pill to give to our faculty members to have them buy-in to open data? The other one is how do we get that plan of action collectively as a community of both curators and content creators? Both excellent questions. I think from my perspective, it's interesting that you use the herding cats approach. Academics are most interested in doing their research and anything that stops them from doing their research is generally viewed as an impediment because they're curious, they want to do more. So whatever mechanisms we have to put in place, either have to assist in the doing of the research or they have to lighten the academics workload somewhere else. This is, you know, in programming we have a thing called principle of displacement. If one client asks for a piece of functionality, then it gets a certain number of votes and we might be able to develop it. If all the clients ask for a piece of functionality and we're developing something else that fewer people have asked for, then the new piece of functionality comes in and the other piece of functionality gets voted down and gets delayed. In academic world it's very similar. If you can free up academic time because they're not doing so much administration, then they will engage with other activities for you. And so this is the playoff. So one should typically be very careful about what administrative tasks one places on faculty. They have to be things that actually pay off for the faculty member. I think specifically with open data, my observation would be that I think the tools are now starting to run ahead or that certainly the tools like things like Figshare, Dryad, community of science, these tools are running ahead of where the framework is in the organisation. We don't necessarily have the staff in place, the support doing the road show that explains what this piece of kit does. I've already talked about the fact that we don't have necessarily the technologies in place in the pieces of equipment to gather the data in a way that doesn't impact the academic. So I think these are all things that are... I mean, you're right, we are in a kind of a slightly painful point at the moment because we're waiting for at least two other pieces of the puzzle to come into place before we've got a really streamlined structure, which means people can buy in. And I think... I mean, I've been to a number of talks recently that suggested that curation was going to be the next big thing. I'm not sure that it is. I think that the next big thing is in producing smarter pieces of equipment and technology that are able to self-report. And I think that's now very much the missing piece of the puzzle. To replace the academics or to replace the equipment? A little bit of both. I think I'd be interested to have your comment on how metrics might evolve in this area. Because in my experience, that's one of the big sticking points. Academics are interested in doing their research, but they're also interested in getting promotion and tenure. And it's what counts when you go up for promotion and tenure, which that's one of the things that provides the incentives. And I think until we can get some metrics that are going to be recognised as being valuable related to open data, then we may not see a major shift. But another parallel interesting development is that I have a number of new PhD students in my introductory doctoral seminar, all of whom are interested in digital humanities. And they're actually talking about wouldn't it be great if the data really were seen as an end product rather than a byproduct? And wouldn't it be great if actually producing a tool, doing the curation, that type of work, the processing of the data, that in itself was valued by innovating in that area, and your intellectual property in that was valued as well. And so I think looking at ways to both disseminate that, publicise it and measure it and come up with measures that are going to be meaningful in the broader context. So really expanding the basket of measurements and trying to sort of promote that could be a positive thing. But looking at open access and the alt metrics that are associated with just open access to papers and research in other formats, it takes a long time for that sort of side of things to catch up with what's actually happening on the ground. No, I completely agree with you. Despite my previous title, which was director of research metrics, digital science, I'm actually not a big believer in metrics per se. I think that on record, and I gave a talk at an ACS conference a few years ago about the problems of the age index, and the fact that I think it's a complete abhorrence that someone would try and reduce my scholarly output to a single number. You know, if you want to rank me as a researcher, if you want to understand how good my research is, then I think you should read it. Now, obviously, in government exercises, that's challenging to do because you have thousands of outputs and you need some kind of proxy to work with. But actually in a smaller context, in a research environment such as a university where you're looking for tenure and promotion, I think that, you know, there should be a rigorous period of year and people should be reading outputs. And I think outputs should be extended to exactly the outputs you're talking about. I think, as you say, research data is becoming less and less a byproduct and more and more the product. And you see it in other areas as well. It's not actually just the research data. It's actually in the cases of things like CERN and Hubble Telescope, WMAP projects, things like this. They're all physics examples, sorry about that. The equipment itself, the engineering feat of creating this piece of equipment is significant. There's significant new knowledge being created in the construction of these devices. And I think that also needs to be recognised. You see any number of people on soft money who have these careers that are eaked out over years and years in very precarious positions because they are on large projects making real contributions, but not contributions that are papers. And consequently, they're not recognised. And I think that's very sad. And I think it is something that needs to be addressed. Open data in some sense is the most obvious way to point out the fallacies. But it's only one facet. There are many others. Daniel, I actually have a follow-up question to this string of questions. They go back to what you were talking about with the interest and increase and people wanting to see their data becoming cited. And when trying to look at the impact that a researcher has beyond a numerical number, what do you think is going to be the future for the research narrative and trying to explain the impact that research has as it goes through the cycle of researcher to researcher to researcher? I mean, it's a bit out there, but I guess I think that publication in its current form will probably die a death over the next 20 years. And I think what you'll see it replaced by is a more continuous research reporting mechanism. I think you will see collections of write-ups of analyses, collections of data, many people contributing to what originally would have been one person's research. And I think you will have highly collaborative, long-term, kind of longitudinal publications, if you will, that will almost be versioned. It'll be much more like kind of a GitHub experience or something like that. I don't know quite how it's going to come about. But it's clear that whilst kind of the PDF and the paper thing has had a great deal of resilience and is an atom of research, it's very consumable by the human mind. I mean, that's what gives its resilience. But moving forward, the quality and the nature of the research we're doing, I think open data is actually an argument or a discussion I had with Amy Brand when she was still working with us before she went to MIT Press. And we both agreed from different perspectives that the actual nature of research itself has changed over the last 20 to 30 years. Research now has very different characteristics to what it had then. And so the mechanism of disseminating that research has to change. It has to be different in the future. And I can imagine based on today's technology what that looks like and obviously that's going to be wildly wrong. But I think it's not going to be this discretised atomic quantum thing anymore. I think it's going to be more continuous. I think it's going to be more elongated. I think it's going to have... I think it's going to build in and crystallise in a more reproducible way things that we now think of as ephemeral. So if you think of a lot of areas of research, arts and humanities are wonderful examples where people have world premiers or there's a museum exhibition, I think there'll be ways of encapsulating those in much more solid terms than we currently do right now. And I think those actually may be the vangards to inform what we do in the rest of the research picture. So I would actually look to digital humanities for where they're going to look at how that's going to be reflected in the STEM areas. So you've just made some people in the middle very happy. Ula, a final question. Thank you. Yes, my name is Ula and I'm a liaison librarian here. And I'd like to keep us on this question of incentives for open data. And you mentioned the scenario of the postdocs who are reluctant to make their data open for fear of losing control over it or someone scooping their research as well as the professor or the senior professor or the senior faculty who has not really ever built their career on open data or use of data. And so my concern is where we are in the spectrum of getting to where open data becomes something that is fully embraced. And from my understanding of the research, a great deal of the embrace of open data comes from mandates, mandates from publishers, mandates from government funders for data to be made open and accessible. And so I wrote this down while you were speaking because I wonder if essentially the tipping point is going to be when those who embrace open data advance at a faster rate than those who do not. And where we are on that spectrum and how do we push ourselves a little faster to getting to that tipping point? That would be nice, wouldn't it? That would be nice if that was actually correlated. It's a really subtle question. It's an interesting one. I would say that mandates have certainly been responsible for the initial uptake. I think that the analogue I would use is when I was a PhD student, I started my PhD in 2000, and that was a good eight to ten years after archive was in place. And when I started my PhD, there was no question of the order of things. You did the piece of research, you wrote it up, you put it on archive and you submit it to the journal. That's how it's done. There's no other way of doing it. That's what you learn. If you miss the step where you put it on archive, you've missed a step, and in fact the physics journal community, when you submit to any IOP journal, you can just put in the archive number and they'll go and suck the paper down for you and automatically push it into their reviewing process. So it's a really slick process. It saves me a lot of time as a researcher. I can get back to doing other problems. Until we have really slick workflows around really obvious ways of sharing data and lodging your claim, which is kind of what FigShare tries today, then it's going to be a drawn-out process. I think it will move quickly once you have an archive-like effect where nobody really questions when you've got PhD students coming in and they're taught that this is the way you do it. You produce your experiment and you will have written an interface to the API over here that allows you to publish your data at the appropriate time, and it will automatically have the time delay built in if that's what you think it needs to have and your ethical approval procedure will have agreed that you're allowed to share these data on this time scale and these data on this time scale and these data never because they're just too sensitive and they don't compromise people or whatever it is. I think until that's really clearly understood and it's something that's taught at PhD level when you're starting to learn about the research process, I think that's the tipping point because in teaching it to the PhD student, the person teaching it to the PhD student has to believe it. My supervisor certainly did. Like I said, there was no question of doing it any other way. That was simply the way we did it. Given the requirements to get on to archive and to be allowed to publish to archive, it's actually one of the things that every PhD student should go through is making sure that when they finish their PhD, they are set up in a way that they have the right to publish on archive. So they're a recognised person who publishes on archive. They've had the backing of their supervisor and another person to say, yes, this is someone who engages in the research community. That should be part of the process. So when you publish on Dryad, community of science, fiction, whatever your favourite mechanism is built into that, it should be an understanding of how to get on to it, how to use it responsibly, what kind of data is appropriate, how you should be engaging with it. That needs to be part of the PhD. So I'm conscious of time, I'm conscious of a crowd outside. So I'm going to wrap us up with one final question and encourage a brief response because it's perhaps an unfair question and that is in all of this landscape, where do you think will be in five years' time? Will open data and these norms that you've articulated be in place? I think if we're lucky, yes. I think if we are fortunate, within five years we will have seen our first professor appointed for sharing their data and doing the things that they do with their data. We will see not all of the community, but we will start to see sub-communities engaging in the way that high energy physics does with archive. And I think if we see that at any scale then we can regard ourselves as somewhat successful. It's very interesting that it's only really this year, the latter half of this year, all of a sudden pre-prints have become a hot topic and a number of publishers are looking at pre-prints. Bioarchive has all of a sudden started to take off. It's taken tremendous time to get traction in that area, whereas REPEC archive have been around for a long time now, making good strides in those areas. Biology has been somewhat intransigent on adopting a pre-prints server, but this year we start to see it moving. So I think it is a tipping point effect. And I think five years I think we'll see communities starting to work, but I don't know if it will have reached the big tipping point yet. So thank you very much, ladies and gentlemen. Please join me in thanking Daniel, and please grab some refreshments.