 I'm pipet in hand in the lab, but my outside of work, I have a great interest in how openness can change the way that we're doing science. So I guess I would probably class myself as one of the young people that Geoffrey referred to in his talk, and many of those views that were expressed on that slide quite chimed with me. So I'll go on and just explain briefly what the Open Knowledge Foundation does. Also the title of my talk is Why Open Data Means Better Science. However, I think Geoffrey did a fantastic job of explaining many of the points that I was going to go over. So we'll frame this as reinforcement when I repeat points that he's already made, but I will try and put a slightly different spin on things. So the Open Knowledge Foundation is a global movement to open up knowledge in all areas, not only science. There are an organisation that works largely with government data. That was where it kind of all started from. But now we have community-based volunteer working groups for everything from open economics. Velichika, our coordinator of open economics, is here today. Open humanities, transport data, a whole bunch of arenas in which openness and particularly access to open data can really change the way in which we do things. So one of the things that is of interest perhaps to the data repository people here is that Open Knowledge Foundation runs an open source data portal software called CCAN, which forms the basis of many open government data publication projects, including the UK, and recently announced that the US will now be using CCAN for its open government data. However, it's also of use for academic data, and for those of you that are around tomorrow for the more technical interoperability workshops, I'll be around if you have any questions about CCAN and how it might be used in an academic context. So the Open Science Working Group, I've been coordinating this group for just over two years. We found it in 2008 as the open data in science working group, and I think it just goes to show how much the breadth of kind of open science and how excited people have got about the open science movement that we recently decided to change our name to the open science group just to represent the broad membership we have of people that are interested in all areas of open research, not just the data but obviously open access, citizen science, ways of doing research transparently. So this is the definition that we use to describe open science. Scientific knowledge, in all its forms, methodology, how we go about the research process, through to the research outputs, be it code, data, publications, that people are free to use, reuse and distribute without legal, technological or social restrictions. So we're a community-based group, which means that we're essentially a network of people around the globe who are interested in this kind of area. We currently have around about 450 people on our main list from all across the world, and it's growing rapidly as the open science movement really expands, particularly now that in the last couple of years we've seen major reports coming out from the Royal Society, the European Commission, the UK government, and it's becoming a little bit more mainstream. So we do projects to produce guidelines. One of the very first projects the working group embarked on was the Panton Principles for Open Data and Science, which were really a very clear statement about how data should be made open in terms of the licensing aimed at researchers. And so the basis of this was that science itself is based on building on reusing and openly criticising the published body of scientific knowledge. And so for that to happen and for society to reap the full benefits, it's crucial that the data is made open. So the Panton Principles is a set of four basic principles. The main message is that if you want to make your data available for other people to reuse, you should very clearly indicate under what terms you're making it available. It's still quite difficult sometimes to actually work out how open a data set is. So we recommended very explicit licensing, and in fact that all scientific data should be licensed into the public domain, possibly using a CC0 licence. So Creative Commons have a zero licence which holds no requirements on the user of that data, although there are others of an equivalent nature. And so the idea here is that you just remove as many restrictions as possible on its reuse so that anyone can take that data and reuse it. For open access, the recommended licence is CC by, and I'm sure many of you will be aware that the only difference there is a requirement for attribution. So we're not saying by recommending CC0 that you should not attribute data that you use, but more that that falls within a scientific community norm. And in fact if you don't cite or in some way acknowledge data that you've used, that's bad science and that you'll be penalised by your community for plagiarising or for not making use of data in a way that's acceptable. Although that raises interesting questions when people are feeding in lots and lots and lots of data sets into a project as to how you actually go about fulfilling that community norm of citation and acknowledgement, and I'm sure that something we'll be discussing in presentations later on. So this is pretty much the status quo at the moment of availability of data. As a bench scientist I've experienced this very frustration myself. And so data availability in most fields, so it varies massively between disciplines. If you've got something like genomic data where that community already has quite large global databases and there's a community, a cultural norm for putting your data in that one place, then yes it's relatively easy to find. You can use it, you know that it's licensed properly, you can just take it and do what you like with it and attribute as you might any other data set. But actually there are also lots of disparate databases, data sets, often sitting just on servers. So if you want to acquire a data set to verify a publication and look in more depth, it often requires a series of individual requests to the data holders. So you might think, well, okay, that's sending off some emails. Obviously if we're doing a major project involving many, many data sets that's going to be difficult, but for your average lab scientist surely it can't be that hard. Well, actually there's quite a few publications in the literature that demonstrated that actually is relatively difficult to get hold of data. And data sharing is, and so I should point out that I don't know of any publications where the authors have come out with a very positive conclusion about data sharing. That may be publication bias. I really haven't cherry picked deliberately, but all of these publications on the top row there were the researchers wrote to or emailed authors to get the data sets that backed up their assertions that they made publicly in the scientific literature with very little success. So the top empirical study of data sharing by authors in plus journals and data sharing in medical research and empirical investigation each contacted, well, I think plus was 10 authors. They received one data set back even after repeated attempts. Many received outright refusals despite the fact that authors usually signed to say they will make their data available when they publish in some of these journals. So it clearly is a problem and the status quo is not what it should be. An interesting article for anyone who, if you haven't seen it, is in the psychology field. These researchers wrote to people providing articles to a psychology journal over a certain period of time. They then looked at, they re-analyzed that data using different statistical methods and they actually found that the willingness to share the research data was less the closer that the statistical significance was to 0.05. Many of those data sets they found with a bit of a tweaking of the statistics they could push into the non-significant area. You can interpret from that what you will, but they also suggest that maybe, if you're aware that your data is not of the best quality all that you've analysed it in a way which is perhaps slightly pushing it towards one end of the scale which we heard a quote in the previous presentation that suggested that data analysis among scientists can be quite poor then that might affect your willingness to share. Okay, so our vision is to have data shared openly at the point of publication to back off any claims that you make in the scientific literature. Fantastic if it comes out before that and indeed for many genomics projects bits of data sets will come off the sequencing machines and straight on to gen bank databases and that's brilliant, but for most scientists the point of publication is the point at which they will make that data available. So why does open data mean better science? As I say many of these points we've already heard but verification is one so you can actually drill down into the details particularly in fields where there are quite complicated statistical analyses it might be interesting to have a look at was the data actually suitable for that statistical analysis have the authors done what you would have done and just really dig down quite a lot and so some of you may have heard of a study at Duke University looking at choosing chemotherapy based on genomic sequencing which was a long battle from another two statisticians who got the data eventually and read the analysis in a way that they thought was better and found that actually this treatment was not effective or this choice of treatments was not effective they fought very hard to block the clinical trials that were going on there and eventually succeeded although interestingly the real contention about that actually arose after it was also independently discovered that the lead researcher had lied on his CV and that was actually the spark point not the fact that statisticians had been trying to publish papers based on a different statistical analysis of the data and actually struggling to do so and I'll come back to that point later scientific publication as it stands is not really designed for people that want to verify or replicate studies and it can be difficult to publish those kind of papers so reproducing or replicating studies is also an area where open data can lead to more rigor in science and just checking that it is in fact reproducible we've already heard that in many cases papers are not reproducible so one area where there's particular work on this is in big data science and also computing can you actually get computer simulations to give you the same results when it's run by a separate group and later on we'll talk about what you might need on top of the data to actually achieve reproducibility and replication so we talked about re-use of data so I'll just briefly give a couple of examples of that because I think we've already had quite a few but re-use of data can involve anything from a meta-analysis and we talked briefly about clinical trials and how many of those have a positive publication bias but also the gold standard of medical evidence is a systematic review of all publications on a particular treatment which is taking into account the fact that there is variation you are going to get some negatives and some positives but if you gather together enough studies you'll get an overall impression of where the treatment is actually effective and getting the data and having access to the data itself rather than just the published result is essential for those kind of analyses One impediment to re-use which can not only give a better impression of the scientific consensus but also lead to new scientific discoveries is that often data is hidden away in publications and so content mining as already mentioned is a growing area to enable us to drag in this data from various places When I say hidden that can mean that actually although the data has been published with the paper it's for instance in a PDF format where you can't actually read it and it's of no use for re-use and so in a second we'll talk about why just being open doesn't necessarily mean that you're going to improve science and again we mentioned efficiency so enabling different groups to do parallel analyses of data so you don't have to wait for one group to release their data set for another group to work on a different aspect of it and citizen science as well if you have a massive data set a good way in some areas to analyse that is to release it to many many minds and many many eyes to make light work hopefully of the analysis however open data doesn't necessarily mean better science all the time so firstly is openness enough well open data is really a means to an end I think this is quite clearly stated in Geoffrey's talk as well that just making stuff open is not sufficient to make it useful and actually be able to do all of the things that we've just talked about so firstly location and discoverability are very important you may have openly licensed your data but if it's sitting on your own website and it's not linked properly to any publications or any major databases where people might be looking for it then it can be quite hard to find and I've had this myself where I've not known for several months that an important transcriptome or important to me anyway has been released because it was not put in the place where I would expect it to be put which is the NCBI or Emble databases and the paper that it was published in did not make it obvious that that was part of the data that they produced so location and discoverability are also key standards and formats I just mentioned PDFs if you reproduce your data as a table in a PDF common problem in open government data that governments will say we've released the data but actually all you've got are pages and pages of PDF tables which essentially need to be hand transcribed or there of no use standards particularly for metadata so how to describe the data and actually make it interoperable often come from the community and they're also important although there is an argument that just actually getting the stuff out there in as simple a format as possible is the best way and then we can kind of work on the rest later but it's still the point is that when you release data it should be released with reuse in mind and that's also a kind of cultural shift from what people are used to doing which is essentially releasing data because they have to in some fields rather than because they actually want people to reuse it in the same way that actually putting your paper under an open access licence doesn't necessarily mean that you've written the paper in a way that someone else could do it you may not have written the method in enough detail for someone else to reproduce it and so kind of getting this idea of making your work reproducible and available for reuse at the core of starting from experiment to design all the way through to the documentation is quite key Data alone is not enough and again we kind of referred to this or Geoffrey referred to this in the last talk so if you want to reuse and reproduce the data the code or the software that's used is also important infrastructure for how the data is stored and linked is important as is training so in the corner you can see Lucy who is a community coordinator at the Open Knowledge Foundation and she's been doing a lot of work with data journalists or rather journalists to turn them into data journalists and actually the use of data to do large especially large data sets to do analysis it does involve specific skills which often just are not taught at the kind of the graduate level or in early career the early career science level introducing training to meet these new challenges is really important from my own experiences a lot of graduate training just ignores the internet ever happened and we're taught how to write a paper in the old style we're taught how to make a presentation we're taught how to have good discussions with our supervisor but there's very little long kind of retrieving using publishing information in the digital age and data alone is not enough in terms of it needs to be linked to other data really to be of any reuse value in many cases and so how to integrate it into an ecosystem of data and more generally into the digital commons so all other forms of knowledge and data be that a paper, be that perhaps government data if you're interested in interrelating scientific data sets with policy or geographic information all of this data is floating around and it's how do you make it come together into one common ecosystem okay so how do we move forward well there are several cultural boundaries to moving forward with making open data actually improve science many of these are cultural so there needs to be a lot of work with community building insert in disciplines to enable people to not only come up with how they want to release data but also how they might reuse it and in fact for things like content mining often researchers just aren't aware of the tools that are available and what they could do with that research methodology so making clear the possibilities is quite important and some communities have already done this very well genomics for instance, bioinformaticians routinely release data openly and Tim Gowers has mentioned he's done a great job in bringing together mathematicians to fight for open access and promote open access and open collaboration to work on projects so it's really finding more community champions in different areas to stand up and talk about these things and to derive their own standards and their own ways of doing things that can then be linked with other areas but infrastructure is important and we heard a little bit about what the European Commission are planning on that and I'm sure there will be more talked about tomorrow but it needs to be made as easy as possible once researchers have a data set created that they can push it out there and the standards are also important in terms of linking within that data infrastructure so open data has the potential to improve both the quality and the pace of research and I think the question is not really why open data means better science because it clearly does but really how we define what good openness is and how we actually go about achieving that and I think there's quite a lot of advocacy still needed in certain areas to promote this idea that openness and sharing is good for you because a major barrier is the lack of incentives in many areas and actually putting your data set out there when you could hold on to it and get another publication and I've heard a lot of comments like that especially from people where for instance the reproducibility criteria doesn't apply so ecologists it's very unlikely that someone would go and reproduce their experiment so they would say well why should I release all of my data when that's not actually going to happen but then there are other reanalyses that can be done and linking together it's a shame to not do it now because there currently isn't any way that could be used without thinking forward to a future where perhaps we might want to link that data with another data set so I would conclude to say that to move forward we definitely need to get communities together to talk about this and encourage the kind of bottom-up approach that Geoffrey was suggesting and we can do this through organisations like Learned Societies and I know that EU is working very hard to bring people together as well in working groups to kind of talk about this rather than from just a mandated institution and from the perspective but I think also that definitely does have a role to play particularly at the beginning to kind of get the stuff out there in the first place and actually show some of the possibilities so if anyone's got any questions that's me, thank you for your comments