 Felly, yn cwrwg, rwyf wedi bod am ddedigio'r bobl yn bwysig, ei wneud i'r hanfodol eraill, mwy'r ddataeth? Free? Andrwych. Dwy'n gŵi ddangos chi'n gwneud i'r hanfodol, mae'r ddod ei wneud i ddweud i ddatblygu'r hefyd, dygwydd yr llais ryw fydd i'w credu. Felly, dyn nhw'n ei ddyn nhw'n ei gwddio'r hefyd, o'r a thym ni'n gwybod, o'r ddorol, o'r grwpio, o'r prefiad â'r hefyd, Beitharmeidwch os ag yr hyn. Me wiith eich llawer eich chip pmw 느낌u? Felly r的, dwi'n da素. rydych chi gweld yn gael dstar, allan sobnydd ar gael chi mor yn이고 gan eich cyllidiaod O ran gwropydd i gyd y gallwch cweilladu ddewos? D Durham. Dwi'n ddweud. Os fy wnaeth bod gallwch yn daf yn gweithio. I'm going to be talking about data publication specifically, now one of the downsides of presenting so late on during the programme is the fact that an awful lot of what I'm going to say has already been covered by other people over the past day, day and a half. But that's good because it means that I can skip over the bits that I need to explain because you've heard it all already and I can concentrate on the more advanced interesting bits. As Najwa said, I work for the British Atmospheric Data Centre, it's one of the federation of data centres funded by the UK's Natural Environment Research Council and we cover all sorts of environmental data from ecology, hydrology, atmospheric science, polar data, earth observation, oceanographic data and all sorts of stuff so there's an awful lot of stuff we deal with. So I'm going to start off, again preaching to the choir here, why do we want to link data and publications? Well when it comes down to it the whole process of doing science is about reproducibility and testing our assertions and if we don't have the data we can't do that. The internet is brilliant, it can do all sorts of really cool things like provide pictures of cats but it also allows us to link things to other things quickly and easily. Now this is a bit of a two-edge sword because when you're talking science you want to have this object of record, this fixed thing which is permanent that you can identify and use as the basis of your analyses or draw your conclusions from. So when it comes to applying that to data we still have quite serious problems with issues like data persistence, data and metadata quality and of course attribution and credit for the data producers and as I said if you've ever created something you're proud of you know what a data producer feels like. Historically speaking our journals have always published data, it's just back in the days, well we've got the example there, never mind. So there we've got a picture, Robert Hook 1665 drawing cells and then from the scientific papers of William Parsons 3rd Earl of Ross in 1800 that is measurements of stars in the spiral nebula. So that is data and it's in a hard copy dead tree format, actually these aren't because I stole them off Google but never mind that. But now we're in the situation that we're generating so much data that we just can't afford to print it out on dead trees and bind it up as books, it just doesn't work. A lot of people are creating data, we're only going to get more of it, there's lots of talk about floods and the data delusion stuff. We've got to think of ways that we can start coping with this, we either sink or swim so I think it's about time we started building some boats. So let's not reinvent the wheel here, we already have a working method, a strong historical precedent for linking between publications, between one thing and another. And it's commonly used, commonly understood by the research community, it's already used to create metrics to show how much of an impact something has and it can be applied to digital objects. So we've got this system, it works, people use it. Let's take that and build on that and extend citation to other things like data and code and multimedia. And the best bit is it's just tweaking slightly researcher's perceptions. They don't have to learn an entirely new way of doing things, it's just extending what they already do. So I won't go into the reasons for citing and publishing data, I've been through these before. Let's just take it as read that citing and publishing data is a good thing for as many people as you can possibly think of. From a strictly data centre specific point of view we want people to cite and publish data because if they cite and publish data that means that there's more incentive for the researchers who produce the data sets to share their data in trusted repositories with appropriate formats and with full metadata. And that's really important for us and which is why we're involved. So my affiliation in the programme is given as BADC prepared and prepared is the project that I'm managing at the moment. It stands for peer review for publication and accreditation of research data in the earth sciences and yes it's a twisted acronym but let's not worry about that too much. We're about seven months into the project. We have a range of partners, we're led by the University of Leicester and Jonathan Tedds. And in our range of partners we have librarians, California Digital Library, we have academic publishers Wiley Blackwell and the Faculty of 1000, we have data centres, the BADC and the US National Centre for Atmospheric Research. And of course we have the Digital Curation Centre who are providing support and in general kind of overview. And we're not just looking at the earth sciences, the earth sciences is our primary focus but Faculty of 1000 deals with life and biomedical science research so we're keeping an eye on that. So we've only got 12 months to do this project, we haven't got an awful lot of money so we're trying to do the best we can with what we've got. We're looking at very pragmatic ways. One of the main reasons that sparked off this project and it's a really good opportunity that we've got here. Wiley Blackwell launched Geoscience Data Journal last April and Geoscience Data Journal is a data journal in collaboration with the Royal Metrological Society. And it's an open access journal and it publishes short data articles which are linked to and cite data sets that have been deposited in approved data centres and given a DOI or other permanent identifier. So what's a data article, Ryan's mentioned this before, it's a paper or a couple of pages or some information about a data set giving details of its collection, processing software, files format etc. And what it's not, the paper describes, the data article describes the when, how and why the data was collected and what the data product is. But what it doesn't do is it doesn't go into the analysis on the data set and doesn't require any novel conclusions to be drawn. Because I know from my own experience that when you spend months of your life collecting data and making sure that it's fit for publication, you just don't have the time to start running cumulative distribution functions over it or applying the latest statistical technique. Often it's easier to just pass that on to somebody who does the analysis. So traditional journal model, you have your author, they write a paper, they submit it to the journal, the reviewer over here reads the paper, provides comments, whatever. Question is, where's the data in all of this? It's somewhere off in the corner or hidden on a CD in a desk drawer somewhere, not ideal. So a data journal, the author writes the data paper and they submit the data paper to the journal. And at the same time or hopefully just before they submit the data paper to the journal, they submit the data set to a repository. And the data journal, the data article itself has a link within it to the data set in the trusted repository. And then the reviewer can review the article and also can look at the data set itself and review the data set as well. So here is a data paper mock-up. It's not a real thing yet because GDJ doesn't actually have any published data papers yet, although we are so close. I've been told within the week. Anyway, the important thing to note in this mock-up is that the data citation is right up the top of the paper. So underneath the title and author name and all the rest of it, the first thing you see before even the abstract is the citation for the data set. And you also get the citation for the data set in the reference list, which allows it to be picked up by those automatic systems that do citation counts. So Prepared is looking at a number of topics, well, three main ones. So we're looking at workflows and cross-linking between the journal and repository. We're looking at repository accreditation. How do you know you can trust a data centre? And scientific peer review of data. And I'm going to mainly talk about workflows and cross-linking because that fits in nicely with this interoperability work. I'm happy to talk about other topics if you'd like some other time. So to start off with, we went and captured all our data repository workflows. And we discovered very quickly that the data centre and journal workflows are very varied. We don't have a one-size-fits-all method. Even within the BADC, we have multiple workflows depending on what level of interactions we have with the data sources, whether we have an engaged submitter who's willing to answer questions about the data set and just provide extra metadata, or whether we have somebody who just says, here, take my data set, no, I don't want to answer any questions about it ever again. So there's lots of differences. So that I expect you're not able to read that workflow down the bottom. Don't worry about it. It's just to give you an indication of how much is actually involved. Similarly, we've got, this is the workflow for NCAR. And there are similarities, but there are also an awful lot of differences. And I suspect that if you went and looked at the workflows from any data repository you could mention, it's going to be really, really hard to standardise. If you want these slides, I'm happy to make them available after the fact. So here we go, journal workflow. Again, there are certain threads that are common through various journal workflows, but there's always wrinkles, there's always idiosyncrasies. So we're still working on the comparisons of the workflows and the identifications of the sweet spots, the best places, to put the crosslinks between the journal workflows and the repository workflows. And the aim is to minimise the effort needed to submit a data paper by taking advantage of already submitted metadata. If you think about it, if you're a researcher, you've already filled out this huge long form for metadata when you deposit your data in a repository, you don't want to have to do exactly the same thing all over again simply to submit it to a data journal. And also, if you have to do it twice, there's more chance that you'll get typos and other errors creeping in. So crosslinking. This is what we have to do for prepared. Like I said, we have to be pragmatic. We don't have a lot of time. We don't have a lot of money. We have to demonstrate crosslinking between Geoscience Data Journal and BADC. Great, we can do that. That's easy. The problem is, as soon as you start adding in other data centres to the mix and other data journals to the mix, you have a plethora vast, a vast wodge of links, all of which are potentially ever so slightly subtly different. And this direct crosslinkly isn't scalable. We need to have off-the-shelf solutions that can work across multiple research domains. So the ideal situation is if we have a registry in the middle where the data centres can link with the registry and the data journals can also kind of link with the data centres through it. And the registry could provide other functions as well as being an intermediary. So they could do data centre certification, certify that data centres are trustworthy. They could administer linking mechanisms. They could provide certain metrics functions. Now, the disadvantage is if you have a single registry, you've got a single point of failure, and also it's really hard to standardise across different research domains. And this last point I added after seeing all the presentations yesterday, and I'm thinking I need to talk a bit more with people, but could open there be this registry? And if it could, brilliant. So, the question is, do we have a start here already? So data site, Herbert already mentioned this, they have a standard, and somebody else did yesterday who have forgotten, but they have a standardised set of bibliometric metadata that has to be submitted before a DOI for a data set can be minted by a repository. So we've got this standardised metadata up here. That gets deposited in the data site metadata store, which is made openly accessible. Given a DOI, a journal can then easily find the DOI. There are the data site metadata. Data site also have the content resolver. So the journal can pull the DOI specific metadata from the data site metadata store and into the article. That's easy. What we don't have is the return link, where the journal can let the repository know that the data set has been cited, whether that happens either via the data site metadata store or in a direct way. I think Open Area has got the potential of doing that. So, just for completeness sake, the data site metadata schema, it's a good schema. It's very general. It only has five mandatory properties and another 12 optional properties, but it has to be that level of generality because when you're dealing with everything from art history to zoology, you've got to be careful with that. You've got to have stuff that's as widely applicable as possible. For contrast, again, I'm not expecting you to be able to read any of this, this is moles, version 3.4. This is the metadata scheme that we use at the BADC. It's a bit complicated, right? But the important thing is here, that's all the metadata we collect for our systems. Only a very small subset of that metadata scheme is actually going to be needed for the data journal. So it's mapping between the two. And I suspect a lot of other repositories are in a similar situation. Right. So what we, as prepared, is going to do, we already have a link from the BADC through data site to GDJ. And GDJ can also pull the standard DOI metadata from the data site metadata store. We've already shown that works. What we need to do is fix that return link so that GDJ can inform BADC or NCAR or whatever repository it is that their dataset has been cited and published. And we've got to be careful about how we do this because of the scaling issues that I mentioned earlier. We're going to be pragmatic. We might have to start with a manual workaround where we just fire off an email to a centralized email address at the repository. But we've got to keep an eye out on what we can do to kind of make it more scalable in the future. And if we can take advantage of what work other people are doing, then that is definitely the way to go. So tell us what you think. So we're having a workshop on cross-linking between data centres and publishers and that's planned for May in Rutherford-Appleton laboratory, which is quite close to Oxford in the UK. And we've got a workshop on the peer review of data happening in March at the British Library. We love getting input from people. Come and talk to us. Drop us a line, drop me an email. Go and have a look at our project website and our blog. Talk to us if you think there's something we should know. Tell us about it. A key part of the prepared project is stakeholder engagement. So we want to produce something that will work. It might be a tall order, but that's what we want to do. So that's me. That's my email address. That's my Twitter handle. And there is a picture of a cute kitten because the internet is made of cats after all. So thank you very much. Thank you very much.