 Thank you very much for inviting me to give this presentation and I'm very pleased to be part of this and series on data citation Which has already produced some wonderful presentations and some great web content as well So very nice to be to be added to that So I'm going to spend a little time at the beginning just explaining about the data archive and its collection to give you a little bit Of an overview and then I'll move into talking about the data citation methodology that we've chosen and implemented at the data archive Okay, so based at the University of Assix for quite a long time now over 40 years and Spent selecting ingesting dealing with the variety of social science data sources and explain a little bit about where they come from in a minute So we've been designated by our National Archives as a place of deposit, which means we can hold public records and A lot of the data that we bring in and the support services that we provide support Higher and higher and further education research lead learning and teaching So we kind of really can support any data user, but our primary type of audience is higher and further education Although that includes also many other researchers who are who want to use social science data And more recently last year we went through a a long process that enabled us to to gain the information security standard ISO 27,001 which means we can hold disclosive data And that was a lot of auditing a lot of work, but it means we're a trusted digital repository in that sense Excuse me for the slow Action right so just a little picture of where we are so you get a flavor. We have got a very nice green campus We have buildings that were were built in the 1960s and some of the towers. You can see they're actually listed buildings They're ugly that they're listed that our campus is based on an Italian village called San Gimignano I just doesn't quite look Italian or medieval, but it's a nice place to be and we have a brand new building that was built about three years ago dedicated for for the data archive and for the social research center that conducts our panel studies in the UK And that's our nice building on the right there. So just give you a flavor of what we look like so just to say this is the data archive for This is the data archive website. This is our data archive website. You can go and have an explore of that later and This particular diagram says that we're part of a bigger international federation of social science data archives around the world One in Australia sitting down there in Canberra and one in America who we work closely with at University of Michigan So most of the Western Europe do have these data archives and we're part of this bigger bigger international network We work very closely together on metadata standards solutions training materials those kinds of things So we have a family of services. We were running a number of different services You can see listed in grey on the left from October 1st We have a brand new grant a five-year grant called the UK data service that brings all of these data services together under one roof And we're currently being rebranded So we don't know quite what our entity looks like at the moment But it will wrap in all of these support services that support all kinds of social science data in UK and We have a number of research and development projects some on metadata some on repository systems On many different things actually someone on providing access to to geodata Others are about preservation and preservation standards, but we do bring in quite a lot of grants to do kind of forward-looking a research We have as an organization more recently in the last year adopted the OAS functional model Which means that our whole our divisions are based around these particular sections of ingest data management storage security access Preservation so as an organization we follow this model We have slightly two slightly different adaptations in that we have a sort area of descriptive information here We have an area that says some That's supposed to say pre ingest There so we have an area of kind of metadata on its own that sits as a Department and we have an access and data support unit a lot of work goes on there to provide Support and access to users But it's nice to follow a functional model and it seems to work very well So where do our data come from a lot of it comes from central government agencies like office of national statistics Like the home office like the Department of Health and they give us Important government series and some of them are dating back to 1960 for example the what was called the family expenditure survey Combined with a food survey from the 1960s. We have got a series of data going all the way back to the 1960s So it's very important and time series collection We provide access to statistical time series some of the macro data banks from International government organizations like the World Bank and IMF We do gain grants Grant data that comes from research projects when an academic is funded by the economic and social research Council As part of their grant they have to offer data. So there's a stream of research data coming in Which can be anything from surveys to qualitative data and sets of interviews psychology experiments met many kinds of a very vast Vastly different data sources actually We collect data from market search at such as these polls do have some historical sources Historical databases and records going back to the 14th century and of course we have we do facilitate access to data in all The other data archives around Europe. So if somebody wants to defend them We can help them And just give you a picture of what we do. It's about 6,000 collections 6,000 data sets in our whole collection We add about 230 a year We've got about 20,000 or so registered active registered users and probably downloads something like 60,000 data sets a year We have a lot of activity in our users before unit. So that's the kind of size that size that we're looking at We've also on the other side done quite a lot to try and promote Data, so we've got a whole website here that tells you about what people are doing with data And they kind of show one page descriptions on what data used and what you did with them and It is about something like a harm 100 in there at the moment And they they span between the purple ones research and the blue ones teaching so they're a nice They're a nice promotional tool to for producers and for users to see kind of what's happening So you can go get an exhaust on those if you want to get a flavor of what people are doing So back to the topic of data citation For many years as part of our end user agreement for actually accessing data through it through our online system You've had to acknowledge the original source and there's a statement there Which is part of the end user agreement number nine that says you'll acknowledge in any publication the original data creators Now that is a contractual agreement and people are supposed to do that But it's very hard to enforce because even though we provide a method of citation people don't necessarily do it that well particularly when the titles of the data sets are complicated and And So just to show you there we give we've always given information Citation information as a separate file which tells you you have to cite the data like this So we've always had a way of doing that actually even from the 70s quite early on But of course this is different from an acknowledgement. It is a citation And an acknowledgement many people are using those they're not adequate as we know for the citing data And this is why we've moved to a more robust system of citation So you all know why we should cite data I'm sure you've seen some of the wonderful presentations from before how and Betty It's about acknowledging the author's creativity and sources and I an identity as well Helps you find data helps promote reproduction research. You can get back to original results and play with them It allows the impact of data to be tracked We now know that if you put a DOI into into Google you can just immediately kind of get some some Idea of what's happening to the data and it provides a good structure Which which I think data craters will feel that they that there's more incentive to deposit data now There's more become a reward structure So there's lots of positive things So I wish to use persistent identifies was going back some time and you know We wanted to use them and for us, you know, we've already got URLs. We've already got citation We want to something a bit more permanent and unique and Our definition of digital objects was the thing that took us long to kind of decide what exactly what we wanted to do But it should be very clear what the granularity is And also as we know, there's a lot of identifiers being being applied to many different things And as you probably know if you haven't you can go and get an orchid idea ready Which gives you a unique idea for yourself as a researcher or somebody's heart institution So this idea of permanent identifies us is very important across our whole community Particularly for long-term data providers. So our data collections are not actually digital objects They're collections of a sets of materials. So one of our collections could have, you know, hundred files in it or could have one file It's a collection of materials that make up one particular Study and that's important for us our unit of description Is a research study or a particular survey that was done so we want to capture changes made to data and We do really version data, but we wanted to make sure it was commonly understood We had a robust way of versioning your data and We wanted to think about what a significant change to data might be That's very important for us and we want to use highly structured data So that machines could understand what we would do it So our other kind of requirements whether our current digital preservation activities this particular flow fitted in with that and Because there's a lot of investment in planning this we wanted to get it right the first time rather than have to go back and Start a different system. So lots of things to think about at the time of planning So just to kind of iterate there about 15% of our collections do have some kind of change in the first year Which is a bit of a pain really, but we often get issued new data sets new variables Particularly new variables. They've upgraded a variable. They've derived something different. They've replaced it There's quite a lot of changes which which we issue is new additions And it's not always clear, you know, which addition you're working with So we wanted a more robust way of kind of signalling those changes And an example is if you're on a series a longitudinal study We're collecting data every year every time you add a new wave It creates a new addition because you add that particular way to the to the existing years of data that you have Regrossing is balancing estimates. So they fit in with current population Trends come from populations as statistics and sometimes we do that to data And also there can be changes to documentation new parts added bits that we're missing. So there are these ongoing changes Which means we have to re-issue data so Versioning is very important for us So we just one of the important things Sorry, one of the important things is wanted to distinguish between what is a major change? What's a minor change? So that's been the most most important thing for us because some of these have much larger impact than other changes to data We wanted to try and make sure we're not versioning every single change that happens because that would be difficult to keep And actually on the whole although there's a lot of talk about verification and replication of data in our field We've mostly seen people wanting new data for new research to ask new questions of data and So that at the moment we do just make current versions available Early versions can be accessed but not from the website you have to come and ask for them But that's something we need to think about and are is how we make previous versions available But sometimes there could be 20 different versions and additions of one data set So you can see the problem that most people really do want the most accurate up to date one So just trying to define a little bit for you what we mean by low and high impact Just an example of low impact there is somebody just might simply add a reference to the bibliography They may have made spelling mistakes You know that can happen quite a lot that sometimes we get variable supplied that are shouldn't be in there accidentally supplied Sometimes there's minor changes in document documentation and they're not going to affect the content or the way that you use the data Sometimes we add new terms. We do index for our data using a thesaurus Soros sometimes, you know terms change they got dated new terms get added. So we do re-index sometimes And then sometimes we add little extra bits of documentation because they've come about and sometimes There's a change in access condition if you're changing from maybe data under under public license to to a more Restricted license so that there can be changes in all those kinds of things to think about Then in terms of high impact that would be something that's really going to affect the content and usability of the data set And that will be if you've added a new variable that will be important to know We might have new new value codes. So somebody might have actually Changed the way in which they're they're giving the different options for the data instead of a yes No, you could have had it a yes. No, I really don't know that could have been added as a as another code So sometimes that does happen Sometimes the waiting variables that help you kind of balance a data set have been have been changed the statisticians who own a government Survey might have done a lot of work afterwards and these changes are not available till till they've actually given us data Changes in formats. Yes, that doesn't happen very often, but it could happen There could be there could be far immigration up up to a new version of data and that could signal quite a big change So they're the kind of things that we we want to distinguish between and that's very particular to our sorts of data But some of those have relevance to other kinds, you know scientific data other kinds of data about what's really important And what's not so important to be used to think about So the next step is trying to find an instance of a change collection. What is this collection? How's it changing? so we do make various Changes to data and they might be internal changes that hatch the point of ingesting data And we don't particularly release those changes which will for us be at something that changes internally, but not externally What's more like to happen is there's low impact changes where we release the change And we have a New external instance that's got the same Identifier because it's not important enough to change it We have a high impact change where we release a new data set and we decide that it's it's worth having a new Identifier and I'll show you what these identifiers mean and how they change So we started working with Data sites and we've got the British Library data site who are one of the agents bought for assigning and deal eyes in 2011 are Our director Mackie Woolard's been very involved in this whole area for some time And as you probably know data site has been founded for some time It's actually some format for citing data and it works quite closely with data publishers to try to promote the idea of data citations Not only does it provide a methodology and issues do I also turn quite a lot of advocacy in this area So What's the actual most of you do know, but it's about persistent identification Identification of objects of digital objects string of letters and numbers nothing nothing that complicated a systematic string of letters and numbers and Subscribers who want to to get do is basically join and they decide And they get access to to do is that they can get you know Once they've got a contract and then these these particular records end up in that they're metadata store and resolve the system And there seems to be I'm not sure how many last count I think in May there's about 1.3 million objects in there There's probably a lot more now and of course there's other handling systems out there But we've actually chosen to go with a data site because we work very closely with our British Library on many other areas So how they created so as a data publisher, which is what we are we register and obtain deal eyes We mint them we have a system for doing that with our own infrastructure I've been dealing with metadata and we use the API to to get the deal eyes, and it's actually fairly straightforward process I'm not going to go too much into the technical side today but um So thinking about what do you want to allocate your DOI to it's got to be allocated to some kind of object or something Something it's got an identity we actually use our core metadata as the The thing that relates to the data collection So you think okay core metadata title. This is the name of particular study That should be quite straightforward It's not always straightforward because sometimes the titles of the studies change which one might not expect But it may be that that one of the government's partners decides it's changing the name completely and we need to somehow reflect that change So our solution is that we have Metadata is the is the the object that our DOI Identifies and that's an external instance. So it's representation of one of our data collections at a particular particular point in time So when you get a DOI and you type it in somewhere it results to a jump page Which actually lists all the history of the changes to the data set and then that then points To the the catalog record where you can go and get data So there's a kind of front page that gives you the the change that the change log And as I said before high impact changes, you know, and you do I So showing what this looks like for example here, this is a DOI on the left I'll show you the structure of what I'll look like Hypothetically, it's a study number SN study number 2000 Version one this denotes a study number that's got this is a survey that's been continued over 13 years It's got way to 1 to 13 and this is an instance specific data set So it's fixed in one point in time and it has its own accompanying metadata The next step would be We've actually made a change as you can see that the blue box in the middle. We've added another wave So that means we've changed our DOI. So it's a slasho to now and again This is instance specific data. It's actually different data set. It's changed and it's got a different metadata record and then finally there we've added another way by 15 and It's changed the DOI. It's now changed. It's an o3 Instance specific metadata When you go to our jump page and you will see that that history But you will be taken to the most recent version of data because at the moment that's what our data want That's what our users want they we don't give them instant access to some of the older waves one to 13 Because actually for our kind of data ways one to 13 is already available in ways one to 15 So you can still get at that particular data set But it's something that we want to think about about how we do about access to previous previous versions Right, so this is an example of our jump page if you go to one if I've plugged in this DOI that's 10.525 slash there I would get a little page that gives me a kind of positive history of what's going on So it it has the most recent DOI change data set version at the top 5 and 3 it gives me the citation and the change log So it actually gives you a very short human description of what's happened. So if every 2012 We for the third edition we added in wave two of the study So you can instantly see the changes as you move down. You can see what the older data sets were So it's quite a simple way of defining change and telling people what the versions are doing and Then in terms of the way that we create an update DOI's We will create well when we create a new catalogue record So when we've got a new data set and we're ready to publish that we mean to brand new DOI through data sites We update that change log and we create a new citation file because citations actually changed We have something that's part of our documentation called a citation file which people can download And it gives you the way of citing data Also, if we update a catalogue record by you know implementing various changes We decide whether they're high or low impact changes We create or update a new DOI if there's a high impact change We update the log and create a new citation file. So there's that process about bringing a new data sets and updating all the ones So here's our format as you can see. This is a kind of constructed DOI that very similar to what other people are using you've got your own identifying organization Identifier which for us is 10.5 cube by five. We've got our readable identifier, which is UKDA SN is what we use to study numbers all our collections called SN and study number one two three four And the results identified to us you what what the actual data set is You know they got to six thousand at the moment we've issued about six thousand these things for our collections And the results version is what I was talking about something that is the current version And that will change that will increment every time we have this high impact change So on the left there at the bottom high impact change means you get a just simply just go from one to two So it's quite straightforward with a low impact change. We don't change the the version better But internally we're keeping track of what's happening. So we know what's the minor version So we know that there's been a change, but it's not denoted externally and just an example here when we have a DOI if I plug in the DOI at the top in the in the data site metadata Registry or search interface. It will give me a link to the data sets and then if I click on that the bottom screen Tells me where the landing page is so I can actually go and resolve to the jump page of the data archive And it gives me the citation there now we There's a sound of way of citing with with with data site where you send up five fields and this is what's displayed But actually for us, this doesn't display version and it's very important for us So we're actually changing the way and we're adding another field to the data site The metadata we send up because that's important that that particular version is reflected in the in these fields So the five core and kernel fields that are used are not not exactly appropriate for our kind of data So what we need to add in resource type and version and we're currently doing that now Every I guess collection is going to have slightly different requirements and maybe versions are not always as important So if you put the DOI into Google You can immediately track what's happening to data So you can get a good idea of who's who's actually citing that data And that's a nice simple way for us to try and find out who's using our materials in the past. It's been very hard We've been able to put key words Like health server for England into various Thompson waters indices to see who's using things But it's a very manual way of finding out who's who's who's using data So we're hoping we can help track the impact to some of these data sets And also of course identify the people go to them and say would you like to give us a case study and then add That reference to our metadata, which we actually keep a list of secondary analysis publications for the users So it enables us to to really add value to our catalogue records and to tell users Creators and users what's happened? So we're very pleased that that seems to work very well So just thinking about linking in different resources by metadata At the bottom there ideally what we would like is some discovery port that enables you to find Data wherever it is. We don't currently have that in the UK, but there's various registries and kind of over portals that are working towards those things. So you can see The top box on the right, especially state repository. That's what we are In the middle, there's a research council outputs repository for us There's something called a research output system where anyone has a grant They have a web page related to their grant and then all of their outputs have to be uploaded to this system So you're going to find that all of the publications and outputs go in there Also, there's demand now by all our institutional repositories that if you publish you need to put, you know Your latest version of your publication into that repository. We're seeing a lot more digital objects Happening now more systematically and some of these these repositories are demanding that that the That the materials go in there So it's a way of, you know tying these up in metadata stores so that we can begin to More easily link articles and data together and that's still a little bit clunky at the moment But we're hoping in time that will happen and we're very impressed for the ANS research portal That's a fantastic model to follow really really very very good. So very jealous of you guys So we've also been quite a lot of awareness raising because now we had idea wise It was a good idea to go out there and say look hey guys you need to site data properly So we applied for some money to our research council and we had just a small grant for four months to just do some advocacy really So we produced a brochure with them about why you should site data and this was aimed at social science researchers That was quite widely disseminated. We also spoke to professional organizations in the social science and humanities domain We contacted every academic publisher and editor in of the journals in our domain and They've done some outreach work with post graduates and just talked about why you need to site data and how to and why it's important So yes, I mean even some of the communications with Academic publishers been quite encouraging that they started to include some citations from some of the smaller journals in their guidance And I already you know, there's been quite a lot of pressure from other communities I think we're starting more and more to see robust ways of citing data given by some of the Some of the academic publishers and journals, which is very important We've done some events with British Library data site or metadata and with a gist who here work a lot to look at Building repository infrastructures and of course metadata is absolutely critical for enabling the sharing of these and we have things called doctoral training centers in the social sciences where Universities have given a particular Kind of particular specialist center to train in various areas of social science So they they provide expert advanced training and they're useful because a lot of post graduates doing high-end research PhDs are useful, you know new young audience to approach So it's important to get to them as well because they'll be the next generation who are publishing So it's been a really really nice short successful project. We think In combination with the library and we produced and actually more importantly sponsored by the research council and actually the brochure produced by them So indoors by them A brochure's been, you know used on it. It's very simple. It has to be very simple It doesn't talk about the technical issues, but it talks about, you know, why you need to do this and You know, I know our sister archive in ICPSR in Michigan are doing a very similar thing They have quite an advocacy program at the moment So So I do eyes and probably recently seen the data citation index that's been released and Very pleased to be included it in that particular system. I think 20 postures have been indexed That's going to be starting up a part of the web of knowledge I think it could be quite a big draw actually this one apparently there's about two records in there There are two million records in there and they're also doing advocacy work Which is going to actually go quite a long way I think in them promoting slowly quality Sort of communication and citation. So I think that's a real added bonus. It's been taken seriously at that level And I know have I Referred to this this last week which is an example there of data citation index and what it all looked like I'll pull up metadata records and show you what the collections look like So I'm not sure how it's going to work because I don't think it's live Yeah, I couldn't find a live site, but it looks sort of quite exciting So what are our challenges with a few future? We seem to have sorted out our day citation methodology It seems to be quite robust. We haven't encountered any problems yet Apart from needing to change the way that the data site displayed some of our citations We need to be thinking about parts of data rather than whole collections And what does that mean to to sites and have permanent identifiers for single files? For example subsets of data If you're getting data from online system, and we have an online data browsing system called nest style that allows you to To browse and visualize survey data and create tabulations online When you subset data or download it how do how do you provide a permanent reference to that particular part of data and Extracts of textual data if you're looking at qualitative data into the data and you want to use an extract I'm very keen on having a permanent identification of that particular extract, and that's the next challenge for us We will be looking at in the next six months actually Also thinking about creating better relationships between outputs data and outputs And also making sure that we're linking Funding information to our grant outputs and research outputs There's quite a lot to do it's quite a lot to do in terms of sort of linking and connecting and More powerful granular citation And I know there's other data centers in the UK, but also looking at this kind of issue So they're kind of some of some of the challenges that we want to be working on And that really is some an overview of what we're doing, and I'm certainly happy to ask questions On any on any particular issue to provide clarification. I'll show you more things on the website So I hope that was useful