 Welcome to today's webinar on data review and publishing standards. Today we're going to look at how we review data and documentation, curate them and apply publishing standards to the data within the UKDS Discover catalogue. Discover comprises two elements, the main catalogue which comprises large-scale social surveys and government data series among other things and the reshare self-archiving repository which contains data largely gathered from research council funded researchers. We'll describe how we curate data for both parts of the collection and we'll start with the main collection and then move on to reshare. You'll see that there are many similarities in the processes but some slight differences too. So my name is Sharon Bolton and I work in the ingest services team here at the UK Data Archive and we're going to talk about curating data and documentation for the main part of the collection now. Our collections development team receive data and collect documentation from depositors for the main part of the collection. When the acquisition process is complete and the license is assigned and all the data and documentation have been received, the package is passed to the ingest services team. The first thing that we do is run a quality assessment on the materials and this helps us uncover any issues with the data and documentation and means that we can deal with them early in the process and contact the depositor if there are any issues that we need to sort out. What we do within this quality assessment is that we will look at the data, metadata and documentation. For example, we'll look at the integrity of the data. Do we find any errors in the data? Are there large amounts of missing data? And we will also look at the metadata within the data files as well. File variable and value label metadata. For example, is it accurate and complete? Does it describe the variables properly? We then look at the documentation and check that we have everything we need so that users can make informed use and analysis of the data. We will ensure that the documentation covers methodology, the questionnaire, if a survey has been conducted, that any derived variables within the files are documented and the weighting methods are described including how to use the weighting variable. If we find that we need extra pieces of documentation to cover all this information we can then contact the depositor and ask for them to be sent. At this stage we also set a publishing standard for the data and that depends on its likely use. If we're going to put it into our Nestar online data browsing tool we need to apply a high level of data enhancement to the data obviously depending on the condition that it's arrived in. For example, we combine the data variables with the question texts side by side where they can be seen on screen. So we will look at all the materials and check that we're happy with the quality and that we are able to proceed to curation. We also undertake a data anonymization disclosure review where we will check the data for any likely confidentiality issues. And the kind of things we will look at obviously depending on the content of the individual data file will be direct identifiers, have they been left in the data. For example, does the data contain names, addresses, telephone numbers or anything else that can directly identify respondents such as email addresses or images. And we will ensure that unless explicit consent has been given to share these images or other characteristics with other researchers we will ensure that any direct identifiers have been removed from the data. We will also look at indirect identifiers and these tend to be demographics and key variables. For example, age, ethnicity, education and employment, religion, household size, detailed income or geography. Indirect identifiers may not directly reveal the identity of people in the data file but what we need to check is whether combinations could reveal their identity. For example, are there cases unique within the sample? We need to balance confidentiality protection of respondents without removing or anonymizing so much in the data that we restrict its research usability. In some cases anonymization may be very simple to achieve but in other cases it may be far more complex. If it can't be achieved without compromising the research utility of the data we will consider more restrictive access conditions. And here at the UK Data Service we have a range of those. For example, the depositor may give permission to each respondent to applies for the data so they therefore have an idea of who is using their data and for what purpose. We may enter the data into our secure access system with the depositor's agreement as well. And that way we can preserve the usability of the data without overanonymizing it. We discuss all these solutions with the data creator and that's not just for confidentiality reasons. We will look at data edits, recoding, bounding, aggregation etc. for anonymization or we would agree the kind of access restriction they may want. What we try and do is open a dialogue with the depositor as soon as possible and we make sure that we liaise with them and build up a good relationship so that we can ensure we work together with them to provide a very good dataset for the secondary user. At present with regard to confidentiality and disclosure we're developing the use of software tools to automate some elements of disclosure review especially where the looking at indirect identifiers are concerned. All the softwares we're currently using are open source and they're all based on algorithms which check the indirect identifiers and key variables against each other. It's early days yet but we've found some quite encouraging results so far. When the quality assessment and the disclosure review is complete we then move on to the main stage of data curation and processing. We will look at whether we need to make any enhancements with the data and as I mentioned we do this in consultation with the depositor. We will look at data integrity. We may find errors which we need to rectify or note if no solution is available. We might look at active range codes, missing values, etc. and if we don't have enough metadata or information on what those particular elements mean we will go back to the depositor, get the information that we need and we will add it to the data. We also create additional metadata from data files. We might have metadata directly or improve it but we also generate a data dictionary or code book for each of the files that we use. When processing is finished on the data files we will then generate multiple data formats which cover both dissemination and preservation. Data dissemination formats, for example, include producing other software specific formats. For example, data which are right is deposited with us in SPSS. We will always make a stator version as well for those users who prefer stator. We will generate for preservation we will generate an ASCII fixed width version of each of the files. This may not be very user friendly but it's very good for long term preservation. We will archive the metadata alongside it so that no matter how software changes in the future you will be able to load it into whichever package you use for statistical analysis and match the data and metadata. If I move on to documentation processing now documentation processing is very similar for most types of study across all the access levels and we will see some very similar processes for the reshare documentation processing as well. What we do for the main collection is that we convert most software specific documentation to the PDFA format which is Adobe PDF and PDFA is their archival standard. This usually applies when we receive data in Microsoft Word or RTF formats. The aim of transferring it to PDF is so that it's available for users in an easy to use well supported format but with the PDFA standard it's also more suitable for long term preservation. If we receive elements like variable catalogs or code books or data dictionaries in Excel we will usually preserve that in Excel for dissemination purposes so that users can find that quite easy to use but we also created archival format for long term preservation. For example, that's usually a tab-delivated text version of each of the worksheets in the Excel file. For PDF files we add bookmarks and headers for easy navigation and to make sure that it's easy to tell that that piece of documentation goes with a particular study. We may sometimes create additional data documentation as well for example the data dictionaries I mentioned or we might create glossaries, guides, data lists and that sort of thing as well. For every study that we curate we also create a read-me file and this contains just elements about what we've done to the study while it's been processed and also anything which might be useful for users that isn't necessarily covered in the documentation. Also, if any data documentation problems remain in situ after we've processed the data this is a place where we can advise users so that they're aware of any issues. Sometimes we receive hard copy documentation this is quite rare now because most of the files that we receive are in electronic format but if we do receive paper documentation for example, if we're archiving quite an old classic sociology study we can scan and convert this to electronic form and apply optical character recognition as well so it's much more searchable and usable. Once the data and documentation are processed and preserved on our preservation server and collated into a format which is easy for users to download and use we create a catalogue record for each study in the collection and to do this, the aim of the catalogue record is to try and ensure that users are able to read the catalogue record and make a decision on whether it's the study that they want to use for their research and we will compile the catalogue record using information about the study which has been provided by the depositor we might augment that with study reports or links to project websites too Technically, we create a study level metadata catalogue record in DDI format and DDI's data documentation initiative that goes into the discover catalogue the DDI format is a recognised metadata schema and it helps to ensure that records from our catalogue may be transferable across the web into other catalogs as well which helps people discover our data as well as the catalogue record itself we also create a keyword index and for this we use a standard thesola which contains keywords which we can use to describe the same concepts across a range of studies and this ensures standardised searching across discover so that if you're looking for data on employment you will retrieve a list of all the studies in our catalogue which have had that index term tied to them so that means that results are standardised and it helps you find the data that you need Our catalogue page also includes citation which is intended for users to have easy copy and paste facilities so they can put it into their publications once they've used the data and this has a digital object identifier or DOI which enables the referencing of data in publications it helps the provenance of the data become much more visible and if you put the DOI link into any browser it will take you to information about that study Here you can see an example for a recent quarterly labour force survey which credits all the parties which are responsible for creating the data the Office for National Statistics and the Northern Ireland Statistics and Research Agency it names the data set by title and it lets you know that this is a data collection for example it's not a book or a journal article it gives you the UK data services distributor and the study number and finally gives you a link to the digital object identifier Okay, well in a nutshell that's what we do to curate data and documentation for the main catalogue I'm now going to hand you over to my colleague Vale and we will talk to you about the self-archiving processes so Sharon described indeed what we would consider to be our gold standard for assessing, reviewing and publishing data collections that we receive from depositors who deposit their data sets with us Over the last few years we've also had a smaller repository where researchers can self-archive their own data collections and we mainly use that to receive data sets that result from research projects funded by the research councils in the UK So here we've implemented procedures that mimic our gold standard but whereby we expect researchers to do most of that data preparation data processing, data curation work themselves before they upload the data onto the repository and we then do the quality control and the review So I'll show you how we have implemented that in practice Sorry, I covered this One first form of assessment is to check whether the data collections that we receive fit within our data collections development policy that researchers can consult which basically is data sets that are of use to the wider social science community In the deposit process you can here see the division of the responsibilities where researchers are creating a metadata record for their data collection, are uploading the data and documentation files after they have prepared them select the suitable access and license conditions and submit that to us We then carry out the review and the publishing We have made this deposit process as easy and straightforward as possible For example, we managed to harvest a lot of metadata from existing systems so we reduce the amount of metadata that researchers have to type in and provide themselves How can we then ensure that when we rely on researchers to do this themselves that we do get high quality data sets First of all, we give quite prescriptive guidance to depositors on how they can prepare their data files and their documentation files and that is available directly in the system I'll show you that soon So guidelines on how to anonymize data what are our recommended file formats how they should prepare their documentation etc Then we provide step-to-step guidance on how to upload and deposit their data in the documentation This is also available in the video that researchers can consult before starting the process On the home page we also showcase exemplar data sets that we have received, whereby we say these are really good data connections well prepared, well documented These researchers have really put in a good amount of effort so we showcase them on the home page of ReShare And then we review each of the data sets before we publish it whereby we check them for disclosure risk copyright breaches the validity of the file formats and the level of the documentation So I'll show you On the ReShare page which you should see now We have the example data collections on the home page itself which are real collections that are in the ReShare repository that people can go and consult We have links to help information links people to the guidance and we'll look at that in detail which is quite prescriptive do this, do that, prepare this etc And finally the review procedures that we follow are also online in the ReShare repository So it specifically says what we do before we publish a data set in terms of quality control So this is for example the prescriptive guidance on anonymization This is where researchers prepare their data before they would upload it into the repository So it's prescriptive to make it easier for researchers to follow Remove names, remove addresses change the date of birth to a year Remove information that's in the file properties etc So these are all instructions that people can follow Check hidden track changes in text files It's important noting that a lot of the data collections that we receive in the ReShare repository are often qualitative data collections So transcripts of interviews recordings of interviews, transcripts collections etc We also have a text anonymization tool that can help depositors to check for disclosive information that might be in textual transcripts We provide equally prescriptive guidance for the documentation to prepare So we have a list of what we expect researchers to upload a readme file that comes again with instructions What do we understand to be a readme file What should be in the readme file It should describe each of the files that is uploaded what is in the files and how they might be related So if you have a collection of 20 interviews then provide us a list with those 20 interviews and indicate what is in each of the interview files If in addition to that you have various documentation files list the names of the documentation files and tell us what it is It might be the consent form It might be interview instructions It might be the questionnaire list etc That single readme file then gives an easy overview to the user of what is in the data set We want clear variable descriptions and code labels in data files If we are talking about quantitative data We want a questionnaire form or a data dictionary If the data set results from service If it results from interviews we want to see topics and question lists We want to have a data list for textual data collections and again the website guidance describes what is a data list how should I make one where can I see good examples in the collection of how the researchers have done it We want to see a copy of the consent form and the information sheet a description of methods etc When we then receive the data set after the research has submitted it we equally have very detailed review steps that we follow We first of all do some reviews at the level of the entire data set at the level of the data collection We check that the metadata that has been provided is clear and is in sufficient detail and indeed describes the data set that has been uploaded We check the metadata and the consent forms for any legal or ethical information that might have an influence on whether or not we can make those data available for reuse by other researchers If it's a qualitative data collection we check the consent agreements that have been uploaded to make sure that data sharing is indeed allowed We check copyright states and permissions that might apply for example where data sets result from secondary analysis of existing data or combinations of existing data files and if the data result from the research project funded by the research councils the UK research councils then by providing a grant number researchers have the ability to link directly to a project record on the gateway to research that provides a lot of publications and other outputs from their project So we also check that that link has indeed automatically been included in the metadata records At the level of the files that have been uploaded we check that they open We check that the formats have are conformed to our recommended file formats which again the file format recommendations is available in the help guidance via an easy link linking to our table of information that's available online So all of this information is in the guidance We check that file properties have no names of people or other disclosive information in it And we check that the access and licensing that the researcher has selected is indeed the most suitable for whatever confidentiality concerns we might have And we check that documentation files have been set to open access In more detail then we have further reviewing steps for quantitative data, we go and check that all the disclosive variables have been removed from the data file Typically these are names dates of birth addresses, place names, geography etc. We check if there is string variables, textual variables in a data file that there's nothing that is disclosive information in those textual variables We check that there's no hidden track changes in files and that quantity files come with variable descriptions labels, codes and values If we find that any of this is not in order then we simply return the data collection back to the depositor with highlighting our concerns or our findings and asking them to make changes before uploading it again In the case of qualitative data as already said we receive a lot of collections of interviews that can often be quite extensive There we check a 10% sample at least of data items as these can be quite extensive data collections We check that there's no disclosive information in textual files are in recordings, if it's recordings that have been uploaded We check again for track changes that might remain in transcripts and if there's any blacked out or redacted information that it is not reversible. At times researchers might just use black highlighting to hide information of course it's easily converted back or removed by a user And finally we have a look at the documentation to make sure that what we asked people to upload has indeed been uploaded that we indeed have the essential documentation that we wanted and possibly any further desired documentation that we want The main aim here is to have sufficient documentation so that users can understand the data but from our experience we know that there's certain essential documents that we absolutely need to have for that such as questionnaire form or a data dictionary topics, question lists for interviews, information on the consent form, the information sheet that was given to participants etc Again if any of this is missing we contact the depositor ask them to upload it And finally related resources can be uploaded We check that those links are indeed working and that there's no copyright issues for example in case of publication have been uploaded and if publications are available via the gateway to research then only that link should be included because that's sort of a standard a good standard for that kind of information. This sort of describes the procedures that we follow and that we have implemented over the last few years so thank you very much for your attention