 Okay, hi everyone, welcome to the RDA gold standard record exemplars project webinar. My name is Philippa Broadley and I'm the project manager for this particular project. Before I go any further, I'd like to acknowledge the contributions of the following people towards the completion of this project. We have Lance Devine, who is a HPC and research support specialist at QUT. Gauri Edisuria, who is also a HPC and research support specialist at QUT. Professor Kerry Raymond, who is a professor within the School of Electrical Engineering and Computer Science. Martin Borchardt, who's the Associate Director of Information Resources and Research Support within QUT Library. Dr Jo Young, who's the manager of HPC and research support at QUT and Andrew White from ANZ. I'd also just like to briefly mention a few things that I'll be talking about during the course of this webinar. So the first one is HPC and research support. That stands for High Performance Computing and Research Support, which is QUT's Advanced Computing Facilities section. Research Data Finder. This will be once launched, will be the QUT Data Registry, or otherwise known as our Metadata Repository. And lastly, Research Master is our research administration system, which is managed by the Office of Research at QUT. I've also included some links to project information down the bottom of this slide. SourceForge, that contains all of our open source material that we've produced. And there's more information about the project on our blog. Both can be found just by searching. I would also like to mention that unlike the Metadata Source projects, where roughly 20 other institutions are completing this project, there are only two undertaking the Gold Standard Project. So that's Griffith University and QUT. So while we did share some information, there was minimal collaboration with Griffith University, as Anne's wanted two independent assessments of exemplary records. So today what I thought I'd cover would be a project overview. So we'll talk about the scope and aims and objectives, timeline, budget and risks. And then we'll have a look at the records and what actually constitutes a Gold Standard Metadata Record. I'll talk about the return on investment for this project. Any lessons that we've learned, recommendations, where to from here. And then we'll have time for questions. Okay, so the project. The RDA Gold Standard Project was designed to address the issues of richness. So they are connectivity and quality with respect to research data description records. This was because many institutions are contributing data descriptions to research data Australia of varying levels of quality. And up until now, there have been no examples of good practice, guidelines for the economical production of high quality, information rich data descriptions, or investigations of the return on investment efforts and cost of producing such records. So starting out on the project, we aim to investigate issues around RISCS metadata record design with respect to quality and connectivity, information sources, information types, the development of scripts, automation and usability. We also aim to increase awareness of and access to research data captured from selected QT research activities. We aim to make selected improvements to our metadata repository to support a greater range of metadata fields and relationships, which will hopefully support the requirements of producing enriched metadata records. We aim to contribute metadata records to research data Australia and also to progress and seeding the Commons program. We knew that to achieve our objectives, we would have to investigate, test, produce and document the creation of exemplar data description metadata records. We would also have to test the integration of various data and information sources, develop research data finder as an instrument for collecting and managing research data information, and produce a report on project findings for use by other institutions, which would include recommendations and a return on investment assessment. So let's start a look at the scope of the project. In scope was a project plan, one sample of a richly enhanced RISCS collection description for data that is accessible for reuse either via open or mediated access, and which was to be manually added to RDA through online services. A document outlining strategies and requirements for improving record quality, which included the best practice recommendations. Between five and 12 richly enhanced RISCS collection descriptions, which were to be made available in RDA via automated methods, for example, harvesting via OAIPMH. We were also required to present our project findings at a public event or a forum, and that is this webinar. We were also required to produce fortnightly progress reports, and this has been done in the form of blog posts, as well as a final project report. We were also required to investigate data information sources and existing QT systems and repositories to identify, you know, forces of integration and linkage purposes. We also had to provide an assessment of ANS tools and standards. So in scope, the scope was defined by the use of selected research activities from all of the QT disciplines and project types to ensure a diversity of issues data and users were explored. Out of scope, the scope of the project did not extend to include the investigation or development of data storage solutions, and that was quite important to note. The project began in November last year, 2011, and was due to end in August this year. However, staff absences and delays from ANS have pushed the completion date out to the end of this week. This project is 100% funded by ANS, with a total of $125,000, going towards the salary of two projects self. So that's one high performance computing and research support specialist employed for six months, and a data librarian myself for nine months. Going into the project, we knew that there were a number of risks, but we were confident that we could overcome these if they ever eventuated. The first one was the automation or harvesting of the records may not be possible for whatever reason. We might encounter staff illness, and therefore prolonged absences. There was a dependency on the HPC and research support specialist knowledge and expertise. And we also thought that we might run into the problem where QT researchers were unwilling to commit time to share information about their data, but fortunately we didn't. We considered enriching only existing RDA records to meet project requirements, but then this actually wasn't considered the best option for QT as we thought it more beneficial to create new records so that an assessment of the record creation process from start to finish could be carried out. We believe however that the learnings from this project would still be valuable to other institutions wanting to improve their existing records. So for D2, the manually created collection description, the data set we chose to use already had published metadata online, and we used this to create a basic record in online services. Metadata for one collection, one activity, and three party records were entered into the web form until the records were compliant with the ANS metadata content requirements. Required and recommended icons and messages were used as a formative quality check tool. As some members of the project team were not familiar with the RIFCS version 1.3 metadata schema, the content provider's guide was consulted quite extensively during record creation. During the second phase of the project, approximately 20 researchers were contacted by email and telephone with 12 agreeing to meet with us to discuss the practicalities of sharing their data. In total, I conducted 11 data re-interviews in collaboration with Professor Kerry Raymond who helped identify potential interviewees. These data interviews were a mix of unstructured conversational consultations and structured interviews using an interview template part of which you can see here. If anyone would like to view the full data interview template, please don't hesitate to contact me. A total of 10 collection records along with 19 party and six activity records associated with at least one member of all the six QT faculties were created. The six QT faculties present are law, business, science and engineering, education, health and creative industries. So we handcrafted the records using text files, so specifically YAML, which stands for YAML Ain't Markup Language, which is a human readable structured data format. The YAML files were then transformed into Java objects and consequently into XML and RIFCS records. Records were added to a feed which was exposed for harvesting by an OAI PMH, so Open Archives Initiative Protocol for Metadata harvesting provider. To ensure that records were RIFCS compliant, we created a YAML template, which you can see on the screen now, and used this in the development of all types of records. Any element not required for a record, any one of the four types of records was deleted from the template. In order for us to be satisfied that the newly created records were of a gold standard, we decided that any submitted records must have compliance with the content provider's guide, and the minimum metadata requirements would also assist in the creation of gold standard records. PDF reports of RIFCS records were generated from the feed and sent to researchers for proofreading before we submitted the records, which we actually had to resubmit twice after making changes. So this is a list of all of our gold standard records, which you can view on Research Data Australia now that they have published. After we familiarized ourselves with the RIFCS metadata schema, conducted data interviews, and began creating the records, we decided that a number of individual schema elements should be employed for their capabilities to enhance record quality. So the first one is the citation field. Because there were no clear guidelines from ANS as to the preferred citation element, the citation metadata element was chosen over the full citation element, as it was thought to provide a more standard style of citation. Also, the citation metadata element was able to be better mapped using the Resource Description Framework, or RDF, present in our research data registry. The downside in using this element is that in previews, we found that the citations appeared cumbersome and overly long. The added context element adds unnecessary lengths and bulkiness we discovered. So after we submitted our records for assessment the first time, it was suggested that we use the full citation element instead, because we didn't have any contextual information that would provide any extra meaning for any of our data sets. So we did this, and we used the data style of citation. The second element that we thought would be beneficial to enrich was the related information element. And this was used to provide links to authors' personal webpages in QTE prints, which is our institutional repository, industry partners' websites, collaborating researchers' websites, and other information relevant to the collection, party activity record being described. It should be noted that we didn't actually have any service records. If no related links were provided by the authors themselves during data interviews, we would seek out related information that would add value and depth to the collection being described. So for example, in a couple of records, we actually linked to news articles that were talking about the findings of the particular researchers' project, as well as the data itself. So the third element is related objects. We thought this element could be used to create linkages between data collections, parties, and the research activities that were undertaken, as it is being used by most people now. But we thought that the inclusion of this element allows users to discover more sources of information and create linkages between systems. The next element, or two elements, are spatial coverage and temporal coverage. As you can see here, for one record, we included, or actually for a couple of records, but this particular record has spatial coordinates. And we used the ISO 139 DCMI box format to include the geospatial coordination. We included this where it was relevant to provide users with a pictorial representation of where the data was collected, just as something extra. But where geospatial notation wasn't available, we included free text representations of where the data was collected. With temporal coverage, we included this for all gold standard records using the W3 CDTF format. We thought that the use of this element provides researchers with a data collection date, which, when combined with other information, such as access rights, would allow re-users to make an informed decision about whether the data is suitable for reuse. So the next two elements were the title and description elements. So the descriptive collection record titles that contain different discipline-specific keywords, but also allow non-technical users to understand data content were necessary and were created for all of the 10 records created. Full descriptions were created for each collection, party, and activity record. As misrecord titles, collection records were aimed at a non-technical audience. Party record descriptions contain biographical information, as well as research interests and awards. Activity records contain funding information, grant details, project objectives and start years, as well as other information. You can see in this collection record I've specified exactly what the data is. In other collection records, I've even specified the file size and type. Contact details. We included full postal addresses, so not just email addresses, to promote opportunities for collaboration. And subjects. Researchers were asked to provide Australian New Zealand standard research classification codes, so that's the ANZ SRC codes, as well as any other relevant keywords. Both the codes and local keywords were used in the records to facilitate information retrieval, which in turn will hopefully leave to information discovery and reuse of data sets. We also used some of the mandatory RIFCS elements, so this included the identifier element. Record keys were used as identifiers and National Library of Australia or NLA party identifiers were added to party records. When available, we also added Scopus IDs and Thompson Reuters researcher IDs to enhance information discovery as well. Also, we thought it possible that in the future the NLA will integrate or might integrate with global researcher identity systems such as these. We also used the digital object identifiers, sorry, the digital object identifier field. We added DOIs for three of the nine harvested collection records and hopefully this will assist in the citation of collections. All data described in the 11 collection records were single data sets. So that means that during the creation of the gold standard records, potentially problematic issues such as multiple versions, subsets and parent collections were not encountered. Also, we included information for the previously mentioned elements. We also considered the new metadata elements released with RIFCS version 1.3 in 2011 and included relevant information in collection records. So these records were writes. So this build was used in all collection records to describe the restrictions around access and reuse of the data. We did attempt to find openly accessible data for sharing and reuse. However, this wasn't always possible and the rights statement and access rights elements were used to clarify any issues around reuse and provide additional information. The license element was also used to link to the license applied to the collection if there it was in fact one. Existence dates was the other one. This element allows provision of dates of party existence. So that's birth and death dates for individuals and dates of establishment and disestablishment of groups. Activity existence so that start and finish dates of projects as well as service existence as well. QT's information privacy rules prevent us from disseminating QT staff and students personal information. So there are no birth and death dates included in party records. By ensuring that as much descriptive information as possible was included but only if it added value or refinement and association and no metadata elements were left blank. We were able to enrich our records to what we believe is a God standard. While the strategy works for QT, we realize it may not be applicable at other institutions. Okay, so now to a sort of a technical summary. Beside my data machine to machine service was implemented after a few minor hitches. Three production digital object identifiers were minted and these were added to collection records. A further outcome was the ingest of all the gold standard records into QT research data finder. Groovy, which is an object-oriented programming language for the Java platform, was used for the transformation between Java objects and different data formats. Groovy was used for its significant benefits including increased developer productivity, static code compilation, simplified testing by supporting unit testing and immediate Java byte curve in compilation so they can be used anywhere Java is accepted. The YAML format for storing data was used primarily because it is flexible, far nicer to read than XML, and it's a human readable file format that is ideal for storing object trees. The YAML was also used for constructing gold standard records with the new elements of RIFCS version 1.3 where there were a number of unknowns regarding what information would go into a full gold standard record. In addition, Vivo software was used for presenting research data sets within QT research data finder with software extension occurring without the need to modify a base source code. Apache Tomcat was used for deploying research data finder and an AOICat Java server web application for harvesting the RIFCS feed. All of the integration was carried out on one of QT's Linux servers. Okay, so now to a return on investment assessment. Initially, it took us up to four days, and that's total, to create one collection, one activity, and three party records. This time includes interviewing researchers, doing background research, creating the YAML files and sending the records to researchers for approval before assessment submission. Not included is the time required to fix errors in the XML at the time of harvest, of which there were a few. Over a period of four months, 11 collections, seven activity, and 22 party records were created. Our unfamiliarity with the RIFCS metadata schema, the ANS online services web interface, as well as other ANS tools and services prolonged the creation of records. So we can see that the return on investment assessment shows the cost of creating records in the beginning to be $241. So that's the staff hourly wage, times by four days, or the number of hours, which is 29 hours, it took to create those five records, and divided that by five, that's $241. As time went on and we became more familiar with the tools and services at our disposal, it took us less time to create records. So instead of taking up to four days to create five records, record creation time was down to two, sometimes two and a half days. Comparatively, the same assessment for five records created over two and a half days is $151. So at an average of $196 per record, the manual process of creating records using YAML for automatic harvest is quite labor intensive. Conversely, creating records using the web interface, so online services is much quicker, entering data into a web form takes considerable less time, roughly one day or 7.25 hours. So assuming that five records are created in a day and using the same formula as above, it is estimated that one record costs $60.47 to create. Once our data registry is operational, we anticipate that record creation in QT research data finder will cost less than $60.47 per record, because we hope that many party and activity record fields will be prefilled from Research Master and our academic profile system as per the requirements of the metadata stores project, which QT is also undertaking. Okay, so we had quite a few lessons that we learned through this project. I've just included the top six here. The rest are included in the gold standard document, which QT will make public shortly. So the first lesson we learned is when constructing a data interview template, the most efficient method is to use the RIFCS metadata schema to inform the questions to be asked of researchers. This method assures staff that at the end of an interview, if all questions have been answered, then it is unlikely that they'll have to go back to the researcher to seek more information. Also record creation time is decreased as data interview questions can easily be mapped back to metadata elements, which allows a smoother import of data into Ann's online services web interface, text files or equivalent, or institutional data registries. Number two, when in doubt, refer to the content provider's guide. We continually refer back to the guide during record creation, as it was often necessary to check our understanding of the intention of a particular element or apply the rules to new record contexts. Referring to the guide reduces the number of errors that needed addressing before harvesting could occur also, and we found this out the hard way. Number three, data interviews should be kept as short as possible. We learned that a list of more than 30 questions can be adjourned for some researchers. Reducing the number of questions researchers needed to answer, simplifying the questions to enable yes-no answers, and generally making the task as painless as possible helped to encourage researchers to commit time to providing information about their data. The strategies we used to get around this reluctance included also sending a copy, an electronic copy of the data interview questions ahead of time, and also leaving an electronic copy of the questions with the researcher after the interview. Number four, the fourth lesson learned is that if applicable or possible, include rights information in your collection records. So therefore use the right statement, access rights, access rights URI license, and license URI elements in the collection records. We found that the inclusion of this information makes for a complete autonomous record that negates the need for potential data reuses to contact researchers associated with the collection. Along with its various child elements, the rights element assist users in identifying which data or data sets are available for sharing and reuse. Permissions and restrictions around the reuse can also be specified through a combination of rights, license, and access information. Privacy and security restrictions can also be specified here. The related information element as number five, we found that this is the most important appropriate element for including information related to the collection party activity or service that is not suitable for other elements. However, this doesn't mean just throwing in as much information as you can, it has to be relevant. So include information on or links to relevant publications, industry partners, other researchers, not present at your institution, and of any other information that may be relevant to the collection being described. The more enhanced the record, the more accessible the information is. Lastly, the sixth lesson that we learned or the sixth top lesson that we learned is to include as many identifiers as possible in party records. Identifiers such as the NLA party identifier, Scopus Author ID, and Thomson Reuters Researcher ID enhance the collectivity of party records in that they will assist in linking party information with collection activity service records, aid with information discovery, and because they are unique, groups and people records can also be disambiguated. So where to from here? As a result of meeting all the project milestones, QUT now has a better understanding of the process of identifying, piloting and testing the use of various data and information sources to enhance our records. We also gained a better understanding of the RIFCS metadata schema and how it can be mapped to our data registries, underpinning software infrastructure. This knowledge has already come in handy during mapping exercises for the metadata stores project. We established relationships with researchers and now have a greater understanding of the benefits of making those collections available to fellow researchers. Lessons learned during the gold standard record examples project will inform the workflow for registering new collections within QUT and that will be via automated and semi-automated updates. We'll also produce an internal instructional guide for entering data for new records into research data finder. And this guide will outline specific elements to be used to create highly connected, rich metadata records. The QUT library research support team and liaison librarians will be upskilled to manage future identification and description of RIFCS records. In doing this, we'll be able to increase the awareness of and access to research data captured from QUT research activities. So that's it from me, thanks for listening. We'll go through some questions now.