 Hello, I'm Leigh Ann George, Coordinator of the Spec Survey Program at the Association of Research Libraries, and I would like to thank you for joining us for this Spec Survey Webcast. Today we'll hear about the results of the survey on data curation, and these results have been published in Spec Kit 355, which is freely available on the Web site. Before we begin, there are just a couple of announcements. First, everyone but the presenters has been muted to cut down on background noise, so if you are part of a group today watching the Webcast, please feel free to speak among yourselves. You won't disturb the presentation. But we do want you to join the conversation by typing questions in the chat box that you'll see in the lower left corner of your screen. At the end of the presentation, I'll read the questions aloud, and then our presenters will answer them for you. The Webcast is being recorded, and we'll send all registrants the slides in the link to the recording in the next week. Now let me introduce our presenters. Heidi Imker is the Director of the Research Data Service at the University of Illinois at Urbana-Champaign. Cynthia Hudson-Vitali is the Data Services Coordinator in Data and GIS Services at Washington University in St. Louis Libraries. Bob Olandorf is Science Data Librarian at Penn State University. Lisa Johnston is the Research Data Management Curation Lead at the University of Minnesota Twin Cities Libraries. Jake Carlson is the Research Data Services Manager at the University of Michigan Library. Claire Stewart is Associate University Librarian for Research and Learning at the University of Minnesota. And Wendy Kosolowski is Data Curation Specialist at Cornell University. So you'll use this hashtag ARL Spec Kit 354 to continue the conversation with them on Twitter. And now let me turn the presentation over to Lisa. Thanks, Leigh Ann. Hi, everybody. This is Lisa Johnston. And I'm going to take the lead for the presentation section of our webcast today. And then my co-authors are on the line ready to jump in when we move into the Q&A. So first, with this survey, we focused on data curation, which can be broadly defined as the act of ongoing management of data throughout its life cycle of interest and usefulness to Scalia education activities. Data curation activities for data might include quality insurance, file integrity checks, documentation review, metadata creation, file format transformations, rights management, and the list goes on. It's important to note that data curation services may be provided with or without a local data repository. So your library might support researchers when preparing their data for deposit into an external data repository or a disciplinary data repository. Data curation can also be viewed as a subset of a broader suite of research data management services. And there have been a number of studies, and in fact, a spec hit on that broader topic of research data management services that include data management plan support or training researchers in data management best practices and other consultative roles. But what we specifically wanted to understand with this survey is how libraries are taking a more hands-on approach to curating research data. So with our demographics, we sent out the spec kit survey in January 2017. It went out to all 124 ARL institutions, and 80 institutions completed this survey with a response rate of about 65%. The goals of this survey were threefold. We really wanted to get a picture of the current staffing and policy and technical infrastructure in place at ARL member institutions for data curation. We wanted to know the current levels of demand for these services, and we wanted to understand any challenges that those institutions face regarding providing these data curation services. We did have a secondary goal, however. We also wanted to begin to establish a community of practice for data curators. The authors of the survey are also collaborating on the data curation network project. This is a project that is funded by the Sloan Foundation to develop a cross institutional staffing model that will help account for the wide range of data types, formats, and other disciplinary aspects of curating research data. So we really hope this survey would also help us better understand the landscape of data curation activities and the staff that are currently doing this work. We do have an open data set for all of the survey data available on our website, and are definitely welcoming opportunities to collaborate with others on additional analyses of these results. So with the first question, we actually branched the survey between those who indicated yes, their institution were currently providing data curation services. So these are our current providers, and 51 of the institutions that responded fell into that category. Those that responded to this question with no or were in the process were actually branched to a separate part of the survey and asked to rank the importance of various data curation activities. And this portion of the survey is actually interoperable with the data set that we collected from 91 researchers last fall. So we're actually able to compare the rankings of importance for data curation activities between a librarian group and a researcher group. Interestingly, only 20% of our sample, 16 libraries indicated that they did not provide nor were actively developing data curation services. So today, what we're going to focus on are the yes category here. So those that did say they were current providers of data curation services. So of those current providers, we found that data curation services appear to be a relatively recent initiative. More than half of the libraries that currently provide services started doing so in 2010 or later. The 51 responses to this question on the source of the greatest demand for data curation services also showed interest from researchers all across a variety of subject domains. Life sciences and social sciences you'll see here are topping the list. But also, we'd be perhaps surprisingly given the fact that STEM receives so much attention in discussing data issues. We also heard that the arts humanities was seeking data curation services from these libraries and actually edged out both engineering and the applied sciences as well as the physical sciences. Interest in data curation services does not yet, however, appear to have translated into strong staffing levels. Our survey asked how many staff focus either 100% of their time or a portion of their time on providing data curation services. And the responses showed that a majority of libraries that we heard responses from place responsibility for data curation on a few individuals who also have other duties to carry out. So they are partially dedicated to providing these services. Looking closer, we found that most of our current providers for data curation services also have a repository service for data. So these data repositories can be self-deposit repositories, mediated deposit, or a combination of both. Many limit the upload file sizes of the files associated with the data set with an average reported about 2.5 gigabytes per file. More than half of the current providers also assist with deposits to external data repositories. And some that were often mentioned were ICPSR, fixed share, and the open science framework. Looking closer at that pie chart, breaking it down a little bit more, we asked about the type of data repository service. And the majority of those, 29 institutions, have an institutional repository that accepts data. Smaller numbers had a standalone repository that was specifically built for data or some other often combination of services. So we also asked what platforms the institutions were using. And DSAIC was the most common repository platform being used by 22 of the reporting institutions. But also 11 institutions use Dataverse, either as a hosted or a local installation. 10 use Fedora Hydro, and seven use Islandora. Other platforms were also very common and indicated in our server results. They included the prestigial commons, CCAN, RSTAR, DataBury, other custom Ruby on Rails applications, HubZero, SobexCM, and then hybrid combinations of a variety of all of these other platforms. The relatively nascent nature of data creation services and treatments that we saw across ARL and institutional landscapes is also displayed in a number of data sets that we are seeing in the repository services. So even though the Office of Science Technology Policy memo, which really asked all the federal funders to come up with plans to make sure that data can be shared and released to the public, this happened in 2013. But we're now only really seeing the technical and human creation infrastructure to reach the point of accepting and curating data sets today in 2017. So of the 46 libraries that do accept data, they're only receiving approximately one new data set a month. We did see three institutions receiving more than ten data sets a month. Similarly, you'll see this trend with the number of data sets total in the repository where most institutions had fewer than 50 data sets in their entire collection. Ten libraries did have between 51 and 200 data sets, and seven reported having over 200 in their data repository. Describing data sets using standard metadata schemas is really important for data discovery, dissemination, and reuse. We do have however many schemas to choose from. They may be discipline specific or institutional specific. So the current providers, those that we asked this question to, indicated about six major metadata schemas in use. Those will be Dublin Core, which you'll see is the highest proportion here, but also mods, DDI, data site, and data verse, which is actually based on a number of standards, and then other institutions, the other category also employ things such as Geo Blacklight, Mark, VRA Core 4, or Custom Metadata Standard. Additionally, I should note that many organizations indicated they use more than one schema for different purposes. So in fact, a few institutions that they used up to four different metadata schemas. So curating sensitive data is also a topic that we wanted to understand for our population. Fewer than half of the current providers indicated that they support curating services for private or sensitive data. One of our respondents actually explained how that process for curating data is just not an easy thing to do. They explained how they had to go through a lot of institutional collaboration groups and working with places like the IRB, understanding a variety of different types of sensitive data file formats, and working with research quality assurance in their campus to ensure that they are actually in compliance to do so. Another key component of the data curation life cycle that we wanted to address was data preservation. So preservation services such as emulation, file audits, migration, secure storage, and succession planning, these all help ensure that the data and the repository technology is usable and stable over the long term. In our findings, we saw that the most common preservation compliant metadata standard used were MAS and premise, but there was little standardation across institutions in ways of providing backup services, where many were employing TAFE systems and cloud services to ensure redundant copies of the data. All right, turning to the next section of our survey, we wanted to understand how the variety of data curation activities were being offered in ARL institutions. So our survey asked respondents who indicated they were current providers to indicate whether their service provides any one of these 47 data curation activities that we grouped into five different aspects of data curation, ingest activities, appraisal activities, processing and review activities, access activities, and preservation activities. If an activity was not currently included as part of their service, we asked them if they plan to or aspire to include that activity in the future. So here are some of those results. The most universally provided data curation services were the ingest activities. So these activities included things like creating metadata, collecting deposit agreements, authenticating the depositor, are they who they say they are, accepting documentation and association with the data set, validating the file formats, and preserving the chain of custody. We saw that 92% of our current providers provided one or more of these services, and all but chain of custody were offered by more than 2 thirds of the current providers. The access category covers these 11 activities that were likewise commonly supported. These curation activities in the access category had noticeably uniform levels of support for data sets, and we think it might be because it's frequently a function of existing repository technology solutions. So these are already built in aspects of repository technologies. So we saw that 43 libraries out of our current providers were currently providing one or more of these access services. However, only 14 of the current providers actually provided data visualization services, so that one definitely fell out down on the lower end of the spectrum. Most of the responding libraries provide support for the 18 processing and review activities, but we saw an interesting bimodal distribution of these results. So there were activities that were currently supported, and then there were activities that respondents would like to provide, but are unable to at this time. And one of the comments we had kind of summed this up, that they're slowly working toward supporting these activities, but many are still going to be out of their reach, and actually peer review was highlighted in several of the comments as something that may be out of reach or out of scope for the ARL institution libraries to support. And you can see that on this next slide, the rest of the processing and review activities where you'll see that similar bimodal distribution of those results. Preservation activities came with a noticeably uniform level of support for data sets because also we found that these were a function of the repository technology in many cases. As one of these comments point out, some of these activities are dependent on infrastructures provided by departments outside the libraries. So that was also a theme amongst our comments of where the level of support was coming from, whether or not it was in the library or outside. And then the appraisal activities around risk management, rights management, and selection. We saw a number of comments here that echoed that idea, is this the role of the library, or is something like risk management really the responsibility of the depositor? So as you can see, our conclusions around these diversions of what roles we support and what roles we aspire to support are kind of falling in these two categories. So a number of data creation activities were falling in this category where libraries would like to perform them, but were unable to at this time. And some of those that rose to the top included repository certification, creating a software registry, or providing interoperability for data sets within their collection. However, we also found that a number of data creation activities were activities that libraries had no interest in providing. And it should be noted that several of these activities were found to be highly valued by researchers in some of the previous research done by the authors of the spec kit for the data creation network project. So some of these activities like code review, peer review, software registry showed up in both categories, and de-identification should not be necessarily abandoned. But like this quote says, is it something that the library needs to do or should do, or do we need to be asking who to partner with and enable to partner and connect researchers to some of these other services? So finally, we asked, what are the challenges, or what is the level of challenge that libraries face when providing data creation services? And you'll see here the majority of the types of ways to provide data creation services, potential challenges, were all very challenging. And expertise in curating domain data was actually one of the most challenging, sorry, it was one of the areas that most of our libraries found challenging. So, and that makes a lot of sense to us, and in particular when you're thinking about the type of multi-disciplinary data sets that must be coming to all of our ARL institutions, how do we effectively curate this wide range of data file formats and data types? It's a really great reminder for us of why it might be a good idea to pool our expertise in a data creation network, so that the specific types of data might be curated by experts that could be either locally or at another institution. So with that, I'll just briefly mention our conclusions in this study, and kind of our scene growth in data creation services, but perhaps not yet maturity in the ARL institutions support. What we saw were a few institutions that did have really well-operated, long-standing repositories that showed a high level of sophistication with the type of data creation services they were providing. The larger subset were respondents that recently took steps to launch their curation services, either through an established IR or through a standalone data repository. And then a smaller group of our server respondents have established maybe some core research data services, perhaps training, research and best practices on data management or DMP reviews, but have not yet embarked on actual data curation activities or applying more hands-on approaches for curating data. So with that, that is, I believe, the rundown of my slides for today. And now what we'd really love to do is open up the conversation to questions, and I will welcome all of my co-presenters to unmute themselves, and we'd be happy to talk more about these findings. So please do join the conversation by typing your questions in the chat box in the lower left corner of your screen, and I'll read the questions as they come in. We do have a question from Amanda who said, you reported that the median number of data sets curated is 39. Is that because of the labor that's required or just the number of submissions? So can everybody hear me? Yep. OK, I think those two things feed into each other because, as Lisa said earlier, the number of a lot of libraries are only getting maybe one data set a month. So there's an issue with flux in terms of not as many data sets are coming in, so then there's no commitment to adding staff in order to do the curation, but on the other hand, if you don't have staff to do the curation, then you can't do it. So we have this kind of circular problem of those two things feeding into each other. And other opinions offered. Welcome. This is Rob. I feel like ours is more of the rate of submission at this point, but it could easily become saturated so that curation becomes a bottleneck, I think, as well. This is Jake. And I think one of the challenges we had with this survey, with any survey, is that we could see the responses as to what people told us, but we didn't necessarily know why they were telling us what they told us. So I think Heidi and Rob are spot on in terms of it's sort of a chicken or the egg type problem, but I don't know that we can confirm that based on the results of this survey alone. So that's probably a question that would bear some more research. Melissa has a question. Did anyone mention having a data catalog, but not necessarily an institutional repository? I don't recall any mentions of catalogs. Is anybody else? That would probably be the kind of thing that would have come up in a comment, maybe. No, I didn't hear anything about a catalog either, but that's a really good question, because I know some institutions do have those, essentially an index of data that's available on campus, but not necessarily housing them locally. Yeah, it wasn't mentioned. That's kind of a related question from Carrie. Did you get any sense that the libraries who participate in keeping tabs on data sets that aren't going through the library, but might be deposited directly into some other repository like DRIAD or GenBank? We did ask a question about whether libraries were willing to curate data sets on behalf of researchers who would be submitting to an outside repository. And I don't remember the numbers off the top of my head, but several of them indicated yes. But nobody mentioned that I can recall of actually whether or not they were involved in the process if they were trying to track things on their own to keep an eye on it. John had a question of, did you ask if institutions have a data library in, because you were asking about staff members' work responsibilities? Go ahead, Rob. I think the survey, and it got a little confusing sometime, I think, for the respondents, but the survey was directed at curatorial staff. So in the text, it looked like some people were including data librarians, but other people weren't. And so I think you're pointing out something that we really can't get a hold of in the survey. Lisa, did you want to add to that? No, no, that's great. Thanks, Ron. Carlin, go ahead. Sorry, just some of the comments that were provided on pages 18 and 19 of the survey about exactly how data responsibilities are divided are pretty interesting and might give some hints that would give some context to that question. Yeah, and we definitely did see, and I think this was one of the earlier slides, too, is that there is definitely a lot of fractional support. For staff. Carlin asked about the slide with the bar graph that showed how many respondents have staff devoted exclusively to data curation. That the bar appeared to be higher for exclusives, but the speaker said most of them were working on it on part of their duties. Do you want to go back to that slide and bring it back up on the screen? Yeah, I think the X-axis on the bottom is more tor- Yeah, so if you look at the number of staff only having one, and institutions of 12 have that, but if you go farther down, there's less and less. So you can see that there's a partial support is actually a greater, I'm not explaining this very well, sorry, there's a greater variability in having a fractional number of staff. Hopefully that explains that. So like if there's an N of, yeah, sorry, go ahead. No, go on, continue. I was just gonna say, so for an N of 49, only 13 had one that was fully dedicated, right? And then only two had one that was, or then only two staff dedicated. Did you ask anything about funder enforcement of data management plans? No, our survey was focused exclusively on data curation services, so we really did not venture over into the DMP and other ways that libraries can provide research data services. Well, they really didn't touch much on the rationale behind providing data curation services. I could see that potentially being one of them of trying to satisfy funder mandates and giving researchers a way to do that, but we did not ask that question directly. Did you ask about ownership of the data, whether the researcher owns it or the institution owns it? No, we didn't get into ownership. Joyce asked what were the most common partnerships provided? Was it IT or the Office of Research or some other party institution? That's a good question. We might be able to extract that from the comments, but it wasn't a specific question that was sort of addressed, or that was addressed in the survey. It was primarily focused on specific activities that people were undertaking to curate their data, unless along collaborations, if I'm remembering correctly. Do you, does this data give you any sense of what percentage of data sets were stored institutionally and replicated in a disciplinary repository? This is Wendy. I do, again, we didn't specifically ask that question. We did get feedback from people that they supported data sets. In some cases, they supported curation of data sets that were in disciplinary repositories, but we didn't get any specific numbers on that. I think that's a really interesting question because that's not like the curation processes are necessarily different. I don't know the impact on workflows, but we didn't specifically get an answer to that question either. There's a question about whether institutional archives or archivists are involved in data curation. I don't think you specifically asked that either would you suggest they look at the responses to question four on staff responsibilities or? Yes, I would definitely recommend that. We did allow everyone to free text enter the who was involved, which staff members were involved with data curation. So that definitely could be pulled out. But no, we don't have the numbers right now. And did you ask about how libraries are recruiting people who have the expertise to do this kind of new kind of curation work? No, we did not ask that either, but I have a feeling we have a whole new survey right now just through the question. There you go. I'm sorry. It was highlighted as- Oh, sorry. I was just gonna say, we did ask for some examples of job descriptions and there weren't many actually submitted so about just a handful really, but recruitment was not discussed. I'm sorry, Heidi, what were you gonna add? I was just gonna say in the challenges section, it was definitely recruiting and retaining people was definitely something that was a high concern or high area of challenges for people. And I do remember a comment where somebody put in something to the effect of it took us two years to hire somebody and then they left. So I think retaining people has also been really challenging for some people. Yeah, I think the situation and the survey results that we saw really speak to kind of an emerging area for libraries and the fact that we still have so many questions that we weren't able to cover in the survey because there really are so many things that we're interested in further exploring really speaks to I think the stage where we're at and we're trying to figure this all out and figure out a direction and move forward in libraries collectively. David asked a question about those institutions who aren't providing data curation services. Wonders is it likely that curation is handled somewhere else in the institution or there's little need for it? Do you get any sense from the comments from those respondents about why not? I mean we didn't get any comments that were just like no we don't have to do this because somebody else is doing it. At least I don't recall any. So I think it's an open question but if it is happening I'm not sure that our respondents knew that it was happening elsewhere. We did get a look, sorry Rob go ahead. Okay our institution just, we started curating the data just this year. Previously we accepted the data without curation and it just never got curated and that turned out to be a big problem for us. Yeah my gut reaction is probably just not happening but I don't have an actual data driven reason to believe that. And Cynthia and I did take a look at some of the three text responses to those questions that were answered by those who didn't necessarily provide curation services and we kind of categorized those into the reasons that were given in the free text answers as to why they didn't provide services. Not necessarily if they were being provided elsewhere but one of the reasons was that they didn't necessarily feel that that was the job of the library and Lisa mentioned that in the slide. So that reason which we categorized as like responsibility, whose responsibility is it to do curation and whether or not it was a scalable activity within the library were the two top reasons that were given for why people didn't do the bulk of the curation activities and it doesn't address where it's being done but that potentially it shouldn't be being done in the library. Kristen has a question of libraries that are doing data curation versus those who are doing data management services. I know you were really trying to separate those types of activities out but from the responses could you get a sense for who's doing each of those types of services or both? This is Lisa, I can respond to that. We did not address that in this particular survey. However, Ina Cooper and several other librarians did actually do an interesting study of ARL institutions by reviewing the websites of the libraries to identify what services were being offered and were able to categorize the different levels of support for data management services, data curation services, data visualization services, et cetera. So that study I would point you to to get a better picture of how that breaks down for ARL. There are a couple of two related questions about libraries working with the researchers. One is did they work with them on how best to structure their data before submitting the data set? And the related question is did you get a sense that faculty are reluctant to plan ahead for depositing large data sets? Again, we didn't ask that, I don't recall that coming up specifically, but I think the answer's yes. It's a little difficult to say based on the survey results given that we were surveying librarian. So it could be a bit of an extrapolation to make that conclusion on the survey results. But I think there's other, you know, research that's been done that would indicate, yeah, that in fact is likely a scenario. And I think I'm on the authors, we've discussed this a bit informally and certainly planning ahead is not research or strong suit all the time. There's a question related to the recent nature editorial and this study, should libraries continue to invest staff in budget and data curation services? Are you familiar enough with that editorial to respond to the question? Is this the empty rhetoric editorial? It's a little hard to tell from the description of what that's referring to. In Nature magazine. I think she's referring to the editorial that came out on June 12th that says that placing your data in institutional IR support for data is patchy and curation, I forget exactly what it says. Who wants to take this? It says something like discourages curation and data standards. I think in the next, yeah, but I think it says in the next paragraph or something like that that it points to fig share, which also doesn't do curation or have data standards as being useful. I think there is clearly a lot of momentum that's going towards this, right? So there are nature editorials and there is some pushback to the medical journals in terms of not being a huge enough proponents behind it. I think, so it's definitely a space that's growing and if libraries want to be part of that space and then I think they would continue to want to invest in staff in this area. I personally don't, I found the editorial to be, there's some truth to different parts of it but it has to be an evolution. Things aren't just gonna suddenly appear that are perfect and I think IRs are certainly a part of that. Please, somebody else chime in with your thoughts on that too, yeah. I guess I would also add that I don't know that we've done a great job as a library community in really articulating why library-based cultural repositories are an important component of sharing data and making this more actionable in terms of scholarship. And I think we have some work to do in terms of trying to explain the advantages that we have of being sort of neutral in our outlook and of being really funded and provisioned for long-term planning and long-term thinking in ways that we haven't done yet. And just to add- Oh, Carol and F. Go ahead. And I think that this is an important point that wasn't necessarily something we focused on in the survey questions but it certainly was the reasons why we formed the data curation network project is not necessarily have the conversation just be focused on the repository, the repository platform but the expertise that's needed to really make that data findable and reusable. And someone on Twitter also pointed out that there's a lot of opportunity here for collaboration between librarians and the disciplines. And I think those are all really great things to consider as we try to plan a future for this bundle of services. Several people have asked for links to the reports. This editorial, a couple reports you mentioned, we will include those links in the follow-up information we send to the registrants. But I think our final question that we have time for today is so how can these results be used? Are there any low barrier areas for improvement to these kinds of curation services that libraries can start looking into? This is Lisa. I think the results showed us that libraries are actively developing data curation services. We're seeing sort of a middling right now where some institutions might be further ahead, some institutions might be just starting down this path but we are seeing that evolution. So for us, having these results to show kind of a temperature check of where we're at is gonna help us respond to things like that nature editorial that just got linked in the comments and demonstrate that yes, actually, libraries do care a lot about interoperability of data sets and the curation of data sets and we're making this happen. And so that I think is one of the more exciting findings that we have here today, it's just seen the progress. Well, thank you all for joining us today to discuss the results of this survey. Again, you will receive the slides and a link to the recording within the week and we will get you follow-up information on how to access the reports that have been discussed this afternoon. So join me in thanking our presenters and I look forward to your participation in our next webcast.