 I'm Tim McGarry, I'm the associate university librarian for digital strategies and technology at Duke University and I'm going to lead off our panel, I have a few slides and then each of our presenters will introduce themselves as we go through. We're talking about developing and scaling research data management and curation services and I'm going to start talking a little bit about the story at Duke and where we are in research data management and data curation. So setting some context up in general, many of you will know that back in 2011 the National Science Foundation began requiring data management plans in all of their grant proposals in 2013 following up on NSF and other federal agencies who had been starting to work on this. The Office of Science and Technology Policy from the White House began requiring any federal agency that had more than $100 million in research and development expenditures to develop plans for making the results of thoroughly funded research freely available to the public and this included both publications and the data that supported the publications which was a change in the movement of the data management plans and the publication policies up to this point. In January of 2016 the National Science Foundation and Department of Energy announced that they were jointly working on a public repository for publications and requiring all of their PIs to submit within 12 months of their grant, within 12 months of their actual publications to submit those publisher-based articles into their repository. At the same time though they were also requiring that the data be made available at the expense of the research institution or the PIs so that solved a little bit of the problem but also shifted some of the challenge to us as a university. So specifically at Duke around the same timeframe in 2015 we got our first notice that a researcher who submitted a proposal to NSF had their proposal rejected. Someone who had had success in the past with NSF had their proposal rejected because they had an insufficient data management plan. They didn't have a lack of, their plan lacked how they were going to deposit their data and so we realized at this point, okay the NSF is getting serious and they're going to start enforcing these policies. At that same time with just a little bit of collusion, the libraries, the Office of Information Technology and the Trinity College of Arts and Sciences all requested from the provost a very significant amount of funding for data storage or for storage for research data and the provost quite naturally said why are you all asking for this amount of storage, why can't we do this together, what is the real reason behind this and so she asked us to write a white paper and as well as a charge for a faculty working group to study this problem and to come up with a series of recommendations so through the end of 2015 and early 2016 we had an interdisciplinary faculty working group meeting for about nine months and submitted a list of recommendations for how we should be doing digital research data services at Duke and this is maybe difficult to see from the back but hopefully whenever you download our slides you can see a little bit more detail. This is a workflow that we came up with, a workflow and a map of what we came up with in the faculty working group as far as recommendations for how to support digital research data. It starts in the upper left corner with a project creation and ends in the red circle on the right hand side where the project ends as we all know some projects really do end but for the theoretical example here it goes through this process. There are two people in the very beginning, we recommended that there be new research data consultants that are hired to help the faculty with these and the researchers with this work. Jen and Sophia who are to my left are two of those people or those two people and they'll talk about their work in a little bit and then we had a series of the gray box in the lower left being the general technology that we're going to be providing to all faculty and we may just need a boost to this. This is the 80% of the work that we expected that we needed and then the box in the middle the gradient between pink and white with all the circles in there. Those are the specialty substrates that we felt were necessary to do a little bit more detail oriented investigation about probably had nuances or very specific functional details that aren't going to be generally met for every faculty member or every researcher would probably need a lot more investment and probably isn't something we should just give away for free. It probably had a lot more involvement than necessary to just be able to do that. In either case whether you went through the general technology substrate or the specialized technology substrate the idea was is that it would get into a repository whether it's a Duke repository it's open whether it's a Duke repository that's protected and we also made the case it could go to a repository outside of the campus as well and then there is an actor in the middle of the lower middle of the page called the repository ingest specialist. We made a recommendation as a faculty working group to hire two of those as well and we actually were able to do that at the same time as hiring Jen and Sophia and so these recommendations to the provost as far as the workflow the people that were involved as well as increases to the technology infrastructure that the central IT organization should be running and the provost accepted every single one of those recommendations and in January 2017 we were able to implement a baseline set of services for all faculty projects not faculties individuals but all faculty projects that had baseline virtual machines a certain level of core computing certain level of memory in their VMs and a certain level of archiving that would go into the repository after after the project was completed so I'm going to transfer over to Jen now and she'll talk a bit about how we're implementing these services. Okay thanks Tim as Tim said my name is Jen I am one of two research data management consultants at Duke and started there in January of 2017 he gave you a bit of the why and history and now I'll go into the how and what we're doing right now. Our research data management program has a three-pronged focus building knowledge and skills through our education program supporting researchers data management needs throughout the research lifecycle and a data curation program built around our own data repository. Our education program runs during the fall and spring semesters with a variety of one to two-hour workshops on data management topics while our workshops have been developed for Duke researchers at any level the inclusion of our workshops as part of the responsible conduct of research series has resulted in our primary attendees being graduate and postdoctoral students. I will say though that both research staff and faculty have also attended. While graduate students and postdocs are required to attend RCR workshops for credit they do have a choice in what they take and we're glad that they often choose to take hours. Given that graduate students and postdocs are often the ones doing the on-the-ground data management work for labs projects and out in the field we are glad to be teaching them best practices and data management early while they are developing their research skills and habits. The statistics we have here are from the 2017 academic year. During that time we offered 12 workshops on a variety of topics such as those listed above the statistics box. Our total attendance across all workshops was 435. The discipline breakdown shows the highest rate of attendance from the sciences with the social sciences making up the second largest attendee group followed by those from the health sciences and medical center and humanities. We have also been actively collaborating with members of the Duke research support community such as IRB staff from our office of research support staff from research computing and our library's own office of scholarly communications to develop comprehensive workshops that touch on their areas of expertise as well. Another education initiative we have been working on is with the advancing scientific, this is a tough one, advancing scientific integrity services and training office within the Duke School of Medicine to develop two modules for an online course on research quality and reproducibility that will be offered through the epigem training arm of Oxford University Press starting in fall of 2018. Our lifecycle services are designed around the cyclical nature of the research process. The services are holistic but also designed to meet discrete points of need which we could frame as potential pain points that researchers encounter during the research lifecycle. We'll now take a closer look at our four main service quadrants. With regard to data management planning, DMPs in and of themselves can be a bit of a pain point for researchers. Where we can provide assistance is by reviewing and optimizing formal data management plan language and structure. In addition to that we also have boilerplate texts that Duke researchers can include in their plans if they want to use Duke's research data repository as their platform for data sharing and preservation. With regard to data workflow design, one of the biggest pain points we are trying to alleviate is the complicated nature of collaborative research with researchers working across institutions in distributed environments and systems. As an example, we assisted a large project out of the Duke Marine Lab to develop a project site on the open science framework that would allow them to use the systems they work with natively but share materials through one primary access point. While this is an example of a large scale solution, we also can provide advice and sometimes training on resources and tools that help with individual workflow issues as they arise. For data and documentation review, the pain point we are attempting to alleviate here is getting all of your proverbial ducks in a row as it relates to preparing data for broader sharing. This involves identifying appropriate metadata standards, reviewing and providing suggestions for documentation, guidance on intellectual property or data sensitivity issues, and how to prepare your data should you wish to deposit it with Duke or another repository determined by the data that is available to you as a funder, publisher, or discipline. And finally, there is that pain point of where to publish data when it hasn't been stipulated by funder, publisher, or discipline. And this is where Tim was talking about. We have a local solution under the Duke umbrella with the Duke University Library's research data repository. I'm now going to transition over to my colleague Sophia, who's going to tell you all a bit more about our curation program. Thank you, Jen. So when we are developing our curation services, we're going to be looking at how we could help ensure that the data that researchers were entrusting to us meet those fair principles. And I'm sure many of you are aware that fair is really being referenced widely and viewed as those ideal principles that all research data should adhere to. So now I'm going to quickly walk through our data curation workflow and some of the things that we do to help ensure that Duke data meet that goal of fairness. So the pipeline begins when researchers have data that they want to deposit. Prior to this point, there may also be some conversations to ensure that our repository is a good fit for their data. If it is, then they'll submit their data, their documentation, their code, their metadata for the data set package. We also ask that all researchers select creative commons license to assign to their data set. And we do strongly suggest CC zero waiver to allow for the broadest reuse possible. Once we receive the data, we then perform a review of the data package. So one of the big things we do is obviously look at the data and we also look at the documentation. So we want to ensure that there's enough information for others to understand the content and the context of the data. If it's human subjects data, then we'll also do a once over to see if we spot any potentially problematic direct or indirect identifiers to lessen potential inadvertent disclosure. We'll look at the file formats to see if we have suggestions for how to optimize those for portability across systems and preservation and just check for any other potential issues such as missing files or minor errors in the data. If we spot any issues or areas for improvement, we'll then communicate with the depositor. And this may then trigger a resubmission of revised files. After any potential issues are addressed, we'll then ingest the data into the repository. At this point, we may transform files into non-proprietary or standardized formats. And we'll also arrange the files in a logical way using our data model to help secondary users understand kind of what's in that overall data package. We'll generate standardized double and core metadata. Part of that's also assigning that license. The final step prior to publishing the data, we do assign DOIs to all data sets to help meet that goal of making them persistently findable and easily accessible online. We also include that DOI within a standard data citation that we generate for all data sets. Before publishing, we also do some reviewing and testing to make sure everything looks good. We do ingest some administrative documentation that kind of tracks our curation procedures on the back end for provenance and preservation purposes. And then finally, the data are published and made publicly accessible within our repository. So thus far, we've had, I think, around 20 data sets go through this data curation pipeline process. As far as disciplines, we've seen mostly within the sciences depositing, we've seen a lot from chemistry, a few from biomedical engineering, some from the life sciences. I think we've had one from the social sciences and we have some clinical data that should be coming in soon as well. We've also had a particular uptake by one lab in chemistry, which presents kind of an interesting story for depositors' motivations. So for this lab, they've really integrated data submission into their overall workflow. So after they complete a study and they're writing a paper on that data, they send us that replication data, it'll go through the pipeline, then we send them that DOI, which they use to link the data in the article. And for the PI, he really sees this as an essential step to ensure that data that's being generated by his graduate students in his lab are properly described and safely archived. So for him, it's not only that giving access to secondary users, but making sure that he himself as the PI will have access to data into the future. Some other motivations we've been seeing from our depositors is that need to comply with growing journal data sharing policies and mandates. Some have also expressed to us that they really like the idea of depositing their data at Duke because they get that institutional backing for long-term preservation and stewardship. Anecdotally, we've also had some expressed that they like the idea of kind of having a second set of eyes on their data prior to it being released out into the wild. And we've been able to see through our work how we've been able to help people really create some more richer documentation for that data to meet that goal of reusability. As far as some next steps for the program, we're currently in the process of implementing a new software platform for data within our repository. So we're going to Hyrex and we hope to be rolling that out in the fall. Another big next step which we're really excited about is we're going to be joining the data curation network, which Claire is going to describe in a lot more detail in just a moment. As far as lessons learned, we've obviously learned lots of lessons while developing our program. And I'm going to highlight just a few. One, obviously, we're good academics. When we started our program, we did look externally. So we wanted to know what other people were doing. We performed an environmental scan. We did research into data curation practices. We specifically looked at some of the early outputs from the data curation network. And this was really useful to look at what some of our peers were doing that had established programs such as Minnesota to help us avoid remaking the wheel and provided some invaluable information for benchmarking. As we've been thinking about how to further enhance the fairness of data within our repository, we've also really recognized through our work that a certain level of data curation really benefits from in-depth subject and data type expertise. And this is someplace where we see a lot of value of the data curation network model of sharing curatorial expertise across institutions. And finally, we've been really lucky at Duke that we've been able to collaborate with folks both internally at Duke and externally while we developed this program. And this helped us to keep kind of a holistic view of how our program fit within the broader scholarly and data ecosystem, both at Duke and beyond. So now I'm going to pass it over to Claire. And she's going to talk about what's going on at the University of Minnesota and the data curation network. Thank you. Good morning. I'm the Associate University Librarian for Research and Learning at the University of Minnesota, and I'm going to give a very aggressively abbreviated history of what we've been doing at Minnesota and then talk about the data curation network in a little bit more detail. So like Duke, we've been, and I'm sure most of you or many of you, we've been offering research data management and DMP support for quite a while with formal, our first formal workshops on that topic starting in the year 2011. And in 2013, Lisa Johnston was sponsored by the libraries to participate in the President's Excellence in Leadership program, and she chose as her project to develop a data curation pilot, which proposed and tested the processes that we are now using to curate research data at Minnesota. And her mentor for the project was actually the then Vice President for Research, Brian Herman, which was a great connection and champion to have made. So following that 2013 pilot and the lessons we learned, two things happen. One, we launched our data repository drum in late 2014, and in early 2015, the University released its policy on research data management, which articulates specific requirements, ownership, and also specific obligations and authorities for the group of organizations that you see here, including the libraries. So this was a pretty foundational document that conveyed specific roles for the library to the entire university. So it's part of our, drum is part of our DSPACE instance. It's the same DSPACE instance that we're using for our IR, the University Digital Conservancy. And here you're seeing just two of the last, two of the datasets that have been deposited within the last month. So we're seeing a nice diversity of content coming in. I know you can't see this detail. These are just some of our statistics. You'll be able to see them more easily when you download the slides. But just after sort of a slow start, we had, and then a couple of big peaks, we're now seeing more of a steady growth in deposits to drum. About 75% of the submissions are coming in from the sciences, so the same thing that Duke is seeing. But we have some notable use from other areas as well. Most, the department that had the most deposits is anthropology, thanks in part to some graduate students. So we've had about 85,000 file downloads from drum since it started. And we've also been gradually expanding our capacity to perform the data curation that the deposits require. And in many cases, we've been gradually moving from temporarily funded positions. So we had two clear post- act fellows who had a role in data curation and also had a 50% time grad student at one point. Two more permanent lines. And we're really trying hard to create balance across all of our disciplines. So since the design and launch, we've been expanding on all fronts. There was, from an inception of the policy in 2015, something called the use case characterization. I forget what the S stands for scheme, I think, is committee. That was supposed to be kind of monitoring use and needs across campus. Just last year, that group spawned a new subgroup, which we really think is going to be where most of the planning and work is taking place. This new storage council, and you see on the right there, the core units, MSIs, our Supercomputing Institute, AHC, is the Economic Health Center. And you see a very clear role for the libraries in helping to lead. So this is not just going to be about storage, this will be about advocacy and developing champions on campus, and really integrating very deeply into the academic, the schools and colleges. Huge growth and demand for training. We're having a lot of trouble keeping up with this research data boot camp, which we offer now twice a year for graduate students. We've had to double the enrollment from 50 to 100 seats, and we still have over a hundred people on the waiting list every year, every time we offer it. So my dean and I are meeting in a couple of weeks with the dean of the graduate school to talk about how are we going to scale up this training that the library is increasingly offering for graduate students. And this is only one, we're also doing an indigital scholarship, but sort of an interesting emerging role for us. One of the things that has really been exciting but challenging, it's interesting that both Duke and Minnesota have had this experience with chemists, because I frankly did not expect chemistry to be the one that moved more quickly towards research data deposit. But the National Science Foundation was getting frustrated. The Chemistry Director, with its funded projects and the slow rate of adoption for data curation practices and deposit and sharing. And they targeted their centers for chemical innovation to lead the way. And they turned, so they turned to all of them, including the Center for Sustainable Polymers, which is the center at Minnesota, and said, you have to change your, accelerate the work that you're doing in research data deposit and sharing. We want to see that. They gave them a very short time horizon. And this was, you know, this is a big grant. This is a 16 million dollar grant for the university. So they turned immediately to the libraries. And we spent about six months having a team of science librarians work with them to develop a pipeline and a protocol and a way to get all of the data supporting their publications deposited and available publicly. So that, that is something that is just about, we can see that coming. We're about to, we know that that's going to dramatically increase our use, our deposits and the demand on our staff time. Before I turn to the data curation network, I just wanted to mention that we were in the final stages of a new strategic plan for research data services. And since 2016, our team has been co-chaired by a colleague of ours in the Liberal Arts Technology Unit. So we've had, we've had pretty strong collaboration with that part of the university, but as part of our new strategic plan we're anticipating much deeper integration with the collegiate technology units. So we'll, there are 75 separate recommendations coming out of this. I don't know how we're going to do all of them, how we're going to staff them, but this is where really we feel things like the data curation network become important so that we can actually consider scaling up in the way that we believe we'll need to. So I'm, I'm delighted to start talking about the data curation by announcing that the Sloan Foundation has just funded the three-year implementation phase of the network as a follow-on to the planning grant that they provided. And so we're extremely grateful to the foundation and to Josh Greenberg and his staff because they, not only for the support, but the really critical guidance that, and input that they provided about the design of this project. These are the data curation network partner institutions. Duke, Dryad and Johns Hopkins are officially joining as of the implementation phase, but in actuality they've been involved for at least the last six or eight months or so in testing the model, participating in sustainability planning and then helping us with the proposal to the, the now successful proposal to the Sloan. So Jen, Tim, Jen and Sophia have already talked about the requirements in the larger environment and the importance of having data that's fair. This is, this is sort of the central belief that we had coming to forming the data curation network, that it's not enough just to provide access. These data have to be well curated and they will be more valuable if they are well, more well curated. They'll be easier for scholars and future collaborators to find. They'll be easier for them to understand and it will be more likely that they can be replicated. But we also felt very strongly that it was not possible either to fully automate this curation or for a single institution to create the capacity that we would need to curate all of the data that we were likely to see coming in. We're all working in environments where we have, you know, heterogeneous environments, lots of interdisciplinary research and don't, we don't feel it's realistic to expect that every academic library can hire a data curator for every type of data that they're going to have coming in GIS data, tabular, statistical, audio and video, genomic data, all of the things that we've already seen coming in to the Minnesota repository. So the data curation network is our answer to that and it will create, it will actually create a shared pool of staff capacity across its partner institutions. We believe it will accelerate the local curation capacity that we each have, that it will really strengthen the collaboration between libraries and disciplinary researchers and projects, and that it will significantly enhance library's collective voice and influence in conversations about the future of research data. So these are the major steps in building the network. I'm going to talk very briefly about what we learned in the planning phase and then how the workflow model looks right now, how we think we are going to be assessing our work and then finish up with a few thoughts about sustainability before we turn to discussion with all of you. So these were the major activities of the Sloan-funded planning grant and we don't, I don't have time to go into each of these in detail. There is a very detailed report on our website with all of the results from the various activities that we undertook during the planning phase, all of the instruments that we used, it's all available only on our site. So I'm just going to talk about a few things. The first, one of the first things we did was to start with some baseline measurement across the six original partner institutions so that we could identify where our practices converged and diverged. Even though we came together because we were all already offering curation services, the way in which we were offering them was pretty different or at least had there was some difference across our institutions. So we knew it was important to identify where we might need to focus to develop a normalized model that we could use across our institutions. We also, and this was something that the Sloan had specifically urged us to do in order to establish a researcher that researchers would actually value the curation services. We held focus groups with researchers at each of the six institutions. So we had over 90 researchers participate. And we asked them to share their perspectives on the importance and value of 47 different curation activities. And we also asked them about the distributed approach, which is something we get a lot of questions about. How would people actually feel if their data was being curated by someone who was not at their own institution? We get that question a lot from librarians. Interestingly, the researchers didn't really seem to care about that. No concerns on that front. And there are some pretty clear indications that there are curation activities that they value. Interestingly, we also proposed and carried out an ARL spec kit survey. We had 80 ARL libraries respond to this. About two-thirds or 51 of the 80 who responded indicated they were already providing some kind of curation service with an additional 13 who planned to and 16 who did not. But we do see a mismatch between the things that the researchers told us they value highly and the services that we are actually planning or planning to provide. So this was an incredibly valuable activity to go through and really helped set the stage for some of the things we're going to focus on and implementation. So because we have a limited time, I'm going to skip over the detail of a couple of the other things we did in planning, but they're covered in detail in the report. I just wanted to briefly acknowledge all of the colleagues who were so wonderful about sharing their time and expertise. We had had a lot of conversations with these projects and people at various institutions in that those have all been invaluable in helping to really improve the model and the plan for the data curation network. So we hope to return and talk more about partnership as we move into implementation. So these are our current partners for the implementation. As I said, Sloan was very keen to see a strong connection to disciplines and disciplinary repositories. So we're particularly pleased to have Dryad as a partner. They've been wonderful to work with so far. Pretty much all of the staff that you see here, this is contributed effort, except for the coordinator. So the coordinator position is actually funded by the grant. Everything else at this point for the first phase of implementation is contributed effort by the partners. So here's a very, very brief overview of the workflow model for the data curation network. It is by design repository agnostic. So the local institution remains to keep the point of contact with their own local researcher and retains responsibility for the technical function of the repository if it's a local deposit. It could also be going into a disciplinary repository. So once data sets are selected for DCN curation, the coordinator will conduct a preliminary review and then assign it to an appropriate curator at whatever institution has the capacity at the time based on an expertise match, so the discipline and the file format. And then there will be sort of a meat that will be a mediated communication between the coordinator and the local contact and the curator. We do have an articulated set of steps that we will go through, so it's file format specific and domain aware curation activities. I call them curate. More detail on our website. And we, of course, plan some assessment activities. So these are some of the kind of the two pronged approach for taking to assessment, some of the key questions that we'll be seeking answer to answer. A really important part of the data curation network is that we do plan to transition to a self-sustaining model in year three. So we've already consulted fairly extensively on this as part of the planning, but we will be engaging a consultant to help with. What we know will be some pretty significant financial and legal issues to address. The model that seems to resonate right now and the one that we're going with for initial implementation is a hybrid of an in kind and a fee for service model. The DCN will operate as an alliance of institutional partners who will contribute staffing and funds. But then we plan to also offer this as a fee for service to other institutions who either can't don't have the capacity to hire their own staff or need supplementary expertise. And we think that the fee for service we could actually go in other directions. It might not just be fee for curation services. It could be fees for training or consultations. So I think there's a number of ways that we might think about generating the revenue that we believe will need to sustain the network. At least the coordinator position. If this is really successful, hopefully actually funding additional curation capacity. We did some preliminary testing of the membership model because that was where we originally started and we thought would be the model. We had a lot of skepticism from our own deans and university librarians when we tested that model. And I think this has to do with membership fatigue. There are just so many projects out there and they don't really know how to evaluate which is the right one to invest in. So we haven't completely abandoned that as a possibility but the signs are so far then that's probably not the best model to go with. So just a last thought before we open it up for discussions. We do plan to expand to new partners. We would love to talk with any of you who are interested about this and our possible directions. We're planning to formally open up in 2020 but this would be a great time for us to be talking about the network in the community and what everyone else thinks about the fee for service model and the model overall and how we can really strengthen it. So that's it for me. I was going to stop here. Just leave on the screen. These are, I answered some of the questions that we've been getting about the data curation network so far but there are many more and I think we'd all be happy to take those or other questions that you might have. So thank you.