 Data Governance in the Scholarly Communication Ecosystem Remarks by Mackenzie Smith, Geneva Henry, Ann Campion-Reilly, and Joy Kirchner at the 161st ARL membership meeting, convened by William Walker Good morning. I'm Bill Walker from the University of Miami. I'm pleased to have this opportunity to reintroduce some of the fellows whom you just met. We had the pleasure of hosting a week-long institute last February in Miami on the politics of technology, not how we did technology well, but on the politics of technology. And it brought together everyone from our president, provost, new CIO, the chair of the faculty senate, the vice president for human resources to work with the fellows. The benefits back to the library were enormous because for a week all administration was thinking about the library, the future of the library, where technology was leading us, our strategic plans, and they were wowed by the talent in the room, the inquisitiveness, the ability of this group of cohorts to grasp challenges that are ahead. So the library came out really quite well after everyone left and the president spent an enormous amount of time thinking about the issues that had been raised during the week. I'm very pleased that today's two presentations flirt with that topic, the politics of technology. We'll do the presentation on data governance first. We're running a little late, so we'll probably present for 20 minutes or so, and then we'll take questions on that session, and then we'll go on to the second presentation, which is on data mining. In selecting a research topic for the RLLF program, Joy Kirchner from the University of British Columbia and Campion Riley from the University of Missouri Columbia, and Geneva Henry from Rice, we're interested in digging into the challenges of working with digital research data and its governance. While increased attentions being paid to the curation of digital data and licensing using Creative Commons waivers or licenses, users need to be aware that data governance issues impact its sharing, reuse, citation, and publication. While a little work's been done in this area, Joy, Anne, and Geneva teamed up with Mackenzie Smith, who at the time was working with Creative Commons and still supporting research at MIT. Mackenzie's been a leader in this area of data governance and agreed to service their mentor. What follows now is a summary of their work and a discussion of the thorny issues around data governance. I'm glad Bill cleared up that confusion, but I'm not a fellow. And my role today is a minor supporting role to just try to set up why this topic became an issue that rose to the attention of the fellows and was chosen as a project for them. So I'm going to keep this very brief, but I think we saw at the scholarly communication panel yesterday that data has sort of been elevated to a recognized part of the scholarly communication ecosystem now, a very important part. And it has a lot of similarities to traditional publications, especially as we move towards things like data publishing, but it has some really, really important differences that we're only starting to wrap our heads around like the fact that it often can't be copyrighted. So I got interested in this problem after starting work at MIT on how to begin to support eScience, any research, and all the data issues. And one of the topics that would come up week after week was around these issues related to what we now call data governance, and these are things like who owns the data, who gets to control the data, what quality concerns do we have about the data, the provenance, whether or not it can be used, and what context, what format sits in, and many, many, many other issues which you can frame as not really technological challenges, but policy and legal challenges. And that is not my background or particularly strong interest, but I realize that unless we recognize it as an issue and start to form some research around this, we are all going to get trapped in the same boat that we found ourselves in with journals. And to the point made this morning about MOOCs and course materials, we're going to find ourselves in a situation where we're buying back our own research if we don't get on top of this now. Because we're starting to think about it as the library community, the publishers are way ahead of us in many ways thinking about how to start to commercialize a lot of this research data as well. So it's a pressing issue and I was really glad when the fellows decided to take this on as a topic. I hosted a workshop on data governance last year and invited these three wonderful fellows to come and participate in that workshop. The results of it are online at the Creative Commons website if you're interested in it. But they're going to be talking today about a little bit of what we did and hopefully you will be inspired to accept this as an important topic and help me begin to figure out how to solve this. Thank you. Thank you, Mackenzie. I will keep this short so you're going to get a whirlwind overview of data governance just to sort of introduce you to the topic. So I just want to briefly cover a few areas, give you a definition of what we're talking about when we say data governance. And then the legal and policy issues that come up around that technology landscape, what is that looking like? And then we're going to turn over to Joy and Anne who will give you a couple of case study scenarios that they've actually worked with at their university where these data governance issues have come up. Okay, so what is data governance? I'll figure out how to do the slides, I'll be able to tell you. Okay, so data governance is that overall system of decision rights that need to be made around data. It's the responsibilities that describe what can and cannot be done with the data, what actions need to be taken with the data, who can do it, when it's all about the rights. It's very similar to the issues we've dealt with with rights around publications, but data does get to be a little bit special. So there are laws and there are policies associated with data. There are also strategies that you need to consider around data quality. This is all part of the data governance. And then the processes that you need to have in place to ensure that data is being properly managed and that all of the rights associated with it can be known and that it will be persistent into the future. So why is it important? I think Mackenzie gave you a really good overview of why it's important. So sharing is a big thing. So there's a big emphasis right now with our funders on sharing data. They don't want to keep paying for research to be recollected once they've paid for it to be collected one time. Furthermore, we do international collaborations across the board now. That's sort of the norm. Very few scientific researchers do research within their own institution. They are collaborating nationally and very often internationally on their research and being able to share is important. So you have to know what rights are associated with that data if you're going to be able to share it. So I think the biggest area where we've been brought into this is with the funder mandates that have come along. And that has driven a role for the libraries to go in and start working with the researchers more. It hasn't really gotten this too much into this whole international area where the laws are different and the policies are different as well. So issues that arise with this, there are the legal issues of what can and can't be done. As I'm sure most everyone in this room realizes there's the saying, well data can't be copyrighted. First of all you have to realize when you say that, that that is a United States norm and I believe it's the same in Canada as well. And that is true. A database can be copyrighted. In the United States and Canada, formatting of the data can be copyrighted. When you go to Europe, data itself can be copyrighted. So when you get into these international collaborations with your scientists, they are not thinking about this. And that is something that you need to understand and that you need to be there for them to help educate them, help bring the faculty and the administration along. The cultural norms of how scientists have standardly shared data is very informal. I like the scientists. I trust them. They can use my data. I don't trust this person. They can't. And then the policy issues around this. Every institution needs to have a data policy. Your institution should, if it doesn't, it does need to get one. And that's something where you can step in and help out with the administration. So the legal issues, there are a number of ways to handle this. There are ways that you can look at controlling data ownership through contracts. You can look at possible waivers. And you can look at policies as a means of doing this as well. And there are pros and cons to all of this. We don't have time to go into them today. And I think the website on the Creative Commons website that Mackenzie mentioned will go into a lot of the details to explain to you some of the pros and cons of doing this. But on the legal issues, the data ownership is a really big deal. And it's very unclear who owns it. And I think you'll see from the case studies we do, PIs think they own it. We postdocs think they own the data. And the institutions think they own the data. So data ownership is very muddled right now as you work your way through it. And just be aware of the fact that a database can be copyrighted when somebody tells you that, well, that data is not copyrightable. If it's been formatted and there's a presentation to it, that formatting and that value add can be copyrighted. And then we talked about this, well, contracts, public licenses and waivers. Contracts, you have a lot of control, but you can't perfectly control everything. And that does restrict sharing of public licenses. There is a Creative Commons public license available for licensing the data. And then, does that not show enough? Okay. And then waivers. I'm looking at one set of slides and you're seeing something different. So and waivers are yet another approach to sort of get around the restrictions on sharing data. Cultural norms to be aware of as you go to international collaborations around this. How you restrict the data, what the funders are expecting you to do with the data. And then just long term, how this will be supported, sustained and be able to be repurposed and reused. NIH, NEH, NSF are all places, current funders that do have requirements on being able to share data and do some form of data stewardship. You have to be aware of your technology environment. We need data sharing licenses that are machine readable so that researchers know what can and can be done with the data. And then I will end it at that and go into the case studies which will bring up these issues in a much more relevant context. So we wanted to share a few real life scenarios that we've encountered dealing with data governance. And I think I'm going to go first. This story kind of illustrates the need for awareness on the part of researchers, librarians and faculty supervisors, especially of graduate students who are involved in these data sets and data issues. So one day I'm sitting at my desk and I get a call asking that something be removed from our institutional repository called MoSpace. The fellow was asking for a dissertation to be taken out because it used quote, my data, unquote, and he wasn't ready to release that data for people to have. And so I said, well, are you the author of the dissertation? No. Are you the advisor of the person who wrote the dissertation? No. Are you representing the graduate school in any way? No, but it's my data. So as the supervisor of the repository, this was a pretty easy no that I wasn't going to take it out. But his belief that this was his data was founded on the fact that he worked in the same lab with this fellow who had completed his dissertation. Interestingly, the data was in GenBank, which is, most of you know, an open access repository very open affiliated with NIH and the NC National Center for Biotechnology Information. And I said to him, where is your data? And he said, well, it's in GenBank. And I immediately looked online and saw all the information about GenBank to try to kind of cite it back to him. Here are the rules that GenBank makes. So I did that. It wasn't effective. He said, but it's my data. So I referred him then to his lab director and the supervisor of the dissertation, and we had a little conversation. The issue that he didn't know what the rules were surrounding GenBank's policies, or what I should say what GenBank's policies were regarding the data reuse I think was interesting. And he was a postdoc researcher, so he wasn't a brand new master student. And I think I had some question in my mind as to what the advisor in the lab supervisor might have known about and what their involvement was in this conversation too. So I know Joy has some more examples, but I share that just because these are very real issues. And there is a lot of lack of awareness out there. Thank you. First up, I want to thank Mackenzie Smith and the RLF program for allowing us to come to this incredible workshop last December. There were many high flyers. There were researchers, publishers represented. So I really encourage you to take a look at the Creative Commons site and take a look at the report. It was fascinating to be a part of that, so thank you for that. I didn't realize when I signed on to this particular project that I would actually have a direct bearing on my own work, at my own institution. And so very, very briefly just to set the context at the University of British Columbia, like most Canadian institutions, we've been dealing with a fairly contentious copyright environment, and that was also true for UBC. Part of my institutional response was to really step up an educational campaign and educate our community about their copyright compliance obligations. And in fact, I was one of the folks tapped in the library to lead that response. And so from August to last year, there was August 8th, there was an announcement from our provost that the copyright environment has changed, that our faculty and our students must be more aware of their copyright obligations. And in two weeks, we set up a whole structure for educating our community. And in one year I went to 100 faculty departmental meetings, talking to around 2100 faculty, putting on workshops, and through those workshops and through those discussions, I heard a lot about data as well as other reuse of material. The kinds of questions I've been hearing about from our faculty was great confusion about what it means around ownership of data, whether there was a policy on campus that governed ownership of data, whether they're grant funding agencies, there was an obligation through grant funding agencies, whether there was open data obligations, etc. But what I want to illustrate, because I think this is a very interesting scenario that came to me recently, was working with our graduate students. So we've pushed our campaign now to reach our PhD and graduate students at my campus, and they are truly, truly confused about what it means to reuse data, how they've encredited it, how you deconstruct what is copyrightable and what isn't and what is factual and what they can reuse and so on. So a lot of discussion with them around this issue. The scenario I want to talk about was one that came to me last week that I'm still sorting through. It was a graduate student who was a new graduate student working in a biology lab. Her professor asked her to build on to the data that a former PhD student was working on. That former PhD student left the institution and never completed their degree. And the student then was required to contact the former student about building upon that data and reusing it. The former student felt she owned the data and wasn't really willing to release it. The faculty member felt he owned the data because after all it was his grant funding dollars that supported the lab and the research. In working with this student who's truly caught in the middle, I looked for what kind of grant funding obligations were required with that faculty member, whether there was an institutional policy around data, which is maybe there. It's not clear to me. I looked through many of the policies and deconstructing what is actually needed. So still working through this issue, but I think it illustrates very well the need to have a policy and some great understanding around the governance issues for all of us on our campus and what is copyrightable and what isn't. That's my scenario. Thanks. Questions, comments, advice. Hi, Bridget Burke from Boston College and actually referencing experience spent at the University of Alaska Fairbanks. And I think there's another piece of this puzzle and that is communities, especially when you're collecting either narrative or quantitative data in a cross-cultural setting. And so there's a definite movement in really insisting that there's an ethical reason to make sure those communities also receive that data in a format that is useful to them. They have done the subjects. They deserve to get it back. So it is a sort of policy issue. It's also an ethical issue. I wonder if anyone's had experience with that. Thank you. Anyone? Is this on? Yes. One of the things that we are also thinking about is how this is going to play out in the humanities. Most of the data sets, all the data experience primarily has been in the science areas. We know that digital humanities is a growing area. Not all of the texts that people want to analyze are in the public domain, of course. And we heard from John Unsworth a little bit yesterday about some of the issues involved there. So we're really not sure where this is going. We're not sure about what people, if people are integrating their criticism and their commentary into these texts, so that is going to play out as we try to look at data sets for textual kind of sources. In sharing with the different cultures, I think that's a really good example of how things start to get really complex. Because that gets into some of the cultural norms, not only about how scientists share information with each other, but what it's expected in those cultural communities. If we do not, as libraries, sort of take the forefront on how this gets managed and help guide how those policies are set up and identify the types of licensing or waivers that need to be put in place immediately, it is likely that those cultures will not have a voice in this at all. So it's very important. We're about the only place that is going to care about that kind of information, and it can be handled if the data is properly licensed and if the policies are set forward at the start of the research project and collecting the data. That's a very good point. Other comments? Yes? Hi, Lane Westbrook from University of Michigan. Did you look at sensitive data at all in your case studies or research? Sensitive data is a great one, and Mackenzie just brought up the example of HIPAA. So we've got sensitive data all over the place, and there are a lot of existing policies that will deal with much of that. So that area actually is probably better covered, I would think, than many of the other types of data. They try to make Mackenzie disagree. So they make it really clear what you can and cannot do, and they have really honed those policies quite well. It's everything else that people haven't paid attention to. One last comment from the audience. Yes, Mackenzie. I think I should make one more remark that might make this a little more relevant to many of you, which is that conversations on campus with your Vice Chancellor for Research are equivalent, and that is that a lot of universities are desperately trying to commercialize their own research, and that comes into direct conflict with both federal agency funding policies and the best interests in a lot of cases of scientific progress. So we are the opposite side of that debate, and if we're not there, then, you know, the Vice-Chancellors of Research will have their way. So I think it is really important to engage in this topic as data becomes more and more of a resource, an asset of the university and the researchers that we're supporting. So maybe with that. Jim. Jim Neal, Columbia. I think this is really important in the institutional context as we've discussed it. We are in the process of setting up a working group within the Digital Preservation Network, VPN, to take a look at this suite of issues from the collective perspective so that when we start to gather lots of data input from a variety of sources, how do these data governance issues translate into that collective environment? So this was really great. Thank you. Thank you, Jim. Thank you for listening. Music was provided by Josh Woodward. For more talks from this meeting, please visit www.arl.org.