 Good day, ladies and gentlemen, and welcome to today's governance of research data webcast. Submit a question or comment at any time during a webcast. Please click on the ask question button at the bottom of your screen. You can type your message into the box and click on the submit button. At this time, it is my pleasure to turn the floor over to Mackenzie Smith. Ma'am, the floor is yours. Thank you, Patrick. Good afternoon, everybody, and welcome to this very close to the holiday ARL DLF East Science Institute webcast on the topic of research data governance. My name is Mackenzie Smith and I will be presenting today. I am the special consultant for ARL leading the design of the East Science Institute. And I'm also a research director at MIT and a science fellow at the Creative Commons, so I've been doing quite a lot of work over the past year or two on this issue of data governance, which I will be walking through with you today. This is not a topic that we discuss very often in our community, but one that I think is becoming growing increasingly important, especially in institutions that want to begin doing data curation. So hopefully you'll find this interesting and it will provoke some questions and discussions. I also wanted to remind you that we have the first of our capstone events for the Institute next week in Atlanta Wednesday through Friday. So we'll be seeing some of you there. And the next webcast will be in December. We have two schedules, and they are listed on the Canvas site if you'd like to see when those are and get registered for them soon. I also have on the call today Deetta Jones, who's going to be moderating the questions. And I wanted to warn you that we have a short poll at the end of the presentation and the question session, just four questions that we'd like your answers to before you sign off for the day. So we'll make sure to finish a little early before the end of the presentation so that you can answer those questions for us. And with no further ado, I'll just dive into the presentation. So I wanted to start by defining what we mean by data governance. This is a somewhat new terminology, but I think it will resonate with many of you who work in the library field. Data governance is the system of decision, rights, and responsibilities that govern who can take what actions with what data, and when, and under what circumstances, and to do what. So it's the who, what, when, where of the data, rights, and responsibilities realm. So in particular, it includes all the laws and policies associated with data and the strategies for managing data in the context of an organization. And that organization could be a legal entity like a university, or it could be a virtual organization like a highly distributed research collaboration that has some kind of formal MOU to govern it. Data governance includes all the processes that ensure data assets are managed throughout the organization. This includes business processes and risk management. So this is extremely familiar to those in the archives community who have to do this regularly. But in the library community, the risk management aspect has not been as front and center as some of the other issues like the legal issues we deal with around copyright and licenses. Finally, data governance ensures that data can be trusted and that people are accountable for the actions affecting the data. So it's this set of things, primarily laws and policies and policy compliance that we're talking about here and as with any other kind of content that universities and their libraries and archives deal with, there are many policies and laws that apply to data that we need to be aware of. So that's what the presentation will kind of go through today. And I wanted to start by explaining why this is such an important issue and that is that there are really two big goals for data curation and data archiving. One is research reproducibility. There's this idea, this scientific principle that any research results should be reproducible by others. And in order to do that, you not only need to know what that researcher did, i.e. have access to the research paper, but ideally you would have access to all the material and other resources that they use to do the research, the experiment or whatever it was, including the data. And secondly, there's growing pressure to be able to reuse that data in new contexts. So not just reproduce the research that was done with it originally, but use the same data for new purposes and integrate it with other data to achieve new research capabilities. And I didn't want to not say that the other two drivers for data archiving and data sharing are the fiscal responsibility we all have for tax-funded research which is driving a lot of the new policies coming out of the U.S. government funding agencies and also just having the broadest possible impact that we can for the research and sharing data is one means to do that. So those are the goals of data archiving and you can see right here that there are some policy implications for doing this. So just to take a step back for a minute, I wanted to mention a few trends in research data archiving that are relevant for governance. One is mandates. We see journal publishers and funding agencies mandating some sort of statement about data sharing and data curation. The journal publishers are doing it usually in conjunction with particular journals that they have where they will mandate that you deposit your data in a particular type of repository before you can publish. So an example of that would be many of the biology and bioinformatics publishers requiring deposits at GenBank before you can mention a gene in an article that they publish. More recently we have the evolutionary biology community getting together and saying, okay, in our journals we will mandate that the primary research data has to be archived in a trusted repository before publication and they have the dryad service coming out to help do that. So on both sides, journals and funders we're seeing increasing expectation that data will be archived and that comes in the form of a mandate which institutions are required to address. However, there's no common practice for how this will happen yet and the expectations are somewhat unclear. They're a little fuzzy. Researchers have no clue and very little interest in dealing with this stuff, so institutions are scrambling to figure out what they should do. Therefore, there's a really interesting opportunity for libraries and their partners to help with this problem and coming up with a reasonable strategy. So let me just talk for a minute about what the researchers' attitude is to all of this. What do researchers need from this? First of all, in order to justify archiving and sharing their data, researchers need credit for that work. They want professional academic credit like they get for publishing research articles because it's just as time-consuming and labor-intensive for them to publish their data as it is to publish a research article and today the system of tenure and promotion and similar kinds of reward systems don't really take into account work that's done with data. So if government funding agencies and publishers are going to require them to share their data and do so explicitly, they would like some way of getting credit for that and I'll be talking a little more about that later. Secondary goals for researchers are that they would often like to be able to reuse the data themselves either to reproduce their own research or to do new research on their own data so they want exclusive access in many cases but mainly they just want to be able to get back to the data themselves. Secondly then, they want to be able to integrate their data themselves and make it easier for other people to integrate their data with other data. So that's interoperability and this is one of the things that the government agencies are pushing for because they want to be able to repurpose very expensive investments that they're making in producing this data and as big science, e-science and data intensive research become more and more interesting then the need to have mechanisms to pool data and aggregate it and integrate it become more and more a high priority for us. The next rationale is that increasingly agencies require you to explain how you're going to share and archive your data in your grant proposal. We all know about the data management plans and there is a need for researchers to get help in complying with the grant term. So if they have promised that they will do X with their data they need to be able to make good on that commitment in conjunction with their university and its services. And finally, researchers need help. They need advice from people they trust because they're not interested in the legal issues themselves. They do not have time to understand the nuances of what is and isn't copyrightable. This is something they trust the libraries to help with in addition to the research office. So there is an important role for libraries to play in coming up to speed on these issues so they can advise researchers on sensible strategies. So let me just point out something that I'm sure is not news to many of you and that is that in the National Science Foundation grant proposal guidelines where they now have these data management plan requirements of the five categories of things that you may include in your data management plan, two of those are actually policy related. They're not related to what the data is or how you'll share it or any of the mechanism. They're related to specifically policies for who can get access to the data under what confidentiality and security mechanisms, what intellectual property you're claiming, what other rights you may have over your data for patent rights, for example. And then this point four is policies and provisions for reuse and redistribution. What can people do with your data once they get their hands on it? Can they subset it? Can they combine it with other data? I think this is an acknowledgement from the NSF that this is too often implicit. People post their data sets on websites. They may or may not include documentation with that about what the data is, but they almost never say what rights they're asserting over the data and what you as a fellow researcher can do with that data. So the NSF and NIH and other agencies are saying that's not really okay. We have to be more explicit about this so that if you're not just talking to a colleague down the hall that you're collaborating with a researcher in another country, it won't be less to the lawyers to figure out if this is okay to do or not. And sort of a footnote of that is specifically around international collaborations. A lot of us are dealing with research universities that collaborate extensively outside the U.S. and the NSF policy in particular in its frequently asked questions answered the question about international research collaborations on whether you should be concerned about data management policies if you are collaborating internationally and their answer was definitively yes. This is an issue because, of course, data is a type of content and the intellectual property laws vary widely across jurisdictions. So if you're collaborating with researchers in other countries, it may be that the laws don't mesh very well. So if you're combining data, you may find that you think that you can share it under certain circumstances, but one of your collaborators is under a legal regime where that is not possible or you need special permission to do it. So you have to be somewhat careful about these kinds of agreements, especially if you're working internationally. So a lot of that with background kind of rationale for why this is important to do, now I'm going to turn to some definitions before we dive into some of the legal issues. So I just wanted to make a point that I think gets lost a lot when we talk about licensing data and sharing data, and that is that credit is what researchers really want as we've already discussed. They're looking for recognition of the effort that they've made and its contribution to science through mechanisms that in the past involved citation. When you publish a research article, it gets cited, that's the mechanism to acknowledge your dependence on someone else's research. So citation is the norm in scholarly communication for supporting evidence, and it has become a proxy for giving credit, which is why you have these bibliographies in promotion and tenure cases and things like that. So credit is what researchers want, and that's typically done through citation. Legal tools, as we'll see in a minute, usually talk about attribution as the requirement, and I just need to point out that attribution is not the same thing as citation and not the same thing as credit. It can be, but that's not really what it means in the legal system, so we have to be careful not to make assumptions that when we say require attribution, we're going to get the credit that the researchers are looking for, and there are ways of addressing that or thinking about that that we'll go through in a minute. Okay, so I need to spend a few minutes, probably about 10 minutes, going through the sort of legal mechanisms for sharing data today, and I need to preface that by saying that this is not very well-established law because data can be anything. Absolutely, anything can be data. So in e-science, when we talk about data, we're basically talking about numeric data or observational data, so things like an MRI would be considered data, things like sensor output is data, statistical models can be data. They take many, many forms, but the law doesn't care whether something is being used for data or not. It cares what form it's in and whether or not it was a creative work. So the mechanisms for sharing data take three forms. There are licenses, and these I will talk about in more depth in a minute. There are contracts, and then there are waivers, which means that you don't expect anything. Both licenses and contracts are legal tools, so they typically allow you to require attribution, whereas a waiver does the opposite. It says, no, you don't have to give me any credit or you don't have to give me any attribution, I should say. So I just wanted to make the point that licenses and contracts are ways that you can set terms and conditions on the user of the data, whereas waivers, you're kind of washing your hands of what happens later. But that's not necessarily a bad thing. All right, so copyright. Data in the United States, copyright does not apply to facts. Can't copyright facts at all. And most scientific data that we deal with is factual in nature. So that makes the U.S. somewhat unusual in a way, and that data is not copyrightable period. However, you can copyright collections of facts, particularly if a lot of creativity went into the selection of them, and you can copyright organizations of facts. So if you design a database, that database schema is copyrightable, and that information inside of it is not. And then there are varying case law on things like whether or not you can copyright an ontology, for example, which is, or taxonomy, or a controlled vocabulary. Most of the case law indicates that you can. So what you have in reality is this mixed bag of aspects that are copyrightable, other aspects that are not copyrightable, kind of messy mix of the two. So just to finish this point, if you're looking at a database, and you've just found it and you haven't agreed to any contracts or anything, you could quite legally extract all the data from that database without infringing on the copyright of the database as a whole. Okay. I have a question that came through, and I just want to jump in and ask it now. A good time for me to ask this of you, or should we wait? I think if these are questions related to things I've said, it's fine with me to just dive in and ask. Okay. This is back of these slides, but I'll go ahead and ask. It's back, and it's about the goals of the researcher, so I'll go ahead and read the question. What is the basis for your point regarding the goals of the researcher? Did you conduct a survey, focus group, or some other type of research? Yes. Well, yeah. A lot of research has been on the goals of researchers, particularly through initiatives like the data net program, and in places like the UK where it's just made this a very high priority. So they've surveyed researchers, they've written lots and lots of case studies of researchers, and of course I'm generalizing, you know, the attitudes of researchers vary widely across disciplines and countries and career stages and all kinds of things, but if you kind of boil them down, the thing that has emerged over and over again is the need to provide credit for the work, so that there's some benefit to the researcher directly, other than just, you know, that it's a good thing to do, right? So I don't have today with me citations for that research, but it's not hard to find it, and I'd be happy to try and dig it up and put a list of a few studies down the slide later. Thank you. And while we're taking questions, one more just came through. Okay. Please go on into more detail on the difference between attribution and credit. Okay. Go back to that slide. I think this will be clearer as we go through the next set of slides because the reason I'm bringing this up is so that as we go through the kind of legal tools that are available, you understand that legal tools for attribution may not be achieving the goal that we all assume it will. So I wanted to define these up front, but let me get through the slides on licenses and contracts and waivers and then see if this is still a question. Okay. So we were talking about copyright and what it applies to. Okay. Yeah. So I just wanted to give an example of a license that is typically applied to copyrighted content when you want to share it, and that's a Creative Commons by-license, which is an attribution license. All the Creative Commons licenses require that you have this by-term, which means that attribution has to be given to the person who is putting the creative work out there. And the reason for that is that in the creative world, attribution is the sort of like acknowledgement that copyright by Mackenzie Smith kind of statement, and that is what most creative creators are looking for when they publish works. It's a little different from what scientists need. So Creative Commons licenses apply to data and databases to the extent that they are copyrightable. Okay. This is a very, very important and nuanced point that some data is copyrightable, in which case you're absolutely encouraged to apply a Creative Commons or similar kind of license to it when you want to share it. But other data is not copyrightable and you can put a license on it, but it has no effect under the law. It's not enforceable. So the second point is that only data uses that implicate copyright trigger attribution requirements. If you see by-license that requires attribution, I could do something like extract the data from that copyrighted database and reuse it without providing attribution because I'm not infringing on copyright in that case. So it only triggers the condition if you've actually infringed copyright and it's often the case that you're not infringing copyright if you just use data. Data uses that don't implicate copyright do not trigger the attribution. That's another way of saying what I just said. So in other words, a lot of people are applying licenses, Creative Commons and other kinds of licenses to data, but if the data isn't copyrightable it will have no effect except possibly to slow down the user while they go check with their lawyer. But if the data wasn't copyrightable you can't use this kind of mechanism of rights that you don't have under the law. So the issue with licenses they apply to an underlying right that you have and it's hard to know whether you have copyright for a particular data set. Some of them are very clear-cut. Like if it's just an Excel spreadsheet with a bunch of numbers in it, no, it's probably not copyrightable. If it's a very sophisticated database of MRI images then probably yes, it is copyrightable. Those are images are usually considered creative works even if they are just scientific images. You'd have to almost get a legal opinion in each case about whether the data is copyrightable or not, which is a very difficult situation for most researchers who wouldn't really know and even most university lawyers who are IP experts often don't really know much about data in particular so they have a hard time making good assessments about whether something would fall under copyright or not. So it's really hard to know when a license can't apply and this creates some risks. You may be applying a license to data that you can't copyright so you're being misled about the fact that you're now protected. And on the other side a user who finds data that is licensed may think that they have to comply with that license when in fact they don't. They're not over complying with what is required of them by law. License attribution requirements are very inflexible. In other words, let me give you a scenario that's actually happening quite often. You have a variety of data sets each of which has one of these attribution licenses attached to it and some poor researcher now wants to combine a thousand of those data sets. Each thousand requires attribution back to that provider of the original data set so you end up with a situation called attribution stacking where as the data set accretes other data sets and grows over time you have to provide credit every time you do anything with it to all of the providers via some mechanism and the mechanism for doing that is often not specified in the license so it gets very onerous. This is already a big problem with examples like Wikipedia. If you download that, the attribution page is 50 pages long and so we're seeing this in science as kind of a problem for people who want to aggregate data, want to comply with the terms of the licenses but the license is required to do things that would just be absurd and furthermore it's not clear how you would provide attribution. What mechanism you would use to do that if the product is an aggregated database of a million or two million data points. So again this is probably not a very efficient kind of mechanism or legal tool for us to use in the e-research scenario. So the opposite of license is a contract and where this gets a little confusing is that these are typically called licenses. If you sign an end user license agreement like if you're buying a software package or signing up with a new vendor, you sign an end user license agreement or license agreement. Those are actually contracts. Contracts between you and that vendor or you and that supplier but these have become very popular for data archives and they're called Dueles. Data usage license agreement. So in this slide you just see a few examples from fairly large data archives have large communities around them and this is just to make the point that these are almost the default behavior now for specialized collections of data that have been collected in a particular domain where they have a governing body usually the organization that's maintaining that archive. So explaining contracts in a little bit more detail, contracts don't rely on underlying legal rights. So it doesn't matter if you have copyright or not or other kinds of rights on your data these trump any kind of underlying legal rights. And they rely on a classic kind of offer acceptance model like a click through or a terms of use statement on the website where it says something like if you download this data set you are agreeing to the terms of use that you can see over here according to a web page. So these are mechanisms that you should all be familiar with in some scenario or other and they usually spell out exactly what the formalities are for using the data sets whether it's attribution or some other kind of term and condition that they like to impose on you if you're going to use the data. These have huge downsides. If you've ever read one of them it's confusing. They're written in legalese which most staff can't make any sense of and most researchers won't even look at. There's no standardization for these across research, across science. Each user agreement usually does have different terms and conditions. Many of them are contradictory so if you're then trying to combine data from two sources opposite terms and conditions then you often just couldn't legally combine the two data sets without going back to the supplier and asking for a new contract something explicitly for you. So there's some evidence that researchers just avoid data if they can't figure out what the terms of use are. So this wouldn't necessarily be a big barrier if you're working in your own discipline and it's an archive that you interact with all the time like ICPSR for example if you're a social scientist and you use that all the time just because you don't understand their terms and conditions wouldn't slow you down where this typically becomes a problem is in interdisciplinary research where somebody may want to use data from a field outside their own they don't understand the convention and they end up having to interpret the data usage license agreement and they don't understand what is requiring of them so they just give up. So this is a fairly big problem today with this approach even though it's a very popular one with existing data archives. So it's a little legal issue and I really should point out at this moment that I'm not a lawyer so these are things that I've picked up from talking to lawyers at various universities and the creative commons but I think they're generally true statements and we can provide you with the sort of references that you might want if you need to hear from an actual practicing lawyer. What I've been told is that unlike a license right which relies on an underlying right like copyright contracts only bind to two parties that agree to the offerer and acceptor which means that if you obtain license data you're bound to the terms of that contract. If you share it with someone else who didn't agree to the terms of that contract they are not bound by it. So this seems very counter-intuitive but what it means is that the data is that license, that contract only applies to you. The first recipient it doesn't apply to downstream users of the same data unless you can somehow force them to go back and agree to the same contract. So you very quickly lose control over the data in this scenario if it gets out beyond the initial recipient. So in that sense contracts are actually more limited than licenses because licenses which rely on copyright so we'll call those copyright licenses those extend to anybody who touches the data because the law applies to everybody equally. So hopefully that point is clear but again this is something where I think practice in the wild isn't really reflecting this legal reality. So the other point about contracts and why we don't like them so much is that they often have broader reach than a licensed would a copyright license because they're not tied to a legal rights. So as I said contracts trump what you have by default under the law which means that you can actually require more from users than the law would require. So not only attribution but you see a lot of contracts that require things that seem a little bit over the top and you can take away the public's rights. So let me give you a couple examples. There are people who share data under these data usage license agreements that require anyone who downloads and uses their data to include the data provider as an author on any paper related to the research that use the data automatically. Even if that person has no idea what you did with their data and I'll show you a couple of other examples but we don't like contracts that limit what we under the law would like people to be able to do with research outputs whether they're articles or data so if you're familiar with the whole issue around exercising fair use and the erosion of fair use rights over time, here's another situation where these contracts are eroding the rights that we have to use this data for whatever we need to. As another example of kind of a funny one the Canadian Government recently created this wonderful program to share government funded research publicly on an open data site the URLs right there and when they initially posted it they included a data license agreement one of these doulas for unrestricted use and they were trying to make this as liberal and open as possible but one of the conditions that they had there is that if your use of the data would malign the country of Canada you can't do it. So people screamed and said well how do we know what would you consider maligning the country of Canada and so within a week I think they had to take that term out of this license but the point is if you inject terms and conditions in there that are vague, subject to interpretation it can be quite disastrous and not achieve the goals that you had in mind. One thing I think we're toying with is the idea of creating standardized data usage license agreement so this is an area where I think libraries could potentially play is helping define what terms and conditions satisfy the needs of the researchers but would be standardized in a way so that you don't have to worry about whether you're agreeing to contracts that contradict or don't interoperate very well. So that's an idea that is getting some traction on the other hand the idea of using contracts in general is not everybody's favorite choice so this is something that I think we'll be debating in the community for a while. Can I jump in with a question here? If this is the case why are licenses referred to as contract law not copyright? Yeah I mean the language here is really confusing and I'm sorry I didn't invent it but a license is something like when I say license I'm talking about something like a Creative Commons license that is basically saying yes I have this right this copyright I am licensing this to you without requiring what copyright requires right so that's where the Creative Commons licenses come in they are actually licenses the things that we're looking at now are called licenses but they are actually contracts because there's no underlying right that they're relying on so I'm afraid that the vocabulary is just kind of confusing but I wanted you to understand that these usage licenses that we all click through all the time are in fact contracts under the law because they don't rely on a right and I'm sorry that is confusing but that's the best I can do. Okay now we're going to get to the third option and that is a waiver okay so you have some rights in the data maybe you're not sure it's a mix some of it's copyrightable some of it's not and particularly in other parts of the world where they have three generous data rights like in Europe in order to provide data interoperability and legal interoperability of data really the only way you can achieve that is by giving up your rights so this concept of a waiver is relatively new just the past couple of years but it does provide legal certainty to everyone because there's no need to decipher what may or may not be copyrightable in a given data set there's no need to try and ascertain what the contract is requiring of you because there's no contract it's better than saying nothing because it's explicitly letting researchers know yes it's okay for you to do what you want so from a legal perspective waivers are just the best possible solution because they just take away all the uncertainty and all the friction however for the data producer it means a complete loss of control they can't require any terms and conditions of the people they can't require attribution they can't require registration they can't require to be a co-author they can't say anything at all because they've waived all their rights so what we recommend in this scenario is that you stop relying on the law to protect your data and you start relying on scholarly norms in your discipline more and let me just provide a quick example in publishing research articles traditional kind of scientific research publishing there's no law that requires a researcher to cite another researcher whose work they built on they can't take if they fail to cite research that they relied on there's no court of law that would do something about that it's the community that would come down on that person if there was clear dependence on the researcher the remedy would be scholarly approbation we've just relied on scholarly norms in publishing forever so it's interesting that as we move more towards data publishing and data sharing everybody goes straight to the legal tools like licenses and contracts instead of working together to come up with new norms for how we could provide the right kind of credit for what kind of disciplinary action we could take if people fail to do what they're supposed to do and I suspect that part of the reason for that is that there's some sense that there might be commercial value in the data that wasn't there in the publications or that the data is somehow maybe subject to patents but nobody's really sure how that might play out so they're just being very conservative falling back on these legal tools but in a perfect world we could all just wave rights to data so that it's nice and clean and out there and integrable and reusable and all that and then have a set of scholarly norms that we've developed that would tell us what we need to do to provide credit and meet the needs of the scholarly communication ecosystem alrighty so this is the question some of the concern may also have to do with authenticity knowing that the data hasn't been altered since it left the source is there a legal mechanism for this or is this a technical issue checksums etc that's a very good question there isn't really a legal mechanism you could put in a contract that the recipient of the data is not allowed to change the data or subset it in any way so it's not changed really beyond that there's no way that you could ask people not to change the data so authenticity is sort of a technical problem that goes along with the legal tools in the data governance umbrella so you're raising a very good point that authenticity which I would link to provenance I need to speed up so I can get to that because there is a set of technical issues that kind of fall out of all of this that I need to tackle and that's one of them so let me speed through the rest of this legal stuff and then we can move on to those I just wanted to throw this slide up which is the CC0 is the waiver that was put out to implement what I just described so you have Creative Commons licenses for cases where it's clearly in copyright and you want to make sharing possible so there are no standards for that at all and then you have these waivers this one being the one that's most typically used so lots of data providers are starting to adopt CC0 as the waiver because it's very clean and well understood and so if you were going in that direction this would be where you would start okay so just to summarize the legal tools quickly the law is a mess in this area this law it's often from domains that are not in science or social science it's in commercial realm the licenses have uncertain scope because of this uncertainty about what can be copyrighted and the licenses are inconsistent with this norm of citation and acknowledgement that we're looking for contracts are a way of having absolute control over your data and what happens to it but you have to remember that it only controls the person who agreed to the contract and not anyone else and it can be extremely burdensome if you've imposed a lot of terms and conditions that people don't understand or can't really comply with and the third option waivers is from legally very clean, very nice supports scientific reuse and all of those good things but you've completely lost control of your scholarly norms which is difficult for many researchers to get their head around the next step okay so every approach requires some loss of control even the contract only works with the parties that are bound so there's no perfect solution to this it doesn't involve any loss of control and I think I've already made these points I think we as a community want to push towards the least technically difficult solution which would be the waivers but we recognize that this is very problematic from a getting credit point of view I'm going to move on now but I'm getting a lot of static on the line here are other people hearing that or is it fine? yes I'm getting it too, you won't be able to other people won't tell you so I just want to acknowledge that there are a couple of people who are having trouble with sound I'm going to have to speak loudly Mackenzie I have a couple of other legal related questions should we hold those to the end or should we try to tackle those now? let's take well let's see I'm trying to remember how many I just have two yeah okay quickly I have three more slides on the legal stuff and then we're moving on okay taking the example of data gathered from a survey would you say the originally collected data is not copyrightable any aggregation, transformation computation on the dataset would make the resulting dataset copyrightable since the creative process was applied to it that's a good example I can't really answer that question since I'm not a lawyer but I can give you an opinion as a non-lawyer and that is that the survey data itself would not be subject to copyright the answers that people gave you and the variations of them may be depending on how much creative work went into organizing them and coding them quite possibly and you know so I think you could copyright the aggregation of them and in fact I think that's what a lot of the social science data archives do in order to try and kind of control what happens to that data but what I'm trying to do today is get you to ask those kinds of questions so that we can collectively come up with some guidelines and that would need to involve real lawyers both general counsel at our institution and IP lawyers who work with the community like Creative Commons so don't take my opinion as legal certainty but just those are the right kinds of questions to be asking one okay next one do waivers leave data open to legal controls by aggregators that will control or otherwise hinder access yeah this comes up fair amount this is the reach through and the theory is they do not if you put your data out now okay so let's say you put your data out under a waiver so it's effectively in the public domain can somebody aggregate your data and slap restrictions on it they can but they can't prevent people from getting to this copy of the data that you made available okay so so they can never override your initial waiver but they could create a separate product that was under a different legal tool and the classic example of this is Westlaw right where the federal cases that they were aggregating were all in the public domain by default because they were produced by the federal government but a company figured out how to aggregate them and then add some bells and whistles and then sell that back as a legally controlled product but they couldn't prevent people from getting to the original law through the normal mechanism they just you know they imposed barriers of convenience I would say so good question you do give up control over commercial exploitation but yeah and that is something that bothers a lot of scientists that they want that non-commercial laws and then the question is is there any legal way to get that okay okay great so I just wanted to say quickly there is another set of legal issues that we have to keep in mind and those are privacy and confidentiality this comes up quite a bit in research data there are laws like HIPAA that govern this and then there are norms like review boards and the licenses and waivers don't typically address these only a contract could address these kinds of issues because they're not related to copyright or data rights and what I'm seeing is that there's a growing tension between the drive to share data and the need to keep it confidential and ensure a certain amount of privacy and security so an example of that is this notion of a portable consent agreement so if you're a patient in a clinical trial say and you sign a consent form for the original research and then the researcher wants to share that data with some large clinical trial aggregator how does that consent form pass along with your data so there are people working on those kinds of mechanisms now to make this a little easier but right now if data has any kind of privacy or confidentiality on it then this problem gets much much worse so I'm just saying that as kind of a warning that if you're struggling to figure out what you're going to do initially you might want to start with data that doesn't have any sort of HIPAA regulations applied to it at a minimum okay so data licensing practices just in summary wide variations in specialized subject archives so if you go to a place like ICPSR like GBIF they will typically have a contract called a usage license agreement institutional repositories of institutional platforms like d-space will usually support optional Creative Commons licenses or waivers that would assert copyright by default they're not particularly nuanced about data as opposed to traditional kinds of materials that obviously do have copyright so there's a need for real thought and development around these IR platforms for how to handle data finally US government data and some other countries too is automatically in the public domain but they typically don't bother to tell you that you just have to know it and that's fine if you're getting the data directly from the government but it causes problems for reusability downstream because there's no way of knowing you know five users later that that data originally came from the US government and therefore within the public domain so we need better mechanisms to have the license metadata travel around with the data itself okay and now let's see I'm just going to step back for a second and say as we were running out of time that what I've been talking about is kind of legal interoperability of data which is a necessary prerequisite to being able to reuse data effectively but it sort of brings in other issues social and technical issues that are related to the legal issues I've just talked about and we've had a couple of questions already about some of those here's the list of things that kind of come up with and I'll just do one slide on each of these so you understand what the issue is and how it relates to data governance so starting at the top we've already talked about license sorry, moving back up license interoperability already sorry getting back to my slide so I don't need to go over that again I think you probably get it now that the licenses need to be complementary so that you can't end up in a situation where you have two conflicting licenses and you're trying to combine the data we've talked a little bit about attribution stacking which is where lots and lots of licenses are requiring you to provide attribution and you're trying to combine large numbers of data sets to run into this problem of needing to credit to millions of people so another thing that's absolutely critical in the state of governance arena is persistent, reliable, globally unique, sightable URIs for IDs we need them for people so we can provide credit in a reasonable way so there are initiatives out there like Orchid, the open researcher and contributor ID platform which is building a registry that identifies for researchers we need identifiers for institutions so we can provide credit to the grantee which is the institution, not the PI typically and there is a NISO initiative called I2 which is working on institutional identifiers but is not completely finished yet we need IDs for the documentation for data and I would include in that the articles, the research articles which are a form of metadata really for research data and so there are things like cross-ref DOIs which provide those persistent IDs to that in the data realm there is an initiative called data site which is also creating DOIs but for data sets in particular, I think both cross-refs but primarily data sites provide DOIs for data sets this does leave questions about extractions subsets of data sets and individual data elements, how we would provide a unique and persistent ID to those and that's quite a technically challenging issue but I wanted to make sure you know that in order to implement these legal tools effectively we've got to have IDs for all of these things next topic is provenance and this came up before and it's related to authenticity you need to know where the data came from exactly how it was produced so that you can determine if it is authentic if it's of a high enough quality how you might be able to integrate it with your other data and how it can be preserved over time okay so this would include things like the methodology for creating the data in the first place so it's citation like metadata but also things like methodology instrumentation protocols for instruments things like that which we typically don't include with the metadata for assets that we're controlling so we really need to think twice about where this kind of documentation would live is it just the research article can you just link to the research article and be done or do we actually need new metadata schemas that will allow us to manage this kind of information as well as discovery kinds of metadata the metadata that we do need for data sets includes the classic kind of who, what, when, where discovery metadata but it also includes the how metadata which is the provenance preservation kind of thing additionally you need metadata about the data itself how is the data structured and what do the data elements mean so if you're familiar with social science data this would be things like the code book for the data but in a lot of scientific fields there's nothing equivalent to that so it's an ad hoc kind of documentation effort but think about the classic situation where you're getting a spreadsheet and it has labels for all of the columns but you don't necessarily know what they're talking about it's weight but what unit of measurement was used so this kind of metadata is going to be a huge growth area I think for helping with data curation and it's again linked to this definition of governance because it's part of the business process that you need in order to be able to do things like preserve the data over time so then on to data structure I think this is sort of self-evident data integration at large scale is very, very labor intensive today many disciplines use proprietary formats a good example of that might be the fifth format in astronomy almost every field we are gravitating towards new web based standards this being the current best practice linked open data is the way that's usually phrased these days so that is using standards like RDF and XML but there will be a massive effort going forward to define which ontologies, which schemas we want to use for data going forward so that it's much easier to integrate and reuse a scale and this has huge implications for the software and other kinds of tools that we use to do science so data accessories are the documentation and metadata the software that you need in order to process, to analyze, to visualize data, data is not useful without these tools and without this documentation so governance extends beyond the data itself to the accessories that you need to do anything with it and those accessories should always be open source and themselves, discoverable and documented and archived and preserved so this is giving us a pretty big mandate but I want to conclude by saying that data governance really is an institutional issue it's not something that individual researchers will take on nor something that scientific disciplines can tackle because they're not legal entities, they're kind of virtual communities libraries are very well positioned to work on these issues because we have a lot of experience with this stuff from other domains like licensing the research literature like helping with privacy and confidentiality for archives, collections and patron confidentiality we've worked a lot on standards and large scale interoperability for example in the mark standard for bibliographic metadata so I think you may not have quite the legal grounding that we'll need in the specifics of what's going on with data but we have the right sort of mission and experience to really help with this area of data governance so are there any trailing questions there is, there's one last one is there a way yet to prohibit aggregation of WAVE data for commercial purposes well there's a couple of ways one is if the data is copyrightable there's a non-commercial variation of the Creative Commons license so I'm sorry the static is getting worse so I'll talk louder but Creative Commons has one license that allows you to restrict commercial use of the content of course there's a lot of controversy about that because nobody's really defined what commercial exploitation is and choosing a term like that can really backfire on you but it is available and then of course you can always create a custom contract like I said you could create one of these data usage license agreements that prohibits any commercial use of the data you just have to be extremely careful because there's no legal definition of non-commercial so things like a blog that have ads on them can count as commercial use so you just have to be pretty clear about whether there's specific commercial uses that you don't want and not just stop anything from happening any other questions? that's the last one what I've done thinking about time is I've gone ahead and posed the first of the four questions at the end that we'd like to get people to respond to please submit your responses as quickly as you can so that we can get through the other three stop the question now for the sake of time we'll just go ahead as part of your e-science planning