 Alright, thanks a lot, Karen. So Karen asked me today to talk about Dryad's attitudes and practices regarding data citation. So this, I think somewhat this was inspired by a talk that I gave over the summer at the Open Repositories Conference. So I'll be referencing some material from that talk, but this is by no means a duplicate of that talk. So certainly if you're interested in some more low-level details, you can view the video of that talk online or download any of the, there's a PowerPoint and a short article to go with it. So what I'm going to do today is talk very briefly, or maybe not so briefly, about Dryad's background, where it came from, what we do, talk a little bit about how we arrived at our current situation for encouraging citation of data and what our practices are now. And then I'll talk a little bit about how well we've done in actually getting people to accurately cite data. And then I have a few open-ended questions that I'd like to use to see the discussion afterwards. All right, so beginning off the background of Dryad, about six to eight years ago, I don't remember the exact time, the U.S. National Science Foundation commissioned a blue ribbon panel to look at the future of scientific data. And one of the things that they found was that data was being lost at quite a rapid rate and they needed better methods for archiving and managing data. One of the things that happened somewhat as a result of that panel was that there was the foundation of several different synthesis centers across the United States. These are centers that are charged with taking areas of science and building bridges between all the sub-disciplines. So the center that I work at is called the National Evolutionary Synthesis Center. So it's funded primarily by the National Science Foundation, and it's jointly run by three universities. That is Duke University, the University of North Carolina at Chapel Hill, and North Carolina State University. The center has three different major areas of interest. The first of those is bringing scientists together to work on these synthesis projects, projects that fridge sub-discipline in evolutionary biology. The second is an educational area where the center helps to support biology teachers in their teaching of evolution. And the third is an informatics area where we are charged with supporting all of evolutionary biology in their data management and archiving. And so that was really a focus of the founding of the center that led to Dryad as a data archive. Okay, so Karen at the beginning there showed a quick blurb taken from the Dryad website, saying that the Dryad is a repository of data from articles in the biosciences. Over the years, we've gradually increased our scope. So we started out, the original language was data underlying articles in evolutionary biology and ecology. And then we moved to the language that Karen showed you, and we're about to roll out the language that you see on the screen here now, where we want to archive not just data underlying articles, but literature in general, provided of course that we can have some kind of control that it's not just random crank literature, but it is somewhat peer reviewed literature. And we've expanded our disciplinary scope to be all of science and medicine. Of course, we will still be very highly focused in the biosciences because that's where Dryad originated. So this is just a quick shot of what the homepage of Dryad looked like a couple of weeks ago. One big thing to note here is I took this screenshot when we first crossed 2000 data packages. I'll talk a little bit later about exactly what a data package is, but essentially it's 2000 submissions to the repository. This doesn't sound like a huge amount, but it does represent a fairly rapid rate of growth. We've doubled our holdings over the course of the last year, and I think we're in a pretty good position to most likely double them again in the coming year. So one of the reasons why we founded Dryad as this very general purpose repository is because biology has a long tradition of very specific repositories and databases. All of these things that you see here listed in the left hand column are repositories and databases that are focused on particular types of data content. Focused on either genetic data or image data or data representing a tree of relationships between various species. All the databases on the right are databases that are focused on a particular group of organisms, either a particular species or a particular set of related species. So you can see there are already very many of these. I just typed this list up very quickly one day off the top of my head. I don't even have a biology background. These are just the things that I've heard my colleagues talking about. So why would we want to build yet another system? Well, the main reason is because there are so many holes in between these existing systems and there is still a large amount of data that just falls through the cracks. And that was really the purpose of Dryad is to catch all of that data that would otherwise have no home or would be orphaned. So here's just a quick picture to kind of give you the general idea for how things work with Dryad. We have a researcher who writes an article. He has some data associated with the article. There may be this genetic data or some kind of relationship expressed in tree form. But then there's all the extra data that goes along with it. They're raw measurements of various weights or lengths or rates, temperatures, etc. And those typically don't have a home so they can go into Dryad. And then the vision has always been that Dryad will pass off some of this data to these more specific databases as appropriate. So we're still trying to work on that part of the vision, but I think we've done a pretty good job at the central initial step of collecting all the data in one place. So one of the things that we've done that has made Dryad a success is we've worked very closely with journals. And again, coming out of the National Evolutionary Synthesis Center, we worked initially with journals that were associated with evolution and ecology. So here's a group of journals that initially signed on to this idea of a joint data archiving policy. This was a policy where all of these journals said on a certain date we are going to collectively require all of our authors to archive data associated with their articles. And the journals hadn't been able to do this before because there was no single repository that could hold all of that data. But when Dryad came along, it enabled these journals to pass this mandate and have a collective action to really start moving the field into data archiving. So there are a few keys underlying this policy. The first key is that data is deposited at the time an article is published. This is the time when scientists are most motivated to actually jump some hoops and get their data in order for archival. The criteria for choosing data that will be archived is that the data must be sufficient to repeat the results represented in the article. So if you have raw information coming off a machine, that might not be particularly useful to another scientist who's trying to reproduce the results until you've run one processing pass over that to correct anomalies and deal with calibration issues. So a lot of times people ask, you know, should I submit data in the form of a video I shot of a mouse running a maze? And we'll say, no, you can simplify that one step. And whatever you recorded from that video to actually get to your results, if it was just a time for the mouse completing the maze or if it was history of which turns the mouse took, then use that data, that one extra step of processing and deposit that. Many of our journals were a little hesitant about presenting their authors with this mandate, and they thought that the authors would revolt and just perhaps try to find some other journal that was outside of this core scope for publishing. Because authors are understandably hesitant about providing access to all of their data because they think that someone else may reanalyze the data and find some kind of errors or perhaps produce extra articles before the original author has time to fully analyze it. So one of the core tenets of this archiving policy was the availability of an embargo, and Dryad was designed to allow an embargo by default. An author can choose an embargo of up to one year, but that can also be extended with approval of the journal editor. And then there are certain cases that are flagged for exception. Basically Dryad's policy is that we are a service for the journals. So there are some things that sort of immediately qualify data for exception, either for a very long term embargo or for not being archived at all. This includes certain types of human subjects data, things like precise locations of endangered species, but it's really again up to the journal's discretion. And then there was the fact that there was this core set of journals that were really critical to the evolutionary biology community, and they all coordinated in putting this policy in place. And that really made it possible to get a large group of authors together and all submitting data at one time. So here's a quick look at what actually happens when an author submits data to Dryad. They submit their article to the journal and go through the normal manuscript review process with the journal staff. Usually, somewhat after they submit their manuscript to the journal, they will send their data to Dryad. Many of our journals just send us information about when an article has been accepted for publication, and the authors deposit their data at the time of article acceptance. It's one of the things that's on their checklist along with producing camera-ready copy. But we do have a few journals that have their authors submit data to Dryad a little earlier in the process, and they actually make the data in Dryad available to the peer reviewers. So the exchange of data between journals and Dryad is really fairly simplistic. It's just a couple of emails that are sent either direction. The journal sends us an email with basic bibliographic information, title, author, abstract keywords. And when a data deposit is complete, then Dryad sends the journal back an email that contains both the identifier of the manuscript and the new DOI that's associated with the deposited data. So this is our growth curve, and I mentioned that we've doubled in the past year. I took this graph about a month ago, so we're now a little bit past the 2000 mark here. And here's an example of what an actual data deposit looks like in the Dryad system. You'll see that the suggested citation information is here at the top. I'm going to go into more detail on that. And then a little farther down there are the individual files that are part of this data deposit and links to download those files. Okay, just a couple of fast examples here of the scope of data that's available in Dryad. So the vast majority of data is this kind of tabular information, usually in a spreadsheet format. Sometimes it has column headings that are easily discernible, and sometimes the column headings are meaningless unless you've read the accompanying article. Just a little different type of tabular data. And then we have, again, this genetic type of data. We have these phylogenetic trees that express relationships. We have images. We have little bits of source code that's used to process data. We have things like Mathematica notebooks. Really a random selection of content, and that's what Dryad was designed for. So here's a look at Dryad's staffing, and this is actually not. This is forward-looking. This is not current. We don't have all of these positions either filled or funded at the current time. But this is the idea of what size of an organization we'll need. We currently have about five FTE when you add up all the part-time bits. And once we fill out this org chart, we'll have around seven FTE. Okay, so that's where Dryad has come from. Now I want to get a bit more in-depth on the citation aspect. When the Dryad board first met to discuss how this content should be cited, one of the biggest questions that came up was, should we be citing data at all? Shouldn't we just always refer people to cite the article because isn't the article the unit that is most important for scientific discourse? There was a lot of discussion on this topic and on the fact that most of the tools and systems set up around science right now are very focused on an article and they're not focused on data. One of the big problems that came up when we were discussing data was the fact that data comes in different sizes. So people sometimes say, oh, I have a data set. And sometimes they mean they have one file that has a lot of information in it. Sometimes they mean they have a thousand files and each of these files has a little bit of information or maybe each of the thousand files has a fairly large amount of information. So the language surrounding data was very confusing and there was no really easy answer when we're trying to encourage people to cite data should we encourage people to cite all one thousand of these small files or is there some other unit that we should use? And then there was a lot of discussion on what is useful for citations in the future. Where do we want the community to go because DRIED was being developed at the time as a new initiative and we had a chance to really change the culture of at least this one area of science and we hoped to expand into more areas of science. So we acknowledged that the people who dealt with data exclusively they really preferred credit for their effort in creating the data and they preferred fairly low level citation information because they might want to do something like record exactly which piece of data was used for a particular calculation and be able to replay that calculation at some point in the future. And then there was this idea of altmetrics. At the time that DRIED was first created the altmetrics term had not even been coined yet but we were aware that people were getting increasingly frustrated with this article-based and journal-based measurement of impact. So after much discussion and getting many different journal editors and scientists input we finally decided that we need to encourage people to cite both the article and the data that an article and its accompanying data are two very distinct intellectual products and they're both very worthy of citation. But we didn't want to worry about this issue of weighting data more heavily if it was more abundant. So when you had a submission that consists of many different data files we always want to collect those together into one data package and that way all of the data associated with a single article gets one data citation. So we're now weighting an article and the set of data that accompanies that article equally. So then came to the issue of, okay, well what kind of information do we include in the actual citation? And one of the earliest articles on this topic is by Altman and King. They suggested that a citation for data should include author's publication date title, some kind of persistent identifier and a fingerprint to validate that the identifier was not mistyped. There have been various articles and studies on data citation since then but they've all ended up with fairly similar sets of information although I think largely the numeric fingerprint has been kind of given away and is not really used anymore. We certainly don't use it in Dryad. So we also looked around at current practice in various repositories. Dryad is based on the dSpace software. So we looked first at what dSpace does and dSpace puts a citable URL right front and center. It's this URL format called a handle which is very persistible but it's not very well known. So we weren't really interested in following in the footsteps of dSpace and maintaining a reliance on this handle system. We also looked at other repositories. This is an example from the ePrince software system and here this is an article type of information. So the citation is very similar to a traditional article citation. So we did want to acknowledge that data was really important and we wanted to give data equal weight with articles and we knew that articles such as this one always get a DOI. So we were very interested in applying DOIs to our data sets or data packages in order to signify that these really are first class scientific objects. And there was a lot of worry at the time Dryad was starting up because DOIs from Crossref were fairly expensive. Luckily at the time that Dryad was starting up, the data site organization was also starting up and they provided us with an alternative in which we can mit DOIs much more cheaply. So this is the citation format that we settled on and you've seen it a couple of times in the slides now but this is exactly how it appears in the Dryad system when you're looking at a data package page. So first we say please cite the original article and we provide our link back to the article at the journals website and that provides traffic to the journals. And then we say please cite this data package as its own entity. Now we do within the Dryad system we have pages where you can view the entire data package and we have pages where you can view more details about one particular file within the data package. But even when you're viewing those more detailed file pages, the suggested citation is still for the data package as a whole. So we're kind of relying on the people who are more technically minded who really want to cite something at that lower level of a single file. We're relying on them to realize that there are different identifiers in place and cite the particular file they want but the vast majority of people we want to direct them to cite a particular data package as a whole. So these are the basic principles that we use when we're creating our identifiers. First of course is that we use DOIs both for their prominence as an identifier within scientific discourse and also because there are a wide range of tools and services that support DOIs. We did strive to keep our identifiers as simple as possible. One thing that's been a little bit contentious is using syntax in the identifier. We use a very small amount of syntax to illustrate certain relationships between different identifiers. So we use a syntactic ingredient to show that a particular data file is part of an associated data package and we also use a little bit of syntax to indicate different versions when we have successive versions of a data item. So I know that this has historically been a little bit of a debate within the data archiving community and then we're taking our stance on it and we'll continue those arguments. But as we've been using this we've at least found within our own staff that having that syntax there is very valuable. So we do allow scientists to submit new versions of their data. We still have a little bit of technical cleanup to do because we added a lot of this versioning functionality into the DSPACE software. But the basic philosophy here is that whenever we have a change to the data that actually changes the meaning of something then we create a new identifier and that's done at the discretion of the scientist or the curator. And then there are other changes that are what we would consider meaningless. For example, correcting a spelling error or sometimes adding a new keyword just to enhance discovery. And in those cases we retain the current identifier and we don't create a new version. So there's a URL at the bottom here that references this presentation that I gave at Open Repositories which goes into a little more detail on some of these issues. Okay, so that's how we came about our current system of citations. Now I'll go through some of the ways that we actually encourage people to use these citations. So of course the first thing I already mentioned a few times is that this citation suggestion appears at the very top of a page when you're viewing content in Dryad. And then if you click on these little words down at the bottom of that citation, the site or share, those expand the box to open a link, an area here where you can use various other external tools with this particular Dryad object. So there are links here to various social media sites where you can send this citation to those sites. And there are links here for downloading this citation in formats appropriate for various citation management tools. Another way that we encourage citations is that at the time an author has completed a deposit to Dryad, they get an automated message that includes all of the information confirming that their deposit has been archived in the repository. And it gives a suggested way for adding a citation into their article. Now this isn't exactly the same as the full citation that we use on the Dryad pages, because what we're aiming at here is to get the author to make the link between their article and the associated data. So this isn't what you would traditionally call a citation, it's a self citation where the author is citing data that should be found with this article. And you'll see in a minute that the ways that these citations are being presented in journal articles varies fairly widely right now. The journals have not yet really agreed on how an author should cite their own content. But then farther down in the letter here, we also refer the authors to our informational page on depositing data, and that contains some more information about how citations should be managed. A few more things that we do to encourage people to properly cite their data. We've had ongoing discussions with all of our journals about their policies for citations. So when we first start working with a journal, one of the first questions we ask is how are you going to present a link to the data within your articles? And usually when we ask that question, this is the first time they've thought about the issue. So we just kind of get that in their minds and then we keep following up with that every time we have an interaction with the journal. We also are fairly active in making dryad a tool that exists for the journal's benefit. So anything that we can do to improve the journal's experience of dryad makes it easier for them to promote dryad to their authors and to promote proper citation practices to their authors so that the authors and their readers can find the content in dryad. And then as far as we make it easier for people to use the dryad content, not just the information that's displayed on the web pages that I showed you before, but we have various APIs and mechanisms by which people can access the dryad data, both the data files themselves and all of the citation oriented data. So one of the things that's not easy to display on a slide is that we support the coins micro format. So every page in dryad that displays data content also includes a coins tag and you can use that with tools such as Mendeley or Zotero to automatically import the citation into those tools. And then forward looking we're working with a lot of other repositories and groups that are interested in data citation. So I'm certainly glad to be here talking with you all this evening or this morning in Australia. But we've worked with data site of course extensively in getting our DOIs registered and we're still continuing to work with them in order to ensure that the metadata associated with those DOIs gets as many places as possible. We've worked a little bit with Total Impact. I know that the next talk on your schedule is from Heather Pivovar of Total Impact. So I'm sure that will be an excellent talk. Heather's a great speaker. And one of the motivations for creating Total Impact was Heather's involvement with dryad. And then I think someone earlier this evening mentioned the Thompson Reuters data citation index. That's set to come out soon and dryad is one of the initial repositories that's included in that index. So we're excited to see that come about. Okay, so we've done all these things in trying to develop proper citations and encourage people to use citations. Now let's take a look at whether they're actually using them. So in terms of self citation, we did just a sampling of content through dryad. We took some random data packages and looked up the articles associated with those data packages to see how those articles referred to the data that was deposited in dryad. We had 77% of those articles had what we would call good citations to data. These are citations that include an indication that the data is in dryad as well as the fully specified DOI so that people can actually locate the data. 2% of them had what we call bad citations, which I'll show you an example of in a second. And then 21% didn't have any citations at all. So this is still a challenge that we're working with trying to get all the journals to provide a standard space for people to cite the data that's associated with the article and trying to get the authors used to actually putting that information in. Some journals now have specific sections where they ask the authors to list any links to data that's associated with the article. Some journals have a little box or other space on the front page of the article where they listed data citation, but there's still a large number of journals that don't have awareness of this. And there's 77% that have good citations. That number is suspiciously close to our other number of 79%. So 79% of all the content in dryad comes with journals that are tightly integrated with the dryad system. So we have this initial cadre of journals that signed on to the data archiving policy, and then we have about 20 other journals that have now integrated their system so that they're automatically sending us the metadata and receiving the dryad confirmations of everything. And they're actively encouraging their authors to archive data. So that accounts for 79% of the content in dryad. The remainder is just authors who are enthusiastic about data archiving who found dryad on their own. So it's very interesting to see that when we have 79% of our data coming from these integrated journals that 77% of content is actually getting cited properly. And I would suspect that there is a very large overlap between those integrated deposits and the items that get cited properly. So here are a few examples of how people are doing self citations that is citing the data that underlies the article being written at the time. So sometimes they appear just in the running text. Sometimes they appear in a special section labeled supplementary material or supplementary data or something of that nature. And sometimes they simply appear in the references section with no other mention elsewhere in the article. But we do have a few bad citations, and this would be an example of one of those. So in this case, the authors simply referred to the dryad database and they didn't provide the full identifier. They just said, oh, there's the succession number 1888, which if you know a little bit about the history of dryad and how the identifiers were assigned in the early days, you could reconstruct and actually locate that object. But there isn't an easy way for a casual reader to go from that information to the actual object in dryad. But things are encouraging. This is an email that I happened to get about two hours before the talk started from an author who had worked with our curators to get their data in final format in dryad. And then he seemed very encouraged by being able to submit his content and it looks like he will now actually place a citation into his article. And this particular journal associated with this data is not a journal that's tightly integrated with dryad. This is just an author who came to us and deposited outside of that integrated stream. Okay, but of course the real measure of success is not when individual authors are citing their own content that they've placed into dryad, but people who are pulling data out of the repository, reusing it and then citing it properly. And the results are not so promising in this area yet. The best that I can say is that it does take a lot of time for this kind of reuse to happen. So the data has to appear in dryad. Someone then has to notice that the data has appeared in dryad. They have to do some research inspired by or simply making use of that data. And then they have to write an article that writes up that research and they have to get that article published. So there's quite a bit of time lag in this citation process. And I think we're just now getting to the point where dryad has enough content that most researchers, at least in the biosciences, can find something in dryad that is relevant to their research. So we're starting to see a few citations, but not many. And I'll give you some examples in a second here. The second major thing that we're running into is that it's very tricky for us to actually find out when these citations occur. Because you saw previously even the self citations are fairly erratic, but these reuse citations are just as erratic. Even though we've gone through all this effort encouraging people to properly cite their data, we haven't really gotten the community accustomed to properly citing the data on a regular basis. So one of the very encouraging signs that we have right now is that our content is getting at least accessed quite a bit. So this is one of the more popular objects in dryad. This page has been viewed over 500 times. And one of the associated data sets has actually been downloaded more frequently than the page has been viewed. I assume that there are some external links that go directly to this data object. So people are accessing the content without even needing to visit the dryad page. And we've seen a continual increase in downloads since we've been recording that information. So we're expecting to see more and more citations. We do have a few citations that we know of in the wild. There is a data package from a researcher named Amy Zane. It was deposited in dryad in 2009, which is a few years ago now. And it's been heavily downloaded from dryad. We're just starting to see citations to this content pop up. So the Web of Science now says that this item has been cited 27 times. We actually didn't learn about these citations until just a couple of months ago when someone deposited new data in dryad that referenced this original data package. So even though some of these citations are going on out there in the world, we don't always know about it. We don't have a really good way of tracking it because of the inconsistent ways that the citations are being recorded. Okay, I want to wrap up with just a few questions on things that I've been wondering about lately. And perhaps you all might have some feedback on them. So the first is I mentioned that this issue of syntax and identifiers is fairly contentious. I had an argument just last week with someone from Crossref who said that we were crazy for adding this little bit of syntax into our identifiers. But we found it to be fairly useful just for our own staff and therefore we're happy that we decided to include it. The second, which is maybe of broader interest is, is this concept of a data package something that is going to work across all areas of science? So we found it really very helpful within evolutionary biology and our other core starting disciplines that we're giving credit for the data on a level similar to the article. We don't give out more citations for more data files. We give out just one citation for all the data that's associated with an article. But that has some drawbacks that it's always requiring that there is an article or some other publication that's associated with this data. So there's a question about whether data can in fact become first class that you can actually use the data without having some kind of accompanying article or other documentation. And finally, the issue of these tools like Web of Science, which really doesn't provide very good support for us to find out about citations. And can we get these publication oriented tools to support data citations? There were a few months where Dryad data citations were appearing in Google Scholar. But eventually Google sort of noticed and they pulled them all out. So we were really disappointed to see that that Google Scholar is now focused only on articles and they don't allow any data citations. So maybe the answer there is to put our efforts more into tools that are data centric. Okay, I think these slides will be available. So you can certainly follow these links to find out more about Dryad. And I did want to quickly thank just a couple of people who were very helpful in the development of these concepts. The Dryad Board of Directors had all these great conversations about what data citation means and ended up with the policies that we have today. Todd Vision, who is the principal investigator on the various grants associated with Dryad, he was the source for some of the information and a few of the figures in these slides. And Peggy Schaefer also helped to collect some of the information here.