 And thanks to all of you for taking the time out to be here. So yeah, so I'd like to talk today about research data publication and scholarly communication and the intersection of the two. So many of you may have recalled sometime around last year the 350th anniversary of the scientific journal, at least of the currently extant scientific journal, the Proceedings of the Royal Society. And this image that the Royal Society used to promote their event for which there were all sorts of interesting symposia and scholarly communication on where it's come and where it's gone and how little has actually changed. Had this interesting image of this foldout page from one of the early issues of experiment observations that were done by the royal astronomer on St. Helena of Tides that were published as a table in one of the earliest issues? So from the very beginning of scientific publication, data was part of the product that was output to the world. And in the days when publications were fewer and far between and paid space wasn't so limiting, there was still a culture in many cases of publishing the data in the paper form, in tables. This is a really classic and beautiful data set that's been used bazillions of times to test new methodologies from a fellow named Bumpus, Cold Spring Harbor, but yeah, Cold Spring Harbor, I believe, or Woods Hole, who captured house sparrows that had fallen down in a storm, measured the body parts of the ones that survived and the ones that didn't, reported it. There's pages and pages and pages of data in this paper. And it was one of the first examples of natural selection happening in the wild when Darwinian natural selection was still kind of a controversial thing. So this data set's been used for education, for research on new methodologies, and in part because it was available in this form. And not every data set was published in this way, but that was certainly not unheard of. And then we entered the 20th century and we began to experience pressure with the page counts in our journals and people began to take digital data and computers and it lived here and the paper was here. And slowly we had a divergence of the narrative in the article and the data behind it until we came to this kind of point where we are at today where the data live here, the narrative article is here, and only a thin slice of the data is often reported in some kind of summary or statistical form in our publication. So I want to talk about two sort of attempts to reverse this trend that are coming from opposite directions. One, I'll spend most of my time talking about this first project, which Carl introduced to Dryad, which is how to access the integral data to the paper when it's left out, when Bumpus's Sparrow measurements are not in the article. And then coming from the other end, all the information, the contextualized data that's in the paper that might serve all sorts of reuses if it could be integrated with contextualized data and information from other studies that's reported in this narrative form and free text that's available to humans to read but isn't easily extracted. And I'm sure you have heard of various efforts to achieve this ladder through various kinds of natural language processing and machine reading and so on. I just want to give one example of that to show how it might be done in one discipline. So problem number one, data is left out. So just as evidence of this, I'm going to try to point to some studies to give you a sense of what the evidence base is for some of the policy things I'm going to talk about in this first half. And there are numerous examples I could have picked, but this is a nice one from Ioannidis, I can't say his name, Ioannidis is a group who had looked at, I believe, 500 articles in 50 high impact journals and cataloged very carefully what the data policy was that applied to the article at the time and how the authors complied or didn't with the intent of that policy. And I believe there were about 150 articles for which no data policy was present in the journals. And for none of them was the full data provided in an archive form. The others, there was some kind of data policy, even if it was only that the data needed to be shared upon request. But even then, it was only about half of the articles complied with the spirit of the data policy that was there. So there's a very patchy ecosystem, per se, of authors sometimes sharing data in repositories. In some cases, it depends on what discipline they're in and what repositories exist. Sometimes it's supplemental information, sometimes saying that it can be shared upon request or sometimes not sharing at all. And to motivate problem two, data that gets included in the narrative text that becomes difficult to extract for its full reuse. So this is an example of a description of a phenotype for one taxon in the biological systematic literature. I dare you to read this and pronounce everything correctly. I'm not even going to try. But so this is a natural language description of some incredibly detailed observation. It is essentially descriptive data in text form. Now, if I were a computational data scientist and I wanted to combine this with similar observations, made in other taxa reported in other papers, this free text version by itself is not of much use. So there's been huge efforts in biology and other sciences. But I'm most familiar with the ones in biology to extract. Mostly using human curators, reading the literature, putting these into databases in a structured form that can then be queried and integrated and analyzed. Steve Pettifer has compared this to making a hamburger and trying to take it back and make a cow out of it. The original authors had a cow. They produced this lovely, delicious hamburger. But if you actually want to use the cow, the backward process is much more difficult. All right, so let me start with part one. So what data gets left out? So a nice way to conceptualize this is this data pyramid. So there's a very small amount of data that gets published in tables, in figures reported in the articles themselves. There's this second layer, which we want to focus on, which is the stuff just behind the paper, the original data that would allow you to see whether you can recreate the same statistical results that the authors did, whether you can re-plot the figure in a way that you prefer. Below that are data collections. They may or may not be associated with publications, surveys that are done in a systematic way by federal agencies and satellites and sensors and so on that exist in their own structured, in their own sort of way that aren't necessarily associated with the scholarly communication world. And then there's the raw data and data sets stuffed in your hard drive and file drawers and that would require much more organization and documentation for someone to make use of than probably will ever become practical. The other distinction, which is useful, is to think of the long tail of data. I've heard it called artisanal data, which I like. So imagine a file format developed by a graduate student or a small community of scholars. It's useful for a short amount of time for a certain number of experimental studies that isn't ever quite seen again. This is found at all disciplines, although it's much more common in some than others. And in my area of ecology and evolutionary biology, it's the norm. So that's on the right, on the left, you have things that are generated in abundance or typically well structured, for which there often exist federal repositories like protein structures, gene expression data, I'll show an example of in a bit. And so this, the data volume is very rich on the left, but there's a long tail of extremely valuable stuff to the right that's much more heterogeneous and more difficult to deal with. And this is typically what's lost. So putting those two things together, it's that second layer of the pyramid, the things just beyond publications and this long tail of heterogeneous, often tabular quantitative spreadsheet type data. And some of that stuff is really valuable. So there was a very striking case a few years ago of a economics paper from Reinhard and Rogoff, very distinguished economists, published an analysis that was somewhat influential in policy circles for justifying austerity response to the financial crisis of 2009. A graduate student after much, much effort eventually got his hands on the Excel file that Reinhard and Rogoff used and found that there was a formula error and that completely changed the major outcomes of the paper. So that's artisanal data, small scale, extremely valuable for someone else to have taken a look at, even a graduate student at a state school. So there's been some nice studies recently that have measured this problem of data attrition over time with small scale data because the norm had been for many journals to have a policy whereby data was shared upon request. Typical to enforce, but it was there. So Tim Vines and his group went to seek data going back for 20 years and kept very careful track of every stage in the process that failed. So could they find an email address that still worked for the author? Some fraction failed at that point. Did the author respond? Some fraction failed at that point. Did the author respond to say I have no idea where that data is or don't bother me? That would be another set of failure. Sometimes they would get the file and they would discover this is completely not what I was expecting and this is unusable and that's another point of failure. So this is just for one stage of attrition. So the data still extent, the author responded and you can see as the age of the paper increases from five years to 20 years that you get this deep reduction. And what they found was essentially after 20 years there were essentially none of the files that they had originally requested could be recovered. It declined by about 17% a year. So another really illuminating study of attempting to recover data by request from Ian Vickers in the Netherlands. So the American Psychological Association has a, they go particularly strong data upon request policy where the authors actually have to sign that they're willing to share upon request for all their journals. And so this group asked for papers from recent issues. There's this wonderful quote, six months later after 400 emails, detailed descriptions of our study aims, approvals of our ethical committee signed assurances not to share data with others and even a full resume, only 27% of the authors complied. What was really interesting in this study is a follow up, the second reference that I have there, the willingness to share research data is related to the strength of the evidence and the quality of the reporting of the statistical result was they showed that there was actually a substantive difference in the number of errors in the papers without looking at the data itself to statistical errors that could be detected from looking at the articles themselves from those that shared the data upon request and those that did not. You could argue about why that's the case, but it was very illuminating. So this is all kind of in contrast to surveys that have been done of researchers when you ask them about the willingness to share. There seems to be a general appreciation of the value of sharing data. Obviously some concerns, not everybody says I'd be willing to share data, but about four fifths are. One of the major things that's asked in return is that it's important that my data are cited when used by other researchers. I'll talk about what that means and kind of unpack that attitude in a bit. And then it is appropriate to create new data sets from this data. It's not, they just look at it but they can actually reuse it. So despite this kind of strong willingness when they're actually pressed on the matter, the behavior's different. And so we need to look for incentives, both positive incentives and negative incentives if we want to change that behavior. And so one of the positive incentives that keeps on coming up, and I just mentioned, is this idea that if your data is available for others to reuse, that you get some kind of advantage and citation for it. And so a postdoc named Heather Pavovar and I looked at this very large data set of about 10,000 studies of gene expression data. Gene expression is very common genomic data type which is a great study system for this because about half the time it's shared, as it's supposed to be, most journals require it. And about half the time, a little more than half the time it's not. So you have this kind of natural comparison. It's not a randomized comparison but you have a natural comparison. And so you can look at those, that papers for which the data are available in public repositories and those that aren't. And they're well-known cofactors that affect the citation counts of papers, of bibliometricians studies, such as the number of authors and institutions and things of that nature. So this is the result of a multiple regression that tries to factor those out, although it's still, it's just correlational. It's not a randomized trial. So what we see here are gene expression data from sort of the first year in which it was collected, 2004, up through 2001 through 2009. And the orange hump shows the citation distribution for those where the data were not available and the blue hump for which it was and the green in the middle is the overlap. And so there's a displacement of the two distributions that's reliable. Overall, it's about a 10% citation advantage for those, again, factoring out the other things that affect citation counts. But in some years it's much higher than that and it was as high as 30% in the early years. So there's a substantial advantage but note that this is citation not to the data per se but to the article for which the data were available. And that's a really important distinction that's gonna affect what we think we can achieve by promoting data citation as an incentive. Anyway, taking all these things into account that it wasn't really, data wasn't really being shared by request. It was tough to enforce. Even when researchers were in good faith wanted to, they couldn't recover data that was several years old. That the attitude is generally favorable toward it, that there's some positive consequences of doing it in terms of impact. Many, many workshops sponsored by NSF and Ecological Society of America and other societies over the years, trying to find solutions, a group of editors and society officers who just kind of happened to be around in the North Carolina Triangle associated with the center where I was at, Messon or at one of the local universities. There happened to be kind of a critical mass at the time they were sick of going to these workshops and said, we can provide the nudge to journal policy if we band together to roll this policy out for our discipline as a whole. So the relatively small field in evolutionary biology, there are about a dozen major journals and pretty much all of the most important of those journals agreed to do this together and roll this out in 2011 after a lot of feedback and letting people know what was coming down the pike. So this is the joint data archiving policy, starts with a statement of principle, says as a condition for publication, the data supporting the results in the article, not everything you've ever collected, but the data supporting the results in this article should be deposited in an appropriate public archive and they left it up to the journals to specify what that means. One controversial provision was that authors may elect to embargo access for a period up to a year after publication. A lot of people were very queasy about including that but without that line, it would never have gotten adopted. And then exceptions can be granted at the discretion of the editor, either a longer embargo or a complete exception for sensitive data of various kinds. And so this is essentially a policy that was adopted by all these journals, a number of others have adopted it since then. I'm aware that other fields, there were earlier efforts in economics, there were replication data sets going back, I think more than a decade. And since this time, as data's become more visible, there've been much more detailed and elaborate data policies. The one that got a lot of attention a couple years ago was when PLOS introduced theirs, both because they have a lot of, a large research community that pays attention to them. But also because it was quite strong and quite detailed, specified how the mechanisms of enforcement would work, what counted as a allowable exception, things of that nature. But this kind of high level agreement of principle in 2011 was very influential in my field. Until that time there hadn't been much in the way of an appropriate public archive that would accept data from any of these journals. So now there's a richer ecosystem of repositories with different flavors. Dryad is one of them. It was developed specifically to support this policy. The only one on this slide that existed before 2009, I believe, was Pangaea. Dataverse and Dryad both came online about 2009 and these others are more recent than that. So there's now a lot of choice in appropriate public archives for journals to choose from for long-tail data that's heterogeneous. These aren't specialized repositories that have domain scientists that work on staff that deeply curate the contents, except perhaps in the case of Pangaea, they accept pretty much whatever the authors are working with. So Dryad has a particular niche of only accepting the data that's associated with a vetted publication. It's usually an article, sometimes a book, sometimes a thesis, but as you can see here in this list of recently published data, each of these is associated with a data set from a particular journal, and I'll mention later the sort of uptake in the world of journals. So it's a fairly straightforward process when it submits data, Dryad minns a DOI for it. It's supposed to be citable, we'll talk about that in a bit, and then Dryad manages long-term preservation and access. So some of the features, there's a data package with multiple data files. We don't reach into those files and change them in any way. They're basically how the author had them at the time of analysis. So Dryad has a sort of mandate to be the scholarly record and not mess with that scholarly record. It's really up to the author and the journal and the reviewers and the editor to decide what should be in the repository. There's strong reciprocal linkage between that original article and the data. There's staff of data librarians that provide user support and do some quality control to make sure that basically the calling card for that data is as accurate as possible. We wanna make sure that it matches to the right article, that the metadata is correct and it's discoverable, that it's indexed well. So all the attention is focused on that end rather than reaching into the files themselves. The sort of magic sauce for adoption for Dryad has been submission integration with journals. So when manuscripts are submitted for review or after acceptance, there's interaction between the journal and the repository so that there's not just an independent process where the authors have to decide, oh, I'll go to this repository and I'll re-enter all the information separately and so on. So it just makes that process simple. There's customization for different journal policies which I'll show in a bit. And the data are free to download and reuse and citable and preserved. The financial assist data ability is achieved through upfront data publication charges. So either they're sponsored by some organization or they're paid by the individual researcher. And it started as an NSF funded project through the National Evolutionary Synthesis Center and it's spun off now as a nonprofit organization. So this is how the integration of manuscript and data submission works. This is one workflow, common one, where the author prepares the manuscript, they have their data files that were used for creating their figures and doing their analysis already in hand. They submit the manuscript and it goes to review upon acceptance. The journal tells Dryad, because we have a relationship with the journal, that this thing is coming. And so we can create a record for it so when the author comes to us, then the record is already waiting for them and all they need to do is deal with the metadata, associate with the files themselves. Once it's been curated, the identifier sent back to the journal so that simultaneously the publication can come out with it's data citation included and the data can show up on Dryad with the article citation accurate. And so those are as simultaneous as we can make them. And so this is an example of a data package. This is a simple one, it just has one file but it's a zip file. So actually there's probably quite a bit more in there. So title of the article, preface with data from. You can see the downloads that are reported and these are passed on to alt metrics providers. And then you can see the recommended citation policy that was agreed upon when Dryad was founded, which is that when using this data, please cite the original publication because in many cases the data actually aren't understandable without the publication. It's a very rich metadata file that says why you collected it, how you collected it, show some of the results from it. And so many authors in this kind of long tailed data space are very anxious about having the data being used in the absence of that context. But in addition to the article citation, there's a data citation which looks a lot like. The article citation, the author list can actually be different. I should have shown an example on which that was shown. But so if the first author of the data, maybe the person who was most responsible for collecting it but wasn't the first author of the paper, they can actually get that credit. And there's a separate UI for the data. So if we want a data citation to work, what do we mean by data citation? The common understanding is that that means somewhere in the bibliography, that reference that had dryad repository with the DOI for the dryad data would show up just like it would for an article. But there's other options. So what's becoming common now, and this is an example from Plas, is a data availability statement. So the original articles that have data listed here as metadata that's associated with the whole article separate from the bibliography. So here's the data availability statement within Plas article. And so it's not in the bibliography. It wouldn't be counted by science citation index and scopus in the way that other citations would. So I'll get back to that in a moment. So as I said, there's customization for different journals. So one axis of differentiation is whether the journal is integrated at all. There's many journals that partner and may pay for publication but actually haven't integrated the submission process. They may choose to submit at the time of acceptance or during review or some journals such as Plas require the identifier to be provided even before they can submit a manuscript in the first place. Not just for dryad, but for any data that's associated with an article. Some journals will allow the embargo. Proceedings to the World of Society allows a one year no question asked embargo. Plas doesn't allow any embargoes. So that's turned off through those authors. And some journals are very sensitive. I never really understood why about the knowledge that the paper's about to come out with this title in these authors. And so for instance, World of Society B we're not allowed to say that there's a paper coming out so that's hid from discovery until the article is available. Others, and this is Proceedings to World of Society A are fine with it. And so once the data's in there if the authors choose to make it available even before the paper comes out they don't mind. So all these kinds of customizations are available. I think it creates a challenge for us. It's a somewhat complex ecosystem for an author coming to dryad to know what the policies are for their journal what, how to follow this. And so we need to figure out better ways to make this less intimidating. Oh, I'm sorry, yeah. So I said all the content is paid for up front about two thirds, I think somewhere between two thirds and three quarters of the content in dryad is sponsored by an organization. So maybe a publisher, maybe a society, it may be the institutional library or maybe a project grant. One sponsors the Canadian Healthy Oceans Network sponsors deposits on behalf of all its researchers. And then there's a escape route for those that don't have a organizational sponsorship where they can pay themselves out of a grant or however they can muster the money. So the cost of the submitter is zero for these and for PLOS they pay 120. So the charges range from between about 95 to 120 depending on the sponsorship and the membership and details of the staff. The question is if you have no sponsorship. So yeah, so this first row would be if you're coming from a journal that has no partnership, no anything, it would be $120. So the question is, does it matter what kind of data or how much data? So this covers up to 20 gigabytes of whatever, however many files, whatever kind of data above 20 gigabytes as incremental costs for storage. The average size I'll show in a second is much less than 20 gigabytes. So this is showing the uptake since 2009 when it was launched. We're now receiving I'd say about 3000 to 3500 possibly a little more data packages per year. So it's a very large journal volume basically. When it started in 2009, the average size of those data packages was 3.4 megabytes. It's now 210 megabytes. So there's clearly a trend there that we need to be considering things are getting bigger. No, the median shows the same trend, yeah. So although we have right now about 90 integrated partner journals, we have content in the repository from over 460. So there's a lot of kind of one off, someone has used it with their main journal but they're publishing in this other. And so as time goes on, we accumulate who's depositing from what journals and we start talking to those journals about integrated. And the other statistic to note are the downloads. So I think the easiest way to think of it is on average, this is a little old, it's probably higher now, but on average within a year of deposit, a file will get about 20 downloads in a year. Okay, the question is if we're set up for much larger terabyte scale data sets, not currently, that is something we'll need to plan for because it's becoming less and more common. We're beginning to work on technologies for uploading a lot of proprietary image formats with Omero and that will get us in that zone pretty quick. Okay, so I wanna contrast these journal policies which kind of led to this uptick in deposition with Dryad that we've seen with something that's happened in parallel which are funder data policies. So using the NSF as an example, because I'll show you some data on in a moment, their longstanding policy which no one was ever aware of or really cares about much as far as I can tell is that investigators are expected to share with other researchers at no more than incremental cost than with a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered with their NSF funds. There's no systematic enforcement of this, but there's been kind of increasing visibility. In 2011, they introduced the requirements for data management plans and some of the programs provided some of more detailed guidance on how to share data. There was no requirement to archive data at the time of publication. 2013, data was allowed to be counted as a product within the biosketch and in 2016, NSF's response to the White House mandate to all federal agencies to have a public access plan. There were no substantive changes in my view to the data policy in response to it, but there was clarification that the cost of archiving can be included in grants. So some have said that this introduction of the data management plan requirement in 2011 was very influential on a big milestone. So what did the data actually tell us about what it's effect for? It's somewhat difficult to tease apart because all of these things were changing at once, but there's a nice study from McGee et al looking specifically at phylogenetic data which is something that's kind of within Dryad's wheelhouse, so we looked at it pretty closely. So the why actually is showing the proportion of data available. The leftmost square is for articles where neither, it was not funded by NSF and it wasn't in a journal that had the journal data archiving policy. And you can see there was actually an increase in the archiving of phylogenetic data even in that kind of background set. The second box shows NSF only. There's a bit of a step increase and this is again the results of a regression, so we're not seeing raw data here. The third box shows it's a journal policy only and it's somewhat remarkable. It seems to go from in the zone of about 10% to the zone of somewhere above 60, 70, 80% availability after this. And then combined on the far right. So it appears from this analysis that at least a journal policy is much more influential even though they came about at the same time when you try to tease them out, the journal policy seems to have been more important. But just having a stated journal policy is one thing, how you implement it has been shown to actually affect the uptake substantially. So again, more work from Tim Bynes' group, he's done some really nice work on the sheets he was for many years the managing editor of molecular ecology, which was one of the pioneers in data archiving in our field. So the, well actually showing the percent of papers that are eligible with data that's available online. So on the left, different numbers of papers from journals that had no archiving policy at all. So they hadn't adopted JDAP and these are all in the sort of evolution ecology realm again. In the middle, ones that just recommend archiving which was kind of a common weasel way to adopt the JDAP to just replace that one word requires was recommend. Some of those journals have since switched based on results like this to requires. And then on the right, those that require it grouped into two different classes. So on the left are two journals that have expectation that the data would just be reported in the full text during the bibliography. And on the right are ones that have a data availability statement like I showed you for plus where you have to say as part of the manuscript that checked in production and so on the data are available at this place. And that little step at least correlates in this case with much higher adoption by these journals. Then there may be other factors in the implementation that matter but this was what popped out in the analysis for them. Another question with this policy is well we gave authors this out to embargo their data for a year and there was a lot of argument over how that had to be in there or else we couldn't accept this policy. And the reassuring and pleasant outcome is that actually it's not taken advantage of as much as you might expect. So looking at all the files in Dryad for which the journals allowed an embargo to be chosen by the authors the vast majority of them over 90% either made the data files immediately released even before the article was out or at the time of publication. 750 here chose a one year embargo and a very, very small tail they are asked for longer embargoes from their editors. And there's actually just a couple of journals that were allowing these longer embargoes very much dependent on the editors. So with articles we publish these things and it's a professional obligation and it's our professional reward and it makes us feel good and it's not clear whether the reward system for articles is a negative incentive or a positive incentive. And it's kind of a nice place to be and it's kind of where we'd like to be with data where it's just part of your it's both part of what you think is required of you as a professional scientist but it's also how people judge you as a professional scientist. So you can't say it's either a characteristic it's what Jeff Bilder calls characteristics. So there are some good examples of a rewards happening via data citation but they're actually fairly rare. So this is an example of the three authors here have no overlap with the authors of the yellow outlined paper that's cited which is actually a data set that's in Dryad. The authors of this paper deposited their own data on top of the data that they reused from Dryad. So this is where the ecosystem is working. We see cases of it. And this is also a case where the first author on this data set is actually not the first author of the paper that it's associated with. So this person is getting credit for that data in a way that you want to otherwise gotten. More common is a case like this. So this is from a plus paper, plus genetics paper. The method section says we compile list of species with multiple sex chromosome systems from this database which is a database that's actually in Dryad but it's reported as you can see in the reference in an article, in a data journal article, scientific data. And so this is part of the confusion over what a data citation means. I think in many cases people are getting credit for reuse of data but they're not getting it via data citation, they're getting it via credit to the article that the data is associated with. And in the data availability statement, they report this already published data that came from this earlier paper as the data that theirs is based on. So this is kind of confusion over what the appropriate way to cite data is and it's not clear what the best answer should be. In the case of Dryad we've tracked forward through all of the papers we could find that had Dryad data associated with it in Europe PubMed Central. So thousands of articles to see what the original authors had done. Did they cite the original data associated with that article in their bibliography? Did they put it somewhere in the full text? We couldn't distinguish data availability statement from full text because they're not that well separated in the XML that we had on the hand but we can tell us in the bibliography or not. And so the purple bars here show for the four years the fraction of times that the data is reported in the work cited section alone. The red bars at the bottom are the fraction of time excited in both the work cited in the full text. And then the other two bars are somewhere in the text or not at all, a shocking number of not at all but the important number is the combination of the two middle bars. So very rarely are the original authors reporting their own data in the bibliography. Fortunately, the linkage between the data and the article can be achieved without a formal citation. This shows an example from Elsevier Journal. So they're not a partner journal. The data knew about the article but the article didn't necessarily know about the data. Fortunately, the DOI system when we cross-linked the data DOI and the article DOI, they have metadata that link to each other. And so an indexer or a sophisticated publisher like Elsevier can discover that linkage even if the authors and the editors and everyone messes up all of the steps in between to actually get the citations in the bibliographies of each other. The another challenge that we face is in quality control. So it's not work of my own but of folks who looked at data in dry ad they were concerned about trying to reuse data that wasn't all that easy to work with and they wanted to look more systematically to see how often the data that's archived in the repository is actually reusable. So we see here is a score on the top for the completeness of the data according to what they could tell ought to be in an archive in the article and on the bottom the reusability of what was provided. So the score of five is the best, score of one is the worst, and this kind of arbitrary line down the middle here about half of the articles that they looked at in dry ad they were obviously missing files and well more than half files that were available had some issues with their reusability their interpretability could not recover the original analysis. And this is probably inherent in the nature of there being many journals that only require these to be submitted upon acceptance. So there's not only has no reviewer looked at it but there's not even a threat that a reviewer looks at it. So some journals require a data to be submitted during the time of review but they are maybe afraid to ask their reviewers to actually do any work to look at it. And so perhaps it's not too surprising that there are issues with quality of data that's in the repository and overcoming that and incentivizing authors to have the interest to have good data archived is kind of the next big challenge for us, the repository. In the absence of good data citation infrastructure and other incentives it's an open question as to whether other metrics of data impact might be a value for incentivizing users such as the number of downloads which presumably is gonna be more dependent on the quality of the data. So this is a screenshot from a paper of mine within impact story or people familiar with the alt metric tool. So it aggregates metrics about many different research products, so not just articles but also data, slide share, other kinds of products that might be out there that are emanating from your research and it tracks their impact not just in another article but as a Mendeley bookmark or as a download in this case or as a delicious bookmark or other types of use that can be easily tracked on the web and often much faster than citations accumulate so it's a useful way for someone to see when their data's having a high impact. We don't do much with this now except export those numbers for alt metric providers but it would be interesting to begin experimenting with this highly downloaded article giving people badges for it and some kind of reward for having high quality or high impact data. I'll be doing that in time. Okay, I'll take a mic. Jump ahead so I can cover the second piece real quick. So part two, we talked a bit at the beginning about the opposite problem from what Dryad is trying to solve which is data that should be in an article that's missing. What about the data that's in the article that isn't actually all that reusable? So this complicated phenotype statement that I showed you at the beginning of the talk. So that is an example of a naturally occurring phenotype. There are different types of phenotypes that biologists study. So geneticists are interested in mutants. So this is a wild type and a mutant fish. See what the difference are between those. That's a type of phenotype that might be reported. There's also naturally occurring variation among CACHA. And there are different languages that are used, different objects that are studied, things that are of interest. So while they both have kind of superficial similarity as to what's under study to be able to combine phenotypes reported in one literature and the other actually is not so straightforward. Who remembers total information awareness? We were so young and innocent and naive back in those days. Now we're living it. The idea that we can take advantage of all of the articles that we're writing and reading in our human minds and rather than walk down the hall and ask your colleague, do you happen to know if this phenotype is relevant to anything that you study and go to a different taxonomic expert and ask if they know of any phenotypes similar to that. Is there a way to do that for the literature as a whole and to make these connections given the free text we have to take all of the stuff that we're outputting and actually input into a system that can read it. It's not human. So this is a profile of phenotypes for a particular genetic locus in mouse. And you can see this table showing all the different things that are affected by different alleles, different genetic variants of this gene in mouse. So this is an example for a model organism, genetic mutant. Now if we wanted to combine that with a phenotype from a different domain, so evolution of biology in this case, the model organisms have done a really nice job of building ontologies that allow you to say, well, if this is the phenotype, it has a relationship to these abstract concepts that are available in ontologies that are used more widely than just this taxon or just this gene. So here we can compare a phenotype that occurs either some gene with a phenotype that occurs in some taxon based on their relationships within these ontologies. So I'm gonna skip over some of the details for the sake of time. But I wanted to describe this system, that takes in phenotype data from taxonomic values here, genetic data here, into our little total information awareness machine, which we call the phenotype knowledge base. And basically we generate hypotheses for the genes that underlie phenotypes by combining data from these different sources. Using techniques that aren't super sophisticated, but just as a way to say what the potential is for extracting this information, the structured data, this contextualized data from the literature. So we use semantic similarity. So this is an interface showing a search for a particular gene, COD, from zebrafish. The profiles for this gene return lots of similar profiles for other, in this case, fish. With statistics associated with how good of a mat set is, so you know when you can begin to ignore it. I don't wanna go into too much detail here, but just to say that we can do this large scale search. And the question is whether it's giving us biologically, scientifically useful output. So the test case for this is a study of the fin limb transition in the vertebrates. And I wanna thank the folks who did monster load data curation to be able to answer this. So as you're all aware, I hope in evolution from aquatic to terrestrial tetrapods underwent tremendous changes in these particular bones that are well studied in fossil records. So we have good phenotypic profiles for the bones along these particular branches of the evolutionary tree of vertebrates. And we can take each of these branches, some set of phenotype changes that are thought to have happened at this point in time in evolutionary history, and basically search using this somatic similarity engine that I mentioned against all of the genetic mutants that are known from mouse, from zebrafish, from Xenopus, from human. And see which genes come back and compare those to the ones that have been reported in the literature as good candidate genes. This is an extremely well studied system. Unfortunately, no one was around to observe the transition so we don't actually know what the genes are, but we can ask how well we can recover the expert knowledge and whether there are any surprises that in retrospect the expert should have been able to pick out. So the red bars in this show the scores, the similarity scores for genes that were in the set that experts had identified in the literature. The blue bars show the distribution of similarity scores for those that weren't recognized in the literature, so these are normalized distributions. And it's a very interesting shape. So there is clearly differentiation between the two distributions, but the ones that were picked out by the experts, there's a almost bimodal looking distribution where these, you might imagine are accurate and these maybe are false. We don't know what the answer is so we can't say that for certain, but it's very suggestive. And the other interesting thing about it is that there's still quite a bit of overlap between the distributions. So some of these in the right tail of the blue may actually be good hypotheses. The provenance of the hypothesis is there for someone to look back and say, well, should I have picked this out originally? I am the literature just missing it. And there's also some really very poorly scoring things that were identified as candidates so we can go back to those papers and say, was that a misfire? Or is there actually a reason why they should be candidates that our system missed? So that just meant to motivate the utility of large scale analysis of data that currently is kind of entrapped within natural language and lots of articles. All right, so I'll close there. So I want some high level lessons from the two stories. So data underlying the scientific literatures being published more frequently now than any time since perhaps 17th century. Dryad provides a model for publishing long tail data that leverages existing journals and our kind of scholarly communication culture as scientists and there's a growing evidence base for how to design an effective journal policy although there's still some holes. Works still to be done understanding how we can incentivize publishing reusable data, high quality reusable data and we still need to figure out how to ensure that the data that's expressed as free text within articles is used to its fullest potential which is not the case today. So big teams responsible for both of these projects I particularly want to thank Heather Povovar whose work I talked about a lot of the Data Dryad project and Jim Bauhoff and Prashanti Manda who were my team working on Finescape. All funded by the NSF and thanks very much.