 Okay, super. Excellent. Thank you so much for inviting me to be here. I'm really thrilled and is doing so much good stuff both in breadth and depth. I'm really honored that you've invited me to talk, so thank you very much. I am Heather Pivova. I am a postdoc with Duke University affiliated with the Dryad Repository. I live in Vancouver, Canada and so I'm also a postdoc at the University of British Columbia. I'm a lucky person to have multiple affiliations and as my day job, I track the impact of data for my day job and this is really a working presentation because for the last few months I've had my head down and my fingers in a bunch of different projects and I'm going to try to pull them all together into this presentation to let you know what I'm currently working on about data metrics and making them matter. Specifically I've been working in three areas. I've been working on collecting more evidence, so doing research studies. I've been working on thinking about a framework in all the ways that tracking data matters and what that means for our policies and finally I've been building tools to make that data tracking easier so I'll try to interleave those and I'd love feedback in the questions as you know we're going into OA Week and I look forward to continuing to tell this story so your questions would be really appreciated and your feedback. Okay, so I'm going to start with some things that I've done previously to get us warmed up. So one of the things one of the ways that we can track data to get to so you have a sense of what I mean by tracking data and what we can learn from it. Let's imagine that we find 10 data repositories that existed in 2005. So I did this 10 data repositories some of them are journal data repositories and picked 100 random data sets that were deposited in 2005 and then use citations to the papers that describe the data collection from web of knowledge and Google Scholar and text mining in the full text of papers to look for those accession numbers and then calculated for each of those 10 repositories how many times is there evidence that that data is being reused in the scholarly literature over time. So this graph is some preliminary results showing for data sets deposited in a publicly available repository in 2005 how many times cumulatively is there evidence they've been used and you can see there's differences in different repositories there's lots more that we can dig into here to learn about best practices. Using similar data to this we can start to examine does data reuse does making data available truly give it the new lease on life that we think it does so this graph the light blue line that's going down is evidence of data reused by authors who have surnames in common with the authors who deposited the data sets. So this is our best guess at a large scale of who had access to the data how many papers did the data collecting authors publish and you can see for in this case data made available in 2007 they published a lot initially and then it tapered off over time presumably as they went on to collect more data and do bigger and better things. The dark blue line that's increasing is is authors who we don't think necessarily had ties with those authors they're people who found it on the internet in the data repositories and you can see that's going up and so this is the beginning of evidence that data has a new lease on life now that the true height to that blue line how it compares with the light blue line that's that's more evidence that we're hoping to collect. More sorts of analyses I've done with this sort of reuse data is to see what is the impact per funding dollar of making this sort of data available. So if we look at traditional research funding and and the amount of funding study gets and then how many papers are produced about four hundred thousand dollars US results in about 16 papers now that's a gross oversimplification but it's that sort of order of magnitude. At the at the level of funding the dry data repository when it to scale at similar levels of data reuse to the ones that I've been measuring in the previous slides four hundred thousand dollars would facilitate a thousand reuse papers and so this sort of evidence can potentially inform funders that yes you really are getting good scientific ROI bank for your buck by funding data repositories and so that analysis was written up in a nature letter on my block so that's that some of the analysis that I've done in the past so let's put this in the greater context of why should we track the impact of data more specifically if we delineate some reasons the first and most obvious is to encourage data archiving so if we can suggest it to researchers who make their data available that they get a higher citation rate or they get more recognition that's an incentive a personal incentive to them to make their data available furthermore the more useful to make they make their data the higher those reuse tracking numbers would be and the higher their reward would conceivably be and so tracking data and making that tracking information is available our rewards the most useful collection curation and dissemination of data it also allows us to include all the relevant contributors in the reward structure so often there's often people who collect data aren't necessarily pass all the hurdles to be an author of a paper but they are certainly an author of a data set enabling data sets to be standalone entities and that receive their own rewards facilitates these sorts of contribute contributors to play a role in the scholarly ecosystem a more official role than they have until now doing this sort of tracking and pulling out and teasing through the data allows people to discover associated analyses and data sets and research communities so who reuses what do those people cluster together in some way you couldn't tell based on the keywords of their papers for example importantly it alerts investigators of re-analyses it's easy to forget when we are not the authors of data sets ourselves that when you make your data available it's really hard right now to actually know who does anything with it and and we're interested I think as researchers we want to know that they did the right thing that they were responsible with it we're curious it strokes our ego we want the opportunity to respond and to do that we need to know when it happens Google alerts just don't cut it we need to do more robust data tracking than that it will allow us in the future to build filters for frequently used data sets people may want to say you know what if other people have used it in the past that's a good indicator that it's easy to use or perhaps good quality so let me look for those or alternatively some researchers might say you know what are their data sets out there that meet these various criteria and that I have but they have not been reused a lot that may be opportunity to really leverage some scholarly resources that are other that are neglected but potentially important tracking data will allow us to when when a data set is detected to be problematic either because a method is found to have limitations that people didn't understand before there might have been data manipulation or or poor ethics in the data collection that's only discovered later right now it's really hard to understand the implications that that's had on the scholarly literature and this this is the way science is supposed to work we're supposed to keep learning new things so we really need to know what's the what are the follow-on effects and and being able to track data is important to facilitate that importantly I think it will help us avoid harmful shoe horning so right now we're doing various we're doing various hacks I think on the scholarly communication system to let data enjoy many of the benefits that scholarly articles have so we're assigning it DOIs we're creating data journals and various other things that some of them I think are maybe the appropriate thing some I think have limitations and we're really doing it because there are not good ways to track data and to reward data and so by building those good ways we can stop shoe horning when it's not appropriate data is actually as as you know gaining ground and is relatively far along in this sort of acceptance in the scholarly ecosystem relative to some other data types so software citation is much farther behind data citation and and citation and other products are even farther behind that so by tracking data sets we're trailblazing for other important research types and finally and importantly it helps us drive policy funding and tool requirements based on evidence to the extent we're doing science and and science is based on evidence surely our science policy should be based on evidence as well okay so now I'm going to take a break in this articulation as I try to move try to highlight for you all the different ways that data tracking is important to give you some first results this these haven't been published yet so appreciate your feedback later oh that's really greeny so I looked at the database the gene expression omnibus database it has one particular type of data in it gene expression microwave data and it's hosted by the NCBI in the US and for various reasons it's a great one to do reuse studies so I focused on it several times and I focused on it again for this study what I've done oh I don't have a slide in here but what I did I used I used full text mining to identify 11,000 papers that created gene expression microwave data and then I got their PubMed IDs that described their data and their accession numbers and I looked for those identifiers in PubMed Central wherever I could find them and consider those the citations to the papers so in this graph what we can see the years there's one little graph here for every year that data was deposited data was deposited into Geo between the year 2001 about till recently and the the the tan line is data set the data behind the gene expression microwave study is not available in Geo and the blue line is the data behind that study is available in Geo I can find that link to the Geo data repository and the graph is of the number of citations that those papers have received it's a density graph of number of citations so what you can see is that the lines the the tan and the blue lines are not on top of each other if they were on top of each other that would mean there's no difference in the number of citations between whether data is made available and not the fact that the blue line is systematically to the right means that studies with data available have received more citations in aggregate than than similar studies of data not available I've got a few more graphs of that same data set showing a few different things so this again is the number of is is showing actually for time I'm gonna skip this one this one is remember I had a light blue line and a dark blue line earlier on with authors that created the data in this line in this graph that the orange line going down so the authors who the original data collecting authors the number of papers that they have published with their data sets is high in the first few years after data publication at the zero and goes down with years after the data has been published and the blue line here is the number of authors excuse me is the number of papers by authors who we think do not are not the same people as those who collected the data and again you can see it going up now that the different panels are the different years that the data was deposited early on in 2001 there wasn't very much action here at all in if it gets really interesting in 2004 and 5 and 6 when there's been for six years for for data reuse to occur and you can really see that blue line is really taking off and the and the original author line isn't so it's yet more evidence excuse me for that new life that we believe data are arriving to facilitate um this graph here is is a cumulative number of reuses in the years since data publication some people say if we make the data available will anyone use it and this graph really says yes in aggregate data really is used those lines are going up and there there's no sign of them lining up this one is interesting so in this case um the the x-axis the years are when data when the data sets were made available and the line is how many citations they have when their data is made available and so the fact that in 2005 for example that line is really high about the dotted line means that for papers published in 2005 they've received about 25 percent more citations than than similar papers whose data has not been made available you can see that as we get to papers published more frequently excuse me more recently that citation benefit tapers off and I think that's because if they haven't had long enough it's potentially for a lot of reasons that we can talk about later one is they haven't had long enough to to achieve the appropriate amount of reuse some of you with eagle eyes and good memories and a deep fascination for data citation might be noticing that these estimates are lower than the estimates I've calculated previously and we can talk about that at some other point if you'd like and finally this paper here shows an interesting thing so like I said I looked for accession numbers so geo doesn't give doys for data they give unique accession numbers so I looked in PubMed Central for accession numbers and tried to calculate how many accession numbers when someone reuses data do they just reuse one data set in their paper do they reuse many and each of these orange dots says how many data sets a data reuse paper use so you can see some data at the top some data reuse papers used as many as 50 data sets calculated this way most down near the bottom you can see only use one but you can see that number is increasing over time and that and so so the data reuse studies I think are getting more and more sophisticated over time which is another good indicator that we're on to something with all this data archiving stuff okay so so studies like that you can only do that when you can track data I've used some hacky estimating methods to do that if we could track it better these sorts of studies would be easier and more accurate so so when I say do it better what do we want to do right what do we want to be doing better what do we want to count so we want to count data set citations in the academic literature so citations in papers wherever they may be we also want to track impact beyond citations so is somebody using something in as a method validation that doesn't make it into their paper is part of their research but they don't have a need to cite it in their in their publication because it's more of a background step we're not tracking that right now we could be potentially be as online as a lab notebooks go online as people write blog posts things like that tracking the impact beyond academic research education and training is one obvious example of impacts that we're just not tracking it all right now but it would be great to do that reuses of the reuses so if somebody does a meta analysis of a paper how much impact excuse me somebody does a does a meta does a analysis that uses many data sets how much impact does that study have right and how much impact to the studies that rely on that study have these second order uses is really important to to estimate the impact that the original ground data has and that is currently outsized scopes for the most part tracking something that I'm calling impact flavor so some data sets are probably useful to do method validation some data sets are probably more useful to do replication some some to look at certain kinds of new hypotheses there there it's not just a one-dimensional scale where everything is competing the same way chocolate and strawberry ice cream can both be really good and the world is better for having both of them we don't just want papers with more citations in the academic literature we also want ones that are really useful for grad students we also want what ones that are really useful for citizen scientists to build on so there's a lot of different ways things can have impact and we want to our metrics to reflect that so how do we get there one that I think and everyone has been working hard on is standardizing right standardizing data citation format standardizing where they should go standardizing identifiers that's really the bedrock there's been a lot of emphasis there and and there needs to be and that's great and let me now say there's as we as everyone knows there's steps beyond that and I think we're starting to be able to have the time and energy to start to look at those so educating encouraging expecting and enforcing data citation those and websites are great places to go and great places to point people to to know what to do talking to our journals talking to our funders as peer reviewers doing everything we can to really raise awareness here there's still a lot of problematic citation policies out there one is a limitation on the number of references is really problematic as we expect people to cite data citation another one I just learned about a few weeks ago the journal cell which is a high impact journal I think does not allow people to cite data in their references lists from what I understand because they don't consider it a peer-reviewed resource in general and they only allow peer-reviewed resources to be cited in the references list things like that are really going to get in the way of our of our policies and so we need to do some some talking and some listening and some figuring out I think around those excuse me just a sec okay we need to open up machine readable reference lists we ourselves need to share more data more usage data and build tools so now I'll segue into my my third area that I've been working on I think as all of you are probably aware for years now we've been aware that although we're making lots of DOIs for data sets making lots of ways and encouraging people to put them in reference lists on at least in my be at least for me I naively assumed our existing citation tools would therefore be able to track those and they they were not able to track them just because they looked like a DOI I did not mean scopes with some web of science could magically index them and make the information available so for more than two years that has not been working and the fantastic news is that now Thompson writers is all set to release the data citation index I believe this month though I think some of you on this call may know a lot more about it than I do and so that's that's game-changing right all of a sudden we really can start to be good on the promises we've been making to people to site data sets and so I actually want to just pause to really emphasize that this is really a fantastic step and I can't wait until it comes up I want to pause because now I want to say but right so now I want to say the problem is that that of course Thompson writers is doing this for the same reasons in the same way that they make all their products and they need to recoup their costs on that in the way they do that is by subscription and subscriptions are barrier-based and all of that means people we can't mash up that data we can't read the citation data is is going to be barrier-based in some ways just like right now citation data for articles does not flow like water nor will citation data for data flow like water and that's a real shame because that really limits what we can use it the way we can use it for all of the purposes I outlined earlier and for that reason and some others I'll mention briefly um Jason preen grad student at UN at the University of North Carolina Chapel Hill myself have founded a nonprofit called impact story it used to be called total impact we just renamed or relaunched a month ago and the idea of impact story is to go beyond the impact factor for articles so to do article based to do to do item based metrics rather than container based metrics and then move beyond the article move to data move to software appreciate all those things as the first-class scholarly objects that they are the web promises new tools for conversation so there's not just citations there's reference managers and social bookmarking and social networks as Karen mentioned Mendeley is one of these social reference managers right now I think Mendeley mostly people use it for articles but I really think there's a chance that people may start to use it more and more for data just as a way of bookmarking data to say you know what this is in my data library this is something I want to have handy to reference Twitter people I don't think right now many people are talking about data sets on Twitter but surely I choose to believe and I think why not I think it's only a matter of time before people tweet about things and say did you see this recent data set wow or wow that you know either wow good or bad who knows but to it all these different ways that people talk talk and interact online all these scholarly tools I think data is going to work their way into them and so that all of those different tools and their associated metrics are going under the name alt metrics right now so bibliometrics is is more citation based on alt metrics includes this broad way of uses of where scholars can really do the things they're otherwise doing and we can track and see what they're see what's happening. There are various alt metrics tools so in addition to impact story I really encourage you to go check these out altmetric.com tracks DOI's to data sets plus article of metrics is article based but is is a granddaddy in this field and there's reader meter and science card as well. I'm going to tell you a bit more about impact story give you a couple of screenshots it's at impactstory.org if you want to have a look so as of last night I think now accepts an orchid ID so if you're a researcher or want to pretend to be one on TV you can go apply for an orchid ID you just go register an orchid ID it just takes a few seconds to get and then it's a it's a unique identifier for researchers and then they can associate their scholarly products with their with their unique researcher identifier right now the focus is obviously on articles but there's other types of metrics that are allowed all excuse me there are other types of products that are allowed and encouraged including data sets I think I think that will start to get more and more play so you can enter your orchid ID or your article ID or your dryad author name and we're looking to yeah there's the orchid badge and what it returns for you based on the various scholarly articles that you tell impact story about is metrics so here someone entered a bunch of articles and a data set and you can see that the impact story went off and looked all around the web for for impacts excuse me for metrics of use and impact and puts data sets right on that same page right on the same you could imagine this is a live CV for a scholar it not only includes their article it also includes their data sets um so there's my data sets in dryad for example um and when you drill in when you click on this it doesn't just give you these badges that says highly viewed it then says what does viewed mean in dryad reveals their usage data and these these numbers here are percentiles so those are the percentiles um that that data set how many views has that data set received relative to other dryad data sets deposited that same year and then the badges show um are given when that's a highly when that's a uh the percentiles over 75 percent and so this this starts to show you a way that we can give context uh to the amount of usage that data sets have so that scholars can really be proud of this show it off talk about it in their grant packages and so on impact story it's designed to be mashed up so the data is open as we can make it um it's as open as the the data sources will let us make it be you can download it in a comma separated value format um and it's got an open API it's it itself is open source and always will be so in some in as a wrap up um are there other ways that we can make uh data count and there certainly are uh one that i think the more people on board with this the better we really need to do agile research with decision makers so as funders start to implement policies as journals start to implement policies let's let's really be doing the research to know um what are the benefits of these policies what are the drawbacks is it worth it let's really be uh be critical and evaluative and learn from what's going on and finally uh whenever we can with our peer review hats on with our conference organizing hats on with our uh with our grant reviewing hats on um asking for and acting on evidence of data impact i think we really are in this age right now where there are thousands of flowers blooming and and it's a great time to be out there with our measuring sticks uh to understand exactly how big those blooms are thanks very much