 In the interests of time, I think we will get started now. Thank you everybody for joining. This is the Open Metrics part of the Gray campaign. Gray for those who don't know beforehand is the Generalist Repository Ecosystem Initiative. And I'm Mark Hannell. I was the co-chair of the Open Metrics Subcommittee with Matt Buys of Datasite. Today we're gonna be talking about how we came up with the plan to work on this particular topic, what came beforehand, what we've done so far, and what we're looking to do in the future. I'm joined by my co-presenters. We have Erachia Puebla of Make Data Count is gonna start us off. And then I will tell you a little bit about the actual work we did with some help from Jim Myers of Dataverse. So without further ado, I will hand over to Erachia. Yes, thank you so much, Mark. And thank you also for the opportunity from the Gray Initiative to be here today to talk about data metrics and the Make Data Count initiative. So before getting into the actual initiative, I just want to maybe give a little bit of context as to why we believe that data metrics are so important. I think that everyone who will be here today in this audience will agree that data sharing is valuable. The question that we tend to get a little bit more stuck on is understanding the actual value that we assign to data sharing. And this is an important question to tackle because we will need to assign value to this practice if it's a behavior that we want to nurture an instant device. So it is important as a community we start understanding how data are found, access and utilize, not only as part of research projects and activities, but also to inform policy development. So we need to start thinking about these questions that I listed here, who is using data for what purpose is perhaps what data sets are being underutilized? How does the impact of open data really translate into societal benefit and was the return on investment on all of these efforts that we are putting to open data? So it is not possible to really complete this type of evaluation and answer these questions without a set of transparent and responsible data metrics. And this is where the make data account initiative comes into place. We have a focus to develop and promote open data metrics to enable evaluation and reward of research data usage and impact. Make data account is a community effort. We work with many stakeholders. The data side has played a big role over the years in a number of initiatives or an infrastructure and outreach, but we work with many others in the community again, bringing different hats and expertise who are interested in supporting the development of data metrics. The initiative has three main areas of focus. One is that we build open infrastructure and co-create the standards towards adoption of open data metrics with the community. We also do quite a bit of work around outreach and supporting adoption campaigns for data metrics. And the third area importantly is that we collaborate in bibliometric studies to start building the evidence base as to how data are used, what are the trends, what are the practices. We need this information really to contextualize those metrics, provides the important news and interpretation that makes it possible to use them for different purposes. So, I wanted to also spend a few moments talking about what we mean here by data metrics because I realized that it may be somewhat abstract for some. By data metrics, we are referring to meaningful and contextualized quantitative or qualitative measures of how data sets that are open are accessed and utilized. There are some parameters that already give us information on data usage such as views, downloads of data sets as well as data citations. These three parameters are not the only ones that would capture all of the possible uses of data. We know that there are others, but we view them as important because they already give us some pointers as to how data sets are being used. So for example, when we think about views and downloads, they already tell us that research are found that data set interesting or relevant enough to their work to actually spend some time on it, go and take a look. And when we think about data citations, they are starting to build those bridges between the data and other outputs. So they are clearly signaling that the data has been used or reused as part of research. When we talk about data metrics, we like to think about this as a journey where the eventual destination is to incentivize researchers to share data. And we think that again, this is something that is very valuable, but we also know that we are not there yet because we are lacking incentives. The current research assessment frameworks often are focused on journal publications and do not include data. And what this means is that essentially, researchers see sharing data sometimes as a burden as something that they have to comply with, take boxing exercise, because maybe a journal asked them to share their data or they feel that they need to comply with a particular policy. But essentially it's not something that they do because they perceive it as bringing them tangible professional benefit. So to get us to this point, we need to start this with several steps back and really get started with the community discussion around the best practices that we would like to see. And the good news is that there's been a lot of work over the years in this space and having these community conversations on what the best practices should be around data usage. A lot of these have been led by make data count, but also involving different groups. An example of these conversations and best practice recommendations is the 411 data citation principles that were published now almost 10 years ago. These principles outline the purpose, function and the attributes of citations to data sets. And importantly, they highlight the needs for practices of data citations that work both for the human reasons, but also for machines. We need to be able to scale and propagate that information through the available scholarly communication infrastructure. There's also been quite a lot of interest around developing best practices for data citations among publishers and data repositories and a number of recommendations and resources have been produced. But all of the conversations in this area have not focused only on data citations. There's been quite a bit of work about usage more broadly. And I wanted to mention in particular the countercode of practice for research data that was led by the Make Data Account Initiative in collaboration with the countergroup. And this is an important resource because it provides a standard for data repositories and platform providers to standardize and the usage metrics reports that they are producing and distributed. So it was a way to provide those reports and usage in a normalized and consistent way across repositories. This was published five years ago and it has been implemented by a number of repositories including those in the gray initiative. And I should mention that obviously a lot of things are changing in the data space and quite a few things happen since 2018 and we are in conversations to actually work on a new version of revision for this code of practice for data. Another important aspect of how the community came together to have important conversations around usage metrics was the working group dedicated to this topic through RDA which brought together different stakeholders and this clearly signal that there was a strong community interest in developing reliable usage metrics for data and having the repositories do this reliably and consistently. The recommendations from this working group were published last year and I highlighted some there I guess a couple of things that are good particularly notes from the recommendations listed is that there was an endorsement for the counter code of practice for reporting views and downloads and also in utilizing data sites for aggregating that information. Another important recommendation from this group was the fact that data usage is very nuanced and we shouldn't be tempted in oversimplification and in developing a single opaque metric that is not going to be to the benefit of the community. So we know this is challenging but we need to work on this in a nuanced way. Right, so there is a lot of this groundwork that has taken place around the community best practices. So once we have that, what we would like to see is moving into adoption of these best practices by the different actors in the community and what we know is that we are not there yet in terms of the adoption we would like to see. So thinking for example about the counter code of practice there's been a number of repositories that have implemented it but the adoption has not been as broad and large as we would like. We know that there are some challenges around implementation because the processes can be quite time consuming for repositories and the implementation requires developer understanding of the code of practice as well as a certain level of code to maintain to produce the log processing and the SUSE reports. So one of our current areas of work is to try to address these challenges that we are seeing with implementation for these kinds of data usage and to respond to these data site is working on a usage tracker to again facilitate this implementation of a consistent collection of the data usage information from repositories. The usage tracker collects web-based usage doesn't require log file processing and the idea is again to encourage adoption of these across repositories to have this consistent way of collecting that information at a scale. The tracker is currently in beta but if any of you is interested we are very happy to send you additional information and you also have some of the documentation that are available on the data site website. So that was related to data usage. There's been also some challenges in relation to the adoption of best practices for data citations. I think one of the key elements here is that the workflow requires a multi-step process with different stakeholders who have different priorities and obviously come at it from different perspectives but essentially what we see is that we need this information to propagate through the system with everyone doing the right thing at each step and at every moment if there is one of the lights here that goes into red or the switch is off it means that the citation will not actually make it into the systems that will enable that information to be exposed to the community. So just to go over some of the challenges in this process the researchers obviously should cite their data so that we can start that process that's not always the case that there is not necessarily a strong culture of citing data sets just yet. This should then get to repositories but we know that not all of the repositories capture citation information and also that some repositories use accession numbers instead of DOI. So this means that we are not collecting that metadata for reuse. When the citation information happens on the stage of the publisher we'll have also seen some challenges in terms of getting journals to optimize their processes to allow that metadata for the data citation to make it all the way to crossref and eventually to be aggregated through that and data site in their event data service. So essentially what we have seen really is that we know that there are many instances of data sets being used and cited but the system is not optimized at the moment to really capture all of that information. And we know that we need to make this simpler for everybody and this is the motivation again for another of our ongoing projects we are developing the global open data citation corpus this is a project and that we are leading a data site with support from Welcome. And the goal here is to develop a comprehensive corpus that incorporates data citations from different sources into a single centralized resource available to the community. The corpus aims to incorporate citations that come from the metadata deposit workflow through persistent identifiers our authorities such as crossref on data site but also to capture citations that are other sources of their groups maybe collating through the use of different tools and strategies for example by using machine learning to mine the full text of articles to identify where there are data sets that are being mentioned. So we are at the earlier stages on this project developing the prototype that will initially include a seed file with information on data citations from data site event data as well as a file of data citations that the Charles Zuckerberg Initiative has provided again these are one of the groups who have been using machine learning to identify citations to data sets through mining full text of articles. As we move ahead in the process we plan to add additional citations because we want to ensure broad coverage as part of the corpus and also make improvements to the functionalities so eventually we have it at the production stage and make the corpus available to the community both through an API and data dump as well as preparing a dashboard for visualizations for users and to give you a bit of a glimpse of what the prototype is looking like and what we hope we'll be able to achieve in the future although we recognize that we are still at earlier stages we just want to show you a view of the current version of the dashboard that we have built based on the seed file of citations that we currently have available. The idea is to provide the user interface where different users this may be researchers, repositories, institutional representatives, funders can come and capture that information on data citations that will be available through the corpus and they will also have the possibility to filter data citation information according to different parameters such as affiliation, funder and others. What we want is to really continue developing the feature so that this responds to the needs of different stakeholders. So we know that there is a need to have infrastructure that really makes all of this simpler for everybody to capture and expose but we also know that there needs to be this community focus on prioritizing data metrics now. There has been a lot of interest in the community as we saw through, for example, the RDA working group but we wanted to have an opportunity to convene stakeholders who may be coming also from institutions, funders, even government agencies and have these focused conversations on data metrics and data evaluation. And to do this we convene the meeting last month the Make Data Account Summit where we had a fantastic set of presentations of different stakeholders coming from different perspectives and as well as very pointed conversations as to the next steps to prioritize this topic. I think one of the takeaways from me was that there was very strong support across different stakeholders about the need to put some work into this in the short term and also highlighting the work that has already been done on standards and infrastructure and signaling that we need to support what's already there and again continue to iterate and scale the data usage information that we make available to the community. There is a summary from the summit on the Make Data Account blog that I invite you to visit if you want to learn more. Right, so just before I wrap up I just wanted to do a call for everyone here to say again, reminder that the time is now and we can all play a part in moving ahead with data metrics and data evaluation reminder for researchers to please cite the data set you use. This includes your own. Don't forget, repositories can obviously collect data usage and citations and submit them to data site for aggregation and the good news is that if you're already doing this those citations will make it into the corpus. So we thank you for that and a call for anybody again coming from different perspectives that we very much want your input as we work on the corpus. We want to make this useful to different users. So if you're interested in learning more you have my contact details there. Feel free to get in touch. I would also be very happy to hear about any success stories you have about data usage. We want to amplify those stories that we signal the value of open data. So again, feel free to contact me if you have any stories to share. Right, and that was all from me. I'm going to be handing over to Mark who is going to be talking about the Metrics Group at Gray, Mark. Thank you so much, Eretja. I am indeed for those who are new to the space. Oh, I should just point out as well. I see there are questions coming in and Amanda did remind us in the chat box. But if you have any questions for myself, Eretja or Jim please put them in the Q&A section and we will get to them at the end. We should have some time. We're not going to take the full hour, I'm sure unless there's a lot of questions. So for those unfamiliar, the Generalist Repository Ecosystem Initiative is an NIH funded system to get the generalist repositories working together and be consistent in the way that they think about things. And for those who don't know what a generalist repository is, there is a line that all funders will tell you that we want you to publish your data and when you do, if there's a subject specific repository, make it available in the subject specific repository. But we want there to be a home for every data set that you publish. And so there are these generalist repositories available, one of which is FigShare that I started as an example when I was a researcher. I had some genomic sequences and I sent them to GenBank. So if there is a subject specific repository, make it available in the appropriate subject specific repository. You get a lot more help. But then I couldn't find a place to make all of the other files available, videos of stem cells moving from one side of the screen to the other. And the data in this sense is just the files that you generate that are needed to back up the paper when you publish your paper. So it can be anything from a spreadsheet data set to 10,000 videos, to spinny molecules, anything like that. And so when I started FigShare, we had the line, get credit for all of your research. And so the generous repository ecosystem initiative brings together these repositories and we have these working groups. So the Open Metrics subcommittee was very interesting to me in this whole get credit for all of your research idea because how will we go to get credit for all of our research? And there's a lot of different heterogeneity in data. So we're looking at views, we're looking at downloads, we're looking at citation counts, as you actually just mentioned, because are people going to download videos? Are people going to cite data sets in the way that we want them to? A lot of people cite in the methods section. We want them to cite in the references section. So this is all stuff we wanted to think about. And we were very lucky in this working group to be able to piggyback off, make data counts. So this is one of the groups that was, there's a lot of different areas to think through, but at the same time, this is one of the groups that had an easy approach because of all the great work that had been done by Make Data Count in thinking through consistency. And when you're thinking about it from the side of the NIH who's funded the work, we want to make sure that in the future, if we get to an end state where we can give credit to researchers for making their data openly available and making research ultimately more transparent and reproducible and giving the funders more bang for their book, as well as now this idea of data sets feeding machines for AI, as we've seen with several projects, DeepMind and the rest. So they want to have it so that they know that this impact over here is the same as this impact over here. And a data set published in Zenodo is measured in the same way as a data set published in FigShare and go from there. So it's a four-year program. We're on year two. The working groups can come up and come down. We have a follow-on working group from this working group already, which I'll get to. But the goals for year two, you can read them probably faster than I can, I say them, but I'll read through some of it, is a roadmap for technical priorities. So a multi-year roadmap in line with the priorities defined, such as implementation of the usage tracker that we heard a ratcheter just talking about. So this is making sure that a view and a download is consistent no matter where you are putting your data. You can put it in any of the repositories that are suitable for you. And the views and the downloads will be consistently measured. Relational metadata submitted to data sites. So data site is an aggregator of all of the metadata associated with data and DOIs. And we want to make sure that we are sending the appropriate metadata to data sites when we get the DOIs so that the information can be filtered. And an example of that is the NIH might want to say, I just want to know about downloads and citations across all of the data sets from the NIH funds across all of the different repositories, because you might have some in Mendeley data, you might have some in Zanoda, you might have some in Zanoda, but you also have data sets from other funders in there. So how do they find what they would like to see? Priority metadata. Again, how do we, as a group of repositories, say we are going to have the same bits of metadata being pushed to these aggregators so you can filter a great example of that is raw, which is research organizations. So you as an institution can start tracking the metadata that you can start tracking all of your data sets across all of these different repositories. And then we start thinking about, well, this is all good going forward, but what about all the data that already exists? If you published something five years ago, surely you want to have the metrics for that be rewarded in the same way. So for the priority metadata, to find the metadata properties needed to address the metrics use cases and coordinate with use cases subcommittee, that's another one of the general repository subcommittees. And in addition, the subcommittee will explore whether aggregators can donate seed file metadata regarding subject classification metadata. So again, looking at the different types of metadata that we all want to be consistent with and obviously subject classification metadata is one of those. The definitions, this is getting back to that. Apples to apples comparisons. Establish clear definitions for metrics, leveraging existing standards. Thank you, make data count. And community initiatives. The intention is to standardize the definition of metrics across the generalist repositories. This is an ongoing work in progress. I'll come back to the citations concept at the end as well. And baseline metrics, developer roadmap for baseline metrics. So defining what metrics we currently have available, the limitations and the timelines to begin using those data metrics and research valuation. So this idea that what can we give the NIH today about the metrics of the data sets that they funded in each of our repositories and at what point can we say that that apples to apples comparison has began. So already giving reports out. So here are the each of the different generalist repositories that I mentioned. Dataverse, Dryad, FigShare, Mentale, OSF, Vivly and Zenodo. And you can see that of our year two goals, which we're over halfway now, are 46% complete. And we also want to be aware that there's been this shift to everybody pushing in the same direction. If you've heard of fair data, findable, accessible, interoperable, and reusable data, we want to push everybody in the same direction. But there's also dependencies, right? This is seven different organizations, each with their own roadmaps, each with their own capacity. And so we all agree that we're going to do something to get consistency like the usage tracker. But there are several different ways to implement the usage tracker. And data site have updated much easier, a much lower burden implementation of this usage tracker. So some folks were waiting for that before getting going as well. So we'll always have this kind of, there's dependencies outside of the scope of the group. But we've agreed that we're all going to implement that data site usage tracker. So we will have consistent views and downloads recorded across all of the different repositories. At this point, I just wanted to hand over to Jim who is going to talk about what ideal metrics could look like. And I put him on the spot here because the different repositories are obviously at different stages. And what Jim can show us is where Dataverse have got to and some of the concepts there. Okay. Thanks, Mark. Hopefully the screen is showing. Yes. So I'm coming from the Dataverse community and to not go into Dataverse too much, but it's a Dataverse is software that's used by institutions around the globe. In the US, the Harvard Dataverse and the Harvard team where the open source software originated as part of the gray initiative. I'm going to show metrics today from the qualitative data repository which is out of Syracuse, which is using the Dataverse software but it's primarily in social science also involved with various NIH projects. And essentially what I want to give you a sense of is the sort of graphics and the underlying API that the Dataverse software has to be able to show what's going on in your repository. So for this, the main things here I want to show you are we're basically picking up from a live Dataverse instance, the metrics using HTML and JavaScript to create live graphics. And I'll walk through a few of these. The one thing I want to show first is that this is for the entire QDR repository. You can also go look down into individual subcollections and get the metrics just from individual subcollections within the overall repository. And since we're talking make data accounts I want to start there. Basically we at QDR were the ones who helped develop some of the make data accounts support in Dataverse. And so the QDR started in about 2018 by about 2019 October or so. We turned on the make data accounts metrics so we have about four years worth of metrics now. And we're showing on the page here just the sort of total views and total downloads and then the unique views and the unique downloads where you're basically trying to it's defined and make data accounts but we're trying to basically figure out if somebody is repeatedly the same person is repeatedly downloading you're not multiple counting that in the unique downloads. And I won't go into the technical details of what unique means there but you'll see the counts are lower because you're getting rid of some of that duplication. The other thing that we look through is sort of on the holding side we can see how many data sets and how many files there are over time what subjects are going into the repository how many files we have of different different types and so on different sizes per type. So it's fun that most of our stuff is PDF at QDR but you can see the video files where there are fewer video files but they take up almost the same amount of storage as a PDF even though there's many more PDFs and so on. We can look at both sort of subjects and things by data set as well as by collections at the collection level. And the last thing I would just want to turn back on at the top here is that QDR and Dataverse has its own internal counting mechanism for downloads so it was interesting to note that despite our efforts to try and tell through robots.txt and other things that robots shouldn't be getting things. We get about 120,000 downloads showing here whereas MakeDataCounts shows only about 95,000 so we're still getting about 20-30% of the downloads handled by the software coming from robots and MakeDataCounts kind of gets rid of that which is one of the ways everyone's trying to get consistent metrics across repositories. The other thing that we put into the API and the graphics is Dataverse has datasets that range from like two, three files to those that have thousands of files and so if we just counted downloads of files the datasets with many more files look much more popular and so we essentially did another version of Unique trying to get just the number of if somebody has downloaded any files from a data set that counts as a hit to try and track kind of how you know basically when people are using the data it's not the number of files it's the number of datasets they're using and so we can track kind of over time which datasets are popular you can see as I'm hovering over this is basically giving the DOIs of things and just for fun you know we've made this so it can pass through and basically goes to the DOI and it'll go to the there's the dataset back in Dataverse so with that I want to talk about the metrics API underneath it's all documented in the Dataverse Guides and I won't nearly read through any of this but the point I want to make is that the API kind of gives you anything from total metrics aggregated over all time aggregated to a given month just over the past few days a monthly time series being able to look at the tree of collections and sub-collections you have over time if you want to and also being able to to subset by collection we're giving out JSON which is what was being parsed by the software we also I didn't show you we had a little CSV button there we can let you click and get a CSV version of the same metrics and then you can filter by various things so filter by collection some repositories using Dataverse have both their own collection and they harvest from other locations and so you can get metrics either your own repository or everything you're harvesting and MakeDataCounts does more things than we were showing on the graph there you can do things like subset by country you can subset by whether it's machine or not and so the API has all of those things and basically in our documentation here we have this whole here's all of the at the collection data set and file level all of those variants that you just saw above are collected in the big table of all the endpoints that you can do so you can see there's quite a few there the software itself for the viewing the API is baked into Dataverse and is always on the software for the metrics is just using a library called D3 and D3 Plus which does all the nice graphics so we're essentially not having to do too much more than pipe out the JSON that we've got and configure which columns and which numbers show up where and D3 Plus does all the hard work of drawing things so that's open source it's out there on GitHub as well and the last thing I want to mention is just this the graphics that we use and everything came from an earlier effort that actually aggregates across all the data versus in the world or at least all of those who are willing to sign up for it so we also have metrics that go across all of the data versus that are out there in all the countries around the world we can tell which versions all the data versus are using and things like that so I think with that that's it for me so I'll stop sharing at this point thanks thanks so much Jim and I think everyone will agree with me that that is one of the reasons that we are aiming as a group to get to some consistency I'm just going to round us off with a couple more slides about what the group has done and then we will open up for questions so if you do have any put them in but I think those metrics that Jim and the team have been working on demonstrate not only the value for where we saw the difference between the stats that were counted internally and then the stats that were counted but using a consistent standard and obviously every different repository will have different ways of measuring things and so having that standardized setup will solve that problem but also the way that you can slice and dice the information so you can start recognizing the impact and the different ways in which people are using data and even further than that you know we all of the generous repositories were on a call yesterday on an NSF funded grant about the actual costs of data publications because funders are starting to say we'll cover costs of data publication but then what does that mean and how do you quantify that and well I have lots of data I have a little data so some of the metrics that we saw from Jim there as well can start feeding this research into different types of use cases across all of the different repositories so as a working group at the repository level for gray as I mentioned this was one of the easier groups because of all the great work done by make data count we all agreed on the consistency that we wanted and we all agreed on promoting standardization usage tracking and better than metadata practices we've been actively talking about this the last two years the repository metrics roadmap so we are all committed to implementing key metrics such as counter standards views and downloads tracking and collecting author-related articles progress varies for different reasons different priorities different organizations are focusing on different parts but we've all agreed to get to the same level by the end of the four years of the funding and we have some good examples of being able to track NIH programmatic research so all data site DOIs with NIH in the funder identified field either a crossref funder ID or a raw ID you can find that across data site now so hopefully you can find more of them going forward and you can also filter this by a repository I won't click on to the links there but I'll share the slides that's what everybody can and data site.org is the place where you can start filling filtering some of this stuff so we're very lucky as I mentioned that we have such good infrastructure already we have data site to aggregate the information and we have make data count doing the research and building the tool to help us do it in terms of the subcommittee impact collaboration among repositories within gray groups is a crucial success for success in implementing data metrics if we don't agree to do this it will never work repositories are transitioning from adopting best practices to a more nuanced approach allowing for tailored solutions to specific use cases and downstream requirements like I mentioned all of the different ways in which you can slice and dice the data these are the objectives of the gray initiative around the section here and you can see obviously connecting digital objects is being pulled up implementing best practices for data repositories support discovery of NIH funded data it ticks a lot of the boxes within the objectives and that's why we are continuing to develop this repositories are actively tracking NIH programmatic research through the data site DOIs with NIH and the funder identity fire field providing transparency and accountability so the funders will at some point be able to say you said you were going to do this in your data management plan did you where's that data because we searched for it we couldn't find it so there is going to be accountability at that level as well and so just around us off looking forward the repositories are all working towards this timeline so within the four years of gray you will be able to have all of the repositories will be ticking the boxes when it comes to the views in the downloads metrics the report underscores the collective commitment of repositories to advance data metrics emphasizing comprehensive and contextualized data evaluation practices I think that's important because I mentioned the heterogeneity of data before and when you look at the previous setup of traditional peer reviewed articles you have one unit of publication which is the peer reviewed article and you have one unit of impact which is citation counts and now we have a plethora of research outputs and we want to make sure that those plethora of research outputs can be quantified and the impact can be measured different communities will have different levels of impact and we want to make sure that the impact is qualitative and quantitative so the next steps we're going to continue collaboration and make sure that the repositories reach the goals we are going to just like the funders might be checking the researchers to make sure they've made their data available we will be making sure the repositories are implementing the usage tracker and what have you and the last part here is the next section so as I mentioned at the beginning we do have working groups that come along and go away or continue without the continued meetings and so the next meeting we just had our in-person event a couple of weeks ago where we all got together and discussed what were the priorities going forward and so the next working group which is going to be chaired by myself and Iracha is on data citations so again more complexity how do you define a data citation how do the different repositories count them and how can we build up the data citation corpus which is again is a fantastic leap forward from make data count for us to have these metrics and be able to qualify the credit for each for the researchers but at the same time we need to wait for that to be robust so that we can start implementing it in the same way so the reason I do mention this FigShare does the state of open data report every year and we have the next one coming out in November about four weeks from now and we do survey researchers and ask them about their what they're going to do and how they react to data and this is the results a sneak peek of some of the results it's not published yet which of the following is most likely to encourage you to share your data this is between five and ten thousand respondents and the number one every year by a long way full data citation so if we can encourage people that we can give them credit for all of their research that they publish their open data sets then we hopefully can foster the normalization of data publishing and people more people will publish their data they will be described in a better way to get more usage and more citations and we'll have academia moving further faster because of the transparency reproducibility and everything else so I'm going to finish there and open it up four questions I'll stop sharing my screen we have a couple of questions I think in the Q and A I'll start with the bottom one because I just I just saw that come in how many respondents do the survey so it's a partnership between figure-shared digital science and spring and nature spring and nature has a very broad reach so it's spread across respondents from 200 countries this year we're starting to look at the nuance between what the different countries are doing in terms of is everybody publishing their data is everybody say the same reasons publishing their data and each year we get between five and ten thousand so I think this one's about seven thousand respondents for uh 2023 and I think the other two are for you iraqia so I will ask them and hopefully you can help we're also joined by Matt buyers of data sites how are data citations not associated with publications handled by MDC I don't know if anybody wants to have a go at that or start a mad place to this as relevant from the perspective of the metadata submitted to data site is possible to create citations between data sets and any other resources that's possible the important thing is to create the correct metadata to to provide that citation but it's not restricted only to citations from journals to data sets it can be from a data set to a data set from access to a data set et cetera something that I would nuance regarding the corpus is that we are started with citations from publications to data for the purposes of the corpus mostly because we know that's the most common use case as I said that's where the data citations tend to happen and we want to be iterative here so I wanted to start with the place where we know this is happening and we can have a better understanding as to how to best handle that in the corpus I don't know Matt if you want to add to this yeah I think also just broadly to mention that citations can be tracked across any research output or resource across a broadly epic data site I'll put in a link into the chat and if folks are interested to look at links to DMP ID as an example but it can be done across the board so yeah I think you covered it Eretja okay great and maybe one for Eretja and Jim how about usage of data access through APIs or programmatically how is that tracked or the metrics measured I know we see this at FIG share as well when you sometimes have data sets that have more downloads than views and it confuses people how is that possible something must have gone wrong but it's a programmatic access to the data sets themselves I don't know if anybody has an answer from make a data count with a database I can maybe jump in and say so I guess obviously the usage tracker that Eretja mentioned and Mark also mentioned tracks web-based usage in the repository so any machine or API usage is not captured by that JavaScript tracker but it can be tracked and is tracked by many repositories doing log processing so you can do log processing and Jim can talk a bit about that and in that log processing is developed in accordance with the counter-coder practice that we helped develop from make data count working with counter and Eretja is leading a lot of those efforts going forwards and working with counter to evolve that kind of practice and we'll also drop in a link if you're tracking usage you can submit that to data sites associated with a specific identifies Yeah at the technical level the way the log processing is dealing with this is it's actually looking at the agent that is making the call so when an HTTP call is made you can tell who called and so if it's that there are three categories in make data counts in the counter processor software there's human use which is basically if you're coming in from you know Chrome, Edge, Safari something like that and it's a browser it's counted as human if it's Python, Curl you know those sorts of things that are things you would script it's counted as a machine count and then if it's coming from stem rush bot or Google or something it's actually one of the robot counts that are removed from the overall counting of your statistics so they maintain a list of all the robots that they know of out there and those are the ones that get subtracted out but you can then if you do the log processing you then have available you can ask how many human counts do I have and how many machine counts do I have and in all the views and the downloads and so on Fantastic we still have another couple of questions so keep them coming when will gray repositories have the standardized metadata collection during data submission? As of now they are widely varied with some collecting bare minimum so at the point I can have a go at this and if anybody else has some thoughts please add them the standardized metadata collection during data submission I think one of the issues that generalist repositories will always face is this idea that they are generalized so you need metadata fields that are consistent for the humanities but also the life sciences for a lot of them there are some generalist repositories that are more thematic and focused on a certain area I think there are ways in which we can allow for enhanced capture of metadata and agreeing on certain schemas for certain types of data and if we move towards that then we can start I always say the last 10 years has been about encouraging people to make their files available on the internet and the next 10 years is about making those files useful and the best way to do that is improve the metadata that we're capturing the best person to do that is the author but obviously authors are busy path of least resistance how much metadata can you give us? And I think to expand on where we're going with that as well there's a lot of automated metadata that we're capturing and just to touch on the metric side of things we're moving into this space around data citations and a lot of people will cite their data set that they use for that publication and they might cite it again in another publication because they used it for that paper too and so this idea of self citation is not necessarily a bad thing but the metadata that you can capture around the citations could also open a whole new layer but yeah so metadata is a working group it's being thought about we're working towards it I couldn't give you a date I don't know if anybody has anything else to add there now we can push on to the next question if the NIH starts collecting monthly metrics of public data sharing will they make the report available if yes where? Well I know that the NIH collects a report from Big Share and each of the other repositories every month so they are collecting that data on some level and the transparency with making these metrics consistent across all of the repository mean that data sites of which Matt buys is the executive director will have all of this data openly available for the NIH but for also for other funders as well and for other organizations so hopefully we get to this point where there's full transparency and you can just go in and look at the data without needing to request access right Matt? Yeah absolutely it's all CC0 data and I think we've seen some interesting trends like CWTS and folks like that that are moving to open sources when we're talking about metrics and we want to build trust and transparency in the work that's being done it's really important that we have open data behind this and so folks like CWTS are really making some of those shifts and to using open sources and I think that's more broadly going to happen across the community Yeah eating your own dog food right and making sure that the the different algorithms we come up with to measure different levels of impact in different communities for data or transparency and agreed upon or if not agreed upon you can see exactly how they've been built so I think that's the future we're heading towards I think that is the end of the questions that we have thank you so much for coming along everybody I don't know does anybody else have anything any last comments before we wrap up? Sorry I was just going to mention briefly that was a question about resources for best practices and I thought myself answered that because I mean I think probably we thought it was easier to provide the links for people to access but they're there on the answer Q&A so Duane do have a look and if anybody knows of other resources very happy to have them Cheryl too Thank you Okay fantastic well thank you very much for coming along thank you once more to Irracha, to Matt and to Jim if you have any questions reach out to all of us and have a lovely day