 Good morning. Hi everybody. I'm Jamie Wittenberg. I'm the Assistant Dean for Research and Innovation Strategies at the University of Colorado Boulder Libraries. Joined here by my illustrious colleagues. John Shadaki, he's the director of UC3 at the California Digital Library and Christy Holmes, who directs the Medical Library at Northwestern University and is an associate dean in the medical school there. I don't think I need to make a case to this group of people that data citation matters, but you're here, and I just love a captive audience, so I can't resist. There are a lot of library folks here, and I think I could reach out to any library represented here and expect with some degree of confidence that if I requested a list of every article a researcher at CU Boulder published in the last ten years, one of your librarians, any of your public service staff probably could help me find it. But if I made that same request about the underlying data, we would be in trouble. We don't have the uptake and there isn't really a way to access metadata about all the data that are out there that our researchers are producing. If we're not citing it, we're not capturing it, and we're not capturing the totality of metadata related to use and reuse. This matters because data citation provides evidence that supports the reproducibility of the research. It enables other researchers to locate and access research data more easily. It enables reuse and innovation, and it gives credit to our authors. But it also allows libraries to develop a record of the research published at our institutions, and it establishes provenance for the work we're doing to advance that research. Here's some evidence from 2021 about the disparity between data sets in the landscape. These are published data sets and articles that cite them. So this is a slide with numbers that data site pulled in 2021. Yes, the landscape has changed. No, it's not comprehensive and definitive. I'm using these numbers to convince you that there is a disparity between the data sets that are out there and the articles that they support. It doesn't represent everything. It represents what we can find, right? So that top number shows the published data sets that are declared by repositories to data set data site. So what that means is that your repository and author deposits a data set with you issues a DOI and all of that information goes back to our central infrastructure data site. That third row there are the data citations declared by publishers to cross-ref through references. So that's 7,629 number. That's the number in 2021 of data sets that articles are sending back. So the number of data sets that are actually referenced in the articles. So, you know, even though this isn't comprehensive, we're looking at five and a half million versus 7,000 and some change. So a really big disparity here. What this tells me is that there are millions of published data sets that are citable, but the number of these that are declared is very low. When we talk about trying to change this, a lot of the time what we're talking about are leveraging various incentives and regulations to influence research or behavior, right? So colloquially, a lot of the time in the space, we call these carrots and sticks. I have been calling these carrots and sticks for 10 years. So this isn't a criticism of calling these carrots and sticks, if that's what you're doing. Like many folks in the community are, as you can see on my slide. And this isn't even a criticism of a carrot and stick approach or a sort of, you know, incentive and regulation approach to data citation and advancing data citation. Even though our title does call for a feast upon fire of carrots and sticks, that was just a trick to get you to show up today. We actually have moved the needle on this over the last 10 years, and the incentives and the regulations do matter, especially for data sharing. So when we're talking about the incentives, a lot of the time we're talking about research integrity, right? So you've all heard this before. It facilitates validation of the underlying evidence and makes research more rigorous when we're citing it. There's a citation advantage. It promotes collaboration. There are increasingly calls for tenure and promotion committees to use secondary data as evidence of researcher impact. Researchers can access benefits from their university like the institutional repositories that you have or free storage, as I like to call it. The sticks are usually compliance related. So funders require data management plans. The convention in academia is that we cite each other's work. So another sort of research integrity case when it comes to the stick. Last year's OSTP Nelson memo mandates data sharing for federally funded research. And there are compliance requirements at institutions when it comes to data retention and disposition. But let's return to the disparities. So the space between the data that are out there and the articles that should be citing those data. Carrots and sticks have done a pretty good job of moving the conversation and an adequate job at influencing the behavior. But it isn't enough. It's not comprehensive. And I don't think it will be in another 10 years at this pace, especially when it comes to data citation rather than data sharing the carrots and sticks. They don't really work. So what if we as libraries funders and infrastructure providers and probably a couple of other people could just take a step back from this binary and approach citation differently. Take the burden off of researcher behavior or editorial oversight when it comes to influence and use machine learning technology to harvest references to data sets, even if they're not formal data citations that you would find in the references section of an article. What if we could harvest those and turn them into citations, building a comprehensive corpus that represents all of the relationships between articles and the underlying data sets that are associated with them. Well, we did do this and I am going to turn it over to John Chidoki to talk about how. Thanks. So how many people here have heard of Make Data Count? So we did a really great job of doing marketing at CNI and other events. So Make Data Count is a community led group. We are just a bunch of interested people in the world of data metrics who have come together for the past eight years to look at these issues and look at the tensions that exist in our communities that have created the lock of either closed metrics or lack of information and lack of sources like Jamie was referring to. We have really the goal of Make Data Count is to not just build solutions and to find approaches to some of these issues, but also to advocate and really move towards a space where we have truly open and audible responsible sources of information for data metrics that can then contextualize and incentivize research. But this is a journey, right? One of the things that we talk about a lot within the Make Data Count community is that we are very nervous that the community will go too fast because as we just heard, the source information up until now really hasn't been there for really contextualized data metrics information to be available to the community. And so we want to make sure that we're very deliberate and understand where the state of the art is in each stage and really bring the community along and build information and build contextualized responsible metrics together. And we want to do this in the open. We want to avoid the lockdown of impact factor or h-index kind of approaches where the math behind there isn't really understood or the traceability isn't really understood. And so we've done this through advocating kind of three big things, and I'm going to jump into the citation corpus, but real quick, the kind of approaches that we've taken within Make Data Count is coming up with common normalized approaches to tracking usage information, so views and downloads, as well as looking at the current practice of data citation, as Jamie was mentioning, and figuring out what are better approaches and best approaches for both publishers and repositories and how they affect researchers. But we know that data citation is something that is not only a challenge for us as a community, as was laid out, but it's also something that researchers also value. In surveys that we've done over the tenure of Make Data Count, over and over, researchers not only say that they value it, they say it is the top way that they value or think of showing value of their research and fellow's research. But we have a challenge. Where are the data citations? So we have challenges where there are people challenge and culture challenges. These can be researchers themselves not being incentivized to not only share their data, but cite data. We also have challenges with technologies. Many of us are sophisticated in this space enough to know that we always can blame the publishers, right? The publishers don't track it. If they track it, they don't send it. If they send it, they don't send it right. And then we can't find it and we end up with 7,000 out of 5 million. These are the stories in these conversations. They end up being conversations where we have scapegoats and they are in many ways technologies, right? But the problems that we are talking about when we bring up these challenges are well understood. We know where the roadblocks are. We know where the red lights are. We understand that if we just did things better and perfect that the information would flow. But that doesn't change the fact that we also have built an ecosystem that is very complicated. Even in the open space where we work and we think and make data count movement, there's just tons of different projects out there that are working on similar things. And so one of the issues that we have is not only trying to come up with a better approach and figure out how to manage the information, but also to do it together. So the project that Jamie was alluding to is really trying to address these challenges. The challenges of culture change, the challenge of technologies and the challenge of the ecosystem that we've built. And it's called the Global Data Citation Corpus. It is a welcome trust funded project coupled with Make Data Count. So Make Data Count as a long living change movement is here for the long haul. And the project that we are currently partnering with WelcomeOn is specifically on this data citation issue. And it's in collaboration with the Chan Zuckerberg Initiative in a very unique way because of their ability to have publisher agreements that offer full text mining and the conceptual power to do a lot of the machine learning algorithms that we'll be discussing. We also are working very closely with Europe PMC to bring in not only DOI information, but also accession numbers and other types of identifiers. So, I mean, you don't have to, you can kind of imagine what the workflows would be for this type of a project. You know, we are still very much interested in promoting best practices through traditional workflows of publishers and repositories submitting metadata. But we also want to move forward and build a corpus of information that can be used for contextualizing research and incentivizing the sharing of research. So this diagram shows kind of the approach that we're taking on building this. We are in the first phase of building the prototype. We have worked very closely with Datasite and Crossref and also with the CZI mind information from across published content. And really looking to think of inside of this prototype, how do we define what a dataset is, really understanding what a citation is, what the difference is between citations and mentions, and looking at leveraging machine learning to improve the algorithms over time, extracting information and growing that number from 7,000 to over 10 million, and looking at how we can not only have links of information between papers and datasets, but also ensure that there's a link back to the repository so that there's access to the actual data. And so this is something that is currently available. The video is just a simple animation kind of. But if you go to the URL that's on the slide there, you'll be able to play around with some of the data. We're planning on doing a formal launch and an announcement in January, working with building matricians on taking the information that we've already extracted to start doing contextualized projects. And there are a lot of ways for each of you to get involved. And for that, I will pass it on to Kristi. Okay, great. So I think I get to talk about the most fun aspect of this project, which is that it's community oriented. And I think everyone who's working on this initiative recognizes that nothing is possible if we don't work together. And so that's libraries, that is infrastructure providers, publishers, people who use and think about data in meaningful ways. And so it really does require that we're all coming together. So in this slide you can see we've identified a few of our key stakeholders. So the usual suspects, researchers, institutions, funders and the research ecosystem. But I think libraries are particularly well poised to think about not just these kind of traditional ways that we're thinking about people who are partnering in this area and take it to the next level. So libraries really do serve as connective tissue. So I'm going to bring in my medical campus perspective, right? So libraries really are the connective tissue here. So they're helping to support all of those key traditional stakeholders in meaningful ways, understanding how they can plug in to these activities, understanding the context, and what it means for this data citation corpus to live and thrive in a meaningful way so that we really do tip the scale, tip that balance over to actually being able to make data count. So we're not just working with these infrastructure providers, I think, and traditional stakeholders. I think we're also coming to the table in meaningful ways across the spectrum. So we're looking at not just different kinds of domains or different kinds of roles that are outside of those more traditional stakeholder roles, but we're also thinking about this in our libraries, on our campuses, in our consortia, in the bigger scholarly ecosystem. So I think, you know, building on the role that we have in terms of bringing people together, facilitating great work and collaboration, this really is an outstanding opportunity for libraries. And I know, at least in our library, I'm especially excited about, you know, being able to support, especially early stage investigators, thinking about, you know, they're creating more than papers, right? We're more than our papers, we're more than just these single outputs. The influence and impact that they have on their research domain is really the full complement of research objects, and research data is an important part of that. Okay, so the community. So how can you stay informed and get involved? There are a few different ways to do that. So certainly we've got some links here. There's a project website where you can learn more about the project. We also have a listserv that you can sign up for to receive regular updates. As John mentioned, it's a very exciting time on the project. We've got a lot of things that are coming out. Certainly there will be the formal launch of the corpus initiative in January and, you know, regular communications and activities through the program. And then we also set up a Zenodo community for make data accounts. So all of these research outputs, these slides, any kind of like one pager types of things that are created through the duration of the project will be available openly in that Zenodo community. So we want to try and work in the open every way possible. There are also some key ways that you can get involved. So, you know, community isn't just this throw away comment like, oh yeah, the community is important. No, the community is actually very important. So, you know, and I think libraries play an important role in this aspect. So not only are we able to bring to the table, understand user experience, understand specific contexts and motivations, I think we can also drive information back into the project. So a really good, I think a really good example of that one type of data that is not in the corpus. I would love to see clinical trials data, right? And so those are aspects that I think that we can also inform and just start that conversation in meaningful ways. I would encourage everyone to take the tools out for a test drive. So John demonstrated that really nice little animation of the dashboard and we encourage you to play with that and provide feedback. And then please do also, we invite you. We're really excited to connect with various people in the community through make data account events. So there will be a make data account summit this fall. And so sign up for the list service so that you get the information about that. And then there is also a paper coming out in March 2024. The Harvard data science review has a special issue coming out called democratizing data, data as a public asset. And there will be a paper about this initiative in that special issue. So what else can you do? Well, there are adjacent and related activities that I think we can all invest in and really help to take that conversation to the next level. One of the most recent ones that was just announced a couple of weeks ago is this joint statement on research data. So it was developed by STM, data site and Crossref. And this statement calls for best practices with respect to data sharing and data citation. They've done a really beautiful job of communicating these best practices. There's nothing controversial, but a lot to get excited about. And so I really encourage everyone to take a look, have these conversations on your campuses, really engage the different roles and the different responsibilities on your campus to kind of understand how this ecosystem is changing and involving into a really bright and wonderful space. So with that, I just want to point out we are, of course, grateful to the tremendous support from Welcome Trust, to the partnership with CZI and Emble EBI for that data corpus. I also want to point out a lot of the work that we're talking about today, there are lots of different applications. So one of the applications that both John and I are involved with is this Generalist Repository Ecosystem Initiative that's funded by the NIH Office of Data Science Strategy. We're also thinking about how do we leverage this tremendous effort and think about data metrics in a meaningful way for generalist repositories. So it's a fabulous time to be in data, and I think there's some really outstanding things that are happening. So the contact information is there, and with that we'll open it up for questions. Hi, Talay Alon from the World Bank. Kudos on the project, it's very exciting and the future looks bright. The project as it is constructed is very much enmeshed with Crossref. The question is, are there plans to expand it to other DOI registrars, citation services, equivalent services? The question was, is there a plan to expand the sources? So one of the designs of the corpus actually is to allow for us to aggregate from sources that not only we are creating but also the community is creating. So we know that there have been similar projects and similar approaches and also a lot of cleanup work that have happened across the community globally. We also know there are challenges with the approach that we're taking in the sense that it's only a certain number of published articles that we're able to look at. And so we really are part of the plan is advocacy to the community on looking for sources that can help complement what we're doing. The corpus itself is intended and designed to be able to aggregate from multiple sources. So we're not specifically working with Crossref per se right now on the machine learning process. That's actually more of through publisher agreements that CZI has negotiated. So while we do partner with Crossref and we are friends and we work on a lot of analysis. The machine learning itself is not I just wanted to clarify it's not specifically happening with Crossref per se. Thank you. I'll add that if you go to our website you can take a look at our advisory group. We have representation from a lot of different sectors. We have publishers. We have infrastructure providers. We have repository managers. We have libraries. And one of the first questions that we asked as we started to have these conversations in our advisory group meetings was does it scale? What happens when we need to include a broader range of sources in the corpus? So our design I think is very intentional in that we're in the pilot early stages right now. And we would love to hear from folks in the community who are testing our new tools in our dashboard about what you would like to see in terms of expanded scope. But that has always been top of mind since our initial conversations. And I think it's a matter of, you know, execution, feasibility and expansion. Thanks. This is Peter Leonard from Stanford Libraries. Great presentation. Thanks so much. I was curious if the machine learning model weights are something you guys might be interested in sharing or perhaps already doing so. Yeah, actually it is something that is already available and it's open. We've been working with CZI. Some of you may remember there was a project called Meta before the rebranding of Facebook to Meta. There was a project called Meta.org. And so many of the algorithms that we're working on came from that project and that's kind of when we say CZI that's what we're referring to is that Meta team within CZI. And so their work that they've done on extracting mentions of software and data has been open and something that we were auditing when we were looking for partners to help us with this. And so that's definitely something that we think of when we were talking about working in the open is to ensure that not only the results of the work is made available through open data dumps and through dashboards but also that the algorithms themselves are available. And I think that's really core to the mission of Make Data Count as an initiative and a project and an organization is this idea that if we are not transparent about how we are creating our fomenting metrics when it comes to data citation, then we can't have trust, validation, rigor when we're looking at the citations themselves and it erodes the usefulness and the value of those citations. So as an initiative, that's something that's very top of mind for us. Yeah. Hi. I'm sorry. Go ahead. Oh, thank you. Hi, I'm Sarah. I'm from Well Cornell Medicine. It's an awesome initiative. I work at the Wood Library just to give some context to my question. I really love the idea. My question is usually when I think about data citation, I think about data reuse. And for data reuse, you need to think about metadata. So I wanted to ask you if this is in the scope of this initiative to extract metadata associated to the data set. And if so, what do you have done so far? And what are your plans? Yeah. Sure. So I think you really do make an important point, which is that that metadata is important. So it helps to provide context. So not only can it be a tool in terms of helping to make things discoverable or helping to call things out. It really does help to drive that contextual perspective. One such aspect that I'm particularly interested in is subject area. So when we look at classification and indexing of publications, there are various schema that organize and present information according to these different subject areas. But that is one output and one type of output. So when we're looking at data, you know, data really are at that core, but they're used in different ways. They're applied in different contexts. And so how do you understand what that full complement looks like? What is that? You know, it's a bouquet of context, right, in terms of how we're using data sets. So one of the things that I will share that we've looked at just locally is that the way that data are used and cited can tell an interesting story. So there are some data citations that really are about reinforcing kind of best practices or standards or they're used for comparison or validation. There are other data citations which actually help to translate knowledge to the next step so that you can see more beyond the generation of knowledge. It's the application or the outcomes from that knowledge being applied. So I think, you know, and that's just one aspect of the metadata. But certainly like we've talked in our comments today, everything is very open and this is a very eager and aspirational group. So we want it all. Yeah. And we want it to be available for everyone. Yeah. And I think that the initial scans that we're doing are looking specifically for DOIs and session numbers. So in that sense, with our partnership with DataSite Crossref and with EBI, we're able to extract a minimum viable metadata record for each of the connections. We were talking about aggregating other sources from each of you who have sources. Please get in touch. We would love to start to bring those together on the kind of logo soup that was up there of all the different open projects, the Curtin Open Knowledge Institute, Koki. We've been working with them on bringing what they've been able to collect and looking at their metadata records to ensure that it meets that MVP. And also with the Open Citations group that's based in Bologna. They also were working very closely with them on ensuring that anything that comes into the corpus has that same minimum viable metadata. Very interesting. One just quick following up question. Regarding metadata, since you mentioned that you have some metadata extracted, do you have anything to standardize those metadata? Because like data is very diverse, right? You have different type. How would you deal with that? Yeah, I think without getting into all the details, I mean, we've really been trying to standardize to a very small set. And most of it being that it's available through links out to get more in depth from the main sources. Great. Thank you so much. I think we're at time. So everyone, thank you so much. Please don't hesitate to get in contact with the program or with any of us and we'll be delighted to make any connections. Thank you for joining us.