 Hello, I'm Daniella. I'm Ted. And I'm the product manager for Dryad and I work at California Digital Library. And I work at metadata game changes. And Ted's put in here the Roar for CDL and no Roar for metadata game changes. Well, we have a Dunn's number. It's a traditional identifier that we're trying to work into the Roar. So what we're gonna talk about is that we have this problem. So Dryad has 27,000 data publications. And because of different standards and when we're thinking about publishing articles and data and not having a standard for institutions, we never actually collected the institutional affiliations for any of the authors for these data publications. And that's a huge missing thing that we're looking for. And what makes that a problem? Well, a lot of institutions want to know a research output from their institution. We want to know the usage by researchers in their institution. We want to be able to see what's coming in to a central place like Dryad and then send it back out to an institution. And really right now, we just don't have any standard way to find that. And we're also just don't even have that raw information to start with. So we came together and talked about how are we gonna find all this information for these Dryad data publications. And we know that in the past Dryad has only required that data be related to an article. And we know that journals send that information to Crossref. And that was the first place that we were gonna start looking for this. But when we started, we found that we could only find half of all the affiliations in Crossref. And that's because there isn't a standard that's far for institutions to be sending that information. And so we also had to start looking at places like PLOS that have open API's that allow for us to start pulling this information. Bring on Ted. So we get the affiliations out of either Crossref or PLOS. We also have a small army of curators that are manually generating some, which is of course hard and much appreciated work. And it goes into the meat grinder. And out of that comes a token. And the token goes into Roar to try and convert the token into an identifier. And the goal in the end is to get those identifiers into the data site meta data for this, for these data sets. And what I'm going to focus on today is really this part, taking the affiliation data from Crossref from PLOS and trying to generate Roars. Who here knows what Roars is? Roars. Roars, our research organization registry. There are a new open identifier, Maria Gould gave a talk on it earlier today. She's back here if you have questions. So this is what a perfect affiliation string looks like. People that are teaching people, scientists, how to do things effectively with their publication should work on writing simple affiliation strings that we can extract things like names of universities from. This is a good one because it has consistent telemeters. This is a semicolons in this case. And there's a single affiliation that happened to be delimited by those telemeters. So we can obviously recognize that pretty easily and turn it into a Roar. So this is a very simple project. Unfortunately, the world looks a little bit more like this. There's a lot. We have some number of tens of thousands of these strings that we're trying to work with. And here we're at CSE. And that's really wonderful if there are commas separating the tokens that you're interested in. And you can see some of these have commas, but some of them have semicolons. So there's multiple delimiters and we need to try and work with multiple delimiters. And then of course there's some strings that don't have any delimiters that you're trying to find these tokens in. So my approach to that is to identify or to create some targets, which are strings that are expected in organization names. And these are things that that you find by looking at the data, doing word counts or just becoming familiar with it. And these become the targets that we're looking for to try to identify the names of organizations in these strings. And so it looks like this. This is the most common word in names of organizations. Research organizations happens to be a university. It happens in many languages. So we use UNIF, which works in most languages. Things like national is also useful in a lot of situations that a lot of these tokens are things like department. One of the things about that we know about organizational organizations, they're hierarchical. So there are usually a department of X, University of Y. Roar is currently focusing on the university level. But I want to try and build this tool so that we can do it at different levels of hierarchy at some point that becomes necessary. So department can be important. But there are also strings where department is actually the identifier for the organization that you're looking for. So if it's Department of Agriculture or Department of Interior, or other things like that, you want to be able to find those tokens that start with department as well. They're going to be at the right level in the hierarchy for Roar. So this is just a selection of UNIF things. And we already know that the University of Arizona is a perfect token in this case, a perfect organizational name. There are other challenges in this in this data set, things like names of universities or other organizations in non English languages, which unfortunately is a challenge for me if it's not English or German. A lot of funny characters sort of that are in these strings and cause some challenges. And then also strings, as I mentioned before, that are delimited by other things. So in this case, they're delimited by commas. Obviously, it's a simple problem, but it's just one of the kinds of, so the simple problems add up when you're trying to munch a fair amount of text. So this shows a set of replacements that are made in the process of converting the original affiliation strings into these tokens, a lot of replacements for UNIF dot. Another good thing when you're writing these strings, if you're writing them into your publication platforms, is try to avoid abbreviations. That's sort of a good thing in most many data processing tasks. So this is a lot of UNIF abbreviations. So this is that original input that we looked at. And these are the tokens that were identified using looking for these targets. And now we have other challenges, which is one is that we have affiliation strings with multiple targets. So this one at the top is a laboratory in China. It comes from a China Agricultural University, which is the token that we're looking for here. And the rest of these are things that we need to try and avoid. Or they might be candidate, they would be called candidate tokens, but in the first string, we just want one. In the second string, we've got a few different things. We've got Cavendish Laboratory, which sounds like it could be at a war, the appropriate level for wars. We've got Imperial College London, and of course, college is one of our targets. But notice in the second string, and the way that I highlighted this makes it a little difficult to see, but there's actually affiliations for four authors here. And the affiliation for two, three and four have two, three and four written in front of them. But so this is the situation that occurs in a lot of in a lot of these examples. And people also write extraneous text, like from the Department of Zoology, or, you know, just extraneous labels, they're writing those labels because they understand that labeling things is good, and it's generally a good practice. But unfortunately, when you're trying to process a bunch of these things, those labels can become difficult. We're back to our data set. So now we've got a bunch of affiliations, like we recognized back here. And now the question is how can we convert those to Roars? So Roar.org has a nice API for giving it a string or giving it other kinds of identifiers, which is going to turn out to be pretty important, giving it a string and getting search results. So and there's also a first pass at a reconciler for Roar that is compliant with open refine. So it's another approach that you have. So in many of those affiliation tokens that we talked about, they make an exact match. So you can go against the Roar API with a thing like the University of Arizona, and you get something back that's called the University of Organization, and you get the Roars. In a lot of cases, because of these differences in delimiters or these things that are piled together, those matches are not quite so easy. Or it could be that some of those funny characters are at different places in the words. Unfortunately, it doesn't seem to be a systematic replacement in that case. And those things, of course, came from Crossref or PLOS. So we want to keep those of the strings we're searching for. So in other cases, you need to look at these affiliation tokens with a human brain, at least in this first pass, and try and say this, this is not an exact match, but it is a valid match. So we have two kinds of matches, either exact or valid. And that gives us, you know, looking at this, this about this many exacts, and about this many exacts or valid. So the result so far, in a sample of the Crossref metadata, we've got 8,826 DOIs. And of course, those DOIs are publications, they have multiple authors, and some authors have multiple affiliations. So the number of affiliations expands. The number of affiliation tokens gets a little smaller in this case, which is good, we have 11,000 of those. And we're matching those up with 2,500 Roars. So in this case, of the DOIs, we have 7,1% of them that we're able to assign Roars to, and 65% of the authors that we actually have Roars. And those numbers grow in two ways. We increase the number of Roars that we know, or we go through and look at the data and try and match things up at this point by hand and by improving the algorithms that we're using. In the PLOS case, we started with a smaller data set of 2,496 DOIs, roughly 7,500, 8,600 affiliations and tokens, and 1,592 Roars. So in that case, we actually have Roars for 91% of the DOIs, and 75% of the authors. I'm hoping that these numbers can improve. Of course, there are a lot of affiliation strings. There are some organizations that don't have Roars, not very many of those. But there's a lot of affiliation strings that don't actually include names of organizations. There are addresses or random words. There are some affiliation strings that are department. And so I don't think we'll get to 100%, but I'm hoping that we can get, like in the PLOS case, at least over 90% of the affiliations or the DOIs with Roars. And of course, this will improve as a function of time. So future directions. What this project is really about is about the adoption of unambiguous identifiers in metadata systems and metadata repositories. We know that we have a lot of metadata repositories in the case of the Dryad one that was one that didn't have information that we needed. And we also have repositories that have affiliation information or they have human names, but they don't have identifiers for those affiliations and they don't have things like orchids for those IDs. How many people here know about Crossref? So Crossref, as we'll talk about it in a minute, it's got 110 million DOIs. How many of you here know about orchids? How many have orchids? Well, it turns out that only 9% of the records that are in Crossref actually have orchids. So there's a lot of room for getting identifiers in there. Something like 13% of the records that are in Crossref have affiliations. So we're going to try and get this attached system to Crossref and also to DataSite for inserting these getting these identifiers in there so that we have a enough in there that we can demonstrate the benefits. There's a lot of great thoughts today about helping people understand the benefits of various aspects of open science and helping people understand the benefits of unique and persistent identifiers for people, organizations, publications, instruments, algorithms, locations, etc. helping people understand that requires having some critical mass that you can demonstrate the benefits. So that's really what we're trying to start out with here. Many people here are familiar with Open Refine. There was a talk earlier about the Carpentries Great Talk by Cary mentioned Open Refine. It's a useful tool. The user interface is sort of challenging if you have large data sets and I need to look and try and learn the API for that. Simon is here sitting in the back in this really bright colored checkered shirt. Simon is a wiki data expert and he's our first partner this week. There may be others today. We still got options and he's adding Roar as a property to wiki data. So what that means is that we can use the wiki data reconciler that's already built into Open Refine and has existed and been evolved over the years to reconcile these names. And then we can use, we can just, once we find a wiki data ID for some organization, we can say give us the Roar for this organization. So that's going to be a huge change. It will also allow us to connect to the existing wiki pages for those organizations as landing pages for those organizations. That's another cool thing to work. Simon's got an ambitious schedule for that. We hope to have it working during this month and it's going to be super cool. And then also testing and implementation with more partners. So we're looking for partners. If some of you here are interested in this, we'd love to work with you. Another nice thing about Crossref in this case is it has over 12,000 members. So if you're, even if you can only convince 1% of them that this is interesting and that's enough to keep us busy. So I developed some tools just a little while ago for visualizing Crossref metadata and in the upper right hand corner of this is the percentage of records in Crossref that have affiliations. On the left is the Korean Society for Plant Biotechnology and the orange in here is data that's older than two years and the blue is data that is two years or less. So this journal in this case in the last two years has had a huge increase in the number of affiliations in their Crossref metadata. So I can use this analytic at looking at the metadata to identify potential partners who have obviously put in, you know, they've made some organizational decisions that have resulted in increasing the number of affiliations in their metadata. And so when you're trying to convince them that identifiers for those affiliations might be useful, this is a good target audience. On the right is Hindawi, probably most of you, many of you know about Hindawi. Yes, an open publisher does 20,000 articles a year roughly. So maybe it's, I think it's 20, no, maybe, yeah, I think it is 20,000. They have about 40,000 resources in Crossref and you can see they have a long history of populating those, those metadata records with affiliations. So, and they're also interested in open science and you can see because all around this circle, the lines are close to 100%, they've got a lot of stuff and a lot of affiliations and DOIs. So those are the kinds of groups that I'm looking to to try and partner in this work. Any questions?