 I'm very happy to see you in person after a two-year break and many familiar faces. Today, Jeff, Rick, and I are going to present you this project we have been working on for a couple of years. I think if you really, the abstract is starting 2019, but if you really, going back to 2017 from the library perspective, we're trying to take it as a research and development project. It's really learning and staff development from our technical folks to understand how to talk to the APIs provided by our vendor databases and also trying to enrich the metadata for ND authors, particularly for the SAM series. And the third goal was really trying to create somewhat authority control for the ND authors, particularly for article authors. We're thinking if we can create one authority control for Notre Dame, probably there will be a scalability across all colleges and libraries, it will be able to create it for their institution author as well. So back in 2017, the Provost Office already understand the library is the only knowledge or organization on campus, so they came to the library asking whether we can support the university effort to really gauge the scholarly impact of our professors and researchers. We started a conversation from there, so they know this is really a side project we're working on. There's only a team and we're sort of testing the instructor, the technology, sort of learning our ropes. So a year after the Provost Office connect us with the Office of Research, they have a more defined goal, trying to gauge the scholarly output of the centers and institute report through the Office of Research. That's really the origin of this project. The project started in 2019 through phase one and we just completed phase two over the summer. From the university leadership perspective, they like to really accurately to gauge the scholarly impact. The data often provided through several methods but they're not very accurate, particularly if they want to have a longitude study about the scholarly impact, they would like to have a more automated fashion to getting the data in and store them so then we can get a holistic picture. So one thing we actually learned, there's the impossible, you just 100% automate the process. Someone has to be in the middle of it, make sure the metadata we get from the vendors need to be massaged and also need to be normalized and also need to map with the university sort of organization structure. Every university is a bit different in terms of how we sort of structure our colleges, department and our disciplines. From the center and institute perspective, it's the annual process. They have probably started several months before the deadline. They have the coordinated process throughout many colleges. As a center and institute, all their fellows actually reporting through several colleges, department, in some instance, there are also fellows coming from our peer institution as well. It's very hard for them to get a clear picture without having their staff and faculty to go through the entire publication to make sure what should be assigned to their center or not. If you will, you take a look across all the centers, often from the office research perspective, they see a lot of duplications from the reports coming in across the centers because the faculty often assign to multiple centers. Really for us, this is not a normal faculty profile project. We really have to differentiate, have another lens to understand our local structure, who is reporting where, and that metadata needs to be enraged from our perspective. That's really the part we do, from the library perspective, first talk to the APIs, and then we have our technical services folks to do the first review. Make sure they are accurately assigned to the right center, and over time, we'll have a more accurate picture to determine who we should harvest from where. The phase one starting with the one center, and the Harper Cancer Research Center. This is a really, really complex situation. Harper is a very good test case when we start phase one because we have a pre-med degree, we don't have a medical school, so our center fellows, they are actually coming from multiple departments, mathematics, computer science, biology, chemistry, and then also we have collaboration with IU, medical school, so it's really, that gave a very, very complex situation we have to solve and think how that map into our center, from the center perspective. So it's really a learning experience for us. By the end of the phase one, and that's the associate director's comment about the first phase of pilot. From there, we're moving to phase two, where we start supporting all the seven centers from the Office of Research, and this is another testimony from ND Energy. I'm going to stop here and ask Rick to give your demo. Sure, so I'm very excited to be here. This is like coming home since the last conference I was at was two years ago at CNI in December, so very excited to be here. So as John has been alluding to, one of the interesting things about this project is that it's not just about author disambiguation, it's also about doing that next layer of discernment where you look to see which publications are not just written by this particular author, but which ones of their 20, 30 publications apply to this center that they're a part of. And as John mentioned, many of these members of the centers are members of multiple. I think the average is two or three of all the ones that we have seen in the pilot group, so that's one big complex aspect of that. So thinking about the human loop portion, what we've really focused on is trying to make sure that we let the machines do what they do well, but then let people do what they do as well and really trying to match those things together. So there is some immediate harvesting, data processing, confidence calculation of sorts to try to trim back the large volume that may be coming from these various, so for example, from some of these sources, the amount of results could be in the tens of thousands, depending on how well the query is tuned or how many results are returned. So we have this step where we pulled in and then there's a portion where a human does an actual review of sorts. In terms of the data coming in, these are the sources that we're pulling from and John talked about how across these various sources there's varying degrees of information. What we have seen is that just from record to record, it's inconsistent in terms of how much the record is populated and then also how much they are there. So we've worked hard to try to combine data from multiple sources as well. So with the tool, there's kind of the typical login via the campus authentication system and then once you're in, we get to the step now where we have one step where we do the authorship and inviguation step in the library. Does that portion where they look to see, okay, do they accept it, reject it? We're able to pull in a lot of information here where you can see the various sources, open the article via various links, you can do UI. So this is a portion here where it's been brought in and then the person is looking at this and they're able to see very quickly whether this article actually applies to, you know, this particular auth or not now away. Now we can see that, yes, it's obvious that it lists Notre Dame when you click on her but that information is not coming to us in the metadata. So it's these kinds of things where there's interfaces like this that are built for people to interact that are, you know, very easy to use but in terms of the computer trying to look at the information it's just not there consistently. So that's why a big reason we built in that particular portion. So once the, there's another aspect here that, you know, I talked a little about confidence calculations. So in terms of trying to push down the articles that really are not applicable that are still coming back from various queries to these sources, we built in these confidence measures where it's looking at last name, initial affiliation, et cetera. And a big part of doing that was it really helped us as I'll show later on in terms of numbers, it helped us cut more than half of the sources coming in. So once the library does their portion, so again, the library being providing that authoritative role in the process to say we have confirmed that, yes, these publications are for these people, then that's when it gets to the next step in the process where the centers are involved in doing the review themselves and they say, okay, well, I see these publications are confirmed for these people, now I'm going to check which ones actually apply. And in terms of the tools that we built in, again, we have that link to go directly to the publication, but instead of looking at the person, now they're focusing on the abstract portion. And that's looking to see which parts of these apply. So in this particular case, it's looking at the Institute for Precision Health and that was where we just looked at the abstract to see what would be applicable there. Now, I talked a bit about already about how many of these researchers are part of multiple centers, so this is an example of one where you have an article that was accepted for, I think it was for Harper Cancer Research Institute, Professor Chen, then when you look at another center that he's a part of, NDNANO, does that apply or not? So what, and in this case, what we'll see in a second is it doesn't apply. And this was something that was pretty important that we could simultaneously accept and reject for different centers and that would not affect anything. And one of the things that I've consistently been telling our participants in the pilot is by rejecting this, you're not saying that Danny Chen didn't write this article, yes, that's confirmed. You're just saying that it doesn't apply to this particular center or institute. Then there were definitely some cases that we heard within the process. So the folks primarily doing the center review are administrators managing directors within the centers and institutes. And they kind of do that first pass. But there's usually a subset where they're not totally sure which ones actually apply. So we built in a mechanism here to be able to filter down to which ones that we're pending, then do a download. I'll let this cycle through again so you can see that. And we're able to kind of sort and filter by the pending, by the counts to make it easy for them to see those things. Then once they do the download here and open that up, then they're able to send that list to whoever, potentially the researchers themselves. We had found that going back to Excel, Google Sheets, it just can't be duplicated. It ended up being more effort than was necessary to try to duplicate what this was doing. So we fell back to the standard tools that everyone has. Everyone can use, everyone can access for this portion of the process. And then they go through and confirm which ones apply and send it back to those folks. And then they go into the tool and enter those in. Now, in addition to the review, so once all the information has been entered, we built in a reporting dashboard here. And this is kind of thinking about these are more of the library leadership, executive university leadership that would be looking at these. And there's the ability here to do various filtering, filtering down to the center, going to the particular author. Again, you can download the results and take a look at that. Here's kind of a view of what that is. So it's really whatever is in view in a particular filter, that's what gets downloaded. So if it could be all, it could be a subset. Then in addition to that within the dashboard, we also kind of have a larger view of that where you can go through and filter and we have another piece in here where it shows co-author collaboration networks. And you see kind of what it looks like as it's gradually filtering down to just the author filter, then down to the, in this case, the energy. And then in a second, it's going to update to be kind of that full set of collaborations that are there. Okay, so looking at some of these raw numbers. So overall, for 315 total authors you see in the middle there, we harvested a little over 8,000 possible matches. And by, when I was talking about the confidence, by doing that initial cut, we were able to filter that down to 3,500. And one important distinction there, if you notice, there's another line there that talks about distinct publications. What we did there is we considered author and publication matches as a pair, unique pair, in addition to just pulling down publications because in many cases, we had publications where there were 20 authors. And one was a very confident match with some of the center and then another is a more attentive. So they might have someone in the paper that has the same last name and even same initial as someone in there. One other thing that we found was that we have set kind of the bar as kind of last name, first initial, but we also added a step to tune this a little bit more to push some publication matches below the line. If it was the first initial match, but the actual given name was present, it was not a match. So a person looks at that, they say, of course, that's not them. But the computer was saying, well, no, we still think this is 50%. So that was a big part of what allowed us to do that. Then looking at kind of these raw numbers of distinct publications. So of the amount that was harvested, the ones that were confirmed as matching at least one author, many cases, two, three authors depending. As you can imagine, many of the collaborations across these centers, they're all working together on various projects. Then kind of looking at those adjusted percentage to explain that a little bit. So if you look at the third line affiliated with center, that 70% number, that is 934 out of 1300. So it's saying, okay, now that I've gone to that step, 70% of those were then affiliated, if you look at the overall that were harvested, it was about 40% of those. Okay, so I think we managed to move through this pretty quickly, because we wanted to have a lot of time for discussion to really hear what your thoughts are after hearing more about what we're doing. We're really very interested in seeing if there's any potential partners, even as you use just a piece of what we have done. We know there's similar tools and processes out there that various folks have already done, and we think what we've done here is interesting, and we wanted to hear more of your thoughts. So I'll pause there, and we actually have Jeff Spies on the line here. So we'll see if we can pull him in over Zoom. Okay, so are there any questions? If you have a question, probably it speaks through the speakers, so Jeff can hear you. Yeah. The project is very, it focuses on publications. Could this tool be analogous or tweaked in a way that could also collect metrics or data from different places per publication? Yeah, absolutely. We have that sort of instructor set up to do all that. You just talk about, and we strictly sculpt it based on the center and institute and off the research requirement. At this point in time, they're only interested in the publication numbers, right? But we can definitely pulling in impact factor, citations. And also, we are in conversation with the office of the research connecting to our funding data. Basically, all the proposal coming in, we can see how much I final through the center and then see the scholar impact from those proposals we have. Yeah, and also, we have done some of that. We kind of stopped at more of the prototype stage with our office of research paying most of the check for various sorts beyond, especially for Jeff, who we have had as a consultant. They had kind of a big say in terms of how our scope was determined the first pass. Hi, John Dunn, Indiana University. Yeah, thanks very much for sharing this. This is really interesting work. I guess I have two questions, one for you two and one maybe a broader question. The first is, to what degree is pace tied to the particular data sources that you're using in terms of its architecture versus being kind of extensible to other data sources to kind of feed the citation data? And then secondly, I think actually one of you just observed this. There are a lot of institutions that have developed kind of parts of this sort of research information management, tools to support citation, processing workflows in support of open access, efforts in support of looking at research metrics and otherwise. Is there anywhere where these projects are getting together to kind of compare notes, share experiences, potentially share code and so on? There may be and I may just not be aware of it, but if not, I think that would be extremely valuable. I think for the latter first, that's certainly our hope that especially today can be the start of that conversation. We have kind of guesses around that, but like I think a lot of things have been happening in the Wikidata community that are very interesting, that are complementary, that also alludes a little bit to the portion of extensible. So it is built up at the moment so that pulling in from many sources is possible to do. It takes some work to build in a configuration, but it's been set up essentially to have the system ingest records that have been since harvested from various sources and what is expected is it's kind of pushed in in a standard normalized format and then what happens is it pulls it in and it does an additional call on the cross ref for DOI metadata regardless of which source it came from to then normalize and push those things together. The last six months has been a lot of time spending on kind of refining and normalizing that portion. So you know, I said when we started back in 2017, when we test some of those vented API, we actually talked to our vendors. Some of those APIs are free, some of them are premium services that we have to pay. I just to be frank, I think they're more interested in selling us products rather than actually thinking this as a partnership because the reason we're doing this, there's a local metadata really mapping your own institution or structure. So that part you cannot pull it down from any of the database. They have a very good in terms of department information perhaps, but people they're changing, institution changing focus. So those are the areas the library really offer. We talked to the vendor saying, hey, we can be a partner, we'll share in that data to enrich your database and that will benefit everyone. I don't think we talk to the right people. The sales want to sell us product rather than thinking it's a collaboration. OK, so with that said, though, on campus, if we think, if you will, if there are phase three, phase four beyond office research, there are very good possibility to talk to the faculty annual review systems on campus already. People are very interested in that, particularly from the office provost perspective, they're really interested in that. We actually love to having our faculty to be the first source of truth coming in through their annual reports, right? So there are definitely ways we can connect. I do think the point we're trying to make is somewhere on campus or the vendor, they have to understand how to map more closely to what the campus need. In this case, particularly from the lens of the center in the institute, a lot of vendor database, if they don't have it. Another problem would be if you think about the discipline they cover, there's no single source of vendor solution can cover all the scholarship output. Someone has to do that job for the university. This is really exciting to see. Thank you. Tom Kramer from Stanford, HF. Nice to see you. Also two questions. So one we've done, we have very similar kind of set of scripts and processes that we've been piecing together. So it's really great to see how well you've packaged this and presented it, especially to staff at institutes. I'm wondering similar to John, have you found that certain sources are better than others? We spent a fair amount of time comparing Scopus, Web of Science dimensions in particular, Orchid and Crossref and trying to figure out where the cost benefit is of expanding sources versus the additional noise. We have not found that deduplication using DOIs to be a panacea, because different databases have more or fewer DOIs, so we get a lot of noise. And the second question is, this is really around the you're presenting around the institutes use case. Can you talk a little bit about campus profiles? Yeah, and kind of speaking to the sources first. Are you able to switch to the slides view real quick? Our friend over at the desk there, oh, do I switch it? Okay, let me see what I can do here. I think I have to go out of full screen for Jeff. I will do it. It's slow to respond. Well, Rick's pulling up the slide. From the campus perspective, you know, this is very progressive for our campus. There's no quarry and effort about the general faculty profiles yet. So this is really the first case surface as an interest. But again, Tom, from our perspective, if they want to go into a more general sort of profile, we have the authors here. We have a very high confidence. We can definitely add more layers or more lenses through colleges, through department or even for individuals. It's just there. It's just the needs not quite coordinated through the campus yet. Well, I'm going to stop fumbling and just try to speak to it. There's nothing wrong with it. It just that's that's really where we are. Yeah. So in terms of ones that were really good. PubMed, we found was probably the best in terms of API ease of use as well as coverage. Another one. So it's very interesting in terms of which is it's kind of a little bit of best of breed where we had a set of centers that, you know, they publish more in conference proceedings versus a particular journal. So we had to have some sorts that we're looking at that. And we were originally looking at Google Scholar and we've kind of approached that very cautiously in terms of what legally we were able to do. And we stumbled upon Semantic Scholar as something that was a very comparable and it's extremely similar to Microsoft academic graph. If you're familiar with that, one that is being discontinued and we more fortuitously, we were looking at this when we already knew about the discontinuation of academic graph. We were ready to jump on to that, but then discovered Semantic Scholar, something that is very good conference paper coverage, preprint coverage. And then, you know, Web of Science and Scopus. They have, you know, very similar as you know, coverage of a lot of STEM oriented things. You know, Scopus has probably a it's it really just depends on the discipline really about which one is best. And then Crossref really ended up providing a nice catch all feature, but that was that was one of the things I was alluding to earlier where Crossref would often return 10,000 results and we would actually not accept that amount of results at that point in time because it was diminishing returns in terms of what could be sifted through. So in that case, it was more productive to focus on a different source for that particular author. So so when we do a query of Crossref, we do a limit. If the results returned are less than a thousand, we take them and use them. If they're greater, we ignore. And I'll give Jeff a chance to speak as well here. And there's anything you want to add. Yeah, this is the, you know, one of the big challenges in metadata is this affiliation stuff, especially when you get to the center department level. We're right now applying a set of heuristics that's giving us quite good results and they're fairly simple overall. But that set of rules that generates the confidence is based on heuristics. You can take that further. But as Rick said, it's it's it's also a question of diminishing returns. What is the benefit of having, you know, this brilliant AI sweeping things when really we can apply a set of fairly simple heuristics and human human insight? As far as the sources go, we're going to get different sources that have different benefits, like Rick was saying about semantic scholar and preprints and conference papers. And so what we're thinking about is sort of this modular way to apply heuristics or apply learning, if that's what takes two different sources to get those results in the end, though, passing that to the human and then and then learning from that, figuring out do we need such complexity on the front in there or is this something where we could just have these thresholds and cutoffs? I think one last thing to mention while we have Jeff on. So a lot of Jeff Jeff and I have worked on the share project in the past and some of you may be aware. So a lot of our history in that project has informed how we have approached various things within this project where we share it was kind of approaching things from a very broad perspective. And this was a different approach where we were focusing first on a local problem versus trying to tackle the large at first. So it's it's been interesting kind of being a part of both of those types of experiences. I think it's a 11. Maybe people want to come in for another session here. Maybe one last question. We definitely love to show you more about how the tool works and talk to us. Yes, and definitely want to hear about collaborations. We're really keen on finding out what more we could do with others.