 Hi there, my name is Paul Berth and so begins my presentation, right? So this presentation is titled using data from NIH's eyesight to dynamically provide bibliometric based decision support. I would like to give some credit to my colleague, Sarbhajit, who's done a lot of the development work here. There's a lot of work in the area of bibliometrics, but not a lot of it is truly useful. I would argue that something I'm calling bibliometric based decision support is an example of a case where bibliometrics can be useful. And here's some examples. Administration can decide if faculty should get promoted, get tenure, or have changes in space allocation. At Cornell, we're periodically being asked to help evaluate prospective department chairs. First parties need to identify faculty's most influential work for applications to funding agencies to Homeland Security. So we've had researchers come to us and ask us for their most influential work to submit for green card renewals. Different offices are asked to nominate faculty and students for awards. This is the age of equity. And so we periodically have analyses of institutional practices and promotion rates for women and underrepresented minorities. And then we in the library are also asked to provide language to the committee of review, which summarizes researchers' impact. So here's an example, or a couple of examples of cases where we have provided language. And this language is, what we see here is problematic for a couple of reasons. One is that it really took a long time to produce. It took like four to five hours to produce each one of these. And then second is it doesn't really say that much. So if you look at that first blurb as an example, it says it lists off some of the publications, some of the journals, which are impressive, I suppose. And then it says the person's overall research citation, in fact, is remarkable in the top 2% of her field. So how do you define field? Like how do we know? Like what's the benchmark? So it's overall, it's kind of weak and we felt like we could do better. So for all these use cases that I provided, they really can be distilled down into a couple of essential bibliometric questions, bearing in mind that the ability to answer these depends on having data regarding who wrote what. Okay. So first of all, for a given person, which are their publications have been most influential or are likely to be most influential? Second is how does the influence of a person's collective output compared to peers, both at the institution and within the larger biomedical research community? And then third is what is a person's career trajectory? NIH has released a service called relative citation ratio or RCR. They validated it, they endorse it, they've made it available via API. And what it does is it counts up citations for a given article and then uses that as a proxy to approximate the influence of those articles. The citation counts come from various sources, including CrossRef. We've done analyses comparing the counts from that they're collected in RCR against Scopus and they're within a couple of percentage points, so that made us feel good. Let's talk about what RCR is. RCR is the ratio between the times an article was cited in comparison to publications with the same year of publication and field. And the field is defined by something called a co-citation network. A co-citation network is the group of articles that has been cited by the same other articles, so that you basically create these clusters and then you infer a field that way. So the value of 1.0 for RCR is considered to be the median or 50th percentile. And then the benchmark is against research articles that are the product of R01 grants. So for those of you who don't know, that's the NIH's most prestigious funding mechanism. And then once you have an RCR, you compute an NIH percentile on a scale of 1 to 100 in which 100 is the strongest. So the RCR service also offers potential to translate, which I haven't really found a use case for as of yet. And then the same thing goes for the high-level category that the articles can be filed in. This last one, though, the list of citing and cited by PMIDs is useful in my experience. So there are cases where of potential researcher malfeasance in order to be considered as a possible case, the articles need to be cited within the past, within a certain time period. So having this item-level citation data, like which publication cited which, actually is super useful for the purposes of helping our colleagues in compliance make these determinations. Okay, so here are a couple papers about RCR, which you can look into if you wish. So now I'm going to do a quick demonstration of eyesight. I know a lot of people actually know what it is, but maybe some people don't. So here's the URL. And what you can do is you can input an author name or a list of PMIDs. So I chose somebody who has a relatively unique name. So it's basically disambiguated for me. But for somebody with a more common name, I could just input a list of PMIDs. So I'm going to click process. And so this web interface gives you some info. So there's a bar graph regarding pups per year. There's something called weighted RCR per year. And then here's a table, and you can export this as a spreadsheet. And then you can limit to only research articles. And let's do a sort by RCR and descending order. So here we have, this could be said to be one of Lewis Kentley here's is one of his top articles. And it has an NIH percentile of 100.0, which is the highest that can be. And then this one just came out recently. So even though it has a high RCR, like they haven't really come up with NIH percentile because they're waiting, I think two years. So this data is also available through APIs. So you can supply it as a publicly available API. You can supply PMIDs as a parameter and I'll output all this data. And what we do at Cornell is we have a Python script, which uses this API and it updates our data on a daily basis. So all this data is input into a reporting database, which I'll talk about in just a moment. Okay, so before I proceed, though, one question that comes up my experience regarding RCR is what if paper is really bad and it gets cited a lot just for being good. So I chose everyone's favorite whipping boy, Andrew Wakefield, who had a lot of vaccine misinformation spread. He actually had a couple of papers that were retracted. So as you can see here, he has a couple of papers that are relatively strong. So two papers that are relatively strong according to the RCR approach. But they don't really have anything to do with vaccines. So I think that this is a case where the RCR approach is not really undermined that much by the work of Andrew Wakefield. All right, so now I'm going to do a little demonstration of a bibliometric report generator. So this is a tool with the goal of addressing a lot of the use cases that I provided at the outset. And the system is not done. So it's not in production. So what I will do, though, is I'm going to show you kind of how it works and then maybe they'll give you a good enough idea of what's going on here. So we have something called a stored procedure and the stored procedure expects as an input a person identifier. So here's our person identifier. And then what I can do is I can invoke this and then it'll output this. So this is RTF data and I can take all that and then I can copy it into the text file and that's what this is. So I copy everything into this text file here and then I open it up in Microsoft Word. So this is a automatically generated bibliometric report of Harold Varmus's output. And there's a summary at the top, but I'm going to skip down for just a moment to the influence section. So the influence section displays item level data. So these are individual articles and these are all research articles as defined by NIH. So if there's a review, it wouldn't be included in here. And then I'm actually breaking up the different articles into two categories here. One are articles in which Harold Varmus is first or last author and the other in which he has any position. Now this data, it's not available from NIH. It's only available through this tool called Reciter, which is what we use for the purpose of disambiguating our faculty and other scholars of interest. So Reciter has a heuristic. It basically goes through all the authors and it identifies the correct one and then that inference is used to pull it. Okay, so what do we got here? We have a paper in 2015 in which the number of citations has a relative citation ratio of 117. And when you kind of map that against articles that share the same field in the same year, the NIH percentile is 100.0. So basically this is maxed out. So this person actually happens to be a Nobel laureate. So I chose somebody who's fairly accomplished. Okay, so what we can do is we can use that individual item level data and we can make a kind of summary judgments based on his output. So at Cornell, we actually care a lot about senior author rank. So Reciter infers senior author and there actually is a hedge in the system that allows you to override the rank of the author. But I won't go into that anymore just to say that such a thing exists because there are co-senior authors all the time. Okay, so what we're doing here is we're taking the top five and top 10 most influential senior author research articles by NIH percentile and averaging them. And then we get 99.9 and 99.3. So that is a comparison of Harold Varmas against the larger biomedical research community. Then we're comparing these statistics to Dr. Varmas's peers. So there are 291 full-time faculty at the full professor level, which Varmas is, with five or more senior author research articles and then approximately 40 fewer who have written ten or more. And within these groups at Cornell, Varmas's percentiles rank at number three and number three. This language basically is, I think it's a cut above what we were previously providing the committees of review. And it allows us to rank people in a predictable and consistent way. So let's see what else this report offers. So I'm going to scroll down here. So in addition to the eyesight integration, we have an integration with Altmetric. And Altmetric provides things like the count of Mendeley readers. And there is some published research that supports the notion that the number of Mendeley readers for a given article is a leading indicator for how many times that article will be cited. So this has been, this is in 46, this has been saved 46 times. And there's a citation count only of one. So that's pretty low, right? So this is probably going to increase. The older research articles are included as well, because those fall kind of outside the scope of RCR. So RCR is not computed for articles prior to 1980. So just a citation count there. And then here's some additional statistics. You have something called H index. Many of you know this, but it's basically the number of times an article has been cited within a person's corpus, a number of articles a person has written in which those articles have been cited that many times. And then H5 index is the subset of those articles that were written in the last five years. And then so you have counts of all these different articles here. And then these are links to the full text, even if it's not blue. And because this is also complicated, there's an explanation here. So anytime someone has a question about what PNIH percentile is or RCR or the author position or H index, it's all listed here. Okay, so that's a demonstration of the bibliometric report generator. So as you saw, all that person level information which I'm listed here, listing here, is output. And then we also have institution level statistics. So you can take, you can go from the person to the institution and you can identify the people who are most impactful. You can identify the individual research articles that are most impactful. So the latter can be used for the purpose of nominating certain publications for awards. Now, all this, all the data we have regarding who wrote what comes from a system called Reciter. And it's an open source, own-grown, author disambiguation system. It's basically, it uses something called identity driven author disambiguation. And the way it works is we're, sorry, there, much better. Okay, sure that was worth it. Okay, so the way Reciter works is that we take data from identity systems and we create a profile for a person. And we use that to retrieve candidate articles from bibliographic sources. And then with that data, we estimate the likelihood using machine learning that a given scholar wrote each article. And because we have, we generally have a lot of features, signals to rely upon, we can exclude a large number of unlikely candidates. In step four, we're having librarians curate the most likely articles for our scholars in question. And this takes about 70 minutes a week to curate the publications for about 5,000 or more scholars. In this last step, we're using web services to expose the data to downstream systems. So this is a kind of diagram of how all the systems talk to one another. That's basically what I showed before. So you have the data from the source systems. It's feeding this Reciter disambiguation application. And then the disambiguation application gets its data from PubMed and this kind of PubMed API. And then Reciter outputs its data to that reporting database. And I want to say Fuchsia. And then there's data that's also imported from altmetric and eyesight. And then over in the bottom left there, you have publication manager where curation is happening, where reporting is happening, and where there's going to be a button for the Bibometric Report Generator. And it's going to look like this. There's going to be a big yellow button. And the user will be able to output an RTF or Word document file and then be able to use it as they need. All right. In terms of follow-up and next steps, we already have scripts to import data from eyesight in altmetric. So you're welcome to download and use that. Here's a link to Reciter, which is that open source disambiguation system that I described. As far as the Bibometric Report Generator, we anticipate completing development in spring 2022. We will release it as open source. If you want to contact me to be notified upon release, you may feel free to do so. If you want to contact me for any other reason, you may do so as well. And then that concludes my talk. Thank you very much.