 Hello, this is Paul Albert and along with my colleague, Sarvajit Dutta. Today we're gonna talk about reciter, which is an open source publication management system for academic institutions, especially medical institutions. Why did we build reciter? Well, we started with use cases that we see here at Walcornell. Various parties have various needs and I'm running them down and sharing them here. And I imagine they're pretty consistent with the experience of other institutions. Administrators at departments want to not have to attract publications down. They wanna have to not have to email their faculty and other scholars. External affairs wants to know which publications to promote. Department chairs wants analytics. They want analytics that account for author rank and they wanna use it to make decisions including appointment renewal decisions, promotion, year end bonuses. Deans have analytic needs, but slightly different ones. They wanna know who to recommend for awards, who deserves tenure, whether promotions are being done equitably and whether the institution's strategic direction is bearing fruit. Finally, faculty largely wanna be left alone. They have tons of stuff to do and they just don't wanna be bothered. Now, many of you might be wondering why we have a separate system when there's a perfectly good one in ORCID. And let me answer that question. First, a brief description of ORCID. So ORCID is a system that mints identifiers for scholars and these identifiers are tied to the scholars' publications. So at the time of publication, a scholar submits their manuscript and they provide their ORCID identifier. And then the metadata associated with that article is supplied to an aggregator like Crossref sends the data back to ORCID where ORCID then syndicates the data to downstream systems. Okay, so in what way is this system not fully satisfying the use cases that I mentioned just before? First of all, ORCID does miss a lot of older publications. So these are our three most prolific scholars at Walcornow and they're missing some thousands of publications. In fact, if you look over the past five years, fewer than 6% of Walcornow papers have an ORCID asserted for even one author. So this is not to say all these papers have ORCIDs for all their authors. You can be compliant with funders' requirements to use ORCID and only have a subset of authors assigned to ORCID identifiers. Significant delays are a problem. So this paper appeared in PubMed in July 2019 and only made its way to ORCID in February of 2020. We have a bit of a stone soup type situation in which scholars can effectively hide their profiles. So you can go into, if you're a scholar, you didn't go into ORCID and you say, I don't want anyone to see my profile, which undermines the whole point of the system. There are duplicate profiles. So these are Walcornow faculty who have multiple ORCID profiles. The identity data is sometimes weak. So here are a couple examples of that. ORCID says that in cases where data is imperfect, you can have a scholar assign an institutional representative to serve as a trusted party. So that makes sense in theory, even though it's a lot of work. However, there are some problems. And one of the big problems is there's a large group of people who will not or they will not give you permission to serve as their trusted party. And here's a list of them. And one not insignificant group are people who are deceased. So here's a list of account of publications by year. It's probably much higher than this of Walcornow affiliated individuals' publications when those individuals are deceased. Another problem is that ORCID does not have a concept of rejected. So let me explain why this is useful. So we have a scholar by the name of Hong Ding. And this is a relatively common name. So here's an article that may have been written by Hong Ding, came out in 2019. If Hong Ding did not write this article, how are we expected to remember that that's the case? Are we supposed to just keep that in mind every single time? So that's a problem. We need systems to keep track of not only the articles that someone wrote, but the articles that someone did not write. ORCID does not have a concept of inferred rank for an author, it just lists the different contributors in order. So we can't necessarily use this system for doing the types of publication reports that our chairs and our deans are interested in. And finally, I was concerned that we only had nine problems. So here's the 10th one. 10th one is that I do not care for the shade of green. Okay, in some ORCID requires me to accept the following. First, we're adding to scholars and administrators administrative burden. We need to dedicate some significant portion of an FTE to do the following. Convince scholars to have a serve as a trusted party and convince them not to hide their profile, independently maintain ORCID and institutional identifier relationships, claim publications for the 98% of publications that fall through the cracks, except that there's gonna be a delay when these publications reach ORCID. We can't capture author rank. We can't capture publications by the deceased unresponsive alumni, scholars with no email and file and the 10% of people who hide their profiles. We can't infer whether the absence of feedback on an article means it's rejected. And because all the above, we have to buy or build and deploy a shadow publications management system. Okay, let's talk a little bit about what a more effective publications management system would look like. First of all, we absolutely needed to be controlled by institutions. Such that there's no need to get explicit permission from scholars in order to curate their publication lists. Second, it needs to have a user interface that allows third parties to quickly review possible articles. The output should be suggestions for each scholar and they should be scored by likelihood and exclude unlikely possibilities. Fourth, is that the system should suggest articles for missing or prior affiliations. So a lot of scholars will fail to include their correct affiliation, we've noticed. But also people publish before they arrive at our institution and these publications are factored into decisions regarding promotion and tenure. And to say that, to just throw our hands up and say, oh, sorry that no one told us about that. I think that's not acceptable. Fifth is that we need to be able to run it daily for over 10,000 people or however many people you think are important to track of. And then finally, we need to output data through a suite of web services to all these downstream systems. Reciter is the system that we have built that tries to satisfy those goals. Now a little bit about, let's talk about how Reciter works. Reciter works by taking data from institutional identity systems and creating a scholar's identity. From there, we construct queries to retrieve candidate articles from PubMed and optionally Scopus. The reason why we use Scopus is to improve performance and that helps a couple of percentage points but it's by no means required. Then we use a set of machine learning technologies to estimate the likelihood a given scholar wrote a given article. Then in step four, we're collecting feedback from users, typically librarians and other administrators in terms of whether a given article was written by a given scholar. Finally, we use web services to expose these publications to downstream systems. So let's talk about this first. Using institutional systems to create a scholarly identity. So we look at librarians as being a kind of model for how we go about doing disambiguation. And librarians need information or data about the scholar in order to do a good job. So what kind of information is needed? Well, it turns out a lot of this information is available within institutions themselves. People's affiliations, institutional affiliations, the years when they got their degrees, especially terminal degree, their grant numbers, their name aliases or variants. We've had some people with up to four or five different names, their departmental affiliations, their colleagues or relationships they can show up as co-authors, email addresses. Okay, so here's some sample data for a WALP Cornell scholar by the name of Jochen Falk. So there's a ton of information about him. All of this can be leveraged for the purposes of making more accurate predictions regarding which authors he has written. We take all this identity data and we put it into a standard JSON object. And then we use one of the APIs that Reciter offers. So load that data into the Reciter system. All right, next step. We're gonna retrieve some candidate articles from PubMed and Optionary Scopus. Now the reason why we use Scopus is to improve accuracy a bit, but it's not really, it's not required. It only improves accuracy in our test by a couple of percentage points. All right, so the exact queries that are used to retrieve candidate articles depend on a couple of factors. The first is how common the name is. So if a person has a relatively uncommon name like Albert B, we'll simply do that name lookup in conjunction with known email addresses. If the name is less common like Yi Wang, for example, so Wang Y, we will look up that name, but only in conjunction with additional metadata such as known grains or departments, et cetera. If the person has multiple themes on record, so maybe they got married or divorced, we will look up both names independently. We will also look up compound names, such as the case where somebody has a hyphenated last name or where somebody has an Arab or Latin passing. So here's a look at one of the APIs. This is part of the Reciter PubMed retrieval tool. It's a standalone application and you can interact with it through a series of web services. Reciter comes with a library called Swagger. So Swagger allows you to get up to speed very quickly and understand how an API works, what kind of parameters you can use without too much background. I'm gonna show you what it looks like. So you can input your query here. You can decide what fields. And then in this case, it's taking the XML from PubMed and outputting it as JSON. And I'll show you something similar for Scopus. This too is a separate standalone application and put data in this case, it might be the identifier that you retrieve from PubMed and then this overturner back JSON as well. We've tried our best to, with these different applications and Reciter as a whole, to insulate developers from quirks in the source systems. So with PubMed, we're transforming the data from XML for JSON. We're allowing you to input an API key that allows you to return 10 records per second instead of three, which is the standard amount. With Scopus, we are deleting duplicate offers that are sometimes output via the Scopus API. We've added some nice touches. We take gates and we convert them into a standard sortable version. We infer a single canonical publication type. I mentioned before that it's important that we be able to know what the author's rank is. So as somebody first author, last author, you're gonna be left. So Reciter will attempt to infer target author rank. In addition, Reciter can be used to only look up publications that have been added since the last time it's run. Okay, so now we have this pool of candidate records. Let's try and use machine learning in order to figure out how likely it is that any one publication was written by our scholar of interest. So we believe that institutions should not pay their business people to do busy work. So our future Nobel laureate should not be reviewing their publications, it should be someone else. Now, if it's gonna be someone else, we need to provide some kind of contextual clues and evidence as to why a particular article has been suggested because no system is 100% perfect. There are gonna be some errors in that. And there are gonna be some ambiguous cases. So here's an example of a candidate record for the scholar of critical. We have data from our identity objects and then we have data from the article. And then we also have a score as to how closely those match with each other. And that's in the final column there. So for example, in our identity record for first goal, we have this full name and then we can compare that in the second row to the data from the article. And that's a pretty good map. So you get 3.31 things for that. Our scholar of interest is listed on a grant with a couple of individuals and those people appear as co-authors. So that's some more evidence. There's a grant identifier that's a match. Chris Cole has an appointment in Population Health Sciences and we've done some analysis and we've determined that people in that department are far more likely by an odds ratio of actually more than 1000 to one to publish in the journal category of medical informatics. So a couple more points for that. We've looked at target authors, institutional affiliation, co-authors institutional affiliation, departmental affiliation, the discrepancy between the year an article is published and then the year of a doctoral degree. Obviously if the publication comes out well before the year of the doctoral degree, then it's less likely that our scholar of interest has written reciting uses clustering. So we group together like articles and then we look at the average score of that cluster and compare it to this article and use that to adjust the overall score of this article here. We also look at Kennedy article count. So this is a kind of Bayesian fudge factor where the likelihood that any one article was written by this scholar decreases as the number of Kennedy articles increase in vice versa. Finally, we're looking as well at the inferred gender of name. So we have data from Social Security Administration where we're guessing what somebody's gender is based on that and we're comparing it from the identity to the article metadata itself. Okay, so we take all of these different types of evidence and we add them up and we come up with a raw score. In this case, the raw score is 16.62 and this gets mapped to a standardized score of 10. So our standardized score scale is between one and 10 in which 10 is highly likely and one's not so likely. Now looking at Yakin Bakugin, we can, I can show you here that it's actually a relatively small number of articles that are truly ambiguous. Most are, it's pretty clear that Yakin Bakugin almost certainly wrote them or that he almost certainly did not write them. And the reason why I can make this determination is that we've taken all those raw scores, we've computed a percentile for them and we looked and we saw how often it is that articles with a certain percentile were accepted and how often they were rejected. So if you look at that, the 16.62 score that's gonna appear, that's gonna have a percentile around maybe 0.94 or 0.93. So it's highly likely that this article will be accepted. And as the total raw score decreases as you go from a percentile of say 0.9 to 0.86 to 0.85, the likelihood that the article is gonna be accepted declines by about 80%. So as you can see, I hope, there's actually a very small window where the likelihood in articles is gonna be accepted changes dramatically. In fact, it's around 5%. So maybe 5% of articles are ambiguous. Now those articles on the left of the curve, those that have scores of below 0.86, 0.5, we actually discard those. Those are so unlikely that they don't even need to be shown to the user. And it's this design decision that allows us to more quickly review candidate publications for a variety of individuals. All right, moving on. So let's talk a little bit about how we collect feedback. So Reciter has a standalone application. It's called Publication Manager. And I'm gonna give you a quick little demonstration of it. Okay, so here's Publication Manager and we've looked up the profile for Yonkin Buck. Publications can go into one of three categories. They can be suggested publications, meaning they have not received any feedback. They can be accepted or they can be rejected. So the user can comment here and they can say, did Yonkin Buck write this? Let's see. I'm gonna click on Show Evidence. And this is similar to that table that I showed you before. We have data from the institution, data from an article. They can quickly review this. And they can say, yep, and then here's another one. Yep, and then they can click refresh to get new suggestions. This is a completely stateless application. None of these data are actually stored in this application. They're only stored upstream on the recited database. So this is our white whale, as it were, Yonkin. If you look up Y Wang and PubMed, you'll get probably 180,000 publications by now. But Reciter does a pretty good job with him because we use that stricter criteria for searching so that we're not introducing too many false possibilities. But you can see there's a variety of data that can be used to make the decision. And in fact, Reciter performs fairly well even in cases where publications are only available for some prior affiliation. So here's a director of the library. I just wanted to illustrate that last piece of evidence that we've looked at before. Oh, yeah. So here Reciter infers that the gender of this individual is 99% likely female. And then even though this has what's called a Levenstein distance of one or two, so these are somewhat similar names. The fact that this is male decreases the likelihood that this article was written by interior. And in fact, it was not, so that's why it's rejected. Okay, so we've done all this work. We've collected feedback from basically everyone about the scholars, you have this great data. Now let's use it. Let's expose it to downstream systems through a set of RESTful APIs. So this is one of the ways we do this. There's an API that we call feature generator. And with feature generator, you can supply the person identifier. You can supply the threshold, the minimum score of the article. You can say whether you choose to use the gold standard in order to test the system or to improve the quality of the suggestions. You can say whether you want accepted and rejected or all the above articles. You can say whether you want to refresh on the analysis portion of reciter's engine. And then you can also choose whether to re-retrieve none, all or only the articles that have been added to the source system since the last time the system ran. Take a look at this. So we're going to input a person identifier, input that threshold. We want to use the gold standard to improve the quality of scores. We don't want to do a re-analysis. And then reciter will give you the curl command and then output to JSON here. So I should mention that the output includes both the article and the evidence behind why a particular article was scored as it was. Okay. So I think it's true of any data system that there's some sort of critical point at which the quality of the data is good enough that you can confidently use it to perform certain tasks. One of these, maybe the most important use case is use our reciter in order to update our scholarly profile system. So we keep track of reciter. We keep track of pending publications and a publication can hit PubMed on Tuesday or update it on Wednesday and it'll appear in vivo on Thursday. So there's a pretty tight window for how often we can be updating publications. The code used to do this is available at that URL above. And I should mention that everything in reciter is open source. Reports, we use reciter for a ton of different reports. We've gotten a lot of positive feedback from our administrator colleagues. I'm gonna walk you through some of those reports. The first is our new Pub's report. So because we have this quick turnaround, we're able to provide information to all these different parties and let them know basically what's happening at the college. This is, it's really important that at a place where knowledge is revered that we're able to share knowledge of what publications have come out in the last 30 days. Some users started saying, oh, I wanna know which publications are related to COVID. So in that seventh column, we flagged cases where COVID paper was related to COVID. A recent thing that we've released is something called the trending Pub's report. With the trending Pub's report, we're taking articles that have been accepted, that have been published in the last couple years and then using all metrics API to return the number of Mendeley readers that publication has. So we take the number of Mendeley readers and we divide it by the number of days old it is and we come up with a trending Pub's score. And you'll notice that, we're right in the heart of the COVID pandemic. So that is obviously the trending topic. And in some cases, these articles are already cited a number of times. In other cases, like in say the third and six and seventh, we got zero times cited, zero times cited, one times cited, but these are papers that are eventually gonna be cited probably a lot just because they've received so many Mendeley readers. We can compute age index. So this is just a simple script that kind of loops through the accepted publications for a given person. And then we can also do citation impact analyses. So using reciter tools, we've downloaded basically 2.4 million NIH funded papers from PubMed. We've looked up their citation counts in Scopus. We've created baselines and then we computed the percentiles for those articles and then we compared our accepted articles to that baseline. So the Office of the Research Dean was interested in knowing this statistic for papers in which scholars served as senior authors. And we were able to compute the average percentile score of their top five performing publications and then rank scholars and then do the same thing for the top 10. And when we perform this task, the top performing person was our one and only Nobel laureate. So it does seem to have face validity, but it's something that we wanna iterate on and explore more. All right, so now we're gonna talk a little bit about some of the technologies in reciter. And for that, I'm gonna turn it over to my colleague, Sarvajee. Thank you, Paul. So these are the technologies that we have used for reciter. As you can see here, it has a lot of components to it. Reciter is based on a set of principles that we follow a principle of microservices. Essentially, what that means is that reciter talks with a bunch of different components and each of the components are independent of each other and they can function. They can function and do their specific tasks that they are designed for. As you can see here, we get the identity data from different institutional systems that we have, such as HR, grants and directory, faculty affairs, and we aggregate that identity data and then feed into a reciter application. A reciter is a Java Spring Boot application. It's very lightweight in nature and it exposes different sets of APIs, serving different purposes. In the backend of reciter, it's actually using a SQL database, something from Amazon AWS proprietary called DynamoDB. The reason we chose DynamoDB is because of its nature of really fast performing reads and writes, which gives us really good performance for the application. And we also have something called S3, which is a file storage solution that AWS uses, which is used in cases where the size of the publication or the item exceeds 400 kilobytes. So we have seen cases where a researcher has more than a thousand publications. In those cases, the size gets much bigger and we use S3 to store those. A reciter, like I said, as a microservices, it talks to PubMed Retrieval Tool and Scopus Retrieval Tool, both are designed to talk to their upstream sources for PubMed and Scopus respectively. And also, reciter publication manager that Paul just demoed is used for accepting or rejecting publications and provide feedback on them. Apart from that, reciter also exposes different data that can be used for the downstream system. So for example, we used it for Vivo profiles that Paul just showed. You can use it for different reporting purposes, the different department or lab websites or faculty review tools. Next slide, please. Now reciter, we put a lot of thought into it to make it easier for anybody to install reciter. So anybody without having any technical knowledge can actually cross and install reciter and have it ready to use in their local system. So reciter can both be installed locally if you want to test it out. And also it can be used to install it in AWS. For AWS, we use something called CloudFormation, which is essentially a set of instructions that allow you to install reciter in AWS and have it up and ready with real quick turnaround time. So essentially this CloudFormation template, you can have reciter and a set of dependencies and components running in AWS in practically 20 minutes. Next slide. Now with reciter, since it's API, so it's always good to have documentation for APIs. So we chose something called Swagger UI, which gives you a very crisp and nice user-friendly interface to test your APIs and essentially see what are the parameters and values you should be using. It gives you a friendly code interface as well. And you can test your APIs really easily. Next slide. Since reciter is based on AWS, AWS has a lot of managed services. So the managed services are essentially that you pay for your costing and Amazon takes care of the rest. So we use some of that in reciter as well, such as point-in-time recovery, having instances spawn whenever it's necessary, whenever we have significant traffic going to the application, updation of software. So essentially any kind of maintenance works is basically sidetracked to AWS. So you don't have to worry about any of that with reciter in AWS. Next slide. Now, with the system such big as this, since we're running for almost 15,000 faculties on a daily basis, always there's a concern for cost. Now, we architect the system in such a way that it minimizes cost without jeopardizing any kind of performance or security or safety for the application. As you can see, we are currently hovering around $300 to $400 a month for a wild corn and medicine. And we run reciter daily for almost 10,000 to 12,000 users. And it's a very cost-effective solution, which can be used for reciter. Next slide. For the next steps of the reciter, we are looking to improve the UI. So we are implementing something called bulk feedback for suggestions, which can be used for departments, department admins, division heads to review publications very quickly and go through their respective deformance. We're also working on a reporting view, which allows reciter to generate different types of reports, custom reports, which can be used for different administrative heads or for compliance perspectives. We're also looking to integrate multiple bibliographic sources as well. So reciter, as you know, currently works with PubMan and it can also use Scopus. We have gone to conferences where we have heard that people want other sources such as Word of Science, WorldCAD. So we're looking to include those into reciter as well. We're also looking to use machine learning for a clustering process, which would effectively increase accuracy for targeting publications. And we're also looking to get a ground to further develop and increase adoption for reciter in various organizations. With that, I'll hand it off to Paul. Paul? Sure. I want to give a shout out to all the great people who've worked on reciter. Thank you, all of you guys. If you want to learn more, please feel free to contact us. There's, reciter is completely open source. You can download the code and you can use all of it, part of it, whatever. We would be happy to talk to your group, your colleagues, if you wish. So thanks for listening. Also, I wanted to mention there is a Slack channel as well for reciter. So we'll include that in the presentation deck so you can contact us on the Slack channel for any questions or concerns. Thank you. Good point, thank you.