 So thank you for joining us today for our presentation on building the unicorn or how to balance magic and practicality and research information systems. My name is Cynthia Hudson vitally I'm the director of scholars and scholarship with the Association of Research Libraries, and joining me today is Dan Coglin had a strategic technologies and interim head of research informatics and publishing at Penn State libraries. And today I'm just going to quickly cover some of the challenges of discovering and accessing research information for institutions. Dan is going to talk through a Penn State case study and ways in which they have found beneficial for addressing some of these challenges. And then we're going to just wrap up quickly with some best practices for stakeholders to adopt and improve the discoverability and relationship between assets. So, research information systems are not new to libraries or institutions of higher education. As far back as 2003 in the United States and even prior to that in Europe and the rest of the world. Institutions were exploring ways to aggregate information about research conducted on their campuses from large commercial databases or using online tools. The LLC has been leading the exploration of Chris implementation across academic institutions and have, have a number of amazing reports out. An additional work has been conducted by a CRL and of course, there is a lot of independent researchers who share their experiences and work about research information systems. The implementation and adoption of these tools have obviously not been without their challenges, specifically institutions have been challenged with an increased workload for staff who have to implement them or clean up some of the metadata prior to it being uploaded into the research information system, the need to develop new skills to implement these platforms, the ongoing engagement with stakeholders to ensure information is up to date. Some of the difficulties and ingesting information into the system, and I think this kind of speaks to continuing challenges with metadata importer operability and quality. And finally difficulties in creating a comprehensive research information management environment. As we know there are many options out there to aggregate information about research conducted on your campuses. And this comprehensive is a challenge to assess. So with that, I'm going to turn it over to Dan to talk a little bit more about how he's addressing these issues at Penn State. So thank you Cynthia for the introduction. I'm Dan Cava, the IT department head at Penn State Libraries. And I'm going to discuss the technology that we developed at Penn State that we referred to as the unicorn, but we do have a bit of a more formal name at this point. So that's the researcher metadata database, or R&D. Years ago, I worked on a project to develop tools to analyze journal usage, and without going too much into the details on the depths of difficulty that existed this was back in 2013 or so to do broad level analysis on journal usage at our libraries. I'll say that one of the salient lessons from that exercise was learning that the effort to organize data, clean the data, provide the data in the most basic of statistics was rather large. So doing any real data analysis on that project ended up to be very difficult due to time constraints. And some of these processes were so specific for that project, and they just could not be repeated they were on spreadsheets somewhere, and we couldn't really query them for anything other than the information that we provided. So when we initially started this project, which was about discovering research outputs that were performed on our campus. You know what articles did Penn State publish this year or last year, which authors published which papers and which journals. There's a bit of a parallel to that, you know, earlier project. And that we're trying to measure journal usage but largely at a different phase of research lifecycle so that earlier project is the lifecycle of research where usage is likely made up of literature reviews and reading articles and downloads to help you with your research. In this case, we're more at the end of the research lifecycle. So the real similarity though is in the challenges this problem presented so competing data sources complex propriety data structures silos of information. Our research outputs exist in several systems, but they number one they exist in the journals they're published in and there are thousands of those. We didn't want to query every single journal out there. We had systems that were already doing pieces of this. So we wanted to leverage that. And the reason I talk about that complexity of the data sources and comparing it to the project on measuring journal usage is that I had one year of journal usage data that I had worked with from 2012 back on that project and people thought it would be great if they could use this tool I had. However, I was not able to like leverage it for any additional data or annual data imports are able to make it something that could be maintained or updated to kind of keep the lights on, so to speak. So our first goal for, you know, for our unicorn for a researcher metadata database was to be better than that, right we wanted this application to be able to repeat its process of importing data on an annual basis. To be able to work on data analysis, more than data aggregation and frequently when we talk about R&D this is something that we overlook, but I do think it's a critical aspect of it so perhaps I place a bit more importance on this because I was bitten by not doing it previously. And then the other goals right where we wanted to aggregate data from enough sources that we felt confident to use it for being the data backbone of our open access workflow. We wanted to provide access to the data that we had be an API so this could help units on campus with faculty profiles or Institute and department websites. In 2003 we wanted to generate administrative reports for common questions that were asked, either by other departments or administrators and then as I mentioned, we wanted to be able to repeat the process of aggregating that data so we could do this every year. So I'm going to talk a little bit about each of these goals, and what we did for these use cases one of the initial questions, people would bring up about powering our open access workflow is how are you going to get the research that you don't know about. And I don't know trying to sound glib about that but we felt that a significant amount of the research we were interested in for open access existed in other systems and if we could at least initially focus on that content. Then we were at a good starting point. So, what were some of those data sources digital measures has many names of Penn State it's a vended software package that we use for faculty activity reporting. Our faculty input data, we automate that when possible, but they input data and digital measures is able to create a university dossier digital measure. Sorry, digital measures has data in it. It's part of our faculty's process I'll be it reluctantly, but it's a good source of data to measure research outputs. So our MD queries digital measures every night for updates and receives these following pieces of information. An important piece to this for a school like Arts and Architecture at Penn State is having performances in there. We wanted to make sure we captured all research outputs not only perhaps the more traditional journal articles using digital measures enabled us to do that. Pure, which is another vended product this one from Elsevier that is our expert database on campus. And pure allows you to search for a specific researcher in their interface and this is an excellent source of data for publications. We query this nightly for research outputs to add to our MD as well. We have our electronic thesis and dissertation system which is a product that we've developed. All of our graduate students use this to submit their master's thesis or their PhD dissertation. So this is a data source that we queried so that we can display the faculty committees on profile pages. If a graduate students looking at a faculty profile page they can see the work that their graduate students have done, and maybe they may be able to get a sense of what work they would be doing if that faculty member was their advisor. We have the news at PSU.edu where Penn State publishes stories on their faculty to promote their work. We query this system to get any stories on our faculty, so that we can link out to those stories in their profile page. And Web of Science is where we pull additional publication information. Those publications include grant numbers. This was really interesting to us because at some point we really want to be able to link a grant to a person to a publication. Right this grant was rewarded to this faculty member and ultimately produce this publication and Web of Science did a good job of including metadata on the grant number that it was derived from and help us with part of this right so we could link the grant to a publication. And the difficult part of it was then linking the data the correct author to the data to the current research Penn State researcher right, which is a good segue to NSF. So we actually have grant data in other systems, but some of that grant information is not public and considered sensitive. Instead of worrying about authorization decisions on grants, we decide to query NSF and get Penn State information from a public source, which in short our data was publicly available. Because of the way this information was organized it made it easier to associate a grant with a Penn State researcher. The last two data sources Web of Science and NSF. In some cases we are able to link an NSF grant to a Penn State researcher and link that researcher to a publication that was found in Web of Science. So you have a little more than a couple thousand grants that are linked to one or more publications. Next is working. So this was the first data source that we were, we were able to successfully write information to initially as a test we wrote employment. Or employer to orchid and then we worked with publications to orchid from RMD. So as a researcher, you can log into RMD this would be the screen that you would see when you log in, you view your publications. And if any of your publications have not been written to orchid yet. Right, you can kind of click that little button. That says add to my orchid record. And once it's been added you would then see a text indicating that the information has been written to work. Scholarsphere which is our institutional repository is our second data source that we're writing to. I'll go into the workflow a bit of our open access implementation later, but from that listing of publications. You can click on the publication right so the previous screen that I was showing from RMD to click on the publication and then upload a file from RMD. And in the background it'll hand that PDF or the document to scholars here and email you when it's complete to let you know that you've successfully uploaded the document from RMD to scholars here. And then scholars here will return an open access URL for RMD to store so it knows about this. Next, if you're wondering, how do we know if a publication is or is not openly accessible we query open access button by sending a DOI, and then we retrieve an open access URL and a status which is green bronze hybrid gold closed. And we store that information for the publication as well. So some of these data sources we update automatically every night like digital measures pure the Penn State news. If they're a bit more involved and they require us to get a feed from our electronic thesis and dissertation system for example that we update semi annually. We've considered adding additional data sources for example we have NSF but we don't have NIH. That's one that comes to mind but for now this is our data sources we're drawing from and in some cases providing data to. We provide some of this information that we've aggregated in an HTTP restful ish API. So these aren't the only data sources we're providing to but the API is a bit different in that we're providing that information to those units on campus that ask for it. So this is a screenshot of our documentation on the API. So from here people are able to test out the API without doing any programming to see if the data, each of these API endpoints will provide is is usable to them. That said, you can see we largely query by user, but we do provide queries by publication and not seen here is query by organization the only reason it's not seen here as I just couldn't fit it on the screen. Our API is organized by the use cases and requests that we've gotten so the largest of those use cases is for department and directory web pages so that's what a lot of our API is geared towards. I think a fair question would be, you know, digital measures has an API pure has an API why don't you just provide those API's to campus units. We do provide that. But at the time we were developing our MD it became clear that a number of units on campus wanted this information to promote their faculty on the college directory department or Institute website. And they really liked the information was continually being updated. So they wouldn't have to track down faculty to make updates themselves so by creating this API as an entry point for our campus. One prevented a fracturing of users where some units are getting data from digital measures, some from pure and confusing people on which is you know the right one so to speak. Oh go to digital measures for the schools of arts and architecture, if you know a faculty members research interests go to pure if you want a richer set of publication outputs on these faculty, we just didn't want that kind of confusion. And we thought the data from these sources isn't always exact. As we go through the data and digital measures we do have people that are manually deduplicating and improving the quality of the data, and this action centralizes that task. The other one is if Elsevier or watermark upgrades the API, every unit on campus does not have to go through that process. So in some cases I believe the pure API is upgraded to be non backwards compatible annually. So imagine several units not being able to do anything other than work on that upgrade at a certain point in the year. This, you know having this our API centralized centralizes the task of API maintenance and upgrades as well. So we did we did not want to become, you know a central profile site for our resource researchers. We wanted to provide access to this data so that the different units could create their own profile sites. The concern with creating the profiling site is that we would kind of balloon into a number of interested parties and become sort of an all time consuming part of our portfolio. We're libraries, let's provide them with the information they need and they can create their own sites. In order to do that we created a demo of what a profile would look like using the API from R&D. So you can see we can provide citations on publications from digital measures are pure. We can provide that graduate advising from our ETD system and then link into the ETD system for these theses. We can provide links to stories in the Penn State news. We can provide links to the grant information that they have been a part of. And going back to the publications, we can also provide links that are not behind a paywall to any of the content that's openly accessible so not only do we have the citation we have a link to the content. And additionally, because we're using our faculty reporting system, the digital measures, right, we can provide a listing of performances for faculty that are in fields which are sort of less interested in the publications. So this is our profile demo website. I'll show some examples of units on campus that are using it. So here's the Huck Institutes using content that we've provided to drive their profiles. And here's the College of Agricultural Sciences also using our API for their directory profiles and the listing of the publications for their food science department. So I'm going to quickly discuss our use case of generating administrative reports. Basically, what we were looking to do is for those units that don't have IT support, and they want data from this system so they can promote their faculty. So how can we provide access to this data in aggregate. And so we have an administrative site to the researcher metadata database. And here is a screenshot where you can see all the publications we've aggregated from those various sources, and you can see we have a listing of all those publications over 330,000. So you could click through to see which of these are open access or existing scholars fear, etc. So the ability to provide this data to others is in this button right here where you could export the publications and a number of formats and share it with others so you could right click that export and export those and provide the College of Engineering with a listing of their publications by year, right if you wanted to. So some content which could be also be done from the API and a bit of programming logic. This is just another way for us to provide access to that data. So showing a screenshot for open access is a nice segue there to our last use case and that's powering our open access workflow. In 2020 Penn State passed an open access policy saying that faculty need to provide an open access copy of their scholarship that has been done basically in 2020 and beyond. So how can we enforce this policy and enforce is a bit of a loose word in my experience in academics it's more like, how can we ask faculty nicely to do something they're required to do by a policy and limit the interruption and letting them know about the policy and limit the interruption getting them to comply. So step one with RMD we have a good sense of what research outputs, we know about from digital measures and pure. We need to deduplicate the publications from these data sources, and we're able to do a significant amount of deduplication algorithmically. But when you have 330,000 publications and an algorithm that's at a 90 plus percent success rate that means you still have about 30,000 publications that need to be manually duplicated. So we have a tool in R&D that allows our staff to do this. Excuse me, I need a drink, but then we query open access button with the publications were able to find out which of these publications are openly accessible. And every four months or so we email our researchers asking them for either URL or the content for any publications that we are not able to find an open access copy of. The researcher logs into RMD and provides either an open access URL or uploads a copy of that article to ScholarSphere or submits a waiver of the policy. And you saw that screenshot earlier when I was going through the data sources and showing how it writes to ScholarSphere as well. So as we created this application we are able to repeatedly get new metadata from these various sources. In some cases we get it nightly, some cases we ask for an export or an import the data. We can provide access to the data via an API for a number of reasons. The most popular is unit directory pages or department directory pages. We can build and provide administrative reports. We can power our workflow to comply with the open access policy. Overall it's been a critical aspect. Researcher metadata database has been a critical aspect of our open access workflow. We've been able to help other units on campus as well by providing access to this content. When we began our open access policy one of the goals was to have 25% of our publications available via open access. Currently we're at over 33% of our publications and that number is actually much higher, nearly 50% if you look at snapshots of more recent years. One of the most difficult things is dealing with the ambiguity. Is this researcher who we think it is? Is this document the same as another document that we're looking at? Or publication the same as another publication document we're looking at? So working on this project provided real insight into the power of those identifiers like DOIs and ORCIDS. I'll turn it back over to Cynthia to discuss some of those identifiers in more detail. Thank you Danny. And just to kind of build off what Danny was saying there at the end. I think one of the most easy things we can do within higher education and within the scholarly and scientific communications ecosystem is broadly adopt persistent identifiers. When we talk about the landscape of PIS these days, I think we're really talking about a couple of landscapes at once. So on the far left, one layer of that landscape is all about trying to figure out what do we want or need to identify. There's DOIs for articles and data sets and preprints. There's ORCIDs for people. There's RORIDs for the research organizations. Plus a whole wider landscape of identifiers for grants and instruments and facilities and samples and conferences and projects. There's a lot out there. Another layer of the landscape for PIS is the system tools, structures, infrastructures and workflows that we need and standards for how we might use and organize them. Moving further over to the far right. It's all about how we connect these things right so they can, they can live as a jumble we have the infrastructure. So we need to move it over into a system of organization. So we can connect these things together and get people on board with adoption we can really achieve meaningful insights about research. When Danny was talking about excitedly when he was talking excitedly about being able to join, you know that the grant information with the faculty member at Penn State, and then with the article that came out of it that's, that's a, that's a, that's a very near possibility. There's still some challenges we need to overcome. It's not just enough to have a pin and it's not just enough to even have a set of pins or the infrastructure, they needed to be connected to one another and they really need to be used. Obviously outreach and adoption is a key part of this, continuing to bring, you know, community together and raise awareness, and also how can these kind of connections be set up seamlessly to really minimize friction and address some of those other human kind of challenges that we talked about earlier. So, what does this all mean within. In terms of how this landscape might be network to unlock discovery. The core of this is obviously the pins themselves. As we talked about kids alone don't do much. They need to be collected into workflows and systems and standards and infrastructure, and they need robust and open infrastructure to make it easy and sustainable. And then they need to be deposited into like global research infrastructure, such as cross wrap and decide and orchid. So that the metadata can be searched and the systems can connect. You know, all of this development is really aimed at activating the research landscape with pets. And just to kind of back up a moment, these best practices and kind of the information I'm sharing about persistent identifiers flows through the work that the Association of Research Libraries conducted in along with partners at the Association of American universities, the Association of Public and Land Grant universities in the California Digital Library. And while brought together folks to talk about the research data kind of landscape and the current state of persistent identifiers for data sets, these best practices that I'm going to share in a minute are really are really global. And I think are critical for meeting our needs within the research information management systems as well. So, really comes down to five things I think you can five easy things we can do to to advance our work within research information systems. Digital object identifiers to identify articles, preprints, research data, as well publications and other outputs. Orchids to identify researchers roars, which is an institutional affiliation or research organization registry to identify and link organizational affiliation with an author or a co author. And cross ref funder registry IDs to identify research funders and cross ref grant IDs to identify grants and other types of research awards. It's through the adoption of really these key pins that we're going to be able to see great advancements in connecting and finding and discovering research at our created on our institutions. So, just to wrap things up here are some references to some of those challenges I mentioned earlier. I just wanted to thank you on behalf of myself and Dan Coughlin. You're more than welcome to contact us if you have any questions. And we thank you for your time today.