 Welcome to this CNI project briefing toward automating collection of article data and repository content. I'm Stephen Pryor, Digital Scholarship Librarian at University of Missouri, and I'll be describing what you might infer from the title is an in-progress project that we are developing to head toward automating collection of information about what our authors are publishing, where they're publishing, how they're publishing it, and getting those publications into our institutional repository or some kind of open access availability. And so we have a few things that we want to accomplish with different phases of this project, including analyzing and visualizing our author and publication activities to help us in our internal decision making, as well as external communications with provosts and board of trustees and other entities on campus and within our multi-campus system. We also just want to contact and maintain relationships with our faculty authors, and we eventually evolve to this sort of document retrieval process that we'll see later. And so our process right now sort of works like this. We have some raw data. We read that data, combine it with a few different sources. We send the emails to authors. We download articles and create metadata and batch files for dSpace. So a lot of these steps as you'll see involve Python and various APIs. So I have a set of Jupyter notebooks, and there's a GitHub link at the end of this presentation for the Jupyter notebooks that we'll be exploring. So you can look at those yourself and sort of see what is occurring in each of these steps. And so our first step, though, is to download raw data, which I am currently retrieving from Scopus. And as you may know with Scopus, you can search for by affiliation. And so we can search by University of Missouri, and we have a couple of random ones that still exist that I also check on from time to time. But generally we can search for, in our case, University of Missouri, and we get a list of articles published by University of Missouri. And I can narrow it down by year, by author name, and I can do other sorts of advanced search and limiting by open access availability, for example, or things like that. And then I can export this data set with a certain amount of citation information about what kind of document, the DOI, when was it published, who are the authors, and things like that. I also get the correspondence address, affiliations, and publisher, and some other metadata, and I export that as a CSV. And then at that point I go to my first Jupyter Notebook, which helps me get additional article data about this article. And so I read in my Scopus data, which has all that other information, and what I want to do then is define a function to get data from Unpaywall, which will tell me Unpaywall is a database that has information. If I give it a DOI, it will tell me whether an article is available within an open access copy or not. And if it is, where is that copy, what version of the document it is, and things like that. So that is what this function does, and I can call my function with the list of DOIs from my Scopus data frame. And then I can combine the Unpaywall data frame with the Scopus data frame on the fields that I want. So best open access license, publishing date, OA status, best OA URL, and things like that. And so after I have combined that Unpaywall data with the Scopus data, there's one other field that I wanted to add to this dataset. And so I took a look and built a list of what the top publishers that cover probably 90 to 95% of the publications in the list. And I went to this list of publishers and I looked for whether they have a general green open access policy that allows authors to deposit their post peer review manuscript version, their author accepted manuscript. And if the publisher has such a policy, I've added it to this list and built a list of these green OA publishers. And what I do at this step is go through the list and make a note for each article, whether the publisher is a green OA publisher, or the thing is already available, open access, in the gold open access or the journal is open access. And then I basically create this field that's a flag as to whether it's a potential OA or not. And so the reason that I do this is because I might want to generate a chart describing that 40% of our items are currently open access. But if everybody who could deposit a manuscript would deposit a manuscript, we could have 90% of our content available open access, for example. And so adding this flag allows me to have that dataset of if we added all of these green OA possibilities, you know, this is what our potential could be. And then I save out a new CSV with all my new data. And I move on to the next step, which is to take this Scopus plus on paywall data. And I want to add even more information to it. So I use my Python and Jupyter notebook with the on paywall API to combine that data. And then I want to get contact information, names and departments where I can for the authors for the next step in adding data to our dataset. So at this point, I move on to the next phase, which I have broken out into a separate Jupyter notebook. I read in my CSV of article data. And then I set up an LDAP connection, because Scopus gives us in the Scopus dataset, it'll give us for the corresponding author at least. It gives us their name, email address and department information, but that is basically whatever department information they've provided with the publication. And so it's not standardized and it's all over the place. And so what I wanted to do was as much as possible, gather this as a standardized list. And so I'm looking it up against our campus directory at this point. And so I set up an LDAP connection connect to our LDAP server. And then again, I iterate through the article data and for each row, I parse the corresponding address, the correspondence address. So right away I can get for the corresponding authors. If they are MU, if they are University of Missouri authors, I can match them up with an entry in the directory and pull their title their department. Their official name. And so I'm doing this match on email address, which is a pretty good unique key that's available in the Scopus data. And so when I talk about improvements a little bit later, obviously this right now is just. And so later in the presentation when I talk about improvements to be made to this process, obviously the major caveat here is that this is only retrieving data for the corresponding authors because that's who I have the email addresses for. I figured at this point the corresponding author is the one who is quote unquote in charge of the disposition of the publication. And so starting with them at this point in our project seemed like a good way to go because we are collecting this information for the purpose of contacting the author for permission or notification. And so we'll get to that in just a moment. But this block of code here basically collects that information about the authors. And so then I get this spreadsheet I save it out. And I have this sort of version of the spreadsheet which has all of our article data the do is titles some information about the funding body. Here's our corresponding author fields and the data that has been added as well. So these are on paywall fields best away URL for PDF best away version. And then at this stage where I have it exported to a spreadsheet for purposes of contacting these authors one other thing that I wanted to do was include the subject librarians in there. So for each of the departments in the that appear in the corresponding department field. I've created a list of the email addresses and the subject librarian associated with those fields. And then I do a an index match process here to match the department name with the corresponding librarian. And once that match is made I've created a mail merge template essentially that allows us to send customized emails to the authors describing their article. Congratulations. And then some sort of either a notification or an ask of some sort and we can sign it. With my name and then we can add their subject librarian in there to in case they're familiar with their librarian or especially if they're not familiar with their subject librarian. And so this particular template this is where we started with this process if. An article is already available open access with a suitable creative commons license that allows us to that allows us to repost that article under that license. We send them an email. The again the corresponding author for the paper. It says congratulations on your publication. According to our records. This is available open access and we can post it if you don't want us to let us know. So far out of a few hundred articles that we've done this and sent emails out about I've gotten only a few responses back and they've been very positive yeah I'm all for this go ahead and do it. And so we've not had anybody opt out. A lot more of these to send so I'm prepared but once we connect this to the spreadsheet then we see that you know we actually are filling in information. At this point in the process I have the spreadsheet combined with all of this information and so we can start taking a look at our visualizations. And so using something like Tableau Google Data Studio Microsoft Power BI all of them work I've tried all of them. And for now I can I'll just show the Tableau workshop. But we can start breaking down our publications in our data set by how many are closed how many are closed but potential open access in most space and so that is where the potential away flag that that I generated and added. And then which kind of open access is available and which version so we have got green submitted version accepted published hybrid accepted hybrid published gold. And bronze and you can also in Tableau or Google Data Studio you can do some sort of you can set up some sort of you click through navigation and so we'll see in just a minute sort of combined table. But one thing that we might also want to look at is away availability by publisher and so we can see sort of how much our authors are publishing under certain publishers and then how much of it is away and again what type of open access. We might also want to look at our corresponding authors and see what departments are publishing the most in the in this set of publications and so and how many are open access and so we can view different kinds of sorts on these as well. And we can set up different dashboards for different information and so right here we could say I want to see all of our green open access through Elsevier and we click on that in the chart and we actually have a list of all of the articles here. The visualizations are sometimes useful for us and we've actually used them in our own internal discussions about publishing agreements or subscription deals. And more and more we're able to take this kind of information and make arguments about whether we want some sort of open access clause included in a particular publishers contract. And finally one other thing that we're able to do with this data set we have the spreadsheet and from on paywall we have direct PDF links when available otherwise we have links otherwise to the best way version. And so if I go back to my Jupiter notebook. My third process is to actually retrieve the open access articles so like I said before where we have articles and we're sending out those emails that congratulations you've published an article according to our records it's available open access and we would like to include it in our repository. What I'm able to do is first of all go through and make that sort of license judgment based on again on paywalls. Bestow a evidence bestow a license and so I can go through those records and build XML blocks and what this block of code does here is builds a block with a citation and builds the XML that I can output and import directly into a d space batch job. That looks like this and then finally there's the get articles step function that for those direct PDF links will actually go in and download the PDFs directly or actually launch a browser page so some publishers don't allow a robot or a spider. They don't appreciate being connected to that way so I save those to a list and we can download those manually if we need to. So this is our current process and pipeline which evolved definitely from basically downloading data from Scopus to throw into some sort of visualization tool. We added that with on paywall then started gathering other information and eventually culminating in collecting articles and creating batch metadata for d space future improvements that that seem obvious. I have a lot of refining to do to make this more repeatable and so right now we've got the initial step of downloading CSVs from Scopus and so I'm working a little with automating this collection with the Scopus API that's available instead of the spreadsheet. The upside to the spreadsheet is that when the spreadsheet is the input we could do some further. Merging of information before the initial step and so specifically I'm thinking about we have access to web of science we have access to academic analytics and you know if I wanted to do this sort of on paywall contact information and those steps. I might do some some processing to merge additional data essentially into CSV for the DOI list that goes into this spreadsheet. But overall there are a lot of kind of moving parts here and refining the process so that it's repeatable without too much tinkering is is a priority. And part of this also involves saving data to a database to make it easier to check for duplication merge those additional data sources and then cash author information so I don't have to do the LDAP look up every time. And then I also think that caching author information and maybe saving the Scopus IDs for authors which is also part of their Scopus data set will do things like allow me to expand to. Contact in situations where the. MU author is not necessarily the corresponding author, but we still want to include those papers and so expanding out that way might allow me to be able to match those authors with our own internal database. And all of this data is starting from Scopus and on paywall and our campus directory and. There may be additional better data sources better ways to connect to these data sources and. I'm on the lookout for those. And we also want to contact so so far we've been doing the email contact for articles that are away that we can. Ingest automatically into our repository by doing this PDF harvest and and batch XML for D space. But share your paper just launched which is. An interesting tool that will allow us to just include a URL in the email that will allow them to again basically set up an automated process of. Selecting their manuscript file and adding it to the depository in a correct way adding it to our repository in a correct way. That automates again a lot of the metadata process for those articles. Here's my GitHub link as it is currently and so the Jupyter notebooks will be there and read me file. Sort of discussing what's going on and how to use it. So thank you and feel free to let me know send me a note if you have any. Questions or comments or thoughts that might make. My life easier if you're interested in this project I'd appreciate hearing from you thank you.