 Good afternoon, everybody. Thank you for joining us for the last session today. My name is Eric Mitchell. I'm the university librarian here at UC San Diego, so welcome to San Diego. I have the honor today of kicking off this presentation of leading 2021 fellows. I'm here as one of the PIs on this project, which is led by Dr. Jane Greenberg at Drexel University, Kenny Arlish at Montana State University, myself, Rachel Frick at OCLC, and Jake Williams at Drexel University. Leading is an IMLS-funded early to mid-career professional development program focused on developing data science skills and LAS doctoral students and librarians. Our overarching goal is to build a community of expertise that informs and helps drive experimentation and innovation in the information science and research and practice fields. We're so grateful for the support of IMLS for this effort or for their support of this effort and exciting to be moving into our second year of fellows. This project builds on leads, another IMLS-funded effort led by Dr. Jane Greenberg, which piloted this project with doctoral students back in 2018 and 19. And I'll just comment. I was a member of the advisory board and remember well the presentations from fellows. And I'm so excited to be able to share the stage with our fellows today. A key feature of our project design is the coordinated collaboration between library and information science schools, libraries, and national library organizations like OCLC. For example, our leading project is grounded in the data science expertise of our faculty at Drexel University who develop a boot camp and support our leading fellows throughout their six-month curriculum. Our partner libraries bring pressing data and information science issues as well as a network of dedicated mentors to support. And a bit later in this session, you'll hear about the incredible work of OCLC leading a data challenge that was a pivotal part of this leading effort. Following the boot camp, which starts this entire experience for our fellows, the fellows join a project site to engage in a real-world data science project. We began our project with 15 sites in 2021 and I'm excited to share that we have four new sites joining us in 2022. And I'll just comment this all works, of course, because of the hard work of our fellows, the incredible faculty we get to work with, an amazing advisory board, of which, of course, Cliff is a member. Thank you, Cliff, our various task forces, especially our diversity, equity, and inclusion task force which supports us in making sure we recruit a great cohort of students. Similar to other community of expertise projects, leading is structured around a node concept. Our goal in organizing into nodes is to create opportunities for fellows and mentors to work together as a team and learn from each other. Our node concept also seeks to broaden and deepen the connections for our fellows and mentors during this intensive and hopefully transformative six-month period. And I'll just say in our first year, we learned a lot about what it means to launch and lead a community of expertise and I just wanted to give a shout out to my own fellow who's not here on the stage today, Crystal Goldman and as well as Sam Gravis who is our project coordinator for all the amazing work they did. So we just wanted to show you where the fellows for 2021 are and their nodes and I'll think on my next slide you'll actually get to see their faces. You should recognize a few folks here today. If you want to learn more about our fellows, all of them have profiles on our website and certainly encourage you to learn more. So in the December CNI meeting, we had the opportunity to share the work of six fellows through a lightning round style session and we're excited to be able to highlight the work of another five fellows with you today. So following the next five lightning talks by our fellows, Rachel Frick will give us an update on the OCLC data challenge I mentioned and I expect we'll have a few minutes for Q and A at the end. Part of our work in leading focuses on helping fellows tell the story of their research impact through something we call a quad slide. You can think of it as a visual elevator pitch and so with that introduction, it's my pleasure to turn it over to our fellows. So Amanda. Thank you, Eric. You can meet just a second. I'm gonna make the notes part bigger so I don't have to scroll so much. Perfect. Everybody can hear me okay? Hi everyone, my name is Amanda Whitmire. I am the head librarian for Stanford's Marine Biology Branch Library and I am on a mission to make historical biodiversity data discoverable, accessible and usable. I wonder how many of you might have used ever used iNaturalist or the Merlin Bird ID app? Any other bird nerds out there? Perfect. Those of you who have used these apps to observe and identify an animal or a plant out in nature have captured a very specific kind of observational data called a species occurrence. And this kind of data is foundational for studies in ecology and biodiversity. If you're a researcher who's trying to detect something like changes in biodiversity resulting from climate change, for example, need to have this kind of data over a long period of time. I work at a marine station that's been in the same spot for over 100 years and I manage several physical collections with observational data locked away in the pages. If I can find a way to pull observational data out from historical documents, it would be a huge benefit to the researchers at the marine station and beyond. So from my leading fellowship with the Academy of Natural Sciences, I was really interested in exploring how much progress we could make toward developing a computational workflow to extract species occurrence records from the corpus of the proceedings of the Natural Academy of Sciences of Philadelphia. A species occurrence, the kind of data you capture with iNaturalist is an observation of an organism being at a specific place at a specific time. And this kind of data has a home in GBIF, a Global Information Biodiversity Facility, which currently has over two billion occurrence records, but less than 1% of those are from before the year 2000 and less than one tenth of 1% of those are before the year 1950. An important characteristic about what we're trying to do with species occurrence records is that it's not like a sentiment analysis where you can be really vague about the interpretations and it's not just as simple as finding the species names in the text. We need to find a species name in the context of a sentence or paragraph or list that unambiguously shows that something was present at a given location at a specific time. Unsupervised AI would make way too many mistakes in this assessment. So what we need the AI to do is to cut through all of the hundreds of pages of irrelevant text and show us a reasonably good candidate bit that might have a species occurrence. I need a tool that helps me filter through text efficiently and show me small bits that I can quickly evaluate as being a species occurrence or not. And that's actually a really difficult problem. So finding species occurrence records in the text requires that you can find taxonomic names associated with a place and a date. And exploring our ability to find those three things was our first step. There's a fantastic tool for finding taxonomic names in texts called Global Names Finder. It can find matches for misspelled or partial taxonomic names, which is really helpful if you're working from OCR results, which we are. I was able to create a list of every single taxonomic name that GN Finder found in the Proceedings Corpus, which is important for our next step, which was using the natural language processing tool, SPACI, to help us find locations and dates in addition to the taxonomic names through a process called Named Edity Recognition. To help us understand how SPACI performs NER in the context of the text, I created a simple online streamlined app to visualize the results. And there's a bitly link on the slide if you'd like to test it out. I also shared all of my R Markdown scripts via blog posts, which chronicle my journey on the fellowship and everything I'm talking about today is on GitHub, but don't look, because it's a mess. So after spending some time getting acquainted with the corpus, testing out GN Finder, Python, and SPACI, Steve and I took a step back and really thought about how we could wade through the corpus and winnow it down to a manageable size for the NLP process. Significant portions of the Proceedings are dedicated to meeting minutes, donations to the library, descriptions of fossils or geology, and other topics that aren't relevant to our pursuits in biodiversity. Steve had previously done some work parsing the corpus into sections, and he continued that work a step further by using WordNet to determine which sections are probably about plants or animals. Within those sections, I tested a couple of non-machine learning approaches to further reduce the corpus. For example, just looking for articles that had the word collected in the title. The list of fishes collected at Port Antonio, Jamaica, 1899. But there were only 20 of those, unfortunately. I also tried text mining the corpus and only keeping sentences that included a taxonomic name, but there were 52,660 of those. So that criterion wasn't specific enough. We decided to pick one of the sections of the corpus and do a bit of a close read on how the NLP performed, really trying to get a grip on how SPACI's linguistic model sees the entities and the relationships between them. We started looking at compound words and engrams, what it thinks are verbs and dates and places, and there were a lot of challenges, lots of places where the OCR wasn't good, where the text isn't structured in sentences, and that's where we ran out of time for the fellowship. We're still planning to continue the work and we're currently navigating how this might intersect with the next Academy of Natural Sciences leading fellow. I'm extremely grateful to have had this experience. As a mid-career librarian who manages a branch library, this fellowship gave me the opportunity to dedicate my time to learning some of the theory and the practices and data science that are directly relevant to my job, and I'm already using what I've learned in a couple of collections' data projects that I have going on. So thank you for your time and I'll pass it on to Chris. Good evening, I'm Chris Wiley. I am an engineering and physical sciences research data services librarian. I'm also a second year information science doctoral student and as of March 16th, I'm also the interim head of research data services, Granger Engineering Library, University Library affiliated at UIUC. My research interests are exploring our researchers' practices, behaviors, patterns, aligned with theoretical frameworks, a lot of the existing research work that I've done and focused mostly in data sharing, data policies, research data management. I use qualitative interviews and focus groups, so I thought it would be great to combine this method as a complement to citation analysis and data visualization. In essence, visualizing the collection using both qualitative and quantitative methods. Librarians at MSU provided a list of faculty members and the Qualtrics was used for the survey. There were about 92 faculty members. We sent targeted emails to these particular faculty members, 18 responded. The focus of the survey, it was nine questions. The focus of the survey was to assess the perception of finding information that they need for their research, their teaching and learning, as well as the helpfulness of the library and learn the researchers and faculty's perspectives on any possible additional services. Overall, faculty rated their skills at finding information using the library resources is highly effective, as well as the library is highly effective. Their concern, the biggest area, if you will, of displeasure or discontent, if that's even a word, is journal loss. With the assistance of my boss, William Bill Michele, who presented at this conference here earlier this year, Scopus is used to visualize researchers' faculty impact. It includes publications, citations and updated NSF and or NIH grants. This other part of the slide is a breakdown of the types of cited references. It's 1,987 cited references of 66 faculty members. What I learned is that Montana State provides access electronically to 86% of the cited references. Some of the lessons that I learned in this experience was one, it was a new experience to use both the citation analysis and a survey. If there was more time, I'd hope for more responses, yet I realize and know that there's never a way to gauge the number of responses that one could receive for a survey no matter when you give it. Things I like to keep working on are one, exploring the most frequently cited journals for trends and patterns, altmetrics, the faculty that discuss the journal access in the additional portion of the survey I'd really like to delve down into why they feel that way and determine if there's any correlation between the initial 66 faculty and the faculty that responded to the survey. And lastly, because I really love research, if I had more time, I'd also look at data sharing and data policies, because I'm really interested in that. Thank you so much for the opportunity to speak with you today. I'm Chenyue Zhao, a doctoral student in School of Information Science at University with the alumni of the champagne. It's great to be here to have my first conference presentation in person. I'm so excited. As a leading fellow, oh, thank you. As a leading fellow, I've been working on a project that scholarly elite characterizing and even distributions in access to institutional repository content. My mentors are Keny Arlish at Montana State University Library and Johnson Wheeler at University of New Mexico Library. For those unfamiliar with RAMP, the repository analytics and metrics portal is a web service that aggregates use and performance use data of institutional repository. In other words, the RAMP data is about the access and usage of items in institutional repository, like whether the item appeared in search engine results and received any clicks. Previous studies have shown that a small number of item receives the most click, that is only 1% of item accessed. So what I was really interested is what affect the click rates or access of items in institutional repository. Previous studies have shown that some metadata of research articles can affect their usage, including citations or downloads. So my questions for this project is, does metadata affect the click rates of institutional repository's content and how they affect? The objective of this study is to understand the relationships between some metadata field and the click rates of our content. Understanding this can help institutional repository to improve their search engine optimization, like increase visibility on search engines and attract more colleagues. My mentor provide me always all RAMP data from 35 institutional repositories during January 2019 and May 2019. I only include this and the citations item in my project because there's a primary and highly accessed item in institutional repository. All titles, abstract keywords and subject were extracted from related metadata fields and I use text many and natural language processing for data collection and processing and I also conduct some statistic analytics. So some of my findings are consistent with previous studies on research articles. I found the metadata to some extent can affect the click rates of our content. First, the shorter, clear and tightly constructed titles are more likely to be discovered and clicked. Second, titles have a containing a column and have more numbers and some positive and negative words often associated with higher colleagues. However, having geographic information or other specific contact information in titles can affect the click rates positively and negatively depending on the contact. I also found the shorter abstract more keywords and subject can attract more attentions. All these findings are very preliminary and these studies also have some limitations like limited data sample. So yeah, in conclusion, I feel so honored to work on this project and work with Kani Johnson and Carrie who is another leading fellow at the same project. They always gave me a lot of feedback and support and regularly check my progress to see if they can help. If they can help. Personally, from the leading program, I have learned a lot of data science techniques and skills that I haven't learned before. I feel much more confident in solving problems involving data analytics in my future studies. More importantly, I realized that data science is so important in my field and because it can help us make the right decisions and provide better services. Thank you so much. Everyone, thank you for having me. My name is Etiana Uriere. I'm the metadata librarian for the University of Texas at Rio Grande Valley. As a 2021 leading fellow, my co-fellow and I, Hiva Catavar, worked with the library of UC San Diego on its farm worker movement collection. Hiva and I were tasked with exploring ways in which the collection could be better accessed than organized, including methods, automated or manual or a bit of both that could extract important data from the collection that could be transformed or repurposed. Initially, we both set out to explore the on and offline elements of the collection to better understand what was in the collection and how those elements were presented in the collection. Very early on in our exploration, we realized that the goal was taking that extracted data and finding ways to connect that data with other elements of the collection, which included images, audio and video. Because we didn't have a lot of time to work with the collection and solve the problems that the mentors had identified, I wanted to work on things that would make it easier for the UCSD staff or other future fellows to continue exploring this collection as well as other collections. My focus and what I enjoyed most about this experience was having the space to try different tools that could be useful for digital collections and provide examples of their use to others. I relied primarily on Python tools and packages to create much of the output during this fellowship. I first created a Jupyter Notebook detailing the use of two popular PDF mining packages in Python. Much of the documents available in the collection were in PDF form and I felt it useful to provide an introductory demo guide on the why and how of PDF mining. After completing the notebook, I then moved on to creating code to scrape the metadata for the images in the collection galleries, which was a bit of a hassle, but I made it through and I've included a bit of the code here on this slide. Concurrently, I took a look at the oral history section of the collection and thought it could be better organized. I created a spreadsheet, that's the second screenshot on the lower right, basically giving a better detail of the oral history audio. The last project that I worked on had been informed by some of the things I saw while working on other projects and some of the potential projects suggested by the mentors. I created a stringlet site, an open source tool that creates web apps in Python that could convert MP3 audio files into WAV files, which work better with Python. It could also transcribe those and other audio files to tags using IBM Watson's speech to text function, which I found in my limited run through of speech to text Python packages to be cost effective and fairly easy to use in lieu of manual transcription, either done by a staff or doing community events. The site can also take text input, in the case of the collection, the text mined from the PDFs and identified named entities using SPACI, and if available, the WikiData QIDs for those entities, which is a pipe that was created by OpenTapYoga. The site is still a work in progress that I hope to maintain and improve over time, but my hope, as with the other things that I've done and identified over the course of this fellowship, is that the UCSD library staff and possibly other individuals and institutions will find some use for these tools in tackling similar projects. Thank you. Hi, I'm Christopher Rausch. I'm here to talk about a project with the California Digital Library and Drexel's Metadata Research Center. We've called it YAMS. The acronym stands for Yet Another Metadata Zoo, but it really is more than that. It's a dictionary of terms with both fixed and evolving to be meant to be selectively referenced by future standards. Term entries are complete with versioning and persistent identifiers as well as provenance information. The motto is YAMS, Better, Faster, Cheaper Vocabulary Standardization. With YAMS, authenticated users can contribute terms, vote, and track, and also add commentary. As a little bit of background, metadata is essential to managing research data. There are often multiple domain-oriented metadata standards addressing the same or similar data sets. The YAMS Metadata Dictionary Project was created to help address this problem. By providing an online tool for metadata managers to elicit the feedback of domain experts in order to vet metadata quality, YAMS is intended to improve metadata curation management and overall usefulness in the context of specific disciplines. One of the main goals of the project was to move beyond the repository phase and to begin to introduce YAMS to academic communities and invite them to participate. Because the purpose of YAMS is in part to tackle the difficulty in consensus building around metadata in general and domain-relevant vocabulary specifically. The architects of YAMS had a great deal of experience regarding this topic. Having worked on the original specifications for URLs, web archiving, Dublin Core, Bagot, and a great many standards since then. YAMS began as part of a National Science Foundation effort to support metadata preservation and interoperability. As an ongoing project, YAMS has developed a significant code base over the generations of contributors. A main focus of my participation was the review of this code base in the context of current best practices and to make contributions to the public repositories as easily as possible. The first part of the assignment was made possible with the help and participation of leading advisors and mentors. Next was the mandate to begin a broader outreach to academic communities that are in principle the beneficiaries of YAMS. Busy professionals don't always have time to contribute to collective metadata refinements through complex processes, but they will probably use tools that help them with their own work. Of course, first the code had to be up and running and available for review. We also wanted to take a look at the sustainability of YAMS as both an open source application and as a functioning tool for users to upload and refine terms relevant to their domains. In order to accomplish these goals, we developed a plan for analyzing the existing code base and documentation in order to understand the inner workings of the prototype. Even though the code was well documented and as I'm sure is true of any project, it took time to do this. I was able to consult with previous fellows in order to get a general sense of how things worked and spent a good deal of time just walking through the code. The previous prototype was also hosted as a free app on Heroku. And this imposed some limitations on the amount of data it could accommodate as well as its efficiency as an application. We made the determination to move the code base to a more robust environment. Initially, we used a grant from the NSF Xseed program to prototype on a Linux Nginx platform. Xseed was a valuable tool for computing resources and they were up to the task because the environment was capable of creating many instances of virtual machines where different configurations were possible. YAMS functioning was well documented. There were several existing publications describing how the prototype should work from both the social and technical perspective and we made a plan to modernize the code base and leverage more of the functionality GitHub provided for collaboration. One significant success was the update of the code base to reflect current best practices using Flask. Flask is a web framework with an extensive community and modernization helped to bring YAMS in line with current practices that could be referenced online and in technical guides. This also made the application receptive to Flask plug-in like components to let us add functionality quickly and lay the foundation for continuous integration. We were able to present our proposed enhancements at the MTSR and the subsequent publication will hopefully serve as a roadmap to future fellows. Finally, there's a draft community engagement plan in place that will be available to this year's fellows should they find it useful. An example serves to illustrate how YAMS works. The Global Cryosphere Glossary represents compilation from 27 sources in a centralized location. When the terms are imported into the working YAMS prototype we see that there are variations on the term oblation. The original intent of the GCW was to merge terms from disparate sources into an authoritative vocabulary. It had existed in its current state for a number of years though. The subsequent import into the metadata dictionary allows for ranking and feedback on terms through a social technical interface that has the potential to facilitate this process and analogous processes across disciplines. So in conclusion, maintaining continuity over time when project participants change can be difficult. Leading provided a way to ensure the long-term continuity of a promising project across funding cycles and invites insights and contributions from the community it serves. Will you join me in giving these people another round of applause? Thank you. I'm Rachel Frick, I'm with OCLC and I'm one of the executive directors in the research department and I'm gonna briefly talk about the leading data challenge although I might have been one of the people that initially had this idea, I handed off the execution of this to my former colleague Andrew Pace. So what we did with the leading challenge is OCLC as a leading education hub, we hosted a virtual data challenge and it was to take place in Dublin, Ohio in person and was to include both the leads fellows and the leading fellows, but with everything else last year, we pivoted to a virtual event. The intention for the data challenge was to create an educational opportunity for the fellows to gather, to collaborate and exhibit their skills while working with various sets of data. When we adjusted the approach to be a virtual event, we actually took a page out of our colleagues at Web Junction. They had developed a course around virtual escape rooms and we thought we would use some of those approaches in this particular data challenge and how we organized the work. So over the course of two to three afternoons, we actually had enough fellows and auxiliary staff to field three teams. We had three teams of four fellows plus additional folks and they worked with a particular data set which is a large data set from OCLC WorldCat focused on children's literature. And we asked them to create a challenge statement, analyze the data and create a visualization and then present that back for judging. All the participants received a stipend for participating but they also competed for a cash prize. So a little bit of incentive there. So as you can see, the winning challenge statement was an inquiry around how, I'm trying to read this from far, far away. Basically, how well were libraries serving communities based on language with comparing language information in the collections data versus community population data? So how well are they serving those populations? So I'm happy to say that we've got a lot of positive feedback from the data challenge and this is something we're gonna be replicating this year and we're gonna take some of the delta plus delta feedback to make some improvements. But also we're playing around with the idea of, do we still wanna do it virtual because we were able to have more participation? Is it a hybrid event both in-person and virtual? But also should we have these fellows compete with some real world data scientists in our library? So if you're interested in kind of rumbling with the fellows, let me know. And with that, but I'd like to say I was really grateful for the work of my colleague Andrew Pace as well as our data science team at OCLC. And with that, I'm gonna open it up for questions for our fellows here. We had a couple of questions. Is there any questions from the group here today that you would like to ask the fellows? I know we only have about five minutes left and we stand between you and the reception that I wanted to ask you all a couple of questions. You all had different and interesting experiences but how did you find the program overall through the support of all these different types of challenges? Did you find a common tie between all your projects was it good to work together as a cohort through the bootcamp even though your fellowships might have had different data challenges? I mean, the bootcamp was great first of all because it was just a chance to get to know everyone and it kind of put us on the same page in terms of the data science requirements. So it sort of forced you to go through some formal learning which I think is a great way to start a program like that. I think for me, as far as the bootcamp, I think it was a great way to kind of figure out where exactly we may start or what the particular sites would involve as far as data science. But it was a great refresher to figure out what it entailed as far as what we needed to learn and how that could be applied. Anybody else? So where do you see your project headed after this, after the fellowship? Is this something you're gonna continue working on? Are there skills that you develop there that you're gonna be able to apply in your next venture? What we're kind of hoping for is continuity towards, to keep this project going so the fellow that comes after can develop a project. And we kind of put together a plan, I know it's up to them to implement it, to engage community members and to kind of build the project. And the Metadata Research Center at Drexel really kind of took ownership of that process because it's vetting metadata and that's something that they do institutionally and sort of has decided to become a co-sponsor of it. So it's... Great. I could take that. I'm definitely planning on continuing to collaborate on the proceedings corpus text analysis but one of the reasons I chose that particular project to apply for is that it is so directly relevant to the work I do at the Marine Station. I have the luxury of focusing on a relatively, as a subject librarian, narrow set of stakeholders at the Marine Station and so that's the lens I look through my experience and my work as a librarian. But my understanding of their research process is that they're not going to come to the library to look for scanned things to work with. What they really want is research data, right? So I'm very motivated to take these extremely rich collections I have in my library and get them into a format that they will actually use. So not only am I interested in continuing to work with the proceedings corpus, I really wanna take and I'm taking what I did with the fellowship into my collections in the library and in many cases, it's easier collections to work with because the proceedings is just huge and weird. I mean, they talk about who showed up to the meeting that day and it's just all over the place and I have collections like student research papers and thesis and dissertations and research data sets and so it's well-scoped for me. So I'm already applying a lot of the processes that I've used in the fellowship toward my own work and it's just been so extremely useful. Great. Chris? I was gonna say, yes, I do plan to continue the work on it and I got an opportunity to present with my cohort member at Electronic on Resources and Libraries as well as publishing a paper that should be coming out on it through American Society of Engineering Education. It has also piqued my interest in exploring like really trying to refine the survey because I'm curious that how a survey given to engineering and scientists on the campus at UIUC will work as far as to find out how they really see the services that we offer. So those are some areas that I definitely plan to continue looking into. Any other comments from the panelists here? Well, as far as maintaining the streamline side, I hope to plan on maintaining that and updating that pending a few updates from Spacey and Streamlib, but hopefully it's something that people can fork and add on themselves and kind of branch out from there. Great. One more chance to ask the fellow any questions? All right, well, if you wanna know more about Leads or the Leading Program or more details about the work, as Eric said earlier, you can go to the Leading website and just keep looking to see what's how this project is going for our second year. And thank you very much. Thank you.