 Good afternoon everyone. We're gonna get started. So hi. I'm Jane Greenberg. I'm a faculty member at Drexel University College of Computing and Informatics. Great to see people. It's been just great to have an opportunity to interact with colleagues and friends. I forgot how much fun it is to attend a conference. So thank you to CNI. Getting dressed up in work clothes, that's another story, but really great to be here. So it's great to be here and it's our absolute pleasure to have an opportunity to share about the Leading Program. Leading is an IMLS National Digital Infrastructure Initiative. Leading was developed to scale up the original Leads Program. So Leads stood for LIS in Data Science and that was also a national, supported national IMLS project. And that was developed for iSchool doctoral students across the country. And iSchool doctoral students came and participated in Drexel's Data Science Education Program. They did an online curriculum and then in a boot camp. And then they had immersive experiences working with Leads Mentor Sites, major libraries, archives, data centers and so forth. And they did, worked with real data and it was an intensive summer program for PhD students. And every time I spoke about Leads and some of my colleagues, we would hear from early and mid-career people. Ooh, can I participate in something like that? I would love to participate in something like that. So the motivation for Leading is to make the program available to early and mid-career people. And so the Leading project now combines LIS, early and mid-career fellows, along with iSchool doctoral students. And you'll see on the right-hand side the overall model. And on the left-hand side, there are what are called nodes and mentor sites. And many of the mentor sites were participants in Leads and stayed on through the Leading project. On the right, you see more of a visual and there's a sort of triangle in the center where you see Drexel, you see San Diego and Montana State University. And we oversee the coordination of the fellows that participate in the program. OCLC is also a hub and they're a coeducational hub. Drexel University serves as the chief coordinating hub and oversees the data science program. So a little bit about the data for the Leading program. As I said, there's four hubs and 15 nodes. We have 31 dedicated mentors who work with the fellows. Eight Drexel LIS data science faculty. We also have a diversity, equity, and inclusivity task force. And that task force is made up of Leads fellows from the earlier program. So they are iS doctoral students. Some of them have moved on now to faculty positions and professional positions. And they've been integral to helping us with recruitment, reviewing curriculum material, and so forth. We also have an advisory board that helps oversee the Leading project as we're growing this network. And I just want to pause for a second. If you are at an institution that has had a Leads fellow or has a Leading fellow, if you are a mentor, if you're a member of our advisory board, could you please wave your hand or something so we can just sort of see. There's a few of you around here and I want to just say right over there, thank you because you're really what helps make Leading successful. I also want to just give a little kudos to Rebecca Koskala who's sitting there. Rebecca, sorry, you've got to put your hand up because part of the model for Leading, I learned a lot from Rebecca participating in the data one data net and working with fellows. Even though the fellows come to Drexel for the data science boot camp, they do the fellowship virtually. So there's our data, there are 24 fellows in the first cohort, and here's our map, there's an interactive map on the Leading website so you can see how we're expanding and we actually have, as far as Hawaii, we've got Jonathan here today, but the map can't quite do justice here in a slide because there's hubs on top of nodes on top of mentor sites. And here are our 2021 fellows. I wish we could have them all here, but we are delighted, I shouldn't say, but we are absolutely thrilled to have six of our fellows here and you'd rather hear from them than me. So I'm going to turn it over now to our first fellow, which is Jay Winkler. Thank you so much. Hello. Hi, my name is Jay Winkler and I am the metadata archivist at the ICPSR, which is a social science data archive housed at the University of Michigan. I've been working with a team from the Temple University Library on a project involving enhancing and analyzing wiki data records for black artists from the Philadelphia area. My project mentors, Sinatra Smith, Alex Wormer-Collin, and Holly Tomrin, worked with a team from the Philadelphia Museum of Art to conceptualize the project. And I did work with a second fellow at my site, Rebecca Bayek, of the Schaumburg Center at the New York Public Library. For the unfamiliar, wiki data is an attempt to apply structured data to all of the world's information. It's built on RDF triples, so each item can have statements attached showing the properties of that item. In our research, our most common triple was that a person has the occupation artist. Our project had two phases. The first was to make a large number of wiki data edits to help the Philadelphia Museum of Art improve their presence on wiki data and also improve the presence of Philadelphia's black artists more generally. To those ends, I personally made about 1,500 wiki data edits with my partner making a similar number. Most of those edits were made using Open Refined, which has really robust tools for a direct ingest into wiki data. Phase two involved designing queries that utilize Sparkle, which is a SQL-based language for querying RDF triples, and we used that to gather information from the wiki data query service. We developed a large query as a team that gathered demographic information about the artists, and we used that in order to form our research questions. What I became very interested in was how frequently race and ethnicity was tagged on wiki data. I saw that African-American was the most commonly used ethnicity in our data set, but that its use was far outpaced by null entries. Almost everyone who has their ethnicity tagged at all in wiki data is a person of color making invisible the overall whiteness of the platform. Obviously not every untagged person is white, but I believe that wiki data's structure and its editor habits make it really easy to treat whiteness as a default. My main research focus became finding a way to track whether the collection of artists with tagged ethnicities was growing or shrinking over time relative to the whole wiki data collection. The query service won't return the creation date for items, so I ran a sparkle query to get a collection of every artist on the platform that was born in the United States, and then I was able to run that list through the wiki data API to gather the creation dates for each artist's item. If my research question was how is this changing over time, ultimately my results concluded that it has stayed remarkably steady. I have two caveats up front. First, I did collapse all other tagged ethnicities into a single other category. That is not ideal, but the numbers for most are really small, and having other as a single category still provides useful insight on whether or not ethnicity tagging is happening at all. Second, I don't think this data is sufficient to conclusively say whether or not any given ethnic group is underrepresented on wiki data, but I do think the defaulting of whiteness is evident in this data. In the total collection, wiki data's items about African-Americans as a percentage of all artists settled in around 6% in 2014 and stayed within our percentage point of that ever since. For reference, African-Americans represent about 13% of the US population. Other accounted for about 4.5%, meaning around 89% of artists have no tagged ethnicity. When I broke this down by city, I found that Detroit had an unusually high other percentage, but most cities track pretty closely to the national data. One thing I want to point out is in the Philadelphia subset, first of all, Philadelphia has always far outpaced other cities in African-American representation. It's always hovered closer to 15%, but if you look in the little circle on the chart, you can see the effects of our project, which I found really gratifying. There's about a month where it goes from 15% to 18% in the course of a month, and that's clearly due to phase one of our project. My big takeaways from leading in general are I feel like I have a much stronger foundation in data science coding and a lot more confidence solving problems through code. I also have found that it's given me a lot of direction as an early career academic in my first role where publication is an expectation. And when the fellowship ends this month, I'm going to continue working with the Detroit Institute of Art to do a Detroit-based version of phase one of our project, and then I'm going to complete some independent research tracking these sorts of patterns in WikiData's gender and sexuality data. Thank you very much, everyone. Hi, everyone. My name is Emily Pingo-Brian, and I'm the Digital Repository Metadata Librarian at Worcester Polytechnic Institute in Worcester, Massachusetts. My project is unlocking and linking World War II Japanese-American incarceration biographical data with a focus on social network analysis. And I had a partner for this fellowship project, Lencia, whose presentation will follow mine. So for the project, we use historical records obtained by Richard Marciano from the University of Maryland who's served as our mentor. So thank you, Richard. And part of the historical dataset I used was from incident cards or also known as index cards from the Tule Lake concentration camp located in California. These cards included information regarding description of events, incidents, including crime that incarceraries were accused of by public officials, including camp police. So this information only described the event, how to category the type of event, as well as other information about the individuals. So what I was really interested in and my question for this project was how can we use this archival historical data to show how the incarcerated peoples connected with each other and reveal relationships and other events, and especially looking at communal acts of resistance and to tell us more about the life of Japanese-Americans in these concentration camps. So the objectives were promote the access and use of historical public records as data, but also understand the history of the records while using the data ethically and responsibly, and create a model overall for other narratives of zero-force displacement in America. So the approach that I used was social network analysis, which specifically addresses the field of data analytics that uses networks and graph theory to understand social structures. So for this case, social network analysis involved extracting data from the data set of index cards and applying graph algorithms found within networking tools, specifically using Python library, including Plotly, as well as NetworkX. And one of the accomplishments we're really proud of is we created Jupyter Notebook specifically. I was able to create one on social network analysis that included the Python computational work as well as the output, which are these visualizations. So in the visualization on this slide, the large blue circle, it's a simple node graph used with NetworkX, and this is pulled from a particular event on November 4th, 1943, that occurred at the Tule Lake concentration camp. It was a congregation of Japanese Americans who were dissatisfied with the situation, lack of supplies, lack of food, and this became deemed a riot by public officials. And so looking on November 4th, the category of the card was riot. This is how many card, almost 4,000 cards that were from that, recorded from that particular event, including the individuals accused of being involved in that riot. So while it can be a difficult graph to read, you can just see the scope of the event itself and how many people were involved or accused of being involved. So this riot event I was really interested in, and wanted to know more about what led up to it, what followed, tell us more about, again, the events and the lives of the incarceraries. So one particular event I found through the computations on November 1st, 1943, there was a theft, so this occurred before the riot, and there were nine cards specifically with the same exact description of nine accused incarceraries of found-picking vegetables in an area of the farm where they did not work. So I used another Python library, in this case Plotly to create a bar chart. X-axis is the different events categories from the index cards, the Y-axis is the number of cards. This was confirming, yes, there were nine cards from that specific event, but also shows other events leading up to, again, the riot event and following, including disorderly conduct, and after the riot, those that were jailed in the military stockade. So from these examination of these results, the theft that you can see in other events like disorderly conduct, they were clear displays of the rising discontent in the Tule Lake camp and also leads to questions regarding the informal of the accused incarceraries and additional or other intentional acts of, you know, what was planned, was the planned, was the theft planned, what was talked about when they were accused, what was talked about, is this why some people, some of the incarceraries were involved in the riot later. And with social network analysis, we can easily view different interactions between individuals over time and how they relate to other events, ultimately presenting additional opportunities for analysis and research. So, you know, this has been a really amazing project to work on. I feel very honored to work on this project and to work with Lencia as well as work with my mentors, Richard and also Greg Jansen from University of Maryland. Personally, I've learned a lot about the technical aspects of working with the data, how far, you know, I can push myself, how far I can work with the data itself, but also most importantly to work with historical data respectfully and sensitively. And my hope is for others to use these computer notebooks as examples and guidance when working with archival data and continue to develop new strategies and ultimately create new opportunities for research, especially with other historical events, data involving serial force displacement. Thank you very much. Hi, everyone. My name is Lencia Beltran. I'm the Special Collections Archivist at UNC Wilmington. And so, as my partner Emily said, I'm also working on the Japanese-American project. So, my focus and interest has been to visually understand the movement and spaces of how Japanese Americans lived while incarcerated during World War II. And so, with that being said, I used geographic data taken from three separate archival records. And that includes one of the data sets that Emily mentioned during her presentation, which was the incident card or index card data set. And so, before I could create the interactive maps, I had to find the locations for all five movements for 25 selected individuals. And so, I selected the individuals based on how often they appeared as being involved in the uprising at Tule Lake. So, once I found those locations, I could then find the latitude and longitude, which is what I used to create the points on the maps. And so, this first visualization here is showing individual movement for all five movements for one individual person out of the 25. And those movements are the Point of Origin, the Assembly Center first incarceration camp, the relocation to Tule Lake, and the final departure state. And so, to briefly describe what you're seeing here, the individual originated in San Diego, California. And in 1942, they were sent to the Assembly Center in Santa Anita. And from there, they were sent to Jerome, the first incarceration center in Arkansas. And on September of 1943, they were relocated to Tule Lake. And from there, it's recorded in 1946 that their last departure state was in Pennsylvania. And so, if I could have you look at the map that's on the lower bottom, this map is showing the number or size of individuals at each location for all five movements. And what I want you to take away from this map is, if you compare it with the first one that I showed you, you'll see that not all 25 individuals were sent to the same Assembly Center or the same first incarceration center or even they didn't have the same final departure state. And then the map that is on the top right, oh, real quick, I'm going to backtrack. This, the second and first map were created using Plotley, which is one of Python's libraries. And then the third map was created using Folium, which is, again, another Python library. But it's an alternative way to use interactive maps to visually see movements. Folium uses layers. So it'll allow users to either select or deselect what is visibly being shown if they want to see specific information. It also allows them to or the creator to add GeoJSON or shape files which I added, those are the blue shape files on the map. Those represent evacuation orders. And real quick, the image between the two maps are the code that I used to filter the data to find the locations. So in conclusion, my experience with leading and as a leading fellow has been a positive one. I learned a lot from my two mentors, Dr. Richard Marciano from the Advanced Information Collaboratory. I also learned a lot from my partner, Emily, and I learned that data or data science approaches and computational approaches can be used for historical research and not just numerical data. Thank you. Okay, hello. My name is Hyun Seung-goo. I'm an assessment librarian at the University of Northern Iowa. My project of mine is top-to-university library project of the University of Northern Iowa. It is a project that is focused on data and decision making for consortium acquisition. A particular goal of this project is to create visualization tool that would help identify which post-on library consortium BLC member can work together to save or share acquisition cost by showing project-based e-book usage patterns. To achieve this goal, we exported consortium e-book usage data from 2018 to 2021 from JSTOR admin porters and prepared for analysis as a pilot study. And then we created multiple natural graph modeling and overlapping usage of title in particular subject area of BLC members. In particularly, we chose a bi-part natural graph with a Python and its library X where one kind of node is a BLC member and the other kind of node is a subject. And the line between a member node and subject node define an institution's usage of a particular subject area and divide the usage of all subject areas. So this is a particular example of bi-part natural graph. If you look at the history at the center and the line from the history to individual BLC member 8, the thickness of 8 is 8 of BLC members that are similar to each other. This means that 8 BLC members, they have a similar e-book usage pattern in terms of history. However, if you look at the philosophy on the right side, you will notice that philosophy is a unique subject of BLC 1, the e-book usage of which does not have a similar e-book usage pattern as any other BLC members. So, BLC member might not need to consider course sharing in terms of philosophy. And additionally, if you look at the history and the psychology next to the history, and you will notice that the line from sociology into the individual BLC member, the thickness is similar to each other. However, as compared to history, the thickness is thinner with a lower usage of e-book usage data. So, one takeaway from this project itself, we were able to see that BLC members in the BLC area are strong or weak candidate for consortium acquisition. So, history and sociology are good example of strong candidate for consortium e-book acquisition. And philosophy is one example of weak candidates. So, in terms of takeaway from this project, I would like to step further and closer to achieving my goal of becoming a well-rounded assessment librarian that I mentioned in my application form by helping me gain immediately applicable collection-related knowledge skills that I did not have before. And helping me build extensible partnership with the BLC project, we are planning to gather more data from more vendors. We are planning to use LC classification instead of day-stored disciplines. And we are planning to build interactive network web would enable individual users to configure their threshold until they get meaningful insight. And I really wanted to thank you and support and they willing you to meet me and talk with me and get some help from me. Thank you. I'm Jennifer Proctor. I'm a Ph.D. candidate at the University of Maryland, College Park. And I worked with the OCLC on a project to detect missing diacritics and catalog records that were ingested into OCLC from works published in Russia. Okay, here we are. Okay, so the project goals were to build a machine learning model capable of identifying incorrect Cyrillic mark records on ingest. And at the scale OCLC works, this is a big data problem. Also, we wanted to develop methods for identifying and mapping corrections for transliteration errors. In order to do this, we needed to label 4.5 million rows using OpenRefine and Python. And we tested dozens of machine learning algorithms and encountered many scalability issues and got a lot of 33 terabytes of RAM cannot be allocated errors. And we had to do some research on other diacritic restoration projects of which there's about one to two dozen around the world in about six-ish languages. So this is an emerging area. And we had to develop new methods for splitting transliterated text into compound characters that would fit 3 trigrams, 4 grams, 5 grams. Because while in Cyrillic, you can just count 3 letters, 4 letters, 5 letters and when you transliterate that one can turn into 4. It can have diacritics. It can have stress marks that look like quotation marks that then the software doesn't know what to do with. And through all of that, different catalogers can parse those inconsistently. And that all complicates using computers and AI to parse these records. We labeled the data, we built machine learning models, we reassessed the data, we scaled the models and then we added some explainability methods. But throughout all of our challenges I remain super excited about this project because the challenges for libraries, archives, museums and natural language processing technology in general for handling this kind of linguistic complexity is what drove me into PhD research and this is the first opportunity I've had to get my hands on data and actually work on this kind of project. So it's been fantastic. And these are the kinds of results that we've been looking at. So calibrated classifiers, the first model that we got some results on used a hash on the title, so turning strings into numbers so that they could be calculated. We hashed it with two to the eighth columns per title and that got us some results but as you can see all the arrows kind of converge together which isn't ideal. Two to the third got us a little bit better. They tended to go into their own little boxes. But it's still not great. So on top of that we still needed to use rules based and statistical modeling with Ngrams. So the other chart shows some of those Ngrams once we figured out the problems with separating out what counts as one character when one character can actually be four different characters that represent one Cyrillic letter. So we are looking at a combination of machine learning models, rules and statistical modeling together to give us this capability of helping OCLC to clean up its Cyrillic ingest at big scales. Okay, hello everyone. My name is Jonathan Young. I am a science and technology librarian at the University of Hawaii at Manoa and my project was with Montana State University and our goal was to develop a measure for an academic library to calculate a return on investment value or cost-benefit analysis based on the grant revenues coming into a university that's supported by library subscriptions. And first I want to say a disclaimer. Obviously the library supports the university in many ways beyond just grant revenues, but this is one simple way to apply data science to get a kind of bound estimate of how much support the library is giving to the university. And so my approach is shown in that figure and it basically adds two factors to the total grant amount. Using the assumption that library support of grant proposals is primarily through the cited references that are cited in the proposal. And you can see again why this is an assumption and it's a lower bound. But those two factors are one the library support percentage. So that would be the proportion of the cited references found in a grant proposal that are provided through a library subscription. And second the citation dependence or how much of a grant's probability of success is actually dependent on having cited references at all. But the main problem is that if you try to look for the data and certainly this is true for Montana State and for the University of Hawaii in Manoa, if you're given data on grant proposals, they don't include the proposal and they definitely don't include the cited references used in the proposal. So I needed to come up with an approach to try to estimate the cited references that were used in a proposal and I first tried to do natural language processing and that failed completely. But I settled on using a web of science data by looking at the cited references in a principal investigator's publications and using that as a proxy for the cited references they would use in the proposal. And so I used four, combined four data sets. One was a validation dataset from a public database ogrants.org that did have cited references in the full proposals to show that that process worked and also found that there was a difference between cited references in funded and unfunded grants so it's great to see the library actually does make a difference. And that also allowed me to calculate that citation dependence term by estimating the probability of success for a grant with zero references. And then the other three data sets were the Montana State grant data, the web of science data and the Alma Library subscription data and I combined those to get that final library support value and you can see plugging in all the numbers into that formula there. For Montana State for the last five years we come out with an ROI of $4.83 of grant revenue that's supported by the library for every $1 of library materials budget which is consistent with previous work using non-data science methods such as survey methods. So that's good to see. And I hope it'll be useful to library directors like Kenning from Montana State. But the data science approach actually has another advantage in that it can provide more data on a journal by journal level which is what I'm most interested in as not a library director but as a academic selector and liaison. And that was one of the great things about the leading project that provided the flexibility that I could tailor the project in a way that was going to help me in my role. And as an example on the bottom left at the start of the pandemic I just started my position and I was told like I'm sure many universities that I needed to re-evaluate some of the journal selections and so I created a chart like this and with as many data points as I could such as usage such as impact factor citations but it was missing what role these journals were playing in our grant revenue at the university and so I've been simultaneously trying to focus on that and I'm very happy that the leading projects have been allowing me to do that. And then going forward kind of in the far future also integrated into a visualization like a network map like I show on the right to help my role as liaison as well. And so I thank the leading program to give me the kickstart to be able to do that kind of work as a librarian. So I have the pleasure of wrapping this up here. I just would like to say thank you to all of our presenting fellows of really amazing work. This is just a selection of the incredible 24 projects that we got to highlight this year got to work with this year through our 24 fellows. I'd like to say that even just we're in our first year of the leading program of course we're extending the leads program. We've already seen some fellows secure new jobs coming out of this but one of our big goals of this program is to support career development of early career professionals. I will say we will be launching the 2022 applications for leading and so if you have doctoral students or early career professionals who would be excited to be part of this program we would really like to talk more and certainly would be glad to see applications to the program. As I wrap us up we would really like to thank IMLS for their support and for this project. A big thank you to our advisory board including our good friend Cliff Lynch our diversity and equity and inclusion board all of our fellows our mentors and faculty who are making this program a success. You can stay in touch with the cci.drexel.edu it's part of the metadata resource center it's up in the slides we'll be sharing and I think we've got about six or seven minutes for questions and so I'll turn the floor over to all of you for questions you have. Thank you. I'll give folks a moment to join us. Maybe a question to our fellows you know I think you presented really exemplary show in products we know that one of the challenges you ran into was actually scoping your work would any of you like to reflect on what that experience was like? Unsensitive question. I think I was somewhat lucky in that a lot of my scoping actually came from my mentors something I've got a forthcoming blog post about my actual the actual queries I ran and at one point in the blog post I say we cheated a little bit by creating a wiki data list prior to starting the project and had my mentors not created that list a lot of my queries would have just like timed out because they were way too big so I got lucky in that sense and then other than that when I started moving away from some of the stuff that was prepared for me in advance a lot of it was just what are the cities I find the most interesting is there any way I can integrate Detroit and or Ann Arbor so I could take it locally a little bit more and so yeah. A lot of my scoping work came with when do I stop labeling and start modeling how do I deal with not knowing how long a model will need to run I just have to wait and see what it's going to finish and if there are going to be results when it's going to finish or if it's going to crash while I'm sleeping and I'm going to have to start over again so some of the results are this weekend's results so I did my best to scope and we got there but some of it was a little hair raising I saw give our audience members one more moment while I shield my eyes from the blinding light I'm not seeing any movement I will thank you all for being part of our session really looking forward to continuing this project with all of our fellows and mentors and so thank you so much