 I think it's about time to get started. Let me welcome you all to the second day of the spring 20, I'm sorry, of the fall 2020 CNI virtual member meeting. I'm delighted you joined us today. I'm Cliff Lynch, I'm the director of CNI and I will be introducing the session very briefly. After we hear from our speaker, Diane Goldenberg Hart from CNI will beam in and moderate the question and answer session at the end. You have a, both the chat tool, which you should use to feel free to comment on the discussion as we go along and also a Q and A tool, which I invite you to use to pose questions at any point, although we will get to all the questions at the end of the session. I just also note that we do have closed captioning available which you are welcome to turn on if you wish. And for those of you who didn't hear the conversation between Bill and I at the beginning of the session, the session is being recorded and the recording will be available after through our usual channels. And I think that's all my announcements. So let me move on and take us right to Bill Ingram from Virginia Tech. Bill is going to talk to us today about mining electronic theses and dissertations for identifying and understanding trends in graduate research. I'm familiar with a fair body of work that tries to mine scholarly published literature, journal literature to identify emerging trends, but the ETD corpus is really quite a special corpus with some unique properties and I think offers some unique insights into things. So I'm going to be very interested to hear what Bill has to tell us about this. And with that, thank you for joining us today, Bill and over to you. Great, thank you. I hope everybody can hear me and thank you for the warm welcome. So I'll just jump right in. I am a librarian at Virginia Tech and a researcher and my researcher explores the application of computational methods and techniques on library collections. So the idea of collections as data and using computational methods to mine, I'm interested in machine learning, natural language processing and the like. And what you're looking at here is just a summary slide of the grant that I am working under. This is from IMLS. It's funded me for three years to explore all of these techniques against the corpus of ETDs. So we're particularly interested in ETDs because they are longer documents, they resemble books and we're focusing on three areas, information extraction, classification and summarization using machine learning and deep learning and ultimately building better digital libraries by adding value through these services. Sorry if I have a timer. This is the team that I'm working with. So my co-PIs are Ed Fox and Jen Wu from Computer Science at Virginia Tech and Old Dominion. And then there are two graduate students that work with us full-time. Bipasha works for me and Montabere works for Dr. Wu at Old Dominion. And I also wanted to just list a few names of students past and present that have been working in our lab. Two of them just finished with master's degrees in dissertation or sorry, thesis on ETDs and the rest of the folks listed down there are also working on ETD related research. So those of you who were at CNI last year might remember that I gave a talk about this project about bringing computational access to book length documents. After the talk, I was approached by the chief strategy officer at ProQuest. And I had mentioned in the talk that we were interested in the ProQuest subject categories for doing automatic classification. And so we had a nice conversation and that led to me meeting the team that is responsible for the new TDM studios. So texts and data mining studio at ProQuest. And so that lead to conversations with among others that the two folks that I've got on this slide, John Dillon and Austin McLean. I'd actually known Austin before but this opened up a collaboration which led to a pilot of their new software. So this talk is about that pilot. It's about the data that we were using. It's about the study that we did. So I'll try to address all of these things. So I wanna give a quick overview of the TDM studio. I'll introduce the research question that we're trying to answer, talk about the data, the methodology and then share some results and hopefully have some time at the end for discussion. Before we do that though, I just wanna have a disclaimer here that this isn't a product endorsement for the studio. These are just my opinions. This is my research and they don't reflect the views of ProQuest or of IMLS or of Virginia Tech Libraries or anyone else. That said, we did have an enjoyable and very positive experience working with ProQuest and with the TDM studio. This is a sort of a high-level overview of what the studio is. So there's an interface where you select content and for us, this was obviously ETDs but they have any content that I believe that your library is subscribed to is available, including a lot of newspapers, which I thought was very interesting that lead right up until, I think you could even bring in yesterday's news, perhaps even today's news into what they're calling the workbench where it is a Jupiter interface for interacting with the data either with Python or with R and then exporting your results and graphing them, et cetera. And so I'll be showing you actually lots of graphs here at the end. I should mention that, although I'm not an expert on the studio, I'll try to field any questions but there's a slide that I'll put at the end if you want to get in touch with the ProQuest folks because they're the experts on this obviously. In the beginning, you log in and once you have an account, you are given this web interface for selecting the data that you want to bring into your studio. So this should be fairly familiar for anyone who's worked with digital libraries, you select either publication titles or a particular database that you want to bring in information into your instance. I think what's interesting here is that the content rights have all been cleared for TDM. So I think that's actually kind of a big deal that you can bring in like I said, newspaper articles, The New York Times. It's not so much of an issue with ETD since most of them are openly licensed but for the content that other folks might be interested in, it's all cleared for doing text and data mining. And you can build data sets up to two million documents which we almost did. So it's sort of a faceted search and browse. We chose the ProQuest dissertations and theses, global collection drilled down into that as you see the two million document limit. So we had to window this down a bit in order to meet that limit, but we finally did. And so once you have your collection set up, you set off the movement of the files. And so this takes a fair amount of time for the files to move over. But once you have them moved over, then you can interact with the data using a standard Jupyter notebook which folks that are doing data science are really familiar with. So this is just a screenshot of kind of what that looks like. So the question that we wanted to do, and I should say that this pilot was three months. And so we didn't have a lot of time and what we wanted to do, a study, we wanted to see what we could do with the data. This is a bit different than my normal research in that it's more of a text mining, less of a machine learning, classification type of tasks that we set for ourselves. But I still think it's really interesting. So what do we want to do? We want to see what we can learn through text mining of the ETD corpus about how graduate research topics have evolved over time. And especially interesting is the interdisciplinarity between or among different majors or departments in graduate research. This is particularly interesting for me because early in my career in libraries, I was working on putting together the systems to collect and display ETDs. And so now we've been gathering ETDs for almost 20 years and we have this great corpus. And it is really a reflection of research that's happening across the country, across the world. And it's just really exciting to be on the other side to mine it and making use of it, especially in this way. So we were able to move over roughly 1.3 million ETDs. Now this is from the year 2000 to 2018. We were talking earlier and I can't remember why we stopped at 2018, but I think it was, it could have been that that's the limit for when that was mostly the time that it was have full collections available. But we started around 2000 because we wanted to get born digital. I didn't want to have the added complication of having to use OCR text. So, but still this is a lot of dissertations. What you would get in the studio is full text, XML and metadata. We wanted to get department metadata that was important for the study. So we ended up having to wind this down even further down to just 600,000 documents so that we would have the department metadata. And so this is the top 20 departments that we harvested and are working with, minus the one there in the middle department not provided. So with this dataset, we extracted title abstract department and the year of publication. We organized this into batches by years and by major. And the intuition behind this is that the top terms would be present in the title and abstract, which would indicate the research topic of the paper. And so put a bit more on that here in a moment. The sources of data there ended up being, I think over a thousand, but this is the top 20 institutions. So this was completely random. We didn't really look at what the sources were. We were more concerned with getting the numbers, but this is just how the spread looks and then a little bit more detail here. As you can see in the top graph up there, the green one, there's a really, there's a long tail of universities that the data came from, but the majority are from these top 20 here that are shown here in pink. So the methodology that we use, again, what we're concerned with is trying to find out, what is the topic? What is the research topic of this paper? And so to do that, our first attempt was to use the TFIDF, which is the term frequency, inverse document frequency, a measure to try to collect the most important two or three-word phrases across the corpus within a major. And initially this looked very promising. And so you can see from the first two columns, this is computer science. I don't know if you're seeing my mouse moving, but this is computer science here, user interfaces, digital libraries, et cetera. That seems good. These look like research topics in computer science. Over here in biology, these also seem to be working well. However, the problem was we're also turning up a lot of other mainly irrelevant phrases, such as result show and future work, high level, these kinds of things that just made way too much noise and so that that wasn't working. What we ended up doing is using this tool called Wikifier that was developed by Dan Roth's lab when he was at Illinois. He's since moved to UPenn, but what it does is takes a stream of text and it's mainly for author of disambiguation or named entity disambiguation. And so it runs the text through a search into Wikipedia and tries to return the entities that have matching Wikipedia articles. And so here you see the underlined terms. Those are actually links to the articles in Wikipedia. And then the bolded terms, I believe are terms that they had pulled out as being context in order to disambiguate what the entity is. So this is not our work. This is just a tool that we are using. If you're interested in that, I suggest reading the paper that I have here linked below. Okay, so, but anyway, so what we did with it was this is use the Wikifier to identify what the terms are and then we were able to use this to rank them so that we could figure out what the topic of the paper was. So this is kind of a step-by-step here of what we did. The first was to, for every document in the batch, use Wikifier to sort of wiki-fi the text that we extracted. And again, this text is just the abstract and title. And then ranking those terms to define out what the research topics for that department or major were. And then the way we rank those is by calculating the document frequency across different periods of time. So what I mean is how many documents contained that phrase over in a certain batch of time. And we normalized against the number of documents so that if one document had just that phrase over and over and over again, it wouldn't unfairly weigh the results. And then plot the results on a graph so that we can see what the highest document frequency terms ended up being for the department and then compare these with other departments. And then finally plotting multiple departments so that we could see what the shared topics that they have across them. So, I made a lot of these and what I am going to focus on is the intersection of computer science and biology which I just think is interesting myself. And I know the most about computer science out of any of these, not so much about biology but let's explore the data. So I'm gonna show a series of these sort of bubble graphs and I made these with Giffy on the data. So this is computer science from, let's see, 2001 to 2005. And these are the major topics here for various reasons and we could talk about that if we have time but there's some sparsity here in these early years. It doesn't really get interesting until, let's see, until here. So now you can really start to see these research topics emerging and so the size of each circle represents the ranking, the weight of the topic. And so you can see here now that we're nearing the end of the first decade of the year 2000, these are the topics that are emerging in computer science. And so it's interesting machine learning, of course, very big. There's these sensor networks and wireless sensor networks. A lot of the sensor network stuff, you don't really see that a lot anymore. So that was evidently very hot in this area and during this time period. See social networks starting to emerge, social networks. That wasn't there really in the data at all pre 2006. As we move into the 2010s, you see machine learning continuing to have a very strong presence here. The sensor networks are starting to get smaller, data mining starting to get bigger, social networking still bigger social networks. I will say that I put this together pretty quickly and I didn't notice until it was too late that the plural version of a lot of these things is represented. So if you just imagine that these like social network and social networks, just imagine that they're bigger and together. But you can see how this is trending nonetheless. So finally here is the most recent batch. Again, machine learning, but you're starting to see the deep learning and neural nets really starting to rise as well as big data. Oh, and there's something, yeah, let me get back one. If you notice right here, really small, there's big data right in the middle, that's a big. It really didn't take off until, I don't know, 2015. And now suddenly big data is, and I think if we continued this into 2020, we would see that continue to grow. Okay, so that's computer science. I wanted to do the same thing with bio. This is particularly, I don't know, sort of funny to me, that again, in these first few years, there isn't as much data, so the topics aren't emerging in the same way as they do later, but it seems that the biologists were interested in prairie grass for the most part, until now you're starting to see more of what I expected with T cells, with gene expression, genetic analysis. A lot of these words that I don't even know how to pronounce, I assume that these are genes that they're studying. This goes into, oh, let me get back for a minute. One of the things that's really been interesting in doing these is seeing how different topics emerge and are sort of growing. So on this slide, you don't see anything about climate change, but it really emerges in the next, here in this batch of four years. So now climate change is suddenly on the map. People are interested in studying that, and that surprised me. I thought that that would have been going on in the research for much earlier, but it turns out it didn't. Again, nothing else too surprising in this. Finally, the last of the bio stuff, that this gene expression, of course, just keeps showing up. T cells have just gotten bigger and bigger, and stem cells were another one that I saw sort of arise through the data. I need to hurry up. So here's the intersection of CS and biology. The first batch of years really didn't have much results at all, but you start to see them here with DNA sequencing, statistical analysis, actually not very surprising. Huge gene expression, again, climate change. And so keep in mind, this is the intersection of CS papers and bio papers. So that means there were CS papers that were written about climate change, about gene expression, about DNA sequences. And then finally, here's the most recent batch where you see the intersection of CS and bio. What surprised me about this, and I don't know if you share this surprise or maybe it was my naivete, I expected to see more of the computer science terms showing up in this intersection, but for the most part, this intersection is all biology terms. And that could just be, and I'm reminded that computational biology is a subfield of computer science. It's not a subfield of biology. So that alone could just explain it, but I thought that was fascinating that computer science is becoming a lot of other things. Just for reference, I wanted to show the intersection of econ and math. No real surprises here, or the big interdisciplinary topic between econ and math is game theory. So that's what you see here. And then the most recent, again, Monte Carlo, Markov chains, et cetera. But you do see the machine learning here. So we've got computer science in this area as well. So let's see, oh my God. Right, so we don't have very much time, but I wanted to open up for questions. Before I do, I just wanna revisit the research question. So what can we learn through text and data mining about the evolution of research topics? And I think that we've shown that it is possible to determine the research focus of an ETD using methods from natural language processing, specifically using the Illinois wiki fire for concept disambiguation. And then graphing the document frequency of these research topics allows us to visualize the relative importance of these topics within and across disciplines. So that's it. I wanted to just thank ProQuest for allowing us to use the studio and to do this. And of course, thank you to the IMLS for their continued support. And I have a slide here at the very end. If you wanted to get in touch with ProQuest about this product, you should try one of these options here. Okay, so if there's questions, I'm full screen, so I can't see if there are questions, but. Great, thank you, Bill. There actually are some questions. That was a really interesting talk. And I know people are curious to know more. We have a first a comment from Rebecca Bryant and then a question. Rebecca says, this is more of a comment for Bill than a question. I think that the Council of Graduate Schools would be very interested in hearing about your research as you help inform how graduate education has changed in the past two decades. And she goes on to ask, a number of institutions have now gone, meaning they're no longer sending their data to ProQuest. Is your research here dependent upon the dataset ProQuest maintains, i.e. institutions not sending copies or metadata to ProQuest are not included? In this particular experiment, they're not. Although from what I know, and again, somebody from ProQuest would know more about this, because of the copyright clearance, you can't pull data out of the studio, obviously, but you can bring in your own data. And so if we, you know, we've actually as part of the larger grant funded research have been amassing quite a large corpus of ETDs for our own research. And this is all by harvesting from open repositories. And so we've got about 500,000 of those. The opportunity with ProQuest though was to be able to really have, you know, the firehose of ProQuest. I don't know if it would have been possible or feasible to collect 1.3 million ETDs by crawling institutional repositories. So that was the advantage here. Interesting. I answered the question, but... Yeah, well, that's a really interesting question. Thanks, Rebecca, for bringing that up. And thank you for addressing it, Bill. Next, we have a question from Michael Seidel, who asks, how are you handling multi-language issues or is the focus mainly on English language works? And it looked like your corpus there was mostly US based, Canadian and British. Yeah, I mean, that's why we made it easy for ourselves. Although there is a woman in my lab who is doing her master's work on Arabic ETDs and is doing automatic classification of them. And so if anybody knows, she's actually, it's been a challenge to gather enough Arabic ETDs in order to have training data for her models. So if anybody knows a source of Arabic ETDs, please share that with me. But yeah, that's an interesting topic as well. Great, thank you. And thanks, Michael, for the question. Now we have a question from Cliff Lynch. Cliff says, it looks like the time to produce a thesis is much longer than the time to get papers or conference papers published. So you're going to be recognizing topic emergence more slowly in the ETD database than in literature analysis. How much more slowly? What's the average time between selection of research topic to acceptance of thesis? That's interesting. I don't know the answer to that, but I mean, I could think of a good study to do where you would compare the emergence of topics in the sort of the journal and conference literature versus the ETD literature. I think that would be interesting. Indeed, and Rebecca is, Bryant is piggybacking on Cliff's question. I think it would be really interesting to combine this study with data from the survey of earned doctorates. So lots of fodder to follow on your project there. Well, I want to thank Bill so much for coming to CNI to present the results of his work and in fact, his project with us. It's really interesting and we'll look forward to hearing more about this. I also want to thank our attendees for joining us. I see that we are a little bit past time. So I'm going to go ahead and turn off the recording on this session and just invite any attendees who are still around and have time if they'd like to stay back and chat with Bill, ask questions, just raise your hand, I'll be happy to unmute you. And we'll have another session as part of the fall meeting here at two o'clock, summarizing web archives through storytelling with the dark and stormy archives project with Sean Jones of Los Alamos National Laboratory. So we hope to see you there or at another conference, another session from the conference. Be well everyone, take care, bye-bye. Thank you for attending. Thank you, Bill.