 Hello everyone. My name is Jian Wu. I'm a assistant professor of computer science at Old Dominion University. I'm very honored to give the CNI for 2021 virtual membership meeting presentation. The topic of my presentation is towards aiding research by improving access to electronic thesis and dissertations from multiple domains. My collaborators are Mr. William Ingram in Virginia Tech University Libraries and Dr. Ed Fox in Virginia Tech Computer Science Department. Our project is supported by the Institutes of Museum and Library Services. The main aim of this project is to investigate innovative ways of using machine learning and natural language processing and see how they can apply the two national corporates of electronic thesis and dissertations. We identified three research areas. The first is the document analysis information extraction. The second is adding value through automatic classification and summarization. We're also interested in building a better user services for digital libraries. The core team of our research includes Bill Ingram, who is the principal investigator, and Dr. Fox and me are co-PIs of this project. We also have two PhD graduate students, Bipasha and Montever. In addition to the core team, we also have a number of graduate and undergraduate students participating in this project. Some notable students include Samuel Uding, who was responsible for web crawling, Winston Shields, who was working on web user interface design. Hi, Marcia, who was working on the metadata extraction. Sampanaka, who designs the figure extraction framework, Palak and Emman, were working on the topic and subject classification. So in this presentation, I will first give a brief background introduction, and then I will focus on how we acquire the data and build the repository and use the data we collect to build the language model. So this is a summary of talking about our results, conclusions, and briefly talking about the future work. So after the World War II, there has been an increase in the number of graduate degrees conferred in the United States. The right graph shows the number of doctoral degrees in the United States from 1950 to 2019. The number steadily increases since then, and this reflects the number of ETDs also increase in the past years. On the other hand, the state of arts machine learning and deep learning tools, they are often data driven. For example, in the pre-trained models in computer vision and natural language processing, the VGD-16 and the VGD-19 and ResNet-100, they are trained by large scale human annotated data in ImageNet and MSCoCo. And in NLP, the large scale models like BERT and SNET, they are trained on large corporses like book corpors and English Wikipedia, ClueWeb, and CommonCrawl. So one advantage of this language model is that they are built on self-supervised training, which means that they don't need labeled data. So to our best knowledge, there are three ETD collections. The first is the NDLTD, which contains over 6 million ETDs in the world, but it only contains the records. The second is the Procrasta dissertation and thesis global, which contains 5 million ETDs, but it's not publicly available. So our collection contains 450,000 ETDs, including full text and metadata records. Our ETD collection are obtained from over 40 university libraries in two access approaches. At early time, we download the set maps and from the set maps, we find the landing pages and then we download the PDF. And later on, we use the OAPMH to obtain the metadata where we can find the landing page URLs and we can download the PDF. So this diagram shows the crawling pipeline using OAPMH. From the OAPMH portal, we first identify the ETDs and metadata prefix, and that gives us at least the ETDs, where we can download the XML metadata, and we can go to the landing page of the ETDs and download the PDFs. So the PDFs are saved in the file system and the metadata are stored in the database. We strictly follow the crawl delays in the robust.txt file. We spend a long time to collect all the ETDs and we met a lot of challenges and learned a lot of lessons. First of all, not all the PDFs are downloadable because some of them have restricted access. The second is that the HTML DOM structure varies of course repositories. So we have to write custom passers to parse the HTML files. Then not all the metadata has the same fields of information. For example, the department, discipline, subject and year issued are often missing. And even if the metadata is available, they often have inconsistent format as it's shown here, and sometimes it's just missing. So the last is that even if we strictly followed the crawl delay in the robust.txt file, our request could still be blocked. So we need to try an error to find the best crawl delay. So this is so far we have collected the 451,000 ETDs from 42 universities. So this table shows the top 10 universities with the Ohio State University has the largest number of PDFs we collected about 55,000. So we want to emphasize that our collection does not reflect the national collection of the TDS because we did not go to all the university repositories. So this is the plot which shows the number of ETDs distributed over the years. So you can see that before 1945, the ETDs increases very slowly. And after that, it increases gradually. And after 1997, it increases rapidly because that was the year when the ETD initialization was made. And a lot of universities start to require students to submit ETDs. A huge aspect here was because a significant number of ETDs dates are not available from the university provided metadata. So we're going to fix that by directly extracting the metadata from the PDFs. So we do a repository using the ETD to be collected. The right diagram shows the hierarchical structure of the ETDs. So each folder contains the PDF and the XML files. In the future, after we get the figures extracted, we're also storing figures together with the PDFs. The total size of the repository is 3.4 terabytes. It's hosted as the ODU computer science at a mirror that written in tech university libraries. We also build a database, which is hosted by my SQL. Here is the schema of the database. The main table is the ETD metadata, which contains 451 rows. We also have a PDFs table which contains 463 rows. And that's because the ETDs may contain multiple PDFs. And the subjects we extract from metadata contains 1.9 million rows. So you can see that there are also other tables which we need to populate using the information extracted from ETDs. So when we're extracting, so when we're building the database of the ETDs, a lot of information we're missing. So because of that, we developed a metadata extraction framework using machine learning. So on the left-hand side is the example of the cover page of ETD. It's a scan ETD. And so using the method extracted, we could extract the title author university degree program advisor and a year. Our papers were published in JCDL 2020 and 2021. So to briefly introduce the method we used. Our model was built on a sample of ETDs selected from 1940 to 1990 to make sure that there are scan ETDs. And the ETDs are selected from multiple universities as shown in this diagram. And we used the conditional random field model with a combination of visual and textual features. And the right panel shows the result. So you can see that we have achieved 81% to 97% of F1 mirror for all the seven metadata fields. So we also built an ETD search engine to make the ETD more accessible. The search engine was built on MySQL, which was hosting the metadata. And the elastic search was used for indexing the free text and providing the search service. And also the ETD repository, which contains the files we collected. And all of these are accessible from the web UI. So our current web UI provides the basic search functions, including a single text box search, advanced search, auto-complete, spell check, voice queries. We also allow users to vote which ETDs they are favorite and add their favorite ETDs into lists. We also offer a RESTful search API. So we have a lot of ETDs, so what can we do with them? We can build a lot of services. The first is the segmentation, which we can chunk the big ETDs into smaller sections. And the second is its summarization, which we can summarize long ETDs into shorter paragraphs so that users don't need to spend a lot of time to read the whole document. We can also classify the subject's categories, and we can extract and search figures and tables. We can also do research topic analysis. So here I just want to quote a diagram used by Mr. Ingram in the 2020 scene and presentation, which we use named entity extraction tool called WikiFire to extract name entities from computer science and biology ETDs. And this diagram shows there are many course listed name entities in these two domains. So using the ETDs, we can also train language models. And the power of the language model depends on the training text. So general language models like Bert or Roberta are trained on Wikipedia, Books, Coppers, Clueweb, CommonCrow, and Gigawarp. Sambert for scientific documents was trained on PubMed, which are predominantly for medical and life sciences. So we found in our preliminary study that pre-train ETDs may not work, pre-train language models may not work well for ETDs on certain tasks. This was likely because of the low vocabulary overlapping. For example, the discipline of specific jargons may not exist in the language models. And because of that, these terms might be encoded into default vectors. And a lot of the default vectors makes the language model as meaningful because they don't carry a lot of meaningful information. So many of the language models from scratch is very expensive. So you can see that it takes several days to train the Bert model and the Sambert models using GPUs and even GPUs. And because of that, we choose to fine tune an existing language model. And because many ETDs, they are in the scientific domain. So we start by using the Sambert as the base model. We extract the text from about 8600 born digital PDFs, and that gives us 300 million tokens. So we fine tuned the language model using PyTorch transformers. It took only about 20 hours using one video GPU. So to investigate the impact of front matters, we did two experiments. In the first experiment, we used the raw text extracted from all documents. In the second experiment, we removed the front matter including title page, table of contents and acknowledgement. So the evaluation results, we use propensity to indicate the quality of the language model. The propensity is commonly used in order to indicate the power of language model to predict the unknown worth. The lower the score, the better. So the evaluation results clearly indicate that the second language model in which the front matter is excluded, has achieved lower propensity score, which means that the front matter confuses the model and it should be excluded. So in the future, we will also perform further evaluation regarding how the language model will affect the classification. So in conclusion, we have collected 450,000 ETs, including the full text PDFs and metadata. We also collected the metadata using the OAPMH portal from multiple universities. So an analysis of the ETs, we found that there inconsistent metadata information, which we will fix using machine learning techniques. We also trained a language model using the document we collected based on 300 million tokens extracts from ETs. So our training process uses fewer resources than the existing general purpose and the scientific language models. We will investigate whether it's achieved with comparable performance on subject classification tasks. So finally wants to briefly talk about our ongoing and future work. The first is to improve the metadata quality. As I mentioned before, we have a significant fraction of missing and inconsistent data, which we could fix by directly extracting the metadata from the cover pages. So this table illustrates this problem. So you can see that the department names provided by the library metadata are different from the department name that's printed on the cover page of the EGD. We also want to resolve the university names. For example, some EGDs use the full name and some EGDs use the short name of the university. And also, so even a space around the amp stand will make the database to think that they're from two different universities. Also, we need to improve language model. One way is to increase the number of training documents. We also want to remove the traces of tables and figures from the text. And we, in terms of evaluation, we want to use a unified schema, such as the one used in the Microsoft academic graph for subject classification. So one big challenge that we're facing is to obtain clean text that's converted from PDF. So we also want to add new features into the user interface. For example, it wants to provide the multi-modality search by allowing users to search chapters, figures and tables. We also want to improve the document summary page, like the one on the right, by segmenting the EGDs by chapters and highlight the concept and provide a short explanation. And also we want to convert the existing framework like manage the extraction, segmentation, classification, and extraction and knowledge graph generation into our web services. That's it. Thank you very much and welcome any questions.