 Hello everyone, it's 11-10, I think we'll start. My name is Bill Michaud from the University of Illinois. My colleague, Alessandro Cabana, could not be here today, so I'm doing both parts of this presentation. The presentation is actually, the slide deck is an amalgamation of two different presentations, one on information about machine learning and the other on our practices. This is more on the second part, which is really our practices. This is really a work in progress, maybe a snapshot of a process. So, Aaron decided to put in a lot of slides for background, so you can refer to some of these later. First, I think it's useful for us to do environmental scan. In our case, we emphasize the fact that we have tools that we can use to implement interconnected services. These include things like robust APIs, DOIs as a glue that allow connectivity, inter-connectivity, asynchronous GPU parallel processing, and for the most part open data sets that can provide data and content. So the question here is the hope is that we can use machine learning computational tools to add to our armamentarium in terms of what we can use for system development. If you look at machine language and AI and library services, what we want to do is focus on using systems and services and service frameworks that provide the scaffolding for these interconnected services. For example, in our bento style discovery system, we sometimes start doing 14 to 16 asynchronous API calls in order to retrieve results. That scaffolding and that inter-connectivity is extremely important. So we want to look at machine learning in all these different systems and see where they can apply. This whole thing really began with a question, which was, can we add topic modeling component to our API-based bibliographic database service? And as an extension, what about adding ML to a bibliometric and discovery and delivery services? More deep background, so machine learning, I'm sure a lot of you are familiar with this, takes sets of observations, identifies patterns and anomalies. Pattern recognition is really the heart of any machine learning algorithm. Machine learning uses a mathematical model. In fact, all machine learning is numbers, so documents, for example, be vectorized. Our focus is on document clustering. We want to try to identify key concepts from a corpus of documents. Devise a way to partition the corpus in the groups of related documents. What we've done is develop some tools to extract words and phrases and build systems that can be presented to the clustering software. More background, clustering is unsupervised, unsupervised machine learning versus text classification, which is supervised. Supervised learning uses a predefined set of training data, so that you've already pre-assigned documents into one of pre-established classes. Again, all machine learning is numerical, so vectorization is done on everything, and you're doing that in order to use some similarities or distance measures to find related documents. In the work we're doing, we're using the cosine function, but there's a dot matrix function, a linear algebra function, and also Euclidean distance that's sometimes used. Most clustering uses what's called a bag of word approach versus text as phrases, and we'll talk about that later. Certainly, explainability, reproducibility are critical in all ML projects. And an important point, a lot of projects actually are, time is spent on processing, cleaning, and preparation of data. That's integral to any successful project. So we looked at a number of off-the-shelf clustering environments. With the idea that we're going to determine if any of them were really ready to be used literally off the shelf. I've got a list here of some of the things we've looked at. We've been focusing a lot on Scikit Learn, the Microsoft Azure Cognitive Services Toolkit, the Google Platform, Wolfram Mathematica. Some of the other ones are not being used as much. Amazon and Copper End, we've had a project or two we did with that. We tried to test a lot of these different systems. We also tried to test particular algorithms, K-means, K-means Plus, LGA, spectral clustering for phrases. It turns out that the choice of the initial clusters is important within any machine learning clustering algorithm. Supervised algorithms and classification algorithms use a number of different techniques, and then regression is also a supervised activity. One of the things you see in the literature is that people are using a lot of different algorithms. Sometimes people try multiple algorithms to see what fits their data best. That's actually one of the issues in terms of machine learning. A lot has been done already in machine learning libraries. Notre Dame and I, in the last grant a couple of years ago, looked at topic modeling in discovery systems or in library discovery. Library of Congress has done a lot of work and had a number of seminars and conferences. Notre Dame has actually released a set of papers done in machine learning libraries, which is what I'm linked to here. There's also an IDEA Institute on AI, which is IMLS funded, which I've been involved with. Overall, there's a feeling that, as Ryan Cordell's report from a commission by LLC indicated that libraries could become really focal sites for the translation collaboration that would be required to cultivate responsible machine learning. But for a lot of other people, machine learning is kind of like the law of the hammer. You give a child a hammer and suddenly everything needs pounding. It is important to pause and consider if AI techniques are the best approach before trying to use them. Real quickly, a lot of hype and hope already for machine learning. I've got a couple of references here. A couple of examples of positive machine learning projects. Including the Library of Congress, and particularly some in medical AI. And then some projects that did not work out, for example, the Google Flu Trends, some of the AI radiology problems, were still at a very early stage on all of these. And it's clear that real-life experience was clustering within libraries and elsewhere still shows that there are some real issues. One of the terms that's being used now is augmented intelligence, to sort of describe the interface between the human component and the machine learning. A couple of other slides here are issues with AI. It's a very famous column in Science where the AI people themselves are questioning the value of what they're doing and how it's being done. A number of clustering comments taken from some of the literature. Typically, specifically, topic modeling with clustering techniques still has a lot of issues and a lot of questions. So I'll talk a little bit about what we've been doing. We have a database service where, for example, we provide access to articles on a topic using the Scopus API. This particular project is a database we downloaded of 57,000 articles from Scopus API on biofuels and worked with a couple of faculty members in civil engineering to do topic modeling. They wanted specifically to look at these articles and say, what are the research fronts? What are the hot topics in this area? We used a couple of the clustering techniques. We settled on a technique that was actually called comparative text mining CTM. We had a computer science faculty member who we consulted with who had worked on this project and eventually ended up publishing the article here in a journal called Renewable and Sustainable Energy Reviews. The K-means biofuels analysis revealed seven clusters, I'm sorry, eight clusters. Again, because K-means is a bag of words approach, you get individual words and you have to try to interpret those individual words within the topic clusters. The CTM results actually allowed us to do some more phrase generation and that produced 12 clusters. I've got some of these listed here, low emission diesel, fuel cells, et cetera. The clusters were all analyzed with the K-means cluster and we derived six topics of interest for this particular article. So the other project we've been working on extensively with clustering techniques is our research impact visualizations. We've put together a database of 500 faculty research impact indicators from nine different departments in engineering and physical sciences and database of metadata over 10 years of publication. So the visualizations themselves of the Kansas Center of Illinois indicate the number of publications a particular faculty member has generated to times they've been cited, the number of co-authors, the number of grants received. This is a research impact indicator service, a research impact service. But we have all this data and we can use this for a correlation of different impacts or different indicators. So we can ask if the people that publish the most articles also have the most grants, also have been the most cited. We can look at the data we've collected in clustering and try to identify, again, research fronts or research areas within each department or more interestingly across departments. So the Kansas Center topic extraction yields this sort of K-means analysis in the visual terms. Again, a key point is that there's often a need for domain knowledge or domain experts, disciplinary experts to look at these clustering results and turn them into something that's useful, phrases and particular terms. So if you look at cluster four, optical coherence tomography, cluster four in the original K-means or individual words, we can drive the semantic meaning of those. So we've been looking a lot at clustering examples that use phrases rather than a bag of words. So there are a couple of different techniques that are available and our assumption is that using phrases instead of words should better capture a semantic meaning of the documents and help us do better topic modeling. That turns out to be somewhat true. We've been looking at a couple of different clustering techniques. One is called kernel K-means. One is called spectral clustering. To test this, I've generated a sample discovery set of 4 to 55 documents or 4 to 35 documents from the Scopus API that I understand because they're about library discovery systems and put together specific phrase indexes. We've been using one of the Microsoft cognitive services tools to generate the phrases. So you see on the left-hand side, abstract and the phrase is derived on the right-hand side using the Microsoft tool. Ran these through our scikit-learn K-means, the spectral clustering that was not successful, used the Wolfram spectral clustering tool which actually did break the corpus nicely into 4 different clusters. And you can see here the results you get back from Wolfram. Again, the results you get back from any of these systems are problematical in the sense that you have to interpret them. So Wolfram system gives back the documents that are in each different cluster but then you have to characterize those documents by driving keywords and looking at the term frequencies. So again, here under the wood vectorizing, if you look at the discovery set, the total number of words are 72,000. There's 7,300, 370 unique words, but actually 9,711 unique phrases. So the phrase derivation yielded more elements. We're using again some of these similarities or closest measures. This is the cosine value. There's a number of other techniques or issues involved with vectorizing. One is dimensionality reduction, which is really linear algebra techniques. I've had the privilege here of digging out my old linear algebra from when I was a math major and trying to figure out some of these techniques that are being used in these clustering models. So there's a question here about clustering versus classification. Can we actually do a real good topic modeling without additional classification of a few articles? This slide I know you won't be able to read. It shows how we're building the word indexes, the phrase indexes. And then if you look in the center section on comparing two documents, you'll see that these values, which are between zero and one, don't differ by much. And if you start looking at comparing all the values in the corporates of a small number of sets, for example, just the 430 documents, it's 92,000 initial comparisons. So this is a lot of machine processing and a lot of machine time. So we've learned a lot from our experience with the clustering algorithms. One of the things I think that's an interesting question is, can we use what we've already learned in IR systems? For example, using inverted file structures, proximity searching, field limiting. Can any of these techniques be passed into the machine learning environment? Can this be our contribution? Can our augmented intelligence application here be adopting into some of these ML algorithms the experiences and the knowledge we've already gained from information retrieval systems? And as Cliff mentioned in this plenary, we need to see a generation of tools that will open this up, make this all easier to use. And as information professionals, we have a role that we can play here. All right, let's stop here. We've got a few minutes for questions. Anybody needs questions? I have to come up to the front. Or if you just want to shout the question out. Tom, are you? Oh, you may have mentioned this, but we have done some experimentation on full text versus just bibliographic records. Have you found that just bibliographic records has yielded anything useful? Yeah, that's a really good point, which I should have made. A lot of this analysis you do see is on full text. Again, I think here is where you end up with not knowing what we already know. We've learned an awful lot about full text searching, proximity searching, looking for words within the same paragraph, within the same sentence. One of the sort of fundamentals that we learned with IBM stairs and systems like BRS is that it's important to be able to do searching within paragraphs, within sentences. When you load up the full text in a machine learning system, you've lost all of that. So that becomes a real problem. Because of convenience, we're using the metadata. We're using abstracts and title words. We haven't actually looked at the difference between that and full text, but when I look at the literature, there's huge processing times and computational issues with doing comparisons with full text ones. Remember, you're building these huge vectors. One of the vectors is essentially your dictionary, your universe of terms. So it's an N by D vector where the D value is the total number of terms. Any other questions? The slides will be available, so please refer to contact me. So you're... I'm counting on your last question. No, that was great. I'll follow up if I may. So it sounds like abstracts are key. Have you found a difference between metadata without abstracts versus metadata including the abstract? Yes, yes. The abstracts are extremely important. Title words are trying to outgast an author, so they're not always descriptive and sometimes we've all seen a few titles of articles. So the abstracts, I think, in terms of document description are extremely important. And it's a substitute, really, for the full text processing, which does take an awful lot of computational time. So this is an extremely important area. It's going to be very fruitful. We're going to see a lot of work in the next couple of years on this. And we're looking forward to making some inroads and putting together some insights about how we can use machine learning and library applications. Any other questions or comments? I'll thank you for your time.