 All right, great. Thank you all for coming to the last session of the day. I'm Aisha Dutta, and I'm going to be talking about my attempts keyword at automating subject classification in journals. So I am a R&D software developer at Crossref. And this was my first machine learning project. So it was a good way for me to get my feet wet and kind of take a survey of some of the methods that are available. So this is the title, also known as a tour of methods to attempt subjects classification automation, because the numbers, spoiler alert, weren't great. And so we can be rest assured that AI isn't going to take over this anytime soon. So to give an overview, I work for Crossref. For those of you who don't know, it's a nonprofit organization that provides infrastructure and software services to the research community. We receive a lot of bibliographic metadata from various sources, such as research organizations and publishers. And some of the metadata that we receive contain subject categories. This is all from Elsevier at this point. So there are Elsevier journals contain subject categories. So across our metadata, the categories are unevenly distributed, because we Crossref don't have subjects in the rest of our journals. So my R&D department was curious to see if we could use subject categories from Elsevier and infer it across the other journal articles. So Elsevier uses a system called All Science Journal Classification. And you can go to this website where I can make the slides available later to take a look. And I'll show a little more later. Basically, from their website, they have in-house experts that determine the subjects based on the journal title and the contents that they publish. So I use Elsevier basically because this was a prototype project. And it was a machine actionable data set. So it was the lowest hanging fruit. So Elsevier has a downloadable spreadsheet with a variety of data points, including journal titles, print and e-print ISSNs, and their ASGC codes. It is multilingual, which is great. It's mostly in English, but there are a few in other languages that you can see up there. And so from the Crossref API, I checked to see which Elsevier ISSNs were available with us. So I had to do some deduping and things like that. And as of last June, we had about 26,000 journals. So I'm going to zoom back a little bit. And in machine learning parlance, this is a type of classification problem. And so a journal can have one or many subject categories. So this is known as a multi-label classification. And so for example, the Journal of Children's Services has several subjects assigned to them, as you can see. So I was curious to see what the models would do with it. I had to do some pre-processing of data because we don't have too many abstracts, and we have no full text. So basically all I could use were journal articles, journal titles, and their article titles. So I grabbed for the first round 10 to 20 article titles and used Spark to kind of parallelize the jobs across for the API. Otherwise, it would take forever. And then I also added some stop words. The ISO standard has a list of its stop words from the languages that are part of the ISO standard. So I used that as well as some domain-specific terms that I kept coming across with, such as journal and academic and book and things like that. I also scrubbed out the article titles because in my voyage of discovery, I noticed that there were some HTML entities and XML, some mathematical and chemical formulae, and so I was scrubbing everything out to create essentially a distilled, you know, super-pristine article titles. So these were the methods that I was looking at. One is the very classic set of libraries from Scikit-learn. So I used TFIDF, and I'll explain more of this as I go on. And multi-label binarizer, which generates sparse matrices for the models to crunch through and generate predictions. Then I also used a neural network, which is a, Alan AI had released a model of their own, which is a fine-tuned model of BERT, which is released by Google. So SciBert is a fine-tuned model based on scientific text, and it is open source. And I use that with the multi-label binarizer to see what I get. SciBert is now rolled into the very James Bondian-sounding specter, which is another model using sentence transformers and citation, things that they have trained the model with to generate, you know, more fine-tuned predictions. And then finally, I was curious, since chat GPT crashed and burned into our consciousness, and so I was curious to see what would happen if I use the cheapest model, the ADA model, and fine-tuned it to see what would happen. And if it gave me good results, I would have looked at some open source, large language models. So TFIDF, just to give a brief definition, because I was new to this myself a couple of months ago, is basically a statistic that measures the importance of a word in a document across the whole corpus. So it can scrub out stop words and generate basically uni and bigrams of the words that you feed it. So it generates essentially a sparse matrix, which I'll define again soon, of the titles. And these titles are basically the features that I'm using to train the model. So that's the data that I'm going to send it. So the multi-label binarizer is essentially another statistical model that generates a sparse matrix of the subjects that I send it. So this is known as the label. And so essentially I get a sparse matrix of the features and the labels that I then can use for the various models. So this is an example of a sparse matrix. So this is from TFIDF, so you see a row of words, I mean a column of words across, and two rows that represent journals. So I'm saying show me the journals where the word account comes up in the titles, and this is the answer I get. So essentially you just have a whole bunch of zeros with a few ones, and then it's much faster to compute the predictions based on this. So I first wanted to see what the baseline prediction would be based on if I did absolutely nothing. So I basically just have the existing column of subjects that the journals belong to, and I just added another column of just all of the subjects, and did a very simple comparison. And I get a lovely number of 0.06, which is the prediction, which is the measure of how accurately the model can predict something. So that's a good baseline to start from. And so I essentially used all the models that can be used for multi-label classification and ran essentially numbers to see where I would get the best scores. And so for example, as you can see, the MLP classifier, the F1 score was 0.38 for 18,000 features, which is 18,000 processed titles, and so on. So linear SVC, spoiler alert, was the one to beat, and I got 0.43 for 20,000 features. So these are not great scores, and I could look into different ways to go further. So one way was to add more data, to add more titles, to add some citations, or the other way was to also look at subjects a little more closely, because these subjects are very, very granular, and the Elsevier classification system is based on a hierarchy. So I was like, okay, if the Journal of Children's Services has several subjects assigned to it, what can I distill that to? So according to their scheme, for example, law and developmental and educational psychology are belong to what they call supergroup subjects, social sciences and psychology. So for example, as you can see, the green are what they term as supergroup, and then they have more granular subjects inside those hierarchies. So I basically distilled it down to psychology and social sciences for the Journal of Children's Services. So I lost some granularity, and I gained some performance. So linear SVC at 20,000 features, it jumped up the F1 score to 0.72, which was a nice improvement from before. And I again just did a grid search, which is basically just changing the number of features I send it, and I get a minimal improvement of 0.73. So I was like, okay, this is as far as I can go with the set of titles that I have. So I was curious to see what BERT would do. So Alan AI's Cybert is a BERT model that's trained on scientific text. They have a website that they have generated with using Spectre, and I suppose that does a lot of semantic recommendation and things like that. So if I essentially generated a matrix using that model, I was curious as to see what would happen. So I generated an embedding of the same journal article titles. And that took about 18 hours on my laptop to do. And so once I got that, I again used the multi-label binarizer for the subjects and got a score of 0.71. But with fewer features, there were only 512 because that's just the dimension of the language model itself. So that's kind of cool that with only 512 features as opposed to 26,000 features, you get the same result. And I was like, this is still not good enough. So by this time I was looking to see if I should add more data and then chatGPT came along. So I was just curious to see what would happen. And so I asked the great chatGPT what the topic would be for these titles. And it actually came very close to the training model. So for example, this title that you see at the top, that journal does belong to the subject of material science. And then I said, how does it do with multilingual titles? And so this is a collection of Dutch titles for a Dutch journal and it is accurate. So I was like, all right, let's see what happens. So I used the fine-tuning model using OpenAI and Ada, which was the cheapest model. And so essentially the prompt is a number of article titles and then the completion is give me the, these are the supergroup subject codes basically. And then they have a, it's all, the documentation is quite nice. They have a good API to figure out what you need to do. And $10 later, I get a model, yay. So then I check, I run the validation dataset against this model and I get a score of 0.69. So at this point I was like, all right, I'm gonna put this back on the shelf because we just don't have enough data as of now to go forward. But going further, I can see these methods on this dataset to see what we can do. So we could also look at different classification systems to see what we get out of it. It would be also, it's a whole different subject, but I'd be curious to see how the various classification systems measure up to each other and that might be interesting discussions about bias or not and things like that. Topic clustering is an interesting way to also go about this. And then we could also see how the research is flowing across time. We could again look into semantic search, so you could measure embeddings and how similar they are to other titles and then perhaps make inferences as to what subjects they might be. So cross of itself is postponing this project for now just because we just don't have enough data. So that's your whistle stop tour of these methods. Thank you very much. All right, so if you have questions, raise your hand and I'll get the mic to you. What was the classification system that you were using? Sorry? What was the classification system that you were using? It was Elsevier's All Science Journal classification system. It's just something proprietary that they have. Do you know what GPT was using? I don't, that's a very good question. It's a closed model, so I don't know how it came up with this. Yeah, yeah. Any other questions? Hi, thank you. I'd like to ask you, do you think if you have perhaps more information like the abstract or the keywords or something like that, it could perform better? I think so, yeah. I think it probably needs more data for it to do anything. So yeah, for sure, that would help. All right, thank you. Anyone else? All right, I wanna, oh, were we at, were we at? Okay. So sorry, some of this was a little over my head, but I'm trying to, you kind of frame this as a failure of, not a failure, I'm sorry. No, no. You didn't quite get the results you were looking for, but it's like, how do you think about the fact that ChatGPT was so accurate? Does it mean, I don't know, like should we stop trying to do things as well as ChatGPT in the open and just surrender or like, I mean, obviously not, but it just seems like it's so impressive that it could do that. Like what are we to make of that basically? I'm not sure if that's a fair question, but. No, it is a fair question. I mean, the problem is we don't know because it's a closed model and that's why I was curious to see how it would work with open models and I just did some ad hoc open model querying it was pretty decent, but it would be a good tool. I, again, don't know about ChatGPT in itself because we don't know what's happening behind the scenes, but I'm a former librarian, so I'm like, open everything and we should know the provenance of the data. So it, as with all AI, it's a good tool for sure and I'm curious to see how the topic clustering stuff works because I think that is an interesting way to go. Yeah. Does it, oh never mind. It helps though for the recording. Ah, okay. I was just wondering where you were hoping kind of this project would land. One of the examples that you gave is a really interdisciplinary, the Journal of Children Services. So I'm kind of hoping what you were hoping to do with it. If it was able to actually classify really well, lots of journals, what were you or Crossref hoping for? Like what's the kind of future aspirations? Yeah, I mean, first it was also, I mean, if it did well, then we would probably do a test run and add it to our metadata and then it might also be kind of interesting to open it out to the community and see what other kinds of services we could do with it. Because there are a lot of companies that provide semantic search and clustering recommendation system, so what happens if we democratize that? It would involve a lot of work and a lot of maintenance and a lot of making double checking to make sure that the data is right. So it's a huge endeavor to get into that. Okay, with that I wanna say thank you so much everyone.