 Hello and welcome to my talk for Esmar Conf 2022. I'm Bronwyn Hunter and I'm a PhD student at the University of Sussex and today I'm going to be presenting and demonstrating how we can use transfer learning to facilitate rapid text classification in R. And for background my PhD is broadly focused on the synthesis of different data sources such as social media and also academic literature to understand large scale patterns of wildlife exploitation. So why use automated text classification? So identifying relevant data sources to include an evidence synthesis particularly when looking at different sources such as academic literature and grey literature can be one of the most time consuming stages. And given increasing publication rates this could create barriers to evidence synthesis. As a result researchers are increasingly using automated text classification methods often based on machine learning to conduct the article screening stage. In addition some of the more state of the art machine learning algorithms can actually achieve performance comparable to manual labelling. Although machine learning is increasing in uptake some of the more advanced techniques often require specific programming skills particularly in Python. They also often require large amounts of training data which have to be labeled manually to achieve high performance. So in this presentation I'm going to be talking about how we can adapt some of these approaches that are generally used in Python to use in R and also how we can use something called transfer learning to reduce the amount of manually labeled data we need to build these text classifiers. There are many different machine learning models that we can use for text classification. There's random forest logistic regression, naive base, K nearest neighbor support vector machine and then some of the deep learning base methods like neural networks and transformers. Given the number of different models available it can be difficult to know which one to use. Today I'm going to be talking particularly about transformers and hopefully I can convince you that even though these models are quite complex how we use them is actually a conceptually appealing approach. Why use transformers? To illustrate why we might want to use these more complex transformer based models I think it's first useful to think about some of the pitfalls of the other models I've mentioned. Some of the simplest text classification models such as naive phase look only at word frequency to make a decision about the topic or the sentiment of text for example. And whilst this can work quite well for longer text and it has been successfully applied to some evidence synthesis because word order isn't taken into account these methods often fall down for shorter pieces of text and in evidence synthesis we often only have perhaps the title or the abstract of a paper. In contrast more recent text classifiers use what we call recurrent neural networks. These models are based on the logic that a words meaning is a function of this context. In these deep learning models words are fed in sequentially and what we call the hidden state which here is represented by these circles is a function of the hidden state of the previous word such that the hidden state of the final word contains information from all of the words in the sequence. This sequence representation can then be fed into a classification decision. As you can see from this illustration representation of the question mark the final word in the sequence contains quite little information from the first word what. In essence these models struggle to represent relationships between the words that are far apart in a sequence and also in practice because words need to be fed in one by one they can be quite slow to train. So now we thought about some of the other models we could use what is the transformer model approach and whilst I don't have time to go through the full architecture represented by the diagram on this slide I will just highlight some of the key features of this model. So firstly in contrast to the recurrent neural networks texts are fed in as a whole rather than sequentially meaning that these models are much quicker to train. The key feature though is what we call self-attention and this self-attention mechanism is part of one of these encoder blocks in the model. Essentially self-attention looks at an input sequence and decides at each step which other parts of the sequence are important. So in this example the boy is holding a blue ball we know that holding blue and ball are all related to each other but the word blue is not actually related to the word boy and this self-attention mechanism is able to learn these association between the words in the text and hopefully build a more realistic representation of natural language. Now since the introduction of transformers back in 2017 there has been a proliferation of models that make use of these self-attention mechanisms to build models of natural language. One such model is called BERT and this is the one that I'll demonstrate the use of today. It was first introduced by Google and it stands for bidirectional encoder representations from transformers. So what this model does is it takes these encoder blocks from the original transformer model and it stacks them on top of each other. So the commonality between all of these models that make use of transformers is that they take this pre-training and fine-tuning approach when applied. Essentially we take these models they're pre-trained on a large body of text which helps to build a representation of natural language. BERT for example is trained via masked language modelling which is where the words in a sequence are masked and the model has to predict which words fit in the sequence as well as next sentence prediction. So once that model has been trained on a large body of text we can then adapt it for use on a range of different tasks in natural language processing one of which is text classification. In essence the learning from the pre-trained model is transferred to the application and that's why we call this transfer learning. Hopefully the previous slides have given you a little bit of understanding of transformer based language models and their key features. Now onto the important part which is how we use them in practice. And whilst the models are quite complex in themselves how we use them has been made really simple by the Hugging Face Library and this is available as a Python library called Transformers. So what Hugging Face has is a repository of these pre-trained language models. We can then take one of these models off the shelf and adapt it for whatever task we want to do. In the case of text classification we can take BERT for example. We can obtain a representation of our text via the CLS token from BERT and then we then add a classification layer which is what makes the decision as to whether that text is relevant or irrelevant. So say we have a collection of abstracts that we've downloaded from Web of Science for example. We can take a subset, label them as irrelevant or relevant based on our inclusion criteria. The training data is then used to fine tune this whole architecture and the testing data will be used to assess how well this classifier is performing. If it performs well we can then feed back in the rest of our data and use that to obtain our final set of relevant abstracts. One thing I like about this approach is that there are loads of different pre-trained models to choose from some of which have been trained on domain specific corpora. For example CYBERT is a BERT based model that has been pre-trained on a body of scientific literature so that learning from the pre-training can then be transferred to the classification. How well then does this approach actually perform? Here I've shown some results from my own PhD work where I was comparing the performance of different machine learning approaches in classifying academic abstracts for relevancy. Here I looked at naive phase which I mentioned earlier, a simple feed-forward neural network and finally our fine-tuned BERT model. The graph on the left is showing F1 score which is a function of both precision of the model and its recall so that's how many of the actual relevant abstracts did the model classify in the final data set. As you can see it's actually small training size but achieves higher performance than the other models and particularly when we look at recall BERT is by far outperforming the other models and is actually retaining 97% of the relevant result, relevant abstracts in the final data set. So I hope that this is a fairly convincing example of where BERT can achieve high performance without the need to label thousands and thousands of data points. Whilst hugging faces made it really easy to use these transformer models in Python, there aren't any tools to make use of transformer models in R. This is where our reticulate package comes in. Reticulate allows us to use Python libraries in R and thus if we're doing other stages of evidence synthesis in R we can basically streamline our analyses. Before you get started with reticulate you need to make sure that you have Anaconda installed on your computer and this will allow us to make use of Python. You'll also need to set up a virtual environment which has either TensorFlow or PyTorch installed. The fine tuning of some of these models also does require you to have a GPU available. Now I'm going to run through some code examples. These are just snippets actually so the full code will be available on my GitHub repository but here I'm just going to illustrate some of the key parts. So once we've loaded in reticulate we can then install our desired Python libraries and in this case this is our transformers library using py install. Then using the import function we can import that library into our environment. Once we have our transformers library imported we then want to use the dollar sign operator to access the models and methods within this library. So to load in our tokenizer which splits the text up into tokens or words or word pieces we're going to call our BERT tokenizer from pre-trained and then the name of the model that we want to use. Then we can also load in our model to be fine tuned and here we're using BERT for sequence classification. So the architecture that I illustrated earlier with the BERT model and the classification head that's actually available as a whole model that we can download and fine tune. So the transformers library has what we call a trainer module and this is the method that we're going to use to fine tune our model but the first thing we need to do is set our training arguments and most of these I'm leaving as default values but the key ones that we want to add in ourselves is the output so that's where the final model is going to be saved. The number of training epochs, the batch size so the number of training examples that are loaded into the model at once and the evaluation batch size and again we're calling methods within the transformers library using that dollar sign operator, so transformers training arguments. Next we want to initialize our trainer so we're calling transformers trainer and we're initializing it with the model that we loaded in earlier, the training arguments that we set in the previous slide, the training and testing data and then the method which we're going to use to evaluate the model performance. This is defined as a separate function which you can see in the full code in my GitHub. We then create the trained model by calling trainer and train and once that model is trained we can use it to generate predictions. So by using the predict function we will get a prediction for each of the classes and then by using the argmax function we'll get the class with the highest probability. So those are the most important functions and hopefully you can see that by just using a few lines of code we can train a really high performance model for text classification. Thank you for watching the presentation I hope it's given you a little bit of an introduction into transformers and how we can use them for text classification in an evidence that this is context. If you want to know more about any of the models or the concepts that I've talked about I've linked here a few blog posts and videos that provide a nice introduction into these. I've also included a QR code which has a link to my GitHub repository and here you'll find the full code and also a couple of data sets that I use to generate these examples. Thanks very much.