 Welcome to this deeper dive into information retrieval. Information retrieval is a text mining task. So if you haven't already, I suggest that you go watch my short introduction to the core concepts in biomedical text mining first. Specifically, information retrieval is the task to find relevant papers. And it comes in several different flavors or subtasks. The ones I will cover today include ad hoc retrieval, document similarity, document clustering, document classification and active learning. Ad hoc retrieval is the task where the user comes with a query and your goal is to find relevant papers for that query. This is not as simple as doing an index lookup to find papers that mention the words that the user typed in. The reason for this is that not all the relevant papers will match those specific words provided in the query. For that reason, search engines make use of a number of tricks. One of these is stemming, which is to remove the endings of words thereby ignoring word endings. Another is automatic query expansion, which takes your query and expands it with additional synonyms or relevant terms. And finally, a resource may have already annotated the papers with so-called subject headings, which are subjects of interest that users may be looking for. If we take a look specifically at PubMed, if you type in the query psychiatric diseases, it will not simply look for these words. Instead, the automatic query expansion will expand this query into this much bigger query. As you see, it will look for the mesh term mental disorders, which is a subject heading like I mentioned before. And it will search the text for both the words mental and disorders, in addition to the words that I typed in psychiatric and diseases. This allows PubMed to find many more relevant papers for my query. However, even doing all of this ad hoc retrieval is far from perfect. A complementary and very different technique is document similarity. Here we instead start with a paper of interest and we want to be able to rank all other papers by similarity. To do this, for each paper we look at all the words and we consider it a bag of words. For each word we calculate the term frequency, that is how many times was this word appearing in the paper. And that allows us to that way put more weight on repeated words, which are likely more important. We also calculate the inverse document frequency, which is the fraction of papers that mention the word. That way we can put more weight on rare words and down weight the words that appear in almost every paper and are therefore probably not very informative. These two metrics are then combined into one called TFIDF term frequency inverse document frequency. Having done this, we've now turned each paper into a vector. We have a vector representation of all our papers and in this representation, each word corresponds to a dimension. We can now compare such vectors and say how similar they are. For example, by looking at the angle between the vectors, which is what is done by the so-called cosine similarity metric, which is the most commonly used. However, it's important to keep in mind that whenever we use a bag of words approach, we are completely ignoring the order of the words in the documents. We are just looking at the word content. Having defined document similarity, we are able to do document clustering. The idea here is that we start with a corpus, which could be, for example, entire PubMed. And we want to divide it into meaningful clusters where each cluster corresponds to some specific topics. The way you do that is that you use the document similarity metrics to calculate all against all similarity, meaning that we now know how similar any two documents are. Having this similarity matrix, we can apply standard clustering algorithms such as hierarchical clustering or k-means clustering to discover paper clusters or document clusters. A complementary approach to this is document classification. In this case, we start with a labeled corpus. That is, we have documents that have been assigned manually, typically, some labels. This could be as simple as dividing papers into relevant papers and irrelevant papers for whatever you're trying to do. We then want to train some method to be able to categorize other papers. Again, we'll want to take each paper and turn it into a vector representation, and we could use backer words, but we could also use more advanced deep learning techniques such as pre-trained language models that take into account sentence structure. In either case, we turn each paper into a vector, and we now use that as input to train a classifier. Since the text has now been turned into vectors, we can use standard machine learning techniques, take the pile of labeled papers, do cross-validation training, and, of course, having put aside some of the papers as a held-out test set to evaluate the performance of the classifier in the end. However, the problem you often have is that you don't have a labeled corpus, and making one is an awful lot of work. For that reason, the method of active learning has become rather popular, and here you start with a query, which of course results in a big pile of papers. And you don't have labeling, but you have some clear eligibility criteria for which papers you want and which you don't. So this is the typical setup that you have whenever you want to do a systematic review of some topic. What you will do is to pick at random, say 20 papers out of the pile that match the query, and label those as which are eligible and which are not. Then train an initial classifier on this very small training set, use it to re-rank all the papers, and screen manually again the top-ranked papers. You can then add those to the training data and retrain the classifier. You rinse and repeat this process until you get to the point where you find no more eligible papers and you stop. That's all there is to it. That's the topic of information retrieval. I'm going to have more presentations covering other topics of text mining, and if you're interested in this, have a look at the presentation linked up here. Thanks for your attention.