 Welcome to this short introduction to the core concepts in biomedical text mining. So why text mining? The reason is that the biomedical literature is so vast that even by very conservative estimates you would get a pile of well over 10 kilometers if you were to print it all out. In other words, there's too much to read. For that reason, whether we like it or not, we need to get a computer to help us, find papers, find things in papers, pull out information and ultimately turn the literature into databases that we can use instead. The first task is what is known as information retrieval, that is to find texts of relevance. The most common approach to that is called ad hoc retrieval, and that is in fact what you do every time you go to PubMed and type in a query. What happens behind the scenes is that PubMed has a very large document collection which has been indexed so that it can quickly find all the papers that match with your query you type in. However, that's just one approach to information retrieval. Another common approach is document similarity, and that's what's used by recommendation engines. The idea is that you take each document and you turn it into a term vector where each dimension in the vector corresponds to a different word in the document. You then use a waiting scheme to place more emphasis on the words that are more important and calculate vector similarity thereby quantifying the similarity of documents and ultimately ranking documents in terms of which are most similar to your documents of interest. Once you've found your documents, the next task is named entity recognition, that is to find names and text. Unsurprisingly, the key ingredient to doing a good job at this is to have a good dictionary of the names you want to recognize. If you're working in biomedicine, that would be things like genes and proteins or diseases. You need to not just have a long list of names, you also need to know what they mean and thereby know which names are synonyms. So you need to know that cycle-independent kinase is the same as CDK1. In addition to just having a dictionary of names, you also need a so-called blacklist which is a list of the names that are a bad idea from the standpoint of named entity recognition. For example, there's a human gene called SDS which you would almost certainly want to block in your name entity recognition because the name SDS typically refers to a small molecule compound instead in the literature. In addition to just looking at the names themselves, you can also use deep learning approaches to look at the context around the names. That way you can learn that when a gene name is followed by the word expression, for example, it's very likely to be a gene name, whereas if the very same name was followed by the word buffer, you would be more skeptical. The way you do this in practice is that you train classifiers to predict the type of entity in the middle of a text based on the words around the entity, but not the name itself. Once you've found all the entities in text, the next job is information extraction, that is to pull out relations between them. The easiest approach to this is co-mentioning the idea being that if A and B are mentioned together, they might have something to do with each other. To make these methods stronger, you generally do counting, the idea being that if people keep mentioning A and B together, they are more likely to have something to do with each other. The question is of course what to count. Should you count it only when they're mentioned within the same document, the same paragraph or the same sentence? The answer to that is that you should do a weighted count, where you place more emphasis on co-mentions and sentences and less emphasis on co-mentions in, for example, documents. That way you get the best of all worlds. After having calculated the weighted count, you further want to do frequency normalization to correct for the fact that the literature is heavily biased. For example, there are many, many, many papers about the GNT P53. In addition to just doing simple co-mentioning, more and more methods these days again make use of deep learning like for named entity recognition. This is typically done using so-called pre-trained language models you may have heard of BERT, which you can then fine-tune for specific tasks like, for example, pulling out physical coaching interactions. That way you can train the method to be able to recognize statements like A binds to B and based on that, pull out these relations in a structured form so that you can put them into databases, such as the string database. If you're interested in that topic, I suggest you take a look at the presentation up here. Thanks for your attention.