 Welcome to this deeper dive into Relation Extraction. Relation Extraction is a text mining task. If you haven't already, I strongly recommend that you go watch my introduction to the core concept of biomedical text mining before proceeding. Relation Extraction is all about getting relations from text and it builds on top of named entity recognition, which I described in the previous presentation. There are several ways to extract relations, one being co-mentioning, another being rule-based systems. You can do machine learning and nowadays people are very commonly using deep learning. I'll go over all of these before briefly covering typical use cases of relation extraction. Co-mentioning builds on the simple idea that if entities are mentioned together, it likely implies some association between them. Of course, if two entities are just mentioned together once in the entire literature, that's not very reliable. For that reason, we rely on counting that way trusting more if two entities are mentioned together often. We count in a way where we take into account distance that way putting more weight on entities that are mentioned for example in the same sentence, as opposed to in different sentences in the same paragraph or far apart within the same paper. Then we normalize for the frequency, that is, we look at how often they are co-mentioned compared to what you would expect by random chance. This is to correct for overstudied entities such as the gene TP53 or the disease cancer. As these appear very often in the literature and can therefore also at random have quite many co-mentions. The main advantage of co-mentioning is that it's a fast system and it builds a consensus view of the literature. For that reason, it is surprisingly powerful. Especially when you consider that it completely ignores the context, it doesn't look at any of the other words in the sentences but the entities. For this reason, of course, it's not able to extract the specific relation type. If you have two genes mentioned together in the sentence, you can say that they likely have some association, but by co-mentioning alone, you cannot say which type. To do that, you can build a rule-based system. These rely on having a natural language parser understand the grammatical structure of the sentence. Afterwards, you can use manually crafted rules to match patterns in text and that way pull out associations. If we look at an example sentence, the expression of the cytochrome genes SIG-1 and SIG-7 is controlled by Hab-1. The sentence is passed into smaller fragments like SIG-1 and SIG-7. Then on top of that, the cytochrome genes SIG-1 and SIG-7. Then the expression of the cytochrome genes SIG-1 and SIG-7. Finally, Hab-1 is connected to that via a verb. That way, the system can automatically extract that Hab-1 controls the expression of SIG-1 and Hab-1 controls the expression of SIG-7. A major advantage of rule-based systems is that it doesn't require any training data because you're making the rules manually. It's relatively fast and what comes out of it is interpretable since whatever is extracted comes directly from the rules that you wrote. And it can be hard to beat. The reason is that it has very high precision if the rules are well crafted. Its main downside being that it has only moderate recall. Another disadvantage is that it only looks at intra-sentence. That's part of the reason why it can only get moderate recall and that many rules are needed to get good recall. Also, it's worth keeping in mind that since you're crafting the rules, the system is only ever going to be as good as you are. If you're not good at crafting rules, you're not going to be able to make a good rule-based system. The major alternative is machine learning and this has been tried for many years. The idea is that you select a corpus of text, you manually annotate relations in it, you mask the entity names so that the system cannot see which genes are proteins, for example, we're talking about, then somehow vectorize the sentences, for example, using a back-of-words approach or more recently using word embedding. Having done all of that, you can apply machine learning. That is, you do cross-validation training on parts of the documents and do testing on a held-out test set of documents that were not used for training. These systems can get much better recall than rule-based systems. However, it generally comes at the price of considerably worse precision. Also, a big part of the reason for this is that the syntax is lost. Using a back-of-word approaches, even though you look at the other words in the sentences, you're not taking into account their order. Also, language is complex, which means that for a machine learning method to learn the many ways that we can phrase a certain type of relation requires a very large annotated corpus, which is obviously expensive to make. Deep learning addresses many of these problems. Especially people nowadays using transformer models, which are able to handle the encoding of sentences of variable length. And also, they use pre-trained models, which has a lot of benefits. This allows you to handle the complexity because you deal with the language before dealing with a specific application. The way this is done is through so-called self-supervised learning. Imagine that you have a huge, unlabeled corpus of text, and you mask, for example, 15% of the words. You now use this very large dataset to train a model to predict the masked words. That way, the model learns language. A very popular example of this is the bird model from Google, which was trained on a corpus of 3.3 billion words from Wikipedia and the Google Books corpus. This allowed the model to learn, essentially, the English language. BioBird was then created by a group in Korea who used another 18 billion words from PubMed Abstracts and PMC Open Access articles, and that allowed the model to further learn biomedical English. We thus now have a model that understands not just English, but biomedical English, and we just need to tune it for the specific task at hand, which could be, for example, to pull out physical approaching interactions from text. Since it only needs to learn the specific task, you can get away with using a much smaller annotated corpus, which is a huge advantage, and still get state-of-the-art performance. The main disadvantage of using deep learning nowadays is that it is very compute-intensive, so you need access to supercomputers to be able to do this at a large scale. So what can we use relation extraction for? The only real use case is to take a large corpus of text, like the biomedical literature, and turn it into structured data. We can then use this structured data to populate a database or a knowledge graph, whichever you prefer to call it. A good example of that is the string database of protein associations and interactions. It relies on text mining in two different ways. It first uses co-mentioning to pull out functional associations between proteins, since whenever two proteins are mentioned a lot together in the literature, they're likely to work together somehow. It then, on top of that, uses deep learning to pull out specific interaction types, such as physical-protein interactions. However, this approach is completely generic. I've used it previously to work on drug-target associations, pulling out disease-gene associations, pulling out associations between diseases and organisms, and at the moment we're working on pulling out disease-lifestyle associations. So it's very broadly applicable. That's all I wanted to say about relation extraction. Thanks for your attention, and if you want to hear more about how these methods are used in databases, take a look at this presentation.