 Welcome to this presentation about methods to automatically extract biomedical relations from the scientific literature. I will talk in particular about an application for the discovery of melanoma relevant genes. Most biological medical research requires a long and painful preliminary search and inspection of the relevant scientific literature to find useful information. In the case of melanoma, for example, the amount of relevant literature has been increasing exponentially, as you can see in this slide. Simply finding relevant papers is obviously not enough. For example, if your specific interest is finding genes associated with melanoma, it will be difficult to formulate a query which captures exactly what you want. And even if you have a good query, you still have to inspect hundreds of papers. Life science databases simplify this procedure by providing already structured data. However, they are not always available for the specific area of interest. If they are available, they might not be up to date with the latest research, since the process of constructing them is slow and expensive. This research is motivated by the desire to leverage the power of novel AI developments, in particular in the area of natural language processing, to support the process of extraction of relationships from the literature. For example, relationships between genes and diseases. As a result of this activity, we extracted a large set of genes, which according to literature are potentially related to melanoma. We make these results available to all interested researchers together with the corresponding evidence from the literature. We started our work from an existing resource, the melanoma gene database, MGDB. This is a manually curated database that contains information about genes related to melanoma together with supporting evidence from the literature. This evidence comes in the form of a PubMed reference, as you see here, and the support and snippet of text from that publication, as you can see here. We applied techniques of automated relation extraction to the task of detecting new relationships among genes and melanoma using the data in MGDB as our training set. In order to be able to correctly detect relevant relationships, our algorithms must be able to distinguish between statements clearly stating a relation between the two entities, like the first one in this slide. And from statements where a gene and melanoma are mentioned, but no relationship is clearly stated, like the second one in this slide. In general, there are two main approaches to the problem of relation extraction. A rule-based approach based on manually written rules, which uses lexical and syntactic properties of the sentences, and is often strongly limited. Then secondly, a supervised approach, where an algorithm is trained on the basis of existing examples. In this study, we focus on the second approach. The main activities described in the article associated to this presentation were the following. We constructed a data set of PubMed abstracts, considering the papers used by MGDB, and adding additional automated annotations. This data set, which we call the melanombase gene-based data set, or MGR for short, is basically an enriched version of MGDB. Second, we tested various machine learning algorithms for destruction of relationships. We tested them over the MGR-based data set, which we partitioned into a training and a test subset. Finally, we applied the best of the algorithms found in the previous step to a much larger set of publications, not included in the original MGDB, in order to obtain novel relationships between genes and melanoma. This allowed us to create an entirely new data set of gene-melanoma relationships, containing 2,265 new genes, potentially relevant for melanoma, and not yet included in the resources from which we started MGDB. This slide describes the steps of building the MGR-based data set. First, for each gene in MGDB, we got this gene ID, its snippets, and the PMID of the papers containing those snippets. Then, we retrieved from PubMed the whole abstracts of those papers. Next, we applied our own automated anti-tentation solution, OGA, which stands for ontogen anti-teracognition system, which detects mentions of genes and diseases in abstracts, annotates them, and adds their corresponding unique identifier, identifiers from entro-gene for genes and from-match for diseases. In this case, only one disease is considered melanoma. A total melanoma has different subtypes, with different mesh IDs, we have decided to conflate all of them into a single entity, the most generic entity for melanoma, to avoid data sparseness problems. As a side note, OGA is capable of recognizing a much larger set of entities, and we invite the listeners to test it at the URL mentioned at the bottom of this page. Finally, for each sentence which contains a gene identified by MGDB as relevant, and contains also a mention of melanoma, we assume that this particular sentence is a positive instance for our learning algorithm, that is a sentence which captures one of the relations originally annotated by MGDB. This approach is based on the assumption which is known in machine learning as distance supervision. This slide shows the result of this process in one of the abstracts. You see the entity mentions, as well as the one relationship which has been detected in this case. This isn't the last sentence of the abstract. The MGR-based data set, constructed as previously described, contains 907 abstracts and a total of 1244 distinct relationships. We randomly split this data set into a training set containing two-thirds of the abstracts and a test set containing the remaining one-third. We can use the MGR training subset to train several machine learning algorithms on the task of correctly classifying occurrences of a gene and melanoma in the same sentence. That is, the goal is to decide if that sentence explicitly states that there is a relationship or it does not. The algorithms trained on the training subsets were then tested on the test subset and this slide shows the results. I describe these results starting from the bottom. The most crude approach is to assume a relationship every time a gene and the word melanoma occur in the same abstract. Unsurprisingly, this method achieved 100% recall, but there's very low precision. The next baseline, the recall sentence level, assumes a relationship every time that there is a occurrence of a gene and melanoma in the same sentence. Again, recall is very high and precision is better than in the previous case, but still only about 50%. We can apply the traditional machine learning algorithm, Decision Trees, which produced more balanced results with precision and recall around 65-68%. The more recent method of convolutional neural network scored around 70% on both precision and recall. Finally, the most recent method that we consider is called Biobrack, which is a deep neural network, a transformer-based architecture from the well-known BERT model, which has been developed at Google. Biobrack uses the same architecture as BERT, but it's been pre-trained on the rule of PubMed. Pre-training is a way to instill into the neural network the implicit knowledge contained in a vast collection of documents and does not require any type of annotation. In order to adapt the pre-trained network to a specific task, a second step called fine-tuning is required. During fine-tuning, the network undergoes some additional training using a label dataset to learn how to deal with a specific problem. We fine-tune Biobrack on our MGR training dataset to learn how to classify sentences expressing the relationship of a gene with melanoma. Then the results on our test set of this method are around 75%. The next step consisted in taking the best model from the previous experiment, that is Biobrack, fine-tune on the MGR-based dataset, and apply it to a larger set of PubMed articles. We started with a fairly broad PubMed query, which retrieved about 89,000 PubMed abstracts. And after applying our fully automated model, we detected 2,265 potential new genes related to melanoma, with evidence from 6,866 abstracts. These are genes which were not mentioned in MGDD. Unfortunately, we didn't have the resources in this project to validate an entire set of 2,265 genes. So how can we estimate the quality of this dataset? Well, first, the scores obtained by Biobrack on the MGR-based dataset can be used as an indirect evaluation of the quality of this MGR-extended dataset. The scores are shown in this slide. Additionally, we asked two domain experts with experience in melanoma to manually inspect a subset of 700 genes. We gave them the sentences detected by our algorithm as evidence for the gene-melanoma relationship. And we asked them to assess whether the sentence clearly stated a potential role of the gene in melanoma, as opposed to just a co-occurrence of the two terms. So importantly, we do not judge if a gene is relevant or not for melanoma. We only verify if the sentence is stating such a relationship or not stating it. In total, 85 of those relationships were judged as correct. On 100 of those relationships, we also computed the annotator agreement, which was 73%. The MGR-extended dataset, with all the evidence, is freely available at the URL mentioned in this slide, which of course you can also find in the paper, which is here. This dataset is now frozen to the literature available when we completed our work in spring 2021. However, since the process is fully automated, it could be easily repeated in order to include recent publications and then repeated regularly afterwards. So if anyone is interested, please contact us. Additionally, to allow for an easy inspection of our results, we have created a web demo, which you can find at the URL mentioned here. You can see an example of this demo. It allows you to easily browse all papers where our system detects some gene-melanoma relationship. It will show all genes detected in the paper, and I like tools that are assumed by the system to be related to melanoma, showing also the evidence in the text. For example, yeah, this is evidence for one relationship. There is also second browsing modality, where you can enter a gene name or identifier, and the system will offer you a list of articles, where the gene occurs with melanoma. And you can then browse only those articles for evidence related to that specific gene. There are two major conclusions to this work. First, we were able to show that modern NLP techniques based on recent AI developments can be used to efficiently process the scientific literature, for example, the structure relationships among biomedical entities. In particular, these models might allow larger scale and low cost data notation. And most importantly, this process could be easily repeated on similar datasets. In other words, we could start from any large enough manual annotated resource and expand it through fully automated techniques. Second, as a specific application of this method, we have built an annotated dataset of causal relationships between genes and melanoma, which contains both concept level and mention level annotations. This dataset has two potential users. One is to support research in melanoma, and the other is to train and test more advanced methods for relation instruction from the literature, or to detect the genes and supporting evidence is made publicly available. Finally, I want to acknowledge my colleagues and co-authors, in particular Roberto Zanoli of the Foundation Bruno Kessler in Trento, who did most of the practical work described in this article, and also conceived the idea of starting from MGB and using it for the distance provision approach which we then adopt. Alberto Lavelli, also at FPK, collaborated in the supervision of this work. Finally, Teresa Loeffler and Nicola Spears-Gonzalez performed a manual evaluation of the MGR extended dataset. We are grateful for the financial support of the Swiss National Science Foundation. And finally, let me thank my previous employer, the University of Zurich, where I was when most of this work was performed, and my current employer, the Dalle Molle Institute of Artificial Intelligence in Lugano. Thank you for your attention.