 Welcome to this short introduction to the core concept of network-based discovery. Network-based discovery is an example of knowledge discovery, that is, the goal is to discover new knowledge from existing knowledge. In network-based discovery, the goal is typically to discover indirect links between entities. The field falls in the intersection of many other research areas that I've covered in earlier talks, namely text mining, enrichment analysis, knowledge graphs, and machine learning. So let's start with text mining. An early example of network-based discovery is Swanson linking, named after Dun Swanson, who discovered a link between fish oil and a certain disease. The idea being that both the disease and fish oil had a relation in the literature to blood viscosity, leading him to propose that fish oil could be used to treat the disease. This is known as literature-based discovery. And there are two variants of it. The first is what is called closed search. In this case, you have two different concepts, A and B, and you're looking for a link between them, C. This can be used to explain observations. If you've already seen that there seems to be a relation between A and B, but you don't know why, you can go hunt in the literature for an explanation. The other variant is open search. In open search, you start from one concept A, then link to B and further to C, thereby making new discoveries. As you can probably imagine, that doesn't have to be only one path. One could look for multiple indirect links linking A and C. That is, A might be linked to both B1, B2 and so on, all of which are linked to C. And having multiple such links would obviously strengthen the association. Another way of doing network-based discovery is enrichment analysis. Imagine that you're starting from a disease of interest and you've performed a genome-wide association study or a transcriptomic study. In either case, you end up with a list of genes. You now do gene-set enrichment analysis or over-representation analysis to discover pathways that contain many of the genes that you already found to be involved in the disease. That is, you found an indirect link between the disease and the pathway. This is equivalent to doing an open search looking for multiple indirect links. You have A, the disease, being linked to a lot of different B's, the genes, which are linked to the pathway C. The only difference is that in this case, we're not looking in literature. Rather, the links between the disease and the genes come from experimental data and the links between the genes and the pathways typically come from curated knowledge. So, evidence could come from many different sources. So, why don't we use knowledge graphs for this? In a knowledge graph, you can have many different types of entities or concepts and you can have many different types of evidence integrated, including text mining, experimental data and manually curated knowledge, leading to many different types of links between the many different types of entities. That gives you a graph like this, where you have, for example, genes and anatomy terms and diseases and drugs linked to each other in various ways. You can now go in this graph and also look for longer paths. You don't need to have A linked to C just via B. You could have A linked to B to C to D. The problem with longer paths is that you get a combinatorial explosion and many of these paths may not make sense. For this reason, they are generally constrained with so-called metapaths that allow you to follow only the meaningful path. For example, you could have a compound, a drug, being linked to one disease, which has a gene involved in it, which is also involved in another disease, thereby indirectly via a longer path linking the compound to the disease. Or you could have the compound belong to a pharmaceutical class in which you already have another compound that is used to treat the disease. Of course, when you look for these longer paths going via many different types of concepts, you need some sort of scoring scheme for ranking them in terms of which are your most reliable predictions. And the obvious way to do that is to turn to machine learning. What is most commonly used in this field is unsupervised learning. Specifically, you can do a random walk within this knowledge graph and have it be constrained by metapaths that way obtaining a metapath embedding. This is a vector representation of every node or entity in the graph in terms of what it is connected to in its neighborhood. You can then use these vectors for doing clustering or for doing link prediction, thereby getting the indirect links that we're trying to get. Another way of casting the problem is as a supervised learning problem where you can use graph neural networks directly on top of these knowledge graphs to do label prediction. Although a lot of work has been done in this field, there are still many open challenges. One relates to the fact that networks are not perfect. And for that reason, the methods need to be robust to error. And that is difficult to do when you're looking for indirect paths because all it takes is one wrong link to produce many wrong predictions. For this reason, it's important to be able to benchmark the methods, which is another challenge. Prediction is, as it's famously stated, very difficult, especially about the future. And the only way of really testing whether these methods are able to make predictions about the future is to do pseudo-prospective tests. That is, we try to make predictions based on what we knew up until some point in the past to see if they can predict what we know now. Unfortunately, this is really difficult to do in practice. It's very difficult to make a clean dataset that doesn't leak any information from the period you're trying to predict. But the most difficult problem is how to avoid trivial discoveries. These are high-scoring predictions that will often dominate your output, but which are in principle true, but completely trivial to experts. For example, we may have a drug that is used to treat a disease, and you can trivially predict that another drug that is nearly identical might work for the same disease. This is not interesting to predict. Unfortunately, it's very hard to filter these predictions out. So for that reason, when you use these methods, you have a tendency to have to go through long list of predictions that are correct but uninteresting to find the few interesting ones. That's all I have to say about network-based discovery. If you want to learn more about the knowledge graphs and how they are created and what other things they can be used for, I suggest that you take a look at this presentation next. Thanks for your attention.