 So I'm very happy to welcome Natasha Pujul here today. She's a star in computational biology. I remember reading her papers about graphlets as a PhD student already and as a young professor. So Natasha did her PhD in computer science in Toronto, then became an assistant professor at UC Irvine, then moved to UCL in London where she crewed like all the career steps of a professor from assistant to full professor. And recently she additionally joined the Catalan Institution for Research and Advanced Studies, IGRIA and the Barcelona Supercomputing Center. Along the way, she won numerous awards, three individual ERC grants and she was admitted to several academies. I'll list a few here, the British Computer Society Academy of Computing, the Young Academy of Europe, then the Academy of Europe and two years ago the Serbian Royal Academy of Scientists and Artists. So we are very happy to have you here, Natasha. We are very much looking forward to learn about your recent work about COVID-19, welcome. Thank you. Thank you very much, Karsten. Thanks for the invitation and thanks for this kind introduction. So today I will present first an overview of what we have been doing over the past, I don't know, 15, 17 years or so, in the sense that we view medicine as a complex world of interconnected entities that needs to be mined for additional biological information. And I have to start with the motivating why we are doing what we are doing, why we need new methods. Then I'll only have the time to give some illustrative examples of these new methods that we develop, how to mine these interconnected data. And in that regard, I will describe how to design new methods to mine single type of omics data and that Karsten mentioned mostly based on graphlets. And then I will focus more on how do we mine multiple layers of heterogeneous data. We'll talk about two applications only in cancer and in COVID-19. But basically we need to understand all of the steps to be able to understand how we came up with our predictions for a pre-purposing of approved drugs for COVID-19 patients. And in the end, I will conclude and because this is a school, I will say something about my views of what we need in the future of this field. Okay, so let's start with the motivation. Technological advances have yielded an astounding harvest of various molecular and clinical data in biology and medicine. We are witnessing explosion in the availability of genomes, epigenomes, transcriptomes, proteomes, metabolomes, phenomes, exposomes, metagenomes, et cetera, et cetera. And this is what we mean when we say omic data. This data growth was guided by empirical reductions that was striving to dissect a biological entity into its constituent parts to better understand. However, even in the times of Charles Darwin, I have back 150, 170 years ago, we knew that known parts is not enough. Darwin in his original species wrote that biology is a tangled bank with all of its entities interconnected. At the same time in Germany, we have seen in the last year that all diseases involve changes in normal cells has forever changed the way we practice medicine. This data growth about the cell has now made us hit the wall of bio complexity. We know that cells are not just loosely coupled arrangements of quasi-independent molecules, but highly intricately and precisely integrated networks of entities and interactions within the cell and within the environment. All of these different data types complement each other, and this is why they see joint modeling and mining. So the time now has come to replace the most reductionist, molecular perspective that dominated the 20th century with a new and a holistic view of the living world that's required to explain biological and medical phenomena and biologists in a complexity. However, that requires establishing a perspective and framework not only for one problem, but for biology and medicine in general, with the foremost challenge meaning how to re-synthesize biology, how to put all of its elements back into their complex dynamic environments, connect them all within a unifying framework and reformulate biological paradigms within our nonlinear world. So my vision has been to bridge this gap by developing mathematically principled frameworks for integration of all network data that will marry biomedical problems and data with algorithms from all sorts of computer science, mathematics, such as machine learning, mathematical nonlinear optimization, network science, algebraic biology, et cetera, et cetera, and due to computational interactability of the problems that we're dealing with on large data sets, we have to do all of these are approximate and we have to do it using high-performance computing. And this is why I'm now sitting in one of the largest supercomputing centers in the world. So we will see how we propose modeling and computational advances that hopefully will link the medicines reductionist class with its holistic future and enable displacement of the dominant molecular representation of biology with a new and integrative paradigm that is deeper, more comprehensive and more inspiring. And I'm currently holding an ERC consolidator grant to do this. And very recently, I've also received the proof of concept grant to actually commercialize the results of the consolidator grant. And in the past, I had an ERC starting grant and some of the first results I'll show were results of that one. Now, computational challenges. I'm not sure who is in the audience, but basically why do we need the tools, new tools to mine these data? We have so much. Why isn't what we have enough? So well, we've had genomic sequences for quite some time now, maybe 20 years or more. And the problems dealing with aligning sequences are so-called computationally easy, meaning that we can exactly solve them in the time polynomial to the size of the input data. And we still have no idea really what the genome is doing. Our gaps in the knowledge of that. Dealing with network data, with large networks, it's much more complicated in the sense that these problems are computationally tractable, empty card, empty complete, et cetera, et cetera. There are many of these classes. This means that we can mathematically prove that we cannot exactly solve these problems on large datasets, even given all the compute power of the world and all the time of the universe. This is what it means. But we still want to solve them. And our only way to do that is approximately solving them. This comes from the theory of computation, right? One such problem seems trivial, aligning or comparing two networks. There is an underlying computationally tractable, subgraph isomorphism problem that tells us we cannot do this. And this is why our only way of addressing this is by designing these approximate methods, but carefully tuned to extract new knowledge from particular data, because an inherent property of every approximate method, every heuristic, is that it is guaranteed to fail on certain data. This is what computational interactability means. You cannot have universal solutions. This is why machine learning methods can trivially fail to the tasks that are trivial to a human eye, right? So this is why we designed these carefully tailored heuristics to extract new knowledge from individual data types. So now when we have various biotechnologies, they're measuring different things, but of the same thing of the cell. So we have gene coexpressions where nodes would be genes and if they are coexpressed in a cell or under certain condition, there is an edge connected and link connecting them or physical protein-protein interactions or bindings between proteins to do something or epistatic genetic interactions, metabolic interactions, et cetera. So we want to extract information out of each of these individuals. Also, we have various ontologies such as disease ontology, gene ontology. We have drugs and their chemical similarity, side effects that link them, et cetera, et cetera. Once we can extract information, as much information as possible from each of these layers, another challenge is how to integrate and fuse them to see what they collectively tell us and also how to put that in a patient-centric way to improve tasks of precision methods. So we wouldn't treat all of the patients COVID in the same way, but certain patients need certain treatments and others different treatments based on their genetics and exposure. Right, so this ends the motivation and let me just go over these new methods, some illustrations, how to mine one type of molecular data or mixed data, network data to learn about biological function of genes and proteins about disease. I will give some examples from cancer, chemion, rare disease, et cetera. All right, so as I mentioned, one of the main work courses of the cell is so-called protein-protein interaction network. While genes are a blueprint, they code for proteins and do other things. Once the proteins are made, there are three-dimensional structures that usually don't do anything by themselves, but they bind together through the physical binding, they change the three-dimensional conformation and do a particular thing, whether they cut other proteins, whether they transport something from one part of the cell to the next, et cetera. So to abstract all of these, we model the proteins as nodes in the network, any physical binding is possible. We put an edge or a link and then presented with two protein interaction networks, let's say, of a healthy and a disease cell. We want to compare them. We just said, we cannot do that exactly. And this is why over the past 30 years or so, people started comparing simple things. These are simple heuristics. How we are trying to figure out are these networks similar or not? So we compare their sizes. The numbers of nodes and edges are they similar or not? Then the number of links that the node has is so-called degree. We compare the distribution of degrees over all nodes, one network versus another network. So try to figure out if they're similar. But this is an illustration how very quickly you can come up with these examples on which all of your heuristics will fail. And this is one example of the network G and H that are exactly of the same size, same numbers of nodes and edges, exactly the same degree distribution. Each node in each network has two edges incident to it. And then you can just eyeball them and see that they're very different. One is connected to start with another one is not. And this is why quite some time ago in 2004, I came up with a notion of graphlets which are small sub-groups of large networks that have you as Legos of networks. And basically I'm asking whether we can figure out by the way these graphlets are put together in the network, in the end, in Lego terms, do we have a pumpkin or do we have a helicopter? And this was a subject of my starting grant, ERC starting grant. And it's not enough just to count them but at the same computation complexity with counting them, you can get much more information. If you observe the symmetries within them, so called automorphism orbit, but let's not get into the mathematics because I would like to have this flow from these data to COVID patients, right? So let me just illustrate what we do with these graphlets. So let's say we have this network and this node A in the network. So we are going to generate a feature vector of the wiring around that node. How is that node wired in the network as follows? The first coordinate in this feature vector will be how many edges it touches. The edge is the only two node graphlets, connected graphlets. So it touches four edges and this is why this coordinate is four. Then we are going to count how many three node paths, node A touches, but at an end node, at the wide node. And you'll see that's three, one, two and three, right? Et cetera. We will count how many triangles it touches. One triangle, this is why that coordinate is one. How many squares it touches? One square, that's why that coordinate is one, et cetera. And this is how we generate these feature vectors of the node and then you can push these through machine learning methods if you wish. Then we can do this for all nodes and then we have these matrix of these feature vectors of the nodes. You can compare these feature vectors, right? To figure out something about the nodes if they're wiring is similar because wiring is indicative of similarity in biological function and involvement in disease. Or if you want to figure something out about the network you can go along the columns. And this first column or zero column here is your degree distribution, right? And then you have graphlet, sorry, orbit one distribution. I don't see my mouse. Okay, orbit one distribution, orbit two distribution. Many distributions about these networks. And then you compare all of these distributions to figure out if the networks are similar. Again, this is an approximate, right? But we are putting many numbers there to guarantee that these are similar. And I'm missing some slides, but it doesn't matter. Okay, hopefully not. So we use this in network alignment. We said the network alignment is computation intractable. So we did various ways how to seed the networks around the nodes that are similarly wired and then extend around them or do all sorts of other tricks to align networks, let's say of yeast and human because yeast is much more annotated, much more studied, much more known than humans. So we could possibly transfer annotation from yeast to human. We found out that the similarity in the wiring patterns of the proteins in protein interaction networks means similarity in biological function, membership protein complexes, similarity in subcellular localization, tissue co-expression and involvement in disease in human, but not only for nearby nodes, such as these two, but also the faraway nodes that reside in the network neighborhood that looks like the mirror image of this one. This is drawing for an illustration. I mean, it's the real data, but it's a drawing to illustrate. And then we push these through simple machine learning methods to figure out that this basically correlation between wiring patterns around proteins in the network and their involvement in certain, actually their certain biological function carries from yeast to human, for instance, membrane proteins look like these hobby ones while transcription factors are more fostered, et cetera. Early on, like 11 years ago, we validated biologically that in this way, by observing only the wiring patterns, we can find new cancer genes in this particular case was genes participating in melanin production pathways, which is implicated in skin cancer, of course, and our predictions were phenotypically validated by SIRNA screens of Professor Garneson at UC Irvine, where I was quite some time ago now. Okay, but our recent work that got published last year in our ISNB conference, the top conference in the field with the acceptance rates of around 15%, we use these methods to study now, not the protein interaction network, but the new type of data that recently emerged, so-called chromatin structure network. Basically, the DNA is a very long molecule and it couldn't fit into the small nucleus if it wasn't packed in a very particular way and very ordered way. And now we have these data, basically which parts of the DNA, even though they are very far away in the sequence, they are packed together, very close in space in the nucleus. And if they are packed close in space in the nucleus, once this part of the chromatin gets pushed around and gets transcribed because of the same transcription factors bind there, then they can be transcribed together. So they don't have to be close in sequence, but if they're close in space, they can be prescribed at the same time. So these are these chromatin structure networks, basically which parts of the DNA are in close proximity. And we took these chromatin structure networks of chronic lymphotoglycemia, CLL cells, and the control cells, naive B cells. This is where leukemia, this kind of leukemia originates from. And this is just a visualization of these chromatin structure network of the naive B cell, of the control of the healthy cell and the CLL cell. And you can see that basically chromatin structure is destroyed. This is just the spring embedding that we use for visualization. In the left, you can see in the healthy cell, clear clusters, clear organization, and on the right-hand side is a mess. And we kind of knew this, that basically the rearrangements of the genetic material in cancers. But what is interesting is that this structural difference exists in normal cells, even before CLL hits, which means that there are these hotspots in the packing of the chromatin, where the mutations and the rearrangements go. So anyways, if you'd like to learn more, this is the paper from last July, and we need to move on in our talk. So these graphed-based measures, we generalize to directed networks. They're very well-performing. The best performing for network analysis so far. They're very robust to noise. I have not addressed this, but everywhere we test for robustness to noise because biological data are no different than any other data. All data have noise, right? So we need to deal with that. Quite a while ago, back in 2004, I was the first one to propose the protein interaction that was actually geometric. And we've seen now the advent of geometric, the logical data analysis, geometric deep learning, et cetera, et cetera. We can try the dynamics of the networks, how they change, how the nodes change in different networks of time, et cetera. Also in ECCB last year in September, also one of the top conferences, similar acceptance rates as ISEB, about 15, 20%. We have proposed probabilistic networks. And basically we've shown that they better, when they have probabilities on the edges, they better identify. So basically, graphlets are generalized to have probabilities on the edges. And if you use that for your data that is probabilistic, you can better identify condition-specific functions in the cell. We generalized Laplacians to graphlet Laplacian. Laplacians, this was in bioinformatics two years ago. In the same paper, we generate spectral embedding and clustering to deal with these graphlet Laplacians. And we have shown that by using these, we can get different and complementary annotations. And graphlets have not only been used in biology, bioinformatics they're used in other areas. For instance, for image categories, recognitions in classification, photocropping, et cetera. Then in social networks for personality and effective states, recognitions in large corporations for maximized productivity, in theoretical computer science, et cetera. And let me just say a few words about multi-scale organization because in the cell, it's not only entities and their pairwise interactions. Sometimes several interactions are necessary at the same time, for instance, for a protein complex to be functional. So for instance, if we have this protein complex of three proteins, we modeled it as three nodes and three edges like that as a triangle. But also if we have three complexes of two proteins each, we model it in the same way. And in this way, if we do this, we lose biological information. So a better way to model it is that this complex of three proteins is one set of three nodes while these three complexes of two proteins each are three sets of two nodes each, right? And this can be captured, of course, by hypergraph. This is not a new notion. Mathematicians have known it for a long time. In 89, Claude Birch had a very good book on that. So basically, hypergraphs are just like graphs in the sense that they have the nodes, but the edges are not only between pairs of nodes, but it can be any subset of nodes. That's a hyper edge. And in systems biology, previously only some simple measures were generalized such as centrality and cluster, but maybe degree distribution, the modeling considerations were explained. But then we went and in ECCB three years ago, we published, we basically modeled the protein interaction network as protein structure hyper network where we took direct protein interactions of pairwise interactions and also we modeled all protein complexes as hyper edges and also all pathways as hyper edges. And we have this pH model that includes multi-scale protein interactions. And then we generalized graphless to hypergraphlets and orbits to hyper orbits. And this is for three node hypergraphlets. We have 65 orbits, but if you go to four node hypergraphlets, you have a feature vector over 6,000 coordinates. And then you do the same as before, you have these feature vectors and you cluster them or whatever, you do whatever you like with them. And basically we did some simple canonical correlation analysis between these feature vectors and go annotations. This particular example is about go biological processes. And we found out that really there is this strong correlation between these particular orbits here and some processes in the brain, synapse organization, learning memory, et cetera. We don't know why we have computer scientists in the lab, even though now we hire some biologists and environmental engineers, et cetera, as the lab has grown dramatically over the past few years. And then this is just from the paper and illustration that basically if you model the protein interaction network with hypergraphs and analyze them with hypergraphlets, you get more, much more biological information. This is in terms of enrichment of heat clusters that when you cluster these feature vectors of proteins both in yeast and human. Hypergraphlets lived on with Pedro de Voyats, my colleague from the US. We published last September, I believe, hypergraphic kernels and classification of biological networks with them. So if you care, you can read that. A similar notion to hypergraphlets is simple. This is from geometry, but kind of mathematical way of, so basically these abstract simplicial complexes from algebra. And we made, in analogy to graphlets, we made simplets, all of these possible small abstract simplicial complexes. We modeled protein interaction network with them in terms of the regular protein interaction network we just pair-wise bindings with proteins being in one-dimensional simplex. And then we also took all protein complexes in the network as modeled as a simplicial complexes. And again, we did the analysis based on these simplets. Now these feature vectors and simplets. And we are again extracting much more biological information with them than with regular protein interaction network. So the red is enrichments as we get from our new model and new methods to analyze them. And the blue is the protein interaction network and the random is at the bottom, you can not even see it. Okay. All right, now we learned about graphlets and what you can do with them. And this is ongoing research in the lab. We have a lab member who is still furthering this work based on graphlets. However, let's now see. So this is all for one layer of information. Let's say protein interaction network and you see that even within that one layer you have multi-scale compensation. We have multiple complementary layers of biological information. And I will show you some illustration of what we do with them. And first I will give an example in cancer and then I'll go on slowly towards COVID-19. Right. So what we did in this paper from 2019 that we published in Nature Communications is the follow-up. We take cancer tissue and make cancer specific gene co-expression network, protein interaction network and genetic interactions from these databases here in gray brackets and of healthy tissue controls. We take these networks, we fuse them within a new model a new network that integrates all of this information tissue specific, cancer specific. Then we compare them using these graphlets. This is why I spent some time talking about and by this comparison of this new model of an integrated cell, this new network of an integrated cell that we call an ISEL we find new cancer genes. We validate them in many ways including through biological experiments that we did in our lab. And we show that these are new ones that do something in cancer and we cannot see them as different in any of these constituent data types alone. They only come out of these data fusion that the data integration. And this is what I mean the different data types complement each other. So we took in this study basically all of the genes that appear in the protein interaction network co-expression and genetic interaction network and all of their interactions. So while we have around 3000 genes in the overlap of these data types we have only one shared interaction and this is not a mistake. This is just basically we are using different biotechnologies to look at the same thing. So there's only one shared interaction. And this is why most of the previous data integration methods fail. And we tested all of the state of the art ones at a time they either have memory issues. This one is based on cancer factorization or they diverge after about a hundred iterations or they don't converge, don't produce clusters or they have a very, very large number of very small clusters, et cetera, et cetera. So this is what motivated us to do this approach. How do we make these data complement each other? How do we address this problem? So as I said, we took these networks from these databases, three different kinds of omics networks. And we also took tissue specific gene expression from the human protein atlas. And please know this is not microarray or an acyc. These are antibody staining low throughput, low number of samples. So they don't represent all patients, just a small number of patients. So if you don't see your favorite gene there, it's not a problem with the method because these data are like this, the low throughput. So we only consider genes that have expression available in the human protein atlas and that have at least one protein protein interaction. And we create these tissue specific networks where nodes are genes that are expressed in the tissue, especially from the human protein atlas, and they're linked by the interactions coming from these databases, PTI, co-XTI, these databases here. So these tissue specific networks we create for these four most common cancers in human. And these are their control tissues, the tissues of origin. And we can see that these networks are large. So you can safely do your analysis on them. They would have around 10,000 nodes and 100,000 edges or so, both in the control and the cancer. And what do we do with them? And this is now the key step. So we color coded here protein interactions here as red, co-expression network as blue, and genetic interaction network as green. Now, as we all know, networks can be represented as matrices. G is two genes. And then if there is an interaction, we put the value of the interaction, whether it's one or zero, or maybe it's the value of the expression, it doesn't have to be one or zero in these matrices. So these are square matrices. So for these three networks, we have these three square matrices. And then we factorize them. And this is another optimization problem, computation interactable, you are minimizing the objective function, going to a local minimum, this is the best you can do. So you factorize these into the product of these matrix factors. This one is the G, this is genes times something low dimensional, K. These are compressed versions of these here. And this is again genes times, well, K times G, yeah. Now, so this is minimization. After that minimization, we get our G. Now we multiply G times G transpose to get a new network. This is the new model that emerges from fusion of these data, because in this factorization, this is basically we're factorizing three matrices and we are keeping this matrix factor the same through these three factorizations. This is how data fusion happens, okay? And then we have our network I cell and we see here, here is just an illustration. This is from reactant pathways, CAD pathways or go biological processes. This is just a visualization that you have these clusters that are functionally enriched because the same function is presented and followed here. So we see that I cells capture, but actually here from the bottom, from the bottom table in all of these, whether it's a breast control I cell or breast cancer or prostate control, prostate cancer, et cetera, et cetera. I cell is pink. We see that enrichments are much higher in these clusters that you have taken here. In the I cell, then in any of the constituent data types, this is additional information, functional information that emerged from the fusion despite of no overlap in these networks. These I cells are not random. They are far from random while protein interaction networks we could safely model with geometric models. We really don't know what this new model is I cell, but let's leave that aside. And here to say that each data set contributes to this fusion and you can see how the percentage of enriched genes increases as we add more data into this fusion. Okay, so now we have these tissue specific networks from the data, PPI, GI co-exp and I cells. And then let's see what they tell us. Now we find between cancer and control tissues, these four ones that are on the top of the screen, the genes that are cancer silenced, cancer activated, always silenced, so they're not in either cancer control or always activated present in both cancer and control. And interestingly, it's these that are always expressed that are enriched in drivers. Once we found that out, we focused on these genes that are always expressed. And in these cancer and control, four different types of data, including our I cell, we checked most three wire genes, we did that by graphlets, this is why I talk about graphlets and check the enrichment of drivers. And we found that only in I cells, not in PPI, not in GI, not in co-ex, but only in I cells rewiring of the genes is indicative of involvement in cancer. So basically the top 500 most rewired was an enriched in drivers and driver-related pathways. Great. So then we for our four most frequent cancers in human, we took 20 of the most rewired genes in cancer I cells and they were not 80 because there was some overlap, there were 63 of them and we almost, we validated almost all of them either in literature, these are PubMed IDs or through knock-down experiments that we did in the lab. So for instance, if we knock down this gene in breast cancer cells, we get significantly reduced viability in cancer. Sorry, if we knock down these genes in this cancer cells, the same as process or significantly increased viability. So they do something. Also through survival curves, we went back to retrospective data of patients and then these patients with these genes, their different survival. And notice that only 17 out of these 63 or 27% are differential expressed in cancer. So not only differential expression can uncover all of these that emerged occlusion. Once we figured this out, we went pan cancer, everything we could find 20 cancers in total did the same sort of thing and then found the most, the highest rewire genes. And actually we have this top prediction, this one that was not known to have anything to do with cancer before. And we went to the human protein at Western found that basically it indicates different survival in patients of eight different cancer types. So here I would conclude with the eye cell. So basically it's a new concept. We get new cancer-related genes that emerge from the fusion. We need to go beyond any of these data types including differential expression, et cetera, et cetera. And it's versatile, we can do many things with them. And let me just illustrate what we can do. This is our ongoing work in rare thrombophilia. We have very few patients. This is a very severe case of thrombophilia that is only present in Northern Italy, in the Balkans and the Far East. We only have five subjects, two families, two brothers, one diseased, one healthy, but the diseased brother has, sorry, but the healthy brother has a daughter who is diseased and we have two sisters. And then very few patients, we cannot do much with that. But if we do this kind of patient fusion, we have patients, the germline mutations to their genes, we have these different interaction types then we can minimize these subjective functions similar to what we did before and we get clusters. For instance, this cluster is specific with mutations specific only to the diseased women and these are mutations specific only to the healthy brother. So we suspect that these mutations are protecting the healthy brother, but this is an ongoing work. This is an illustration how this is a machine learning method that provides mechanistic explanations. You can fuse heterogeneous data types not only on the genes on the same entity such as protein interactions or genetic interactions but between patients and the mutations, for instance, of their genes or patients and the drugs that they could receive or have received or between drugs and the places where they bind, the proteins that they bind to, et cetera. And here basically you're decomposing these matrices across these data types and sharing these matrix products. And let me not go into this, this is a study from 2016 that we published in the Pacific Symposium on Biocomputing for ovarian cancer patients, where from the same formula, we read three tasks of precision medicine, patient stratification from the cluster of patients, cancer gene prediction from the clusters of enriched cancer genes, and drug repurposing from these drug target interactions that throw matrix completion property we feel in based on the data. And again, all of the data contributed to this fusion and you'll get higher precision recall, rocker, et cetera when you put all the data. And finally, a little bit about COVID. We put this on a preprint archive so you can read it, of course, ongoing pandemic, I don't need to tell you about this, we're probably all sick and tired of this already and frustrated and want to contribute, right? So we've seen quite a while ago from Crogan's lab, the first protein interaction map between viral proteins and human proteins. And since then there were more, but in cancer cell lines, so it's questionable. Here is also at the same time in May, study of expression in the lung tissue of patients of COVID. So expression data. So we constructed again these networks from SARS-CoV, from the viral proteins, these are viral host interactions. Then we have molecular interaction network that we construct from protein interactions genetic interaction and metabolic interactions because metabolism here is very important. And we have drug targeting interactions, drugs targeting human gene products, proteins, right? And then this is just where the data is from. And then we again factorize these matrices, viral host interactions. Here is the factor with the genes. These are human genes to drugs. We regularize by drugs, chemical similarities, more molecular interaction networks. And then by co-clustering, we get clusters of genes and drugs emerging from all of these data and matrix of completion for drug repurposing these DPI's, right? And we get again clusters with much higher enrichments in terms of genes and drugs, et cetera, in terms of biological go annotations or drug categories from drug bank, et cetera. Then we could get without, without this fusion. This is, I would not focus on this. This is basically that it works, that we have good results when we do that. And now let's focus on this because we want to repurpose known drugs to be able to faster bring cure to these patients if possible at all to help in these efforts. We focus on drug target interactions of these and we predicted over 800 of them. So about 500 drugs targeting around 200 genes, human genes. Now, basically we are bringing this, we are not going into antivirals. Antivirals are hard to design, very specific. We have very few of them, but we are going into the human hopes and we want to target the drugs, the human hopes to better the person, right? Who is infected? We only focus on FDA approved drugs to reduce the number of them, okay? And we have certain validations in the databases that we didn't use for the fusions. So we are confident that this method works. And now we went into, back into the molecular interaction network and we have viral interactors, right? So from the virus to the human host. So these are human host, interactors with the viral proteins. Then we have in human differential express genes, okay? And nothing is very specific there and we cannot really target those very well. But the common neighbors, you go into the molecular network and the common neighbors of the human viral interactors and the differential express genes, that intersection is very large and very unique in its structure in terms of its degree, centrality, clustering coefficient, the different kinds of centrality. It's also in terms of the graph. So these common neighbors in the human molecular network are specific, also in terms of the graphlets, very, very different than the rest. We then, when we found out that these common neighbors of the human viral interactors and the differential express genes are specific in topology, in the wiring in the molecular interaction network. Then we went to see functionally and the functionally enriched in the viral processes while the other sets of the nodes are not. And then basically we would decided that these are because they're so, they're very central and very connected probably this is the best place to heat with already approved drugs. So we reduced the number down to around 150 drugs, targeting about 50 human genes. 11% of them are already in clinical trial, 70% of them are already in Corona Drug Interactions Database Core, right? And then we went a little bit more into the functional analysis of these particular genes that we proposed should be targeted and we found enrichments in interesting processes but all of them are linked to VEGF signaling and nitric oxide signal. And you can read the details in this paper because I should conclude this presentation soon. But basically all of these other ones that they have this commonality and then we propose modulation of these kinds of pathways for treating patients. And I would love to conclude. So basically this methodology is versatile, you can put into it all DNA elements, expressions, mentalations, copy number, variations, clinical profile similarity treatments, adverse effects, et cetera, et cetera. However, each time you pose this problem you have a different NP-HAR continuous optimization problem for which you have to propose the objective function, optimization, solver, probe, convergence and correctness and of course optimization will be slot so you need high performance computing to do this. I will not talk about other things that we did with it just to conclude that I hope I conveyed the message that we're dealing with a complex system of heterogeneous interacting entities that each of these data types that we have is limited but they all have complementary information and this is why they see joint organization mining within the same framework. Everybody knows that these are important issues. I already acknowledged my grants which is required in every talk so this is why I have to put them on many slides. So to holistically mine all the data we need paradigm shifts, conceptual but also methodological, conceptual. We have to analyze all data types within the same framework in a new bottom up data-driven concepts, biological concepts we need to generate on these data and they may tell us that cell may be governed by yet undiscovered principles of life. So we have to rethink biology and approaches to medicine. We've seen only the first prototype I-cell methodologically we've seen that even when mathematical formalisms do exist for instance to capture multi-scale organizations such as hypergraphs we don't have algorithms to mine and this is why we did hypergraphs simplices, et cetera, et cetera. And we need to utilize dependencies for instance in the local network topology that we did with graphs and orbits which are data set dependent and this is why these methods work to uncover low, latent low dimensional structure of the data exploit the structure for discovering efficient tool sets for particular data only because this is the best we can do. And but no matter what we do unless somebody throws P equals MP computational issues remain with us due to computational interpretability and large size of our data. This is why training is key. We have to train embedded data scientists that will be solving problem specific things in specific labs and designing these statistics. And to that end I've written this book with colleagues one or two years ago it was published by Cambridge University Press. This is a textbook and it has certain chapters five of them are written by my lab the rest contributing to the Carsten's chapter is there. So I think this is a very good introduction for people from all sorts of backgrounds into this. And we have to work with biomedical collaborators and industry. Also I'm taking this hopefully strictly out of an academic setting trying to scale out and productize eye cell and ERC I cannot thank them more they were instrumental for my transition from the US A to U and none of that would be done without them. And of course I think other funding agencies and none of this could have been possible without my group members. I acknowledge them all also my collaborators we are currently hiring please email me if you're interested at all levels postdocs, PhD students, research engineers. And this is how my group looked like in Barcelona. This is part of the group only a couple of years ago this was last year when we could still travel to retreat. And this is this year all online. And I would like to thank you all for your attention and take some questions. Dasha, we thank you for this very clear and comprehensive introduction to your work. Thank you very much. We sent around a virtual applause to you and we now have the time for a few questions. Let's start with Insight Network on Zoom. So please raise your hand if you have a question. I don't see a raise hand yet. So I will start. Natasha, this is a very exciting work. I mean, I've also like as I said more than already more than 10 years ago I read your papers about graphlets and these network motifs that you are studying. Now one question to these biological networks tend to be incomplete or they tend to be or end they tend to be false positive. So we cannot fully trust these edges. Sometimes has someone systematically examined how this affects the space of graphlets. So I mean studies like simulating networks, simulating a certain ratio of missing edges or false positive edges. And then seeing how this affects the graphlet spectrum or the network motif spectrum of the graph. Exactly. And this is what I said just the robust to know idea. From the first paper in 2004 where I introduced these graphlets basically we tested robustness to noise. So basically we randomly add certain percentage of edges or in wire or remove, et cetera. And these methods depends on the method they're all heuristic, right? But sometimes they can tolerate especially these like with the orbits and stuff. So they can tell it even up to 70% of noise. This is a lot of noise that they can tolerate, right? Now in terms of the quality of the data even as early as like 2004, there were huge efforts for instance, funded by the Canadian government that actually experts postdocs read the papers and confirm the validity of interactions. And since then the methodology has improved and many different labs confirmed interaction. So if there was an interaction, pretty sure there was, there is an interaction there. It's just the data will be sparse. Of course they will not have a whole link to all by any stretch of imagination. But of course some things cannot be tested because we only have such biotechnology right now. So it was important. And what about biases in data collection? So for instance, a certain bias that one gene or protein is particularly relevant for the biomedical community. So they study all the interactions of that protein and they are very much covered. Whereas for other proteins, the knowledge is very incomplete. Let's say you have a big hub in the network and that hub creates a lot of smaller graph that's attached to that node. So is this impact or problem like fully understood or can you correct in your methods for these terms of biases? Yeah, you can, you can correct. I mean, these are algorithms. You can do whatever you want. For instance, these sticky proteins, for instance. I mean, some of them are really sticky, these sort of proteins and there's no bias. They are always, right? It's not that they've been more studied in the early studies, of course, like back like 20 years ago, roughly. It was very, the data were severely biased by these certain types, date and prey. You're only interested in this pathway, it has 15 proteins, you tag them, you pull everything down. So you have all of these 15 will be your hubs, right? Because you pull down, I don't know, hundreds, right? With these 15. And of course then you will have this scale-free topology. But even as early as 2004, this scale-free means faded a little bit because when people started studying more and actually when biologists started being aware of this and then basically when they started this matrix approach, right? All to all in a certain region. And then for instance, Mark Vidal at Harvard, he, I mean he's a pioneer of these two hybrid methods where basically all of these all to all of these, let's explore the entire space of yeast, Arab dopsis, human and you know, we'll work together on some of these things. So yeah, we are doing efforts on that. If it's deemed that there is still somewhere a bias you can always correct or agree with. Very good. Are there further questions? If not, I have a higher level question for you, which is similar to myself, you have been working a lot on these, explicitly defined network statistics like the graphlets and other types of patterns in the network. Now there's this alternative way of doing deep learning on graphs. The previous talk by Matthias Niepert was about that. So do you see a role for that? Do you see maybe a particular type of, so in your world and in your work for deep learning on graphs, is there other maybe problems that you couldn't solve so far with the more statistics or count-based approaches where you see a big potential for deep learning? Yeah, yeah, when you think about it, I mean, again, I'm not a machine learning expert. I mean, I kind of had machine learning gone as I need and we modified some methods, but I mean, factorizations, you know, this can be brought down, this is equivalent to some deep learning, right? So in the sense, I mean, now we are toying a lot with these embedding methods like everybody else, right? And you can do these embeddings various ways, either through neural networks or through factorizations, some of them are completely equivalent, et cetera. It all depends what you want to do, but I think the integration of all of these approaches would be necessary, absolutely, to solve these problems. And let's see where it takes us, but yes. Yes, one question from the YouTube audience that I'm reading out. Thanks for the interesting talk. You have mentioned using IID, the database for PPI's. Could you comment on using the string database for the PPI information? Yeah, yeah, sometimes we use string. It's just we used IID when it was available. I think maybe it was brought down recently. I don't understand why. String is also great. It's just for filtering for us, it was easier to use IID because we never want to deal with predicted interactions. We only want a biologically interactions coming from biological experiments. And if you want further confidence, more confident data than even that various labs confirmed the interactions. But yeah, we have used strings, I mean, in this particular illustration of the method non, but yeah, it might still be possible. Good, thank you. Great place, yes. I'm checking that there are any further questions. In fact, there are none and we are right on time. So thank you very much, Natasha, for joining us today. Thank you very much for this great talk and overview. You appreciated it. Thanks for the invitation, it's been pleasure. Yeah, you're most welcome. This was a great talk and element of our symposium. Thank you very much.