 Good time to start. Welcome back from the break. It's a great pleasure for me to introduce now the second speaker of today, Matthias Neepert. Matthias is from NEC Labs Europe. His research interests are in relational learning, learning on graphs, learning with graphical models. He's doing extremely interesting work in this domain. He did his PhD at Indiana University, Bloomington, then had post postdoctoral stays at the University of Monheim and the University of Washington, and then he joined NEC Labs Europe where he had a very steep career. After a few years, he's now the manager of the machine learning group. There's a stellar career. We are very happy to welcome him here today and to learn about these exciting topics. You remember in Fabian Teister's talk also this work on graph data and relational data was highlighted as one of the next challenges in the field. How come you're here? Matthias, it's great to have you. But before you start, I also want to mention that you are a co-founder of several open source digital humanities projects such as the Indiana philosophy ontology project and the linked humanities project. Also, I also want to give credit to these activities of yours. It's great to have you here. And we are now looking forward to your talk. Yeah, thanks a lot for inviting me. It's looking at the set of speakers. It's an amazing symposium. My impostor syndrome is going really strong today. So my talk will be about new relational learning and some biomedical applications. And as Carsten said, I'm the manager and also the chief scientist at NEC Laboratories Europe. NEC is a Japanese IT company with a lot of different areas of applications of use cases. So yeah, one of them being in the biomedical domain and specifically in drug development. Before I start, I also want to credit my collaborators, so the machine learning and biomedical AI groups at NEC labs and specifically also Alberto, Brandon, Caroline and Timo who have worked a lot on the types of projects that I'm going to present to you today. So the first question that I want to answer here or, you know, elaborate on is why does an IT company like NEC care about, you know, graph data and specifically the biomedical domain. So you can see four different types of graphs that we have specifically worked on in the past, you know, two to three years. So we work quite a bit with patient data. So NEC is a big software vendor for software for medical records in Japan specifically a chemical compounds of course can be represented as a graph. So we've worked quite a bit with biomedical knowledge graphs, where the notes of the graph are things like the drugs, proteins, and also diseases, and the way that they interact and they can be many different types of interactions. And then also specifically some project that I'm going to talk about today as well is a network of peptides. So in this particular case here in new antigens, where we want to learn something about new antigens that can tell us, you know, how likely will they elicit immune response in cancer patients. And so, as you can see now we have this, this, you know, natural way of representing biomedical or medical data as graphs. And now the big question is, you know, how can we apply machine learning methods to these graph structured data sets. Here I should also make a disclaimer, you know, my kind of my overview of graph based machine learning is quite narrow and biased. So, you know, machine learning for graphs has a long history and this is specifically also, you know, the history of kernels for graphs that has worked on a lot. So I just want to mention that my view here is quite biased on recent methods of using neural network based machine learning methods for graph structured data. So let's go through a couple of examples, right, of how we have used, you know, representation learning for graphs for different data sets in the biomedical domain. So for instance, in drug discovery right obviously chemical compounds as I said before can be represented as graphs. And the problem here is now taking a set of graphs or maybe a big database of chemical compounds and mapping these graphs into a vector representation so that you can downstream classify these these graphs or perform a, you know, solve regression problem over them. The second type of problems that we typically encounter is that we have, for instance, you know tabular data so you know patient medical records. Then we induce the graph representation on top of this. And then we apply graph neural networks to note representations again. And in this particular case we learn note representations only and perform the note classification. So for example here, we might be interested to predict, you know, things like discharge destination length of stay and in hospital mortality for the patients that are the nodes in this graph. And that's a very typical, you know, the typical target problems for patient outcome prediction that are quite important in the medical domain. Similarly to the previous example, we might also have, you know, certain biomedical entities in this particular case here peptides. So new antigens. And we have particular measurements about them. And we also now want to represent these peptides as a graph, and then learn vector representations for the peptides so that we can perform in this particular case here for instance, a ranking of epitopes, according to how likely they are to elicit an in a cancer patient. And then finally, you know, we might also be interested in learning representations for notes and edges. Right. So if we might have this mentioned this before biomedical knowledge graph, where we have drugs and proteins for instance and different types of And then we might be interested to predict missing links in this graph, right. And so one particular type of missing link could be, you know, our two drugs if you take them together, more likely to cause a severe side effect, which if you took these drugs individually actually wouldn't wouldn't show up. Right, so then here the the problem is now the problem of link prediction in graphs. And so, you know, this was just to give you an example of the different types of of prediction problems that you can address when you are given a graph. Of course what I haven't really told you is how to actually do this. So right well what have people done here in the past and in recent years. There was a large body of literature so the area of graph neural networks in machine learning is growing tremendously I would say there are probably hundreds of paper per month coming, per month coming out. And so I will try now to on a high level, kind of, you know, categorize and sort a bit. And then show these different approaches for you and then also show in two instances, how we have used, you know, graph based machine learning in a particular biomedical application. So the way that I like to start, you know, explaining graph based machine learning and specifically neural network based graph based machine learning is to start with the success story of convolutional networks for images The image can be looked at as a grid graph. And what the convolutional network does it, it kind of moves know a small local kernel typically like three times three pixels from left to right and top to bottom over this image, and then reads off that information and by now stacking more and less, it starts off with this kind of local very local representation of the image and then builds more and more global feature representations for the images and this works extremely well, right. So the nice thing about images I mentioned this before is that the image can be represented as a regular graph so a grid graph. So the big question now for arbitrary graphs is where notion of top, you know, top to bottom and left to right really doesn't exist is what are good local structures that we can use as a substitute for this, you know, square shaped grid that we are using in convolutional neural networks for images. And when you look at the literature you can kind of see three different types of local structure that's been used. One is triples. One is paths or random walks that are being extracted from the graph and then used as the local structure that on top of which we had learned more global feature representations. And then finally also, you know, k hop neighborhoods of notes, where we are aggregating information of the neighborhood of a note to recompute or update the vector representation of the note itself. So we're going into two of those, you know, areas and specifically also how we have used those, you know, types of graph based machine learning methods in in the biomedical domain. So this one here is, you know, this, the classical example of representing a knowledge graph actually as a three dimensional tensor, right. So this you have, obviously you have your entities like drugs and proteins in this particular case. And you have relationships between them so you can represent essentially a knowledge graph as a set of triples right subject relation object triples. And now the way that this can be represented and this goes goes back to you know result from 2011 is essentially as a as a three dimensional tensor where every element in the tensor is one, if that particular entity that particular triple is true. And so the typical approach in what's called, you know, tensor factorization based, you know, knowledge graph embedding methods, the way that this works is that you usually choose a representation or an encoding for the entities and relationships. And in this particular case of Rascal here which is the first method to have done this is that we choose vector representations for our entities, and we are choosing a matrix representation for the relation type. Then we are choosing a scoring function. And there are many now out there. The idea of a scoring function is that you essentially combining the vector representations of the entities and in this case the matrix representation of the relationship into a score. And then finally you choose a loss function that makes the scores of, you know, the triples that you know to be true to be higher than those triples that you don't know anything about, right. And then you train the model and what you're doing implicitly here is you factorizing now this three dimensional tensor that's that's representing in this case here by a medical knowledge graph, you know, into different parts. So if you're new if you want to do relation prediction, what you're doing is you're giving now, you know, this particular relation, this particular triple that you want to know if it's true or not to the scoring function and if the score intuitively if the score is high. It's more likely that the triple is true than if the score is low. Yeah, so what what's been happening in this, you know, area of work is that over the last, you know, 10 years, you know, people have proposed more and more different types of scoring functions essentially right so starting with Rascal in the beginning. Now there are many, many different more of these scoring functions, but the basic idea is always, we are combining vector representations of entities and relationships into a score, and we train now that score to be high for triples that we know to be true. Okay, so now can. So, so, given that we can now do link prediction in a knowledge graph, how can we actually use this now for instance in drug discovery. So for instance, which combination therapies may result in severe adverse events right or which proteins are strongly associated with a particular disease, or which entities in different biomedical databases are actually the same, right. So all of these questions can actually be answered using link prediction. And so the typical process that we are following typically in, you know, in our company is to first create the knowledge graph from from biomedical databases and other types of sources. This is something that is not to be underestimated as a problem right so a lot of people focus on the method development. So really important step is to get a really good and comprehensive and clean knowledge graph representation of your particular domain that you might be interested in. One of the things that we specifically also do is we have developed a method that can extract and incorporate rules. So in addition to this kind of factorization based method that I mentioned to you before, we also extract rules. So we might also enrich the knowledge graph with additional modalities, such as for instance sequence data or other types of data sets text. Then we train our graph machine learning model for the different data modalities and predict finally the missing relationships between the entities in the graph. And just to go through one example here so for instance if we want to build a knowledge graph to predict polypharmacy side effect. We can go to different biomedical databases, we are collecting, you know, information about, you know, the entities in this in this biomedical knowledge graph, and also the relationships between them. We enrich it with, you know, additional features, for instance, from the gene ontology, or from other types of biomedical sources. And then we apply our rule based method. So the idea is really that you can now given this large biomedical knowledge graphs you really apply a rule mining method. That tells you something or that that that results in particular rules that say something like, you know, if drug A and drug C both target a particular protein B. So for instance they both up regulate a particular protein, then A and C may have a side effect together, right. And these rules are now combined with the type of factorization based method that I mentioned to you before. And other types of modalities that might be associated with the notes in this graph into a joint, you know, machine learning model. The idea here is that for different types of modalities different types of feature types. You can choose different types of encoding functions for instance for molecules you might choose a graph neural network for you know that this might they might be associated with a drug node obviously right chemical compound of the drug. You might also have, you know, sequence data associated for instance with the proteins in the in the graph. And for each of those you choose an appropriate encoding function right and there are many out there right now to do the way that you can pick and choose. Then you aggregate essentially the information from these different modalities, you concatenate or you you average so some sort of aggregation mechanism. And then again you go back to you know what I presented to you before you apply this this scoring function to make, you know, the score of triples that you know are true in your knowledge graph to be higher than the score of other triples. And we did this and we are doing this in many different contexts or for instance we're doing this in, you know, as I mentioned before polypharmacy side effect prediction, where we, you know, interested in understanding which drugs if taken together might cause severe, severe side effects, and we show that our method does really well and outperforms other previous state of the art methods here. But we can also ask and I mentioned this before questions such as which of the proteins are actually strongly associated with a particular disease, right. We also can compare with, you know, existing more traditional network based, you know, machine learning methods, and we can show that we're really doing quite well here with respect to these, yeah, these prediction problems. The nice thing and this is something that I should mention here is also that I mentioned that we are extracting these rules right. And so, you know, one of the nice, you know, side effects of extracting these rules is that we can actually look at the model when it makes these predictions so when it for instance tell us it tells us you know this protein I think is really associated with the disease. It also provides rules that we can actually inspect and take a look at so this is something that we found quite important. And actually what we did is especially also for the for the polypharmacy problem is that we could look at the rules and we could actually find some in some cases in the literature. You know, evidence that the rule found something meaningful right so that for instance, a certain up regulation happened that that was not really explicitly modeled in the knowledge graph that was essentially predicted and then it could be verified in the literature that that was actually really correct, correct prediction so this kind of explainability component of using also in including rules in these matrix factorization methods is something that we found is quite important. Okay, so this was kind of the first part of. I know the first type of, you know, machine learning model that can be used and what we've been using it for. The second one is the class of machine learning models and for graphs that are based on on neighborhoods. So where we aggregate neighborhood information into a new vector representation of the of the note that we're looking at currently. The idea here is that, you know, they are, of course now a huge number of graph neural networks, but one large class of graph neural networks can be unified with what's called a message passing neural network. And here the idea is that what we do is we are computing note vector representation so the vector representations of notes recursively by aggregating neighborhood information. Again, one of the things that we would choose is we would say, we want our, you know, message passing neural network to have a particular depth k, for instance. And then what we do is at each step, we are looking at the particular note I and its vector representation, and we computed as first looking at the neighbors of the note I so all of the neighbors J. So here vector representations from the previous steps. Then we apply a learnable function H so these are the, this is the function that has learnable parameters here. We then aggregate the resulting vector representations of the neighboring notes. And this is a crucial step right because in a graph we might have for every note in the graph, you know, a different set or different number of neighbors, and the structure also might be completely different. So here we have to choose an aggregation that is equivariant or invariant to to for instance the number of neighbors. And then finally, this function G which is also learnable combines the vector representations of our node I in the previous steps k minus one with this newly aggregated representation of the neighborhood of the node I. And now all we need to do is we need to specify some labels for the notes for some of the notes in the graph, and then we can train the parameters of this graph neural network, and to end, and then make, and then predict the class labels for the notes where the class label is missing. And because this might be not 100% intuitive, you know just based on the formula here. I personally like kind of this insight that what essentially happens in a graph neural network is that for every note in the graph what happens is that the neighborhood is unrolled into a directed acyclic computation graph that can then be used in a deep learning framework type of way, right. So for instance if we are looking at a particular two layer graph neural network, and we now want to look at the computation graph that's constructed for the yellow note that you see on the left here. What is happening is that you first look at the, because it's a two layer graph neural network you're looking at all of the notes, you know that are two layers away. And those vectors representations are aggregated combined. And so now we have a new vector representations for the one hop neighbors of the yellow note. And then in the second step, we aggregate the, you know, one hop, the representations of the one hop neighbors into the new vector representations of our yellow note. And then in the hop here when we are training, we are applying a loss function and can then back propagate essentially through this computation graph. Right. So this is, I think a nice intuition of what's really happening in these in these graph neural networks. And so now what, you know what have we actually done, you know how have we used this particular class of graph neural networks. So one of the, and again I should mention here right that I'm a machine learning person. So, I'm not an expert in the in the biomedical domain that's why I'm, you know, working with other people who are experts here, but on a high level. What you do when you're trying to design cancer vaccines is that you're trying to understand which of the antigens in a cancer cell will elicit immune response, because then you could essentially through, for instance, you know, a vector based, you know, the immune elicitation you could essentially teach the immune system to then attack the cancer cells through this particular new antigen, right. So Neo here meaning that it's an antigen that is developed within a particular mutated cell. The idea is that when a new antigen undergoes the different processing steps from, you know, transcription to cell surface presentation and receptor binding that they are different stages and for each of these different stages. We might be able to collect features right that tell us something about this particular new antigen. I think exactly what what we do in our work so so we are collecting this type of information some of that information is publicly available in many different, you know, also biomedical databases, biomedical studies. In other cases we we we get it ourselves. And so now what we do is we are now looking at, you know, these epitopes so again epitopes here peptide and new antigen you can think of our, you know, one and the same thing in this context here. And so the idea now is that we are building now this, you know, graph representations of these new antigens these peptides, for which we have at least one piece of experimental evidence. And that's kind of the crucial point here right so it's it's very expensive in many cases and in many cases we only have very few data points for these these peptides. So, the nice thing about this graph representation I'll go a bit more into the into the details in a second here is that we don't need for every peptide to have all of this possible set of measurements. Right, so we might just have one measurement for instance for one particular peptide, and we can still include it in this in this data set. So this can be constructed between the peptides in in many different ways so one of the ways that we've tried is to use blossom 50 so kind of, you know, to apply some sort of sequence similarity measure between the peptides. And then again we have this you know what we like to call multimodal peptide graph, where we have the notes are the peptides the new antigens, and we do have for for for them at least one of these measurements right that I mentioned in the in the previous slide. Okay, and then we have a particular graph neural network that we really like to use which we call embedding propagation so this is something that you know we published, you know about about three years ago in Europe's, which is essentially kind of an unsupervised version of this kind of graph new message passing graph neural network that I explained in the previous slides. So, how do we use this now right for for predicting or for ranking, you know new antigens in, and according to what the likelihood is that they will actually elicit an immune response so we start with our input data again these are the, you know the graphs and the different types of measurements and you can think of it as whenever you see a you know this colorful circle there, it means that we have a measurement available. The cross means that there is data missing so this particular feature type is not available for this particular peptide. So we can induce similarity graph and affinity graph. We've also been working in recent years on actually including the graph learning in an end to end pipeline but in this particular case here we would again do it based on for instance sequence similarity. Excuse me and what we then do is essentially. Again, we look at a particular note right this is what happens when we're now training this graph neural network. We look at a particular note a particular peptide. And what we do which is a bit special in our approach is that we're actually learning a vector representation for the features that are available for the peptide. And we're learning a separate feature representation for the missing values. So what what is missing for this particular peptide. We then aggregate these two different feature representations and apply a contrastive loss. So the intuition here is that. So the peptide, we want its vector representation to be similar to the vector representation of the neighboring peptides, right. And so now we can run this message passing your network scheme. And by running this we essentially propagating now this information. And while we are propagating this information we are actually performing an imputation in embedding space. So, when we're finished learning. We're finished training this model. What we end up with is for each of those feature types that we might be interested in, we do actually have a vector representation, right. So before we had missing values there but now we do have for each of those feature types of vector representation. And what I should also mention here which is quite crucial is that we are still. Distinguish so we're not just you know lumping everything into one vector representation, we can still distinguish, you know, different types of vector representation corresponding to different types of features that we might have, you know, for the particular new antigen. We don't have any missing values anymore we have these embeddings these different types of embeddings. And now we can just, you know, again concatenate these embeddings and train a standard of the shelf, you know white box even if that works supervised model, and can use that to make predictions, right. For instance, how, you know how well the particular peptide might elicit an immune response. We, again, so this is kind of the, you know, the core idea here again we, we are running this this graph neural network we are exchanging messages, we are imputing the vector representations we're learning vector representations for these different feature types. And we are then in a second step, apply a supervised model. When we do this and so for instance if we use something like logistic regression as the supervised model. We can actually see which of those, you know, at least in terms of embeddings right which of those embeddings contributed the most to the logistic regression model making a particular prediction. And then we can kind of track that back to the particular feature type. And even to the particular neighbors that might have been especially responsible for, for this type of imputation if the data was missing. So, it provides a particular, you know, a way of interpreting also the behavior of this this imputation method and the way that then the classify actually works in the end. I should say that, you know, this is something and I'll mention this in a second that we have actually integrated in a larger bio informatics pipeline. And again, this was done by, you know, mostly by Brandon alone who has done the bulk of the work of actually putting all of these different pieces together and building this bio informatics, you know, software really. One of the things that that we then do is we want to, yeah, we want to essentially predict how likely a particular new antigen would produce an immune response. Right. And so this is a pretty heterogeneous data set I mentioned this before it's coming from different studies. And here what we do is we are comparing our embedding propagation framework to other methods such as for instance gradient booster trees, or convolutional neural networks which have been demonstrated to work well on similar problems. And we can show that for this type of prediction problem. So again, predicting or categorizing immuno gene density, that our approach, you know, works quite well and statistically significantly better than than other methods. And we also, you know, did this essentially so you can think of it in this pipeline what we do is you know we are collecting essentially additional data modalities so this is really expensive I mentioned this before. It includes things like whole exome sequencing and RNA sequencing, usually from blood samples or biopsy. And then after treatment we check, you know, what is the teaser response, right. And, and here. So I mentioned before you know that we're doing this as a as a ranking so why are we ranking peptides. Well the reason why we are ranking peptides is because what we what we then do with this ranking is we give it to a company who then, you know, creates, you know, viruses that then, you know, carry the information of this peptide and hopefully stimulate the immune response in the patient. And of course, you know, you don't want to include all possible epitopes that you can find right so you want to kind of narrow this down to a smaller number of epitopes. And so this is typically in the range of 2025. And so that's why it's a ranking right we want to kind of find those epitopes that are most likely to elicit reasonable immune response. We evaluated this with patient data. And what's nice is that we could also show that we, we can find additional epitopes that would actually be missing if we would just look at methods based on for instance binding predictions, right. Like net MHC or something. So this is something that we're, yeah, that we're currently working on. I should also mention and again I think this is, you know, maybe something that, yeah, that doesn't happen so much is that is not so this, you know, bioinformatics pipeline that, you know, we have this graph based machine in the core of it, that this was actually approved also by FDA and EMA and is currently used in in clinical trial so we're currently in phase one clinical trials. This was also, you know, published. And again, this is work I want to, you know, plug the name again of mostly of Brandon Malone who has put this pipeline together here. So, so this is kind of to show you that, you know, you know, graph based relational learning really can make a difference. I wanted to, to also mention that, you know, one of the areas where we apply this is in. Gene regulatory networks, you know, with gene expression data where you usually have, you know, high P small n kind of situations, and where you probably also want to group. You know, the essentially the genes into clusters before you apply machine learning method. And he also just something I wanted to mention this type of graph, this type of graph neural network approach that essentially exchanges these you know messages into the nodes and then kind of clusters implicitly the notes has has worked quite well. Alright, and then, you know, finally, I'm probably finishing a bit earlier today. So we, we also mentioned this before. So if the graph is is given right that that's great. But in many cases the graph isn't actually given. The graph might be noisy right or you you might use a particular if you know similarity measure to create an affinity graph but you're not 100% sure if that's actually the best graph that you can find right. And so one of the research projects that we are working on is to actually try to learn the graph structure at the same time that you learn the graph neural network. So, for instance, what you see on the slide here is you might have, you know, particular data set that whether the data points are situated on a manifold. And what typically happens is as I mentioned before you're constructing the graph, you are training then and so this is the first step right you start doing the graph on top of this using some similarity measure between the the input features. And then in the second step, you are, you know, applying a graph neural network or some other method to to perform node classification to compute essentially these these vector representations of the notes. And as I said so so one of the interesting things I think is to actually, you know, kind of incorporate both of these steps that are usually performed. So distinctly, so individually, to actually combine these two different steps into one step and I think that that's a really exciting area of research is, which also touches on you know the notion of discrete continuous learning is you know how can we actually do this so how can we induce and update and improve the structure of the graph how can we discover structure in unstructured data initially right. So training an end to end pipeline that then actually performs for instance, you know, through a graph neural network and not classifier classification problem. So I think this is something that. Yeah I think is a really good nice nice recent research area here. Okay yeah and so that's it from from my side. Let me know if you have any questions about this work. Thank you very much. Thank you Matias. This was very clear and very fascinating talk. Thank you very much for that. I'm sure there will be a lot of questions. So, would anyone like to start from the network, Giovanni please. Hi. Yes. So, first of all, thank you I found it really interesting. Very fascinating talk. The question that I have is actually related to the top left of the slide, because I was just thinking that his application of these embedding propagation would be quite amazing for patient data because often there's the case of missing data on some of the features that we could have, and so on. So, as my question in the applications for example as the patient network on the top left, we would want to extend that model to new data. How do we introduce new nodes into the graph and then use the previous setup to make new predictions. Yes, this is a this is a really good question. There's a question that comes up also a lot actually in the, you know, when we actually present this work to people who might be interested in using it, right. So, do you have to retrain the network over and over again, when you're actually adding new nodes to the graph, right so so this is in the in the machine learning community oftentimes. The terms that I use there is like the trend, trans productive learning, where you assume that, you know, the nodes are already part of the graph. So all the patients are already there right when you train your model. And the other one is where you train your model and then you're getting new notes, you know, new patients in this case, you know how do you update without retraining the entire model, right. So, and there are ways to do this, I should mention that this is something that we actually did evaluate also in our paper here with embedding propagation and why is embedding propagation, especially suitable here, because what you can do is when you add in your note is that you simply, you know, you're simply sending now so you when you add a new note you also know the connections to the other nodes right so that's the assumption. What you can do now is you can now just apply the aggregation function, right. Once so what you do is you apply the, you know, the, the, the essentially the learned the functions h and g that I mentioned before right so these these usually projection functions on the features of the new patient, right you get this vector representation, and then you simply choose and so instead of, you know, learning this vector representation you just take the aggregation of the, of the neighboring notes. So the back the aggregation of the vectors of the neighboring notes and use that as the feature representation of the new patient. Right. And this is something that be tested also empirically, and it turns out that this works extremely well. And even though that, you know, this is this is for sure a shortcoming of, you know, of this approach because as you said, you know, this might work for a few times but at some point you have to retrain, right you don't want to add, you know, 50% of more notes and and and don't retrain your model. And so this is for sure a shortcoming and some people are, you know, also thinking about how to do this, this kind of updating of the model in a more efficient way without retraining everything. Right, so maybe just retraining locally or. Yeah, it's a good question. Thank you. Thank you Giovanni. Thank you. Matthias, a folk address. Yes. Matthias, a very, very interesting talk. Great results. And knowledge clouds of course also intensely used in my team at Demons and the LMU. I think it's a very nice general representation for data. One question I had when you talked about the combination with rules. I'm not sure if I completely followed everything but I think there's a lot of recent work where people apply some type of rule learning in context of knowledge graph or triple prediction and getting quite good results. And, but, so my understanding is on my feeling is that they get like the designs are quite good first that's very important, but they get a lot of rules like not just one, maybe thousands or something. And then the combination of these rules gives you a very good performance. Yeah, can you comment on that and also the terms of interpretability of these rules. So, so the current status I would say is that actually, you know, rule, purely rule based methods are actually catching up at the moment so they have been a couple of recent publications, where, yeah, essentially just using rules so without even machine, I mean, without, you know, embedding anything or using fancy machine learning that they can actually for some of the knowledge graphs do extremely well this that's what you already mentioned right. And typically what these methods do is they choose. So they also have an internal ranking of the rules right so usually and this is something that you can also see here. So this is the usually these rules are probabilistic in nature right so they say something like in 90% of the cases, if the body of the rule was true so the left side here of the rule was true, then also the right side was true right. And now you can essentially internally rank you know how, how confident we are that particular rules are true. And then you can kind of start by just applying the rules that you're extremely confident in. Right and this is essentially what these methods do so this is their way of dealing with a lot of rules is they kind of internally rank them, and then only apply the top rules the top top rank rules, right. So this is one side so they're catching up I think on the other hand, you still do have knowledge graphs and this is also what we observe where embedding based methods. So the type of tensor factorization methods right that that, you know, also is coming out of your lab. And that they, they still outperform rule based methods, right. And so our stance here is that we should try to combine the two, right. And one, one, and they are now a couple of proposals here and one way to do this is to really simply. We use a probabilistic model like for instance, what we use is a product of experts. So we essentially just have, you know, you know, we get a score from the rules. So the rules tell us you know we believe that this triple here should be scored really highly. And we get a score from the embedding method. And then we just combine these two probabilistically. Right, we just say, Well, you know, if, if, if both of them give us a high score then you know the overall score should be higher. And that's something that works extremely well and is also interpretable right so you can actually then look and see okay, these rules actually contributed to to the prediction of the method. So you can select some of these ones, let's say 1000 rules and say for this particular prediction only these five or so. Yes, exactly. It's specialized for a particular prediction, right so you get these two rules where the ones that are really important for that particular prediction. Thank you. Hi, so thank you very much for the talk. So when you were presenting the context at the very beginning, the first application that you showed was the classification of entire networks so where many networks are are the impute and then you want to classify the networks entirely. Do you elaborate a little bit on methods to do that so for example, are they mostly supervised or unsupervised or do you know some names of such methods. Yes, so, so let me actually go go back to the slide just to provide a bit more visual visual context. So, this one here. Yeah, so again the yeah so the question is you know what what because I just I didn't cover this really in my talk. What do you do if you want to actually learn representations for entire graphs like for instance chemical compounds right. So one way to do this and this is based essentially on on the graph neural network presentation that I had in my talk is that you learn vector representations of the notes in the graph first, so to speak right, and then you globally aggregate the vector representations of the vector representation for the entire graph. Right. So so and you can do this also end to end so for instance right. Let's say you apply what I explained in my talk right this this kind of message passing graph neural network. What it ends up doing is it's computing vector representations for the notes of the graph right, and then instead of stopping there. You say well now I take the sum or the average of these vector representations, right, which then gives me one vector representation for the entire graph. And then I apply some loss function for instance if I want to classify graphs right, I apply a loss function that says you know this molecule is toxic or whatever, and this one isn't right. And this is the most kind of let's say naive way of how you could use standard graph neural networks that typically compute note representations, also for for graph classification. But this is just one way which builds on what I presented in my talk right there are other ways to do this for instance graph kernels are, you know, one prominent example where, and in my opinion, still, you know, not at all outperformed I think by by by graph neural networks is this is a very controversial topic because oftentimes also you know these benchmarks are not very meaningful I think sometimes that that are being used. So I think you know graph kernels is another good choice. And then we've also worked on a couple of other methods and of course, like always in machine learning. So one example that I gave you of kind of averaging the vector representations of a graph neural network. Now more sophisticated ways of aggregating information right from the note representations into a graph representation have also been proposed, like for instance a form of differentiable clustering, right, where you kind of hierarchically cluster more and more and then end up with one vector representation. So this is what's called diff pool. So there are many different ways of doing this. And I think the first that what I would try is graph kernels or a graph neural network with some simple aggregation function on the on the note vectors. Okay, I will think about it. Thank you. Yes, so there's one question on Slido and then Rima from the from the network in this order. So Stefan on Slido asked what if not a node is added to the data but a new feature dimension modality is there an easy way to extend the graph without having to rebuild it. Yeah, also also a very good question and I think here the answer is at the moment know if you because what you know what what a graph neural network really does it learns a mapping from the the feature space so associated with the nodes right from the feature space to this embedding, right. And if you don't have a particular feature at all part of the training. Then it's, it's, it's not possible to just add this after the fact. So I think that is something that where you so if you add new feature so completely new feature types right not not just that for one note you suddenly get a new value or something that's possible but if you add a completely new feature type that you didn't have before, then I think you really have to retrain. Thank you for that. Now Rima is next. The question from inside the network. Hello. Thank you Matias for the nice talk. That was really amazing. So, I'm actually working also on problems applied to clinical data, however, and have one question concerning the way you chose to represent the vector representations. So you mentioned you have one vector representation for the available data and another for the missing right. So, I was wondering, have you tried one architecture with only the available data vector representation and if you did what is the performance you got. And the second question is, how is it different of the vector representation of the available data from the missing data in your case. Yeah, so the first question is, you know, did you compare. So by by, you know, I mentioned in my talk that you know what we do is we kind of distinguish. We separately learn a vector representation for the available data for each note right and for the missing data. And the question was, so if you don't do this right. So if you just pretend, you know, then you know, there is everything is there or you know you just use some sort of standard value for for missing stuff. Did you compare and the answer is yes we did compare we also compared actually to standard imputation methods right so there is of course also you know standard ways to impute missing data. And we did compare certain statistics for instance of the of the features that are actually available so the values that are actually available. And we did compare to to these, you know, to these other methods, and this worked significantly better. So so kind of separating essentially the, yeah, the missing and the available features worked really well. And so how does this, how does this look like right so so this is a bit of a difficult question to answer. But I would say what what really happens is that you, because you're doing this contrastive loss right between every note and the neighboring notes. What happens is that you are kind of, you know, propagating information from the neighboring notes and their missing data, and also the data that's available into this missing feature representation. So essentially what you're doing is your, your kind of have a flow of information from the neighboring notes into this vector representation of the missing data. And that's really I think what what what makes the difference but it's an intuition that I have right. It's not something really quantitative that I can, that I can give you here. Does it make sense or. All right, yeah. Thank you. Thank you. Now, I would have a few questions but yes. We have done a lot of research on graph kernels as you mentioned, and there are some related concepts between these two fields and we made some empirical observations in graph about limitations of graph mining and I wanted to ask you whether you whether you experienced the same in in graph convolutional networks. So one thing is the vice filaments also use this neighborhood aggregation scheme and what we observed in practice is that the depth of that scheme like how often you repeat the aggregation how deep you go when unrolling the graph as you as you called it that this depth is not very deep. It's usually in the in applications. I haven't really seen examples where you have to go beyond level three often level one level two is already sufficient to get the best predictive performance. So my first question is, have you seen examples where you may one should go much deeper. That is a very good question because similar to what you mentioned about graph kernels also in in graph neural networks and graph convolutional networks specifically. Usually the best choice is to have depth to so so when you do this aggregation step that I mentioned this unrolling right, you only go up to the two hop neighborhood of of each note, right. This typically works the best people have tried to go to 345 and usually that that doesn't work well so you see how your performance actually degrades right. And I think, but but but there is recent work who tries to kind of look at this more from again from a methodological point of view so is it maybe that we need, for instance residual connections right. It's like what people have done in deep neural networks for images right like these these residual shortcut connections should be also maybe try to use those in in graph neural networks right because maybe there is some sort of vanishing gradient issue whatever going on right. And so there is a lot of work in this area. I personally. My point of view is that it really depends a lot on the type of graph. Right. And typically what we see in. So this is kind of the irony right so the typical benchmark data sets that people use in graph for graph neural networks which are the citation networks Cora and sites here right. They typically have this what's called the smoothness assumption right so that you know things that should be classified in similar ways are actually close to each other in the graph. So if if you actually apply then a graph neural network what's happening is this is, you're kind of making the vector embeddings of things that are close to each other in the graph similar, right. And then because of this assumption or this property of the graphs that things that are the same, have the same label in the end are close in the graph, you are doing really well right even if you're just using depth two or one. And so I think what is really missing is to find really good. So it's really a bit of a data issue, I think a benchmarking issue is that I'm pretty sure there are graphs where you know distance is not necessarily distance in the graph is not necessarily indicative of same label, right. Where you have for instance in, and I think this is actually something where the biomedical domain can also maybe contribute something where you might have, you know, big network and the labels are completely different in different parts of the network, right. And they are, even if two things are close to each other they might have a different label. Right. And I think this is kind of what's actually missing. And if that is something that, you know, we see more so different types of graphs really that have different properties. Then I think also going to further depth and having the ability to propagate from further apart, right over the graph, can actually be beneficial. I think it's a bit of a, I think really a, yeah, what type of graphs are really used for these benchmarks, right. Yes, I agree. I, I think these cases can exist just in their relative frequency compared to the other ones where neighborhood implies the same label is so much lower in the application that I'm aware of that it's not so, so easy to find them or to find them in large, in the application on this. Thank you very much so I agree with what you said. And another aspect you mentioned here on this slide 30 but also in your outlook on slide 37 this construction of the similarity graph. So I've also looked at a lot of applications of this in bound mathematics and very often and I came to very similar observations as you did. And very often I asked myself and I also went experiments checking what if you skip the graph altogether if you just take your input data, and you compute a similarity matrix on it or a kernel matrix or whatever you want to call it. You're learning directly on that and the, and you, you get rid of the, of the intermediate step of defining a graph on it I have seen many examples where the intermediate graph is not really necessary you could just learn it on the full similarity matrix as well. So, in light of this I found it very interesting in your outlook that you, you make the same observation but come to a different conclusion maybe that one should learn the graph the intermediate graph end to end. So this is a really great question and it's actually something that you know it opens the door to a lot of related topics in kind of discrete continuous learning so the question for to me boils down to, why do you need to create a discrete and sparse structure to begin with right. Why not keep it dense and have even if it's tiny have all of the weights there right have it essentially be a dense fully connected graph so to speak right. And then just train this right then then we don't even have to worry about, you know, in using or finding the right graph structure right. So, there are a couple of possible re re bottles essentially to this right one is actually, and this is maybe a bit counter intuitive is actually efficiency. So when you actually introduce a sparse structure, and you use sparse matrix multiplication, you can actually compute this much much faster than if you have a dense matrix. So this is surprisingly one of the things right that if you, if you may actually learn sparsity then you are more efficient if you if you're leveraging the sparsity more meaningfully right. So this is one answer. The other answer and this is something that we actually answered in one of our papers is, in some cases, it does actually perform so the accuracy right is actually better if you actually learn dense, sorry, sparse representation. Right so that's the second one. And actually interestingly, we just discussed this in our group recently, they are now an increasing number of papers, also even in reinforcement learning in other areas, where actually the latent code is actually going to be discrete. And it's something about some sort of compression is happening right if you go to a discrete space that could be potentially superior than if then having just you know a continuous essentially space where you just have you know different types of larger and smaller weights. And the third one is actually and this is something that I find also quite, you know, makes sense to me is interpretability. So for instance I mentioned, you know, gene regulatory networks, right so you could treat, you know, the measurement the gene expression data as one big matrix, and you don't really care if there is a graph there right you just use it somehow downstream to solve a particular problem. Right, but if you actually then do this learning so if you induce a particular structure on top of the genes for instance in this case right, you might first of all be able to include domain knowledge that you might have. Right, so you might actually have particular inductive biases that you can induce there, and you can actually look at the resulting structures, and you can see okay yeah so here my algorithm decided that there should be an edge between these two genes for instance right and you can kind of take a look at it. And that's maybe a bit more interpretable than saying, yeah, my method assigned a weight of five, you know to this pair of genes, and a weight of zero dot one to this one, right. So, I think that there are several potential answers to this and, again, it goes back to like always in machine learning. In the end it boils down to what do you need in your application and what's the, what does the data look like and what works. But I agree with you it shouldn't be always the answer you should learn this or you should have a graph. Sometimes it also works perfectly fine to just not even care about the graph right to just treat it as a dense matrix essentially. Very good answer. Very good answer. And a very inspiring talk and introduction to this field. So, many thanks on behalf of the entire network and the YouTube audience for your presentation Matthias, it was a pleasure to have you here. Thanks a lot. Thanks a lot. It was, it was fun. And thanks for the invitation again. And we send a round of virtual applause to you.