 Today, I'm very pleased to have Bastian Rieck from the Helmholtz Center in Munich. So Bastian is a principal investigator there in the Institute of AI for Health. His main research interests are in algorithms for graph classification and time series analysis with a focus on biomedical applications and healthcare topics. Bastian is also enticed by finding new ways to explain neural networks using concepts from algebraic and differential topology. He is a big proponent of scientific outreach and enjoys blogging about his research, academia, supervision and software development. Bastian received his master's degree in mathematics as well as PhD in computer science from Heidelberg University in Germany. And it's my great pleasure. I also know him from other collaborations to have him here today to talk to us more about the topological graphical networks. So Bastian, please take it away. Thank you very much for the kind introduction and also thanks for this cool invitation. I'm really looking forward to this and I want to have this talk as interactively as possible under the proviso that the network connection plays well with us. So if it doesn't, I'll try to rejoin obviously. But if you have at any point, if you have a question or a comment or something that I should clarify, then just unmute yourself and let me know because that way we can keep it a little bit interactive. But of course, I'm also happy to take your questions later on after the talk and I'll stay for a little bit longer to have a nice discussion. That's of interest. Right. So I'll be talking about topological graph neural networks. This is one specific project that we've been working on for a while now. And I don't have to tell you here in this reading group that what graph classification is, but I just want to set the scene a little bit. So let me just, sorry, just see how this, I can't actually see a lot of the participants, but I'm making sure that this doesn't, okay, perfect. So the goal of graph classification is to, well, given a graph, tell us something about potential label sets. So for instance, the graph could be a molecule that is flammable or water soluble or whatnot. This is in stark contrast, of course, as you might know to note classification, where the goal is to say something about individual nodes in this graph here, but we're not dealing with this for now. So we're dealing with this on the graph level. Now, when we represent graphs in machine learning, we run into the issue of their cardinality, meaning that two graphs G and G prime can have a different number of vertices. So in general, we require a vectorized representation F mapping from some space G to some high dimensional space for these for these graphs. And this representation, I need to stress this here, this representation needs to be permutation invariant. So you need to be able to permute the vertices or the edge indices, and still get the same high dimensional representation out of that because of course, your graph should be invariant to the order in which you provide its constituting parts. Now, just for some background, because that will come in handy later on. I don't know how many of you have heard of the WL test for graph isomorphism of how familiar you are with that in general. It's a relatively ancient method going back to the 60s and to similar work by Weisfeller and Lehmann to ensure that you can tell some graphs apart. Originally, it was meant as a graph isomorphism test, but this turned out not to be valid. However, it's still kind of a nice precursor to modern graph neural network techniques, as we will see later on. And fundamentally, and really on a very, very high level here, the WL test is based on a very simple iterative scheme by which you start by creating a color for each node in the graph. So this color could, for instance, be based on its label or its degree or some other kind of property that you have in mind. You then collect the colors of adjacent nodes in a multi set. You can press the colors in the multi set together with the notes color to form a new color, and you continue this relabeling scheme until the color sequence is stable until there are no changes anymore, until it has reached convergence. Now, this compression is, of course, some kind of rudimentary hashing scheme. So you use a perfect hashing scheme in order to ensure that the same neighborhoods and the same sets are mapped to the same representations always. One similar result of this process is that if the compressed label sequences of two graphs diverge at any point, then the graphs are guaranteed to be not isomorphic. But as with many of these invariant claims, the other direction is not valid. So non isomorphic graphs can still give rise to coinciding compressed labels. And I don't want to play around for too long with this because, of course, some of you might already guess that this is going towards message passing graph in the book, so that we're all on the same page. But just for this very simple graph example here, if you take a look at the three, you will see that it has a certain neighborhood that is different from all the other neighborhoods. And when you rehash it, it will get a label that is reminiscent of this new label neighborhood, whereas the other vertices such as V1 and V2 and V7 with the same label itself and the same neighborhood, they all get hashed to the same overall new label. Now, when you replace the labels now, in this sequence, you will find that you gain much more fine grade information than before. So you start out with a binary label set, and now you already have some more fine information about information in the graph. That's really nice and all. If you want to do some classification with this, you can now take the labels and their counts, put them in a feature vector, and just use your favorite graph kernel or high-dimensional similarity measure, whatever you had in mind in order to compare those feature vectors. So that's all nice, and it works, and it's a really excellent base, and it's still to this day. It also has some beneficial properties. So people still keep using that for various reasons. Here are some of them. One is that it still has a very efficient calculation for small values of age. So if you don't go very deeply into the graph, it's still relatively efficient to do that because you just have to collect labels. It also has a really good empirical performance. So it's actually still pretty good at telling graphs apart, at least if those graphs have discrete label sets, with continuous labels a little bit different. But extensions for continuous features as well as topology variants exist. So here are some pointers to that from my own previous work. The big elephant of the room, though, which we as machine learners want to surpass as quickly as possible, of course, is that it's still a static aggregation over neighbor. So you treat all the neighbors the same, and you just aggregate all the information indiscriminately of the graph. So this has to be surpassed somehow. And of course, we all know this is the right reading group for this. Message passing and graphs was meant to do exactly that. So here we start some notes, let's take a look at this in the middle, and we look at it at its neighbors. And we start collating their high dimensional neighborhood information, their high dimensional attribute information and aggregate them somehow using an aggregation function, which can be some function or something even more complex. So that's the building block of modern graph networks, I would say. You can also repeat this process multiple times, of course, and later on combine these high dimensional representations using concatenations. And then you can use a readout function, which, for instance, could perform graph level pooling in order to obtain a graph level representation. This representation can then subsequently be used for classification or for regression or for whatever task you had in mind. So that's the overall building block of a graph network architecture. If you're familiar with the paper by XUETR, you will find this slide very familiar. So all gene and architectures in a nutshell or most of them are based on some variation of this common theme. So they all learn hidden node representations HV based on some aggregated attributes. Those aggregated attributes are obtained using neighborhoods. And the more iterations you have, the deeper you go into the graph itself. So you have some kind of information propagation or information diffusion process, if you will. And if you repeat this process K many times, a big K many times, then you can collate all the hidden representations that you obtain at the end and put them in a readout function in order to obtain your overall high dimensional future vector of the graph. So far so good. I'm not telling you a lot of new things here, I'm guessing. But here's the status quo about this. Graphs are fundamentally topological objects. So they have some kind of connectivity. At the same time, GNNs, despite their really excellent empirical performance, are somewhat incapable of recognizing certain topological structures. So either they need a big receptive field, so they need lots of layers in order to do so, or they cannot detect them at all. We can see some examples about this later on. So a natural research question that drove this project here was, what can we gain when in viewing them with knowledge about the topology of the graph? And this is where we need to take a little bit of a step back and resume out. And I'm now giving you a little bit of a topological introduction so that at least we're all on the same page when it comes to some of the things I will say later on. So given a graph like this one here, what are topological features in the graph? Well, on a rudimentary, fundamental level, there is two types of features that I want to talk about. The first one is denoted by a beta 0. It's the number of connected components. So in this case, this is pretty easy. This graph is connected, it's not fully connected, but it's connected so every vertex can be reached from any other vertex, so it has exactly one connected component. The second feature that I want to take a look at is denoted by a beta 1, and this is the number of cycles in the graph. So this graph, most of us would agree here probably this graph has two cycles, and if we would be forced to point towards them, we would say that those are the cycles here. And that is also true. However, in some, let's say, deep topological sense, you could also have, for instance, one big cycle here that encompasses all these nodes here, and just one of the smaller cycles here. Then it would still be two independent cycles somehow that are just represented a little bit differently. So what we are talking about later on is more the existence of a cycle, and not so much its localization. I want to make this very, very clear. We're talking about counts much more than we are talking about actual things that are happening in the graph at some level because already having information to the counts is something that can be a game changer for certain applications. So that's what we're talking about. How can we capture these topological features? And now you have to make a little bit of a leap of faith with me because due to the time I won't be able to explain all of these things in a very big detail, but I'm happy to answer more questions about this or also feel free to reach out to me over any medium that you prefer. The main idea for capturing topological features uses a mechanism or a framework called persistent homology. And the intuition behind this is that suppose you're not just working at the level of an individual graph, but I give you something more. I give you a way to look at the graph as it's being built. So I tell you something along the lines of this graph has n vertices and m edges, and they are all inserted in this for that order. And now while you do this, while you insert edges and vertices into the graph, you can monitor how the topology of the graph, how the number of components and the number of cycles changes during this process. So we'll start with this very, very simply and just start adding an edge. This edge that we added here connects to vertices. So it reduces the number of connected components by one because previously you had two disconnected vertices and now they are connected by an edge. So one connected component has been removed. The same happens when we add more and more and more and more of these edges. Maybe let's add this one as well. And you can note that on the right hand side, while we did this process, we are filling up a descriptor which will turn out to play an interesting role later on. This descriptor on the right hand side is known as the persistence diagram. So it contains the topological features, the number of connected components for a given threshold or for a given step in this insertion process. So whenever something happens on the topological level, we just add a point there. And the coordinate of that point is determined by the ordering of the insertion process. So notice what happens now. If we start adding this edge here, something different happens. Now we don't merge two connected components because the edge that we added is already connecting the vertices from the same connected component, but instead we create a cycle. And to denote this in this diagram, we just add a vertical line here because we just know that the cycle has been created at this point, but we don't know what will happen with it later on. So having set that, let's just move through the remaining edges and see what happens here. As you would imagine, we destroy some additional connected components and so on and so forth. And for the last edge that we add, we create a second cycle. Now we're left with in somewhat of the pickle because we have these two cycles that have been created, but we don't really know what is going on with them. And we also have one other thing in this process that I haven't really discussed before, namely if you count the number of red dots on this side and count the number of vertices on the left hand side, you will find that one vertex is missing. And that is because we can't actually say something about the overall component that the graph has. So we can't really count the destruction of this connected component because there is no other connected component to connect it to. So what we do at the end is to commemorate this, we just add the last connected component, bringing the total finally up to the right number, and we add the two cycles at a very high level in this diagram as well in order to mark them as being part of this graph. So that's the really, really rough overview of what persistent homology does. It can do so much more in practice, so I'm really happy to discuss this in much more detail with you. It can do so much more in practice, but for all intents and purposes for this talk, it's sufficient to think of persistent homology as a specific way to look at multi-scale topological features. So instead of just saying this is a graph with a connected component in two cycles, you take a look at how these connected components are well forming the graph together and how these cycles are forming the graph together. All right, now one critical choice here in this whole process, which you might have guessed that already is this choice of filtration, so the choice of ordering the edges and vertices. I can give you two graphs here and you will find that they lead to radically different filtration, so you can have a filtration like this, basically adds everything at the same time, or you can have a filtration like this, which builds the graph in a much more leisurely manner. And this is a critical insight for using persistent homology or for in general looking at the graph at multiple scales. If you're ordering determines what scales you get, then you should better pick a good ordering. And ideally, and this is now every machine earner's dream, of course, if you have to pick something manually, why not make it machine-learnable? And that's what we did in a precursor work of this, of the work that I'm going to talk to you about. And I'm just mentioning this here so that you can see how this is a little bit connected. The precursor to this work is, well, correctly named graph filtration learning because it's about learning graph filtrations. And what we did here is we initialized a certain function on our graph using a graph neural network. In this case, it was a GIN. So essentially, we had a single GIN epsilon layer with one level of message passing, something dimensionality, et cetera, et cetera, resulting in a single set of functions on the nodes of our graphs. And we could use those function values in order to sort the edges later on accordingly. Or rather, we sort the vertices and then the edges are sorted by extension to this. Now, if we have this, we can create what we called a readout function based on persistent homology. So readout function based on the learn filtration with the possibility to learn a task specific representation. The overall idea was here to take the initial values of the vertices of the graph, calculate a set of these persistence diagrams with the persistent homology framework that I showed you before, do some neat projection tricks into a neural network and have a lost term so that this can, so that the weights and the function values on the vertices can be adjusted. Now, without going too much into the technical details here, one thing shouldn't be too surprising, and that is the gradient from this loss function to the projection function, unless you pick a very strange projection function, this gradient already always exists, right? Because this is how we create our functions and how we create our architectures. What was a little bit more surprising in this paper is that we were also able to show that this gradient exists under certain conditions. So we were able to actually change back the weights of this filtration process, the weights of this ordering process in order to create a better ordering for a specific task that we had in mind. So the specific task could, for instance, be the graph classification later on. So that's the, that's the precursor work. Just keep that in mind. We can learn one filtration. Now, the exact details are not super important, but I just wanted to tell you that this psi function, this projection function that we have in this work is really very simple. It just projects a point from a two-dimensional space into a one-dimensional space by virtue of having two trainable parameters. And we can, of course, stack multiple copies of this and learn individual parameters unless we get a way to project one topological descriptor into some high-dimensional space and everything works because of back propagation and differentiability. Now, let's move on to the work that I wanted to talk about here. So this is, this is what I'm here for. This is what you're looking for. This is the, this is a great work with Max, Edward, Michael, Eve, and Carson. So Max is currently still writing up his thesis. Edward should probably write up his thesis at some point. I mean, if you're, if he's listening in at some point on this video, then I'm going publicly on record with this. And Michael just submitted his thesis on Monday. So this is spectacular. I'm really happy to see all those awesome people graduate. And I mean, Eve and Carson, they are already professors. So there's like their careers, their careers is great. And nothing, nothing to worry about here. But anyway, in this paper, which is still under review somewhere, we introduced a topological layer for graph classification. And I apologize in advance because I will hammer in a few points now while we go over the nitty gritty details here. And one of the points that I really want to stress and that I want to hammer in here is that the cool thing about this layer is that it's kind of a plug and play approach. Because we made sure that the input of the layer is just some d dimensional vector. And the output of the layer is also some d dimensional vector. And I know that this sounds super trivial. And I mean, of course, it can be achieved, but this turns out to be somewhat crucial in order to make that thing fly or to make that thing applicable. Because what we want to do in the end, we want to have a simple layer that we can take and replace an existing layer with in a graph neural network. And suddenly, we want to make this graph neural network aware of topological structures, aware of topological features in the data set. And we can only achieve that if we ensure that it's kind of well compatible with existing representation and that you don't have to jump through too many coops in order to make that work. Now, how do we manage to do this? Well, it's really very simple, at least when I show this diagram here, we start with the node attributes that can be imbued by some GNN, but they can also be just coming from the original graph, whatever you prefer. And we use a differential node map phi that maps these d dimensional node attributes to a k dimensional vector space. And this k dimensional vector space, this can be thought of as having k different learnable filtrations of the graph. So instead of learning a single filtration that is initialized by some GNN, and then the GNN is ignored for the remainder of the training, we now learn a filtration in a kind of end to end manner, because we can inform downstream layers and we can in turn be informed by upstream layers. That's a really cool thing to have in a really cool direction, I think. Now, the filtrations that we learn, I should also stress this in case you're wondering about this, we're not learning any type of kind of say this relationships between individual filtration. So they are just learned independently, but they provide individual perspectives for dealing with this graph. Now, every of these filtrations results in a potentially different set of persistence diagrams, as shown here. And we can use the aforementioned projection function psi to project all of them into some convenient space, which leads to some representation psi of v. And we can concatenate this representation with the original input data in order to obtain a new output. Of course, in this concatenation, you might have to make sure that you go down again to the proper dimensionality, but those are some kind of technical details. The bottom line should be, and this is where I have to hammer it again, the bottom line should be that we can obtain an output vector of the same dimensionality as the input vector here. And that's essentially all there is to on the very, very high level to this. This architecture gives you a way to have a plug-and-play approach and make your network topology aware, or that's at least our main claim. Now, let me show you why this is true in the theoretical sense, and then let's see whether the theory is our friend when we actually do it in practice. So on the theory side, the paperback suet al also shows that your typical run-of-the-mill gene and architecture is not more expressive than the WL test for graph isomorphism, which is commonly abbreviated as WL1. Now, as a small caveat, the paper that I'm citing here actually discusses the discrete label case. So when you have continuous node features, it's actually a little bit different, and maybe we can have a discussion about this afterwards. But it's still being cited as this, and it's still the WL test is used as a kind of bound for or as a kind of proxy for the expressivity of a neural network. And this is also what we did. So we have a theorem in the paper where we can show that persistent homology. So this whole ordering process is at least as expressive as WL. It's, however, an existence theorem. It's not necessarily a theorem or proof that shows that we can actually learn such a function, but we can prove that if the WL label sequences for two graphs G and G prime diverge at any point, then there exists an injective filtration function in some ordering of the nodes such that the corresponding persistence diagrams are not equal. Now, I'm not going too much into the details here, but the bottom line should be that we can construct an appropriate filtration function from the WL one label sequence. And then we have to make some mathematical tricks in order to show that it's actually injective. Why do we need injectivity? Well, we need injectivity because one of the ingredients of our of our new layer requires that the gradients are uniquely defined. And if you don't have injectivity at the vertical at the vertices of the graph. So if you have two vertices in the graph that are assigned the same function value, then things start breaking down. It could still work. It's just mathematically more elegant if you just enforce it like this. Now, this is kind of the lower bound, but let's also see whether we can do more. And this is my my my favorite example, actually, because it's it's so simple, but so powerful. Maybe it turns out that there are non isomorphic graphs that the WL one test cannot distinguish, but persistent homology can distinguish them and quite trivially so. So on the left hand side, you have two triangles. And on the right hand side, you have a hexagon. And if you start initializing these graphs with the degree of the node, respectively, then you will see that the WL test will always result in the same counts of of labels. So the WL test cannot tell you cannot tell these two graphs apart. So it doesn't really know whether it's in the left graph with two cycles or in the right graph with a single cycle. But persistent homology or rather topological algorithms can tell these graphs pretty quickly apart because they can just count the number of connect components, which is two on the left hand side and one on the right hand side. Or they can count the number of cycles, which is also two on the left hand side and one on the right hand side. So however you twist and turn it, topology based algorithms are easily capable of telling those graphs apart. So together with a preceding theorem, this means that we actually end up in a regime that is more beneficial, that is more expressive than the WL one scheme. So that's the theory side. All right, let's see how it fairs in practice. So our experimental section in this in this preprint is pretty cool, I would say, even if I say so myself, because we take an existing GNR architecture, we replace one of the layers by our graph neural network layer, and then we measure or assess predictive performance. And we do this in such a way that the network sizes are roughly comparable to each other. So essentially, the new layer doesn't really add too many additional parameters there. So this is a kind of a fair comparison. And we're not just increasing the size of the of the parameters set by some orders of magnitude, and then say, oh, and now we're a little bit better. But we really want to say that this swapping procedure already provides some benefits because the new layer can help the network pick up an additional topology based signal that it was unable to pick up before. Now, the experimental setup that I'm going to talk about now has three distinct components. The first one is pretty simple and what you would expect. So we assess expressivity on synthetic data sets, which we picked. And so we can pick our battles here. Then we assess the predictive performance on data sets without no features. So to show whether an approach that can only leverage topological structures is better. And third, we do the usual predictive performance assessment of benchmark data sets, at least on small ones. We'll see about that now. To battle, the synthetic data sets that we've been looking at are a binary classification problem. So we generate the same amount of grass for each one of those classes, and we pick the data sets or we rather generated them such that they have simple to logical structures that are nevertheless challenging to detect with standard GNNs. So one data set is our old friend, the cycle data set where we either have a big cycle or some small cycles. The other one is called our necklaces data set, where we either have this closed form necklace with one big loop, or we have a smaller necklace with smaller loops. Now, let's take a look at what happens when you do WL on the cycles data set. When you initialize it with the degree of the nodes, you probably guess what will happen, namely, regardless of how many iterations you do, you will not end up learning anything. So the counts are still the same, the histograms are still the same between the two classes, and you're unable to tell those graphs apart with the WL test and node degrees as the labels. Now, how does that look in practice? And does it really work or does it really fail? So you can see here the expressivity, the test accuracy on the y-axis. When we take a look at the WL test, the green line here, you can see that it pretty much stayed at 50%. That's good. That kind of works. We have a violetish line that is standard persistent homology without any additional tricks and without any additional learnable representations. That one is, of course, also pretty good here because it's literally designed for that purpose. It also works. And now the actual comparison that I want you to make is, let's take a look at a standard GCN architecture with K layers and a GCN architecture with one toggle layer, with one two logical graph layer. You can see that if you have only one layer or two layers, you have a substantial difference in predictive accuracy, whereas without apology-based graph learning, you already end up being able to perfectly tell these graphs apart. I mean, that's what you would expect, right? It's a very simple topology-based structure. So this should work. This is more like a sanity check. Now, for the neclysis data set, you would expect something similar. Indeed, thanks to also, I should mention this, by the way, these are Max's drawings, which he did for his dissertation, which is currently writing. You can see that if you apply the WL test to some of the neclases, that they will also end up having the same histograms between the two classes. So you wouldn't expect a very high accuracy of WL there either. I should stress that some of the WL, pardon me, some of the neclases that are generated, they actually lead to different WL representation. So it's not a data set on which WL completely fails, but it fails in most of the cases. And we will see this on this slide here. You can see that regardless of how many iterations of WL you do, you don't end up getting a very, very high accuracy here. Similarly, now the purplish line, the bioletish line, whatever, of a standard persistence-based approach without trainable filtration is also pretty bad. It helps you in most of the cases, but not in all of them. And so it really doesn't matter what you do here. You can also see that if you now start with a regular GCN, you need quite a lot of layers in order to approach good performance here. And if you start with our trainable representations, you end up having a very good performance already from the very beginning. So this is also very nice. And it's also a very, very nice sanity check, which shows that these sort of things work as expected or as anticipated. Now let's move on to the next experiment. So they're more like going towards the real stuff. What happens if we actually take some existing data sets and we remove the node attributes and we replace them by random node features? That way, we ensure that the node features don't leak information about the topology into the graph itself. So suppose that if you're given a data set with some molecules or with some proteins, knowing some identities of some of the constituents of a molecule can, for instance, give you a hint as to what type of topological structure to expect. So you know that carbon atoms tends to form cycles and stuff like this. And the safest way to ensure that this doesn't really cause any problems for you is to remove the node attributes all together and replace them by random ones. So in that sense, any improvements that we're seeing where we now swap out an existing layer for our layer would be caused by having access to topological information and not anything else. And this is indeed what is going on, what is happening. So you can see that in all these three comparisons or configurations, whenever we swap out a layer for a topology-based one, the predictive performance goes up. In some cases, quite considerably so. So for instance, for the DD data set, we have more than 10 percentage points on average as a gain or here on MNIST as well. For pattern, I'm not sure what is going on here. Don't ask me. This is an old justification approach, so we're not going to look too much in too much detail at this. But in general, I would say that it works fantastically well if you're in the setting. I have to stress this. If you're in the setting of having only random node features, thus only having access to topological features, which are nevertheless trainable. So there's still this difference in previous approaches in that we can make our features trainable end to end. Now, for the last set of experiments here, this is the sort of mixed back of results. Let's take a look at how we fare when we classify benchmark data since overall. And now I should stress that the takeaway message from this table should not be that the approach is really bad. The takeaway message should be that existing methods are already very, very powerful. So if we take a look, for instance, at the gated GCN, then you will find that it almost outperforms all the methods on all the data sets even without having access to topology based information. However, the purpose of this experiment was not to outperform all existing methods on all data sets and get all the bold cells in the table. But the purpose was to show what happens if you make an existing architecture topology aware, whether that has any benefits or not. And we marked the lines in which this topology based replacement has a benefit. We marked those lines in italics whenever the comparison worked, although I'm just seeing that there is, I think a typo there, this one should also be in italics. But anyway, so you can see that in comparison to a regular GCN, when you replace one of the layers with our topology based layer, you gain some performance benefits in some of the data sets. Enzymes is a little bit weird. We didn't get that one to fit. You can read the epic story about this in the paper. We also don't know why it didn't work that well. We hypothesized that it's partially due to the size, but also partially due to the fact that we should have added more regularization. But we can't change the experimental setup now. So we would have to rerun everything for fairness sakes. But you can see that overall, this really helps improve the expressivity of the GCN. You can observe similar results for a GIN. So replacing one of the layers helps you for sci-fi. It also helps you for DD. Again, that's not talk about enzymes too much. It also works for proteins. It also works for MDB. It also works for Reddit and for cluster. And this pains a much more interesting and nuanced picture, I would say, in that the integration of topology based information can be helpful in most of the cases. But I also want to stress that these architectures are really doing very, very well on their own. So it's not like we can claim that we are now the first ones to fit all these data sets. And I have a reason why that might be the case in a few slides. Now, the comparison with other topology based methods, it's maybe not so important or relevant for this type of audience. But here we are also quite good. There are some existing methods out there that do graph classification based on topological features. But ours is the only one at the moment that really leverages an end to end closed loop system in the sense that we are able to work together with GNNs and are not reliant on some discrete feature choices. Last, I want to show you some beautiful visualization of the types of filtration that we're learning. So we initially thought that we would be learning something akin to a degree based filtration. But if we actually scale the nodes by the node degree and color them by their value, so a high value is colored more purplish than the low value, then you can see that this doesn't necessarily correspond if we walk through three different filtration. So this would be the first one, the second one, and the third one. So this shows at least as one random example. I should stress this is a random example. This is not a handpicked or anything. It's just one of the graphs that we wanted to visualize in the end. This shows that we empirically appear to learn something quite different and something quite orthogonal to the overall degree based filtration. And with this, I'm at the end, and I want to say a few things about where we could go from here. So I gave this talk also at a more topology based audience. And they hear a colleague Mikhail suggested something very, very nice. I originally had the saying in the other way round, but I'm inclined to agree with him that this is the better thing. When we take a look at the data sets now, I'm forced to conclude that if all you have is nails, everything looks like a hammer. So our data sets, the way they are, they might actually stymie some progress in gene and research because it seems that their classification does not really require a lot of structural information. And of course, I'm coming at you with a topology based background. So I'm saying, oh, yeah, topology will save all these things. And it's super important to have some knowledge about this. But I really want to stress this that it's really weird to me that it appears that so much information is already in the node features themselves. And I'm really wondering whether we could have some more foundational data sets that require an interplay of node features and topology because it's abundantly clear to me. And I'm happy that this is recorded because now I'm on the record with this. It's abundantly clear to me that it's not that topology is necessarily superior to the node features. You need both to do good classification, right? But it's weird to see that you have so much information already concentrated in the node features that can kind of circumvent additional information that you would get in the graph, otherwise using using some message passing, for instance. Nevertheless, I'm hopeful and recent work by Michael Bronstein, for instance, shows this as well. Nevertheless, I'm hopeful that higher order structures, for instance, please can be crucial in discerning between different graphs. So here's a very interesting paper or preprint, I'm not sure whether it's published now, where they essentially imbue a GNN with additional information about clicks and other sorts of graph structures. This is also seen to increase predictive power. Now, I'm always wondering about this, where with the integration of topology based features, the actually the smartest. So we already tried this out a little bit with the GIN, but the overall question still remains as are there smarter integrations that we can do other than just saying, hey, here's a new layer, let's just remove it. Would it be maybe smarter to remove all the layers and kind of do a message passing that is fundamentally topology based? Would it be smarter to have something like a lost term that can navigate between the two extremes? It's unclear, it's still an open question for me. And the last two points I wanted to make here is that I would be also really interested from the learning theoretical point of view to state conditions under which we're guaranteed to learn appropriate filtration functions and also to show what we can gain from doing so. I mean, we showed this empirically, of course, that the learnable function is superior to the not learnable function, but it would be interesting to see whether we can make something something more fundamental about this. With this, I'm at the end and I'm thank you very much for your attention and I'm exceedingly happy that the network connection worked out so well, at least for now. It's probably going to go down now and I'm happy to take any questions and comments now. Excellent, thank you so much. Last time, that has been an absolutely wonderful talk and also you are very grateful to your hot spot that held up. Yeah, thank you. So we are opening the Q&A. How much time do you have rafting? I remember that you had some time constraints potentially? No, no, this has been solved now, so let me just check again, unless someone really wants to, no, no, this is all good. Excellent. We'll all shift it into the future, where it's a problem of future bastion and not of present bastion. Excellent. So I was wondering, you mentioned that the model didn't work so well for enzyme data set and I specifically remember that I think enzymes is already kind of processed in a way that the node features that literally reflect some kernel method from like 10 years ago when the data set was introduced. Do you think that that might be really the reason? I think in some experience I've seen that the node features themselves really hold already most of the information and you don't really need the edges in the data set. That could very well be the case. I'm jumping back here to show you that it doesn't appear to be, I mean, don't hold me too much accountable here, but it doesn't appear to be a fundamental issue with our method, because you can see that in the structural based experiment where we have only random node features, we are able to fit the data set. Sometimes the standard deviation is a little bit worryingly high. This could be, I think, related to the size of the data set. I think it's only 600 graphs, which is not a lot for a GNN, but other than that, it at least works sort of, but you can also see that if we now jump to the next slide, that we are at least 40 percentage points or even more away from what one would consider being good performance here. For the other data sets, for instance, maybe I should show these side by side at some point or do a scatter plot. I just realized this, but for proteins, for instance, we are actually not that bad. With just the topology, we go up to 74 percent. With the topology and node features, we go up to 76 percent. I would say that we're getting there somehow, right? But for enzymes, it's pretty crazy. We're at 20 percent with structured information alone, and at 65 percent if we use everything. Here you can see that we also had some problems getting this to work or to fit reliably. Very high standard deviations and very, very big problems in the training itself. Yeah, it could very well be that the node features are really, really problematic here. I also have the feeling that it might be caused by the fact that there are multiple versions of the data set available sometimes, but sometimes a little bit unclear. If you read an older paper, it's sometimes unclear what they are referring to, which variant of the data set. But yeah, if I had a choice to do all of that again, then I would probably opt for removing that and just have something bigger there, but we were a little bit restricted or constricted by the technology of our time and by the compute resource that we had available. Running this since persistent homology is not yet implemented, parallely running this was a little bit problematic. Okay, thank you. We have a question in the chat from Denise, so I'll read out the question. Can you replace the cycles and connected components in your method with small degree graphs, graphlets, which are quite important in biological networks? Also, what is the advantage of using this method over simply counting graphlets and using them as node features in message passing? That's a very excellent question. I think I'm going to answer it in exactly the opposite order. So first, the advantage of using the counts is that here we have theoretically at least a learnable feature representation. So we can give different weights to the cycles or we can see how the cycles are being detected here. And it's not just a big inductive bias where we say we always count them. So theoretically, the network could learn kind of zero embedding for the whole topology here and just ignore it all together. Whereas if you add the additional features, then you kind of have a strong inductive bias. And you also have to do the counting. This method is, in that sense, if you implement it right, this method can be really effective because it just has n log n for detecting the number of cycles and the number of connected components. Whereas I'm not sure how efficient you can make graphlet counting in general. I mean, you would have to specify what types of graphlets to count and what types of graphlets to ignore. I think it's a little bit related to the second point here. So these would also be higher order structures in some sense. But now coming back to the first part of the question, so whether it can be adjusted to account for that, I don't think that it would be easy to account for this. I think it could be useful and relevant to account for this because it's one additional type of feature that you want to incorporate. But I don't think that the way we have the method set up right now is making this easy to do for graphlets. Sorry, to disappoint you there. Maybe it can be seen as a somewhat complementary approach. So the cycles being theoretically of infinite length and then the graphlets being more like localized features. Maybe that would be interesting. Thank you. I think now over to Christopher. You are still new to Christopher. Yeah, my bad. Yeah, thanks for your great talk. So you showed that the persistent homology approach can be more powerful than the 1WL, right? Yeah. Do you know any results or do you have any insights about 2WL? Do there exist graphs with the 2WL cannot distinguish but your approach can distinguish? Yes. Now I honestly have to apologize because I'm a little blanking on how well those results were. So we have some expression about this in the appendix of the paper, where essentially we're looking at these strongly regular graphs and we're checking how well we can distinguish them. I know that persistent homology was kind of okay in an untrained setting. So if we just use the degrees and some higher order click information, then we are capable of distinguishing them but not without that information. So again, it's interesting. This is both for this everything was down to the second point here. So I would say that without knowing the results by heart, we definitely need higher order structures here to account for this, which would I guess also point towards the fact that it might not be more expressive than WL2 here. I'm not sure because WL2 also, no wait, WL2 does not necessarily include clicks of arbitrary size, right? It just uses edge information. Yeah, edge and non-edge information. So I guess it might not be more expressive in this hierarchy but it would be super interesting to actually look at this and hash this out. Sure. Thanks a lot. You're welcome. All right. And we have another question in the chat from Vijay. Will not topological approach still be susceptible to overfitting issues, especially due to its expressiveness in terms of cycles and connected components? That's an excellent question. So we actually have hypothesized, because we had some additional experiments where we were looking at how many layers we can fit and how all of this works. And our current hypothesis is that we can prevent some over smoothing because if one gives a lot of weight to the underlying topology, then that already pulls your representations a little bit back into a more favorable space because you're not, if you have to preserve all of that and if you have to be aware of all the cycles, then this enforces some kind of inductive bias or some kind of rigidity that you would not get otherwise. We haven't done any large-scale experiments there. So for this, we would need, as you can see here, we are at currently four or five layers or something like that. For this, we will probably need among this amount of layers in longer experiments. But we hypothesized that it might be helpful at least to prevent some kind of over smoothing issues if you have this information available. Right, or to Michael. Thanks. I had one short and one maybe a bit longer question. So here, actually, it's very good that you're on this table. So here with the last line with GATS. So generally, we see that topological features do improve GATS on the previous table and the node classification task here on pattern, or is it like significant control? Almost 15 points. Do you have any intuition why it happens that on graph classification, it seems to be better? But is it the regularity of the pattern dataset or is something else? Yeah, I think this was actually pointed out to us. I think during the review process, or maybe we stumbled upon this ourselves, I'm not sure, but I'm also happy to credit the credit reviews there. So the previous review round was really helpful in that regard. I think that this dataset has regularity that is not really helpful in our case. So this was down to what I said earlier a little bit about the enzymes dataset. So we need to, it would have probably been a better idea to regularize that a little bit more strongly than we did in the end. Because that way we could have something like, we need a good way to say, hey, this information is really not usable. Let's just ignore it all together. That just doesn't work. So yeah, the regularity doesn't really help us yet. But it's weird. What I can't explain actually is why it only happens for GAT. This I can't explain at the moment. I'm sorry. This is the second question is a bit more abstract. So if it's total nonsense, just it's fine to mention it. Well, dismiss it. Do you see any connection, any trace from those charts that you showed that what you built over when we construct the graph at edges, we built this new chart. It resembles somehow to the idea of dictionary learning essentially for graphs. So there was one question in chat also about graphlets and those are more regular structures and also there was a bunch of papers on dictionary learning that also tried to identify similar parts or tokens for smaller graphs like molecules or something else. And they also compare the graphs based on those kernels, not by node or feature by feature, but based on such a decomposition. Is there any connection to it or I just made it up with some wrong premise? No, no, no. This is an excellent correspondence or an excellent association. So you can indeed see these diagrams, these persistence diagrams, we're just jumping back here. You can indeed see them as some kind of dictionary. The only difference I would say is that currently you cannot really control a lot about how this dictionary is being created. So you only have to control about the ordering of the edges, but you will always get cycles and connected components in that dictionary. If you want to set this up such that the dictionary contains some other structures, then one would have to think about how to manage this. But yeah, it is exactly that. It's some kind of fingerprint or some kind of summary where you essentially give up some structure on the left-hand side, but you gain some other structure on the right-hand side. Namely, you always know that those diagrams will live in a two-dimensional space and that the points will be situated above the diagonal and some other nice metric properties. So this is the thing that you gain, but you lose kind of the knowledge about the localization so you don't necessarily know where are the cycles in the graph. You can link it back sometimes, but for technical reasons, it doesn't always give you very concise features in the graph. But yeah, it's a very nice way to summarize the graph. That's true. It's a very helpful intuition. Thanks for that. Yeah, thank you. It's a great leap. I hadn't actually thought about it like this. I always say that it's a fingerprint, but I think having a code book or some kind of other dictionary representation is actually much more appropriate. Thank you for that. I had a kind of a question in a similar line that recently there has been a lot of fuss about the different learnable positional encodings for graphs. Those still try to derive the position of each node within the graph, and here you're more like, you were just saying, getting the general information and appending that pretty much to every node through the projection and summarizing the global fingerprint of the entire graph. But a lot of the positional encodings are based on some spectral features that are not that different, though they are still summarizing the kind of connectivity of the entire graph. But I don't know, what is the kind of connection perhaps to that, like whether you've considered testing it or viewing it from that position, that it's more like a positional encoding, or whether you could do that, and what kind of the connection there might be, I don't know. That was something to ask. That's a good question. I actually can't answer that easily to be honest. In a way, maybe it would be useful to have some kind of positional encoding of the topology-based features that I could certainly see. In some sense, provided that your filtration plays nicely, you can already see a topological feature as some kind of positional encoding in that it tells you something about, for instance, this node is a node that has a very big structural role to play. So if you remove this node, then the graph is separated into two parts or something like this. So this still kind of happens. But I would have to think more deeply about this, to be honest, how to make a bigger connection there. I think it could be there, though. I think it could also be useful to try to imbue some additional knowledge about positions in the graph with the positional encoding. I think that could very well lead to a very cool representation later on. I'm sorry, I don't have any more for you there. I kind of dumped that question on you without... Yeah, so I think that's totally acceptable. And I was just wondering whether any of those weights that you learn from the learn filtering, whether it can have some kind of interpretation or of that sort or... Potentially, yeah. But I also have to say that this would be a fascinating new research direction. So we should also... What I showed you on the last slide, I mean, this is honestly, it's a little bit whimsical when I say like this, because what one should actually do is one should take a look at what those filtration values actually correspond to, right? But that would be a different paper altogether. So looking at what we have now actually learned and trying to relate that to things that I already know that would be, that would be really fascinating. So seeing, for instance, whether those are related to positional encoding or whether they're somewhat complementary, that would be really cool. Excellent. Thank you. Maybe like a detail easy question now. So you mentioned that you swapped some of the layers for this topological layer in the experiments. And was it always the first one? Yeah, no, no, we experimented with this. We have a cool ablation study in the appendix. In the main paper, we always take the second layer. But there is no clear advantage to having it always be the second layer. This was more like a choice that we made where we said, okay, if we take... If it would be the first layer, then the layer doesn't have access to the actual message passing. If it will be the last layer, then the subsequent layers don't have access to the topology-based features. So in a sense, we took some middle ground and we said that, okay, let's do the second layer. But I think that whether it's the second or the third, there's not a significant difference over all the datasets. So it's a purely choice made out of convenience. But I would strongly suggest not to put it last, because then you essentially just gain a readout function that gives you some more topology but not really a lot of additional information that you can now leverage. And I also would probably not put it first, because then the topological information is not actually informed by something that you can learn really well, but you just gain some features that can, of course, still be adjusted. So it's not a fixed layer, it's not a static layer, but then you just have the input labels and you can't actually change something about their representation. So that was kind of the long answer of why we put it second. That's an excellent answer, because that's what I was exactly wondering, like where would you position it and what kind of impact it has. So thank you so much. Of course. Welcome. All right. On to the next question. Do we have another question? I don't see any in the chat. So I'll leave it. So if anyone has a question, please jump in or we can conclude here if you run out of questions. And also we're already a little bit over time. So Bastian, thank you so much for taking the time and spending with us, sharing your work. I think it's a very fresh way of processing graphs and thinking about the problem that in the world of mostly the WL message passing, and it's been very educational. So thank you so much for the presentation. And with that, I think we will adjourn for today and hope to see you next week. We'll probably have more of a reading group style meeting next week with one of the PhD students here at Mila that we hope to go through more of the introduction into the topic of equivariant graphical networks. So with that, thank you all for coming. And once again, Bastian, thank you so much. Thank you very much for having me. It's been a pleasure and my honor. Thank you very much. Thank you so much, Bastian. Bye-bye, everyone. Bye-bye. Bye, everyone.