 So today at our seminar, we are very pleased to have Bastien Rieck who's working at ETH Zurich and is working on basically a interaction between topological data analysis and machine learning and has done some very interesting work about this and so today we'll talk about topological representation learning for structured and unstructured data. So Bastien, yes, please feel free to do your presentation. Thank you very much. Thanks for the invitation. First, this is really awesome to disseminate my work or our work here a little bit in this seminar. So today we're going to talk about structured and unstructured data and how to do some representation learning for this. I already talked to the organizers, of course. I know a little bit about the audience already knows about TDA. So I'm not going to bore you with the details, but I want to start. So typically, I would start this presentation with topological objects and avatars and the sphere and whatnot, but for this audience it's a little bit different. So here I want to set the stage a little bit different with a quote from Solzhenitsyn and I find this very, very appropriate. So we are really, really living in the future in a sense. So the quote goes thus, Nertsyn, his lips tightly withdrawn, inattentive to the point of rudeness. He did not even bother to ask what exactly Reneov had written about this arid branch of mathematics in which he himself had done a little work for one of his courses. Apology belonged to the stratosphere of human thought. It might conceivably turn out to be of some use in the 24th century, but for the time being. So this is I think a great quote because it means that we are truly living in the future. So this book is maybe a few decades old and people thought that Apology might never see the light of day and it might be confined to the realm of the purely abstract for the time being, but as you can see in the seminar now and the work that has been ongoing in conferences, this is far from being true. So in a sense, we are the avant-garde because we are living in the 24th century with our topological data analysis. So I'm happy that this prediction turned out to be a little bit wrong. So without further ado, let me briefly introduce what we're going to talk about. Most of the stuff when I talk about TDA in this presentation will be based on persistent homology and even something like a viatorious rips calculation. Although in the latter cases for graphs, we will not have to construct this complex ourselves because it will be given for us. But anyway, just so everyone is on the same page, a viatorious rips construction works by calculating this complex that is comprised of simply Cs whose pairwise distance is less than or equal to an epsilon. And having a radius of 0.5 epsilon, we create a simplex for every pairwise intersection of the Euclidean balls that will arise here. And on the right hand side, we monitor the resulting persistence diagram, which contains so far only connected components. But now, as we have pairwise intersections of three and more simply Cs, we will also get some triangles and thus some cycles. Notice that most of them are still confined to the diagonal, of course, because nothing has been destroyed yet. But at some point, we will find this big cycle in the complex that has been destroyed. So this is all known to you. We also know that those persistence diagrams are very convenient descriptors. So a lot of work has been done concerning their stability properties, stability in the geometrical sense. So when you assume that you have a function on the vertices of the simplicial complex, you can bound the changes in the persistence diagram as you vary this function. The most convenient function, which in fact, we will be using in the remainder of this talk, is the bottleneck distance. It is defined by this matching here. So it's the infimum overall bijections between the diagonals, such that you look at this supremum distance matching one point in one diagram to its other point in the other diagram. So this illustrated here, if you have those two diagrams, the blue one and the red one, the blue one being a kind of perturbed version or a more noisy version of the red one, then the bottleneck distance evaluates to the distance between those two points. So in some sense, it's a very coarse distance and there are refinements. Of course, there's the Wasserstein distance with an exponent p, but most of the time, the bottleneck distance is also pretty much sufficient and we will see why on the next slide. Namely, we have this classical stability theorem and there are so many versions of this theorem now. Some are done in a different context, but let's say the original version was given by, I think, Kohensteiner and colleagues and it was about triangulable spaces. So if we have a triangulable space, which could be a manifold and we have two continuous tame functions f and g that go from this manifold to the real numbers, then the corresponding persistence diagrams satisfy this bottleneck stability, meaning that their bottleneck distance is bounded from above by the house of distance between the respective functions. I should say tame here means that the functions don't have an infinite number of critical points and so in some sense, for most of the nice functions that we want to deal with, this is satisfied. In particular, it's satisfied when you're dealing with real world data that have been discreetly sampled. So the consequence is, and this is often misunderstood and in fact, I think a recent paper of someone in the community whose name I can't grasp at the moment showed that this theorem is often misquoted because what this specifically states is that we are robust in this context to small scale perturbations in the house of sense. If those perturbations become too large, then we lose that bound. So this means that if we take one cycle here and we calculate its persistence diagram with the cycles being shown in blue and the connect components being shown in red and we have another sampling of this circle, then the theorem applies because the distances here are very well bounded and it's not a lot of perturbation happening, but if we add a few large scale perturbations into the data set, so a few outliers into the data set, then we lose the stability result. So even though this is only a fraction of points that have been added to this cycle data set here, you can see that we lose this structure of the persistence diagram. Of course, we still find the overall cycle in the data set, but you can see that this point here in the persistence diagram has lost a lot of its persistence. So in that sense, the stability theorem only tells us that we are robust to some small scale perturbations, but not some large scale perturbations. And why am I pushing this so far? Well, I'm pushing this because there are some implications for machine learning here. Namely, we need to be extremely careful when working with mini batches. So typically in deep learning, you have this mini batch notion, which means that you take the subset of your data points of your samples and you train on this, so we will see more details about this in a minute. But when we're dealing with these mini batches, m tilde of some point cloud, m, then we need to be careful about the perturbations and about the stability of this. So for example, if we take a point cloud with 100 points that are just normally distributed in some two-dimensional space and 50 subsamples of varying sizes, so we take always more points, an increasing number of points for the mini batch, then you can see that the house of distance between the point cloud and the mini batch decreases with increasing mini batch size as it should. But nevertheless, there's some variance here and it's also and it's not directly going to zero, of course, but it's progressively going down. Which means that at some point, if your mini batch size is too small, you might not benefit from very tight bounds of the aforementioned theorem because your house of distance is still rather large. So in essence, you would need a better bound. Of course, empirically speaking, you can still do the training and whatnot. That doesn't mean that your training process will not work. It just means that you have to be careful about the theoretical bounds that you can give at this point. All right, so now what do I want to do in this talk? I want to, as I call it, bridge the chasm in some sense because persistent homology, as we all know, it is inherently somehow discrete. It deals with topological properties of a space connectivity and simplicity. While deep learning is inherently continuous in some sense, meaning that you have this back propagation step, you have the gradients and everything else to make everything differentiable. So the challenge is, can we make the calculations of a persistence diagram somehow differentiable, in particular if we have some control over the input spaces? We will see that this is possible and that this works out. So I want to maybe briefly step into this, maybe not in too much detail, but I want to showcase some exciting work from Poulenard and colleagues that were among the first ones to show how this optimization could work and how to make it differentiable. There are some other works out there that also do this, but I find their phrasing and terminology and their proofs to be the most accessible to be perfectly honest. At least, that's kind of my flavor of mathematics here. So some terminology before we step into a simple example, we assume that we have a function f from a manifold to the real numbers, and so we can see persistent homology as a map from this, from this tuple to a set of creation and destruction points. So a set of persistence diagrams in a sense, even though we don't, we ignore the dimension for the time being. And now we need a map s, which maps a point in the persistence diagram to simplex pairs. So this is why it's called s because it maps to simplices. And this maps the geometrical point in the diagram to the pair of simplices giving rise to that point in the persistence diagram. We can also evaluate this for a single coordinate or a single point in the diagram. Second, we also need a filtration map or a vertex map, which we denote by v for vertices. And the idea here is that we map a simplex to one of its vertices. How that mapping looks in practice depends on the filtration. So for the sub level set filtration, for instance, we will typically map the simplex to the vertex with the largest function value. And finally, we calculate the map p, which can be seen as a point map or a vertex or the concatenation of those. So p is what you get when you apply v after s. So you first go to the simplex, then to the vertex. And this is the map that we will take a look at because this is what will drive differentiability. So let's see that in practice. And as I said, this is just for intuition purposes. So there's a lot more stuff there. And I urge all of you, if you haven't read it so far, I would urge you to read the paper that I will cite in a second. So for a simple function on the left hand side and its corresponding persistence diagram, these these mappings look as follows. We first go from one point here with the s map, we go to the respective simplices. So here s of zero four is the pair of simplices a and a b. I noted I denoted them by by subsets to make this a little bit more accessible. Okay, so that's that's easy. This is just the persistence pairing in a sense, right? So typically, some people call this a pairing, and then they get the pairing by actually applying the function to this diagram. Right, we can now go one step further by applying the v map. So having access to the simplices here, we look at the vertices that give rise to that simplex to that simplex function value in the filtration. And that's easy to do here because it's a sub level set filtration. So for a, this is just the vertex x three. And for a b, this is the vertex x four. And thus, the p map is now the concatenation of those two maps. So v after s. And so it maps the point zero comma four in the persistence diagram to a point in the in the data set on the manifold, namely to the top of x three and x four. So far, it's a good that's just an intuition of how that works. But now the interesting thing is, there is, you can get the gradient through this calculation. Namely, Pulinar and colleagues have the following observations. If the function values are distinct, so your f function values are distinct, or it's an injective function, then p is unique. So this mapping back into the domain is unique. Moreover, if the function values are distinct, then this map is also constant in some neighborhood. And that is great news because it means that we can calculate this gradient very easily. So assuming that we have an input function f, that depends on some parameters theta, I'm using the machine learning terminology, which always uses a theta for all the parameters. We then have f of e of some creation point is f of v for some creation point is c i because it's constant and the mapping is unique. And so we can evaluate the gradient here as a partial derivative, and we don't need to care about the partial derivative at p itself because we get c back. It's a constant mapping, and we only have to care about the function itself. So to summarize this, the partial derivative is equivalent to the derivative of the function itself evaluated at the image of the map p c. So this is, this is of course great news because this means that we can actually have a trainable function that gives rise to a filtration later on in the dataset. Again, I urge you to read this. This was done in the context of topological function optimization. So where the idea was to get a function that mimics the input data as well as it can, but we have used this whole analysis for deep learning purposes later on. So with that, I want to jump into the first paper, which is the topological autoencoder's paper that appeared last year at ICML, and it's joint work with Michael Moore, Max Horn, and Carson Warquart, all from ETH Zurich. So maybe we don't need this, but let me try just for fun. I have an animation prepared. So let's see. So the motivation for all of this is as follows. If you take a special kind of dataset, in this case, it's a dataset consisting of spheres that are nested within each other, and you train a classical autoencoder or a vanilla autoencoder, as we like to call it, you get this kind of training process here. So you can see that the different spheres that are nested in the big sphere are neatly separated from each other, which is good for reconstruction purposes. But at the same time, you lose the information about the containment relationship. So you lose the information of knowing that there is a big sphere that encloses those smaller spheres. So now let's take a look at what the topological autoencoder can do. And then I'll give you all the details about how this actually works. You can see that here during the training process, those spheres are kept at a different level. And in particular, the large-scale surrounding sphere is shown in this latent space here. So let's get back to the slides. This is, of course, great because it means that you can keep track of the connectivity and the enclosing relationships in the dataset. So let me give you a brief overview of how this topological autoencoder works. On top of this branch, you can see the, let's say, standard thing that you would do when you train an autoencoder. We have an input dataset. We put it through a neural network. This neural network has a bottleneck in the middle, so meaning that it decreases the hidden dimensionality of the dataset. The layer in the middle is often also known as the latent code because it is the latent space representation that the network has learned. From this, we get a reconstruction of the input space x tilde. And then the classical thing to do is to calculate a reconstruction loss. So the idea being that how well are you able to reconstruct your dataset after passing it through this bottleneck? Why would you do this? Well, you would do this because you want to have a dimensionality reduced representation in the middle in this latent code here, which you can use for compressing the dataset or for storing it or for all kinds of other purposes. For example, also the visualization, this is often being done in particular in the field of computational biology. All right. So now the new thing about our approach is that we have the second branch here, which evaluates a topological loss. And for this, we calculate persistence diagrams on the minibatch level of the input dataset and on the minibatch level of these latent codes. And then we try to match those as closely as possible. So in essence, we have a loss term that measures to what extent the topological descriptors of the input space and the latent space are similar or not. The idea being here that we can regularize this and we can force them to be a little bit together and represent the relevant features of the input dataset. All right. So this works in theory because we have a very nice theorem that tells us about the probability of having a exceeding, a threshold in terms of the bottleneck distance. So this is where we link back to the bottleneck distance that I mentioned before. So I won't go into the details of the proof here, but the idea is that we can essentially, if we have a bound on the house of distance between the original point cloud and its minibatch sub-sample, if that bound is good, then the probability of exceeding this bound in the bottleneck distance can also be bounded. So in essence, this means that under some conditions, this sub-sampling procedure, this minibatch procedure is stable and minibatch is artypologically similar if the sub-sampling of the dataset is not too coarse. So this gives some credence to what we try to accomplish. And the gradient calculation here looks very interesting because we only operate on the level of distance matrices of the dataset. This also opens up some exciting possibilities for future research, which we haven't investigated so far to be honest. But the idea is that we get a distance matrix of the original input space and we have a persistence diagram, the right-hand side. And if we play our cards right, namely if the distances in the dataset are all pairwise different, which amounts to some kind of generosity condition or general position condition, then we can map every point in the persistence diagram to precisely one entry in this distance matrix. And now the kicker is that each entry is a distance in this matrix. So this means that it can be changed during the training or at least the latent space representation can be changed. Of course, we can't change the representation in the original space because that's that's what we started with, but we can change the distances in the latent space. So let me just illustrate this mapping here so you can see that this point for example maps to these two distances here. These distances are of course generated by two points and we can change their position to adjust those distances here. This leads now to a very nice loss term which interestingly has been termed the topological signature loss by some publications and we unfortunately didn't give it a name so that shows if you don't give it a name then other people might give it a name. I'm not sure whether I would want to call it topological signature loss, but that is what happened. So this loss term is comprised of two components. One component goes from the input space to the latent space, the other one goes from the latent space to the input space. I have to say on the outside it's not a metric so you know that the bottleneck distance is hard to calculate even though we have some approximation schemes. The Wasserstein distance is also hard to calculate even though we also have some approximation schemes, but so this does not pretend to be a proper metric. It is just a loss term that uses some, well I would say similarity properties between the diagrams and you don't want to go too much into the details here but everything boils down to looking at the persistence pairing of an input mini-batch, so pi of x or the latent mini-batch pi of z and then using this persistence pairing which tells us which indices are paired with which other indices. So it's the, it's the, it's kind of a look up table for this implicit complex that underlies this representation and we can subset the distance matrices with these pairings and so we get one component where we exchange the distance matrices in the latent space here and the input space here but we use the persistence pairing of the input mini-batch. We get the other component by looking at both distance matrices and using the persistence pairing of the latent mini-batch. So the idea being that we cannot control the mapping or the difference between the two, the two pairings but we can control to what extent the same distances are selected in both of these spaces. So this is why it's a bi-directional loss and as you can see here it only has very simple arithmetic operations so if a is differentiable and we have this property that the distances are unique then the pairing will be constant in a small neighborhood so this means that we can do a small gradient step and we can actually train. So that's, that's the magic that happens here and that's, that's what makes this approach work in the end. So how does it look though? Let's do a qualitative evaluation. First in that I'm going to show you some quantitative results. I also have to stress there's way more datasets on which we tried this but I'm only showing this one here to have a nicer narrative structure for this kind of talk because I want to give more of a broad talk rather than a deep talk in a single paper. So anyway, the spheres dataset which as I said contains some spheres that are nested in one bigger sphere it cannot be disentangled correctly by PCAU map does a good job in representing some of those spheres but it fails to to show this this enclosure relation as the same goes for TSNE. Isomap does something very strange here. The standard autoencoder with the standard geometric construction loss fails in showing this enclosing relationship but it at least shows that these spheres have some some kind of shape and the topological autoencoder does what you would expect it to do as I showed you before. So that's, that's very neat. We can zoom into this a little bit and so we can see that it also thanks to the usage of these distances it can also to some extend mitigate or leverage information about different scales in the dataset. So it is able to tell us that this bigger sphere is in fact geometrically speaking larger than the other spheres whereas you don't see this effect as pronounced here although to to the credit of the autoencoder of course there's some there's some differences in scale here but in our method you know that you can trust those differences in scale because this is how it was set up. So maybe this is maybe now let's let's talk briefly about how to evaluate this because that turned out not to be so easy so we have we tried out a lot of different evaluation measures here for dimensional reduction. Some of them are geared towards preferring one dataset one method over over the other so we we did what anyone would do probably we just developed our own measure and it's based on the distance to measure density estimator which I think some some of you know pretty well because they they developed it as far as I understand. So we use the simple distance to measure density estimator like a discrete a simple discrete variant of it which evaluates the appropriately weighted Euclidean this square Euclidean distances between all points. This is this this evaluation has the advantage that it's well defined on mini batches and on the full input dataset and then for any smoothing parameter sigma we just evaluate the callback Leibler distance or similarity I should say between those two distributions. So we measure the similarity between the two density distributions and the idea is that if your if your method if your dimensionality reduction method is good then it should yield an embedding that represents and preserves this similarity. So how does it look this of course not the only measure that we tried but look at the quantitative evaluation the best result is always underlined and bold and the second best result is just shown in bold. So we can see here that the topological autoencoder is of course the way it's it's not it's not directly set up that way I have to stress that so it's not so this callback Leibler distance is not occurring in our loss term in any form of fashion so in some sense we are not geared towards preserving this but of course the the way that persistent homology works gives us gives us the possibility to preserve those distances a little bit better. So we can see that that for those measures we are ranking very very well in particular and this is this is a very interesting point that that occurred to us when you look at the mean squared error of the data itself so of the reconstruction then you can see that we are making the reconstruction the purely geometrical reconstruction we are actually making that a little bit worse so lower values are desirable here the autoencoder of course wins that one because that is how it has been set up but we are we are making this a little bit worse this is because the the topological constrained is somewhat at odds with the purely geometrical reconstruction so to reconstruct the data set you need some kind of other information than to have a latent space that is topologically faithful so in in some sense we are pulling the autoencoder towards in a different direction all right so this concludes the part about the unstructured data sets so we can now learn stuff on point clouds we can obtain good visualizations I could go on of course here and and please feel free to ask me anything about the autoencoders later on there's a lot of interesting things that we can do afterwards that we haven't investigated so far but at least we have this first notion of of an understanding of what is going on and what what topology is capable of so now I want to want to switch gears and go to to the structured data sets and I want to in particular I want to talk about how to learn graph iterations this is joint work with Christoph Florian Mark and Roland most of which are from from Graz University and it also appeared last year and at ICML and here let me give you a brief digression although maybe this could be somehow of an audience participation question so who is already super familiar with graph neural networks and would be bored by this and just just write in the chat if you if you're not if you already feel feel confident here otherwise I have like two or three slides that explain this a little bit okay so we already okay we we have one person who who is not super familiar that already that already is enough for me then then I'm happy then I'm happy to do it I just don't want to I just don't want to bore anyone but so okay I may be wrong but I think that most people are not familiar with GNN so yes I think the slides are worth it okay that's that's perfect so then then you're in for a treat I hope we can also discuss some something something else later on when it's about expressivity I have some some additional things to say here of course let's see okay so the the whole idea of the graph neural network is that you want to learn a graph representation this is of course a little bit made complicated by by the fact that graphs can have different edges and different numbers of nodes and so on and so forth so what you do is now the community has converged to a very I would say generic setup that in that involves message passing between nodes and edges so the idea is here you first learn some node representations some hidden representations based on aggregated attributes this aggregation happens over the neighborhoods of the graph and in fact the case iteration of that neighbor of this aggregation kind of contains information that is up to k hops away in the graph then you repeat this iteration big k times and you obtain a graph level and node level representation so more specifically you have this small aggregate function here which takes the hidden representations of your vertices that you learned in a previous step and it accumulates them over the neighborhood of that vertex so this aggregation can have different forms so typically it could be a sum of some small let's say multi-layer perceptrons that have been trained on those representations but you just need access to the neighborhood of the vertices and to the previous hidden representations you start off with in the first in the first iteration you would start with the original node features or one hot encoder tables or whatever so you start there the first hidden representation is the original input data and then you start aggregating from there with this in hand you have a combined function that takes the vertex attributes of the of the current vertex and aggregates them with the with its own hidden representation in the previous step so the combined functions typically also a sum or a max pooling or something like that and then in the end you can do a so-called readout function this takes all the hidden representations of the vertices at the end of your iteration scheme and constructs a graph level representation so you build the graph level representation from the individual node level representations so this terminology follows the paper how powerful our graph neural networks and I want to zoom into this message passing character here so you can see how that works namely if this is your graph here and we take a look at v4 you can see that it has some neighbors it has v2 it has v3 it has v5 as a neighbor and those could have arbitrarily high-dimensional attributes attached to them now you aggregate those features over the neighborhood this is for example easily done using a sum so since those vectors are high-dimensional and they have the same dimensionality you can sum them and you get a new representation whether this is a good representation is a little bit of a different point but now you can repeat this process multiple times and just update the vertex representations for v4 accordingly so of course you need to account for the fact that you have already some attributes at v4 so as I said some people could make them into a bigger vector or they could or they could sum them or whatever so there's multiple versions here and then at the end you use this read out function to read out all the high-dimensional hidden representations of your graph and from this you obtain your graph level representation and so now the motivation for this project was that we want to learn a filtration so we often employ a predefined filter function or filtration function that goes from the vertices v of a graph to the real numbers for example the degree would be commonly employed or it could be heat kernel signature I've seen that being done with great success in the past we then typically extend this to the full graph of course by setting this to a maximum of the two vertices but now the question is is it possible to learn this function and to answer to learn a filtration function that is geared towards a specific task and the answer turns out to be yes and I'm just going to show you an intuitive overview here so we take our message passing scheme we then apply persistent homology to this to the to the low-dimensional node representations we obtain a persistence diagram we then use a projection function or a coordination function psi so if we apply this function n times then this means we will get a n-dimensional vector out of that and then we use a neural network to learn a to to classify based on this representation and the gradient from this loss term here to this projection function here this is already known to exist because the projection function is a simple coordinate projection function so there's nothing new here but the nice thing is that the framework of pulenar and colleagues gives us conditions under which this gradient exists here so and which under which conditions we can go back to the original graph so I'm seeing that we have a we have a question in the chat so with with respect to which filtration is the persistent homology of the graph being computed that's an excellent question so in fact we start with a very simple degree-based filtration no we start with a simple random filtration on the nodes there's a multi-perceptron that projects this down to a scalar value I have some details on the next slide about this and if we do this we get a single scalar value on this filtration and this is this is our starting filtration but it is trainable because we have parameters for giving rise to the to the filtration function at each vertex and so the idea is that we make this function trainable based on a loss term at the end of this neural network does that answer the question to some extent okay perfect okay so so the so the nice thing about this paper is that we can show that this gradient exists here as well so we can go back from this persistence diagram representation to the actual graph and train it and we do this by applying a differentiable coordinateization scheme I'm going to skip this slide here because it's really not that much important so this just a continuous projection you can use different things you can use for example the purse lay the purse lay layer would be an excellent choice here as well or the the land layer or I think it's now called pl lay for persistence landscape so there's there's multiple options here it's just just has to take a diagram and project it so we initialize the filtration by using a single graph isomorphism network upside on layer so this is not the details here are also not important it's it's a single one level message passing graph neural network with a specific hidden dimensionality followed by a two-layer multi-layer perceptron and the important thing is that this project everything in the end to a scalar value so in the end we have a very complicated way of getting a function that lives between 0 and 1 it's typically initialized using a vertex degree or the uniform weights but this representation is as I said trainable over the course of the well the training process I would say so we know that if f is injective on the graph vertices the gradient exists this is one of the takeaways here initialize it easily and we can integrate this into existing neural network architectures and the results are interesting in so far as they show that with our method with one gin and the graph filtration learning procedure we are always better than a pure persistent homology calculation on the graph itself but we are not necessarily better than other kind of pooling methods so we so the results are a little bit of a mixed back um we are better than some pooling for the imdb binary data set but we are a little bit worse in on the imdb multi-data set but in essence we are always able to be better than a predefined filtration of on on the graph using uh using persistent homology so this is this is a good this was a good takeaway for us but now let's let's briefly towards the end of the talk let me briefly show you what else we can do so the graph filtration learning is is nice it but it has one small disadvantage namely it still does not make full use of a graph neural network because we only use the graph neural network to come up with a representation in the beginning with a filtration function and then we use a regular neural network for obtaining the rest but we're not really able to make use of the power of a graph neural network and i'm personally i'm a big fan and believer in a hybrid method so methods that can incorporate both geometrical and topological information and so this is what i'm going to show you now briefly it's a this is a recent preprint of ours it's called topological graph neural networks and it's joined work with max edward michael eve and karsten uh max michael and karsten are from eth here and adward and eve are from k u leuven in belgium but there's a very nice international collaboration here and this is just a slide to illustrate a little bit the complexity of the problem but let me briefly outline what we are doing so we are really we are pitching a method that is capable of being integrated into a graph neural network and it is a it is a layer that is capable of using different filtrations of the input dataset to obtain a graph level representation that can then be used in downstream layers and arbitrary gnn architecture so in essence it it takes the graph filtration learning approach and it makes it applicable and in in well and into google into a hybrid gnn architecture so we we achieve this by taking the node attributes of a graph on the left hand side calculating hidden representations using a node map and calculating k different filtrations of these vertices so we don't restrict ourselves to a single filtration but we take a high dimensional filtration representation by just calculating multiple filtrations of the dataset notice that some of you might now think oh they're doing a multifiltration approach but that's actually not true it's only i wish it was true it's only multifiltrations in the sense that we have multiple filtration stacked on top of each other but those filtrations are not yet interacting with each other so this is something that we want to tackle in the future and if any one of you has any ideas how to do that then i'd be more than happy to to write to write thousands of papers together with you but so for now we don't have an interaction term in those filtrations in any case with those k filtrations we get k persistence diagrams and we can now use a similar strategy from the graph filtration learning paper and use this psi coordination function to obtain k predictions of those pardon me projections of those diagrams which later on we can aggregate together with the original node attributes and make them into a new output node attribute of the input graph so in essence this kind of closes the loop so we have if you discount everything that is happening between the between this arrow here and this arrow here we have kind of a black box to go from an input node attribute of the graph to an output node attribute of the same dimensionality that is the important thing because this makes the layer fit very neatly into a graph neural network because the graph neural network assumes that the dimensionality of those features doesn't change for for deeper layers and so we can stack multiple of those layers and we can integrate we can make the network topology aware with the single integration of that layer i'm seeing that there's a suggestion namely you could maybe use linear combinations of the filtrations as new channels that's a very exciting suggestion i'm i'm going to take you up on this and maybe maybe we should discuss this at some point that that would be a very very cool way to move forward here so i'm not going to to go into more details here because this is this is still pretty fresh but i want to show you what it can actually do and and why we are excited about this line of research namely we are primarily interested in the expressivity so when you have a graph neural network expressivity refers to the fact that there are certain graphs that are not isomorphic but that you cannot distinguish with the graph neural network of course this isomorphism problem is somehow contrived i would say because in reality you're not interested in knowing whether they are isomorphic but you're interested in classifying them but nevertheless you want to you want to make sure that your your graph neural network is capable of seeing a lot of differences between different types of graphs and we have two nice synthetic data set that that show that the integration of this topology layer which we call toggle for topological graph layer is is beneficial for the performance so the first data set is the cycles data set it's a data set that has two classes and one is a simple cycle of some number of vertices and the other one is a set of triangles or polygons that that mimics this this idea so this one has one cycle obviously the one class is one cycle the other class has more than one cycle but the same number of vertices so with tda of course we are almost trivially capable of distinguishing between those data sets because we just have to count the number of cycles or even even more simpler we can just count the number of connected components right because this one has one connected component this one doesn't so it's trivially classifiable using using tda but not so much using established graph neural network methods so this part demonstrates that with our method toggle we of course end up with the test accuracy of 100 percent when i say when i say our method toggle here what i mean is that we take a normal GCN that is a graph convolutional neural network and we inject one topological layer there so we replace one convolutional layer by one topological layer to make the total number of layers still work out you can see that the standard GCN takes a few layers to capture those graphs this is in line with the fact that they are quite expressive and they can count cycle lengths but they but they need sufficiently many layers to to do this counting so if your cycles are too long and you don't have a sufficient number of layers then you cannot then you cannot find them using using the GCN a standard method from graph kernels though the weisfeller-lehmann method which is often also used for graph isomorphism testing is completely incapable of distinguishing those graphs here by contrast so this also demonstrates since the WL as it's often called the WL test or method is often used to describe expressivity of graph neural networks we are strictly more expressive than they are because we have this this data set in which in which we outperform them in which we can see features that they cannot see but we also have a theoretical proof in the preprint about this about this fact so if you're if you're interested to check this out a second data set and this will bring me to the end almost is the necklaces data set this is a little bit more complicated in terms of the structure so we take we either take these two cycles here in one of the class or we merge them in in this other class and this is a little bit more complex in terms of in terms of the topology and in terms of the features that can be used so here we don't see this effect that we are the only ones capable of classifying it but you can see that there is a saturation happening for the WL algorithm so it needs a sufficiently many sufficiently large number of iterations in this case to to look at the graph and to understand it and it never quite reaches the level of performance of a GCN or of a orbit or of our method for the GCN itself it it requires a certain number of layers to approach this this performance here but it also never quite reaches it because it's not aware of of actually counting cycles correctly but with with the topological graph layer we we get this performance here so that's that's really neat that is the synthetic data sets and here it works well and we are very expressive let's show some empirical results and we're not we're not going to to discuss everything in detail but let's just say that we have a GCN4 that means four layers which is one of the best architectures out there for the GCN I should stress not for all graph neural networks but the idea was that we take comparison partners that have roughly the same number of parameters than our method making it possible to make a fair comparison because otherwise you could say that well your method the toggle method has has an order of magnitude more parameters so of course it is more expressive we want to avoid this this backlash we want to avoid this kind of criticism and so we made sure that we have roughly the same number of parameters in all those architectures we can see that if we now compare those two here so the GCN4 and the topogen N31 meaning that we have one topology based layer and three convolutional layers we are more or less on a par with proteins we are definitely not not comparing as well on the enzymes data set who knows why that is happening we are a little bit better on average on the DD data set we're a little bit better on average on the IMDB binary data set and we are a little bit worse on the reddit binary data set so again don't want to go into into too many details here I want to summarize this with a very nice image that I found on the internet so I would summarize those results at not great not terrible so clearly there is some form of signal in there but it's not it's not very consistent so to solve this to make sure that we're getting a consistent signal we did something very daring namely we removed the node features and node labels from the graphs that we were looking at we removed them from the from the molecular graphs so from DD enzymes when we looked at their visualizations we saw that they had some interesting topology in contrast to the social network data sets which have more motives in there such as a star or something like that but those actually those data sets actually have cycles and and some nice connect components and things like this and so we were thinking okay maybe maybe all this performance stuff is driven by the node features and by the node labels themselves so let's let's remove that to ensure that we have a topology based comparison we're not and we don't carry any information about the about the nodes themselves which might which might be a confounder in this case and lo and behold this turns out to be the right approach so we can see that here suddenly we are a lot better with arm with a blue and whereas the the standard GCN is not capable of performing as well and this even looks like it's statistically significant we haven't actually evaluated this because those data sets are very tiny and so getting statistically significant results is is very very complicated here for enzymes we have a little bit of a of a less clear signal but still we see that with a sufficient number of layers we are definitely on average better than the GCN and the same thing happens for proteins so in a sense our conclusion for this was that that the assumption that topology can be helpful is definitely true if you have some features that are only driven by topology so if you have a classification task that requires structural information whereas in some of the case for some of these data sets I would say the the node information is already a little bit tainted and contains confounding information that a pure look at the topology itself will not be helpful for the classification and so with this I want to conclude and want to leave you with a few takeaway messages namely it's now pretty much clear that persistent homology can be made differentiable so the integration into arbitrary neural network architectures is possible this is this is great news for us because this means that we can build very cool architectures that are capable of helping us out in different tasks moreover these topological features can improve representation learning learning tasks that that has been that has been demonstrated not only by by this research obviously but also by a lot of other groups now in other cases but one caveat here often the the main performance driver is fully unclear so it's if you if you just throw topological features onto a new problem you really need to carefully ablate you need careful ablation studies to disentangle the actual performance in the end to make it clear that the gains you're seeing are coming from exactly the component that you are that you're changing if you're changing too many components at once for example you change the filtration and you change the way persistence is calculated and you change a lot of other things and this is very very hard to decide what is actually driving the performance it's not bad from the model perspective so you can still say this is my model and I'll perform everything else so you don't need to point to individual components but for this paper we or for this preprint we wanted to point towards these these components and we wanted to make sure that the integration of top logical features can be reasonable and be useful and I'm really really happy to see something happen here namely in previous talks I always said that the the future belongs to hybrid models and now this is one of the first hybrid models that arises from from our lab there are other hybrid models obviously but this one of the first from our lab and I'm really happy that that it shows particular promise for for graph classification as provided that the features are there and that the structural features are important in those in those graphs so with this I want to thank you very much for for your attention and I also want to extend a warm thanks to my to my main co-authors that would be Christian, Christoph, Edward, Carson, Max, Michael and Roland so we have done a lot of good work in the past I'm really happy that that we had the opportunity to collaborate on this so thank you very much