 again. So welcome to the last lecture of this tutorial. This is where we finally get down to the recent advances in topological machine learning. So we will be looking at different machine learning methods that are driven by topology based features. And again, if you have feedback or questions, write me an email, shoot me a DM on Twitter, you can find the slides and additional information on the website. All right. That being said, what do we see the recap is, we saw that persistence diagrams are somehow the basic or natural topological feature descriptor. They have some disadvantages, but also some neat properties. However, there are multiple alternatives depending on the applications that you want to solve. And all of those have different key properties. And in these in essence, everything boiled down to saying that it's your data. And so it should be your choice of descriptor. But this was all this was all maybe a little bit overwhelming. So now let's take a look at what people have actually done in this field. So now we're, this is kind of the, I would say the Keystone lecture in which we put everything together. How can we actually build topology based machine learning methods? And moreover, how do those models perform in practice? So first of all, the simple feature based analysis pipeline. And this is pretty great because it's suitable for point clouds for graphs for whatever. So this works whenever you can pick an appropriate filtration, it might be a filtration based on the vertex degree, it might be a filtration based on distances. Then you just calculate the your persistence diagrams, you vectorize them using the persistence images shown on the right hand side. And you use an arbitrary feature based algorithm, such as an SVM, or random forest or whatever, and use them as features in your algorithm. And you can do this you can do this to classify your objects. So a brief example of this, which is also very interesting because it's because it uses a cubicle setting so it goes beyond the simplicial setting and it uses an f MRI volume as an input. Here our filtration is induced by the activation function by the bold function of the f MRI data. We use persistence images to obtain a time varying embedding, because every subject of this f MRI study had a time series of f MRI measurements attached to it. Those times series had the same length. So our work was cut out for us in that sense. And then we were able to describe the topological dynamics based on the dimensionality reduction algorithm. So essentially this boiled down to calculating the persistence images and calculating a dimensionality reduced representation of each image. And this made it possible for us to learn about differences of subgroups in the population we were looking at. So I'm not going into too many details here, but it was pretty cool to see that the age stratified subgroups that we were looking at. So all those people were being subjected to the same movie. So they watch the same movie while being recorded in an f MRI machine. And we could see that as the as the age of the of the subgroups increased, their topology based representations became more complex. So if you disregard a little bit of the noise here, which we can also get around and remove in some other representations, you can see that younger children. So this is one of the cohorts was where children of ages 3.5 to 5.5 years, for example, as the age increased, the topology based representation of their activation function also increased in complexity. And of course, this this is not this should not convince you that it's that this is a very useful descriptor, but we we correlated this to to some other measures. And we were able to predict the age group, for example, of children directly from the topology based summaries. And we were also able to show that if we restrict the analysis of the topology to certain parts of the brain, then we can easily disentangle different parts of their of their visual system of their complexity of their visual processing. And we could show that younger children, for example, are unable to to use information about, let's say memory or about more complex tasks, while watching such a movie, and they just are very, they are primarily visually focused, I would say. So that's one of the findings of this of this paper that the the the older you get, the more you're able to make sense of complex relationships in the same movie. And whereas the younger you are, the more all the processing of your movie is driven by essentially the visual stuff that goes that goes into your into your brain. But just just as a very, as a very simple example or side note of doing this for time varying cubicle complex, which is, to my understanding, also one of the first applications of this of this sort. Another thing that works really well is the classification of unlabeled graphs, using classical machine learning models again. So here, you would take the, again, the degree filtration. So you would, you would look at the at the edges and vertices of a of a graph, you would sort them accordingly, you would repeat the analysis pipeline that I described previously. And then you could learn weights for the topological descriptors to improve the predictive power. This is the paper by Zhao and one that I previously mentioned in the last lecture. And this makes it really easy to classify unlabeled graphs, because since you don't have a label anyway for any of the notes in the graphs or for the edges, the degree turns out to be a relatively good descriptor of the information that is contained in that graph. So now let's move on to something more complicated. And this is a small digression, but it's, I hope it's well worth it. So we're looking at the WL iteration or the Weisfeld-Elemen iteration and its subtree feature vector. This was developed in the, I would say, 50s or 60s now by Weisfeld-Elemen. And the idea was to create a test for graph isomorphism. So a test to figure out whether two graphs isomorphic to each other. To spoil you, it turns out that this test does not actually work that well. Well, by which I mean that it's not, it doesn't solve the graph isomorphism problem, because if it did, that would be, that would be awesome, because the, the iteration is really simple to calculate and it would, it's essentially being able, you can essentially do it in polynomial time. However, it turns out that the resulting features of the resulting iteration scheme is still useful to describe the dissimilarity between two graphs. So how does it work? This might sound eerily familiar, because it's essentially what all the graph neural networks are doing. And there's a recent paper, well, not so recent anymore in machine learning terms by Jekyll-Kar on the graph isomorphism network, which does make this link between the WL iteration and graph neural networks very explicit. So anyway, the, the the process is very simple. You go to a node, you look at your own label, here, the labels are created using colors, because otherwise we would have to have node labels and the labels of those nodes again. So this would make it more complicated. So here's just colors. You look at your own label, and you look at the labels of your neighbors. So for example, you have, you're here at node A, node A has one neighbor, namely C, and node A is blue, and its neighbor is also blue. So you mark your own label as blue and the adjacent labels as blue. And this also works for the others. There is no sorting whatsoever. It's just, it just looks a little bit neater when I draw it like this. But this is just a set or a multi set if you want. And now what you do is you hash this label, or you hash the information between the two labels. So you take your own label and the adjacent labels, you make it into a multi set, and you hash it in, you give it a different color. And this hashing needs to be perfect. So the same neighborhoods need to be mapped to the same labels. So essentially, anything that has a blue label and also one blue label in its neighborhood is now mapped to this green color here. So you can see that this works A, B and G are all mapped to the same color. Whereas D and F, which have a different label here for their own node label, are mapped to a different color here. And I hope that no one of you is too colorblind because it's really hard to to illustrate this with a color perlady that can be distinguished by by all people equally. But let's let's hope that it works. So anyway, this is the way you describe your your hash labels. And once you have this hash label representation, you can create a histogram feature vector, which just counts how often each of the colors appears in the graph. So in this case, you would say, oh, okay, the screen label here occurs three times, the orange one, one times the violet one, two times, and the pink one, once. And this gives you a feature vector three one to one of your graph. And now you can compare to graphs G and G prime by evaluating some form of kernel or distance or whatever you prefer between those two feature vector representations. Moreover, you can of course repeat this iteration step using the hash labels. So as long as you have a perfect hashing scheme that is capable of producing and telling apart the different colors, you can repeat this process. And by repeating it multiple times, you will incorporate more information about the neighborhood of the graph. This is the vice fellow Lehmann iteration or subtree feature vector in a nutshell, because you take this pardon me, you take this, you take this graph and you you repeat this process, you get a feature vector of a certain depth, depending on how many iterations you have. And you get this and you can compare the graphs using this representation. Now let's move to to recent paper of mine. This is together with Christian and Carsten. It's called a persistent vice fellow Lehmann procedure for graph classification. And the idea behind this is really simple. Namely, the vice fellow Lehmann algorithm can vectorize labeled graphs and persistent homology captures its relevant topological features. So we can combine the two of them to obtain a generalized formulation of this description. And this requires a distance between the multi sets that we generate in this in this representation. So how can we generate a distance between label multi sets? Well, suppose that we have two multi sets A and B and they are defined over the same label alphabet Sigma, so L1, L2 and so on and so forth. Then we can transform these sets into count vectors. So we would trans we would just look at how often does the label L1 occur in the set A, how often does the label L1 occur in the set B. And this gives us two count vectors A1, A2 and so on and so forth, B1, B2 and so on and so forth. And this gives us a way to calculate a multi set distance between those two vectors as the as something Kofsky distance between the labels. So instead of looking at a very complex label distance, we just look at a count distance. And now since the the nodes and the multi sets are in one to one correspondence, so we always know which multi set belongs to which node. We now have a metric on the graph. Of course, that's, you actually have to show this and it's going to be a little bit more complicated, but we're not interested in this. It becomes a metric. So how does this, how does it look in practice? So moving back to the original example, we would say that the distance between the nodes C and E is the distance between their respective label multi sets. So we take a look at what does this node C here. So C has neighbors E, B, A and D. And we observe three blue nodes and one red node. And likewise E has neighbors G, C and F. And so we observe two blue nodes and one red node. And the distance according to our scheme, if we set the Minkowski exponent to one, is the distance between three one and two one and this is just one. Likewise, we can do the same thing for C and A and we can see that C and E are much closer in that distance than C and A. And this kind of makes sense because the neighborhoods of those two nodes are more similar to each other than the neighborhoods of the nodes A and the nodes C. So this just gives us a multi-set distance between the individual label multi sets that we observe. But we want to extend this multi-set distance now to a distance between vertices of the graph. And for this we have to incorporate the label from the previous iteration of this rehashing of this coloring algorithm. So we take the previous label of the Weis-Feller-Lehmann iteration that we call LVI indexed by H minus one, so the previous iteration, as well as the label LVIH from the current iteration and we evaluate this neat distance here. So this is an Iverson bracket that counts whether those labels are different from each other. It's either zero or one plus the distance here between the two labels multi sets as shown on the previous slide plus tau, which is a small constant that is required to make this into a proper metric. For more details see the paper. But the intuition behind this is that this turns any labeled graph into a weighted graph. And we can calculate the persistent homology of weighted graphs because the weights just yield a neat way to filtrate the graph. So suddenly we have built a bridge between the labeled world and the unlabeled world or the weighted world. So first of all, how does this look? What properties does this vertex distance have? So depending on how many iterations of this relabeling scheme we make, this defines how different our metric is, how sensitive it is with respect to the distances in neighborhoods. So this is an adjacency matrix of the graph with weights according to the multi set distance that I showed previously. And you can see that if you don't have any relabeling operation at all, so h equals zero, you don't go into the graph as such, you have a very coarse metric so it's either zero or one depending on whether your neighboring node has the same label or not. But if you increase this a little bit, you'll see that you get much more information out of this metric. And this makes it possible to get a very nuanced view of your graph for classification. And in fact this makes it also possible to create what we call the persistence-based Weisfeller Lehmann feature vectors. And in contrast to the original Weisfeller Lehmann relabeling scheme, we can also get cycle information. I'm not going into too many details here, but we can calculate the persistence of every feature and aggregate this over the label that we observe for that feature. And we can do the same thing with our cycles. And here what one of the attendees remarked is remarkably easy to do here. Namely, we are only looking at connected components and cycles so our algorithm can be implemented really efficiently and it's tantamount to going over the graph once. So it's a simple, it's linear in the number of edges and the number of nodes in the graph. We only have to look at every part of the graph at most once and this gives us all the information that we need to calculate those feature vectors. And everything is weighted using the persistence of the corresponding features. So this generalizes the Weisfeller Lehmann scheme in the following sense. We can redefine this vertex distance that I showed you to obtain the original Weisfeller Lehmann subtree features and this we can simply achieve by defining a distance that is one if the vertex labels don't agree and it's zero otherwise. So this whole spiel was a way to show how to generalize an existing algorithm and show that it's actually a specific substance instance of a persistence based algorithm. And as you can see here it really works when compared to other to other more complex approaches. It performs relatively well, it performs very favorably. In particular, since we did not even try to make a very complex hyperparameter tuning for our own algorithm. So the hyperparameter grid of the competitor methods was much more complex than the hyperparameter grid of our own. So I'm reasonably sure that we could even gain some more percentage points. But I want to particularly highlight this following number here because I find it interesting by integrating cycles. So this is PWLC means persistent Weisfeller Lehmann with cycles. We gain almost well yeah more than four percentage points for this classification scenario by integrating the cycles. So the addition of more topological features shows you how to make classification much more efficient and much more effective in practice and much more expressive. And you're free to try this out. It's on the net. There's a there's a GitHub implementation and I and I really really would urge you to to take a look at this. If you're interested in topology based graph classification because it's a very very neat example of demonstrating what can be done here. And all of this is based on a single weight based filtration. So this could be adding more filtrations or adding more complex filtrations could even change the results somewhat further. Right so now moving on what else can we do? The thing you've been waiting for because this was also kind of a shallow approach still because we did we used an SVM to obtain all of those things. But now let's take a look at at the well the godfather of topology based machine learning if you will. This is the paper deep learning with top logical signatures by Hofer and colleagues. The I would say that this is arguably the first successful combination of deep learning and topology and it's in Europe's paper from 2017. The rough outline we will take a look at the details in a second. The rough outline is that you take a graph filtration to obtain persistence diagrams. This graph filtration is also based on the degree again. Then you define a layer to project persistence diagrams to 1D function. You learn the parameters for multiple projections. You stack those projected diagrams and you use them as features. And this gives you a way to to learn an automated projection of your topological information in end to end fashion. The only thing you can't skip but we will see this in a second how this is possible. The only thing you cannot go back to the to the persistence diagram calculation. So this is this is how we will end essentially. This is the the last project that I will show in this lecture. How to how to go back and how to change the persistence representation based on classification or on a on a embedding objective. But in this case for this for this paper here you learn how to project persistence diagrams in an efficient manner and in an effective manner so that a classification objective can be reached. And the the details are a really a really simple even though they might not look that way at first. So the main ingredient is a differentiable coordination scheme of the form psi going from diagram to the real numbers. So writing cd for a tuple in diagram again in what I call creation persistence coordinates. So these are the coordinates that you obtain by flipping the diagram or by rotating it by 45 degrees. Then we just calculate these sort of exponential expressions here. So I'm not going into all the details here but you have an additional parameter new one new zero which is a mean representation. You have a sigma parameter sigma zero and sigma one which is a smoothing parameter. And you have a new parameter which is a kind of thresholding parameter. And supposing that your that your parameter is in the right range for this for this new parameter for example you go you calculate this exponentially weighted expression here where you take your feature c you subtracted from the from the first new one from new zero pardon me and in the other component you subtracted from new one. And the interesting thing is that this is this is kind of the way the right way to think about this is that it's a trainable projection. So it projects your persistence diagram onto the onto the the real valued line and it uses a bunch of trainable parameters. So it learns the right new parameter and learns the right sigma parameter and the right new parameter to do that. And since this is only done for one feature here so it's psi of p you can represent the whole diagram as a sum over each of these projections. So essentially using any of those different coordination schemes you obtain differential the embedding of a persistence diagram into some real dimensional space and that's really all there is to it. So you stack those on top of each other and thus you make these embeddings trainable. It's pardon me for this maybe weird analogy but it's kind of similar to to convolution based approach right where you have different feature filters of feature maps that you learn and this is essentially the same thing just for persistence diagrams. So to show you the rough classification pipeline you take a graph, you filter it, use this magic persistent homology function, you obtain a diagram, you calculate this psi function with different parameters and times and you use the resulting features in a deep learning architecture making it possible to do anything you want with it including of course graph classification. So to summarize this this paper they show an excellent performance for social network graph classification. So in particular when you compare these numbers here you can find that what I mentioned earlier on I think in lecture two if you include the essential features so the features in the persistence diagram that have no finite destruction time then suddenly your classification goes up by quite a lot. So they can considerably outperform other approaches here on these data sets by including cycle information because in this case and you have to refer to the paper for details about this but in this case the essential features correspond to cycles in your data set. Yeah the another advantage of this approach is that it's really simple to implement and use and the feature maps that you get are even interpretable because they tell you something about the the mean and the smoothing that you have and so you can you can take a look at them and you can use them. Moreover it's highly generic and it's definitely not restricted to graph classification problems in fact I summarize this paper very succinctly but they also have experiments where they do shape classification but I'm more interested for this lecture I'm more interested in graph classification right now and again you can try it out you can find more information in this in this qr code so Christoph my my colleague from this paper has has an excellent repository now where he shows how to include persistent homology and PyTorch so if you ever wanted to to start a project with this this is this is your go-to github repository I would say and I'm not being paid by them just just to know. All right so moving on to another project of which I'm very happy to to discuss this is the topological order encoders paper that was just accepted at ICML 2020 it's with my colleagues Michael, Max and Karsten again and here we're trying to solve a different problem so previously everything was about classification right but now we are looking into order encoders we're looking into the into a way of constraining topological information and I was really looking forward to this I had tested this slide multiple times and I'm happy to see that it works so the basic idea is that what happens if we if we tell an order encoder something about the underlying topological characteristics of the space what if we can constrain an order encoder with the topology of the latent space so this example is is based on on a set of high-dimensional spheres which are nested in a bigger and closing sphere and you can see that again showing the animation if you use a regular order encoder all these spheres are being pushed apart and of course you can use them you can you can use this latent representation to visualize to some extent what is going on you can show that that the that there are some spheres in there which get kind of compressed or kind of kind of stretched but you don't see this nesting relationship whereas whereas with the topological order encoder you clearly see that something is going on that this that this big sphere here is living at a different scale and it subsumes the original low-dimensional spheres in in its latent space so this just the rough outline of what we wanted to solve and why and why this makes sense and why why we think that this is important so who give you an overview what we're solving is what we're proposing or solving is essentially this um taken some we take some input data we take some auto encoder and in the above arrow we get the usual reconstruction metric right so we calculate we calculate the we train the auto encoder architecture we try to get a good reconstruction we get a reconstruction loss so that's that's the standard way that you usually do but we add topological information by looking at the input data on a batch level and the latent representation also on a batch level and we try to get a topological loss out of there so we try to to not only look at how well we can reconstruct the data but also and how well this reconstruction represents or preserves topological information this is the this is the goal of this project so how do we do this so the main intuition that we had to solve here and this is where we where we make the the full circle to to the Wasserstein and bottleneck distances the main intuition why this works is that we want to align persistence diagrams of an input batch and of a latent batch using a loss function and this works in theory because we have a nice theorem that tells us that if we subsample a point cloud repeatedly so we do batches essentially then we can bound the probability of the persistence diagrams of the subsample exceeding a threshold in terms of the bottleneck distance and we can bound this by the house of distance between the two point clouds so in other words the mini batches of the point clouds are topologically similar if the subsampling is not too coarse of course if we take if we take only two points then it will be super coarse but in general we can bound this based on the house of distance between the subsamples and so this is this is the theoretical underpinning of the alignment process the other thing that we had to solve and this is where it where it gets really interesting so the previous approaches they were unable to to map something back to the persistent homology calculation because if you recall from the very first lectures everything that we do here has kind of a discrete ring to it right because we we calculate this matrix reduction and we we add columns in some order and and whatnot so the the other challenging part of this project was to figure out how to get a gradient calculation going and for this we were looking at the distance matrices of the space so notice that we're not looking at the objects themselves we're just looking at distance matrices in the space and we noticed that every point in the persistence diagram can be mapped to one entry in the distance matrix and since each entry is a distance it can be changed during training at least in the latent space because we have full control over the latent space while training an auto encoder right we can change the point positions and those will change the distances and the distance is a continuous and differentiable function of the two points and so hence it can be trained with the gradient so how does it look like if we have a distance matrix that gives rise to a certain persistence diagram here then we can take a look at this mapping and we can say ah this point in the persistence diagram maps to this distance is here notice that this distance matrix is actually over specified we would only need the upper or lower diagonal of the distance matrix because of course it's a similar it's a it's a symmetric matrix but essentially all the points in the diagram are in one-to-one correspondence with distances in the distance matrix provided that the distances that occur in the latent space are distinct this is our one caveat so this is the condition that makes the the mapping differentiable and generic because if that is satisfied the gradient is unique and exists all right and with that having with that intuition squared away our loss term boils down to a sum of two loss terms one that looks from at the loss from the input space to the latent space and the other one that looks at the loss that you incur from the latent space to the input space and both of these loss terms boil down to a distance calculation between distances that have been pre-selected by the persistence calculation so without going into too many details here we are looking at the distance matrix matrices in the in the input mini batch so x ax or the latent mini batch and we're looking at the persistence pairing of the input mini batch or the latent mini batch so the persistence pairing is nothing but the raw form of the persistence diagram in a sense where instead of drawing the points directly we just we just pair the simplices it doesn't matter it could have been this is just for notational convenience we could have also just written a diagram here and so the the loss that goes from the input space to the latent space is defined by evaluating the distances that we get if we pretend that we use the persistence pairing of the input mini batch for both the latent distances and the input distances so this is not a mistake here it's pi x both both of the times so the persistence pairing of the input mini batch is used in both of these whereas for the loss that goes from the latent space to the input space we are looking both times at the persistence pairing of the latent mini batch and as you can see there's always only a matrix difference involved which makes this differentiable as long as the distances are unique because this is certainly differentiable this is a norm this is a square this is one half this doesn't change differentiability at all so in essence we have a bi-directional loss that is differentiable and that can change the topological structure of the latent space to resemble that of the of the input space so this is this is really this is really neat because it gives you a way to generate a latent space that closely approximates the topology of the input mini batch so all right so how does it look in practice so what what can we do with it first of all I want to note that it's highly generic formulation which is which is really awesome so we can plug this loss term into anything into into a pca even if we want to and we can always make an algorithm to some extent topology aware and we can tune how much how much of an influence we want to have and so on and so forth but I really want to point towards these two guys here so this is what you get when you use a vanilla auto encoder and this is what you get when use the topology based auto encoder so with the additional loss term and you can see that this this high dimension sphere data set is really well represented here so it shows you that there is an enclosing sphere they chose you that there are smaller spheres and so on and so forth whereas other algorithms in particular TSNE or UMAP they all to some extent rip apart the the structure that is inherent to the data so they are not capable of preserving both of these topological features at the same time and in particular this this isanum algorithm is is really crazy here for example because it it does not even tell you that there is an an enclosing sphere we're not sure what is what is going on here but you can clearly see that that this additional information this additional topological information is extremely helpful in regularizing your your model that's only the qualitative evaluation though so I don't want to bore you with too many details here but the quantitative evaluation also shows favorable results so the best result is always bold and underlined and the second best is just bold and I want to I want to particularly pay attention to to this column here so this is the the mean squared error that you get from the actual reconstruction whenever that is appropriate of course and you can see that including the topology based loss term of course changes your reconstruction objective so in essence topology pulls in one direction and reconstruction pulls in another direction right because both of these goals they can be they are somewhat orthogonal to each other they can be somewhat orthogonal to each other but we can see that adding the loss term does not incur too much of a penalty here so it goes from 0.81 from the 8 from the autoencoder to 0.86 in the topology based autoencoder and likewise for the for the other dataset so you can always see that it kind of behaves as it should it goes a little bit up when we add more constraints because of course constraining it more and forcing the latent space to have a certain topology or certain shape or certain geometry this also decreases your reconstruction objective a little bit but in general we are we are performing extremely favorably in particular when we look at for example a density based measure these these KL divergencies are calculated for a for a distance to measure density estimator where we are essentially looking at how well a method is capable of preserving the density of the original space in the latent space on on a bachelor you can see that we're always among the top performers there so with this i'm coming to an end there is a lot of open questions still remaining let me briefly give you some along the way so the first question that we have to ask ourselves should we learn filtrations or used fixed ones I have a recent paper out with colleagues from from Salzburg from Austria where we try to learn graph filtrations and if you're interested in this there's a there will be a link on the website it's also an icml paper and in this we we show that learning a filtration can be beneficial in in some cases but of course there's also a question is it better to use a fixed representation or fixed filtration that is maybe robust to certain aspects of your dataset but moreover another challenging question that always comes up when we when we do these sort of analyses is can we map topological features back to features in the data so can we say oh the cycle is created by the following elements or by the following by the addition of the following elements it turns out that this is not as simple as it might seem because we are looking at an algebraic formulation and it's possible that the algebra does something different than the geometry but nonetheless there there's hope there that we could that we could make this a little bit more explainable or interpretable the last question that i that i that i have to say is still somewhat open is how can we scale those arguments to massive datasets because in the autoencoder we were running exactly into the problem that i described early on so the the bigger the mini batch the slower the algorithm because we had to account for all these all these different distances at the at the flip side though we are only using distance matrices so we're not restricted to any object representation so if you have some objects for us and you can calculate distances between them then we're good to go but yeah this is the price to pay maybe and maybe maybe there are smarter ways of doing this maybe we don't need all the features maybe we can do sparse filtrations whatever there's a lot of of research to be done i think so before i and before i summarize this i want to briefly give you a point at what is next so this is a not too shameless advertisement there is a workshop at new europe's 2020 which i'm co-organizing and if you're interested in that take a look at the website it's called topological data analysis and beyond you can submit your stuff there if you're interested in that you can also just visit that for extremely great talks from a very diverse and and interesting set of speakers that cover many different domains so we tried to purposefully not only include the theoretical the theoretical rock stars of the domain but also the people who are in the trenches who are doing the who are using topology in interesting problems for biology for example and so this might be really interesting will all be recorded the lectures will be pre-recorded and you can take a look at them at your own leisure also new europe's is 25 dollars for students and i'm not being paid for this but i wanted to point that out i can also recommend this giotto tda library and it really might seem that way but i'm also not being paid for them but it's a neat way of integrating tda into your own projects because they follow the psychic learn approach they have all these transformer classes with fit transform and whatnot so it should be really really neat and and easy to do and last but not least i also want to urge you to join the tda in ml slack community if you're interested in that there's a lot of people now we are we're more than 250 people in there people are creating all kinds of site channels private channels for their projects you can find um anyone willing to answer your questions there if you're interested or giving giving a tutorial and we we will also be using this to coordinate a lot of things in the workshop and fielding questions to the speakers so give this a look if you're interested in that it would be would be happy to to have you there we're also from yeah pardon excuse me is the link for the slack community active because i tried like two weeks ago and it wasn't that's a very good point it should be it should be active but slack has a has sometimes very weird ways of of deactivating those links so i once created one that is that should have an infinite duration of validity but it's possible that it doesn't work anymore so if it doesn't work and you want to join just just tell me via email or via twitter and then i will create a new link um this definitely it's open for all um we're happy to have you on board regardless of whether you are uh interested in that whether you have already done some work with that whether you want to learn it whatever so all everyone's welcome uh bring your huddled masses to to create tda together uh we're trying to make this into into into a neat community of of like-minded people and same goes for the workshop so if you have anything that you want to share with the community or you want to learn something please uh come by look at the look at the talks give us some feedback um if you feel familiar with tda already of course feel free to join the program committee we are looking for uh for reviewers we are also taking recommendations for other people if you see that we missed someone on on the list so i would be happy to to make this to make this into into very nice and inclusive event for the for the new york's community and for everyone else who is interested in machine learning so do let me know if there's anything that you want to do in that vein and i will be happy to support it like same goes also for for emails of course if you have any questions there just shoot me an email and we will try to make this happen all right so with this i want to end this uh this lecture this tour de force in a sense uh let me just tell you that i think that topological features are incredibly versatile i hope that i could impress this this fact on on you a little bit the integration into modern machine learning architectures is certainly an ongoing research topic so there are there are some successes already there are some neat hybrid approaches that really that really start to showcase the benefits here i would say that right now topological machine learning shines when working with structural information such as the case of graphs but there's also tons of other applications that are just waiting to be discovered and i think the future to be very optimistic i would say that the future is probably probably will probably belong to hybrid approaches that are capable of including more information than just the these persistence diagrams that maybe you would need information about the curvature of your of your object maybe you would need to have other information about the manifolds but i think that topological machine learning is one first step towards that and it will be an incredibly versatile and rich field for the for the years to come and i hope that i could impart this a little bit on you and that you continue thinking about this or maybe that you start to become a user or practitioner or a researcher in that area that would make me very very happy and with that i thank you very much for your attention and now let's let's start the last question round before before i have to before i have to stop this i think we're more or less right on time so thank you very much for your attendance thanks for watching so i have a question which is i have looked into topology but i'm talking about the math concept itself although i'm not a mathematician and i would like to ask why is it called topological data analysis because it doesn't seem or it's not obvious to me that it starts with a topology as it is defined in abstract algebra uh but i guess it has something to do with this ah yeah that's that's a very good point so it's i think it's the difference between between algebra between points at topology and algebraic topology so algebraic topology is defined on these sort of structural things yeah sorry what did you say it's the difference between algebraic topology and point set topology so okay points where where you where you really uh we really try to describe open sets and closed sets and things like this okay so so there there is kind of an overlap um between those between those subjects of course but in general i think it i would say that it's called topological machine learning because it incorporates the these ideas of structural information of connectivity information and in a in in some sense you find the same information in your in your point set topology approaches as well right when you define which set is open which set is closed things like this because those sets can be used to create a neighborhood and from this neighborhood information you can gain more information about the connectivity of your space and so on and so forth so i think that it's just it's just a let's say a higher level abstraction of looking at data because all of this tda or topological data analysis in general has a very simplicial view on data sets so it it assumes that you have some way of creating a simplicial complex or a cubicle complex from the data and it it tries to it it doesn't really describe functions on these data sets yet it just describes simple connectivity in variants so this is maybe maybe maybe this would be maybe this would be a topic for the future so how to actually also describe functions in that spaces and then what can we learn from this and so on and so forth so in in your case are you using algebraic topology the term or you are using point set so in for all of this i would say it's algebraic definitely it's like it's like you you describe a simplicial complex you describe simple chains and and all these sort of things and in the end you do some algebraic calculations by reducing matrices and whatnot and this yields a set of invariants that you can use to describe your space so i would say that that this is more like an algebra approach than a it's like a subset of point set topology and point set topology is more general topology well i'm at the risk of at the risk of of alienating the topologists that are watching this i would say that it's the the other way around so i would say that points at topology is a little bit more fundamental and then you go a little bit higher in terms of the abstraction level and then you have algebraic topology and differential topology which are both two ways of looking at at manifolds or or looking at at complex spaces differential topology is more about describing i would say functions on those on those spaces and algebraic topology is more about calculating invariant information of those spaces but i'm not sure i would i'm not sure i would want to be quoted like this so it's more like my way of structuring structuring those two those two fields okay oh i'm seeing some more questions here so at the moment topological topological data analysis is analysis of homology groups by these persistent diagrams or i would say mostly there's also other approaches but they boil down to to similar concepts so if you're familiar with this notation there's also people who are doing something called mapper the mapper algorithm which gives you a way to to calculate nerves of your of your data and visualize them so this this is also this this this is also one one aspect there but i would say that the working horse at least i'm confident in saying this without without making more enemies along the along the road the working horse of topological data analysis right now is definitely persistent homology where you calculate those persistence diagrams and all those other descriptors this is kind of the working horse there's a lot of other approaches nowadays that take a look at for example circular features in a data set but the work the working horse is still this sort of a sort of machinery because it's also the one that is easiest to integrate i would say into a standard machine learning pipeline so so fundamental groups are too hard yeah yeah no well people are people are working on this but you're absolutely right i mean this is coming i mean if i had to if i had to do the same lecture again for for a for a topology based audience so let's say the young topologist workshop or something i would definitely say that fundamental fundamental groups are coming i mean there is homotopy computational homotopy theory which goes in in that direction i just have to be very very clear that i don't feel confident in saying how far ahead they are so so what you can actually do with this i find them mathematically i find them fascinating and intellectually very very pleasing i'm just not aware of any usage in machine learning currently okay thank you you're welcome have you considered also uh other systems um for example i have two things in mind where i would like to ask whether these have been in consideration is your topological auto encoder sounds a little bit like a variational auto encoder in in that sense that um you are putting some um constraints in a way on the latent space in the middle of the auto encoder so that's this aspect you you put this topological constraints on there and variation auto encoder put this yeah probability aspect on there is this uh i don't know whether it makes sense to any in any way to combine these what to consider this i probably i don't have any ideas myself but just ring the bell in a way you know what i mean yeah no i this is a this is a very this is a very insightful comment uh in fact i would say that we are yeah we are we are essentially doing that we're just looking at at topological variation in the sense right so we are there's the theoretical view and then there's this topological view and it would be interesting to see whether the two can be combined i also have to say that i have not given any thought to this until until now but definitely definitely this is a good analogy for us so we are we are to we are to regular auto encoders what a va is to a regular auto encoder we are just the same thing in in a topological sense so you're also trying to regularize a little bit also have you considered combining this with more like probabilistic approaches i'm again the variational auto encoders sort of combining auto encoders with probabilistic ideas right but then they are all this morning i heard the lecture about probabilistic circuits which is sort of like a neural network in a way but it has a probabilistic um quality to it so that you can ask questions not just what is it a cat or a dog this picture but something like what is the probability of if this and this happens then that and that will happen so you can ask a lot more different things it's much more versatile than regular neural networks so again no idea whether topology where this would come into the picture there but um maybe there's some connection again um i think yeah there certainly there certainly is uh in fact there there is a there's a whole subfield that looks at the topological features of random simplicial complexes or gosh and random fields for example so you take a random process on on a structured domain and you take a look at what the distribution of of your topological features would look like and so i think that this will also probably the future will also belong to those approaches that are capable of of also assessing a probabilistic view on these features because that's what you get essentially right if you if we pick multiple mini batches then we get a distribution over over topological features and capturing this and regularizing over this will be will be a good view so you're you're absolutely right in fact i'm looking very much forward to to give you another plug i'm looking forward so peter bubenik the inventor or creator or discoverer whatever you want to call it of the persistence landscapes he will also be giving a talk at the europe's workshop and i think this will be a little bit about probabilistic modeling and this i'm really looking forward to to this talk and to to learning more about this myself uh i think that this could be an interesting view at the same time also want to stress that topology is much more fundamental and i don't mean this in a way to denigrate uh probability theory but i'd rather mean this in a way to to like curb the enthusiasm for topology even though my job is to raise enthusiasm but the idea is that it captures a lot of fundamental properties of a space but the question is whether those properties are actually really driving classification or really driving the the problems that you are that you are working on right so it's it's a little bit different it's it's a different lens through which to view the world whereas probability theory has all kinds of neat applications already and a kind of neat intuition where you say oh we have this causality view and this this this modeling view and so on and so forth and this these sort of things are still missing in the topology world i think they will come given enough time but the building these bridges between the the pure mathematics and the the machine learning world is not very easy and current currently you in for your topological auto encoder you're just inputting you're just calculating the persistent diagram once right and then you just like this as a feature into the auto encoder well of course with the loss but um but wait a second i have to have to stop you there so we calculated once on the input space on the input latent pardon me on the mini batch input space there we only calculated once because of course if we pick one mini batch we cannot change its topology right this is the topology but we do allow the latent space persistence diagram to be changed yeah sure okay okay okay so yeah because i'm yeah yeah because i'm trying to make the connection that you said earlier with the but now i'm getting this what you mean you have the mini batch and then with varying mini batches you would kind of develop a probability based on that because all these you can calculate how probable a feature is sort of your it's sort of your your persistent diagram is one random variable what very complex but random variable in a way and then you calculate the um over this complex random variable you calculate your probabilities sort of and then you have a probabilistic model in a way and then that way you could combine all of that okay got understand yeah absolutely yeah but but i mean there's a lot of that is a lot of things that have to be done i mean this uh this auto encoder approach we're really happy the way this turned out but it's also pretty clear that we are that we are stretching the surface of something that is much more complicated i mean even when you go towards saying what kind of what kind of information do you actually want to preserve so regular having an additional layer of regularization for example would be very very interesting all right um any any other questions hello can you hear me yes at first thanks for the great talk and if it's okay i would ask one more thing yeah at the beginning of the lecture you said it's used on grass a lot like the betty numbers is there any consideration in looking at like neural networks in general to analyze them with betty numbers for instance like if this um i don't know how it's called persistency if the if the hole somewhere in the neural network i don't know vanishes then maybe the network is starting to overfit oh yeah something like it yes in fact there is uh this is um we have another paper on this very topic it's called neural persistence it's an iclr paper from 2019 i didn't include it here in the slides because it would go a little bit beyond the the idea of classification so i was mostly focusing on representation based learning but you're absolutely right i think that topology can also be useful in describing what is going on with a within your network so when it's what it starts to learn and so on and so forth and maybe maybe yeah i will put this i will put this on the website afterwards but you can also find it on i think my google scholar profile where where we where we were analyzing this right now i have to say that the state of the art in this in this particular research strand is that we are limited to fully connected networks so we try to extend this approach we try to extend this view to convolutional networks but it didn't work as easily as we would as we would expect it to work and so right now we can only do this for fully connected networks but we showed how to use this persistence value so the essentially the topological complexity of the of the network's weights we were able to show how to use that as a as an early stopping criterion that does not require additional validation data so this was oh yeah it was the first step in that direction but i have to say that again this is also this is just scratching the surface so i think that this is that this is the right way to approach certain certain analyses in the neural network world by by which i mean that you have to understand what the network is actually learning what the what the network is kind of seeing or or what kind of structural information you you can get out of it but again this is this is a very open topic for for future research i would say okay thanks you're welcome any anything else i am working on a clustering method i will have the top tomorrow it's called gulf shift not for advertisement but simply i'm currently looking into a combination of that with some other methods and i don't know you you probably have heard of db scan and there are some some other methods that are related to db scan and they are can be interpreted as topological data analytical methods with like terminology like persistent homology and stuff also come up actually without the persistent diagram so it has nothing to do with the persistent diagram and and my method again is somewhat related to it i would be interested in in in talking about that but maybe later in case definitely actually why i came to your course okay because i think this has nothing to do with the directly like the persistent diagram that you're doing but it's still somewhat strongly related to this so i would be interested in yeah definitely i mean you can check out my talk then you get a sort of a quick introduction on what i'm doing what i'm doing if you're interested but all we can look at yeah no i'm definitely i'll definitely take a look at it i mean this is this is the cool thing about virtual conference right you can you can take a look at the things in your in your own time and see and pick and choose a little bit so i'm really happy to to hear about these other things because that's also something that is not currently on our radar so there's a lot of things that i didn't mention so far for example the idea of what to do for time series analysis or how it can actually help in clustering or can it help in clustering that's kind of the that's kind of the question right so i'm really yeah i'm really interested in this thanks thanks for dropping thanks for dropping this how to contact you or how to give you the talk i think the the easiest thing is to so i'm available on on huva or on or via the email that i posted here or via twitter so this these are all okay this would all work okay any anything else hello i'm just curious what is what it is persistent diagram of a persistence diagram pardon me i mean a persistent diagram is a 2d diagram no yes so each point you can think as a data point if you apply the persistent diagram of a persistent diagram what oh that's a good question i'm not sure what would happen if you do this iteratively i think you would converge at some point to a to probably a space that has only a single topological feature but i'm not sure about this i mean iterated persistence would be would be an interesting interesting thing to try out i mean for sure oh no wait you wouldn't actually you wouldn't actually converge on the number of features wouldn't actually decrease but i'm i'm not sure if anyone ever ever considered this so far but would be an interesting it would be an interesting direction okay thank you but any intuition because is the how transition connects each other so how transition let's say cluster i'm i'm not sure i mean i would say that first of all of course the first step in this iteration is already can already be from a very high dimensional space to a very to a very low dimensional space but then if you start reapplying this you would never get rid of the of the components of the connected components of the zero dimensional connected components so in essence it would always i think keep the cardinality of the persistence diagram and you would merely change the it would merely probably change the scales maybe maybe the iterated persistence diagram would convert towards a diagram that has all the points in the diagonal that i could sort of sort of see or understand because it means that with every every it's kind of like an iterated compression you would always lose more and more information but i'm not so sure about that it's just a conjecture thank you sorry another quick question so i'm not sure maybe you mentioned at the beginning but so the complexity of building these persistence diagram is so is a square exponential so in the number of data points it can actually go so the formally the the worst case complexity is two to the power of n where where n is the number of data points but this is the actual this is the really worst worst worst case that you can get in general the for low dimensions you have efficient algorithms for n log n and in general it's observed that it has a sub linear or linear complexity in the number of simple c's of your simple initial complex so this is the this is the the answer that i that i would that i would give yeah but the simplicity is it can be the class or the number the the number so so it's so it's so if you have so if you have a you can have a squared number of simple c's of course starting from your starting from your data point so the simple initial complex defines the set defines the the complexity class of of the of the persistent homology calculate but this is exponential not in theory yes in theory yes yes but there's a there's a disconnect because you because many people don't actually calculate the whole vatorious rips complex but they just stop at a certain dimension and then of course you have fewer simple c's than two to the power of n but yes in general you could say that this is this is the worst case complexity yes thank you okay any any other questions okay i mean in this case thank you very much for attending and i look forward to to hearing back from you would be interesting to get get some additional feedback if you if you like this or maybe even maybe even see you in in the tda nml channel and if you have any further questions feel free to contact me over any of those communication channels and I wish you a very pleasant conference and very very good research topics and very inspiring inspiring talks so bye bye and and see you soon