 Okay, so now in the third lecture, we're gonna finally look at top logical descriptors and how to use them in practice. So what kind of properties they do have and what their respective advantages are and how we can start to integrate them into a machine learning framework, how we can integrate them into our algorithms or how we can create hybrid algorithms even. Again, the usual disclaimer. If you're interested in learning more about this or you have questions, feedback, whatever, give me a shout out over email or on Twitter. You can find the slides and additional information on this website, topologyrocks.ecmail.pkdd. underscore 2020. All right, let's do a quick recap. We found a multi-scale generalization of BT numbers called persistent homology. We found that it's rather versatile and that it can be applied to point clouds or structured data in general, such as graphs. And the resulting descriptors that we get out of there, they are called persistence diagrams. They essentially contain information about the creation and destruction of topological features along certain thresholds. And now in this lecture, we will be taking a broader look at the landscape of topological descriptors. So what choices of topological descriptors other than the persistence diagram do we have? What are their properties and what are their respective advantages? And how all of this will be modeled or geared towards finally using them in practice with the machine learning architecture. So let's first take a look at the persistence diagram. This is the thing that we already encountered previously. So the points here are tuples in the real numbers times the real numbers with a potential infinity point. The persistence here corresponds roughly to the distance to the diagonal. So the further apart features are from this diagonal, the higher their persistence is. And usually people say that high persistence means high prominence in a dataset whereas low persistence means noise. Although that does not necessarily have to be true at least not in all kinds of datasets and this whole notion is a little bit under review, I would say. This diagram has certain things that are appealing, namely it's more or less two-dimensional but the multiplicity of each point is not apparent. So if we are in a bad situation, if we are in a non-generic situation where two features appear and disappear at the same time, then they will be assigned exactly the same spot. And so the multiplicity of each point or of each feature is not apparent. Also, but this is more like an aesthetic thing. The space under the diagonal is typically unused. So these descriptors look a little bit weird to people. Before we move into the properties, now let me briefly also remark that if you're familiar with set function learning for example, you might think that this is a perfect example for a descriptor that can be put into a set function and I agree, but at the same time it turns out that this is not as easy to feed into a modern machine learning algorithm because the persistence diagram can have a different cardinality. So if you calculate this for different input spaces, the number of features is of course limited by certain things like the number of simplices that you have or the dimension and things like that. But at the same time, it can also be different. So you have to account for this. It doesn't give rise to a simple feature vector or a fixed size feature vector at least. But anyway, it has certain properties that are really nice and that we will be looking at. One of the properties is stability by which I mean that if you calculate the persistent homology of two point clouds and you modify the point clouds a little bit by a small perturbation, then the resulting persistence diagrams will probably look very similar. So this illustrated here with these point clouds, essentially if you, all these three point cloud configurations look more or less the same and you can see that all of them are clustered around the diagonal a little bit, even though let's say the red one here on the left-hand side has a different width than the blue one on the right-hand side, things like this. But essentially they keep some of their form. So this is the intuitive thing, but it turns out that we can also formalize this. So before moving into a purely machine learning approach, there's also a lot of statistics that people could do or do with persistence diagrams and topologic descriptors. And one of them involves calculating the distances between these descriptors. And it turns out that persistence diagrams give rise to a really neat metric structure, even though that metric is not so easy to calculate. And this is called the bottleneck distance. So given two persistence diagrams, D and D prime, their bottleneck distance is defined as W infinity here. It's defined as the infimum over all bijections going from one diagram to the other diagram of the supremum of the matching distance between one feature and it's matched variant in the other diagram. This looks a little bit hard to parse, but it's actually really easy. So eta, in this case, is a bijection from D to D prime. It denotes a bijection between the point sets of those persistence sizes to be more precise. And the infinity norm is just the L infinity distance between two points in a two-dimensional space. So now you should maybe notice that the two diagrams are not guaranteed to have the same number of features, but this is solved easily in practice by permitting this bijection to map one feature in one diagram to its projection on the diagonal in the other diagram. So essentially you add infinitely many points to the diagonal of the two diagrams and this makes it possible for this to be a bijection. There's also a generalization of this distance called the Wasserstein distance, which you may have encountered in machine learning for general probability measures, for example. And in contrast to the previous formulation where we have a supremum here, we just sum over all the distances of the projected points in one of the diagrams with a suitable weighting parameter for which I'm not going into any more details. So those are two distance measures that make the persistence diagram space a metric space. This is neat, but it also turns out that this metric space is kind of hard to calculate and kind of cumbersome to calculate at least for machine learning purposes. So let me just briefly illustrate the differences between the two distances here. So we take those two diagrams here. So the red points are coming from one diagram, the blue points are coming from the other diagram and we want to evaluate how close they are to each other. So in the bottleneck distance, we only look at the supremum between two points in a bijection, whereas in the Wasserstein distance, we look for the sum over all these features in a bijection. And notice when I say bijection, the remainder of the points, the gray points, they are just projected to the diagonal. So this also incurs a cost, but this cost is being accounted for by the optimization algorithm. And as I said, this makes it possible to compare two diagrams of different cardinalities with each other. So as you can probably imagine, the bottleneck distance is a little bit more more receptive to noise because it only uses a single distance to calculate to calculate the distance between persistence diagrams, whereas the Wasserstein distance is a little bit more thrown out, a little bit more mellow because it can integrate information of the two persistence diagrams. And we can make this actually a little bit more formal. And I'm just mentioning this so that you know what people are talking about when they mean that persistent homology is stable. So one thing that we have to define before that is the notion of a tame function. A function is tame if it has a finite number of homological critical values and its homology groups are finite dimensional. So the takeaway should be that almost all the functions that you will encounter in data analysis or in data science are tame because you're working with finite data and so the function should also be finite. But of course, these theorems are coming from a very generic mathematical point of view in which you could encounter all kinds of functions. But anyway, moving on, we have this famous stability theorem which states that for a triangular little space with continuous tame functions f and g moving from the space to the real numbers and the corresponding persistence diagrams df and dg satisfy that their bottleneck distance is less than or equal to the house of distance between the two functions. So what does that mean? That means that the bottleneck distance between persistence diagrams is bounded from above by the house of distance between two functions. So notice that this is really, really intriguing because on the right-hand side, you have a function space distance and on the left-hand side, you have a topological distance. So you know that if your functions are close in some sense, if they are created with a small perturbation, for example, they arise from the same process, then by that definition, you also know that the corresponding persistence diagrams are not infinitely far apart. And I would say that this theorem, which is due to Cohen-Steiner and colleagues from 2007, laid the foundation for all practical uses of persistent homology because it relates to quantities that are not easy to relate. Of course, that shouldn't subtract any other research in that area. And in fact, there's a multitude of other approaches that have been created by other people so far. So there's a Gromov-Hausdorff stability. There's a Lipschitz stability based on the Wasserstein distance and so on and so forth. But this is the first one that laid the foundation for all of these, I would say. However, this stability only pertains to small-scale perturbations. So it assumes that the Hausdorff distance between the spaces is still somewhat bounded, right? It turns out that you can have a few perturbations that are relatively high-scale or big-scale, if you want, and those destroy your calculations a little bit. So I'm just mentioning this for the sake of completeness. So let's take two circles here and calculate their persistence diagrams. So the red dots are the connected components and the blue dots are the cycles or the tunnels in this. And you can see that, oh yeah, in both of these cases, we find this circular feature eventually and the persistence diagrams look relatively similar. In fact, I mean, they were calculated for the two different point clouds. So they shouldn't be exactly similar, but you can see that their overall structure is very, very similar. But now, look at what happens when we start adding a few points here that would increase the Hausdorff distance quite considerably between the two functions because they are suddenly there. These are some outliers that have a high, that have a large scale, right? So this is a large scale perturbation of a few points. And suddenly, you can see that this third persistence diagram doesn't look like anything than the other two, right? So you can clearly see that there's something going on with the cycle here and it decreases its persistent noticeably, noticeably, right? So this is a little bit unfortunate, but of course, this can be rectified, this can be remedied by using different filtrations or by taking care that such large scale perturbations do not appear, et cetera, et cetera. I just want to mention this because sometimes people that get into the field have this wrong notion that persistence is super stable and that the descriptors will not change come what may with your data. And this is just patently not true. So if you have a large scale perturbation in your data set, then this can change the persistence diagrams quite considerably. So now for an interlude so that we can finally start messing around with some cool descriptors, we need some kernel theory for that. It's one of my favorite topics. And this should maybe be familiar to some of you if you followed this whole neural tension kernel thing, for example. But anyway, given a set X and a function K that goes from the tuples of the set to the real numbers, we call such a function a kernel if there is a Hilbert space H, so an inner product space that is also a complete metric space and the map phi from this set X to the Hilbert space such that K evaluated on X and Y is the inner product in the Hilbert space for phi X and phi Y. So this might sound really strange if you're not familiar with this sort of terminology, but what this implies is that a kernel is a way to calculate the similarity or dissimilarity between two objects using a high dimensional feature space embedding based on a Hilbert space. And why should you do this? The cool thing is that this feature space H that you have, it can be high dimensional, it can even be infinite dimensional, which makes it easier to or simpler to classify objects or to tell them apart. So this is kind of the notion that I want to impart on you. And this brings us to the first kernel between persistence diagrams. So why would you do this? We call that these distance calculations with the bottleneck distance and the Wasserstein distance, they are kind of tough because you have to solve this optimization problem and there's the question of how to assign features from one space to the other space. And so it's all nice and cozy, but for in most machine learning applications, you are actually not really that interested in an actual metric between two objects. Rather you're perfectly content with knowing that two objects are very similar or they are very dissimilar or whatnot, right? And this is where the kernel formulation comes in because kernels can be evaluated much more efficiently and they immediately give rise to a certain number of neat classification techniques. And here, let me present the first kernel between persistence diagrams. It's due to Reininghaus and colleagues and it was presented at CVPR in 2015. So it's in machine learning terms, this goes beyond the dinosaurs, I guess, but in mathematical terms, this happened yesterday. So the kernel and feature map definition of this thing here are really, really simple. So the kernel between two diagrams D and D prime is a sum that is kind of strangely weighted with a smoothing parameter sigma and it's essentially a sum of exponentially weighted distances between two points in the diagram. Where I have to say that there's this one guy here, this Q bar, Q bar is the point that you get when you flip one of the points to the other side of the diagonal. But you can clearly see that this can be calculated easily. It boils down to a sum of all the distances and it incorporates only the points that exist in both of the diagrams. So there's no problem with the cardinality. You don't have to extend the diagram in some form or the other. You don't have to add other features. You can just use the diagrams as they are. And there's also an associated feature map, which is calculated by a similar thing, weighted with one to the power of four pi sigma and so on and so forth. And a little bit of different, a little bit different normalization constants but it uses the same information essentially. So what is this good for? First of all, the feature map illustration gives you an idea of how this kernel actually looks at a persistence diagram. It takes such a diagram and it starts adding heat and diffusing that heat on the diagram. So this could also be called a diffusion based kernel on the space of persistence diagrams. And the more you change this sigma parameter, the larger this diffusion process is and the more heat you diffuse over your diagram space. So this is the feature map illustration. It's kind of, I like it because it gives you a way to compare diagrams only using this feature map representation. And we will in fact find another top logical descriptor later on that uses that. But why would you use that? What is the use of such a kernel? So first of all, let me mention that there's alternative formulations nowadays, of course. So I have added, yes, question. That just looked like a sort of a smoothing over the data points or you could say a convolution of a kernel with the data points if the data points would be Dirac functions. Not sure with which terminology you are familiar with but am I right here or is that on sense? I don't hear you. You are absolutely right. I have a mute button on my microphone. So you're absolutely right. In fact, this is the, if you read the paper and I mean, I can recommend it. I love this paper, they use essentially this formulation. So they take the diagram, they rotate it into 3D space essentially and they say, ah, let's assume that we have Dirac functions and we just start diffusing them. So every point, every persistence point creates a stands for a Dirac function here and you start diffusing them. That's exactly what is happening here. This is why I like it so much because it's a very neat way of bringing together multiple concepts. So it just says, we know how to do diffusion on domains, on 2D domains. The persistence diagram is such a domain. So now we just have to pick the right boundary conditions and then we solve a nice partial differential equation and then suddenly we have this way of calculating the dissimilarity between two persistence diagrams by means of this diffusion map. So in that sense, and maybe if the authors of this paper are ever listening to that, the only drawback I have for this paper is that they should have named the paper, they should have given it a much cooler title like the diffusion space kernel or something like that. They mentioned diffusion, they mentioned scale space in the paper but it just doesn't have a super catchy title to be honest. Like the other kernels that now exist. So we have one based on sliced Wasserstein distances. We have one based on kernel embeddings. We have one based on Romanian geometry with like the persistence Fischer kernel. So all of these, they give you more information maybe of what is going on whereas the stable multi-scale kernel sounds a little bit, sounds a little bit right. But that's, I have to say that's the only criticism I have with the paper because these sort of multi-scale approaches, these direct smoothing approaches, they can also be implemented really, really rapidly. And in fact, I would say that this kernel is still among the best performing ones if tuned correctly because you have this, you have this one tuning parameter, the sigma, the smoothing parameter. And this is a smooth parameter. So there's no, it can really be tuned in a continuous manner if you set it up correctly. And this is really neat. This is not something that you get very, very often. So hence I think I made my point here that I like this paper. But let's look at some, briefly look at some applications because I find this fascinating how this opens up the door for machine learning. So if you want to go that far and you could say that this was the first paper that actually opened up this field of topology driven machine learning because suddenly these descriptors, suddenly these persistence diagrams became comparable and became usable in a modern machine learning pipeline. And this is, this is amazing. So some applications, you can use a kernel PCA for example, so principle component analysis based on a kernel. You can use this for visualization, for dimensionality reduction, for feature generation. And you can use, oh, I think we're getting a question in the chat. I can't hear you. You have to, you can either type the question or you can try to unmute yourself. I think I gave everyone the power to unmute themselves. Okay, maybe if it doesn't work, we'll just move on and you can type the question or we can do it in the other Q and A session. So anyway, you can do kernel PCA for these sort of things. There's also kernel SVMs, so support vector machines for classification or kernel SVRs, support vector regression for, well, you guessed it, regression tasks. So suddenly you have this very simple embedding method that opens up the door for all kind of calculations. So there's one of the descriptors taking, you go, you take the persistence diagram, you calculate a kernel. That's the class of methods for learning with persistence diagrams, but there are also others. So one that I like very much, it's called the bitty curve. It's a simplified representation of a persistence diagram. So instead of having this complex kernel mapping, which is, of course, I mean the complexity gives you a lot of freedom and a lot of expressivity, but it also comes at a price. A very simple way, however, is to use a function mapping from the persistence diagram to an integer valued curve. And this is what I like to refer to as a bitty curve or what other people refer to that as well, I guess. So you take the persistence diagram, you draw it in a different fashion that it's called a persistence barcode. So essentially what this does is you take every point and you take a point into a line that runs from some arbitrary Y position and just between the two X positions. So a feature created at zero one, for example, so this dot here becomes a line that goes from zero to one and that has an arbitrary Y position. So this is called a persistence barcode because it roughly looks like a barcode, right? And once you have this representation and by the way, those representations are interchangeable. So there is, you can map one to the other and back again. The order doesn't matter. Once you have this representation, you just take a look at how many intersections you have at every point in this barcode and you graph this number of intersections and this gives you the bitty curve. So we go over these thresholds here and we say, okay, at threshold zero, we have one, two, two intervals that are alive. At threshold one, we have one, two, three intervals that are alive. And so on, oh no, maybe, no, no, wait a second. This one is actually not alive anymore. We have one interval that is alive and another one that is alive. So it's again two and so on and so forth. So we move through this threshold on the X axis here and we just draw this integer valued curve here. This might, this is, as you can guess, this is a very simplistic way of representing a persistence diagram. Moreover, it's not an injective mapping. So multiple persistence diagrams can map to the same bitty curve. So you lose a lot of representation power in a sense, but you gain a lot of other things. So first of all, this is ridiculously easy to calculate. You just take the diagram and you just go over it and you're done. It's a very simple representation because what you get is you get something that is a piecewise linear function and piecewise linear functions give rise to all kinds of interesting things. So you can, you have vector space operations that are possible here. You can add those, you can do scalar multiplications and so on and so forth. So you can create mean functions, things like this. And you can even go further and you can calculate distances and kernels again. So even those curves also give rise to a kernel. The very simple kernel formulation here, for example, would be the integral over the absolute value raised to a certain power of the two persistence curves. The two bitty curves, I should say. I created this preprint, which is still in press since 2017 in 2019. I started uploading it to archive because I didn't want to wait so long. Unfortunately, I called it persistence indicator functions there for reasons that will be quite clear if you actually read the preprint, but there I described some additional experiments about this representation. And it really turns out to be rather neat and very, very easy to calculate and use in a practical pipeline. So let me briefly show you what cool things you can do with this. So you can exploit the vector space structure of these bitty curves and you can calculate a mean bitty curve over different samples from a data set. So for example, here on the left-hand side, I took, I think, 100 samples from a sphere data set and I graphed the persistent homology features in dimension one calculated using the bitty curve and the same on the right-hand side for the torus. And you can see that you get a mean representation out of there, which is getting more and more smooth, of course, because it's an average representation. And so this makes it very easy to calculate the means of distributions of features. So essentially it could give rise to something that I would like to call topological hypothesis testing because it's a very simple way of comparing two distributions using topology-based means. This is just one of those applications. There's also another one that is really neat to have. It's a classification scenario example. We will see this data set in a few other examples as well. It's essentially a core occurrence graph from Reddit, so from this internet platform. It uses the Reddit binary data set that is created from two different kinds of subreddits there, so from forums. And the task is to classify those graphs according to whether they are discussion-based forum or a Q and A-based forum or something like that. And these are really easy to assess using persistent homology because we can calculate a filtration based on the vertex degree of a graph. We will see more details about this later. We can calculate the persistence diagrams for D equals one, so for the cycles. And then we can calculate the betty curves and use a kernel SVM for the classification. And already this very simple approach gives us a precision recall curve within AUPRC of 0.93, which is not too bad given the fact that we did not use any other type of information except for the cycles from these graphs. And this is what you get out of the box with a simple betty curve for a presentation of this data set. More details in the pre-print. Of course, this is not the highest that you can get here, but it's one particularly interesting example that I find highlights very much the opportunities that you can get from these curves. We can also make this a little bit more complex, of course. So similar to these curves where we just were looking at individual, individual intersections. We can also calculate the rank of so-called covered topological features of a diagram and we can peel off those layers iteratively. This leads to something called a persistence landscape and you will see the reasons for this in this illustration here. So first of all, take a look at the persistence diagram here. We just draw these additional lines here that indicate the covering. So how many other features are contained inside that region? And then we just count, we just count how many intersections are in those regions. So for reasons of simplicity, we take this representation here and we convert it and rotate it into this representation here. So essentially we just rotate the diagonal down by a rotation of 45 degrees and then we continue calculating in this space. It's just a convention now because you don't need the space below the diagonal anyway, so why not make your life easier, right? And then we take a look at how many intersections we have between those triangles here. So for example, this big triangle here intersects this other big triangle here. In this region here, this is why we have the shading and we have at most, unless I'm counting incorrectly, we have at most three intersections because of course we also have at most three points. We can see that this point here intersects and in this region here with all three of the other features. And what we can now do is we can peel off layers iteratively and we can say, okay, we first only look at all the intersections with one point. So this would mean that we walk along this highest line here. This is our first layer, you can see this expressed here. Then we peel off this layer and we see what remains and we only walk along the highest line always, along the maximum. So then in the second layer, we would walk along this here. We're not allowed to go down here because here's an intersection. So we have to go up here, we go down here and this leads to the second layer here. And now we peel that off and we are left with this single triangle here. So this is why it's called a persistence landscape because you get an increasing sequence of these piecewise linear functions again that are hierarchical in a sense. So I don't wanna make this more complicated than it has to be, but there is a relationship to the Betty curve, of course, in the sense that the Betty curve is a very simplistic way of looking at these features with the, whereas with the persistence landscape, you get an idea of the hierarchies. And the neat thing here is also that you can define this landscape for different hierarchies for different values of a hierarchy. So you can say, I want to look at the kth persistence landscape in the hierarchy and you can always extend it after your initial domain. You can always extend it with zeros and it will remain valid. So it's also a kind of integer valued, no, it's not integer value, pardon me, but it's always a real valued landscape, a real value descriptor and it's a real valued curve that you can extend arbitrarily along the x-axis. It's a formulation that is due to Peter Bubenig in statistical topological data analysis using persistence landscapes. So again, it's a JMLR paper from 2015, which also contributed to the foundations of topological machine learning. And it's primarily focus is the beneficial statistical properties. It also permits the efficient calculation of distances and kernels, but the primary focus of this paper was to get a notion of statistical hypothesis testing in this space. So making it possible to calculate means of those landscapes and comparing the means with each other. And it's really, it's extremely readable and it's a really useful approach that is used very often now in TDA to ascertain the dissimilarity between different topological distributions. Right, and even I'm not going into the details here later on because there's so little time for that, but there's a very recent preprint from this year where this persistence landscape is sampled at regular intervals to obtain a fixed size feature vector and then the built in hierarchy is being used and this thing is actually used as a neural network layer. So this is a preprint called efficient topological layer based on persistent landscapes. And it's also worth a read because it shows you how to connect those descriptors to a neural network architecture, but I have to stress that we will see a very easy way to do so in the last lecture. So there's no need. I know that having additional descriptors and whatnot is not making this easier for beginners, but if you're already aware a little bit about what is going on in TDA, then this paper can be very interesting to you. All right, there are also, I wanna stress this to make this clear and to cite some other of my colleagues. There are also other functional summaries of persistence diagrams. So I showed you the Betty curve as a very simple one which just counts the intersections, but you can generalize this. You can evaluate so-called template functions over a persistence diagram, making it possible to incorporate more than just the point information. So one such simple template function, for example, could be a tent function. So you go to any point in the diagram and you look at the neighborhood and then you evaluate this and you add over all these representations. So essentially in the language of embeddings, you would take a template function G that operates on these persistence pairs. So on points in the persistence diagram and you obtain a simple embedding based on a summation. So you just say, I take my diagram D and I map this to a representation based on all the features in based on the embedding. And then you can use multiple template functions with multiple parameters and so on and so forth and you obtain a feature vector using this and you can obtain a very high-dimensional feature vector. And bear this example in mind that you can create a point-wise summary of persistence diagrams using these sort of approaches because this will be very, very helpful in the last lecture. I also have to mention another approach that is a little bit more clunky to be honest, but it's also an interesting way to approach this. It's based on histogram-based vectorization of persistence diagrams. So you would take the persistence diagram, you would cluster it, and this would give you a way to learn representatives of each of the regions, right? Then you can learn a bag of word representation and use a quantized bag of word representation as a feature vector. I mean, this works really well and the authors have some very smart techniques of how to overcome some issues. But in general, I would say that the parameters for this representation are not as easy to pick because there's no intuitive description of the result to representation. With the persistence landscape, for example, you have this intuitive view of a hierarchical curve that goes through your diagram. And this you lose here a little bit in this approach because the clustering can yield different representations of your dataset depending on the input conditions, right? And now let's move to a very simple and successful descriptor of persistence diagrams that is also multi-scale similar to the persistence diagram itself. It's called the persistence image and it's due to Adams and colleagues from 2017, also JMLR paper. The overall idea is very simple. It's essentially very much related to this multi-scale kernel that we previously saw in that it uses a kind of like kernel density estimation or kind of smoothing. So you take the diagram and you rotate it into this other plane because that's easier to calculate. And then you evaluate a function on this plane turning this diagram into a surface. You do this by evaluating a weight function for any point of the diagram and a probability distribution phi, which is indexed by x, y, and z. So you assign any point in the plane a kind of height. And typically phi is a probability distribution such as a normalized symmetric Gaussian and w is a fixed piecewise linear weight function that essentially tells you how you want to incorporate the persistence information of a single point. Now you can discretize this psi representation here using an r times r grid and this yields a persistence image. I think the connotation or the intuition behind that is quite clear. So you take this diagram, you evaluate a kind of density that is appropriately weighted because you want to give a very high weight also to very persistent features. And then you can represent this, the resulting surface at multiple scales. And the price you pay is of course that if you use a psi representation with a very high parameter of r, so very high resolution, then you also have r squared many pixels to represent in your image. But at the same time, it allows you also to zoom down, to zoom into certain regions of your image that you might not be able to reach otherwise. And you can see this illustrated here. If we start with the very coarse resolution, there's not much going on, but if we are already at this resolution here, then we get something that is progressively more useful. And of course the properties here are very interesting and are very useful in practice because first of all, it can be shown that this representation is also stable. Bear in mind the stability and the crux of having large scale perturbations here, but for small scale perturbations, this persistence image calculation is also very stable. What I particularly appreciate about this is that it has an intuitive description in terms of density estimates. So the persistence image can be easily explained to someone. You can just tell them, hey, I have these points in the plane and I use density estimates with some additional weights and that's it. However, the resolution and the smoothing parameter are sometimes hard to choose. In some cases they don't make that much of a difference. So these parameters turn out to be relatively stable, but in other applications it turns out that it's very crucial and very critical how you choose them in practice. Moreover, the thing I dislike most about them to be honest is that the representation is not sparse. So there's a quadratic scaling going on with R. So you get feature vectors, but they're really, really large. That also makes it easy to use in a classification setting though because this is now a fixed size feature vector. So it turns all your persistence diagrams into fixed size images and those you can use in whatever algorithm you want to pick. So I've seen people use random force for this. I've seen people use multi-layer perceptrons for this. And I've even seen people use CNNs for this. So it's a very versatile way to convert topological features and to bring them into a machine learning pipeline. And recently there have also been some extensions about this. I wanna point you towards a NeurIPS paper from 2019. So NeurIPS is now has become a target for TDA as well. And in this paper, the authors are looking at learning weights for persistence images. So instead of just taking the same weight function for all the cells in the persistence image, the ideas to learn a weight function that is more appropriate for the classification task for downstream task. This is achieved by implementing some minimization scheme on the persistence image. And it amounts to calculating the weighted distance between such images and the authors show that this new weighted kernel can also be used in an SVM and they raise the bar quite considerably for new graph classification. So interestingly graphs, as we will see in the next lecture and the last lecture have become a prime target for TDA approaches. I suspect that this is because they are simplification I suspect that this is because they are simplificial complexes already and they are not too high dimensional. So it's kind of a very easy way to incorporate them and to extract their features. But essentially this approach, I have to stress this. It was presented for graph filtration for graph classification, but it can be implemented for any other kind of filtration that you have as well or for any other kind of input data as well because once you have a persistence image from your data set, there is no need to remember what the original objects represented can just apply this weighting algorithm. All right, there are also other vectorization methods. I have not seen them in use so much but I implemented some of them and tried them out, they're really neat. So without too many details here, you can also take a persistence diagram, you can calculate a kind of minimum distance to the diagonal and between points and then you can sort all those pairs in descending order and pick a few of them, potentially padding zeros and this gives you also a very nice fixed size feature representation. It's effective so it can be shown to incorporate a lot of information of the persistence diagram, it's also stable. But again, the computation of this descriptor scales quadratically in the number of entries of a persistence diagram. And this illustrates a very neat trade-off or no free lunch theorem, I think. So either you lose some expressivity as in the BT curves. The BT curves are simple to calculate, just go through a diagram once and that's it. Or you lose some efficiency in the computation because you either scale quadratically with the feature size representation as in the persistence image or you scale quadratically in the number of entries in the persistence diagram for many kernel representations and for this representation and so on and so forth. So that's like, you have to pick your poison in a sense. So where do you want to spend your precious computational cycles? Last, I want to point towards some other summary statistics methods which are particularly interesting when you have time varying persistence diagrams. So this is, I used them in previous research, we'll come back to that in a second in the next lecture. There's also well-defined norms of a persistence diagram which are based on the maximum persistence that you observe or on the sum of all the persistence values that you observe. The fact that those are norms in the mathematical sense is also a very deep result that has to be proven first. And these norms are also stable with respect to small-scale perturbations and thus really useful in obtaining simple descriptions of persistence diagrams, in particular when you have them over the same time scale. So suppose you have a time series of persistence diagrams like this, suppose that every one of your time series has the same length or you can always pet them, then you can just calculate this summary statistic and you get a neat curve in the 1D plane or 1D function and you can compare them with each other. So it's of course not perfect, it will of course not replace persistence diagram kernels or whatnot, but it's an excellent proxy, I would say for more complicated distance calculations because calculating distances between curves in the plane is much easier than calculating them in other spaces. And a little bit more complex, I also want to point out towards very recent research by my colleagues from Oxford University, so this is a paper by Shevreechev, Nanda and Oberhauser. It's about a generic vectorization of persistence diagrams based on so-called signatures. So the observation here is that different representations of a diagram can give rise to paths in a very high-dimensional space and then you can use the so-called path signature, which is kind of a universal non-linearity on the paths of bounded variations to compare them. And again, path signatures have several beneficial properties, one of them being the stability of course, the other one being that they can be easily extended so there's like all ways of making the calculations efficient. And the results in the initial results of this paper are quite promising but also a little bit computationally heavy because in addition to the persistence homology calculation, we now also have to calculate the signature alongside with it. So there's an additional layer of complexity added but the price might be steep but it might be worth it in the end if you get a more expressive calculation. So in the end, let me end with this neat diagram. So here's a very simple way of figuring out which method to use in practice and I hope that this will not trigger hate mails by the authors of other methods but it just might, it is just my opinion. So let's start here, you just take a look at what you wanna do. So if you're interested in statistics, just do persistence landscapes because that's a no brainer, that's what they were invented for. And in particular it allows you to do hypothesis testing. So if you're interested in summaries and maybe you just wanna get a quick look into your data first, do use the norms of the persistence diagrams because that's easy. But now moving on if you wanna do something more complex like classification or visualization then it depends on how many caveats you have in your calculation. If you require features, so if you're interested in knowing what the features represent and you wanna have some control over the calculation and whatnot, then use the persistence images. If you don't use them, if you're not interested in them and you're okay with having a very continuous representation try the Betty curves for large diagrams because they scale really, really neatly and try the multi-scale kernel for smaller diagrams. Of course, this is vague here. I mean, small and large, what does it mean in practice? The point I'm going to make is that this one, the Betty curves scale a lot better than a multi-scale kernel because you only have to calculate this representation for every diagram. Whereas with the multi-scale kernel between persistence diagrams, you have to go over all the points in the diagram. So this might, if that turns out to be a limiting factor for your analysis then you're better off with something else. But this is like the rough overview of what is currently out there. Of course, you could equally well say that this could be synonymous with any other kernel method here and this could be synonymous with any other template function method here but I didn't want to make this more complicated. So this is where I'm ending right now with a summary over all the topological descriptors. So the takeaway message from this lecture should be that there is this original persistence diagram structure. It is cumbersome to work with somewhat due to its multi-set structure. There are also different mathematical aspects of it that are not super nice, one of them being that there is no well defined mean in the space of persistence diagrams. And so hence, there are numerous topological descriptors for different usage scenarios. There's two classes of methods that might be useful here. One is kernel-based, the other one is feature-based. Although I can also say that some of the kernels give rise to finite dimensional features. So it's kind of a mixed, kind of a blurry mix in between but that is roughly where the field is at the moment. And this is where I want to leave you for now. And again, I'm happy to take any questions before moving on. It seems that you are discussing, or you discussed a little bit about the speed of the methods that derive features and whatnot from the persistence diagrams. But my intuition would have thought that this is computationally the easiest thing or like I mean easiest in the sense of that it's not so computationally expensive. I would have thought that developing or calculating the persistence diagram in the first place would dominate anything else. But this is not the case. Because we discussed the riser before, right? And that can change the, you said it can change the, or it has very good constants but it still has a very bad exponential or something like this complexity class. So that's why I thought this would dominate everything computationally. But now it seems it's not. Maybe I was mistaken. Well, it depends on the application a lot. So the thing is that you can have very efficient algorithms. I did not mention this here but you can have very efficient algorithms if you constrain yourself to, let's say zero and one dimensional features. So in this case, in this dimension, you can have algorithms that are as efficient as sorting essentially, so n log n. So there you don't have the bad scaling properties. But the thing with the distances is that a persistence diagram can have essentially a lot of features that you have to account for and computing them even in parallel is easy when you have, let's say you have different batches for which you want to calculate persistence diagrams. Then you could do this embarrassingly parallel, right? But if every one of those diagrams then has let's say a thousand points, then for the distance calculations, for the metric distance calculations, you have to solve an optimization problem between a thousand times a thousand matrix essentially. You can work around that of course with the kernels where you also have to evaluate this 1000 by 1000 matrix or at least half of it because you have to take into account all the different distances between the individual pairs of features, but you don't have this additional minimization step or this additional optimization step, right? So the point I'm trying to make with these distance calculations is that the minimization or optimization step forces you to into a different complexity class that you would otherwise not be in. So essentially people are using things like a Hungarian method, for example, to solve this optimal transport problem or this Vasselstein distance problem. And you're also right, there are some recent advances in optimal transport theory. So there is entropic regularization, there's a sink horn divergences, things like this. So all of this can be sped up, but again, we're talking about, sometimes we're only talking about constants here and in the end you still have this very complex matrix that encodes all of your distances, all of your similarities. So you cannot escape this combinatorial hell, I would say. So this is why in practice, I would say that let's say for graphs, for example, you're better off using a kernel representation or a template function representation because you can compare those templates much faster than you can calculate or evaluate the actual distances. And moreover, you're actually not, I mean, maybe I'm making some new enemies here, but you're actually not that often interested in having a proper metric in machine learning, right? Oftentimes it's perfectly sufficient to know the similarity information. So to know that these two objects are very dissimilar from each other or this feature vector is dissimilar from this other feature vector. But having one number, a quantifiable number, it's actually not that often what you're after, I would say. In particular, not for classification, right? Does that make sense? I'm not sure whether I would agree, but I'll see what you mean at least. I mean, don't get me wrong. This is not my way of saying that these distances are useless. I mean, I included them because they are really, really useful in some contexts. They just don't have the nicest scalability properties that. So the trick is why the calculating the persistence diagram is not killing everything is because you restrict it to these one-dimensional loops. So basically you're simply sees from your first lecture are always graphs and not you don't have surfaces or anything like that. So how would if you would have surfaces, would this then exponentially explode or? No, with surfaces, you're still good. So in fact, I would say that if you have a very structured, a simple complex. So for example, a grayscale image, right? So you have a rectangular domain and you have some scalar values on there. Then this all scales rather neatly. You can still do this. And there are nice duality theorems that make the calculations faster in practice. Because three-dimensional grayscale, that's what you mean, right? Yes, exactly. You only get in trouble when you are in a very unconstrained setting where you say, oh, I have a 100-dimensional point cloud. Let's calculate a 100-dimensional vitorious rips complex. In this case, you are really in trouble. And I have to stress that I don't know of any papers that are actually using a lot more higher-dimensional features than dimension two. At least not in practice for big data. And okay, big data is also kind of a hand-wavy term or vague term, but I would say that at least in machine learning conference, I have never observed anything beyond dimension two, I think. So a surface would be the largest. Yes, exactly. And then you- Like a triangle or something like this. Like the surface of a triangle. Yeah, and this still gives you a lot of information. It just is, this is also, I have to say, maybe as a very idealistic comment, this is also where the idealized algebraic topology world and the machine learning world diverged the most because an algebraic topologist, for them it would be anathema to say, oh, we only calculate until dimension two or something like that, because they know that maybe interesting stuff is hiding in higher dimensions, right? For the machine learners, for us, well, I'm not sure what I am now anymore, but maybe I'm between both worlds. But so in practical applications, you can often just be pragmatic and you can say, okay, if two dimensions, if two dimensional persistent homology is sufficient to give me a classification accuracy that is good enough for my purposes, then why should I go any higher than that, right? So this is kind of the trade-off that I'm talking about. I see, thanks. You're welcome. There was also another question in the chat, but maybe... Yeah, can you hear me now? Yes, I can hear you. Okay, perfect. Because the app crashed, so I kind of missed the last 10 minutes or 15 minutes, so perhaps you already answered this, but there is something that I want to check if I got right, because you talked about persistent diagrams, and if I understood right, that depends on the kind of filtration we choose, right? Yes, absolutely. Okay, but once we choose one kind of filtration, the persistent diagram has no path dependency, right? Because you said at some point when two points have the same score, let's say, because we add one after the other, we have to choose which one to add, and that may give... Well, I asked myself, doesn't that give rise to path dependency? Because in a way, you might create and destroy topological features in a different fashion, but that doesn't happen, right? Well, you're raising an excellent point here. I have to stress this. So let's... I have to disentangle this. So first of all, typical persistent homology implementations, they avoid that sort of dependency, or well, no, they sidestep it by saying that they take into account, for example, an index of a simplex. So they would say that if we cannot tell two simplices apart, then we take the one that has the lower index or something like that. So in our case, it would be similar to adding things in lexical graphical order, for example, right? So you could say, okay, add the simplex AB and then the simplex BC or something like that. But you're absolutely right, it can create these dependencies. And in fact, in one of the papers that I'm going to discuss in the next lecture, in the last one, my colleagues and I actually looked at this sort of thing because there's a fascinating relationship to other concepts in machine learning, namely to your max pooling operator, for example. The max pooling operator has essentially the same problem, right? You're not sure what if you have the same maximum somewhere in your data, right? You have a choice to make and this choice is implementation dependent. And it could theoretically lead downstream, it could lead to different results. And in the paper that I'm going to discuss next, we worked around this by assuming a genericity assumption. So we were assuming that all the distances that we observe are dissimilar, so we are unique. So none of the distances is allowed to occur twice. And in practice, this can be achieved by symbolic perturbation, but it's also, it's a point to take into account. So there are differences in the way that your reduction works and the way that your filtration sets up your persistence diagram, right? So this, yeah, this is a very, this is an excellent point. This is something that one has to consider when choosing an implementation, which is why software engineering tangent starting, which is why it's so important that the libraries that do perform persistent homology calculations have some well-defined test cases, where they actually check whether their representation makes sense and whether they are calculating the right features in the right order and so on and so forth. So this is actually, this is a very tough thing to ensure in practice, but people need to look at this.