 something that I don't really know anything about today, because I was browsing the internet and I found this thing called topological data analysis, when I was googling something on the forms of, is algebraic typology useful for anything? Because I was beginning to have a bit of an existential crisis with my life. And so I googled that and a question on math overflow came up that talked about topological data analysis. There was a link to a couple of papers. It turns out this branch even has its own Wikipedia page and there are people doing fairly active research in this area. So I thought I would learn about it and so what we're going to see today is the product of the last couple of days of me just learning stuff. It's really interesting, there are a couple of excellent papers online. One of them is kind of bite-sized 15 pages long by a guy called Robert Christ. And his paper is basically what this seminar is based on. And there is another paper by a guy whose first name I forget, which is terrible. Gunnar Carlson, who has published a much longer paper, more like 50 pages, which goes into a lot more detail, gives more applications, it's more recent. But there will be references at the end when I recommend that you have a look at them, because they're very nicely written papers. So if you know anything about homology, which you will after this talk, then that's something you can learn about. So I'm going to start out by motivating exactly what all this stuff is about. Like I say, this really is an application to the real world of something that was born out of pure mathematics. So our setup is going to be as follows. We have stuff called point cloud data, which is a finite subset of potentially a surface that we know something about in a typically high dimensional space. I'll give an example in a moment. We want to extract some qualitative information about the data. So one example that you can consider, which is given in Carlson's paper, is one way that you can distinguish between the two types of diabetes, like early onset and then the one you get later, I think it might be medical knowledge, is you'll notice that certain pieces of data are clustered around two different points in this whatever high dimensional space it is. So that would be the topological property of how many connected components does this space have. You'll be able to work this out by applying this theory to the corresponding space. Now, we've got these points in end dimensional space, usually real space, and that usually comes with a lot more structure than we necessarily need. It comes with distances, it comes with coordinates, choice of origin, all that kind of stuff. Maybe all we want to know is, like I say, the number of connected components are the areas that your data is avoiding for some reason, that kind of thing. So topology, oh yeah, I should have said this is not always appropriate. So topology might provide the methods that we need. Topology is what you get when you take geometry and forget distance, forget coordinates. All you care about is the stuff that doesn't change when you bend it around and squish it and throw it out the window. So as an example of this, this is going to be an example of some point cloud data. You need to use your eyes and your brains to distinguish between which of the following is a photograph. So on the left is a picture of me circa 2001. On the right is a picture of me circa 2013. It's everything not a picture of me. Can anyone tell me which of these is a photograph? Any guesses? B. No, it's a different thing. But although this is a picture of me, this is with my school uniform. What you're seeing is not me. You are seeing me, but that's also, okay, whatever. What you're seeing is a finite number of pixels and the pixels are colored in one of the finite number of ways. You can interpret this as being a point in a very high dimensional space where you have, you know, one dimension per pixel and potentially more directions for the red, green and blue values in that pixel. Later on, we're going to be turning to just grayscale. So I guess the real question is, is it really, if we want to understand what our brains are doing, and we want to be able to make computers be able to distinguish between what a photograph is and what just random pixels are? What are the underlying principles that we want to look for? What are the things that we're trying to find? So the idea of what we're going to do, you're given a bunch of points in our end. Typically, this will be a finite set because usually when you do an example, you don't do it infinitely because we're humans. And you want to build some simplicial complexes. I'm going to define what they are. The idea is they are a combinatorial construction which encodes topological data of sufficiently nice topological spaces. But usually, nice, you know, compact subspaces of RM like you will get when you have finite stuff in RM are sufficiently nice that you can do all of this, you know, simple. We're going to study the outbreak and variants of the resulting simple complexes, which is, you know, simple complexes, topological spaces, you can think of them as being the same thing for our purposes. So we'll be studying the outbreak and variants of these public spaces, if you like, by the homology groups, things called Betty numbers, I will define all of these terms later. And we're going to see which of these persists as we vary the parameters involved. So we're going to get for different parameters, we're going to get different simple complexes for my data. We're then going to encode this information using something called a barcode, which is going to tell us as our parameter varies, how do the generators of the homology group change or don't change. So this is going to give us some qualitative, qualitative information. We're literally going to look at the barcode and be like, oh, that line's longer than the other one. And that should hopefully be enough. So with that in mind, I'm now going to do some apology to you. So the next section is going to be all about simple homology. So before I continue over any questions about, so I should say it was my mother that made that costume for me. She was preparing me for a tutor wedding. Well, my school had a tutor day. We went in dressed as people did in the tutor times. And I got married. It's an early marriage agent in the evening. Are you still married or did you guys get divorced? Well, we never officially got divorced. So anyway, I did So find out how strikes in political complex is a pair v signal, where v is a finite set and signal is a set of non-financial tasks. So what would this look like? Sorry, if you're sat over there, you probably won't be able to see it because I'm going to be on the video as well. So you can think of all of the points in the sorry, all of the single elements sets of v is being points. So I guess all the all the elements of v is being points. And you can think of the elements of Sigma as being symbols. So for example, this here would be a one simplex, which is the line. If you join these together and shade it, it's going to be a two simplex, which is just a building triangle. If you wanted to add this point to the mix, and then you fill it all in, this would be three simplex, which is a tetrahedron, continue to get high dimensional synthesis. And so one question that I'm sure you have burning in the back of your mind is how, how are you going to, how do some of these gods of my mind start working? How do these synthesis arrive? So for example, a graph is a set of ordered pairs from v. And each graph gives rise to an abstract simplex complex just by throwing in to the set of simplices or vertices. So you just kind of forget that you're working on the graph. Oops, whoa, whoa, whoa, whoa, whoa. So every abstract simplex complex can be associated in a nice way, which I'll describe later with a topological space, which we denote as, you know, x with physical bars. So each simplex, I'm going to call the elements of Sigla Sigma simplices, because that's what they are, will correspond with a simplex. So for example, if this is your data set, and you look at the topological space, I mean, this is really what the topological space looks like. It looks like a tetrahedron. So here's an example. We have 18 vertices. And all of these sympathies. So because I'm including all of these vertices in my simplex or complex, there's going to be significance for all of those. The two simplices are the faces of this triangle. So one, two, three, three, four, and then this one four over here. And then six, seven, six, eight, six, nine, seven, eight, seven, nine, you get the idea. Three simplices is this one, this one, this one, this one, and all the faces of this thing. And it's ambiguous from the image. This is like Wikipedia, by the way. I'm just assuming that the middle of this is hollow. I'm going to assume it's hollow because it will make something I say later slightly more interesting. So we want to study the structure of these things somehow. We want to assign an algebraic invariant to these simple complexes. So given a simple complex and an obedient group, you can think of A as being the additive group of integers if you want, that will make your life a lot easier. The group of k chains in X with coefficients in A is just the group of normal A linear combinations of the k plus one simplices. So I'll kind of draw some more things, depending on how far to that side of the room you are in my room, because I'm going to be drawing a lot of stuff on the book. So if you just have this hollowed out triangle here, then for example, C one would just be all of the formal combinations. If I call this A, I'll call this one, this two, and this three. This would just be all linear combinations of vertex one, vertex two, and vertex three. I call this problem like V one, V two, V three. And then this is like, sorry, C zero would be A V one plus B V two plus C V three for A, B and C, lying in your leaving room. I am pretty much uniformly going to be talking about the integers, although this theory works very nicely when you take it to be some kind of field, because you get to vector space later on. So there is a boundary map, and the boundary map is defined as follows. You can get from a case of a formal sum of k minus one simplices just by taking the alternating sum of all of its one dimension lower faces. So this is like what it is. It's the alternating sum of what you get when you pull out one of the vertices from each of the faces. So if you were given, say, this completely filled in triangle, then the boundary map applied to this would be this face minus this face plus this face. Okay. Oh, maybe you're the way around. No, it would be this face minus this face plus this face. But just like alternating sum of faces, it doesn't really matter how you do it. And the reason why I'm saying, you know, I face and just assume the vertices are totally ordered, just assume they're natural numbers or something like that. And then, so I'm going to kind of keep tabs of what we define. So Ck is chains, which is formal sums. We've got this delta k, which is the alternating sum of faces. I'm now going to define Zk to be the kernel of this thing and Bk to be the image of the one a dimension higher. So if your formal sum of stuff came from taking this boundary map of a simplex, then that is one of the cycles. And if you're, if the sorry, that is one of the boundaries, if it gives you zero, so if you have some sum of simplices, who's alternating sum of zero, then you have a cycle. And you can think of a cycle as being like what you get when you go around the circle, right? You would need to have like a square or something to make them cancel out and stuff. Okay. So this is something I'm not going to prove. But when you apply the boundary map twice, you get zero. And this is because when you add and subtract stuff, and then you subtract and add stuff, you get zero, right? That's how the proof goes. And then we define the simplex homology group of X with coefficients of A just to be the quotient of the cycles by the boundaries. So cycles over boundaries gives you the homology groups. So I'm going to go over an example of this and just write this down so that we have something for us. So homology is cycles over boundaries. So here's an example. So I'm going to take V to be the set 123. So they're going to be three vertices. And then there is obviously going to be a one zero simplex corresponding to each singleton. That's just the vertices. And then we're going to be one simple C's ID edges around the triangle. So this is a graph, which is just a triangle. So I'm going to draw a picture of this side of the stack maybe. All this for a triangle. So this is what it looks like. It's not filled in. And so the intuition that I'm going to spoil for this point is that the algebraic invariant that we're going towards is going to go some way towards counting the number of k dimensional holes in the space, right? So here we have a hole which is created by a bunch of one simple C's. So I'm going to call this a one dimensional hole. So we want its first homology group to be something that's not true. So it has no two three, it has no three simple C's, two simple C's. It has no triangles in it, right? It's just edges and vertices. So the image of the math coming into it can only be zero because it's zero. Now, this is a nice calculation for everyone to see. So we want to find what the kernel of the boundary operator going in that direction is. So here I have, so this is one, two, three, one two, one, three, two, three. So when I apply when I'm looking at some formal linear combination of these edges, and I apply this alternating sum, I'm going to get this one minus this one, or rather, this one, minus this one, plus this one, minus this one, minus this one each time. And what you get, so you apply to one, two, you get minus two, plus one, you apply to three, you get minus three, plus one, you apply to two, three, you get minus three, plus two. Do the math, set this equal to zero, you must have A equals minus B, A equals C, B equals minus C. And so the resulting group is generated by one, two, minus one, three, plus two, three, which is essentially, you can imagine this is being, you go along one, two, and then you go along one, three backwards, and then you go on two, three. So really, it's just like going all the way around the triangle, and doing a full sign, which makes no sense at all. But okay, this is good. So h1 is just the group that you get when you quotient this group generated by one element by zero, which is therefore, I can do the integers. Integers, integers to the power one, one whole, bear in mind, this is not an accident. So now I'm going to talk about how this actually relates to topology. So as I mentioned before, every simple complex, x, say, gives rise to some kind of topological space, you just send the vertices to points or whatever. Makes sense in two, we can make this more rigorous by if you have n, n vertices on your simple complex, you can map the i-th vertex to the i-th basis vector, and then for each simplex, you just take the convex hull of all the points around that vertex. So if you were to find the geometric realization of the triangle, then you'd be in three-dimensional space because it has three vertices. You've got 1.410, 1.001, 1.001, I guess and then convex hull is just what you get when you do it together. So this is what the actual embedding into three-dimensional space would be, not like this. You can also go in the other direction. And I'm assuming that our space is sufficiently nice. So you can think of like a compact subspace of r n or something like that. I don't know. In our case, it's definitely going to be a complex subspace of r n. You can do it more generally than that. But you do this by triangulation. And the idea is you kind of cover your space sufficiently finely and then just cut in loaded vertices wherever they'll fit, corresponding to the correct number of holes. It's like the worst explanation of how you triangulate a particular scene. And this is independent to the triangulation. So I'm going to say that the homology group of the space is just the homology group of the underlying sequential complex. So if I have a triangle, for example, as I said, given a triangle, the geometric realization of that sequential complex is just a triangle. But the triangle is isomorphic to S1. And so the first homology group of the circle with coefficients in the space is just the answer. So before I move on, and this is going to get more interesting when I'm doing work, but I need to kind of just introduce myself. So two maps are homotopic. If there is a continuous map, which you can think of as being a continuous sequence of paths that stops at one and finishes at another. And if I can do this correctly, I actually have a nice animated gift, which I stole from Wikipedia, which kind of illustrates this fact. So you start with a function f, you can think of this being the image of a path or something like that in your space. And you've got the function g, which you can think of as being the image of another path in your space. And the homotopy tells you at a given time which path in between your functions are taking. So in this case, this is a very special case, it would be a smooth homotopy between two curves with relative to fixed end points. So just think of bending stuff. And we say that two spaces are homotopy equivalent. If there are maps which compose not to be equal to the identity, which is what a bijection would be, but which compose to be homotopic to the identity. So this is a weaker notion of equivalence than homomorphism. And it's what you get when you just morph one space into another. So the old thing about apologists thinking that coffee cups and donuts are the same thing comes back to this sort of This is describing the images of a bunch of embeddings of one space into the other. In bedding to the best places, it's what we have So another example, the interval from zero to one is homotopic to the point intuitively squish the interval down to the point and then stretch it back out. So you can do squishing. That's something you can't do with homomorphism. So these are two spaces that are homotopic, but not homomorphic. So the exact homotopies I get in here, but they're not going to be much interesting. So if you do f, you squish your interval to the point. If you do g, you just include your point at zero. And then this homotopy at time t, so at time zero, it's just going to give you the point and then time one is going to give you zero. So yeah, two spaces of homotopic whether one can be continuously different into whatever, which bash bash. Do interrupt me if you have any questions at any point by the way, or if you want me to go back in slides and like give you definitions of things. There's a lot to take in and if you're familiar with ultrasonic topology, you might get lost. So please tell me to go back to definitions if you've done, or to explain in more detail. Okay, anyway, so the reason why you care about simple homology is manyfold. The reason is manyfold. The reason is how manyfold. First of all, it's factorial. And as a category theorist, I cannot mention funtoriality. So if you're given a continuous map from one topological space to another, that induces a map on homological homology groups, it's been a long day, which respects composition and it respects identities and therefore it respects inverses and it respects isomorphisms and it respects all this nice stuff. So for example, I mean I'm kind of ruining it already, it's homotopy invariant. So if you have two maps that are homotopic, you know, you can squish one into the other. When you do, when you look at the maps induced on the homology groups, they're going to be the same map. So the homology is only dependent on the homotopy type, if you like, of the functions that you're including. And so in particular, when you look at composites, you get, you know, the composites of G with F is isomorphism. Sorry. When you look at homotopy equivalences, you've got the composites in homotopic identities. When you apply the homology factor, that is, when you look at the effect of homology, you get equality and therefore isomorphism. So homotopy equivalence spaces have isomorphic simplest homology groups. So if you're not taking anything away from this, take away the fact that it respects functions and it's homotopy invariant. So if you've got spaces that squish into one another, they have the same homology groups. I should say, like, why do we care about things that you can squish into one another having some homology groups? It's because we're going to use this on our data and we're going to want to, like, be kind of cruel to our data. We're going to squish it and project it and like, thrash it around. And we want to make sure that as we're doing all this abuse to our data, that we're not messing with any of the invariants that we actually want to study. So we've got all these points, all these holes, we're not going to get rid of any holes, they can't be got rid of by continuously performing stuff. So we've got all this data, we're going to squish it, whack it around, hit it with a hammer, do whatever we want with it, and we want to make sure that nothing changes. And so this is why this is all very important in this invariance. And the other thing is because we're applying this to statistics, computability is going to be very important. If you get some how to rate your invariant like the fundamental group or the homotopy group, these things are very typically extremely difficult to compute. Statistics decisions would like laugh at you and walk away and go and do their like, you know, t-tests instead. Yes? What sort of complexity are you looking at for computing? Huge. What if we compute in quality groups? Huge. Well, isn't that a useful standard for participants? Huge but possible. And in cases where your data is sufficiently nice, very, very computable. In the sense of some low point at any level? Yes, as in the sense that like this little midget of a laptop could do it. My little midget of a laptop, for example, could compute the homology group of the triangle on that. So. But I mean, the homology groups of if you have millions of points of data. Yes. Yeah, sure. So if there is a field that you're still casting on the literature, the literature is a slightly tricky one. There's a slogan which says homology is just linear algebra. Oh, okay. And to an extent, it's true. But only to an extent. And some typologists would probably kill me if they didn't. Especially given that, yeah, obviously the theory goes more crazy. Anyway, Betty. Yes, Betty. The Betty number is the Wrangler homology group. So in the case when this A here is a field, that's just the dimension of the vector space that you get when you get some digital homology group. The rank of a group is just the smallest number of generators. Like the size of the smallest generating set. So for example, the integers is generated by the number 1 under addition because 1 plus 1 plus 1 plus 1 is 4. And by induction is even the m4 level. Well, I just said it's true. So the intuition here, the homology groups by means of their Betty numbers tell us how many holes our space has. So in this case, the triangle, we found that this side, first simplificial homology group 1. So we're going to say it has 1, 1 dimensional hole. Beta 0, so the rank of the 0th homology group is just going to be the number of connected components of the space. So who wants to take a guess on the 0th homology group of the triangles? Oh, the Betty number 1, right? It's the number of connected components. So this has one connected component because it's the triangle that's the connected component. Okay, good. So we've already seen that the first homology of the circle, which is the triangle, the circle triangle, same thing, is this. Now I'm going to do a slightly more complicated example. I'm going to add a point and then join it to two of the other vertices. So I have two triangles that are stuck together along an edge. So I have something that now looks like this. I've joined four to my triangle, so I've got one, two, three, and then a bit. Okay? So I want to compute the homology groups of this thing. Can anyone take a guess of what the 0th homology group of this thing is going to be? Is that right? Same thing, because it's connected. That's not the one we're going to work out. We're going to work out the first homology group. Does anyone want to guess what that one might be? Is that square? Yes. It's going to be the freobelian group on two generators. So, okay, I should have said, yeah, b1. Okay, so there are no two simplisites. And so when we look at the image, so if you remember, we finally work out cycles divided by boundaries. There's no boundaries of two simplisites because there's no two simplisites. That's just 0. So whatever we get is just going to be the group of cycles. So we just need to find what the cycles in this thing are. And so to do that, you can do all of the mathematics, which I did earlier, and it was just linear algebra. There's nothing to it. And you find that the cycles is generated by the thing we have before. So that is what you get when you go around the triangle this way, plus this new thing, which is what you get when you go around the other triangle in a complicated way. So you've got one thing corresponding to this cycle and one thing corresponding to this cycle. And sure enough, a group, which is a reeling group generated by two elements is just assigned to. And so the betting number is 2. And 2, sure enough, is the number of holes. And this is very robust. For example, we're going to compute the homology groups of this thing, but we're not going to actually compute them. So what's the first betting, what's the sorry, what's the 0 of the betting number of this thing? 3. Because it has three connected components. You've got lonely 5, right, rhymes with lonely 5. You've got this at the end and you've got this down here. So this, all of this big here is connected. Okay. How many one dimensional holes does it look like this has? Who knows? So this triangle here is the only thing, the only hole which is bounded by one simple disease. And there are no one simple disease kicking around anywhere else to have holes. I mean, there's a one simple like here, but it's filled in. There's no hole there. And if you remember, I said this thing was hollow and so the second betting number is one as well. There's no other three simplex or hollow three simplex. And 8k is trivial 4k greater than zero because there are simply no cycles or values to consider. Okay. There are no simple seeds of or decay or whatever. So now I'm going to talk about how you actually get complexes for data. So we've just been talking about how we have a bunch, we have a finite collection of points in Rn. They have scattered around point cloud data like, you know, raindrops or whatever. And typically, I mean always, finite subsets of Rn are discrete. You can separate them all by open balls. Now, if you had a bunch of zero simple seeds, the homology groups are going to be very boring. There's no one simple disease. There's no two simple seeds. There's no three simple seeds. So the zero of betting number is just going to be the number of elements in your set. And all of the other homology groups are going to be zero because there's nothing to consider. So like, yes, this may be the case, but we're not going to look at just the points. We're going to look at how the points sit in relation to one another. So I guess the point I'm trying to make is, suppose your points were just other than dots like this and they kind of looked a bit like they were going to form a circle if you gave them enough time. You would probably want the thing that you're ultimately going to get out of this to look a bit like a circle when you take this to an active realization. We want a whole, right? We want to say, oh, look, there's a whole. Now, I was going to talk about my impulses. Every talk I've given this to come up. So our ultimate goal is to find a way of analyzing some kind of homology of something coming from this data which says, oh, look, there's a whole, right? Because that's the entire point of everything that we're doing. And so there's a couple of obvious things to do. So I will just say, remember this photograph, this beautiful, beautiful photograph. If you can extract continuous structure from this weak set of pixels in our minds, we should be able to do so computationally. And this is kind of what I'm trying to get at. If you've got a bunch of discrete points all kicking around, we want to kind of join them together. And so when I'm looking at this set, this picture, I'm not just thinking about how embarrassed I am. I'm not just thinking like, oh, here's a pixel, here's a pixel, here's a pixel, here's a pixel, and so on all the way at the picture, I'm thinking, oh, these pixels look kind of like they should be joining together in some way. In my mind, I'm not just seeing the pixels, I'm kind of beefing them up bit and joining them together and doing something like that. And so this is kind of the intuition that's going to drive how we get a simple complex from this discrete data. So we're going to build a simple complex based on nearness of points in the simple complex. This is going to bump us to really annoyingly high dimensions, but fortunately, like, homology can deal with that like this. It takes all the quotients and everything, you know, and it works out to be nice. For example, like the homology group of the solid torus is still the same as the homology group of the circle because you can just pattern it out even though you would still get high dimensions in the simple season. So, okay, good. So one possible way to do this is using a check complex. And the check complex is just saying, okay, around every point put an epsilon ball. Whenever any two epsilon balls meet, put in a one simplex, whenever three epsilon balls meet, put in a two simplex, whenever four epsilon balls meet, put in a k simplex, and so on. So we're saying, around each point there's an epsilon ball and then reach x, look at all of the points in your data set because epsilon balls intersect yours, like you've got these balls around you, as always, my thoughts around. And then every time it happens, you put in a, if this set had k elements, you would put in a k simplex and all of its faces, okay, because it's, you don't have to basically just to make the half of the work. So that's one option. I'm going to give like an actual picture of this in a moment, but the idea is, if you've got these three points and say this one here, and you put these balls around them, then this point intersects two others, this point intersects two others, this point intersects two others, and so you're going to have this kind of like filled in one simplex here, this point to this point, where you intersect each other's once, and so you get this one. So that is what the simplex correspond to this, is to let the set be the right one. Don't they need to have the same radius? Oh yeah, imagine they have the same radius. Good point. Very good. Thank you. These balls are all the same size. This one is just closer to you. Another way is just to say, I'm going to join myself, I'm going to put in a case simplex if you're in my ball. So in this case, you would get, I suppose maybe if I roll this out. So instead of like, does my ball intersect yours, it's, are you in my ball? So in this case, if you were finding the rips complex, also known as the vitorious rips complex, then well, this point and this point and this point might not be this ball, and this point this point, this point, not in this ball or whatever, you would just get a discrete data set. So you would just have four vertices, no one second sees, no two second sees, so you'd have these four vertices. So the check complex, which is, does my ball intersect yours, and the rips complex, which is, are you in my ball? Maybe I should say neighborhood. Are you in my neighborhood? It sounds like a gangster. So then like, these are going to give you different looking complexes. I believe, by the way, that these things should actually be the other way around. So this is the rips complex and this is the check good place. So suppose we're given the list data, you fix some epsilon, there should be two epsilon, then all these things overlap in different ways, and then you put in the appropriate simplities depending on when they overlap. So are these things actually different? I mean, can you just like replace epsilon with one of them with two epsilon and the other and get to the same thing? Oh yeah. Fortunately, we can compare the check in rips complexes just by changing the sizes of the balls. And the rips complex, given you're just computing distances, is going to be computationally slightly easier to do. They're still both computationally a bit of a nightmare. I think that there are people working on it, or at least they've said that it's a problem, so people will work on it if they decide this is interesting. So that's a, you know, 20, 40 or something, we should be okay. We're going to use the rips complex just because we need to fix something. I believe all of this would work. You can change it to a check complex. So I'm going to be asking, are you in my neighborhood? So the question we have now is, are the better numbers of the rips complexes, as epsilon varies, going to be enough to classify the holes in the data? So are the number of holes you get in these complexes going to tell us anything about holes in the data? We've got all these points, we put balls around them, we've created a complex, and we're hoping that that complex is going to tell us something about holes. Well, the answer, unsatisfactorily, is no. This is the problem with, like, using other people's images is sometimes we don't compare them to your PDF, probably. They go, you know. So we've got these points, and they're kind of, they've kind of got a hole in them, kind of, like, in a very loose sense. I mean, they, imagine there's more points here. We want to say there's, like, a hole in the middle, and we're going to detect it with a hole in it. Well, if I think a smaller value than that is one, then my homology groups think that our data has two holes in it. As I make x1 even bigger, my homology groups now think that my data set has three holes in it. And again, three holes, but notice not the same three holes. So my homology groups now think that this hole here, here, and here, but this hole has been filled in with a high-dimensional symbol. It's been filled in, so there's no hole there. And then one hole, no holes. Oh, that's kind of bad. And so we need to find some way of, I'm going to get my presentation back up. We need to find some way of preventing this from happening. I mean, that doesn't seem very likely, or some way of detecting when it happens. Yeah, Kevin? Oh, it doesn't work in one. It has a lot to do with it. Sorry? But it has a lot to do with it. It has a lot to do with it. It doesn't. I mean, I imagine you could do this with different values of epsilon for each different ball. Well, it would, but you would kind of already need to know where the holes are. To decide where to put the different values of epsilon, you kind of need to know where to put the holes. And to, well, at least to do it in a computable way, you would need to know that. And if your data is 3 million points in 47,000-dimensional Euclidean space, it's very difficult to see where the holes are. And so the point is that we're trying to just use one simple number so that we do everything in one go and then do some computing on it and then hopefully we'll find out where the holes are. I mean, you could try and maybe do some iterative process where you do this with like a constant value and then if you have holes misbehaving, then you go back and do it again with different values in different places and iterate it like that. But I don't know. But we will have another solution. So keep watching. No, this is not the presentation. I took all of these images from Robert Wright's paper because I don't know how to make them myself and his look really pretty. But you can find them in his paper and they will be. So anyway, we're going to do something called persistent homology. It's called persistent homology because you're looking at what persists over time or who came up with that term. So the idea is we're going to detect the k-dimensional holes by varying epsilon somehow. And we're going to use these things called barcodes. And so I'm going to explain what these barcodes are and if I have a sip of my delicious lemonade, I'll get that one for you. So along the x-axis down here is the parameter epsilon variant. So here epsilon equals zero, but that never happens. So just imagine there's something like this. So at these particular values of epsilon, we're going to look at the rips complexes with balls of radius epsilon. So how many points are within epsilon? That would be as I vary epsilon. Each of these bars coming down the y-axis corresponds with a generator of the particular homology group. So if you remember, the generators of the homology groups are the cycles. So I raised my triangle, but think of ways of going around a hole, modular boundaries. So if I have a hole and then I can translate it along boundaries and I consider it to be the same hole. So you can see as we increase the parameter epsilon, so each of these dashed lines corresponds with one of these vectors. So this is victor 1, victor 2, victor 3, victor 4. So these are increasing values of epsilon as we go along. Here we have 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 connected components. And this is supposed to intersect 10 of these bars here. As we increase the value of epsilon, the number of connected components decreases. And that happens by essentially throwing out connected components, throwing out generators of the homology groups. And then we're going to do that here and here and here. I'm sure enough, as you get past this one, this is now a connected abstract. This is now connected simply to complex. It's not going to separate by adding in possibilities. And so you will notice that one generator persists over time. And so we would like to think that our data is pretty much connected, given that this persists. But maybe the epsilon value is too big or something. Anyway. So the idea is we're going to look at this barcode and see what persists over time. So we're going to decide what a reasonable value range for epsilon is. Then we're going to look at all these snapshots and try and work out what persists. What's this persistence? So as I kind of mentioned, as you increase the size of the balls, the complex is embedded to one another, right? So all you're doing is adding complexes. And the inclusion of the complexes, that's the same if you like as an inclusion of topological spaces. And that induces a map of homology groups. And we're going to find the persistent homology group to be the image of that map. So given an increasing sequence of epsilon values going from the ith to the jth, so if you consider going, say, from the third to the fifth, we would define the three-fifth homology group to be the image of the inclusion going from the smaller value to the larger value of epsilon. And there's a theorem by Zomorogian and Carlson which says that the betting number, as in the rank of this homology group, is equal to the number of lines of the barcode which persists for the entire interval range. So if you decide on two values of epsilon that you think are going to be interesting to you, you can find out which generators of your homology groups exist over that entire parameter range. So if you're interested in varying epsilon from this one to this one, then the generators of your persistent homology group would be like this one and this one and this one and nothing else. Not this one because it stops. Not this one because it appears and leaves. Not this one because it, you know, only starts partly in. Actually, maybe it would be good to have that one because if you were to look at what would be that variation of that one. And so these persistent homology groups give you information about what generators persist over time. So I guess another question is, is this actually useful? This is a little very nice in theory, right? So we're like, oh, yay, we can count the number of poles in the top of the generators. But we don't really know if this is actually going to be useful for anything. So now I'm going to talk about an example which is discussed in the Robert Grice paper, which is detection of natural images. So if you remember back to nearly the very first slide when I showed a picture of my beautiful self and the picture of a load of pixels, you were able to detect which one was which. We might hope that that corresponds in some way with some tangible property of the data set involved. So Lee Moffitum, Peterson, Pedersen, Pedersen, I don't know how you can ask that. Looks Danish to me. Took some images and actually did this. And there were a number of studies and I don't remember what they all are, so look at the references if you're interested to the papers that I'm going to show in the end. So you can consider a digital image as a vector in Rn for a very large value of n, one dimension free to pixel, maybe more dimensions if it's in color. How's it going to be gracedale? We want to kind of make this lower dimensional just so that we can actually compute it realistically because again this computability and this realistic computability for complexity and modeling and stuff. So here is a nice big zoom in of my mouth. And nostrils, the most attractive part of human space. So the idea is we're going to look at 9 by 9 sub squares, right? So if I look at this particular 9 by 9 sub square, you can see it's like darker and slightly lighter darker or whatever. If I look at this 9 by 9 sub square, I've got a couple of all those white pixels here and then it's the darker ones up here or whatever. So we're going to take as our data all of these 9 by 9 squares that appear in our images. So we could potentially have loads of data points but at least they're going to be sitting in nine dimensional space because you're just looking 9 by 9 square, so nine dimensional space and the only, you're living in like R9. And the values that you take are between 0 and 2 by 5, like the integers between 0 and 2 by 5. So for example, this one here would be near 2 by 5 because it's almost white and this would be near 0 because it's almost black or rather the boxes around it would be near it. Okay, so what are we going to do with this data? So we've got all this data and like, if you have a bunch of points in R9 but they're already taking integer values, it's going to be kind of difficult to analyze that data in a way that is going to have all these points like sheets, right? And they're not going to, what are you going to do with that? So we're going to project onto a seventh sphere, which is my go-to solution to all the problems I have. Can't sleep at night, project onto a seventh sphere. It works a treat. So we're going to make it mean-centered. So, okay, what we do is we're not taking all of these. So we want to reduce our data set to something that's, you know, tangible. So we're going to ignore the blocks which kind of look like they're all the same, right? So if you have a bunch of shades of gray, if you have like nine shades of gray that are like very similar to each other, then you're just going to throw that up to your main set. If you have nine shades of gray that are very different from each other, then we're going to keep that in, right? So we're just going to take a certain percentage of our data set which has sufficient contrast. So we're going to throw away loads of the contrast and we're going to subtract the mean of the contrast of all of the squares of nine so that everything has like the same average brightness or something like that. And then we're going to normalize with respect to this norm that some clever computer graphics person invented, which will mean that like, we're only really looking at the relationship between like, how am I saying this? There's a norm which allows you to measure the distance away from the origin of one of these nine by nine squares of pixels and we're just going to divide by the norm. So we're normalizing, we're protecting onto a seventh sphere. So we reduce one dimension by mean centering, that's subtract the mean, and we project onto the sphere in eight dimensions which is S7 by normalizing with respect to this norm. And so our goal is to investigate how does our data sit in S7, right? So we've done all these tricks, we've kicked and like shafted and stabbed and pitchforked and everything our data and now it's on this nice seven dimension sphere that I'm sure you can all visualize easily. And our goal is to investigate how this data sits in there. Maybe we can say something about the data. So to do that, we're going to look at the co-density and co-density, oh yeah, Sam. All right, I really think I really can confuse. That's right. Usually what I've seen looking at everything is the data analysis. Each image is one point in some big dimensional space, right? Yeah. Or, right, that's what we're doing here? No. No. No, you are taking these sub nine by nine pixel images as being individual points in a nine dimensional ground space. Okay, so how are we retaining our, we're not retaining the structure about what's in here. We don't care how they're joined together. Okay. So I guess the goal of all of this is just to find some things out about what distinguishes natural images from, you know, photographs of actual stuff from like hit the random number generator and make it spray some pixels on the screen. We can do it in our minds and this is going to give us a way of potentially doing it with our computers. So with losing loads of information about the picture, like we wouldn't be able to reassemble it from the data that we have. But we're looking at, you know, we're just looking at whatever data we can kind of get our dirty hands on. So we're saying that hopefully nine by nine blocks from a picture is enough to tell you the picture. Yes. That is what we're hoping for. Yeah, yeah, yeah. Someone much cleverer than me spent a lot of time thinking about like how you would analyze a picture and I imagine loads of methods were kicked about and sampling sub squares seemed to be a good idea at the time. So you define the co-density to be the distance there's Dean on the host talking about from a point. So from one of these three by three blocks to its catenorous neighbor. So I guess the co-density would be lower if it's dense and that's why it's called co-density, right? So the distance from you to your catenorous neighbor is going to be really small if your canine neighbor is very close to you. So if you're dense, like points where you're dense are going to have low co-density. Now, given some, we might not want to look at all of our data points. We might, we might not. So given some sub set of our three by three, our nine pixel grids, we're going to let this v prime of kt to be the thing for whom the proportion of elements in the sub data set, which are kind of denser than x is less than or equal to t percent. So in other words, based on catenorous neighbors, you're finding the t percent most dense bits of your data set is essentially what I'm saying. So how many neighbors do you want to look at and how dense do you really want it to be? This is just a way of like narrowing down our data set to give you something more manageable. So this now gives us a bunch of points on our sphere. We've narrowed it down to the points where we have the most density. And so we're going to investigate the persistent homology, these groups that you get by inclusions of this data set. So this actually happened. And what I'm going to show you is the nature of what actually happened when it happened all today. So when k equals 300 and t equals 30. So in other words, we're looking at a set of points whose 300 nearest neighbors are so dense that they're denser than 70% of the rest of the space. So this is kind of like a very tight restriction on the data set. This is what the homology, the first homology group of the data set looked like. So you notice that you've got all of these generators of the first homology group as the value of epsilon increases kind of appearing like those little one-dimensional holes that we have. And then bloop all of those simplices you filled in. But epsilon gets bigger and it gets bigger and it gets bigger and it gets bigger and it gets bigger and it gets bigger and it gets all the way to here before finally the big hole in the middle is filled in. So just from looking at this, and like I say, we're looking for qualitative information, you would kind of hope that maybe the first homology group of like the actual thing that we're looking at has a hole in it. So what the people did was they kind of looked at this and said, hmm, a thing. So they went through the data set. And I guess you've got this seventh sphere and you could work out like what these circles, like the Greek circles if you like on the seventh sphere are going to do just by changing the parameters. And what they found was sure as there was a bigger loop when you looked at this particular complex. And this, imagine this as being a Greek circle around the seventh sphere. So the points in R9, these are like, think of these as labeling coordinates of points in R9. This is what the circle looked like, right? So you have lots of light to dark linear transitions on these three by three pixels. So just to go back to this picture, does this make sense? Well, like I've got kind of dark bits kind of going to like the bits down here. I've got some here. I've got some here. Like it kind of makes sense that that should be one of the things that you see in a natural image. But like it's kind of cool that it came out in the homology group, I think. I think that's pretty cool. Then they lowered the value of k and lowered the value of t. So k equals 15. So you're only looking at your 15 nearest neighbors and you want to be in the 25% most dense of your seventh sphere, if you like. So dense points in the seventh sphere and we just want to look at what happens. Well, in this case, we get these bleak loops and they appear to disappear. But then for a reasonably large parameter interval, there are five generators. Now, you can get a space with five one holes in it by having, I mean, a picture's going to come up anyway. I don't know why I'm bothering to draw this. But like you can take the circle that we just considered and if you have two others, we can intersect it and not each other because you can do that because this is in certain dimensions. Then you will get like enough holes. And so again, they went back to the dataset and they had a look. And sure enough, they found like the same one as before, which is probably like there's a long one here. I don't really know what this data represents, precisely. So like you've got the same large number of nine by nine segments that have this linear thing happening. But then you also have these two new circles, which when they investigated the dataset, looked a bit like this. And so they're ones which kind of fit between light and dark, but in a very particular way. And so these two were on the other circle and these two were on the other circle, but these ones were. And this is other ways that like by changing the parameters on the circle, you can get these changes in for three by three pixels. So I mean, we could look at some of those. Can we kind of see one here, right? There's black spirit in the middle and black on the outside. And so the idea is by varying these parameters, you can use the homology groups to find out what are the dominant, and like not less dominant, but still prominent, features of a natural image. So I'm just going to review what we did, and then that will be it. So we're getting a huge dataset, like a bunch of pictures. When they actually did this, they used loads of pictures. That was crazy. And we want to find something interesting, useful, qualitative information about it. We might be going to be looking for it, but we might be able to find it out with these barbecue things. We're going to kick and shove and push and squeeze and crush our dataset, but not so hard that it cracks it too, and interpret it as a subset of the surface. We're going to compute its persistent homology groups using these grips complexes and seeing how they react as an absolute varies. Look at the barcodes, like we did, and see which generators persist over time. And then go back and look at the data and see what the data was that made these generators of the homology groups do what they were doing. And hopefully that would then tell us something about that data. So the papers that you will want to look at are Robert Rice and Bernard Carlson. They're available to search topological data analysis on Google. I've not read them yet. This is just a summary of what I read over the last couple of days. I find it really cool. So thank you all for listening. Are there any questions? So when you have a poll, does that mean that there's something in the middle? There's some type of 9x9 arrangement that does not show up? Well, the 9x9 arrangement corresponds with points on the sphere. So I guess the fact that it's interesting is that it's a one-dimensional hole and not just a seven-dimensional hole or a eight-dimensional hole. Because if your points were distributed uniformly over a seven-spare or if you like a three-spare or two-spare, that's easy to imagine. If your points were distributed evenly over a two-spare, you would expect when you find this complex to just get a sphere with a two-dimensional hole and no one-dimensional holes. So what it's saying is you've got this circle going around your sphere and then say you've got the equator of your sphere. It's saying that there's not very much going on in the North and South Poles. So in the cases that you're showing, what would be the app on the face when I can tell? Well, so I imagine an example of a 3x3 sequence that you wouldn't see a lot of in a natural image would be the one you get by alternating black and white. For example, this seems pretty far from the kind of thing that you're going to see in a natural image. And so I would imagine that that is sitting on the seven-spare somewhere kind of far away from this equator that we were talking about. Yeah? So it seems like this work is primarily looking then at using this for detection tasks? Yeah. Are there any other applications of this sort of methodology you're looking at? So I think this is all very new and this was just a let's see what we can find kind of thing. I don't think they had any particular like, oh, we need this to find this particular information about this thing. I think it was more, let's do a thing and see what happens. And this happened. I think that there's a hope that they're going to be able to apply biology, you know, medicine, that kind of thing. There is probably answers to all these questions in the paper that I've not read yet. There's definitely a lot more applications in the paper by, so in this, Robert, no, in this kind of Carlson paper, I read probably about the first 20 pages of 15 and he does go over more examples if he goes into more detail about all this stuff. So hopefully it's kind of finding a use that maybe not. So you had to project it here quite a bit. So how bad is this modification in that corner? I mean, they did it in 2010 and computers were slow back then, so. I think it was even earlier than that. I think it was 2003 or something that that particular was done. So, you know, talking like a 133 megahertz processor. I mean, I don't know. I just have absolutely no idea, I'm sorry. Robert? So I just want to make sure I understand. So you were talking about projection of those data points to S3, the three-dimensional, so. Well, I mean it was actually S7. Oh, S7, so. Yeah, yeah, yeah. S7. And then, you're actually not using the distance in Sofia, but rather you are using Yeah, you're using something called a D norm, which is like a different way of measuring the distance. So it's not actually the distance. So it's like, it's something, it's a quadratic formula of the logarithms of the greyscale values of the points in the grid. So, yeah, but I imagine that it's topologically equivalent to Euclidean distance, like Euclidean space, because topology. Do you have the intuition of the projection if you project? How to project? Well, so this thing defines a norm. And so you project onto whatever sphere is defined by that norm. So for example, if you have the infinity norm, your sphere looks like a cube, right? Similar idea. But topologically, same as I said, because I think we're all in it. Yeah, so. It seems like this idea is easy to print. You just look for the things that seem to persist. And then you just make it in with those things. And then, does that going to feel like an image to us? Probably not. But this tells you one feature of natural images, right? Which, so if you're doing natural image versus random image, natural image is going to have more of this stuff probably, right? This is definitely not the sufficient condition for your image to look natural. We've thrown away a lot of information, right? Like for a start, we got rid of almost all the data points by saying, oh, you don't contrast enough. And then we did all this like normalization. So we said, I don't care how bright you are. And then we said, project onto this sphere. It's like, I don't care what your distance is from anything. So we lose loads of information. It just so happens that we keep some information. The information we keep happens to be relevant to the problem. So I'm sure that there's another class of image which has the same properties. But then the question would be like, but what does this do? Which is that natural images? And you could maybe use this to find out. Debbie. Can you not understand that this is to do with a solution? So what's wrong? To do with the solution then? Because it feels like this trick, taking the 3 by 3 sub-grids, kind of depends on the particular resolution. Like if you had enough pixels in it, the 3 by 3 sub-grids would tell you basically nothing. I imagine that you would need to be taking a fairly standard resolution of photo. If you wanted to move this to high resolution, you might want to increase the number of pixels in your grid. So for example, if you had a 32,000 by 32,000 pixel, you might want to have like a 10 by 10 grid instead of a 3 by 3 grid. Or you could just go with resolution artificially. Or you could just decrease the resolution artificially and then look at 3 by 3 grids. So the paper that actually did this, like I think you may be able to find information out about exactly what the resolution of photos was, that kind of thing. It's not something I know. But like for sure, if your picture was a million by a million pixels, and it was just a picture of my ugly mug that was like spread up that much, almost all of the 9 by 9 things would be just like, like there would be nothing, there would be no change, there would be no contrast. So it would, yeah, like you probably need to have a reasonably low resolution. Or get bigger boxes. Is homology the only thing they use to recover the information? Because it seems that it's really insensitive to the shapes of the hole. And it seems that the only thing you uncover is just the number of holes they were. Yeah. But so I suppose the point with this was uncovering the number of holes was sufficient to then explore the data and find out what the holes were telling. So knowing that there is a hole helps you go and like find out what it is. Like, I mean, there is more information than the barcode tells you. But because you can find out what the generation of the homology group is. So then you go back to the data and you're like, hmm, what happens here? And then you go and have a look. And you're like, oh, these things pop out. But yeah, you lose loads of information. And that's just what happens. But the fact that you lose loads of information is almost kind of good because it means that you're doing less stuff. It's less computationally expensive. You're throwing away potentially irrelevant data as well. But absolutely, you lose loads of data.