 The second speaker is Vipu Periwara. He's going to talk about something other than the large deviations. So we talked about a lot of large deviations. No large deviations. No large deviations. No clarity using models. I lit. OK, please. All right, light entertainment. So this may not have been evident in this school and set of talks, but biological data is very noisy. And so theorists, when they arrive at trying to model biology, usually fall flat on their face because they don't know how to deal with very noisy data. All right? So please, keep this in mind. If you think that your beautiful differential equation parameter finding is working great, try it with real data, and it still works, then you've got something. So one of the policies that is very strictly enforced in my group is that if we're developing anything theoretical, we have to have some real data that we should, that the algorithm should work well on. We don't even start doing any modeling unless there's some real data that we aim for. So a typical problem that my group has studied for a while is the arrangement of cells in pancreatic islets. And what's an islet? The reason your blood glucose levels stay roughly constant most of the time, even when we keep eating cookies and little bits of pie out there is because your pancreas has these little organoids called islets, which have these cells that in this picture are colored red. And there is green cells, which are the alpha cells. And the beta cells, the red cells, secrete insulin. Insulin gets muscle and various other organs to absorb glucose, and that reduces your blood glucose levels. And the green cells, the alpha cells produce glucagon, which is responsible for your brain getting enough glucose even when you're not eating, or even when you haven't eaten for a while. So there are three sub-regions in this islet. There's this mantle, which is the surrounding region. There's a core, and this empty region here is where there's blood vessels. Now, the arrangement of beta cells in an islet is supposedly functional, and this is an area of huge controversy, because for a long time, the only things that were available were slices through islets. These are three-dimensional organoids. You go around taking slices, and really there's a lot of imagination out there. And you can come up with all sorts of hypotheses as to how they're arranged. So one of the hypotheses is that alpha cells form a sort of cover around the beta cells, which are in the center, OK? Now, a postdoc in my lab, Debbie Striegel, did a very beautiful geometric analysis finding this kind of structure and the probability of a given islet having such a structure and so on and so forth. So when Manu, whose work I will be presenting today, when Manu started in my lab, Debbie had moved on to better things. I asked him to check out what Debbie was doing so that we can finish up her work and publish. So Manu did that, but then he got bored, and he said, why don't I try topological data analysis on this, just as a sort of different way to verify Debbie's results. So Debbie did very, very careful geometric analysis in three dimensions with angles and when you get a sphere and so on and so forth. Manu decided to do topological data analysis on this. So he could quantify when and if there were these mantles. So now I'm going to tell you about what is topological data analysis. I'm going to talk about a specific sub-area of topological data analysis called persistent homology. And so I have to tell you what is homology. So homology is something that came out of the work of Poirier. When Poirier was trying to figure out how to mathematically quantify what means for a shape that is, how do you say that something is a torus versus a sphere? You can deform it in different ways, and it's still a donut. So what you want to do is you want to find an invariant way that doesn't depend on the geometric embedding, whether you make it a very elliptical donut or a round donut, it's still a donut. It's characterized by having two non-contractable loops. So that's what Poirier was aiming for. And so there's a huge branch of mathematics called algebraic topology, which basically converted the study of these invariant shapes to algebra and group theory. Generically, just algebra, really. And persistent homology is a way of trying to compute the homology when you don't have some continuous nice manifold, what you have is a bunch of data points. And you're trying to figure out where are the regions where there isn't any data? There's a hole. How do I quantify in this noisy data set that this is not a significant hole and this is a significant hole? That's what the whole aim is here, to be able to quantify the significance of these holes and when they exist. So what's the basic idea? The idea is suppose you take one parameter, which is basically how far the nearest point that you're connected to is. And then what we're going to do is we're going to vary this parameter. And as you make this parameter, which is labeled tau here, you get connected loops. Now if you get connected loops that you can basically contract or connected networks that you can contract, that's not a hole. But this happens as you scan through tau. You see a non-contractable loop as you keep increasing tau. You start getting connections across the loop so that you can squeeze it out. So then that hole that you found dies. So that's what when I refer to as we go on, I refer to something being born, a topological feature being born. And then as we scan the persistence homology diagram, that feature dies. So what's a significant feature is usually one that has a long trace in the persistence diagram. Now in this two-dimensional case, it's sort of easy. You say, oh, this is a significant hole. And it depends on the length scale of your data whether you decide that this is also a significant hole and this is not. See, that's filled in. So that can be contracted to nothing. Something that has a hole in it cannot be contracted. So in a persistence diagram, as I said, there's a tau parameter value at which a feature is born. And there's a tau parameter value at which the feature dies. So if I do the persistence diagram for this cloud of points, intuitively you might think there's only one feature. But you'll notice that on this persistence diagram, there's birth, there's death. And most of these features are pretty close to the diagonal. There's only one feature that's up there. Way off in the persistence. So what that means when it's off the axis is that there's a long interval between the birth and the death. There's quite obviously nothing below the diagonal, right? Unless you believe in some sort of reincarnation, in which case you might be dead before you're born. But in this topological setting, that doesn't really happen. So every feature is up here. And the further away they are from the diagonal, the longer the time scale, the persistence length scale over which they survive. Yes? So that is sort of obvious, right? Now here's a three-dimensional point cloud. I'll give you a one look at it. Can you tell that there's a hole in it? No, it's not so obvious when you go to higher dimensions. And yet, I can give you this point cloud. It's just a bunch of vectors, three-dimensional vectors. And you do the automatic persistence homology calculation. And you'll see there are no non-contractable loops. But there is actually, in the center of this, there is actually a two-sphere, a hollow two-sphere in there. Not obvious if you just look at it. So that's what we're trying to get at to find non-trivial topological features. Features that are way off the birth-death diagonal. That's persistent homology. So how do we compute this? Yes? Yes? A hot dog is just like a sphere. A hot dog is a sphere. Points of a hot dog would be less than a sphere. No, no, wait. You're talking about a solid hot dog? The shape of a long cylinder. Yes? Yes? That's it. So when you start the persistence, you're not looking at the longest. No, no. You're looking at the shortest at which you can contract it. And this is? It's a metric notion. This is not that. No, no, no, wait. So any actual data science, it's with a point cloud with actual numbers. So there is a metric notion because of this. But you change the metric in any smooth way, leaving it a metric, and the persistence diagram, it will be deformed smoothly to a persistence diagram in the new metric. And things that were not on the diagonal will still not be on the diagonal. OK? But this metric will reveal that. How far you are from the diagonal? Yes? I could tell apart with the clusters. Yeah, well, yeah. So as we go, we will get into how we actually tell what is significant and what isn't. Yeah, that's a whole another. OK, so how do we compute this stuff? There is what's called a zero simplex, one point, right? That's what's called a one simplex, which is an edge. That's what's called a two simplex, which is given by three points filled in, right? Yes? That's a four simplex, a tetrahedron, right? A, B, C, D, filled in. OK? A simplexial complex is just a combination of sets of points like this, of simplices like this, OK? Like this is a simplexial complex where we include the points A, B, C, D, and we include the edges A, D, A, C, B, C, A, D, C, D, and we include A, B, C. So in this case, A, C, D is a whole. It's not present in our simplexial complex, right? That's why it's not shown filled in, because A, B, C is in the simplexial complex. A, C, D is not, OK? So that's a whole. There are edges around it. And yet, we will discuss how exactly we can say that this A, C, D is different from A, B, C if we're given this simplexial complex, OK? Any questions? Any questions at all about this? Because we're going to do a lot of this. So we will go a little bit faster. So please make sure you understand what this is, OK? All right. OK, so now let's see how given such a simplexial complex, we are actually going to compute the persistent somology diagram. So you'll see here lengths which are added to each edge, like B, C is length 1.5, A, C is length 2.75, B, D is 3.25, and so on, right? So I start at some value of tau. So here we are going to be increasing tau as we go. At tau equals 0, nothing is connected, right? At tau equals 1 point, let's see, what's the smallest edge? 1. A, B gets connected, right? So you see the simplexial complex as we keep going changes, right? Yes? That clear? The simplexial complex is changing. Initially it was just four singletons, four points. Then we got one edge and four points, right? Then we got two connected edges and one point, right? Yes? As we change tau, right? By the time we get to tau equals 2.75, we have everything except this edge BD, right? Everything except the edge BD, which happens to be longer than 2.75, right? So that's not included. Everything has included. And now if you look at this, there's a hole born when we added the edge AD, right? It's a hole because there is no way that I can deform AD or BC or AD to nothingness, right? Now here, there's another hole born. Why? Because neither ADC nor ADC can be squeezed to nothing. On the other hand, when we add the simplex ADC, ABC, right? The two simplex ABC, now this can be squeezed to nothing and one of the holes dies, right? The whole ACB dies, right? Yes? So this is how we compute the whole persistence diagram, the birth-death diagram. I'm sorry. Yes, please. Is it the same way? OK, yeah. So basically, what the idea is that if I were to see this part is all included, right? Everything inside, right? So imagine that I can just deform, right? I can just deform B till it just ABC is right on top of AC, right? So that's just gone, OK? And why would we add it to? Because what's inside here is not in there, so it's a hole. I can only squeeze things which are filled in, which are part of the simplex complex, right? The triangle ADC is not in the simplex complex, OK? So I can't squeeze it to nothing because it's not the contents. The interior is not in the simplex complex, OK? It's not something, it's not a question of what I want to add. This is provided by the data, right? So we have to work with the edges at a given value of tau that the data gives us, right? Yes? Well, I'll have to look at precisely what the sequence is. Let's see. When we go from here, which is 2.25 to here, then we get ADC, right? When we go here, we have ADC. Oh, he didn't go on to 3.25, right? So I stopped this at 2.75. So I didn't talk about the last step where you add the last one. If you add the last one, then everything goes away, yeah. You're right. If you make it the biggest length in there, then there's nothing there. Yeah, you're absolutely right. But I didn't draw the 3.25, OK? Yeah, yes? Why did I add ABC? Because the length that we're at here is allowing me to add the edge AC, right? Here I had ADCB, because all these lengths were less than 2.5, or 2.5 or less, right? And at 2.75, I can also add this last length AC, right? ADC, no? Sorry, ADC was already added. It's already there. In this set, yeah? No, no, no, I only add one edge, right? I added AC, yeah, because that gave me this edge 2.75. So basically, this diagonal, which is this whole triangle, is now a simplex, right? No, we can't, because ACD, the triangle ACD has the inside of it is not in the simplex. It's only the edges, right? It's not the interior that's in there. This is just drawing pictures. It actually corresponds to specific matrices with 0s and 1s, depending on which of the entries are included, OK? So when I talk about birth and death, it's actually a column operation that removes certain things to be 0 and so on. There is a more precise way to do this. It's not just guesswork, what I'm trying to say. But the key point is that as you change the length scale tau, right, there are edges that get connected, and those get added to the simplex complex, OK? And you can go, let's go over this very carefully. Between 0 and 1, they're all just four separate points. Between 1 and 1.5, when you add this 1.5 edge, then you get this link between A and B, and then you get this link B to C, right? And then at the next length, you get a link between C and D, right? That's at 2.25, right? And then you get, when you add at 2.5, you add AD, that edge, then you get this whole part, right? And this corresponds to a simplex, right? OK? And then you can add one more edge at 2.75, and that gives you this, right? OK. Now everything at that point is included. You have A, B, C, and all the edges, they're all part of this, right? OK. Now the connection between B and D is not part of it, because that is 3.25. That's still not connected, right? OK? Sorry? Question? 3.25 is also part of ABC triangle. We're not just adding. No, it's not. 3.25 is between B and D. ABC is this, 1, 1.5, 2.75. Yes? OK. OK. So now there are some computational challenges, because if you have a certain number of points, right? You have to be able to calculate with, well, one point at a time, two points at a time, because those are the edges, all the two points at a time. So if you have 1,000 points, then there are 1,000 times 1,001 divided by 2 number of edges, right? And if you also want to compute H1, then you have to, the first homology group, H1, which is the number of noncontractable loops, then you have to include all the three point combinations, right? So that's for the total number of three simplices and 10,000 points, that's about 4 times 10 to the 10. OK? That's your matrix size, OK? Just one. Saving each simplex as four integers in computer memory will take about 165 gigabytes, OK? And this is, we're not even talking about the computational time of actually computing the homology, OK? We're just talking about storing the data that you need as you want to compute, OK? So here are the pre-existing algorithms. Before we got to Manu's work, these are the algorithms, Dionysius, Goody, DeFar, Ripser, Perseus, and Irene. And on N is the number of data points. So P and T are things to do with the sparsity and the connectivity. So this is how much time and how much memory these algorithms take. Ripser is by far the best of these. And the dashes represent that these algorithms could not even handle the data, OK? And we're not talking about huge number of data sets, data points, 512, OK? 50,000 Goody could handle, Ripser could handle, OK? So this was the limitation at which we came into the picture. And since I like to keep my fellows, my postdocs, entertained, when Manu proceeded to show that his topological pH approach to calculating islet structure reproduced Debbie's results, I said, well, try something more interesting, with like the data that we'll talk about, which captures how the chromosomes are arranged in a nucleus, OK? And that, you know, if you look at 1,000 KB resolution, that's 3 million data points, OK? So it seemed like it should keep him out of my office for a few months, right? I mean, look, this is a limitation here, 50,000 points, right? Surely he should be able to do 3 million, right? Yeah. So unfortunately, Manu is very persistent, so he did it. And he had to develop a fair number of new methods to index the data, and store the data, and reduction algorithms, and so on. Took about two years, but, you know, he didn't bother me for two years, so all good. And he had to come up with new algorithms to process simplices without storing them all in memory, because no matter how powerful a server I get him, I'm not getting him a server that will store 3 million data points and 3 simplices or 4 simplices of those 3 million data points, right? No way. It does not exist on Earth. And in his algorithm, which we call Dory, everyone seen the movie with Dory, the fish? Why do we call it Dory? He wanted to call it Goldfish. I said, what Goldfish? Dory, because we need very little memory, OK? So Dory can handle the human genome, high-C data, 3 million data points in about three minutes, OK? So I'm going to tell you about. So he did that, right? And he said, oh, look, I completed the persistence diagram. And I said, that's useless scientifically. Why is it useless scientifically? Because in science, we actually want to know, where is the feature? It's not OK to just say there is a feature. We'd like to know which part of the chromosome is bounding the feature, right? It's completely useless to tell me that there are 3,000 holes in the chromosomes arrangement. I need to know what genes are around each hole, right? So that was his next task to keep him out of my office for another year and a half. And the point is that the question is, where are the features? So his initial algorithm, see, because persistence and phenomenology is a topological thing, right? And I said, it doesn't care so much about where exactly the cycle is, right? Because it's supposed to be invariant under little stretches and deformations here and there, right? But that exact thing makes it difficult to find the exact location of any topological feature, right? So what I was asking him was to find tight representatives, in your data point cloud, find tight representatives for each cycle, OK? This is, again, this is definitely a metric or geometric question. Given that metric, the distances between points, find the tightest representative you can, OK? So here is the initial cycle that his algorithm computes. As you'll see, it's definitely not tight, OK? Right? You shorten it naively and you get this, which is, you know, it's a little bit jagged, right? So the smoothed out version of this cycle, this is all the same topology, OK? It's all a noncontractable loop, right? But this is a tight cycle. The red one, OK? So with a little effort, we got this tight cycle algorithm so we can now tell exactly where these non-trivial topological features are, OK? So now we're doing science, OK? I don't think it helps you much to know what is going on, but this is his basic algorithm. First, you have to find the cycles. Then you have a greedy algorithm for shortening it. Then you have a local smoothing to further shortening. And then he had to invent a whole different way of computing homology because when you want to find tight cycle representatives, that's actually very computationally intensive, even more than the initial part. And so he had to invent a way of covering the whole point cloud in sort of covers, computing the representatives within each cover, then worrying about the intersections between covers and so on, OK? It's a little bit involved. But four years after initially setting the problem to him, he just got referee reports. Mass journals take a long time, OK? This is an interesting question. I guess they care about the thing actually being true. Physicists, who cares? Anyhow, so he is busy revising his papers. So now let me tell you the application that I set him on and that he successfully computed. So here's the human genome, you know, DNA and stuff like that. This is at the different scales. That's how the chromosome is all bundled up inside the nucleus. You look closer, you start to see individual chromosomes. You look even closer and you start to see what are called chromosome territories. They're actually kind of segregated inside the nucleus. And the question is how do you figure out how it's all arranged? It turns out that things like gene transcription actually depend on how the chromosome is arranged. So in different cell types, you get different arrangements. As a cell differentiates and develops, the arrangement changes, OK? And there are several diseases that arise from incorrect folding, for instance, OK? So is this folding random? It certainly is not. Is it non-functional? Well, as I said, it actually does matter. Why does it matter? Because there is a protein called cohesin, for instance. When there's a loop, that cohesin protein is actually, the loop is arranged in such a way that the promoter, which causes this gene to be transcribed, is actually close to the loop, OK? This is actually, it used to be a big mystery in the arrangement of the human genome is why is the promoter of a sequence so far away from the actual gene, OK? It's because things are looped, OK? And there is a method to this looping. How do we figure out the distance between bits of the chromosome? This is a very cool, I mean, biochemists and biologists are amazing experimentalists. This is the idea they came up with. They crosslink, they put in a molecule called a crosslinker that links nearby DNA. Then they have restriction enzymes that will chop off the edges everywhere there was, this close crosslinked DNA. So then you're left with the crosslinker and these little fragments. These fragments have sticky ends, very amazing biochemistry. So then you purify and shear the DNA so that you shear it right here so that you end up with two bits that know about which parts were close together, right? And then because we have the whole sequence of the genome, we can actually figure out where one side was and where the other side was, OK? So because you know the sequence of the genome, you can figure out from these segments where exactly they were, what was close. And you get this contact matrix, high C contact matrix shown here for chromosome 14. This took something like a few million dollars of research funding to actually do this, OK? Very, very nontrivial work. OK, so Manu computed the genome-wide high C persistence diagram and he showed that there is a... Oxin is a molecule that impairs cohesants. So when you put in oxin, then a lot of the loops are just gone, OK? So what Manu showed was that in the oxin data set, the number of loops that he computed changed by like 70%, OK? So that shows that he is detecting loops and that, you know, oxin treatment got rid in his topology of a lot of the loops. This is just showing that for many different cross-linkers and many different digestions, digestions is this cutting-off bits. You get the same persistence diagram. This is just showing consistency of the loops he found across different protocols. That's all his revision that the referee asked for. So he did that, OK? He applied this to protein folding. Let me show you some examples of holes inside proteins. So here's a pheromone-binding protein from Bambak's mori, which is, I believe, a moth. When there are two ligands bound, bell pepper and iodohexadecane, then actually you get voids. In the unbound pheromone, there is no... In the unbound pheromone-binding protein, there is no void. But for these two odorants, you get these holes. This is a GAD45 gamma protein. In the human version, there is a void, not in mouse. Actually, it turns out to have some relation. The GAD45 protein alpha version actually has a relationship to a human disease. But we couldn't find a structure for alpha. Cocosin is a coconut protein. And when you have mutations, then you find three different voids. This is a possible food allergen. OK. Then lastly, I'll show you some pretty pictures. I started out as a physicist. In fact, I started out working on the cosmic microwave background. So I asked him, well, how about the galaxy distribution? We should be able to find voids in the galaxy distribution. So he did that on one dataset with 110,000 proteins. Sorry, galaxies. So I was talking to a cosmologist at Kies. And I said, well, we were handling really large datasets. So I figured the galaxy was not that big a dataset. And he said, a galaxy distribution is smaller than a chromosome? And he said, no, not quite. But I said, well, 3 million is bigger than 110,000. So this is the persistence diagram for a galaxy distribution holds voids in the galaxy. The red ones are the ones that with a lot of work, we figured these are the ones that we would say are statistically significant, which is Antonio's question. So there's a picture of voids in the 110,000 galaxy sample. Even more, these arrows indicate the direction of Z increasing, redshift increasing. These are examples of the tight representatives that he found around distinct voids. And this is just statistical significance of two different ways of showing that the voids he found were statistically significant. So I think I'm out of time or right about out of time. So there, no kinetic easing models, no large deviations, but some biology. All right, questions. Please. Thanks for the talk. I was very impressed. And then my question is that you came up with several examples to apply your method to analyze these topology things. So I think for pancreatic isolates and the galaxies, they are just points. Yes. They are not connected, right? But any data set, this is the reason for computing persistence, homology, is that any data set is always discrete points. Yeah, but for the chromosome, you have additional information. So we actually use that information in other ways. It's actually very interesting to use the linear organization and the actual point cloud information. Yeah, but we don't use that in computing the topology. We compute it in interpreting which genes are involved in that void of whatever we found. OK. OK? Yeah, no, no, no. That is when you have additional information and you get the topological information, then you put them together and then that's the science part. Right. Yeah. And the other question is that there already exists some tools to analyze this structure. Which structure? Such as structure vector or pale correlation function given positions. Sorry, I didn't hear that. I mean, given positions, you can calculate structure factor. Structure factor. OK, like what, for instance? Which is a Fourier transform or. Yeah, yeah, yeah. Of course, you can compute Fourier transform, yeah. And also, pale correlation function also gives some information. Oh, yeah, yeah, there are other ways too. The reason for this is this is incredibly robust to noise, OK? For instance, when I was talking to a cosmologist, they say, oh, you have to include the fact that the masses of galaxies will change the shape of it and so on. And I said that's exactly the point, is that it doesn't matter what the mass of the galaxy is because it will deform the shape of the void maybe from an issue after whatever explosion might have caused the void, right? But the topology, the existence of the void does not actually depend on deformations, right? It's independent. So that's what makes it robust to noise. There are lots of other ways to analyze data. I'm just showing you a way that is sort of geared entirely to being noise resistant, if you like, OK? Please. Questions? Wait, just one second. So I want to ask the intuition behind the choosing the shortest boundary, I understand? We didn't choose anything. You see, if someone gives you a data set and there's sort of, you have to think about what the units are and so on, right? Before you decide what the minimal distance is, suppose the data set has elements that are of different units, different magnitudes, and so on, right? So you have to have some rational way to get rid of the dimensions to non-dimensionalize the data. Then you can talk about what's the minimum distance that's present in the data set, right? And you start at that distance, right? And then you can look at what's the maximal distance in the data set, right? And you stop at that distance, right? My question is I understand that part, but the algorithm that your student developed was since the first algorithm gives us very noisy loop but the student developed more smooth kind of loop. But I want to ask that smooth loop is some kind of, if we consider a torus, then choosing the shortest loop may not represent the global structure of the torus. If we choose... No, no, that's absolutely on point, you're absolutely right. So when we say something is the tight loop, we actually need to show that it is in the same homology class as the initial representative. So it's not, you don't forget the topology. If you just forget the topology and say, oh, I'll find the shortest loop that's close to this one, then you could quite often lose it. It took him a year and a half because it is a subtle question to find a tight representative that is still in that same homology class. It's non-trivial, really. Thank you. So I have the same question regarding this point. So about this notion of time, as you just said, this is like among the equivalent class. Yes, it has to be in the same equivalence. I understand it, but still, in that class, you need to have some kind of quantifiable measure. Yes, yes, yes. No, you're right, yeah. How would you define that? Well, we define that actually as the sum of edge lengths, right? So like you have a Hamiltonian cycle, then you just sum the length over the length. For example, yeah, yeah, yeah. No, this is all, yeah, all that tight loop is entirely dependent on what metric you put on the data set, right? If you were to change the coordinates in such a way that it makes, you know, that it changes the distances between, then the tight representative will change, right? The topology won't change, but the tight representative will change if you change how you compute distances, right? Yeah, OK. I think we are better to stop here. Thank you, people. Yep.