 All right, so I'm going to talk about using persistent homology to generate features for drug discovery, but there's sort of, in some sense, it's just one application of persistent homology, and there are other applications, and like, I think you can learn something from this even if you're not particularly interested in drug discovery, maybe you'll learn a little bit about persistent homology, and of course Dimitri is a world expert on persistent homology, if you have any follow-up questions or issues. So what's the context for the problem that we're looking at? So we want to create new drugs to solve diseases, to cure diseases, right, and to do that we have to find new compounds to run through drug trials, right, and before you start just giving people drugs, you test them, you know, in labs, on cultures, on cell lines to to make sure that they inhibit the things you want them to inhibit, and then before you do that, you need to like the space of all possible compounds is too large, so you need to find some way to filter that down to a potential set of compounds to actually try to solve the problem. So you kind of start with this kind of like in the world of all compounds, which are the ones that I want to be looking at. So we're going to be looking at this step here of like sorting through compounds to find potentially promising ones. Now as far as drug discovery goes, in some sense, this is like, this is a part of the problem you can tackle with data science and math and statistics, but it's like, it's a very small, you know, we should understand it's a very small piece of the problem, and that the actual, you know, drug development process is, you know, a decades, multi-decade long process to actually do this for real, but yeah. And then maybe like, I'm going to be kind of dealing with a very constrained data set and a filtering problem, but there's nothing to really stop you from stepping back and doing something on a much larger set of compounds. It's just not, it's just not something that I did in this process. So this step is called virtual screening. So why do virtual screening at all? So actually testing, physically testing compounds is very expensive. So virtual screening is a computational approach. Typically it's going to be based off of like biomedical knowledge. And the reason I say that is that I have no like particular biomedical knowledge. So typically you would have a real expert in some chemistry, biochemistry, etc. Virtual screening, it's a way to reduce costs. And our typical goal will be something like, you know, we would like to get 90% of the potential inhibitors and maybe reduce the, you know, total space of compounds to 10% of the original data set. That's like sort of a ballpark of what you kind of, what you want to do. So there's lots of methods. Again, like, I'm not really an expert here. I did this work in collaboration with an expert. And he told me, these are the methods you should put on your slides for other competing things. And the message is, though, that they're like, they're very detailed calculations, maybe they're quantum mechanical or semi classical or, you know, that there's a lot of input from chemistry and biology, but they're very detailed calculations about particular molecules. And what we're going to do is kind of do something, it's like philosophically a little bit different, which is that we want to organize molecules and extract something about particular molecules from how they sit in the space of all molecules. So rather than kind of getting into the nitty gritty details of a particular molecule, we want to have some sensible way of organizing them so that we can say over there are the antibiotics, over there are the chemotherapy drugs, right? Like that's kind of this like spatial view. And that has to do not just with a particular molecule, but its relationship, right? How it sits relative to other molecules. So that's kind of going to be our approach here. Okay, so again, our goal is to take a database and find the set of relevant bioactive compounds. So we're going to look at the particular, a particular sort of kind of problem, which is a DHFR inhibitor. And the main sort of takeaway from this slide is that DHFR is needed for cellular reproduction. So if you can stop it, right, you can, if you can stop E. coli from reproducing, you have an antibiotic. If you can stop cancer cells from reproducing, you have something to treat cancer, right? So we're going to target a certain thing that's needed for cells to reproduce and try and block that, stop that from happening. So, so yeah, so there's cancer. So the same, the same pathway exists in multiple different kinds of organisms. So it's not just in humans, but you can target, so if you're targeting human cellular reproduction, you're going to be developing a cancer drug. If you're developing bacterial reproduction, you're going to be targeting antibiotics and then you can also target things like malaria. So in general, like this, this method of DHFR inhibitor, it's like a very promising ground because there's lots of potentials for drugs to have applications in a lot of different domains. That said, a particular drug, you really want to do one of these things. You don't want the, when you give someone antibiotics to also be killing them, right? Like, like you want either an antibiotic, DHFR inhibitor, or a human one. You don't want like overlap there. And so that's going to kind of, that's a little twist that kind of complicates our story a little bit, is that the, the, the process of inhibiting is very similar across these different organisms, but there's some subtleties that will, that will make a, that distinguish them. And so that's just kind of confounds and makes our problem a little bit different, more difficult. And so, yeah. So in fact, this is sort of, this problem is interesting also, like historically. So methotrexate is a DHFR inhibitor and it's considered the first historical example of successful cancer drug design. Meaning that people really, they, they wanted to go after this particular thing and they designed a drug to go after it. And so on the top we have the, I believe the actual thing and then on the bottom we have the inhibitor that's going to stop this reaction from happening. And I could have that backwards. I actually, since I made the slide, I forgot which order. I'd have to look at the tag and like look at the file name that I'm loading the two pictures. I apologize for that. And then here's one that's like similar, but, but doesn't work. So this kind of design, this like designing drugs to do things is very hard, right? So a lot of knowledge, biological, biochemical and years of work goes into doing something like this. So methotrexate was designed in the 40s and 50s and it's still used today as a, as a, as a drug. It has bad side effects like hair loss, ulcers, right? These are things that maybe if you, you know something, you know, or maybe we all from popular culture know something about chemotherapy like these are all really bad side effects. Why do we use drugs with these side effects? Because it's really hard to design new ones that don't have these side effects, right? So, so the fact that we use something, you know, for many decades with awful side effects is like a testament to the difficulty of this problem. And then there's this, there's this like twist to this story, which, which actually is my favorite part is, is that the design, when they finally were able to, to use x-ray imaging of, of like, of how this compound was actually binding and stopping this reaction from happening, it was doing it backwards from how it was designed to do. So in a certain sense, the whole thing that like this like successful design story ended up being a story about luck, right? That like it, it happened to work, but it just works backwards, right? And, and that's again, like, like this is, this is a hard problem, even when you're successful, maybe you just got lucky. So this is like, yes, yeah, you sound like a more of an expert on this than I am. No, that's, that's right, yeah. But yeah, that's right. Okay, so it's not really to the rescue. It's, this is more of a like, I don't want to oversell this. This is like a slide I stick in here for computational topologists. So they feel good about the work they're doing. In this case, Dimitri's the only computational topologist, so it's, it's a little bit of a pointless slide. I apologize for that. No, his work is so self-evidently valuable. It doesn't need, it doesn't need me calling, you know, calling him out. I mean, I will say there was a period like, you know, computational topology was, I would say it was developed before it had great applications, like, like the applications weren't great. So, so like, it's nice to have good applications. So this, this was, yeah, anyway, I should stop talking about that. So, all right. So for, so for each, so, so what is, what are we going to do? Okay, so for each chemical compound and some database of chemical compounds, we're going to calculate a set of topological invariants called barcodes. Yeah. We're going to take a metric or distance on the space of barcodes to assign a distance to the space of chemical compounds. And then what we want, like what we're hoping to see, and I should be clear that, like, I was like, and this is, by the way, this is like how data science is done. I was like, I have this idea. And like, I hope it works. And the only way is to try it. Like, there's no reason to necessarily believe that it's going to work. So you hope that the compounds with certain, you know, properties, bio, biological properties will be grouped together, sort of physically, geometrically in this space of all compounds. So why, like, can I say, you know, like, retrospectively, like, why did that happen? I don't know, but it did happen, right? So we'll see, we'll see that. And that's how a lot of data science problems go. So, yeah. So, and again, this is, this is a little bit of a different perspective that I'm sort of emphasizing here, because I think it's one of the gains from using these techniques is that we're not just looking at individual compounds, but it's the compounds relative to the other. It's finding the, the, you know, antibiotics grouped together in this compound space that's interesting, not sort of the particular barcodes necessarily of an individual antibiotic. Okay, so we we're using a standard online database of compounds. There's some basic things that we know we need to have that we use to pre-filter the database. And then there's also, we did a literature search to find all the known inhibitors, and those are going to go into our data set as like the outcome that we want to predict on, right? The things that we think know things about are the ones from the literature search. Okay, so persistent homology. So I'm going to tell you now how I make these topological signatures. So, so this year is sort of a technical slide, which is that when you have a series of inclusions of a space, you can track topological information as it sort of passes from one space to the next in the inclusion. So this is like, this is like, this is as technical as we're going to get here, but I'm actually going to like, forget that for a moment, like this is what that means this this issue of the family of inclusion. So here I have a toy, a toy chemical compound. So these dots are the atoms, and I'm going to create now a barcode for you one of these topological signatures from this toy arrangements of atoms. What I'm going to do is I'm going to connect atoms to the nearest neighbors, right, like the ones that are distance, you know, half from each other, the ones that are distance three quarters from each other. And then when I get a complete simplex, so that's like a triangle, when I get all the sides of a triangle filled in all the edges, I'll fill in the triangle as like, as part of the space. So as I've increased this distance parameter, I get I start with the points and then I'm like connecting things up. And what you notice is that as I connect things up, each space is included in the next one along the line. And that's kind of the machinery of persistent homology will then create the barcode for us. So what does this look like? So here we have our atoms. These are the closest atoms. They would be the first ones connected. Then I connect these these ones further out. Notice now I have I have these loops, right? So I started with eight points. Some of those got connected. At the next step, I end up with everything connected and I end up with some loops. And then if I keep going, those loops get filled in. Right? So I'm going to record this information about the connected pieces and the loops. Okay, and I'm going to summarize that with what's called a barcode. Okay, so H naught means connected pieces. So that top barcode is recording information about connectivity. H one is going to record information about the loops. At the beginning, I have eight lines for the eight points, and no loops. So there's nothing on the loop line. I connect that up. Some of those points connect up. So the their connected components die. So I end up with six. And there's some rule I'm sort of being a little bit vague about. But I started with eight connect or sorry, not eight, nine connected components. And now I have seven. So I have two of those bars died. Seven live on. Right? And then at the next step, I get these loops. So all of a sudden in the h one, you see that there's a loop. And then when I keep going, those loops connect up and they die. So those those loop lines are short because they only exist for a certain set of parameters. Right? So I'm tracking these features about loops about connectivity, as I change this distance parameter and connect up the points in the barcode. So this year is a particular this is one particular example, which is filtering by which I'll call that's the rips filtration. And I'm using distance distance to the nearest neighbor to filter the space. But there are lots of other ways I can filter the space. There's the death. So another way is I can say, I'm going to filter from the edge of the compound down towards its middle. Right? So what do I do that? Is that I assign the middle of the compound is two. And then as I move away from there, I have some decreasing function. And now I'm just going to filter by sub level sets of this function. So what does that mean? Again, we look at the barcode. I start with the lowest values, which would just be the things furthest away from the core. And then I add in more points. I still just have two pieces. And then as I continue this process, I still have two pieces. But now I also then I have some loops. And then once I get to the middle, one of those pieces die. So I have one piece, right? So that's why the barcode ends I have one piece. And then those loops kind of live on forever. So they just once I have two loops, they never disappear in this view of coming from the edges in. Right? So so this is this process is that different filtrations give us different barcodes. So what we're going to want to do is now we this is this is sort of like feature generation, feature generation in this context means coming up with filtrations that capture meaningful information about the chemical compounds. So the ones we've already and the thing to note here is that is that this this is not this is not a magic, like dark box that you can't look into different filtrations tell you something that you can like, you can read off information about the chemical compound from these barcodes. Right? So for the first one, right? We learned about the number of atoms, which was the H naught at the beginning. And we learned about the size of the void, right? There's the size of the voids corresponding to the length of this H one bar, right? That that if there were really big circles, those bars would be longer, because they would it would persist over a larger set of parameters distance parameters. So we can learn something by looking at the barcode about the compound, right? And a different function gives a different barcode that reads off something different about the compound. So in this example, we learned that this compound has two kind of flares far away from the center, and sorry, two flares, and that the circles are far away from the center. And we know that because the circles entered the picture long before the compound itself was connected. Right? So we learned that there's like circles away from the center. Right? And now we just we use sorry, we're going to skip these slides. We're going to come up with functions on the atoms, use the filtration, these filtration of these functions to make barcodes, and then use those barcodes to create a space of compounds. So I've explained that how like how sort of we get these barcodes. So we're actually going to end up with a lot of barcodes for every atom, for every molecule. And this is actually this is like a two sided coin. It's, it's nice because it means that that the sort of it's very easy to come up with functions that tell us something about these atoms. But it's bad because we're actually going to have trouble like handling all these different barcodes. And it becomes a little bit of a of a of a mass. But we'll get through it. So one of the ways we get lots of barcodes is that there's more than one notion of distance, floating around. So I said, we're connecting things that are a certain distance away. But there's at least two very natural distances in this problem. One is just thinking of the molecule as being in three dimensional space, there's like the Euclidean distance between the atoms. But there's also you can think of a molecule as like a connected graph with bonds being the edges. And then you the distance is like the distance along this graph. So there's more than one distance floating around. So for every function or for every like, yeah, so so we're going to want to consider both of those different kinds of distances. And now we're going to filter them a different way. So there's the distance filtration we already talked about. But we can also get other kinds of filtration. So the rips filtration is what we we've already really talked to. We already talked about since the centrality fence filtration is also called eccentricity, just like one over or minus or minus one over right depending on your kind of standards. So we've already looked at that in great detail. So some other kinds of filtration you might consider are by atomic mass, where you put in either the heaviest or the lightest, probably heaviest makes more sense the heaviest atoms first, and then you add in sort of progressively lighter ones. And you see how the the molecule connects itself up. There's also you can look at partial charge. So this is how charge is distributed over the the molecule, and you can filter by super and sub level sets of partial partial charge. And and sort of those those are explicit those last two are explicitly chemical, meaning that they sort of explicitly dealing with something about chemistry. But the other ones were also really kind of implicitly chemical, because when you're filtering by distance, this has to do with bond distance, this has to do with how the the molecule has kind of warped itself up twisted itself up in space. So so chemistry kind of is is really permeating all of these different measurements. Then there are some other there are like various parameters that you end up having to choose to kind of do this cut off parameters, what's the underlying complex that you're filtering and and doing sort of trying different parameter choices. So I'm just like trying them all and kind of throwing them into into the ML toolkit leads to this of the proliferation of lots of different barcodes. Okay, so we end up at the end of this process with hundreds of barcodes for each compound. Now what we want to do is is so this is every compound has its collection of barcodes, we want to use this to make a distance between chemical compounds. And there's two very natural distances floating on one's bottle that and one's Wasserstein. And here this is like their representation as formulas. But there may be easiest understood as just edit distances between barcode diagrams. So if I have these two barcodes and I want to match them up, I try and find the best matching, right, where where the best matching is like just alignment of these bars. And then this is the cost for the matching, which is going to be the distance between the barcodes. So this is this is just how they the barcodes kind of mismatch. And the the difference between the two distances for bottleneck, you just take the maximum mismatch. And for Wasserstein, you're just summing up all the mismatches. So so after that process, we we have now a metric space of chemical compounds. So we did the thing we said, which is that we were sort of geometrically organized all these compounds. And what we're looking at now is a visualization of this metric space. And what we want are, like, have we captured something about the bioactivity of the compounds. So what ends up happening is that all of the known so from the literature search, all of the known human inhibitors end up being mostly in that upper upper flare over there with a couple sort of in these these other spots. And I can look at the ones for E. coli, and they're kind of mostly elsewhere. Right so red meanings means that there's a lot of E. coli inhibitors in that region of the space. And you'll see there's kind of a little mixing up at the top. But like mostly the E. coli is very distinct in this viewpoint from the human inhibitors. So this is a visualization, which is nice. We want to quantify, we want to quantify our ability to to separate out, not just visualize the ability to separate out. So this the the space of barcodes form. So what do I need to do? Most machine learning kind of tools require that you you give it vector information. So I need to write. So right now I have this abstract metric space, I have these barcodes, which which consists of, you know, bars, which are just unordered lists of birth and death points. Right. And I want to turn that now into some nice vector representation. So I can feed it to, you know, whatever machine learning method I want. So to do that, I'm going to use the fact that the space of barcodes is itself a geometric object. It's a algebraic variety. It turns out to be like a little bit more the story of like, exactly the details there is complicated because it's an infinite dimensional variety, which is like, as soon as I say the word infinite dimensional, you say, whoa, and that, you know, so it's got some complexity. It's it's singular, which makes it more complicated. But then it turns out when you consider all of the infinite dimensions, like all the singularities kind of paste together, and you end up with something really kind of nice. And the point is, you can just write down its ring of functions, which is just some examples here. These are some sort of simple examples. The thing to notice is that like, so x is the birth of x i is the birth of the ith bar, y i is the death of the life, ith bar, so where it starts and where it ends. And so x i plus y i is not a legitimate function, because when the birth and death go to zero, the function, right, the bar disappears, right, when it has zero length, it's no longer a bar, it's gone. And and I need the function should not have a value on something that doesn't exist. So so anyway, so here are some examples of this ring of functions. And just by taking polynomials up to a certain degree that leave in this sort of specific ring of functions, I can embed now the space of barcodes into some Euclidean space and apply regular machine learning. So in this case, I'm going to use a support vector machine. And so when I do that, this is the confusion matrix between the different classes of compounds. So there's the human inhibitors and the E. coli inhibitors. And then I also have pneumonia and another kind of inhibitor, another kind of target, which I apologize. I briefly slipping my mind and you can see that that so this is the confusion matrix. And the point is most of the weight, most of the value support is on the diagonal. We do a great job separating out humans and E. coli. And you saw that sort of visually in that diagram that that the E. coli and the humans were targeting compounds when different parts of the space for the pneumonia and whatever the other one is, there's like a little bit of a confusion going on there, which you can see. And the point is, is that this ability to separate out compounds by the by the species targeting is basically comparable to state of the art computational chemistry, which is like just the sort of many orders of, you know, five orders of magnitude more compute to do the kind of molecular simulations compared to calculating these barcodes. And that was a made up number five orders of magnitude. It's way more complicated. And it's at least that much more complicated, probably even more. Okay, so what was the like, the goal here is that so from a set of chemical compounds, we, we created this kind of set of barcodes, we used a metric on the barcodes to organize the compounds kind of geometrically into a space of compounds. And then we can use the structure of the space to look for potential new compounds. The reason we think that might work is that for existing compounds, we've managed to kind of group them into distinct groups. Yeah, and so then there's this this piece, which I think is, is maybe a little bit, which I'd like to be more appreciation for, which is that we're using like other compounds to measure, right, like your space in the your spot in this space of compounds has to do with your relationship to other compounds, right. So rather than just thinking about the particulars of a particular compound has to do with its like relationship to the other other compounds, which I think is is really interesting. There's this global space of compounds that you can start to try and understand. So there's like lots of ways that this project is is like, is incomplete. And there's like a lot of work to be done. So there's this. So one of them is that I'm considering these filtrations individually, right? So when I have a function like distance or centrality or partial charge, I'm sort of taking a filtration with just respect to that one thing. But like what how does partial charge interact with centrality? I would like to take both of these filtration simultaneously to understand their interactions with each other. So that's something that on the math side would be called multi dimensional persistence. And the story there is kind of the does a lot of work to be done. It's really still still open how to do that properly. And in the right way. One thing that we really would like to keep, because there's a lot of benefits is, is this idea that when you do have a filtration, you have a bar code some invariant, and you know how to compare invariance from different spaces. And so that was one of the things here that it's a little bit of a subtle point. But the bar codes allowed us to take the topology of each of each different right, the chemical compounds, there's no direct way to like compare the chemical compounds. But the bar code provides a uniform language where I can now compare the topology of differing compounds. And so that that ability to compare invariance is really important. So Dimitri, I know, because he's about to release Dionysus to once his documentation is finished, that he's going to work on this afternoon. So faster, more efficient, persistence calculations, and, and there's lots of work on the chemistry side to what other functions would be useful to measure information about about the the compounds. Yeah, so I think there's like an intriguing idea here that that. So so I don't really work on this at all anymore. But but the reason I go around giving talks on it is like I really want, eventually, I'm going to talk to a chemist who's like, I'm going to do something with this. Right. And I really, you know, what what are like, what are we capturing about the compounds with these, this metric space of persist on on bar codes. And like being able to more directly say, yeah, there's three, you know, in the E. coli group, there were four different groups of of E. coli inhibitors, like what are those correspond to chemistry? A very sort of cursory examination of this question is not completely obvious. There's some subtleties in these bar codes that's causing them to break out into those groups. But I think that that could be really, really interesting. And particularly because the bar codes themselves are so well, for me, at least, they're very intuitive. I understand like maybe maybe like we went a little fast here, but but the point is that like, they're not some weird feature. They're a feature that that that like has direct connection to the geometry of the compound. If you understand them, you can like read information back and forth. And I think that that that itself is having features like that is really interesting. So all right. That's it.