 muted the last time I saw? Ian? Yes, can you hear me now? Great. Welcome, and Ian is from University of Toronto. Take it away. Yes, thank you very much. I'm a graduate student at the University of Toronto, and today I'll be talking about similarity metric learning on the L1000 connectivity map. Okay, so the L1000 dataset, which was originally published in 2017 by the Broad Institute, is an example of a perturbational dataset. These datasets measure changes in some sort of biological feature space due to perturbations like compound treatment or genomic reagents. L1000 in particular measures changes in gene expression in cancer cell lines as a result of perturbation. Another example of a perturbational dataset is the cell painting morphological assay, which is also from the Broad, which similarly measures changes due to perturbation by compounds and genetic reagents, but the feature space is morphological features, so imaging of cells after treatment. These datasets can be thought of as matrices where the rosary features and the columns are signatures of the changes induced by perturbation. So the applications of these datasets, I mean there are a couple of different things you can do with them. For example, you can characterize unknown compounds by taking the signature of an unknown compound and identifying compounds that produce similar changes or genetic reagents. You can nominate therapeutics for a disease state, and these analyses require identifying similar signatures, but it's not obvious what similarity means. So similarity is defined by a function, a similarity function, which is related to a distance function. It operates, it takes two vectors, and those vectors which are similar, which are close to each other, are given a score close to one, and if they're dissimilar, their score would be zero or minus one. Some canonical examples are things like Pearson experiment correlation, gene set enrichment analysis, or cosine similarity, and similarity functions are used for all sorts of analyses like clustering and model fitting, and so I maintain that a good similarity function, because there is a very vast space of these, but it's one that correctly discriminates related pairs of signatures from unrelated pairs, and of course the only way to really assess this is if you have some sort of a priori benchmark, where you know that pairs in orange here are similar and you expect that a good function will make this distribution of similarities different from the distribution of all pairs of signatures. So the technique that I'm using is from the field of self-supervised learning, which is a discipline in machine learning, and self-supervised learning learns an embedding that brings pairs that are known a priori to be similar. So for example these two pictures of a signature, a handwritten signature, they bring it, it learns an embedding that brings these close together while leaving samples that are dissimilar far apart, and it's been widely successful in fields like image processing and natural language processing. The key idea is that it uses label-free data augmentations, so you start out with say a picture of a dog and can apply a series of transformations that leave the identity of the sample unchanged while still generating new data. Importantly, you don't need to know that this is a picture of a dog to use this technique, these are all assumed to be similar, they're supposed to be the same class and so you can then learn the embedding to bring these guys close together. That doesn't really work well in biology because learning an augmentation operation that preserves the identity of the sample while being meaningful is pretty hard in biology. So the solution is to use replicate data. We can take a signature of a perturbation in some cell context and the assumption is that replicates of that signature should tend to be similar to each other, whereas different signatures are not assumed to be similar, so we want to bring replicate similarities close together while leaving the distribution of similarities for other pairs of signatures unchanged. So we begin with cosine similarity, just a basic inner product, and we modify it by introducing this parameter M. This is a matrix and there are a lot of different formulations of this and we've experimented with quite a few. The one that I'm going to talk about here is a very simple linear transformation where this is a matrix and you have two unitary matrices and some re-weighting on principal components space. This learns an embedding on your original data space and this defines a similarity function and so we can then optimize this parameter M in such a way that replicate similarities shown here in red are brought close together whereas similarities of non-replicates are kept apart. Conceptually what this very simple linear transformation is doing is stretching the data set along axes of variation in such a way that ground truth in particular replicates similarity is maximized. So our benchmarking is as follows. So to begin we train a metric starting from some training data set which has replicate structures and we know that these columns are the same and these columns are replicates and so on. We can then apply it to two different benchmark data sets. So first we can apply it to replicates, so same compound, but unseen compounds and in unseen cells. This is data that was not seen in the training step and then for a more realistic case we can apply it to compounds or signatures of compounds that have the same mechanism of action but aren't necessarily replicates. So these could for example be two EGFR inhibitors and then we can benchmark it by looking at the similarities of our ground truth compared to the similarities of all other pairs and the ranks of our ground truth compared to all pairs and then compare these two or more different metrics. So this just give a little bit of intuition. This is the how the embedding, sorry, how the embedding has changed the similarity distribution. So on the left you have cosine similarity of signatures, compound signatures in HEPG2 which is a liver cell line from our 1000 where replicate similarities are in red and non-replicates are in blue and on the right you have the learned metric. So you've done the embedding and are competing this rectified cosine and as you can see that the distribution has shifted somewhat. But this doesn't just shift the distribution. Replicate recall has improved. So if you look at the ranks of replicate pairs compared to the entire data set that this figure here shows the CDF of replicate ranks where most similar pairs are on the left, least similar pairs are on the right and this is the fraction of replicate pairs that are recovered. What you can see is that the learned metrics shown here in purple and orange, these are two variations trained on two different cell contexts, show improved replicate recall compared to your sort of conventional off-the-shelf similarity functions in that you see this lift near ranks of zero. And in the more practical case of trying to identify compounds with the same mechanism of action, you see that rectified cosine, this learned metric, does better identifying pairs of compounds that have the same mechanism of action than cosine does. As you can see, there's an increase in small rank pairs in this case where we know that there's a ground truth. And if you adjust for the false discovery rate at a particular threshold of the false discovery rate, the learned metric improves the recall by 10 to 20 percent of pairs of compounds. So this is pretty substantial when this is just a computational technique. There's no sort of legwork being involved apart from learning a new similarity function. Very briefly, we see the same results when applied to the cell painting perturbational data set, which again uses cell morphology features. You see the distributions have shifted using the learned metric, which is here on the right, and which, you know, the metric on the left is kind of saturated. You see improved discrimination on the right near the high end. And again, we see that replicate pairs are better discriminated by the learned metric controlling for false discovery rates. You see a more modest improvement than an L1000, but you're still looking at something like five to 10 percent improvement in the fraction of replicate pairs you're able to recall. And as with L1000, compounds that have the same mechanism of action show improved recall with the learned similarity function. This turns out to be a pretty hard problem in the cell painting space, which my collaborators are familiar with. But you're still seeing improvement using the learned distance function. So briefly to kind of recap, there are kind of two interpretations of this method. For one, you have an improved similarity calculation that's able to better recover known biology and presumably discover novel biology. And you can also think of this as an embedding in a new space that may be more appropriate for applications like clustering or modeling. This is a more natural basis for the data. In the process of submitting a package to Bioconductor that I'm calling BioSimLearn for now, to enable researchers to apply this method to learn similarity function specific to their data sets. I mean, all you need to use it is some data matrix with known replicate structures and kind of ground truth that you can use to train the model to recognize replicates. And it has a range of applications. Thank you very much for your attention and to my collaborators. And please reach out to me if you have data sets that you would be interested in using this on. Thank you.