 Symmetries in machine learning models by Solida Vida, who is Vila, sorry, from John Hopkins University and partly a visiting scientist at Apple these days from what I understood. Solida, you are welcome. Please go ahead. Thank you. So thank you so much to the organizers for inviting me. So the first time that I was here was about like 10 years ago when I was a master student in Uruguay, in Montevideo, where the other organizer who I haven't met yet is from and I think it's very nice that this institution gives opportunities to people from like South America to come here, do research and learn. So I was a student when I came and now 10 years later I come and I give a talk here, so it's like a very emotional, I don't know, it's like a very nice opportunity. So thank you to the organizers for this and for today's institute. So I'm going to talk about approximate symmetries in machine learning models. I'm going to talk about two projects, mainly about one project with my PhD student, Theresa Huang, who is graduating next year and she's a great researcher and also a great person, so I'm very happy to work with her. So then if you see, it seems like I have 47 slides, but actually I only have like 15, just disclaimers so that you don't get anxious and then the rest of the slides are just to clarify questions or if you want me to discuss other topics, I'll be happy to go over them, but just like the main part of the talk is just a shorter thing. So the motivation for this talk is to talk about inductive bias in deep learning. So if you are not familiar with some of these questions or like if you have questions, please ask, I'll be happy to make it this interactive, so it's just to make sure that I'm talking to the audience in the right way. So if you haven't seen it yet, there is a very well-known paper by Belkin and collaborators in 2019 that explained this like double descent phenomenon that occurs in like very over parametrized machine learning models, which is the models that people use these days for chat GPT and like all these deep learning models. So these models would have the property that are very over parametrized, so typically you have way more parameters than data points in these models than training data points, and so then since you have this situation, you have many, many possible models in your class of functions that can feed your data perfectly and some of them generalize well and some of them don't, and so you want to understand which ones are the ones that are going to generalize well and how can you design the class of functions so when you train it, so basically when you do gradient descent, just local optimization, you converge to a local optima hopefully that has good generalization properties so that it works well in unseen data. So the double descent phenomenon like has a plot in this in this form where basically what they do is they say well in the classical statistical regime you have like this known bias variance trade-off where if your model is not very expressive you may not have a lot of variance but the bias is huge because maybe you cannot express the data and then when you have too many parameters you can overfit to your training data but the model can feed your training data perfectly but it doesn't generalize so it overfits so here in this plot you have like this is like the training error which is always decreasing with the capacity of the model hopefully and then you have the test error which goes down and up and then there's like a sweet spot which is the bias the best bias variance trade-off point and that's the best model that you can use and so what they say is that in these deep learning models in these overparameterized models what happens is that you still observe these bias variance trade-off where you have a point where like your model is very overparameterized you overfit but then if you add more and more parameters there's some form of implicit regularization that is happening and then the the test error decreases monotonically and sometimes you can have that in the overparameterized regime you get a test error that is smaller than the best thing that you can have in the underparameterized regime that's uh that's the idea that they so um and then the there's like a very simple explanation for uh for linear models why does the peak this peak occur and why you can see that this uh that this decreases there's a a couple papers by Hasty and collaborators that explain these for linear models and there's a paper by Bartlett and collaborators that is called benign overfitting that explains uh why you can have a smaller error in the overparameterized regime and I have some slides about it if you want to discuss it later and be happy to do that but at this point I'm just going to say that this is the setting and so the question is like how can you design models that have the right inductive bias so when you uh when you are in the overparameterized regime you still uh have a good uh a good performance and so it's not always that it happens that you go up and down uh you can have a but design models where like the the test error is is is is large even in the overparameterized regime so the question is like how can you design these models that have these nice properties and so uh the the idea uh is that uh some people are um focusing in is the use of symmetries in the design of machine learning models and if you see like what are the machine learning architectures that are successful for for many problems these days they all have the property that exploit some exact symmetries or approximate symmetries so for instance for instance convolutional neural networks are the architecture that changed how deep learning was perceived and one of the properties that it had was that is approximately equivalent with respect to translations and um graph neural networks are uh equivalent with respect to the action of permutations and transformers can also express many symmetries and so there's like something in the design of the class of functions that is exploiting the symmetries and uh and and and this workshop is about the structured data and structure models so imposing symmetries in machine learning models is one of the themes and so uh another motivation to think about symmetries is the the the fact that symmetries are everywhere uh in the physical sciences so not only in the actual world like you have symmetries that come from uh conservation of of energy or conservation of momentum uh that are given by the physical law these are called the active symmetries are are connected with um the symmetries are are connected with conserved quantities in in um in certain physical systems uh but but also you have symmetries that are not that don't have to do with uh the actual world but the way you represent the world so if I have a physical system I can express it in a coordinate system and then the choice of coordinates is arbitrary so if I do a different choice of coordinates then there is like a reparameterization of the world that I can do and then the predictions of my models should be uh predictable with respect to that parameterization so that you can write in terms of an equivarian with respect to a group action and equivalents with respect to a group action so the fact that there's not a unique way to represent the world there's always arbitrary choices and if you want to write machine learning models uh that are going to generalize and maybe you want them to generalize to different coordinate systems you want to want them to generalize to different forms of representing your data then you the ideal way to do it is by writing them in a coordinate free way and that's uh that's one of the ideas so if you can implement coordinate freedom or units equivalents or gauging variances in machine learning models you may be able to generalize better and so that's a claim so how do we mathematically write these symmetries in machine learning models uh using group actions so the idea is that if you have a group g that acts on a data set then you may you may have that um you want to find a function that is invariant with respect to that group action meaning that if I apply my group uh element I act on the input then the output doesn't change so for example here I have an image classification problem if I rotate the image the classification value doesn't change so that's an invariance or an equivariance uh if I have the group acting on the output as well then um I say that a function is equivariant if every time that I do a transformation to the input then the output transforms by the same group action all right but by an action of the same group element it doesn't need to be the same group action uh it could be that the group acts differently in the input in the output but you can have that the action by this group element in the input corresponds to an action by the same group element in the output and I'll give you an example where the group action is not the same so in this in this example here I have like a dynamical system and the goal is to predict the state of this dynamical system after certain amount of time and so if I rotate my dynamical system then the output rotates in the same way and so what does equivariant machine learning do uh they uh basically they parametrize the class of functions the hypothesis class of functions so where you're going to do the learning so that for every choice of parameters the corresponding function satisfies the symmetries so it's kind of like you're doing the parametrization of the learning in a space where every function satisfies the limit the the symmetries so uh one question is how do you do that and I'm not going to go into the details of that but the idea is that you can use representation theory to to parametrize this class of functions you can use invariant theory to parametrize this class of functions so you can do some kind of like averaging over the group elements if your group is small for instance that's like the equivarian convolutions for instance do that but if you have questions I'm happy to to answer later how do you do that and in practice most most actually in practice most machine learning models do not implement like do not parametrize it as a function to satisfy these symmetries but they do something called data augmentation which is like applying the group transformation to some of the inputs and then use that as a way to promote the existence of the symmetries and there's a lot of math that you can study in this in that space so as an example I'm going to talk about graph learning so the idea is that if I have a graph here and I uh I I have this graph and I can express express it as an adjacency matrix so I have this adjacency matrix that corresponds to this graph but also I have this adjacency matrix that corresponds to the same graph because if I do a permutation on the rows on the columns if I do a the ordering of of the nodes in the adjacency matrix it's not a property of the graph so there's kind of like many ways to write this graph as a matrix so that's a passive symmetry it's another parametrization of the same graph and so if I'm going to learn a function of the graph then the function should be invariant with respect to this group action which is the permutation action action by conjugation and the input matrix so for instance the shortest path the length of the shortest path would would have that property that is invariant with respect to the permutations if you want to learn an embedding the typical graph neural network so like the graph learning problem learns an embedding so learns for every node a vector in rd that is how you embed this node and then that's what they they do they learn these representations for the nodes and then whatever downstream task that they have to perform they perform it on these learned representations of the nodes so that representation needs to be equivariant with respect to these permutations so if I permute the like the green and the yellow then the corresponding embedding will have this permuted as well so in this case we can see that this function is equivariant with respect with the action of permutations in the input the action is by conjugation in the output the action is by multiplication right and so the question is how can we efficiently parametrize the space of invariant and equivariant functions with respect to permutations and and so the the way I mean there are many ways to do this but the most like the the way that practitioners use the most is using message paths neural networks because they are easy to implement and they're very scalable and they implement the symmetry is in a very trivial way so the idea is you have your graph here is your graph and then every there's a message function that is a global function and it's a function that takes the state of a node and outputs a message which is a I don't know a vector and then every node has there's also an aggregate like an aggregation function which is like every that every node uses and they aggregate all the messages that they receive from the neighbors using using that function that is invariant with respect to permutation so there is a message function and there's an aggregation function and that is the same function for everyone so it's a permutation invariant just by definition so that's the way it's implemented and and that fun that architecture is is equivariant with respect to permutations by definition but it has the issue that has like limitations on their expressive power like if you have if you have a graph and and like if you want to learn an invariant function with respect to these action by permutations then you should be able to be able to like if you want to express all all functions that have this property then what you want to do is you want to be able to solve the graph isomorphism problem because like the orbits of the group action by this by of a graph by this group action is the isomorphism class of the graph and and so the graph isomorphism problem is a hard problem so we don't expect to be able to define a small architecture that can separate every every pair of non-isomorphic graphs so what happens with this architecture is that it has some expressivity issues and there are some graphs that are not isomorphic but for every choice of parameters that you put in your machine learning model the output is the same so for example that like one very bad example that that that that happens here is that if you have this this graph this is two triangles and this other graph is a hexagon both graphs are two regular graph with like a regular graph with degree two h node and six nodes in total and those two graphs are not easily they cannot be separated by this architecture but of course you can do other things that like simplify like that give more expressivity to this class of functions and remove that issue but basically you have you have the you still have some issues with like what kind of functions you can express and there is a lot of literature related with something called the Weiser-Liemann test which is a test for graph isomorphism that allows you to characterize what kind of functions can be expressed with these graphene networks architectures and what kind of functions cannot be expressed and there's a lot of a lot of literature in this space and I'll be happy to discuss it but basically this is what they do they do message passing and then they have functions that are equivariant with respect to this group action but they cannot express all the functions that equivariate with respect to this group action so there's just a subset of the functions and so what I wanted to talk about is how can we change this this this structure like this message passing networks so that we may get better expressivity and better performance in the case where you know that you're learning a function of a fixed graph that say you have like a time sequence of of graphs and the of like of graph signals the graph is fixed and then the signals change over time and so if you if you want if you have that setting then you may not want to use a message passing your network because the symmetries that it imposes are very strong and maybe the symmetries are not that useful for your case so maybe maybe it's not necessary to impose all the symmetries and maybe you can get a better bias variant straight up by relaxing the symmetries or breaking the symmetries so in this case sorry yeah I yeah so there's two examples that I can discuss so one is like this like human post estimation problem so you have a graph that represents the joints of of a skeleton and then this is like a computer vision application and so from the 2d positions of the of projected positions of the nodes you want to predict the the 3d coordinates of the nodes so that's one and they typically use graph neural networks for that task this computer vision task and the other example that I had is a traffic flow prediction so you have like a road network and you and you have some sensors that can compute like calculate how many what's the the flow of the traffic and the sensors and you want to predict the where's where there's going to be like a like a traffic jam so the fix the network is fixed but there's some connectivity in the network that makes sense to study and then the there's a time there's a signal that changes with time okay so so in this setting we have the the the idea is that when you when you have your graph your graph is like given by your by your agency matrix and your signals which are supported in the nodes or functions on the nodes of the graph and so the the typical permutation action permutes the signals and permutes the adjacency matrix by the group of permutations and so now what we're going to do is we're going to fix the the adjacency matrix and only permute the signal that's the group action that we're going to do so this you can think of this is a passive symmetry and you can think of this is an active symmetry and when the when the group that you're using is the automorphism group of the graph then these two things match right because it fixes the graph and and so by choosing different subgroups of the permutation group you can extract something that looks like like a like a double descent curve and I'm going to show you that later so the intuition going back to the passive active symmetry is the intuition in convolution neural networks is that you have your signal like you have a like you can think of a convolution neural network as a graph neural network or like as where like your signal is an image that is a signal supported in a graph which is agreed and then you like the same the kind of like this you can think of a graph neural network as a generalization of the classical convolution neural networks where you change the topology of the underlying graph so in the graph neural networks in the in the convolution neural networks you fix your domain and you shift your signal and so in the graph neural network symmetry you shift the the signal you permute the symmetry the signal and the graph simultaneously so here we want to kind of like go back to the original convolutional symmetry and decouple the permutation of the signal with the permutation of the graph the underlying graph so that's the idea and and we get something that when when we relax the symmetries that we can have in these graph neural networks we can obtain something like the double descent curve that I showed you earlier in the talk so the idea is when you impose more symmetries then the class of functions that you can express is smaller and when you relax the amount of symmetries that you are expressing then the class of functions that you can get is larger just because of the correspondence between yeah that goes the other direction and so you can see that you you can have a a bias variance tradeoff which is if you think of like your class of functions indexed by complexity or by graph the you the generalization error will have like a like a the same kind of shape that you observe with the double descent and that can be explained by computing the bias variance tradeoff with the the symmetries the symmetry constraint where like you replace the like complexity in the x-axis that you had before with the amount of symmetries that you impose and the way you can write it in a simple setting is that if you have g is a subgroup of the permutation group and your data is sampled by a permutation invariant distribution then you can define the like given given data generated by some f star of x plus noise so there's some f star that you don't know what it is but it's what generated your data given a function f that is your estimator you can decompose it in two parts you can do the projection of f onto the space of invariant functions and the projection of f onto the orthogonal complement of the space of invariant functions and so the the risk gap would be the difference in the risk that you have with f minus the difference in the risk that you have with the projection of f onto the space of invariant functions and that this difference can be, can be, you can write it down as two terms, which one is like the, like the norm of F, F, the projection of F onto the orthogonal complement, so whatever you couldn't express of F because you were projecting onto the space of invariant functions, and then another term that you can, that is the inner product that you can, that in some cases you can show that this is actually zero. So in the case, for instance in the case of the, in the case of the bias variance rate of, like if you look at the linear regression case, the, if you're doing linear regression, then the bias, if your data, if you have more data points than dimensions, then the bias is zero. And this term is kind of like, it comes from the bias on the bias variance trade-off, and this is kind of like the, the, the variance. And so, so using this, we can, I'm not going to show you, maybe I can show you that later if you, if you, if you ask, but you can write like a explicit bias variance trade-off for the linear regression using the typical computations that you have for linear regression with approximate symmetries, and you can construct examples where you can change the amount of symmetries, like you can impose more symmetries than your problem has, for instance, and then increase the bias, but reduce the variance significantly. So the same tricks that you can do in the classical statistical settings were like, you have an estimator and you can make the risk smaller by increasing the bias a little bit and decreasing the bias, the variance a lot. The same thing you can do here in this, in the setting where like, you change the amount of symmetries that you impose in your graph neural networks models. And so, how do you impose this, how do you change these symmetries that you impose? So this, the, the simplest way to do it is if you have a graph neural network, this is not via message passing. So the message function is the same for everyone, but what you can do is you can cluster your nodes and then say, the nodes that are the same, like, they're in the same clusters, they use the same message functions, and the nodes that are in different clusters can use different message functions. That's one way of breaking the symmetries, and that's something that, that you can do. And then, or other ways, like if you know exactly what symmetries you want to impose, you can use representation theory to parametrize the space of functions with respect to that specific symmetry. And I, I can explain that later if you want. Okay, so any, and I have some examples where, like, we do this for, like, this human pose estimation and the traffic flow prediction that I described just earlier. And we look at the implementation of different symmetries, and then we see which, which symmetry that you can impose has a better performance. And it's typically something in the middle between, like, the full permutation symmetry and node symmetries at all. So, for instance, here in this traffic flow prediction, we can do different clustering of the nodes, and then do different message functions for each of the clusters, and the pen, and, and you can have that, the graph your network, the classical graph your network that uses the same message function for every node has smaller performance than lower performance than the one that breaks the, the symmetry in that way. Okay. So, if you, any questions about these comments? Yes. Yes. No, what is SN invariant is the, is the data. So, say, for instance, you generate the data from, like, a Gaussian. And so, it's SN invariant. Yeah. The target doesn't need to be SN invariant, or you can think that your target is, like, G invariant for some G, and then look at different G's that you can project to, and then, if you have a mismatch on, like, what is the actual G that the function satisfies versus the G that you impose in your model, then you can compute what is the variance bias by just, like, having a wrong group in the estimator. That, that's a great question. So, we only do it on the, on the graph structure, but it makes sense to do a clustering based on the graph and the signals, or even if you have some labels, you could even use the labels to do the clustering. But here, we only do the clustering based on the, on the graph, on the graph structure. Like, so, for instance, here, the, like, oops. So, here, the, the, the graph, the, the, these are the same kind of highways, and the clusters are coming from, like, if your thing comes from the same highway, then they're in the same cluster. Yeah, exactly. Yeah, because the, the, yeah, there's no, yeah. Okay. So, um, so that was the first part of my talk. And if you, if you were asleep, maybe, uh, this is a good time to wake up because I'm going to talk about something, uh, completely different, but it also has a flavor of using approximate symmetries, your machine learning models. And, uh, this is going to be about, uh, contrastive learning, self, self supervised learning. Uh, if you haven't seen it, uh, this is something that is very, uh, widely used right now because they use these, like, self supervised models to learn embeddings that they're used for, um, like language models are, and, uh, language vision models, et cetera. So basically the idea is how can you use your data in a self supervised way or unsupervised way to learn representations of your data that have meanings. So the, the classical contrastive learning setting is like this picture, which is the, the following. So how do you, how do you learn these embeddings? So you have images. Here you have, uh, you want to learn an embedding from images to this space that is used for representing your images. So here I'm going to say the sphere. So you're learning, uh, embeddings from your images to the sphere. And what you do is you take your image, this image of a cat, and then you do an augmentation of this cat that is like a transformation that could be like, uh, color shifting or like cropping or rotation or some, some transformation that, that is a transformation that doesn't change the fact that it's a cat. And then you learn a function so that, uh, all these augmented versions of this image, uh, are mapped to the same point. And you, you have that, those are like kind of like your positive pairs. And then you have negative pairs. So for instance, uh, cats and dogs, they come from different classes. So you want to, uh, minimize the, the distance between the augmented versions of the, of the images and then kind of maximize what we respect to other negative pairs. So that's the, the classical like simpler objective. So this is what I'm saying here. You have a representation from your data, you learn and embedding. And then the loss function that they use this, like info and see what basically it says is that if you have positive pairs, you want them to be close. And if you have negative pairs, you want them to be apart. That's what it says. So then what we want to do is we want to find a way to do, uh, self supervised learning so that instead of making the, the augmented versions of the data to go to the same point, we want them to be equivariant with respect to some group transformation. So basically the idea is that maybe we can, we want that the augmentations in the, in the input space correspond to rotations in the embedding space if, if you're embedding in the sphere. So you want to kind of like decode these augmentations as like linear transformations in an embedding space. How do you do that? And, and also if you see how the loss function for the classical, um, in contrastive learning is written, it's like given in terms of pairs of points. So we want to do the same. We want to give them in terms of pairs of points. Um, and so in order to do that, we're going to use classical results from invariant theory. So the idea is that if we have, um, um, if we, if we have a function that takes like, so in this example here, we have a function that takes n vectors in our D and outputs, uh, a vector in our D. So for instance, you can think that it outputs the position of one of the particles or the position of the center of mass so that if I, uh, if I rotate all my vectors or like if I apply an orthogonal transformation to all my input vectors, then the output rotates in the same way. So, um, we can show using classical result from invariant theory that, uh, function is all the equivalent. If and only if you can write it as a linear combination of your input vectors using, uh, invariant coefficient functions. So it's a linear combination of the input vectors and then the linear combination has some coefficient functions and these functions are invariant. And, uh, the invariant functions, the, the first fundamental theorem of D says that the invariant functions of all the invariant functions are functions of the inner products of your input vectors. You can, you can think that if I have, uh, a rotation, I rotate every point, I rotate all my points, the inner products don't change and then the other direction is the first fundamental theorem that you can, you can reconstruct, uh, the vectors up to all the, uh, using the, or the inner product, from the inner products. Uh, and so, uh, the, from here to here, like the, from the first fundamental theorem to parametrize all the equivalent functions, the idea is that the equivalent functions can be constructed from gradients of invariant functions. Uh, so gradients of functions of inner products. So that's why they look like this. And, and so how do we use this for, uh, for, um, equivalent contrastive learning? So the idea is that we want that the augmentations of my inputs correspond to orthogonal transformations in my embeddings. So we know that, uh, a function like this, this theorem over here tells you that a function is all the invariant, even only if it's a function of the inner products. So if you have something that preserves the inner products, then it's going to be all the invariant. So, uh, what we do is we take the setting that, the classical setting, the, the, the, the, the classical, um, contrastive learning setting, and then we add, uh, a loss term, which is the loss term that, uh, is zero, even only if your trans, your transformation in the very embedding is equivariant. And this is, this is how the loss looks like. So for all augmentations, uh, in A, and for the, so the expected value over augmentations in A and the expected value over, uh, your, your training points, you want that the inner product between the, uh, the embedding of the augmented points is equal to the inner product of the embedding of the points. So if you, like the, the, the fundamental theorem of your embedding function tells you that this is going to be zero, or like, I guess this term here is going to be zero, even only if, uh, for every augmentation, uh, there exists, uh, a transformation in the orthogonal group such that f of A of x is equal to the rotation of times f of, the rotation of f of x, or like the orthogonal transformation of f of x. So this tells you that you can make, you can make that, uh, loss function to be zero, even only if the augmentations in the input space correspond to orthogonal transformations in the embedding space. But in practice, uh, you are not going to be able to make that loss function equal to zero, so you're going to be approximated with, with respect to that by minimizing that, you're going to find that a transformation that, uh, approximates that property. And in order to be able to have that thing equal to zero, you need to be able to see the group of augmentations, which is typically not a group actually, because you use croppings and you use weird things, uh, as, uh, as a subgroup of the orthogonal group in order to be able to make this loss function equal to zero. And so you can see that, well, if, if, if your group of augmentations, if your augmentations form a group, which is not always the case, uh, then, and if it's a compactly group, then you can see it inside of D if D is arbitrarily large. But that's, that's not what is happening in practice. What is happening in practice is that, uh, you're making it closer to be, uh, equivariant by using this loss function. And we have some experiments that show that, um, uh, for, for different, uh, for, for different classification problems, image classification problems, we see that imposing this equivariance improves the, the accuracy of the, of the models. And also maybe this is a, an interesting feature of it. It's like, so here the augmentations take these input images and then they make them like a little bit more yellow. And then when you look at the closest image, uh, of the yellow version of the model, they will have like the yellow feature is going to be present. Whereas when you do it in the invariant way, it doesn't really see the change on the color. So it kind of makes sense to move away from the invariance because you also want to be able to capture the nuances of like the changes in the, in the images in the, in the regional space. Okay. Uh, so with this, I'm going to finish my, uh, talk and say that, okay, my, the, the point of this talk was to say that approximate symmetries can be a good inductive bias for machine learning models. In particular, I talked mostly about graph learning, but also in contrastive learning. And I mean, I'm, I'm happy to chat or like discuss things about um, invariant functions, equivariant machine learning, representation theory, and how do you use it in machine learning models? And I don't know, many things. And, uh, these are my references and thank you for your attention. Thank you so much for this great talk. It's a very nice starting for this conference. Is there any question from the audience? Can you explain again the how SIM, no, was it, it was SIM card, like how the, how using the symmetries change is different from uh, usual augmentation on the last example. Yeah. So um, maybe this example, this, this plot would show it. So the, the classical invariant contrastive learning what it does is like the augmented versions of the same image get mapped to the same point. They're invariant with respect to that. And so here what we want to do is we want to have that uh, the like if you have a transformation that takes, that does like some form of augmentation in the input. So here you have the coloring, like change the coloring scheme. You do the same thing for these two objects. So doing that transformation corresponds to doing rotation in the embedding space, a small rotation in the embedding space. So that, that augmentation you can, um, you can have like an interpretable way of understanding like if I move in this direction in the, in the embedding space corresponds to changing the color by this in this form. So, so concretely what happens is that maybe you need to like warp the data less to do the embedding because normally you need to smush all the points together and that might have, that might force you to learn a very complicated function in order to map all the different colors at the same point. But now your embedding space is bigger. Yes. And you, and therefore the mapping can be simpler and maybe this is why it works better. You still need an invariant term in the loss because you want things that are, uh, that come from the same objects to be mapped close in the embedding space. But also you want to be able to interpret the augmentation in the input space as a transformation, as a linear transformation in the embedding space. Okay, so this is just added then? Added, yeah. Maybe, okay, there is another question in the meantime, maybe the next speaker can prepare him or herself, I do not remember. By just connecting for every speaker, you just need to connect to the zoom and that's it. Then you present from your computer. Okay. Thank you for your talk. I came a little bit late. Like, um, like you make mention of true application of this, um, method can we, uh, is it applicable to like, um, in, um, in breast cancer classification, for example, I'm working on breast cancer classification using multimodal machine learning. Um, how do we, uh, use this kind of, uh, strategy or methods in that area? Because I can see that you are doing augmentation and in the breast cancer area, we have to be very careful with medical information. For example, in histopathology, uh... I agree, yeah. Uh, I don't know what are the classic limitations that would make sense for you to do it. Uh, but, uh, there are some forms of equivariable machine learning, like, like in equivariable respect to small rotations or small, uh, transformations of the image that could be useful, but I would need to look exactly at what is your specific application because it doesn't, it's not that you can take something out of it. Okay, maybe I'll have a chat with you after that. Someone left a phone here. Yeah, it's mine. And sorry, I forgot there will be a coffee break before. I'm sorry. I'm not, uh, wake up properly. I have a question, actually. Uh, all this is based on the knowledge, on the a priori knowledge of certain variances and equivariances. Is there a way to learn non-trivial symmetries, uh, that you may not know a priori? Uh, yes, there is, there's some work that, that does that. There's a paper by uh, Andrew Wilson at NYU that, that learned some form of, uh, symmetries from, uh, the training of the model by, uh, by parameterizing the symmetries that's like, uh, like using the matrix exponentials. Uh, but the, I think that that's an area that hasn't been explored enough. And I think that there's a lot of things to do in that space. Okay. Like you need to have a very specific assumptions of like, how do you want your group to look like and then find your, your group, uh, your, your group transformation. So in that, in that case was like, like, uh, groups acting linearly in a specific way. But, uh, yeah. In general, I don't think that it, that's a solved problem. Well, thank you for the talk. Um, that result in the equivalence relationship where like if, if you have an equivalent, um, that, that representation theorem. Uh-huh. How generally is that if you don't have like your talk on a group acting? It's like, it's more general. Do you have different group actions? So in order to be able to see the, uh, equivariate functions as gradients of invariant functions, you need that the representation of the group is orthogonal. But there is a more general, more general way to see it. That doesn't require that. And the idea is that, uh, if you have an equivariate map from B to W, then, uh, then you can see it as, as an invariant map from B times W star to R. And that's where this whole thing comes from. So, uh, so then in the orthogonal, uh, orthogonal group, the representation of the duals is the same as the representation of the original space. So that's why it's acting on the same way. But then in, in the case that you want to do, then you have to work with the dual represent, the representation in the dual here. Does that make sense? I can, I can share some references with you. But it is, there is a generalization that looks like a little bit different from what I show you. Thanks. And so another question on the first part of the talk. Uh, how do you choose, like, in that case, you have, like, instead of getting the whole group of permutations, you get, like, a subgroup tree. And how do you pick, like, these smaller subgroups which are not the whole SN? How do you pick the subgroup? Yeah. What kind of heuristic community? So in, we, we did it, like, um, like, based on the, the different applications that we had. So, um, so what we typically deal, but maybe there's better ways to do it, is by doing clustering on the nodes and then do, like, the full permutation in each cluster and then, like, semi-direct product with the automorphisms of the, of the structure that you get after clustering. But, uh, maybe you don't have to do clustering and there's other ways to relax the symmetries that make more sense. Okay, thanks. All right. So if there are no further urgent questions, maybe let's keep them for the coffee break, which is going to take place on the terrace and we resume in 25 minutes. Thank you very much, Soledad.