 Okay, I guess now it works right it does okay good recording in progress Okay, so thank you very much for the invitation my name is Mario Geiger and I will talk about group theory in machine learning and I'm postdoc in MIT with professor Tess Schmidt so This talk is about a Q variants neural networks so Imagine you have a task you have to solve that you know it respects some Symmetries so in this example, it's rotation so What you can do? To impose this symmetry from start because you know that this is a stronger bias you have on your problem You can make a neural network an architecture that respects this symmetry so if you rotate the input of your neural network, it will rotate the output the same way and This can be done in such a way that it's always Satisfied for any parameters so even before training actually this Animation I'm showing it has been collected before training So I'm really bad at doing presentations, so I tried to do something to Motivate my talk with a graph of dependencies of what I have to explain to go to my point so I would like to present what a Way to do a current neural networks, and then I will want to present What can improve data efficiency in equivalent neural networks what we did we discovered some interesting learning curves, so They show the generalization error as a function of the Size of the training set and we observe that When you add some Equivalence I will explain that It improves the data efficiency So by following this arrow that shows what I have to explain to go to my point I will have to explain these two architectures that are considered in these two graphics For that I will have to explain what this graph convolution and equivalent network and for that I have to explain some basics of group theory So let's start with what is a group So a group is a set of elements with a binary operation here. You have the Wikipedia definition, so It has to satisfy three rules the first is as associativity So a times b times c is equal to a times b times c The second one is identity you have an element in the in your set such that if you Multiply this element by any other element you stay with the same element And the last one is that for each element of your set There is another element in the set that is the inverse so if you multiply the two you get the identity Now let's move to the representations and because My last point is also about Equivalent neural networks that are equivalent to rotation I will start to show examples of representation of rotations because I think it's good to start with examples because representation is a bit abstract if we start with the definition so the these are few example of Representation of rotations and the simplest one the most common one are vectors So vectors are just three numbers put together and when you make a rotation they are multiplied by a matrix An example another example that is the trivial representation are the scholars so these are numbers that You can have a single number and when your system undergoes a rotation they are unaffected by it and Then let's take a representation that is a little bit more fancy Signal on the sphere is also representation It's a function from the sphere to real numbers and when you rotate it you have You obtain a new function f prime that is the same as the old function, but evaluated somewhere else Related by the rotation matrix once again and Finally, let's say a scalar field is also a representation of rotation it's this time a function from the space R tree to R and You have the same formula to Transform your your scalar field and the rotation So now that that's that's the formal definition of representation It's a it's a tuple of two things you have a function row and a vector space V and Row is a function from the group into Function from V to itself And he has to expect these two properties. First of all, it's that row of G is a linear function and The second one is more interesting. It it has to respect the structure of the group So if you have X an element of your vector space and you transform it with row of G1 and Then you get something again in your vector space V that you transform with row of G2 You can do that in one step. You can just transform X directly with row of G1 Sorry G2 times G1 So now let's move to what are irreducible representation You are you have two types of representation some are reducible and some are irreducible So among the example I showed you for rotation The two first are irreducible and the two last are reducible. What it means is That the two first You cannot find sub vector space that are also representation. So you cannot Make them smaller representation. They are the smallest But these two other these two vector space You can find sub vector space that are also representation. So let's take the example of the scalar field I can take this function and I can decompose it on some basis. This is for instance the orbitals Atomic orbitals and the atomic orbitals are a nice basis with respect to rotation Because then each of these coefficient they transform independently With some irreducible representation. So each of these see Old all these see together Contains the same information as the field itself But each of these see transform independently with an irreducible representation So These are all the irreducible representation of the group of rotations. They are indexed by index L that goes from zero to infinity The the first one L equals zero are the scalars. They are dimension one The second one L equal one of the vector dimension three and then the dimension is always two L type Two times L plus one so five seven and so on and here are some example of Quantities that transform with these irreducible representations So what is interesting with the irreducible representation is that every quantity that is a representation Can be decomposed into irreducible representation take for instance this three by three Object that is a stress tensor of instance it can be decomposed into an L equal one a zero and a two and Now this brings me to the tensor product. It's the last operation. I need to explain you You can take two representation row one and row two and you can create a new representation by multiplying them together and This new representation act on the vector space. That is also the tensor product tensor products of The vector space v1 and v2 and it works this way. So take x that is in the Vector space v1 times v2. So it can be represented as a matrix of dimension v1 times dimension v2 and you simply Rotate this vector by multiplying on the left by row one of g and on the right by row two of g transports and Within this is you can write it like that. There's no more transpose You just have K and L K is contracted with the row one and L is contracted with row two and When you do the tensor product of two representation, it's Typically not not irreducible, so it's typically reducible and You can make a change of basis and decompose as everything can be decomposed into irreducible representation You can also decompose it in a direct sum of irreducible representations So this is a big the picture you have if you have a group that you're interested in You can find the irreducible representation if you go to the library and open the root the right book and then you can Make tensor product of these representation and decompose them into other representations And that's a very great tool that we will use to make equivalent networks So last thing about tensor product is the tensor product For rotation. So this is the rule as that works only for rotation when you do the tensor product of the irreps of index L equal to Time the irreps L equal one you can decompose it into One plus two plus three and the general rule is this one so d of j tensor product with g of k can be decomposing to the the irreps from From j minus k absolute value up to j plus k so now we have all the tools to to make equivalent neural networks and The building blocks to do that are polynomials. So with the tool I presented You can build any Equivariant polynomials and I put a little tita here because you can The the coefficient of this polynomial you can make them learnable. So you have learnable equivalent polynomials, which is The only thing you need to make equivalent neural networks So this is a bit the picture of an equivalent neural network You you have an input that transform according to some Representations and you make a layer with these tensor products and direct some and some parameters and It gives you some internal representation that will also have some representations and You can stack these layers together because the composition of equivalent function gives you an equivalent function so these are a very non-exhaustive list of Equivalent neural networks that exist in in the literature So the the most famous one is the convolutional neural network, which is equivalent to translation then people came up with Equivalent equivalent neural network to the the 90 degrees rotation you have just for rotation It's a very small group then people also did Rotation in 2d. It's a continuous group of dimension one Then there's also the scale that has been done and then 3d rotation also people did the Lorentz group and Probably much more that was what came up into my mind so In my work what I like to do is to make a library called each VNN That allows to do equivalent neural networks for rotation You can just install it pip install e3nn and for instance you can just Create spherical harmonics, which is just an example of polynomial Doctor equivalent to rotations and that is very useful So now I have to explain you what is a graph convolution before I can explain the architecture of necrip that will be related to the first Figure of learning curves. So if you have a graph The graph convolution is Simply that each node sent a signal to each neighbor. So if you focus on for instance this node each node that that is connected to him Sent to him a blue signal Everything is done at the same time So in the architecture on necrip from Simon but but now If you isolate a source node and a destination node this So this this architecture is to make Molecular dynamics so the nodes here are atoms that have a position in space So Given a source node and a destination node you have a relative distance r and the source node has some feature on it and the message is simply the tensor product between the Source feature and the spherical harmonics up to some l of the relative distance In this in this formula there is it's missing the parameters and the radial function but it's not really important for this talk and This is what has been observed. So these are This is the mean average error on the prediction of forces So what they do is that they train a neural network to take a position of Atoms and types types of atom and to predict the force that this atom undergo and On the x-axis you have the number of training data and The different curves correspond to different Maximum order of irreducible representation they use for the features and the messages and They observe that if they use only scholars they get some learning curve. So they can train it, but it's a lot of that that to reduce the Generalization error, but if they use higher order Representation The learning curve has a better slope this other architecture Is a bit more fancy So in this architecture the the message is not just a function of one neighbor But it's new neighbors. So let's say a new equal tree. You have three neighbors So you have the feature of the tree neighbors. You have also these three Relative distance to the destination our one or two are three and you will build this time a big Equivalent polynomial they call f t tight has some parameters That is just a function of all these features multiplied by the spherical harmonics of the relative distance and so In this architecture they have so two parameters this L, which is there the maximum representation Irreducible representation they use to for the features and for the message and they have also this new so how many neighbors you take to create this polynomial here and If you take new equal one, you are in the same case as the neck whip architecture So they observe the same thing L equal one and two is better than L equal zero but now if they increase new and They can even keep L equal to zero they also observe this improvement in the learning curve and What it means is that? Even if the features and the message are just scalars As long as you have enough terms in this polynomial to to have So if you are if you have new equal to two you can for instance take Spherical harmonics and they take it up to L equal three and you can do the the scallop product of two L equal three spherical harmonic and create a scallop and It is Apparently sufficient to improve this Learning curve and have a better Generalization I mean data efficiency so in conclusion Equivalent neural networks are more data efficient if they incorporate tensor product of order One two three four and not only scars But it doesn't need to have the features at each every layer to be L equal To be of higher order as long as you have in in the layers Some tensor product of higher order even though you you you go back to Scala's It's it's fine So thank you for listening and the slides are available online Yes, we do I will start there thanks I Didn't understand the last comment you made on the on the graph of Mace so new It's not just a number of neighbors that you consider for a given destination it's the num so Instead of having just one message per neighbor You will consider if let's say new equal to two For this node you will consider every pair of neighbor So you have this guy and this guy that is a pair of neighbor This guy and and this guy that is a pair This guy and this guy there's a lot of pair of neighbors You consider all of them and each pair of neighbor will send you one message and this message So here I isolate just one a triplet or pair of neighbor and This pair of neighbor will send you a message that is a polynomial of the spherical Harmonics of the relative distance and the feature of on each of these neighbor in this pair Okay, so you're saying that if you're using if you're getting messages from the neighbors from persons Let's say instead of just a single Particles I guess in this case, right? Yes, it's MD. Yes Then you don't need to use vectors in the layers Exactly to have a good you don't to have a good data efficiency. You don't need these Feature H to be a vectors. They can just be scalars but because you have You use two neighbors to create the message you can have a Higher order spherical harmonics of the relative distance that you can scale our product together to get back the scalars and Efficiently rich to have a good data efficiency Okay Thank you So in the neck whip figure Is the so so the task is invariant right at the end of the day Yes, the task is So they want to predict the forces applied on atoms But what they do actually they predict the energy and they then they take the derivative with respect to the energy So, yes, the task the task itself is invariant because because you break the energy Yeah, so and the claim is that even if even if the task is invariant Vector representation is better exactly exactly and can you speculate why this this could be So I have no idea Yeah Exactly so Except that conversions of of Examples that show me that it's like the case I I honestly don't know because It has been proven that You don't need these higher order Operation to be expressive you can be expressive enough with just colors But in practice we observe that it's better for generalization and learning curve and efficiency, I Think this is common and a common thing in neural network. I mean one one hidden layer Fully connected neural network is expressive enough to express anything But still we use deep neural networks with very fancy layers because Expressivity is not enough. We need a data efficiency. I don't know how but it probably induce an implicit bias Helps, but I have no idea So in practice It may seem that simple things may be enough it may seem that having an architecture that respects exactly the Symmetry that you have in your problem is the best choice, but I mean can can we check for example in those in those Plots that you show with the data efficiency as a function of how you construct your equivalent network What if you compare this with Let's say a full-blown overparameterized Not equivalent neural nets, but such a train with that our competition. I mean and and you of course you don't Incorporate the cost of that augmentation in the sample size Do you do you always see that it's better to have to equivalence embedded in the network? so for this particular example this architecture and equip is Right now the state of the art from molecular dynamics, so it's better than any other Concurrent method that is Invariant that does not contain this equivalence with enough data and Augmented data would you learn at some point the equivalence the representation you learn? Would it be would it be like equivalent polynomials at some point or? So with enough data you can always learn the same thing you can reach If you follow this curve can always reach any Efficiency One difference is that if you look here We impose a structure that every layer is an equivalent polynomial If your neural network does not impose a key equivalence, and it's just Common neural network by training it you will it will learn to be equivalent end-to-end so from the Input to the output it will be equivalent But the intermediary layers will not be equivalent so you will not have these polynomials in it Probably it will Make take all the layers to in the end make something Equivion that it will learn to become to any other comment or questions Okay, this on what was touched upon in the previous questions, so I wonder since you're essentially adding structures with this Can you say anything about the interpretability of this kind of networks? Yes, I For sure it adds interpretability because now Your internal presentation is He has a structure and has has types so you can you can look at these internal neurons and Look what they do and look what these guys do and because of the their Representation the the vectors can point in can Go direction the scholar cannot so I don't I don't have Work that has been done that I know that Exploit this extra interpretability, but For sure, it's it can lead to better interpretability I'm just wondering if there is a simple example that you tried Where you actually saw that I don't know neural number one does this neural number two does that and they implement some specific operations? Yeah, unfortunately, I have nothing in mind like that Thanks. All right, so Maybe we can move to a speaker. Thank you a lot my