 Yeah, thanks for having me. It's great to speak in front of people again. And yeah, so the way I structured this talk is I start with a short part on neural networks in general, then go to the neural network potentials and then move a bit, basically, through my path in this field to watch applications and give us a couple of pointers to what other people did. So it should do something if I press that button down here, but it doesn't. I have to put it on somewhere. OK, thanks. OK, just in general, what do we want to do with neural network potentials? Well, in general, we want to parametrize chemical space. And this could be a potential energy surface, for example, in a local region like this, if you want to search for different conformers. This could also be a more global exploration, or it could even be the whole chemical space with completely different compositions. And so depending on what you actually want to do, this changes also a bit of the requirements to how you model your systems. For example, if you have this very local thing here, you might be fine without permutational invariance because in a small molecule, you can basically assign just an index to each atom, and that might be fine for you. But if you start with something like this, you see here, in this case, you might have basically rotations of some groups, and then you might actually want to have permutational equivariance or invariance. And then, of course, there are the materials where you might have periodic boundary conditions, and then you also want to follow this symmetry. And if you get to chemical space, I'm a bit confused by this point, but OK. If you go to chemical space, you also have different number of atoms. You have different atom types, and that you somehow need to represent, and perhaps even learn from one. So transfer knowledge from one atom type to the next. So how can neural networks help with that? So basically, what we want to do here is we have our machine learning deck box, which for the purpose of this talk is in your network. We put in the atom types represented by the nuclear charges and the positions of the atoms, and then we want to get some property out. For example, the energy, which is why it's normally called neural network potential, but you also might want to predict other properties, like partial charges, dipole moments, or even some ensemble property. So if we now have this setting and we know, oh, this is somehow what the neural network looks like. So you have all these neurons, and you have connections, and then there's some property you'll get out in the end. How do we actually put in a system? But before we come to that, first the announced neural network crash course. So what is a neural network? How can you think about this? What's the difference to the kernel learning approaches we've already seen, and also what is perhaps something common? And the first common thing is that we start off with a linear model. So here you see basically the simplest neural network, which is called the perceptron, and this is basically just a linear classifier. So you have your input here represented by these x1 and x2. So it's a two-dimensional input, and then we have a constant input, which represents the bias of a model. Because as you've seen in other talks, we don't want to deal with biases, so we just treat them like parts of the input and then ignore them for the rest of this lecture. So OK, what we're going to do is just we multiply the input with the weight, the second input with a different weight, then we have here the bias that comes in, then we sum everything together. That's basically the linear part here. And then since this is a perceptron and perceptron is a classifier, we need to decide is it class A or class B, or in this case here, the positive blue class or the negative orange class. And the way we can do that is by just looking at the sign of this function. So if you look at this, we have the w, which is orthogonal to this line here. So it's points in the direction, basically in the direction that's relevant to the classification. So it doesn't matter where the data is in this direction. So if you're here and you go here, it's still the same class, the same here. But if you go in this direction, orthogonal to this decision boundary, then the sign and the value of the function should change. So that's why we have this kind of linear model here. And then the sign, which we call the activation function. So basically, the origins of neural networks are basically the biological modeling in the brain. And there are people think, OK, there are some activations coming into this neuron. And then there's some point where the neuron says, OK, that's it. I'm firing and basically send a new signal to the next neuron. So that's why it's called like that. And yeah, and this gives us a decision boundary like this. And in the two-dimensional case, this is just a line. Then in 3D, you have a plane that separates the data. And in higher dimensions, you have a hyperplane. OK. So now the problem, again, is that we, just as we had this in Mathias lecture, we just have a linear model. And in this case, we also have a classifier. But most of us want to do regression, actually. So the first thing we can do is look at this function here. So the sign is not really nice, right? So it jumps from minus 1 to 1. And yeah, you don't want that in a regression, for example. But also, you don't want that in a classification. Come to this point. So you can use different activation functions. So one obvious choice is the sigmoid. So it's similar on the 10-H here, which is like a step function that's smooth and goes from 0 to 1 or from minus 1 to 1. But in modern neural networks, what you often have when it comes to classification, so-called rectified linear units. So that's basically a neuron with this kind of activation function. But you have 0 if your input is negative and the input gets just passed on if the input is positive. So you're basically just clipping any negative values. And yeah, similar to that are these ELU exponential linear unit and soft plus. Oh, there's a team missing. Which are basically something like, so they're basically similar to the ELU in that they are saturating at 0 and going to the identity if you go in the positive direction. But you could also have something like this, which is sometimes used to get a fast approximation of the 10-H function here. So you have a linear part in the middle and you just clip at the top and the bottom. So you see you can use very many different activation functions, but that still gives you not really a way to go to non-linear predictions. Because as you see in here, we're just applying it to the output of this linear model. So if you have a linear classification, it's scaling basically the output here, but it's not really changing anything in terms of expressiveness. So the key is to use multiple layers of neurons. And that's why it's also called deep learning. So each of these is basically a neuron here. And then we have multiple outputs. Well, these are inputs in this case. And these are multiple neurons that receive data from the input. And what we're doing now is we add a layer in the middle so that we don't have just multiple perceptrons stacked here, but we actually have this recombination of the outputs of these perceptrons. And there's some really important property of this structure, which is a universal approximator. So that means if you have one hidden layer with enough of these units, you can approximate any function. And now if we compare that to a kernel method, so what's actually the difference here? So the kernel method we saw before basically has this feature map here and is otherwise linear model. And we have to define this feature map either explicitly or implicitly by defining a kernel function. But then it's fixed. And we can't change it anymore except for hyperparameters. And in a multi-layer perceptron, instead of a fixed feature map, what you have is basically, in this case, it's just this perceptron we seen, or in this case, multiple perceptrons like on the slide before stacked on top of each other. So basically, you have a neural network that you put into the next layer and then you put it into the next layer and so on. So this feature gives you the possibility to basically learn the feature map. So while in a kernel method, you have to pre-define the kernel and by that the representation of your data, here you can learn the representation of your data by adding layers to your network. And here's just a informal representation of how this might look like. So you could have here like this 10-H activation function and you have two inputs again and different weights. And if you recombine this, you get suddenly this, which looks like a Gaussian function. And what you can basically do is, you can separate the space along different hyperplanes and then recombine them. And if you think about this, even if it's very inefficient, you could just basically separate each infinite, no, not infinite, it's each small part of the space and then give it a value by scaling. So that would of course be not very reasonable to do, but that's something you can think about in terms of why it's possible to do everything with one hidden layer. Okay, but we of course don't do that normally. So what we're doing is we just add a second hidden layer and just for fun, we added a third hidden layer. So why do we do that? Because I just said one is enough, but the thing is it's not very, it might not be very efficient to take one. So think about that. Let's say you have six neurons. And you could arrange them like this. So you have basically one input, six hidden neurons and then one output, or you could arrange them like that. So that you have three layers with each two neurons. So the difference is, if you look at all the paths through this network, here you have only six. While in this case, you have eight, right? So you have two possibilities, that's two possibilities, that's two possibilities. And of course, if you have much more layers and neurons, you can basically, you have an exponential growth of these paths with the depth of the network. So that means with the same number of neurons you can, and with that also with the same number of parameters you can get a very high expressive power. There's also an alternative view, which is called information bottleneck, the information bottleneck perspective, which I don't want to go too deep into, but basically if each of these is a linear, it's basically a linear model. And if this has some kind of null space, you can remove some information in each of these models. And by that, basically filter out a lot of unnecessary information, which is harder if you have this high dimensional space here, which in just one layer. Probably, well, it's just informal, like I said. If you want to look into this, you have to be prepared for a lot of information theory and mutual information. Okay, so now let's look at an example how to train a neural network. So, and this is just a simple multi-layer perceptron. So we have, yeah, we have the linear transformation in the beginning, then we have our activation function that the 10 H, and then we get our, basically our hidden activation. And we use that, apply the second linear transformation to get our prediction. So in this case, we're looking at the regression and we are therefore using just the squared loss as you've seen before in previous talks. So how can we now train this? So this is called also the forward part. So we take actual data. So XI is one point I of our data set and pass it to the model. And then we can calculate the error or loss L I. Now, if we want to minimize this error, the only thing we have to do is doing a gradient descent on this error. So we want to get the derivative of the loss with respect to our parameters W2 and W1. And you can do this easily by just applying the chain rules. So take the derivative of the loss with respect to the prediction then with a prediction to the parameter here. And then in case you want to know about the W1 parameter, you first have to take the derivative with respect to the hidden activation and then hidden activation with respect to parameter W1. So it's just a bit of matrix calculus. And if you write this down, what you see is here these terms reappear from the forward pass. So we already calculated that. So we can just reuse it to calculate the gradient and the same here for the activation function. So that means if you want to train in your network, what you use is basically a mix of using the chain rule with some dynamic programming that means saving the intermediate results. But I would recommend you to actually use Autograd because that's the tool that's made for exactly this task because if you're now going from something simple like that to some basically large neural network with millions of parameters, you don't want to do this by hand. And there's also another thing here. So I started with this other perceptron and the biologic model that's on the force. And of course it's much too simple to model anything that's happening in the brain. And I guess no serious neuroscientists would do that. But we're now at a point where you can basically just say, oh, I just forget about the neuron stuff. I have basically some mathematical expression that I can differentiate and where I can apply gradient descent on and I can train it. And that's actually all there is to it. So that's basically what neural networks are. And you can, if you think about that, use any kind of operation where this applies. So you can use tensor products. You can use convolutions. You can, yeah, whatever you can think of and build your own neural network or structure and apply this principle. So, and I think, yeah, so this was basically the crash course neural networks. And I think that's a good point where you can ask some questions on that before I move on to the materials part. Any questions? The chat. From the online chat, they are asking whether the number of layers should be any hour related to the number of features you want to utilize. I don't think there's a simple rule to that. So I think that goes back to the tutorial, just try it out on your data. So it really depends on your problem, on the complexity of your problem. But I guess if you have, I would say if you have more dimensions in your data, you can probably add more layers than if you have just one dimension. But it really depends in the case. Perhaps another thing, because I said, oh, you can use every kind of structure. So there's one thing that I should mention. This kind of neural network is not, it's not a convex optimization problem. So if you have kernels, this is nicely defined, it's convex, you can solve that analytically. Here, that's not the case. So even though you can basically use any kind of analytical expression, if you're not careful, you might fall into the trap of creating something where you would just fall into a local minima. And you might not want that because it might not be a good fit for your data. But if I go back to this part, so similar to how we use the feature map in kernel methods to basically make some problem linear by increasing the feature space, we can do the same in your network. So if we increase the dimension here, we also make it, yeah, perhaps also more linear, but you also increase the chance that there's a good path to a good optimum in your learning problem. So you shouldn't start with three neurons. Perhaps you might use a bit more neurons. Okay. So this very hand-baby, your network's introduction, it's, I guess, now a good starting point to go back into the discussion on how to encode atomic structures because just I had a fixed number of neurons to put in some data, but how do I do that if I have sometimes 10 atoms, sometimes 20 atoms, or I have periodic boundary conditions. So one approach would, of course, be to just take some of the existing descriptors. So this is a very old slide, actually. So here you still have the Coulomb matrix by Matthias, who talked already, and there was a sine matrix by Felix, Faber, who gave a talk also on the first day, and then the usual suspects like soap or the atomic symmetry functions, which are, so most of these are used for kernels, but the atom-centered symmetry functions are, yeah, by Bela and Parinello developed for actually a new network approach. And yeah, so that was enough data. A couple of years later, I also had the FCHL kernel, for example. So you see there's a lot of representations that you could use. So let's just have a look at the symmetric functions. We already saw them earlier in the tutorial. So we have basically functions around some atom that represent the neighborhood. So for example here, the distances or also the angles. And then we use these features to put this into like a local neural network that just gives us an energy that's character, that's basically a energy contribution of this local neighborhood. And yeah, if we do that for all the atoms, we can then in the end take these energies, sum them up and get the final total energy that we want to predict. And so remember, I said, okay, you can use any kind of differentiable function to define a neural network. So in this case, it's just the sum here. So we don't really need to know these local contributions to the energies. We just need to know the final energy and then do back propagation through the sum and through all these networks to the features. So basically, yeah, you have like this first part of the network, so we enter with the charges and the positions and get basically all features, the atom set of symmetric functions. And then we just put this into a new network to get our prediction. There's a different approach and this is what I will spend like the rest of the talk about, which is you model in your network that directly encodes or that directly learns a representation just based on your raw inputs. So in this case, the atom positions or you could also say the distances between atoms. And you don't really have to define any kind, any more kind of features than just these basic things like distances or perhaps if you want angles. And then the second part just looks the same. It's just the prediction of these energy contributions and the sum. Okay, so how can we do that? Let's say you have a water molecule like that and you want to predict your energy contributions from each of these atoms. So the first thing you can do is you say, okay, I just assign so-called embedding vector to each atom type. So hydrogen gets one at one vector, this hydrogen gets the same vector and the oxygen gets a different vector, of course. So that would mean if we do that and we feed that to our output neural network, E1 and E3 would be, if the same energy, E2 would be a different energy with some that's up, we get a total energy. And the problem is if we now move these atoms around we always will get the same energy because it only depends on the atom types in this case. So the next thing that we can do is we say, okay, we replace our representation of the hydrogen with something that is basically corrected by the influence of the neighboring atoms. So we say, okay, we also add to this vector, we already have some function of the oxygen representation that we know and the distance between these two atoms. And we do the same between the two hydrogens and then we get a new representation for this hydrogen. And we can also do the same for the other hydrogen and the oxygen. Okay, what we now have at each atom, we know about the distance implicitly in our vector that represents each atom. And that means we have basically something like, something like pairwise potential. So the next step would of course be just to repeat this. And now we basically do the same thing only that we already knew in our neighboring atoms about the distances. So we get now higher order correlations between these atoms. And we can do this multiple times and then predict an energy contribution for each of these atoms, sum it up to the energy and train it as before. And now we have, if we do this multiple times, we can basically get a complete description of our structure here. And this is what we did back in 2017. So 2017, this was published in which we called the Deep Tensor Neural Network. So basically we started with this kind of atom abetting of each atom type that I just showed. Then we said, okay, we add a correction, Vij for each pairs of atom, i and j. So we sum over all the neighbors j here. And this interaction we modeled like this. So we said Vij is this kind of tensor layer. So we have the representation interacting with the distances. And then we also have these linear parts. And then of course, the bias we never care about. And one nice thing is if you do it like that, you can approximate this kind of tensor product here. So especially this part by first projecting both the representation and the distance here into some kind of factor space, do an element bias product and then project it back to your future space. So that, so this kind of factorization was basically inspired by a machine learning paper by Sootskiver, Martens and Hinton, generating text with recurrent neural networks. And the fact to do these interaction corrections was actually inspired by a network, a paper by Scasselli et al, the graph neural network model, which was from 2008. And so there was basically this early graph neural network paper that I think, I don't know, I think it was not that famous at the time, but then got cited a lot once the whole graph neural network thing picked up, which was just in recent years. So because shortly after this paper, there was a message passing neural networks by Gimmer et al, which was about how to use graph neural networks to predict quantum chemical properties. And so they basically developed this message passing scheme which is quite similar to what I just showed you. So you have basically you sum over a neighborhood of a node in the graph, and then you have this message function depending on the nodes and the edge. In this case, I just plucked the distance in here. And then there's this so-called update function which takes the previous node representation and the message and gives you a new node representation. So basically, this is just a general formulation of what I just showed you. And actually they also had a reference to the deep tensor neural network here where they just showed that if you set the message function, you don't have to read it because it was on the previous slide like this and the update function like that in the text you get basically that the deep tensor neural network can be seen as a message passing neural network. And yeah, and one important point here is since we are using the distances, we are automatically invariant to rotation. So, and also of course to translation of the molecule. Again, shortly after this, we did another iteration on the deep tensor neural network and it was basically just a minor change. Well, it helped a lot actually in the prediction accuracy but it was just a small change in how we modeled the interaction or the message depending on what language you want to use but essentially the same thing. And that turned out that with the small change you can view this interaction as convolution. So if you have normally, if you probably have seen a convolution in your networks where you have this convolution kernel or filter that you move across an image, for example. So now you could think of a molecule as an image and you have some filter that's seen in the background. And then, and now you basically, if you want to, you basically shift the filter over this structure and calculate the values at the positions of the atoms. Now the problem is of course, if you have a discrete filter like you have in your network, so in a convolutional neural network, so that means you have just a three by three filter matrix. If you now move the atoms, you will get a very rough energy prediction because you have this kind of discretization error. So basically the values will jump each time your atom is located in a different pixel. So compared to an image, you don't have this grid structure. So the idea is now, instead of having this parameter tensor, we can replace it with a neural network again and then we can predict smooth continuous filters and also get a continuous prediction of energy. And that's basically just the variation of what we did in the deep tensor neural networks. And it also is a nice view on these things because it's not just telling you oh, there's one atom sending a message to another atom, but it also tells you something about the space around the atoms. So it's more like a potential, actually. Yeah, and since we're here in a materials workshop, I also want to say something about periodic boundary conditions. Of course, we're summing over neighbors. And if we do that by respecting the periodic boundary conditions, we can thus as well predict materials. And you see, for example, if we have just this radio filter here, they are different filters that I got from one of the first schnets that I trained on I think the materials project data. And then you see this with different periodic boundary conditions. So basically the periodic boundary conditions are reflected in the convolution filter. Okay, so this is how the architecture of the schnett looks like. So first, we have this embedding. And then we have this, and then we have this, which it looks like. So first we have this embedding. This only depends on the atom types. And then we have these interaction layers. And finally, the output, just a network just before, and just like you would do this with the atom center symmetry functions. And each of these interactions are these kind of correction blocks. So that means you add this, it's also called a residual structure from the ResNet architecture. Then you can add this correction where you have on the one hand atom-wise layers. So that's basically just linear layers applied to apply to each atom separately. And then you have this convolution I just showed you. And then to get the filter, we can have a filter generating your network. Yeah, it looks like this in the case of schnett, for example, but it could be any neural network as well. And of course you need, at some point, a non-linearity to get some kind of non-linearity features in the, why you're building the representation. Okay, perhaps that's a good time to ask again whether you have questions before I move on to equivariant networks. Okay, then I will just move on if you find some questions in the chat, you let me know. Okay, so I just told you that in the message passing, we get rotation invariance when we use distances directly as a characterizing the edge basically. But this is in the usual case, not ideal. So because we require local representations so that we can scale linearly with a number of atoms. So if we basically increase the system, we don't want to have or the number of distances explode quadratically or even have something worse. But we want to have this linearly and that's why we introduce a cutoff on the distances. But if we do that, that means that the local environments might have a higher symmetry than the whole system has. So for example, here in this, yeah, in this example here, you have the blue node and the red node and some cutoff distance chosen as marked. Can you see that? Yeah, I think you can. And if you look at the blue node, it has the same distance to the neighboring nodes as in this structure and the same for the red one. And that's because the cutoff is basically too small to also capture the distance between the two white nodes. And let's just call them atoms. It's not a realistic molecule, but let's just call them. So between the two hydrogens, so you don't have this, due to the small cutoff, you don't have this distance represented here. So that means for the network, if you do use this kind of cutoff, both structures looks the same. And if they have different properties you want to predict, then that's not possible with this architecture. So what you would need to do is perhaps increase a cutoff, but you don't want that in many cases. For one reason is that, yeah, the computational demand. And the other reason is that if you increase a cutoff, you have basically a larger space that you have to model with your local networks. So that might be bad for your generalization because if I have a small space and I can basically partition my molecule in smaller areas, it's much easier to look. So in order to still distinguish these two structure, we need to retain some additional directional information because that's basically the thing where these two structures differ. So on what we proposed to solve the problem is what we called a rotation the equivalent message passing. So you have now this message function that is more general. So we now also we have the node representations as ISJ that we had before, but we also have now vectorial representations for each node. So VI and VJ and we take the direction Rij. So basically the vector pointing from one atom to the neighboring atom, not just the distance. So now in order for that to be useful, what we have to ensure is that our message function is rotationally actually varied. That means if I rotate the input X and this refers especially to the edge Rij and also to the VIJ, VI and VJ, then the output of the message function should rotate in the same way. And if you look at that, what this essentially means is that this is a linearity constraint on the directional information. And if you think about it that way, you can easily find out that there are a number of equivalent building blocks that you can use to construct your message function. So on scalars, you couldn't use any kind of non-linear function. So basically any kind of new network that won't change since yeah, the scaler won't change the direction information. Then you can also scale vectors however you want. You can have a linear combination of actually variant vectors. So if you have, for example, multiple feature channels, you can recombine them linearly. You can have vector products or you can have scaler products to get from the vector representation to the scaler representation. So and of course, if you look at this, so you can scale a vector, but you can apply any kind of non-linear function to a scaler, then something that basically directly follows is that you will always apply your non-linearity to the state of scalars and then use that as something called a gating or linearity by rescaling the vectors with that. So and here is then our new variant which we call pain polarizable atom interaction neural networks and I will go into this polarizable part a bit later in the talk why we call it like that. So, okay, I can spoil one reason to get a nice funny acronym, but there's also another reason. So yeah, and then we have basically this message parsing function again and I don't want to go too much into this stuff here, but what's important is if you look at the interactions, they're again convolutions. So for the scaler features, it's exactly what we had before. So that's identical to Schnett, but now for the vectorial features, and what we get is these two parts here. So the first part basically, so this is basically a non-linear as a scale as a non-linearity on the scaler multiplied by the vectorial feature. So this is basically gating non-linearity. And then we convolute this with a rotation invariant filter. And on the other hand here, we just have a scalar feature, a scalar note representation and then we convolute this with a rotation the actually variant filter. And this is something that you can also find under the keyword steerable convolutions. So there's some work on that also for the usual convolution in your networks. And you can basically obtain an equivariant filter by taking the invariant function and taking the derivative with respect to your input, Rij here in our case. And yeah, we don't really have to use a derivative here. We can just use a new network to basically model the derivative and still have an equivariant function here. So we have basically the radial part here and then the directional part is just a normalized vector. So now how does this affect a problem with representing these systems? So we have here an example system. We have these two, this periscene and we have these two rings that we rotate against each other. So we used three kinds of networks. So schnett and dime net, which is, I think that was the abbreviation for directional for, I don't remember. So basically they use angles as part of the network to encode some direction information. And then we use pain where we use this equivariant message passing. I think it was called directional message passing. Probably, yeah. And then we use different cutoff distances to model just the energy profile when rotating these two rings against each other. So if we have a cutoff of four angstrom here, we get more or less the correct energy prediction with all networks. And then we reduce that to three angstrom. And now you see that schnett is fading because we don't get all the distances in the cutoff and we can't propagate the direction information. Since once we're having the representation of an atom, we basically collapsed all this information into one scalar. So all the direction information is gone and we can't pass it on in the next interaction pass. And here if we go to 2.5 angstrom, even dime net is failing to catch that and pain is still predicting the energy profile correctly. So why might this happen? So here's one guess. So if you're using angles, the two structures that I showed you before would still have the same representation so because it's still the same angles. So if you want to do, if you want to get this directly, you would need something like dihedral angles. But if you use directional information, you see like here with these vectors, then it makes a difference because now you can pass this information on by projecting the vectorial representations from the red atom to the blue atom and you will get different results here. So different representation. That means you can resolve these two structures here correctly. Okay, a different, so actually what's the time? Oh, okay, good. I was actually worried that my talk gets too short. So I added something in the end. I will probably leave that out. Okay, so now what we can do is also predict tensorial properties because now that we have the vectorial representations of the atoms, we can express a tensor or you can express a tensor as a vectorization of rank one tensors like this. So that means you have the scalar part and then a tensor product here. So we can use this standard formula to use our vectorial representation to predict several properties. For example, here, the dipole moment, you could predict that by taking, so this a function of the vectorial representation here plus the scalar times the direction R i. And so you basically see that this is something like a local dipole and this is the contribution of the charges to the dipole and that's how you can build up this basically dipole layer. And again, since it's just part of the neural network, you just need this dipole and can optimize the whole thing with gradient descent and back propagation. And here's something similar for the polarizability tensor where we have the scalar component and then here we have vectors times the atom position and to get this symmetric, we also have it the other way around. And using that, when we learned these properties, we can run molecular dynamic simulations. So for example, here, this is a ring-polymer molecular dynamic simulation with 64 beats for aspirin and this simulation would have taken 25 years if you would do that with DFT. And we could do this in one hour. Of course, you can scale as to the relation is important, right? You can always let it run longer or shorter depending on what you need. Yeah, and since we have the dipole moment as a polarizability tensor, we can then calculate infrared and almond spectra. So one further side note and this is basically already a hint to a future talk of this conference because I'm sure you will hear more about that. So this vectorial convolution in pain is actually some kind of a special case because here we can see that as a rank one tensors and we might want to go to higher order tensors. So that means instead of having this linearity constraint, we're going to polynomials. And an elegant way to do that is to use irreducible representations and Clebsch-Gordan products. So what you do there, you say, okay, so now I just call it X, the representation of my atom X of order L is convoluted with some, it's convolved with some filter W. And now I can express this this way so that I have, I sum over every L and M of, or let's start there. So you have basically spherical harmonic that represents your direction, R, I, J for your order L and your component M. And then you have a radial filter here. And to be able to combine these two spherical harmonics, so the representation is a spherical harmonic and this part is a spherical harmonic, this part is a spherical harmonic. You use the Clebsch-Gordan coefficients and again, this is kind of a convolution. And so this is an approach that was used, I think first in tensor field networks, it was using comerant and I guess you will hear more about this in relation to the NECWIP network. Okay, perhaps any questions until this point before I go to the, to some application part? Chats asks, if you can share some experience on how you came up with such a complex architecture as pain, so I guess the thinking process behind the structure and also is asking whether the model works for long range interactions as well. Yeah, okay, I think this is actually, this looks more complicated than it is actually. So this part, I mean, this is already in chat and this part is just actually a logical extension to go to the equivalence and go to vectors in this case. And yeah, if you start to painting this diagram it looks a bit more convoluted, but it's really much inspired by how dipole-dipole interactions or dipole-charge interactions would also work. In terms of long range, of course, if I have, when I apply a cutoff, this doesn't work. So when we want long range interactions we would use a separate term. For example, what you could do, you could predict local charges for each atom and then use an electrostatic interaction as a correction. And you can do similar things for dispersion actually and use some kind of, for example, there are also some many body dispersion layers, for example, in PiMVD that you can use and we also have our own inner code. So I have a little bit more abstract question rather than something particular about that. Is that, well, there is an argument and it's not mine but I find it, I'm a little bit sympathetic to it. So towards, it basically says that any neural network, no matter how complex that is, is linear in the end. And basically the whole thing that you do before is that you're shifting your work into finding a suitable representation in some machinery and you say one would smart to be more clever or something that would just come up with such representation first. And so how do you feel, what can you say basically? And what's the advantage of the say of neural network compared to anything else? So basically you're saying I'm too lazy to come up with a good representation, right? Well, no, not necessarily, but maybe. Yeah, I get your point. So actually I think for the geometry part, okay, but there's really, I think one major thing. So if I want to really model the large chemical space, I have to get a good representation for atoms, for atom types and how these interact. So what I don't want to do, I don't want to treat each atom type as orthogonal and then have terms for each pair of atom types and so on. So because that's, for example, a major drawback of the built-in networks that you have to do that. So here, once you have this embedding, this is much easier. You can compress this into this shared space with the geometry. And this, but this is not straightforward how you would model that in a kernel on a future space. Okay, I guess it's fine. I think you can debate later about this. It's going. Okay, so actually this future net approach, this is, oh, you can see it here. Probably the people in the Zoom can see it. So this was with my co-worker, Michael Gastecker and Klaus Robert Miller here. And this predates actually the pain network. And this was meant as an extension to SNET to use also external fields. And so the idea is if you have a lot of the properties and also like the dipole moment or polarizability are response properties. So if you now can, if you can design a neural network that depends on an external field, you can directly get these properties by taking the derivative with autograd with respect to the field. So here, the left side is just SNET. And then on the right side, we have the field SNET extension. So from the scalar representations, we basically construct something like local dipoles like here. And this is very similar to what we later do in pain, which is basically just a vectorial representation that we built here. And then we use dipole-dipole interactions between these features and also dipole field interaction with an external field, which is another input to the network to modify our scalar representations. And again, we do multiple of these interactions. And in the end, we just use like before neural network to predict the energy. And then we can take the derivative with respect to the atom positions to get the forces. That's what we did before, but we can also take the first and second derivative to get the dipole moment and the polarizability tensor. Or you could also take different derivatives. So for example, to magnetic field or to the nuclear moments to get chemical shifts, for example. And yeah, that's also why we later named pain polarizable because it's basically, if you look at the rip here, this is a visualization of the representation surrounding a molecule. And once we apply a field, you see how this gets polarized and the symmetry of the representation is broken here. And yeah, using that, again, we can you simulate multiple spectra. So infrared Raman as before. And we can also do NMR spectra here. What's, but what's a real advantage now is that when we want these response properties, we can basically just input a zero field and take just the gradient. But we can also apply an actual field and model the energy under the influence of this field. For example, if we have a solvent model, we can model the molecule in the solvent. So for example, we can, so here in the dashed line, you see the spectrum in vacuum. Then in blue, you see the machine learning model in a polarizable continuum solvent. And so, and basically the machine learning model is interacting over this field input with the continuum model. But you could also do what we call MLMM. So in analog to QMM, you now have a molecular force field that models the solvent. And machine learning is basically trained on a quantum chemistry method. So that we can then basically replace the quantum mechanics with machine learning and still just use the field input to interact with a molecular dynamics solvent that's surrounding the molecule. And then you see here in red how this changes the spectrum. So for example, here, these peaks get shifted to the left and broadened. So here we use the charm general force field to get the spectrum. So finally, I want to show this one application to a Kleisen rearrangement reaction. So if we look at this reaction here and we sample along the reaction coordinate, you see in vacuum, we follow, so we have the quantum mechanics as a reference and the machine learning model nicely follows the reference here. And then we can do umbrella sampling in vacuum, but we can also do umbrella sampling in the MLMM model where we have in this case a water solvent and you see how this then reduces the reaction barrier. And this requires less computation time than explicitly training on data for the whole solvent. And finally, one idea that's pretty nice here is that we can now say, okay, we have a differentiable model, so why not reduce the barrier by taking the derivative with respect to the field, basically finding an optimal charge environment to reduce this barrier. So you see we started with this 30 something K-cal barrier and if we have an optimal field, which looks something like that, so you see the molecule in the middle and then the positive and negative charges surrounding it, we can use this to around 10 K-cal. So the problem is of course, how would you actually create such a field? This we didn't tackle completely yet, but one thing that you could do is attach these kind of charged molecules in the surrounding of the reaction. And if you do that, so this is just done by hand at the moment, then you still get a barrier at around 20 K-cal in our predictions. So what would be, of course, nicer is if you can actually directly generate this surrounding and don't have to place this or don't ask your favorite chemist to do that for you here because in the end, I'm just a computer scientist. I don't know about this stuff here. We, I need, yeah, Micheal Gastegau did all the chemistry here. And for that, what we're currently working on are autoregressive neural networks. And I don't really want to start with that, but what I'm gonna tell you is what you now want to do in an autoregressive neural network is generate a structure. So, and the way we do this here is we already have an unfinished molecule where some atoms are already placed and now we want to predict the position of the next molecule. And if you think about that, what we actually want to predict is some kind of probability function. And that, again, is just another target for our neural network. So it's basically the same thing. It's just a regression with some normalization because we have to get a probability. Yeah, but I don't want to go into this because I wouldn't finish in time. But you see, you can do all kinds of nice stuff. So I come to my conclusion. So neural networks are great to learn representations because you're too lazy to really construct them. It's still unfortunately a lot of work to implement them and coming up with architecture and so on. So it's a different approach. You don't get rid of the work unfortunately. And we can get the more data efficient if we encode more of what we know about the problem. For example, actually variants, but I guess also in terms of different atom types, we might profit from that. A different example was the influence of the field. And yeah, you can accelerate molecular dynamic simulations. You can model reactions and do in solutions and do inverse design and beyond potentials. You can also with some slight modifications get an autoregressive model where you can actually build structures and condition them on properties. So you might want to predict a structure that has a low energy and a certain homo-lumo gap or end gap. And as a last point, I want to point at this perspective that we've written is Julian Bestermayer who's giving a talk later in the week and Reinhard Mauer from Warwick. And then we made the point that in the workflow in materials and quantum and computational chemistry, there are a lot of steps in this workflow that you can tackle with machine learning. And we basically list a lot of approaches that already exist and hint to some future possibilities. If you're interested in where you can use machine learning in your workflow, look at that. And so I thank you for the talk. We have some software. It's called SNETBEC if you want to start coding. And there's also this book with a lot of contributions also from some of the people who are here. So thank you. Thank you very much for this super nice and broad overview on how we could apply NERN Nets to study problems in material chemistry. Is there any question? Hey, thanks for the talk. It's kind of like more of a question about how you came to design the neural networks. So you started off with deep tensor neural networks which were kind of message passing inspired or doing message passing. And then you moved on to SNET and then you built on top of SNET for quite a few pieces of work after that. So what was the difference between deep tensor neural network and SNET and then what was basically the idea? It's actually more of a technicality. So it's really this interaction function. Oh, I'm too far. Sorry. Where is it? Yeah. Okay. So here I started with this tensor thing. And everything was wrapped into this 10 H. And if you do that, if you do that, then you can write this as a as a convolution because you have this nonlinearity in there. And also it turned out that it's not a good idea in terms of prediction accuracy. Yeah. And so this is one of the differences. And then we also changed the types of nonlinearity. So here it's 10 H in SNET. We're using a soft plus nonlinearity. And I think we also just, I think deep tensor neural networks didn't have a cutoff back then because we had these small molecules. But actually we found out that even with small molecules having a nice cutoff works better because for the reason that I mentioned the smaller you make your local environments the easier you can you can learn this environment and the better you generalize. But then again, of course you have to you have a trade off with this direction propagation and long range effects. The representations learn better in some way in your experience. Again, in your experience other representations learn better with SNET. Yes. At this point I would use pain actually. So the pain network with pain it's about three times as data efficient. So you need three times more data with SNET to get the same accuracy as pain so from what we are seeing on our data sets. So this equity variance really helps to generalize better. Thank you for the talk. I have a question about the interaction layer. How much does it add on to the computational time of fitting the neural network? You mean each one? So I what I can tell you is with pain I think for an aspirin molecule the whole network takes I think it was four milliseconds or something like that. So I didn't measure really what each interaction layer takes. This is with three interaction layers in pain. So I don't know I think you won't get into the realm of microseconds but I think for molecules you're in the millisecond regime if you're going to materials it's a bit more of course it's basically sketched with a number of atoms. It scales linearly or? Yeah it scales linearly but well the network itself but then you have to collect the neighbors and then that scales like your neighbor list. So you can use any kind of neighbor list implementation and then that's basically the main part of the scaling. And if I can ask one more question how generalizable is schnett or pain in general compared to other non-end-to-end neural networks. Does it more specialized to the system or is it more general? I mean for instance if I was going to do some transfer learning from one system to another and it's not really same Oh Okay I mean in principle the representation has expressive power to transfer the thing is of course if your initial data set is very specific and that could be hard so this is really data dependent. Thank you One last question I wanted to make is about whether we what are the ways you would suggest to have uncertainty on the prediction because we have seen these four DPRs and kernel methods. Right so what we're doing which works in many cases quite well is using ensembles of this network so we're using this for we have a project on crystal structure prediction where we're using that and that works pretty well we're all having also project with molecules and surfaces this works very well I heard of cases from other groups that in certain I think in certain reactions this doesn't work but this is then a general problem that's not just relates to schnett but also to different methods but there are also different uncertain approaches one could use for example this some dropout ideas so that you can basically get an ensemble by using dropout which is basically probabilistically switching neurons on and off so there are a couple of approaches that we didn't really extensively study yet that would be interesting to look into Thank you very much and thanks again for this fantastic talk Thank you So we see each other tomorrow at