 So let me first thank you, Marta and the Royal Society for this very impressive award. Andrew Black, Guillermo Shapiro, who are also at the origin of it. And to all of you, my family and friends, for being here to hear this lecture. So what I'm going to speak about is indeed AI, which has been, as you've all seen, revolutionized by these deep neural networks, which are invading our lives, the latest avatar being chat GPT. And strangely enough, although the algorithms are very well mastered, we don't really understand why they work so well. And that will be the topic of the lecture to try to approach this mystery and to show that there are very deep connection with physics. And that will be a central theme of the lecture. So what kind of problem these kind of machines are attacking? What is AI? There is essentially two types of problems. On one hand, if you have data, so what is data? For example, it can be an audio recording. It can be a text with millions of letters. It can be an image with, again, millions of pixels, video, physics, molecules. What you'd like to do is to try to understand the property of these data. And defining the property of these data amounts to define a probability distribution which describes the relation between the different components of this data. So learning such model is one of the key elements of this domain. And the applications are data generation. In fact, the image that you see here is not a photograph. It has been generated by neural networks. If you look at these videos over there, same thing. They are not natural videos. They've been generated. In fact, they were generated from few sentences. For example, a teddy bear running in New York led to this video. And there are plenty of applications, for example, to recovering better quality medical imaging, suppressing noise to data and so on. Now, the second kind of problems that you have in this domain are classification or regression, which means from a data you'd like to get an answer that I'll call y to a question. For example, image classification, you have images here. Each image corresponds to an instance of x. And the value y here would be, for example, this is a car or a grid. This is a mushroom. The other image corresponds to cherries or a dog or a Madagascar cat. This is a problem of classification. The problem of regression would rather associate to a data x a real value y. For example, compute the quantum energy of this molecule. That kind of task can be achieved nowadays with these neural networks. Now, the big surprise is that, a priori, these tasks are incredibly difficult to solve. So why? If you look at this problem, which consists to associate a value y to a data x, so to compute this function of x, in order to compute it in learning or data science, you have to begin with examples. So you have an example xi and the value of your function for this xi, which is yi. And now the problem is the following. You are given a new image, so a new x, and you would like to know the class, for example, the value y. So the first idea that comes to mind is to take the x, which is here, to look at all the neighbors. And since, for all the training neighbors, you know the value is to compute the value of the function by averaging the values which are in the neighborhood. And that works, in general, quite well. But not in this situation, because x has many variables. In other words, lives in a very high dimensional space. And you have no chance to have a close neighbor from the training data close to x. And to understand this difficulty which is raised by the big dimension of the space, consider the interval 0, 1 in dimension d. So for example, in dimension 2, that gives the square over there on the right. Suppose that you want to ensure that you will always have an example which is, at a distance, 1 over 10. How many examples do you need for your training? The number of examples will be 10 to the power d. In dimension 2, it will be 100. Now, if d is 80, that's more 10 to the power 80 than the number of atoms in the universe. That means it's impossible. Now, this curse of dimensionality means that you have an explosion of possibilities. And in order to learn, you need somewhere to reduce the dimensionality of the problem. In other words, you need to realize that within x, there are not so much information which is really crucial to find the classification and learn the task. And that's what is very difficult to discover. Now, what is a neuron? A neuron is a very simple computational unit. You take the inputs, the different value of your data, and you are going to weight it with different weights, w1, wk. And you are like that going to make a kind of vote of the data with the different weights, which is this linear combination. Now, if this value is above a threshold b, then your neuron is going to send the value outside. If it's below 0, that's called a rectifier, then the neuron is going to an output 0. So that's this very simple unit. Now, this simple unit, you put it within a network. That means that the input data x is going to be fed into a full layer which is over there of neuron, which themselves are going to be fed in the next layer of neuron. And the number of layers like that can grow up to several hundred. And the output, you are going to get an estimation of what you think is the right answer to the question. Now, the field began to explode the day people began to have more structured neural network. And in particular, this architecture that was introduced by Yanlequin, are called convolutional network, is the basic architecture which is used on data such as images in. In fact, it has applications to almost all fields. And I'm going to show here what it does for an image. In the inputs that you have here on the left over there, your image is the data. The weights of each of the neuron are going to look at a small part of the image, which are shown as the small square over there. Now, because you don't know where the object is, the weights are going to be identical all over the image. This means that the weights, if you put them within a big matrix, is called a convolution operator. This is going to be transmitted. And then all the coefficients which are below zero are set to zero by the rectifier which is over here. Then again, you reapply a set of neuron. That means that you again do a transformation with a convolution operator, W2, and again a rectification. So you cascade this transformation up to the output. Now, comes in the learning phase. So how do you learn? What you want is that in the output, the value obtained are equal to the true value that you would like to have that you are provided in the training data. So to do that, you are going to measure the error. The error is going to be defined by what is called a loss, which depends upon the parameters of the network. What are the parameters? The parameters are the weights of the different neurons. Now, the number of weights can go to millions, billions, even trillions, in the latest neural network. It's absolutely huge. And what you are going to do is to optimize these weights so that the network gets you the right answer. That is the learning phase. It tries to minimize the error. So how can you do it? Imagine that the error, the loss, which is here, is as simple like that convex function. So you begin with a value which is here over there. And the algorithm is called a gradient descent. It's going to move the parameters progressively in the direction of the derivative, which is given by this equation. Now, if you do so, it's like a ball which is going to roll. It's going to roll to the minimum loss. And you are going to find the best weight which achieves the minimum error. The problem is that in a network, that's more how the loss landscapes looks like. You have plenty of what is called local minima. So a priori, your ball is going to be trapped somewhere here and not go at the right position. If you begin from here, it's going to be trapped here. So the question is, how come the neural networks learn so well despite the non-convexity? The more impressive question is the fact that these networks get you impressive results for a very wide range of fields. Image recognition, audio, speech recognition, scientific computations, these networks can predict the evolution of differential equations. Medical diagnostics, fault detections, even generation of images, music, physical data, generation of text, programming, doing mathematical proofs more recently with chat GPT. The first question is, how come this is possible? You have this curse of dimensionality. That means that these networks are able to find a way around this curse of dimensionality. The second surprise is that the architecture of all these networks, which solve all these kind of tasks, are quite similar. That means that all these problems share some very strong properties so that they can be solved with the same kind of algorithm. And these will be two key questions I'll try to address. Now you can begin to look at your network and look at the weights. What is observed, if you look at a network for image classification, is that the weights at the beginning, they look like the weights that I'm showing at the bottom. This was measured from an actual neural network. And they look like small oscillating waves. That's just on the first layer, in other words, the W1. The other layers are much more complicated to look like because the weights looks random, and I'll come back to that. So why these kind of wavelets are observed? Now the interesting thing is that if you look back in neurophysiological models, these dates back to 1960, if you look back at the visual system within the brain, in the back of the brain, you have this first visual domain, which is called V1. And Eubel and Wiesel discovered that within this domain, you observe filters, which are called simple cells, which have very similar responses, which are shown here. This comes from neurophysiological publication, very similar to what you observe on the first layer of a neural network. But then when you move towards V2, V4, IT, things get much more nonlinear, much more complicated. People have studied that, and that's in particular, so that's a much more recent result from the team of D. Carlo in 2018. What they observe is that if you compare a neural network and you compare what is done by the visual brain, there are some very strong correspondence. In both of them, as I mentioned, in the first layer, you observe these wavelets, which are shown here, and that corresponds to this zone. But what they observed also is that the next layers have a type of response which can predict the population response of the neuron in the next layer, V4, IT. So there seems to be some very strong similarities. And the next question is, of course, why? How come you have a neural network which deeply has not much to do with biology, which gives you suddenly the same kind of answer that what seems to appear in the brain? And why are these wavelets coming in? So wavelets. These wavelets have been studied since the 1980s. And in particular, we understand mathematically quite well their properties. Why are they useful? Wavelets are used to separate phenomena across scale. If you have an image like the x, which is here on the left, this is the image, you can split it into an image at the larger scale. So you remove details. And a set of details that you see here, which corresponds to the wavelet coefficient. Essentially, they give you information about the edges, where the information varies very quickly in the image. And then if you take this image and you split it again, you again see an image at a lower resolution and the details that have been erased when you go from this small image to the bigger one. So you can, like that, decompose an information by separating the different phenomena that appears in the different scale up to the last one. Now, as that was mentioned, this has been used for image compression. Because if you look at the number of non-zero coefficients of the wavelets, which are here, they are very few. These essentially correspond to the edge. But here, we are not interested in data compression. Why would that kind of wavelet be of any use if what you are really interested is to analyze the information, which is within data? And to get a key to try to understand that, I'm going to move towards physics and look at the kind of thing that is being done in physics. So in the talk, I'm going to move from the problem, which is fine models of data. And I'm going to specialize that in the size of physics, because that will allow us to try to understand a little bit more what are the mathematical properties behind. And we're going to move towards classifications of images. And as I said, there's two problems we'll need to solve. One, we'll need to understand why we can reduce the dimensionality of the problem, the complexity of the problem. Two, we will need to understand why we can make the problem more or less convex so that the optimization converge. And there will be two key ideas. The first one, you need to separate information at different scale, which appears in these architectures. But the most difficult thing we'll see is to compute or define the interactions across scales. And we'll see why this is fundamental in physics and why this transport in many other fields. OK, so let me begin from physics point of view. So the field of statistical physics is about taking the fundamental properties at microscopic scale, interaction between particles, and try to understand how from these you can infer properties of the world at the macroscopic scale, properties of materials, and so on. Now, x now corresponds to a physical field. You can think of it as an image. The energy of the field is, in fact, defining the probability to observe the field within this state. So in other words, the probability distribution p is an exponential of the energy, which is sometime also called a Gibbs energy. So in physics, what you want is, given this energy, understand the property of the field at all scales. Now, what is known pretty well in physics is the energy of systems which are not too complicated, like gas, like ferromagnetism that you see here, which essentially corresponds to the problem of measuring the magnetism of a material from the properties of the spin. That kind of thing, there are good physical models. Where there are not so good physical models, and in fact no physical models, is whenever you have a physical phenomena where there are geometrical structure. For example, turbulence. In 1942, Kolmogorov raised this problem. Can we define the statistical model of turbulence? And up to now, this has not been derived from the fundamental equation. Cosmic web. This is the aggregation of mass in the cosmos. How can we describe the energy of such a system? So one way to do it is try to do it from data. And the second question that I'll ask is, by doing that, can I better understand the principle which are governing these machines, in particular the neural network? And I'll be showing that indeed, on the way, you get very good hints about what's problem happening in the problem. Now, why is physics very difficult? Physics is very difficult because it's also a very high dimensional problem. You have phenomena which occurs at the scale of 10 to the minus 20, which is the scale of elementary particle, up to phenomena which happens at the cosmological scale at the scale of 10 to the power 20 meters. Now, how does physics deal with this problem? When you deal with a cosmological problem, you don't try to look at the property of each atom that are involved in the cosmos. What you do is you do scale separation. In other words, you have one domain of physics which only deals with particles, with atoms, and their nucleus. You have another domain of physics which specialize at the next scale at properties of materials, which is material physics. Or you have biology that will study properties of DNA. And so each time, you try to separate phenomena at different scale. Now, of course, scales interact because the property of atoms are going to influence the property of materials which are much bigger. So the idea is to try to, which is done always in physics, is to study each phenomena at each scale and try to understand the kind of interactions that happens across scale. Now, why is that a good strategy? Suppose that you have a set of particles which are these points over there on the plane. Think of these particles as it can be pixels in an image. It can be agents in a social network. Everybody interacts. Now, what are the strongest interaction? The strongest interaction between yourself, let's say, which are in the middle in red. And the next particle will be maybe with your family, the closest people. Then the one which are a bit more far away particles, instead of trying to look at the interaction of each of them, you can regroup them and look at the equivalent field influence on the central one. And the ones which are even much more far away. You may think, for example, that the life of someone in Russia has little influence in your life, if you pick him randomly, probably. But if now you think of all the Russians and if there is a tension between Russia and Ukraine, which are two groups, that's going to influence your life. What that means? That means that, yes, you can regroup the phenomena in different scale, but then you need to understand all the interaction between the scale. So why have you simplified the problem? Because you went from the particle to log the groups. So you are killing the curse of dimensionality. But now you still have a complicated problem, which is understand the interaction between all the scale. How to model the interaction across the scale? That's where we'll see neural network coming in. So what is the strategy in physics to do that? That's an old problem that has been studied in particular by the beautiful work of Wilson and Kedanov. Wilson got the Nobel Prize for that, which is called the renormalization group. The idea is to try to look at the evolution of phenomena when you move from one scale to the next. Don't worry, at one point we'll come back to neural network. So how do you do that? Here is an image, cosmos, at a fine scale, and you approximate it at the next scale over there. What you would like is to understand how to come back. How to come back, we've seen how to do it. If you can compute the wavelet coefficients, which are shown here, you can go back. These are the high-frequency variation to the original one. Now what do we want to do? What we want is to relate the energies. We want to relate the probabilities, the probability at the fine scale from the probability at the course of scale, which is here. You have the Bayes formula, which tells you that this is going to be given by the probability of the wavelet coefficients given the low-frequency. And the discovery of Wilson and many physicists around was that this conditional probability is much simpler than the probability of the field. And you can see it. If you look at the wavelet coefficients, they look like noise, much less structured, very much they're correlated. This is much easier to understand than to understand the structure of these filaments. If you look at the same equation, but you take the logarithm, you get an equation on the energies. What that says is that, yes, it's very complicated to compute an energy, but if you compute the increment, the interaction terms, this is going to be much simpler. And that is the approach that we'll see will naturally lead us to a neural network architecture. So that's the work that was done with Thierry Marchand, Misaki Ozaou and Julio Biroli, the Colenormal Supérieur. You can take now a very complicated probability distribution with a very complicated energy over there. And the idea is to say, I'm first going to look at it at a very core scale, very few coefficients, and progressively I'm going to add details. And when you add these details, you compute these conditional probability distribution. You compute these interaction energies and we'll see that these appears within the different layers on these kind of problems of a neural network. Now, why is that also important when you do the optimization? Because the original energy, which is here, is going to be very complicated. There is going to be a lot of local minima. If you try to get it directly, you are going to immediately be trapped and you'll never converge to the best solution. If you look at the simpler interaction energies, one of the very important properties that in most problems, large class of problems, they are convex. So if you try to learn each of these, you have a much easier problem than learning the total. So that's what the strategy that is going to be for. So, now we need to learn these. And to learn it, that means you need to build an approximation with parameters. And that's where the neural network are going to come in. So what I'm doing here is, instead of showing a neural network, I'm showing how progressively it's going to appear from these first principles coming from physics. So, here is an image on the left. These are the wavelet coefficients that are obtained with the different wavelets at the next scale and the next scale. And what we want to understand, to build a model of this, is to understand the relation between the different scale, the relation between the wavelet coefficient over there and the Corsair image, which is represented itself by all the wavelet coefficients at the different scale. Now, you can see that they are very related. You can see the shape of the boat appears at all scales. So, obviously, these are very related one relatively to the other. The big difficulty, and that has been blocking a lot of research in math and physics, is that if you try a naive approach, which is just to try to correlate these coefficients to different scales, you're going to get zero because these coefficients oscillate with different sign. And when you make a correlation, it disappears. Now, if you follow a normal network strategy, then you put an activation function, a nonlinearity, which is going to kill the sign. That's what we're going to do here. I'm going here to use an absolute value because mathematically it's easier to analyze. And then, of course, the correlation between the different scale is going to be non-zero. Okay. Now, your correlation may have very long-range interactions. You want to build a model with as few parameters as possible. What we're going to do is reapply the same strategy. Again, apply a wavelet transform. So that means that from here, I'm reapplying a transformation. And that with wavelets, that begins to look much more like an neural network. Now, if I want to understand physics, I want to understand the interaction across all the scales, all the orientations that appears at a given depth, which is here. How to do that? I have to realize that if you can capture these interactions, you can capture the physics, whether it's gravitation, my electromagnetism, and so on. Everything is within the nature of these interactions. Now, there are techniques that have been developed. And if you look, as I mentioned at the beginning in neural networks, the coefficient has a tendency to look random. So one strange strategy, and that's the work that has been done by PhD student at École Normale Supérieure, Édouard L'Empereur, Guy Rochette, and Florentin Gutt. The idea is you want to measure interaction between all the coefficients that corresponds to the different scale. You do random combinations. In other words, you take these coefficients and now you introduce neurons or with random weights. So this is a random matrix. You multiply it, so that means you are going to do a random, linear combination of the coefficient, and then you apply your rectifier. These are kind of models that appeared in the field that shows that neural networks seems to have no random coefficient. And then you compute your interaction energy with the parameter vector. So the idea is once you have all the coefficients that have been computed at different scale, the interactions that happens across the coefficients are going to be carried by these random weights and then you compute the energy by learning the parameters of the network. How do you learn them? With your gradient descent, the optimization that is done in a neural network. And that is done with a standard maximum likelihood approach. Now, if you analyze the mathematics behind, you realize that doing this random projection and nonlinearity is like doing a Fourier transform in high dimension. That means that essentially what you have is interaction between different scale and you just represent the low frequency properties of these interactions, the regular part of these interactions. So here is an example. These are type of physical phenomena that are not described by any standard energy model. On the left you have the gravitational alignment obtained in the cosmic waves at a very large scale. In the middle you have a turbulence coming from an electromagnetic field. And on the right you have a free turbulence. So for each of this system viewed as equilibrium system, you would like to compute the energy. To do so, we just apply what I just described. So we build this kind of neural network with separate information across scale. And then we computed the parameters of the energy. And then you can sample the models and what you see at the bottom are generated by the computer, by the energy calculated by the neural network. And you can see that indeed you are reproducing fields which are very close to the original one. You can check that by checking the statistics of the field indicating that indeed you can learn physical models of very complex physical energy including turbulences. Okay, so what all this says? It says that essentially when you have a very complicated phenomena which lives at many scale, one of the important things is to separate the scale and then try to measure this interaction. Now is this really what is done by a standard neural network that is being trained for doing classification? And that's what I'm going to finish on. I'm going to show that you find the same kind of principle within these networks. So if you want to do image classification, so in input you have an image, in the output you would like to get the class of the image. As I said, these networks are learning by taking the weights and updating the weights so that at the end you have an error which is as small as possible. How do you initialize the weight when you don't know anything? You just put random weights. That corresponds to what is called Gaussian white node, totally independent weight. Then you take your algorithm which is doing a gradient descent and you progressively update the weight in order to minimize the error that is obtained by the classification. And the question is, what has been learned and what kind of function can you approximate? You know from the Curse of Dimensionality that you cannot approximate anything. You can only approximate problems which have the appropriate structure. Okay, so from the previous analysis, what is very tempting to do is essentially do exactly the same thing than what we did on the physical side. That means you take your image, you apply the wavelet filters so you're going to get the first scale of the wavelet transform and then you apply an operator that is going to compute the interaction across the different scale. And then you repeat and you compute the second interaction operator and so on. Now, the question is, how do you compute the interaction operator? About five years ago, seven years ago, I did a bet with Yann LeCun that someone will within the next five years be able to compute these interaction operators without learning. So we tried for five years and we failed. We failed. So how do you try to compute these operators? You try to have some prior information about physics. Physics is invariant by rotations and so on. So you try to, you have all kinds of property that you try to put and so that you can encode information across scale. The best results that we got, this is the state of the art on the right here. And what is shown below the prior is the percentage of error that we obtain, essentially four to five times bigger. So obviously you need to learn. The problem is too complicated and this was to classify images. The top is Cifarten, is smaller images, ImageNet, much, much bigger images, more complicated. Okay, so the next stage was to say, I lost, I had to pay a three star Michelin restaurant which is totally unfair given that probably doesn't need it. But at least we needed to have a research question out of the box. So the next stage is to say, okay, we lost. So let's try to learn these operators. And that was done by Florentin, Guth and John Zarkin. So they maintain the architecture, you separate scales, but now you learn the interaction which are these operators. And once you learn the interactions, you go back to the performance of a standard neural network. So that means that separating scale and just learning the interaction is enough. But then obviously the question is, what has been learned behind these operators? Again, if you think in physical term, these interaction is the interaction between the different scales. So they describe the physics. In this case, these interaction, they describe the different property of the different objects that you want to separate. So to try to look at what was learned, what Florentin, Guth, Brice Menard and in collaboration with Guy Rochette did, Kaspar Rochette, sorry, did, is to look at the evolution of the weights. You begin from the weights in your neural networks which are totally random, Gaussian random. And now you let them progressively evolve and learn them. And what do you observe? When you have a matrix of Gaussian white noise, the spectrum which is the eigenvalue of the covariance matrix are flat. And as you see the learning moves, you see that the spectrum of the coefficient, in other words, the weights are not anymore independent. They get to be very correlated and you begin to see the evolution of the spectrum. However, the surprise was that the weights remain somewhat random. In other words, they still have, to a first approximation, a Gaussian distribution. So what does that mean? That means that in such a model, you begin with your data X. You separate the different scale with your wavelet transform and then you compute the interaction between scale. And in this model, the interaction between scale is computed by introducing these coefficients which are correlated. And if you look at the correlation matrix, you realize that in fact, what it does it reduce the dimension. It goes from a data that lives in high dimension and it reduces this dimension. And by reducing this dimension, essentially it eliminates all the information which is not useful and it selects the information which is useful for the classification. And then you go to the next scale. You again separate scale with the next up, sorry. And then you go back with your, these are the random coefficients. And as I said, the random coefficient are essentially equivalent to a Fourier transform which is over there. And then you again separate the scale and you iterate. So that we call that the rainbow network because you have your noise which has different color at each layer that are illustrated here a bit like a rainbow. So within this mathematical world, then you can do the analysis. You can look at the class of output functions. You can see that they belong to a certain space and they are essentially influenced by the covariant. The question is, does it really work? I mean, we've done some hypothesis. Is that going to work on real data? So that's the test that was implemented by Florentin-Gütt and Gaspard-Rochette with Brice-Ménard at École Normale Supérieure. So you take now your neural network, you separate the scale and you implement all the scale interaction operators with just Gaussian weights with a covariance. So that means the following thing. You are first going to take a task, learn a network, compute at each layer the covariance and now you want to see whether the model works. How do you test it? You create a new network with totally random weights but have exactly the same covariance and you see whether they have the same performance. The performance that you need to reach is 7.8% error on Cfarten, which is so of the order of eight and you see whether when you apply the model and you create plenty of new networks, they have the same performance. The performance in this case is a bit higher, 11% for these kind of images. The problem is that when you go to phenomena which are more complex, the simple model that I gave which is to have Gaussian weights begins to be worse, worse and worse. And that's why I said all these are mainly mysteries. Each of us have models but they are limited. So there are conjectures that the reason why you want to go from 10 layers here or 18 layers to 300 layers is that it simplifies much more the learning and Gaussianity comes back but these remains mostly open questions. So I would like to finish on trying to take a bit of distance with all this. As I said, there is this basic mystery. These problems are incredibly complicated. These neural networks, not only they are able to avoid the curse of dimensionality but they are able to solve an incredible variety of problem with the same kind of architecture. So one answer to this problem, as I said, if you look at physics is that the key to avoid this complexity is to separate scale. However, once you've separated the scale, the big, big difficulty is to encode the interaction between scale. And again, encoding interaction between scale, that amounts to discover quantum physics, that amounts to discover gravitation. So this is not easy but by doing that, taking that strategy, you reduce this problem to a problem which is unfeasible to a feasible problem. Now, what we've been showing is that in the case of turbulence, for example, these kind of models, given the prior that we have in physics, is sufficient to learn new kind of physical models as the one that I've been describing. And as we see the evolution of AI, we see that the impact on physics is getting bigger and bigger. For example, you now have neural networks which are able to predict meteorology in the next three days which are better than calculating the solution by running partial differential equation, namely the Navier-Stokes equation, because there are a lot of uncertainty on the data that the network seems to be able to better take into account on short range, just three days beyond that, it doesn't work so well. Now, what is interesting in this field is that as you've seen, the example that I've been showing are quite simple compared to the example that are nowadays shown with creation of incredibly complex images with, as I mentioned, chat GPT. What's happening is that the field right now is moving incredibly fast because it's an, let's say, empirical phase. We have algorithms which are being developed by extremely smart scientists and engineers and the performance of these algorithms is growing very quickly with more data, more computational power. The results are incredibly impressive, as I said, speech, protein, folding, large language models, and mathematics moves very, very, very slowly compared. We're still trying to understand properties of turbulence, trying to understand property of filament in the cosmic web, whereas these networks are able to solve proofs and so on. So at one point, there is a question that you can raise, is why trying to understand? It's worse to ask this question because after all, you can develop a system, check it statistically. You think, for example, automatic cars. Automatic cars are now running in Phoenix, in San Francisco, because statistically, the number of accidents are very small, and it works, although we are very far from understanding any of the mathematics which leads to robust results or not always so robust because there was recently an accident in San Francisco. So why trying to understand? From a practical point of view, I think there's two main reasons. One is robustness. It's always somewhat dangerous to have an engineering system that you build where you don't understand why it is stable and why it is robust. If you think of building bridge, you can begin by doing that, Romans built bridge without knowing the basis of mechanics, but if you want to build much bigger bridge, very stable, then you begin to need to understand the math and the mechanics. Efficiency, these systems are incredibly energy costly, data costly, and from that point of view, understanding is important. But I think there is another reason, and for myself, at least as important, it's a very beautiful problem. It's an amazing problem. We are now having machines which are able to solve problems that range from physics to language to music generation, speech, protein folding, all kind of problems. That means that there is some kind of structure in the world which is common because these machines are all able to solve it. And discovering this structure, it's about structure of information, it's about structure of knowledge, is a very, very beautiful problem. So we are not close to solving it. I try to give some elements to it, at least to approach it, but I think it's a very beautiful problem for anybody interested in doing math in this domain. Thanks, Hermann. Thank you. Thank you for the excellent lecture, and especially mentioning robustness at the end, my personal favorite. So I would now like to open the floor to questions. Do we have any questions? There is Andy here. So Shafia, I think there was a question there at the back. Okay. Thank you for a fantastic lecture. You've covered many topics. I would love to hear your thoughts a bit about sort of interpretation of parameters in these kind of models. That came to mind because I was thinking of the images, the way you were explaining how the models interpreted those 2D images of a ship and the parameters at various points. And it strikes me that the really substantial models you're just talking about at the end there are just much bigger with much, you're referring even like trillions of parameters. So it seems like that's a problem of scale as well. Like on small models, you can kind of visualize what the parameters mean when the gigantic ones, it's much, much harder. So I wonder if you've got any thoughts on how we might approach that problem. Yeah, this is an extremely difficult problem. When you have a physical or a problem with three parameters, then you can associate to each parameter an intuitive meaning. When you begin to have a model with a hundred parameter, thousand, a million parameter by itself is very complicated. Now, if you go back in the physics case, at the end, all the parameters are used to learn the energy. And then you can relate the property of the energy to the physics. So for example, that's what we are doing on active matters. Active matters are particles which have their own free will and can move by themselves. And by looking at the energy, then you can begin to interpret some of the physical phenomena. But it's not, so that means that you go back to an abstraction level which is still high because we are thinking about, you are looking at the physical potential. You are not looking at patterns. And I don't think that interpretability on such complex problem can be boiled down to say, oh, there's this type or this type or this type of pattern. The interpretability will probably be at a mathematical level on constructs such as in the case of physics, the case of potentials. It's a very good question, how to do it for problems such as classification. This is in some sense why myself I moved back to physics because in the case of physics, you understand the problem, you have a machine you don't understand, you can try to relate the two. If you want to classify cats and dogs, deep down we don't know what is an image of cat, an image of dog. And then you have a machine that you don't understand, so you have a machine don't understand to solve a problem you don't understand, it's more difficult. So hopefully we'll move to that problem. It's a beautiful problem of interpretability, but I don't think it's going to be solved by easy things such as patterns or these idea we had in computer vision in the 1990s or 2000s. Right, next question. Hi, thanks for the talk. In terms of neural networks, you've talked about it from a kind of reductive point of view, so you're saying that the networks learning in terms of wavelets, and you're building that up with different hierarchies. But when you look at generative neural networks, you're getting some very, very complex behavior out of them. So you showed us before the idea of a bear running through the streets, there's a lot of high level concepts involved in that. So for instance, if you're pulling into a generative neural network, the concept of the pope in a puffer jacket, can you go backwards through that and split it up and show which part of the network is associated with popeness and which part is related to the puffer jacket? Okay, I didn't quite understand the end, but I, so from what I understand is to... Can you do the opposite and probe the network backwards so that the concepts within it are highlighted? In the case of generative network to generate images and so on. In the case, so the one that I know well in the case of images, if you look at the architecture, usually people use what is called UNET. And what are these UNET doing is precisely separating scales. There you go at one scale, the next scale, the next scale, the next scale. And then you have these interaction elements that comes in the horizontal and they reconstruct. So it's very present. And in fact, the first networks that were doing generation of images without doing these multi-scale properties were only working for small images. So you do see that appear. Now, whether these are wavelets inside or not, as I said, you can only see it on the first layer. Inside the other layer is very complicated if you don't do the factorization. Now, there are many, for the synthesis part, there are many techniques. So that we would need to discuss about that because depending upon the way the synthesis is done, if it's core diffusions, then you use these network at each layer. And so at each time, so at each time you do see it being these structures. But yes, that was for the case of images that we studied. Yeah, thank you for a really stimulating talk. I think it's really interesting to do a comparison between modern systems and human brain. And I think one of the important comparisons is the energy footprint of learning. And the human brain is extraordinarily efficient. It runs on 10 to 20 watts of energy. So I guess my question is, current systems having a much higher energy footprint for learning, do you think we've missed the mark with the algorithms or with the hardware? Both. Certainly both. The hardware is obvious. I mean, silicon compares to neurons. It's energy-wise. It's extremely bad. At the algorithmic level, there's something quite obvious is that in the neurophysiological system, there's DNA and DNA already encodes part of the solution probably. We are not born. The architecture is already there, but also responses. You can see there are a lot of experiments that have been done on babies. They acquire vision incredibly quickly. And you can see, for example, these simple cells, they're already there. They're already encoded. So that means that you need... Now, that could be addressed by pre-training, but that means the algorithm needs to evolve. So both. Probably the worst is the hardware. If you compare it to biology, there we are really far. Thank you. So just one last question. Thank you for a fascinating talk, Professor Malle. I would like to comment on your diagram showing the range of scales considered by physics from the plank level to the whole cosmos. And point out that one of the problems, which is the reason for physics being different from chemistry and different from biology and from psychology and so on, is it not just a matter of scale, it's a matter of the nature of the systems that are involved and the fact that the system has evolved, it has properties which aren't contained by any of its constituents. There are emergent properties as the scale gets different, for different scales. And the approach that you've taken seems to me to be concentrating too much on the scale and the interactions between the different scales without taking account of the different emergent properties that occur at these different situations. Do you agree? So that's a very interesting question. You're absolutely right. The nature of the phenomena that appears in fact the different scale, which may correspond to chemistry and biology or to fundamental physics are different. But you see the same kind of thing appearing in neural network, in fact, people speak of emergent properties in chat GPT, which suddenly can work with a prompt which was not expected at the beginning. So it's, you mentioned all, there is one point is that I think you would agree that all this system, whether they are biological, chemical, they are built from the same particles. So they are built from the same material in the same way that the emergent properties of a system with trillions of neurons are based on the weights. So it is emergent properties, but these emergent properties results of going to different scales. That's the way I view it. But you're absolutely right. They are properties which are fundamentally different. This is why statistical physics is both beautiful and incredibly difficult, is to understand how these emergent properties come out from the previous layer. And that's what we don't either understand in neural network. So I'm not sure the two point of views are so different.