 open the webinar and start recording. Okay, you will see the number of attendees increasing. We are also live on YouTube. Okay, I think we can start now more or less. Yeah, number of people has gone over 100. Okay, welcome everybody. It's a great pleasure for me to introduce Professor Stéphane Malat for who's our colloquium speaker today. We'll talk about mathematical mysteries of deep neural networks. So Professor Malat is an applied mathematician. He's a professor at the College de France on the chair of data sciences. He has been a member of the French Academy of Sciences, the Academy of Technologies, and a foreign member of the US National Academy of Engineering. He was a professor at the Courant Institute of NYU in New York for 10 years, then at Ecole Polytechnique and Ecole Normale Supérieure in Paris. He was also the co-founder and CEO of a semiconductor startup company. And I will pass it now to Jean to introduce his research interests and contributions. Thank you. Welcome Professor Malat. I'm very happy to host today this seminar in particular because the work of Professor Malat has been very influential in my research and in the one of whole generation of young physicists and applied mathematicians working in particular in the field of signal processing. We have one of the specialty of Professor Malat. He has great contribution in particular in the theory of what we call wavelets. For those who have no idea what it is, it is a kind of natural basis in which you can express complicated signals like images, which is the equivalent of the Fourier basis for sounds but for images and other type of complex data. Professor Malat has an enormous amount of contribution in this field. Later on, he has moved towards machine learning, which is the subject of today's talk. He has established a whole series of deep links between this field and its natural representations of data in images and why algorithms such as deep neural networks are able to learn from such high dimensional data. I'm also very happy because as you will quickly realize, Professor Malat is an expert in science popularization. I had the chance to meet him once and to see one of his talk in a very small village in France in Corsica. The whole village older people, children gathered and had the opportunity to listen to him. It was absolutely obvious that everyone got something from this talk. I think this is really an important skill. With that, Professor Malat, I leave you the floor. Welcome again. Thank you very much for this warm introduction. The pressure is really on, so I'll try to be up to this description that you've just done. I'm very happy here to be at ICTP to give this talk. Let me begin by sharing my screen. Here is it. I'll be indeed speaking about these mysteries, mathematical mysteries of deep neural network, but a bit more broader, I'll be speaking about this whole field of data science, which has been booming in the recent years and the relation with mathematics and open problems which are around. First of all, maybe what is data science? In some sense, it's about extracting some form of knowledge from the data and what is really new in what has been happening in the last 10, 15 years. First of all is the mass of data, obviously, which is constantly acquired by video, different types of sensors, the increase of computer speeds, but also new algorithms. In particular, these deep neural networks, which existed in the past, got very, very impressive results with the mass of data and the increase of computer speeds. These results are really not understood and I'll be focusing a little bit about that. Now, the applications are in many different types of domains. For example, automatic speech recognition, which you may have, for example, on your telephone, analysis, translation of language, image recognition and facial recognition, but also, you probably know that the world champion Go has been beaten by such a machine, using, again, these neural networks. Impressive results have been obtained, guiding, which is obviously very important if we can move towards an autonomous car and other domains such as medicine for automatic diagnosis. Now, what is common in all these problems is that the data we are dealing with, whether it's speech images or text, corresponds to very high dimensional signal. High dimensional signal in the sense that the number of data points, if you take an audio sequence, is about one million in a minute. The number of pixel of an image is of the same order of million. You'll get also about a million letters in a book. Of course, the numbers can grow considerably in a social network. You will say about billions of agents interacting, for example, on Facebook. If you go to physics, then the numbers will grow even much more. In some sense, physics has been until now the signs of high dimensionality, with the Avogadro number 10 to the part 24 number of atoms in a few grams matters. Now, what is this field about? It's essentially about trying to answer questions and predict the answer given the information provided by the data. The question will be, for example, recognition of images, what kind of sound word has been produced, what kind of mechanical diagnosis and so on. This domain, which we call statistical learning, is really the basis of the explosion of artificial intelligence that you've seen in the recent years. Now, when you look at artificial intelligence, it's very interesting to look what has been happening because the whole field has been completely stretching years from the beginning in the 1970s or 1980s, where the domains were essentially based on logic to what is happening now, which is much more, as I was saying, a statistical analysis point of view. Now, this switch is not really a surprise. If you look in the philosophy of knowledge, you constantly have this balance between a totally rationalist view, which would be in the extreme, a Platonistic view, where everything is known a priori and there is nothing to get from experience only to be triggered by your own senses, to the other extreme, which would be Aristotle, which basically say, well, if you take a baby, it's just a piece of clay that is going to be modeled by experience, by data. Now, this point of view is also much more the Anglo-Saxon point of view in philosophy of you, Locke, as opposed to, let's say, more French Cartesian or like this being more German. And that balance constantly has been discussed because the fact is, in these problems, you need prior, you need data. And how to put one with respect to the other is really the center issue of all these questions. So that will be the question we'll be trying to look at. What are the priors in machine learning algorithms? And if you learn from data, what is the nature of what you learn? Now, I'll try to phrase all this in mathematical terms and why understanding is so important. We'll see that algorithms are incredibly efficient, but there are issues also of making sure that things are robust, are well controlled, and that's where mathematics will come in. So let me begin with learning algorithm. What does it mean? So you have a piece of data that I'll call X here. For example, X can be an image and you'll ask a question. For example, what is this animal? And you want to estimate the answer and the answer will always be called Y. So in this case, Y is a dot. Now, how we are going to try to estimate this answer, the specificity of learning is that you are going to develop an algorithm, so a sequence of instruction, but which is going to be parametrized. And it is these parameters of the algorithm that you are going to train in the learning phase. Now, how do you set up the parameters of your algorithm? You take examples. So example of images X, I, and you give to the algorithm the appropriate answer. So it can be a dog, a cat, a horse, whatever. And the algorithm is going to configure it itself so that it gives the right answer on the examples that you gave. Now, this by itself is not very difficult. What is really going obviously to be difficult is to generalize. Now, if you give it a new image it has never seen, it has to give you an answer, which hopefully is going to be close to the true answer. So how come you can generalize from examples? Now, that will be the mathematical questions that we'll be asking, understand the condition of generalization, and you can have a little bit more experimental approach to that. That would be, let's say, the computer science approach, which is develop algorithms which are efficient and test it on the data. And constantly in this field there is a back and forth between these two different point of view. Okay, so where is this generalization coming from? The generalization comes from the fact that you have an a priori, and the idea is in fact very simple. You have an a priori on the regularity of the problem. I'm going to give you a simple example to get a feeling for that. Suppose you have a bowl of water and inside this bowl you would like to know whether the water is in a solid state, liquid or gas. Now, the data you are going to be given is the temperature and the pressure in the bowl of water. So you have two variables here. X has pressure and temperature. You make one measurement and you'll say, okay, here the temperature is about zero and 100 degree pressure, not too high. It happens. It is a liquid. You make many experiments. In this zone, your water is liquid. Here high temperature becomes gas, low temperature, it becomes water. Now, you are going to be given a new temperature and water and you are asked, is the water going to be liquid or gas or solid? So how to solve this problem? Well, as you can see, essentially it will amount to finding the frontiers between these different zones. Now, that could be the frontier. Okay, given the data, that would be a possible frontier between the data. Now, if you look at it, it doesn't look very reasonable. And if that's the frontier, then X would be a gas. It's not very reasonable because in physics, when you change slowly a variable, you expect in general that the result changes slowly. So you expect some form of regularity. Now, if you expect that the frontier is regular, then you will get something more like that. And then it's much easier to estimate the frontier, given the position of the point. So you can try to estimate a regular frontier that I show in black. And if both frontiers are regular, there is a good chance that the errors is going to be small. So how can you do that? There is a very simple approach to do it, nearest neighbor. You take X and you look at the experiments you already did, which has pressure and temperature, which is as close as possible to the example you have. And if this example happened to be liquid, you'll say, well, X is liquid. That indirectly is a way to define frontiers, which goes in the middle of the different examples, so which is going to be piecewise regular. And it works relatively well. Okay, so the problem looks simple. Why is that, that people have been working on it for so long with so bad results? The reason is the situation is very different when X doesn't have two variable, but millions of variable. If you take an image, the number of variable is the number of pixels that you have. Each pixel can vary from zero black to one white. So X is now defined by D variable, which are of the order of a million. And if you want to view it as a point, it's going to be now a point in a space of dimension one million as opposed to two. So what's the consequence of that? Let's ask the same question. What is the animal, dog or cat? So you get example. Here you have example of cats. Each image is now represented by a point in this space of dimension one million. Now you take other examples of dogs. Here are your points. Now what's the difficulty? The difficulty is that all the examples are very far away, one from the other. And to understand why that's the case, you can ask yourself, suppose that I would like to make sure that if I that I always have an example, which is at a distance, let's say one over 10 to the example that I gave, what would that require? It would require that all the pixels of my image is close to all the pixel of one of the example up to a movement of one over 10. How many examples do you need to make sure? Well, you need to make sure that each time you move by one over 10 along one of the axes, you have a new example. So you would need 10 multiplied by 10 multiplied by 10, as many as the number of dimension, then to the power D. Now if D is larger than 100, 10 to the power D is larger than the number of atoms in the universe. Now D is not going to be 10, but 1 million or more. So you can see that the number of examples you would need is totally impossible to have. What that means is that in practice, you will never have an example, which is close to the image that you have, at least in your period. So how come you're still able to interpolate? That means that there's some form of very strong regularity, which allows you to relate all these examples. And that's the problem that one needs to discover in this field. Okay, so what is the type of strategy that people use in machine learning? So at the beginning, suppose I have two classes, red and green points here. The points are very far away and it's a mess. The idea is to take this vector X, which describes the data, and to make a kind of change of variable. So describe it with new variable so that in this vector of, you may think of it as features, suddenly by miracle, the two classes separate. What does it mean that they separate? That means that I can put a plane that is going to divide the two classes in two parts. What it means is that if I move orthogonally to the plane, I have a discriminative direction that will lead from one class to the other one. Let me give you an example. Suppose that you have images and your problem is to find whether in this image you have a fire truck or a car. So the image can be very complex, but you could imagine a very simple strategy is just to count the number of red dots in the image. And if you have many red points, you can imagine that it's more likely to be a fire truck than a normal car. So that would be counting the number of red dots. Now this is obviously a very naive example, and in general, that's not going to work for very complex problems. So the problem is going to understand this fine. So before going to that, let me look a little bit about this plane of a day. What does it mean to discriminate the class in two? The plane that is going to separate is defined by a vector which is orthogonal to the plane, this vector w. Now to say that you are on the left side as opposed to the right side, you just compute the inner product, so the weighted average of your feature values with the different coordinates of your vector. And if the projection of phi on this direction is larger than b, you'll be on the left side. The answer will be one. On the right side, if it's more than b, the answer is zero. So you can see that what is difficult is not to do that. What is really going to be difficult is to find this miracle phi, which suddenly is going to separate the class in two. So how to define it? There's two strategies. Either I have a simple problem like my fire truck, and I know a priori how to discriminate the images so I will be able to engineer the phi effects, or I don't know. And then somewhere I will need to learn the fire effects. What are the discriminative features? And that's where the neural network are going to come in. So what is the idea of a neural network? So first let me begin with a neuron. What a neuron is going to do is essentially the same thing than this separating hyperplane or plane. I'm going to take the input data and each value of the input data I'm going to multiply it by the weight of my vector. I'm going to sum, so I have the inner product here, and I'm going to look whether the inner product is larger than b or smaller than b. In other words, I'm going to see whether the sum is smaller than b, in which case the neuron is going to give you zero value, or if it's larger than b, I'm going to give you the difference between the two. Now here's how I'm going to represent the neuron. On one hand, I have all the entrants which would correspond to the dendrites of a real biological neuron, although it's an incredible simplification of what a real neuron is. And then that's the outputs which would correspond to the axon of the neuron in biology. Now a neuron is useful when it's put in a network. So in a network, you are going to have your x that you are going to put in input of the network, and you are going to organize the neuron like that into layers. The first layer is a set of neurons that are going to take linear combination and produce an output here. That will be the layer one. Now the output of the layer one is going to go into a new set of neurons that are going to compute linear combination and apply the non-linearity that I've described, this kind of rectification, going to put a second layer. You can put as many layers as you want. And at the end, you are going to get your fire effects that you are going to linearly aggregate compared to the threshold and say whether you get the results one or zero if you only have two class or what's the real value that you want to compute. So a neural network like this has a lot of parameters. Each neuron has many parameters. I'm going to call w the set of all the weights of all the neurons. Now, as I said, in an algorithm like that, you learn the parameters from the examples. So you are going to give an example in inputs. You know the answer, yi, and you are going to optimize all the parameters so that the answer that is produced by your network here is as close as possible to the answer. So you have an optimization problem. And in this optimization problem, you're typically going to have 10 millions of meters. So it's a huge optimization problem. Now, two questions come in. First of all, how to find the solution to these optimal weights? So that's one branch of mathematics which is about optimization. And second question, what kind of architecture you are going to use? Now, the architecture of the neural network is very important because that's where you're going to put the prior information that you have on the problem. And I'll give you example from physics. Okay, optimization. So you have a cost function, which is the mean error on all your examples. And you would like to find the w such that the cost is as small as possible. So the first idea that comes to mind is the algorithm called gradient descent. What essentially you do, here you are, that's one example of parameters. You try to move your parameters so that the error is reduced. And to do that, you move in the direction of the gradient. That's a very classical optimization algorithm. The problem is you're going to move, you're going to arrive here and you're going to be stuck. You are in a local minima, whereas in fact, you would have liked to arrive here, which is much better to reduce the error. So that's the problem when you have a non-convex problem. So a priori it shouldn't work because these are very non-convex optimization problem. The bizarre thing is that people do these gradient descent also called stochastic gradient descent in the way it's optimized. And the fact is it works quite well. So there is a whole series of questions which are being asked now. How come the local minima that are being reached by these algorithms are in fact not so bad? And one of the reasons that seems to be very important is the fact that you have a huge, huge number of parameters. And in fact, you are doing what is called another parametrization of the problem. That's one set of type of questions that people are asking. Now, second type of questions, how are you going to organize your neural network to incorporate prior information about the structure of the problem? Let me take an example. If you take an object like, for example, this telephone, if I translate the object, it's the same object. What does that mean? The problem is invariant to translation for recognizing most objects. If that's true, if you have a neuron whose weights are here, if you look at a different region of the image, there is no reason why you should use a weight which are different. So the weights are going to be invariant by translation. Now, because this is a linear operator, if you have a linear operator which is invariant to translation, this is going to be a convolution. And therefore, all the linear operators you are going to put here are going to be convolution. Now, this convolution is then going to be followed by this nonlinearity, which is the rectifier I showed you. You are going to build a whole series of image in your layer one that you are going, again, to re-transform with convolution, nonlinearity and so on. These are the particular architectures called convolutional network, developed in particular by Yanlokin in the 80s. And that's the architecture which had the biggest success in the recent years for many, many problems. Okay. Now, the strange things is that these algorithms don't work for one or two examples. They seem to work for very different classes of examples. I said images, speech, languages, bio data, physics. How come? What is in common with all these problems so that the same kind of architecture would be able to solve all these problems, which in appearance have nothing to do? Quantum chemistry, I'll speak about an image recognition has nothing to do in appearance one with respect to other. What is generic behind all that? And when you begin to ask that kind of question, you are really asking questions that relates to mathematics, finding the underground phenomena that links all these questions. Let me give you an example of image classification. This is a huge database with thousands of classes. These are different examples, countenance, ship, motor scooter, leopard. You see, you have very complex images. And now these kind of systems do as well, if not better, on the specialized type of tasks than the human system. In the before the 2010s, the errors were typically on this particular dataset above 25%, which is quite large. Now, the error because of these deep networks have gone in that case below 5%. 5% is typically the error that humans do on this database. Now, it's not that these systems are better than human vision. They are much less flexible, much more specialized. It's just that not everybody knows how to discriminate the Madagascar cat hasn't been trained for doing that where the system has been trained. But the fact is the results are very spectacular. Let me show you another application, audio recognition and audio discrimination. So let me make you listen to a sound mixture. You have two persons speaking together, difficult to understand because of the mixture. The problem is, can you separate these two speakers? Three years ago, all the results on classical algorithms were still very bad. Recently in the last two years, there has been a huge improvement because of this network. I'm going to make you listen to the result of the separation. This allows the shaft to change its length and direction as the car wheels move up and down. The second voice. His most significant scientific publications were studies of birds and animals. So the system was able to separate the two voices, although it didn't learn from these voices, but from very different voices. So again, it was somewhere able to find the key elements to do this separation. Why? How? We don't really know. Now, why is it important to understand? Because you can also trick these systems quite easily. This is an image to which has been added a very small perturbation. All these images were correctly classified. With this small perturbation, you see an image which is visually identical, but now all these images have been classified as ostriches. Why? We don't really know, but it shows that there's instabilities. Instability means that if you have a system like that in an airplane in which you are, it's a bit scary because you don't know if suddenly the command is not going to diverge in a strange way. So understanding is now a fundamental question to have robust algorithm. So what does it mean? I explained that essentially to understand how a system generalized is about understanding the nature of the regularity of the problem so that from few examples, you can interpolate. So in some sense, these algorithms have found some source of regularity in all these problems that we don't understand so well. But there are some elements we can see. First of all, when we move like that across the layer of a neural network, because each of the neurons that are being used do an aggregation of the information, which because of the architecture is in fact very local, you can see that as you go deeper and deeper and deeper, you progressively aggregate information over a larger and larger domain of the image. In other words, the depth here seems to correspond to some form of scale looking at the phenomena at different scales. And that's going to relate the problem to wavelets that have been mentioned at the beginning. Second thing, what kind of prior information do we have in very high dimension? In physics, we know that a key information is the symmetry of the system and the environment. In fact, the whole physics, particle physics is based on finding the groups of symmetries to define the different type of particles. Notions of symmetries are going to be very important. Last thing, when you think of patterns, these are structure which are well defined. And to understand how these patterns emerge in this network, we'll see that there are a phenomena of concentration and I'll be mentioning briefly these kind of ideas. Okay, so how can you approach such a problem if you do research in this area? One way to do it is to try to take this very complicated structure, simplify it and see whether your results are good or not. If they are bad, that means that you've left something and then you have to refine your model. So that's the way we're going to proceed. Simplify the network so that it can be understood mathematically and then see whether you begin to understand or something that you haven't understood appears. Okay, so let me first justify why scale makes sense. When you have a system in physics, which many variables which interact, think of your variable, it could be pixel, particles or agents. Now they interact, but there are some interactions that are stronger than some others. Particles which are close interact much in a stronger way. You interact much more with your family than with some other people that you don't know in another country. So one way to regroup these interactions is first look at the very strong interactions and then the particles which are a bit more far away, you can look at their global impact, for example, your friends on yourself or these particles in electrostatic on this by looking at the equivalent potential and the even more far away particles, you can regroup them. And let's say you'll regroup all the Chinese and look at the influence of China, for example, on Italy that may influence your local life. Now these regrouping, what's the advantage is that you don't have to look at the interaction of the billions of particles on yourself, but just on the groups and the number of groups will go from D to log D. So boom, suddenly you've reduced the number of interactions. That's a very important principles that you find all over in physics. Multiscale wavelets will happen. Now how do you do that? Let me take an image. One way to separate the different scale is first to look at the local average which is a kind of blurred version of your image and separate it from the details. The details are in some sense the kind of edges and how do you extract the details? Instead of using a sine wave, you're going to use a local wavelet which is going to compute the local variation of the image and you will see appearing the edges, large positive and negative coefficient along the edges of the boat. Now at the next scale you take this image, you sub decompose into a coarser image and again the details and then at the next scale you sub decompose it with the details and so and that's the idea. Instead of decomposing on large sine waves which are delocalized, you look at local variation with local wavelets, different direction and different face. Now why would that be useful for classification? That's a question that is being asked. Now before going to that let me look at the second type of prior information, symmetries. So what is a symmetry? A symmetry is a transformation that you're going to do on your data which is not going to change the answer of the problem. So the answer of the problem you'll say is invariant. Very simple example, take a square, you rotate it 90 degrees, it's the same, the square has a symmetry to 90 degree rotation. A circle is different and the symmetry group, the set of all symmetries corresponds to all possible rotation. Now the theory of symmetry appeared to be completely fundamental in math but also in physics by observing that these symmetry which are these transformations define a structure which is called the group because if you combine two symmetries which don't change the answer, the combinations of the two symmetries is not going to change the answer. So compositions is the preserving symmetries and that's where the group structures comes in. Okay why is that useful? What kind of prior do you have on the symmetry of the problem? Let's look at images. Suppose you have an image, you want to recognize a digit. If you take a three, if you move the three, as I said, it's still going to be a three. So the problem is invariant to translation. Any translation preserves the answer. Not only that, take the three and let's deform it a little bit. It's still a three, a five, a little bit deformed, it's still a five. So you have a much larger group of symmetries which are the set of all possible small deformorphisms, small deformation. If you look at this video you can see that by going on the group of deformorphism you can like that travel across the whole European painting. As long as the deformation is small, you essentially have the same painting. When it gets large, you go to another painting. Finding the symmetry is a very powerful way to understand and reduce the dimensionality of the problem. Okay so let's try to simplify the problem. We know that the problem is invariant to translation so we are going to get convolutions but we don't know what should be the filters and in these networks you learn the filter. So let me simplify it and say okay it's an issue of scale so I'm going to impose that the filters are all wavelets to see what it gives. Now it makes sense because the wavelets are going to separate the different orientation, different scales and so on so it does make sense to put the wavelets into there. Now I don't learn any more my billions of parameters, millions of parameters. The only thing I'm going to learn is the last layer which is here which are much fewer parameters. So only the last parameter layer which do the last linear combination which is going to be learned from the training data and I'm going to apply that to physics. Okay so let me speak very briefly about quantum chemistry so that's what is called an n-body problem. You have n particles which are interacting. What do you know about the problem? You know the position and the charge in other words the type of atom you have in your molecule. What you would like to compute? You would like to compute the energy. Why effects of the molecule? Why? Because if you know the energy you will know whether the molecule is stable or not so you'll know what type of chemistry can happen or not happen. To compute the energy of a molecule from the individual atom this is a quantum you have to solve the quantum n-body problems but you have some few prior information. If you have a molecule, if you move the molecule like my telephone the energy is not going to change. If you rotate the molecule the energy is not going to change. If you slightly deform the molecule just a little bit the the energy is not going to change too much. So in some sense you have a problem which is not so far from image processing. The other thing is you know that the interaction between the atoms which are very close is very strong. These are the chemical bonds but you also know that you have far away interactions through the van der Waal forces. So what people do when trying to compute this energy is to solve the n-body problems, compute the electronic density which is the probability of finding the electron at any position and from this quantum calculation compute the energy. We're going to do it in a totally different way and that's the work of a group of postdoc and student Michael Leichenberg, George Ossetka-Cherrys, Michael Hearn and Nicolas Poilvet and we theory. So let's begin with X which just gives us a Dirac wherever I know that I have an atom with the amount of electrons I have. In a stupid way I'm just going to run that in this neural networks with wavelengths. What does it produce? It produces these kind of images which are basically the interferences between the different electrons and the different atoms. From that I'm going to build the descriptors like a neural network which is environment to translation rotation and then my last layer I'm just going to do a linear regression which is about learning the last layer. If you do that you don't learn anything here you get about the same results that what a deep network would do by learning everything. That means that in this case because I understand the physics I have some prior into the problem I don't need to learn all this and I can get very good result. Now these problems on the other hand are not so complicated because it happens that all this has been learned on the database where the molecules are not so big. Only 29 atoms in each molecule because right now people don't have very large molecules of large database with large molecules. Let me move to a more complicated problem. Apply the same strategy for image classification. If you look at a problem like recognizing digits which is again quite simple same thing you don't need to learn the whole thing a deep network will do as well as imposing wavelengths. Now if you move to a more complicated image like recognizing cars dogs and so if you use the same strategy then your network with only wavelengths will do here if you do the experiments 23 percent of error whereas a deep network which learns has an error which is three times small. If you use even larger and more like images there is almost a factor of five between what you would get by not learning the weight and what you would get by learning the deep network. What does that say? It seems that yes when the problem is more complicated than simple translation deformation like in a digit you need to learn. You need in some sense to learn the nature of the symmetries or the patterns and the mathematical question is again okay so what is the mathematical nature of this line? I'm going to finish just giving you ideas of direction in which we and people are doing research. When you have a problem like that it's very hard to attack it just from a mathematical point of view. You need to go back to the numerical and look at what's happening what have you messed what what is the thing that you didn't capture. Now there has been many experiments on this network one very interesting set of experiments that has been carried by a number of authors is to look at the evolution of the results when you increase the number of layers. So if you only need to learn one layer you get your first layer and then you do a direct linear projection and here I'm going to have a classification problem where I have many classes in the example I showed I can have from 10 to 1000 classes. So the last layer is going to project this layer on a space whose diamond is essentially the number of classes and you are going to look in that space where are your example regrouping class one appears in that zone of my space class to that second zone class three here and so on. At what condition will you have a zero error when you do a linear classification if these different classes don't overlap in other words if your network has been able to separate the different components. If you do that with one layer you essentially get a decent result but there's error due to these overlap. When you add a second layer what is observed is that the class begins to separate and concentrate when you add another layer you see more separation more concentration. So what's happening is that these nonlinear operators are progressively separating concentrating across the different layer. The question is back what's the mechanism of this kind of transport if you can view it as a probability distribution transport which is going to separate these distribution and what did we do wrong with these wavelengths. What the wavelengths are essentially doing is each time it takes an image and it separates the different component scale orientation phase and then it separates separate but at no point do you really reduce the variability of each of the class. How can you do that by doing exactly what people do in neural network. In neural network they don't just recombine the value in an image they also recombine the value across the channels. Now what are these channels that's where these things are very sophisticated because these channels corresponds to the different scale different angle different phase and they recombine these elements to reduce the variability. So you can do that kind of thing by a technique that has been developed in the past which is essentially find a representation which builds a sparse representation of these different elements and make a threshold which is the kind of thing that we did. So then you can develop some mathematics which do exist or which has to be developed and in our case we did that experimentally by trying to concentrate the variability in each of the classes. Now that's the kind of results that you obtain so basically you use wavelengths and then you learn the features essentially that are going to allow yourself to re-concentrate across the channels. So the previous approach had an error on the small images I showed about 23 percent and the big image was 1000 class 50 percent whereas the state of the art is about five times small. If you just add this concentration which essentially learns feature and threshold boom the error is decreased by the factor five. So in other words you reach the state of the art that residual or deep network gets. So what happened? It means that you don't need to learn the spatial filters which drives you across scales. They are essentially captured by wavelength. What you need to learn are the interaction across the channels which essentially defines the geometry which is very nonlinear of the problem and what is the property of this geometry we don't understand. This is for the next day. So that's a way to illustrate a little bit how such research can and that's the work of two PhD students, John Zarka and Florenta Gut. So let me finish with just one remark on the field in general. This domain right now essentially is being developed as I said initially in a very experimental and empirical way and the success is very impressive but it's essentially now behaving as a black box and we need explanation. We need explanation why because when it's used and it makes error you need to understand the nature of the error. You need to understand whether this error is for example due to bias. There has been for example this is when you train your network you are going to choose a training database and if the training database is biased for example because you don't have people of color in your database it will make errors on it and that has I take this example because it has created a scandal with an algorithm that had been developed by Google a few years ago. It can have bias of all kind if you don't control well again your training data but also you can have instabilities. Now can we reduce such a complex system to a simple explanation? In such a sense a system that would analyze how it works you can view this algorithm as some kind of unconscious intelligence and in fact it's being used as models or people try to see the relation with the models of physiological perception and there are some a bridge between neurophysiology and this kind of architecture. Now if you try to understand how the unconscious algorithm works it's in some sense equivalent to define a kind of consciousness on this system. Now the difficulty is that some problems are too complicated to be reduced to a simple explanation. Take for example the question whether the weather is going to be good let's say whether it's going to rain or not in let's say Australia in two weeks. The reason why it's going to rain or not in a certain region of Australia is not due to one or two or three reasons it's the interaction of billions of variables that is going to define the evolution of this turbulent system. So the first difficulty is that explaining is not sometime possible in terms of one or two reasons. On the other hand if you don't explain when you have for example a system and that's another example that has appeared which is used by your jury in order to decide whether you are going to let a prisoner go out or not go out from prison. If the system gives you an indicator saying well you should not let him out because there is a high probability of recitive you have to justify it and that has happened legally where in that kind of situation the difficulty is that the system is just going to give an indication to the committee but it's risky for the committee to go against the advice of the system and you've been observing that whenever this system were used the committee had a tendency to follow the indication. Now in that kind of situation one of the prisoners then made an appeal to the decision so then they came back to the company who had developed the system and said okay can you explain why and the answer was no the neural net was strained and got that result and that's not acceptable so you see that there are many situations in which it's not possible to just say I don't understand so there are two ways to go about it one way is to say okay you try to put an outside wrapper which tries to analyze and certify the efficiency of the system the second way is really to open the system and understand the underlying mechanism and these two questions are now open to try to understand but both requires mathematics so in conclusion all this I think the first important thing is to observe that yes these systems are quite amazing I mean the the application are spectacular nobody essentially had predicted such a fast advance of computer vision speech recognition uh in the years 2010 or before I would have been asked whether I believe that the system would be better than human system for recognizing face or could do decent translation of text I would have said no not before the next 50 years and things have accelerated very fast now from a math point of view this is very interesting it's very interesting because you see that these problems are generic are very similar in physics images speech and so so somewhere there is some mathematics to understand what are the functions that are be learned what is the notion of complexity and so essentially we don't understand so as I said initially we are a somewhat empirical phase and this is very important in science it's rare that the theory is leading in most science you usually have an experimental phase you discover phenomena but now we need also to go back in a little bit more rationalist phase where we do need to analyze what's happening otherwise it's very difficult to use these things for robust and reliable applications and that's where I'll stop thanks very much thank you very much Professor Amala I think this was an exemplary colloquium extremely clear extremely interesting for a very wide audience I've learned a lot so maybe we can spend some time for questions so I remind everyone so if you have questions you have a special chat where you can like can ask them so I will start reading the ones that are already there so there is a question by Stefano that asks in the quantum chemistry does the quantum chemistry example that you gave respect size extensivity that is the energy of two distant fragments is the sum of the energy of the two isolated fragments yeah that's very interesting the answer is yes in the way the aggregation is is done and in fact this extensivity is one of the source of the simply somewhat simplicity of the problem but two isolated fragments are going to get the the the energy is going to be the sum if as you said they don't interact but in a molecule they do interact they do interact in particular through electrostatic terms so the extensivity has to be there if the two elements are totally isolated but in general within the molecule it doesn't you don't have isolated element or not let's say totally isolated element but still you are right that kind of property has to be checked and they are there in that kind of networks thank you another question from Rosario is the way the separation occurs with the numbers of layer a way to understand the complexity of the problem under study so uh the structure across layers is very strong organization which is indeed a way to think the organization if the axis across layer indeed corresponds to a scale axis and it hasn't been proved for all networks when you have a network with 300 layer uh it may not be that some other phenomena happens but that if that's the case in our case we impose it then you are reduced to a simpler problem where as you were saying the complexity is reduced to the interaction across the channel and then it can help so that's why we we we try to to follow that path it's to simplify it but uh when we speak of complexity usually in these problems it it's a way to specify the size of the class that can be approximated by these networks and in statistical physics the notion of complexity is usually related to the notion of entropy in this case nobody knows how to relate the flexibility or the size of the class to the architecture and to the number of parameters and in fact what has been observed is that sometime it's good to increase a lot the number of parameters which is not going to increase the class but it's going to make the optimization easier so you have this mixture with optimization which makes the analysis of the problem even more complicated thank you very much so another question by Luca and actually another Luca was asked essentially the same question so Luca starts by stating that it's a super cool work that I back up the question is is the wavelet model that you presented uh is matching state-of-the-art uh networks in terms of vulnerability to attacks to adversarial attacks that you mentioned so that's what we are checking now the question is indeed if you have more prior and you fix uh the the some of the filters and learn some others are you going to have the same vulnerability uh a priori I expect that we will but we'll check however we now we know where the vulnerability vulnerability is because in the way we structure the network we impose that all the operators are orthogonal or so-called tight frames so they are well constrained they are stable but they are in between something that is called a batch normalization which is essentially rescaling the coefficients and this rescaling can lead to instability so we now know where we isolate where are the potential instabilities and we're going to try to study that but it doesn't mean that we have eliminated the instabilities well but that's again a very good question um related I have myself a question which is connected the example that you showed it was a perturbation specific to a particular class or an image it was designed to perturb a specific image my question is are there ways to to design kind of universal in some way perturbations that that would perturb a large number of different images or is it always very specific to to a given example I think that some people have found somewhat generic ways to to build somewhat generic perturbation I I'm not absolutely if you but if you say if you define a preset family of perturbation you can always build a network that is going to be immune to it because you can include it in the training example and then it's going to work so it what you can do is to to to try to define perturbation that works for a very large class but you will never be guaranteed that it will work for any class however the question is interesting because it's about understanding is there something intrinsic about this perturbation and that will tell us something about intrinsic properties also of the network and that I don't think anybody knows thank you a question by Eduardo um have you tried to apply this approach and uh I believe he means uh this uh wavelet based neural networks to problems without layers does classification also work with test images that are very much unlike most of the others so uh there is one thing if the image is not it's totally different than the example there is a basic hypothesis in machine learning is that you have a class and you have a random sampling in this class so if you have outliers that are have zero probability to be to appear in the training set or anything similar it's going to it's going to fail that's for sure for but so for example if you have dogs and all the dogs you show are big dogs and suddenly you have a shiwawa that appears it's very unlikely that the network will work so these notion of outliers are very difficult to to to define precisely when it's going to work or when it's not going to work at that stage as long as we don't understand the nature of the probability distribution that it can discriminate and analyze it's hard to know when an outlier is going to be impossible to get or not the question by by Guido which is who is interested in the concept of explainability could one define the class of explainable functions more precisely and if yes could one find an optimal explainable function that matches a black box so a class of explainable function would be if i understand a class of function which are parameterized by much fewer parameters and in some sense if you indeed are able to have a much fewer parameters then you should use this model to do the classification and the fact that we need so many parameters means that probably the problem cannot be reduced i'm back exactly to to to that example i gave there are some problems where which are intrinsically high dimensional problems and then the notion of explainability will rather be like in physics finding the the core symmetry of the problem and so on which is very different than predicting the trajectory of a chaotic system so it depends uh i have a tendency to think that we will what we can do with an explainable system is to approximate the solution but not get it exactly with fewer number of variables okay which is related to what we're saying uh complete question sometimes uh require complicated answers yeah um the question by alessandro is there a way to initialize a classical uh convolutional neural network with a wavelet structure and then to try to fine tune the weights and if so would gradient descent converge to a close enough solution so can it serve as an initialization so uh i had the next student who tried that kind of thing it didn't work and it didn't work because in some sense it was rather trapped into a local minima or it improved this region and uh it didn't get something as good as what was obtained by initially the way these networks are initialized is with random weights and the way people view these optimization problem now is by looking at the evolution of the weights the random weights you can treat as a probability distribution and as the gradient descent evolves you have a kind of evolution of your probability distribution and to begin with a random weight is a way to have no prior and to have a better chance to explore the whole system if you begin with a system which is not very performant and is trapped in a region there is a good chance you won't get out and at least that's what we observe so we hoped the answer would be yes but the at the end the answer is rather no doesn't help um yes atish okay i have us uh maybe it's philosophical question but uh so one place where one can go from a high dimensionality to low dimensionality in physics it's not just symmetries but uh sort of renormalization group where you sort of go into a a small number of relevant parameters even though your original parameter space is almost infinite dimensional in the end at low energies you can describe everything with a few parameters and in a way uh we as the learning systems have figured out the small number of parameters by which to describe you know the standard model to describe the world right all the phenomena so is there some kind of a motion of uh this kind of renormalization group in this so yeah that's also very interesting uh in fact we are uh working on this bridge between renormalization group with a physicist in nickel normal superior or jillio biroli so why when you look at the evolution across scales of the phenomena you change your grade you go from a large grade to a course of grade course of grade course of grade and there is a flavor of renormalization group and that idea has been around for quite a while people were thinking okay what's the relation is there a relation with a renormalization group and in the past in the 90s there has been work showing the relations or trying to show the relation between weight that decomposition and get enough type or wilson even in the version of wilson uh renormalization group now so yes there is and the the idea is in the renormalization group you progressively look at the evolution of your Hamiltonian at different scale and you look at the interaction terms uh between in the wilson case between the the different frequency bands and that kind of machine seems to do that kind of thing but when you said at the end because in the renormalization group at the end we we end up with very few parameters this is true to just characterize the uh the critical point because at the critical point you have a homogeneous system you have very a few different type of singularity but when you are off the critical point the physical systems can have millions billions of parameters that is going to define the state now these are not and the critical point everything is self-similar in some sense simpler if you think about it these things are off the critical point so there is no reason why we should end up even with a multi-scale technique back to very few parameters but I agree that somewhere there are similar ideas at the same time if again you go back to renormalization calculations have been only carried in very simple example like easing of five four models which are infinitely simpler than these machines so for me these kind of machines are indeed a hope for physics to try to attack who is the technique of renormalization group example which are more complicated so there is a very fruitful interactions uh it's a it's a line of research currently and I know number of groups who are trying to look at that also okay thank you atish um I'll select some more questions there are tons so it will be difficult to ask all of them I think this is a generic question that is often asked in in this field so I will I will state it because deep networks seems to work better when increasing the number of hidden layers but uh doesn't uh increasing the number of layers at the same time increase chances of overfitting okay so uh increasing the number of layers increases the number of parameters and any reasonable statistical thought would tell you then you are going to get an overfitting unless you have some kind of regularization which allows you to avoid this overfitting and what people observe is that the way these algorithms are optimized in particular through stochastic gradient descent there is an indirect regularization which is produced by the algorithm so that increasing the name despite the fact to increase the number of parameters it doesn't go to crazy irregular solutions that it remains in a set of regular or effective solutions that generalize now this is another one of the other type of questions that we don't understand what other parameterization seems also to do is to convexify the problem and that's another problem people don't understand some it looks like uh there are uh examples now that have been developed where uh one can understand this phenomenon but it does look like when you increase the number of parameters the problem gets a little bit more convex the local minima not as bad or not as frequent so that the optimization is improved that's a kind of statistical physics question if i may add a comment on which which is related it appears also that this happens over parameterization seems not to to be harmful even without explicit regularization of the neural network and people seems to there are people arguing that there is a kind of implicit regularization also connected to the dynamics of learning to the fact that we're using stochastic rate on descent um so may is it uh so is it also studied in the context of these neural networks that you presented no uh in some sense you can divide if you you can brutally divide this domain in two parts on one hand you have the optimization which says given an architecture try to find the best parameter on the other hand you have what you call the approximation problem which says if i have an ideal optimizer what should be the uh architecture in order to get a good approximation and that's the later that i looked like looked at and i'm using uh somewhat standard optimization uh tools okay this being said the interaction of the two problem is very strong and working at the interaction at the interface of the two is something in fact particularly important when you begin to look at very deep networks because as i said going to very deep network is probably due to the mix between the optimization and the architecture thank you um okay i'll take two more questions or three maybe so a question by aldo uh we're saying uh you mentioned the importance of the continuity for learning many functions that is the property that the target function changes only a little bit when the variation of the inputs are small in the quantum chemistry example you gave this might not be true the energy can change discontinuously during a chemical reaction can a can a wavelet network learn that kind of structure is there a specific architecture that can okay so you are right uh that's absolutely true uh to quantum chemistry in fact when you have symmetry breaking phenomena and it's also true in classification you can suddenly because of a little structure that appears realize that in fact you're wrong this is not that kind of animal but that kind or so this discontinuity in fact when you go from one class to the other you are going to have a frontier and at one point you are going to cross the frontier and you are going to have a kind of discontinuity phenomena now the existence of this continuity phenomena is probably one of the reason why these systems have instability you have to allow potential strong transition now so you're right one shouldn't say this the solution has to move regularly for any perturbation but there are some type of perturbation where when it changes slowly the solution is going to essentially move slowly and in particular when you look at deformation g-philorphism when you slightly deform besides pathological example which you're right in the case of quantum chemistry can happen things goes well now if it doesn't satisfy stability to deformation the kind of network we have will fail so in other words it will fail in quantum chemistry when you have a symmetry breaking because of a small perturbation and then you need to reconfigure it so that's another question which shows these problems are very subtle yeah question by roman who is interested in dynamics as well I believe can the wavelet framework be used with recurrent neural networks so we haven't tried however in a lot of the literature the recurrent networks are regularly replaced by even for audio by multi-scale networks if you look at audio for example the the example that I showed of audio separation so why was it initially recurrent network very much push is because it's a way to have very long-range filters with by using a recurrent equation on the other hand you can do that also with cascading multi-scale filters so the short answer is I don't know if it can be used you can implement wavelet with recursive filters but I has never been done in that framework the second level of answer is I'm not quite sure at this point what are the domains where recurrent network do better than convolutional networks and but that's a question which is my own ignorance so and and there is probably but thanks thank you so I will end on on a final high level question which is your point of view on the following there is a tremendous gap between the our understanding of actual machine learning and the the applications of it do you believe that this gap is standing to will continue to increase indefinitely or we are going in the other direction and mathematics is slowly but surely feeling this gap okay I have a tendency to increase that the gap is increasing and will increase for a while and there's a simple reason for that you look at the number of mathematician working in this domain and the number of engineers and it's probably a factor 1000 between the two okay you have engineers all over very bright computer scientist researchers in Google Facebook in all universities and so on whereas there are much fewer mathematician working on that's one level of answer there is a second level of answer if you look at the very simple problem which would be generation of turbulent images which looked like turbulent people know how to do that well if you look at turbulence since the 1940s where Kolmogorov got out his first papers the math hasn't evolved much it's still a completely ununderstood problem so we shouldn't underestimate the difficulty of the problem they are very regular okay it's a domain where there is a lot of research pressure a lot of people are working so very often you have announcement that I got the ultimate solution which understood to understand the deep network uh I have a tendency to believe that it will take time before that happens again because this system seems to be able to perform well on very complex physical problem where we know should take the n body problem in in physics we know that these are very difficult very bright mind have been working on it there is not a solution which is going to come up like that so it's going to take time uh yes I believe that's the the gap is going to increase for a while but that means it's a fantastic opportunity for doing math I mean if you are a phd students you'd better do math than computer science in this area because there are a lot of people doing computer science and relatively few who do math and computer science one thing however it's a domain where you cannot just do math you have to do math and computations because if you just do math and you don't know how to compute you'll have to do very strong hypothesis and if your hypothesis are wrong you'll be off so the way I view it is that it's an opportunity for mathematics nowadays thank you very much so I hope it gives motivation and hope to all the young mathematicians and physicists that are watching and there are numerous when I see the number of participants thank you professor Mella atish maybe thank you and so for all the ictp students that would like to ask more questions you have a normally a link and in three minutes I'm sorry okay lack of pause uh professor Mella uh uh three to five minutes you can you can gather in this um in this virtual room and and ask some more questions but please restrict yourself uh to 30 minutes uh it would be important uh thank you professor Mella and uh see you one of these days hopefully uh okay see you everyone thank you very much bye bye atish okay thank you