 Pašen, Pavel Boudinic. The Italian School is a feminine, the center is a masculine, so we are sister and brother. It was created as a plan B, so if ever the ICTP wouldn't have worked, then there was another boat possible, for going on that was CISA. So it's the true story, so we discovered it in the papers, tudi da so postebellnje v njegu, priDidem se na 14. olačenju seca and we went back to the papers of that period of the 70s and we discovered the letter by Bhudiniće mentioning seca as sort of a plan B. Seseca is international but it is Italian. So it is an eye that you can interpret in both ways. So it is international and Italian. The PhD School of working languages is English and we receive 70 can possibly up to students per year from everywhere. And overall we have 300 PhDs and in this field we are doing a special investment because we got a funding from the government for the next five years and indeed, the two of them have already passed. So, for the next three years and we already started hiring in narednjih poslednjih poslednjih na CISA, ki je odpočil površenje dve dnev. Vse projekte je površenje z ACTP, kako smo površenje, površenje, in površenje, in površenje, in površenje, ki je srečen, prušal in nekaj za vzelo, ker je tudi tudi vzelo v italii in vzelo v zelo, ker je za ekonomijke krize in vzelo vzelo v zelo. Mnoj kompani in vzelo vzelo v zelo in vzelo in vzelo in vzelo. Včasno zelo, da treste, ki so nalami, ne bo vzela, ne bo vzela, da je to vzela, ali so lahko, da so prišli, da sem prišli na to, da so prišli. Ok, počkev, pa vsev, za vsev, tako, da sem izgledal, da smo prišli na to, da se prišli na to, da smo prišli na to, da se prišli na to, imam postočnji sešen v koridoru, ko se vsegače vsegače. In se je za vsegačenja. Vsegače sem izvršal, pa se vsegače. OK, daj, Matio. Zato smo izvršal vsegače. Vsegače je Mark Mezar, si ležilo na komprcije v Pali? Mark吧če pa o zmbršenju ječnjih vročajicnih netrogovorov z anovali以上stvo. Če pa hlav vrča organizator, staj našli tem, da sem prikaj z nima. Tako so namkome po vznosnju, sem je jazil našel neko jel pogledaj, da je to prikaj zahvalo po svoj pove, če še edge positions ne zelo si hvali, in več vse več še, da je zelo govoril, na razmah je zelo, da je pa vse podetimo, je začala je državne data, da je te vse državne data, da je ni ne kako taj, nimi je to prišli, a je to vse vse vse državne data. Przvih je, da je to početno početno vse vse, da je ni vse vse državne data. v bidrogu, z toga rezdačna, ki žel naša vseče v jazno. Kaj tudižen zc bej postedahva, ko je bilo tako, in zača, da vzöljimo se očastne modeli,rine, in kaj je inša začasnje, ki je tudi nekaj in vzen, je tako, da s LIKEm illuminacijen svoje infilizije jaziro, ki se so začal d CPA. random ensemble of models on which you can start to do theory, in which you can start to do analysis, which goes beyond just saying, I have observed this database and I find that. I have observed that database, I find these other things. So in order to go beyond that, you need to understand the mechanisms. And for that, getting the right ensemble of problem is something which is absolutely crucial. So I will start with a few remarks. We have been working mostly, as everybody knows, the MNIST database of handwritten digits. So this every image is an image in the space of dimension 28 square, which is 784 dimension. And we will have a look at how this space is built. On the other hand, on the theoretical side, we will also have a look, go back to old models, old problems, old modeling, in order to contrast the two. That is, I went first to contrast what happens in a real quotation mark database, if you accept that MNIST is a real database. It is kind of, let's say, at least is some database of which people have been spending a lot of effort. And what was the status of theory? And then after contrasting these two, I will tell you about how to generate some new ensemble that is closer to what we find in real database. So in most of the theoretical work, especially the one done in the 80s, for instance, where there was a lot of work on learning and generalization in neural network, in all this work, most of it was based on input patterns, the data that you feed at the entrance of a network in the learning stage, let's say, or then in the analysis stage, which are IID, that is, every entry is an independent random number, let's say normal distribution, it could be binary, whatever. So basically the set of patterns, the data set is a set with p patterns, each of them being a n dimensional vector. Just a small warning, we have no way to accommodate that everybody feels happy with the notation. I am using the physicist notation for many years, for many decades, physicists have decided that the size of the problem is n. And because of that, we have decided when we went into neural network studies that the number, the size of the database was p. So we have p patterns in n dimension, that is our database. All statisticians are used to having a database of n patterns in p dimension. So there is a way out, because we tend to use capital letters and they tend to use small letters. So you could say that small p is equal to capital n and small n is equal to capital p. If it helps you, but maybe I'm confusing you, I'm not sure. And anyway, I cannot do better. So, we will use, I will present a few experiment that we have been doing with some kind of what we call vanilla teacher-student models, one that were used in the 80s, which is typically a two-layer neural network with an input layer, the output and an intermediate hidden layer. And so it means that the output in such a case depends on the input in the following way. Ic neuron here, the neuron number k in the hidden layer is doing a scalar product of the input with a vector wk. Then one applies a nonlinear function, which is a transfer function of this layer. And then one does a weighted sum of all the inputs from the hidden layer to the output layer. And this gives the output. So this is the two-layer standard neural network. And we will use this neural network with k hidden neurons. So the size of the input layer is n, that's the number of units, the size of the image if you want. K is the size of the intermediate layer and we have one single output in this case. And we will use it and study two tasks. One is a task, two binary tasks, let's say. One is a task in which we want to distinguish, as an input we put a mist image and we want to distinguish the image which are even digits and the images which are odd digits. So this will be a kind of practical task if you want. Realistic database task. And then we have something which is more traditionally what has been done in the past, which is let us study that on input data, which is IID. I present in the input I take that X is a Gaussian random number and the desired output is given by another two-layer teacher network, which has the same kind of structure with an intermediate layer of size m. And the parameters here which are omega m and new m which are the ones of the teacher. So you have a teacher, you fill it with some random input it finds its output by applying this rule. So this is the set of outputs that you want to learn. So this is typical of the teacher-student model. And so we will study in this study both the training error which is the difference between the task which is achieved by our network and the desired task. Here we take the square error, but it could be also several measures and the result do not depend much on the definition of the error that we take. And then there is another quantity that we monitor in these studies, which is if you take two different learning, you take twice the learning phase with different initial condition. I should say a few words on the learning. For the learning we take the standard procedure of applying a stochastic gradient descent with this database that we have. And so this stochastic gradient descent it starts from some initial condition on the weights on these weights W and V which are here. And you can start again the learning by taking different initial condition and see where you go. And you look at the difference of the two experiments starting from different initial condition and you can look at the square difference. So let us look at the data. This data is obtained from the NIPS data and the first thing that you can look at is the blue point, the blue point which are here and this is the generalization error after convergence as a function of the size k, I remind you k is the size of the intermediate layer. In some sense the size of your of your network. And so as a function of k you find that the generalization error decreases and these are the blue points and so the larger your network the broader your network the better the result. So far so good. Now the thing that you can look at then which are the orange points here, the orange points is the difference obtained with different initial conditions. And what you find is that basically as soon as k is large enough that you get a relatively good performance on these NIPS then you find that the difference between the two experiments is more or less of the order is close to the generalization error. So it means that they are as close as they can be with respect to the network. In green on the contrary you look at the difference the orange one let me make it very precise the orange one is the error estimated the generalization error estimated on NIST images. The green one is the error, the difference between the two networks estimated by random input. So you try to understand global function that has been learned. And you see that in two trials you find global functions which are totally different. A random choice of output would put you at a distance of 0.5. So it means that basically the learning starting with two trials with two different initial conditions it converges more to a function which are very similar when you look at it on the NIST database that as global function they are totally different. They don't look at all the same. Now if you do the same thing with our second task which is IID data and teacher network this is what you get you find that the generalization error, this is the blue it decreases with K and at some point this is a bit special this was a case in which the teacher network the one with which you have generated the label had m hidden units equal to 4 m equal to 4. So you see that when our network has a number of hidden units which is larger or equal to 4 it can make it. It just has to find the exact structure that generated the data. So it can go to zero error and that's what is found experimentally. So the learning works well in the exact result and this is the blue point and the other blue points are covered by the green. This is actually the green points are the difference between two runs starting from different initial conditions and here you see that what happens is that you've learned for K larger or equal to 4 you learn exactly the same global function so as you don't have this effect of two functions which are different only outside of the space of images. So these are very this is a strong contrast between the two situations a kind of realistic situation and the one done by the teacher student a theory. Another point that is worth mentioning is the learning the dynamics. This is a generalization error in the number of steps of learning and this is the MNIST database this is the MNIST task and you see that it relaxes to some asymptotic relaxation which does not have a special shape while if you do it in the teacher student case you will find that it relaxes to a plateau it stays for long in a plateau this is logarithmic scale and then some point decays to zero. So this intermediate phase has been seen in the dynamics in the learning dynamics of these teacher student networks it reflects a state in which all the hidden units think that they will make it by themselves in some sense and so they all are thinking this is a linearly separable problem and they align with each other and then at some point they start to cooperate and to distinguish to have different cutting planes so that you can separate the data So these are the experiments that we did to contrast the difference between a realistic database and the standard teacher student and we I want to point out two facts that are clearly very different the fact that two trials in the different initial condition learn the same function in the teacher student and learn different functions in the MNIST and the fact that the learning phase has a plateau only in the teacher student So let us, this being said let us go back to the structure of the data in order to try to see what is happening If you look at the input space so it is 784 dimensional space you have a kind of manifold of 100 digits in MNIST it is not an arbitrary image with 784 bits it is not a digit so you have the space of the digits it is a kind of hidden manifold and it has been studied quite a lot you can try to understand of course it is a kind of fuzzy manifold because when it is a digit or not you can have some kind of arbitrariness but basically looking at the database you can get an idea of the internal dimension of this manifold it is not a linear manifold actually it is a much folded manifold in this 700 and something dimensional space so one way to try to understand the distance the dimension of the manifold is to look as nearest neighbor distance if you take if you look at the number of points which are certain bore of course it should increase with N to the D where D is the dimension of the space so it means that the nearest neighbor distance should scale like N to the minus 1 over D D is the dimension of the manifold of 100 digits and N is the number of examples that you are looking at this has been studied quite a lot I mean this idea of looking at nearest neighbor distance it probably goes back to Grasberger and Prokaccia maybe before that I don't know but I think was also working with that at that time on strange attractors and many people have been looking at it recently it works relatively well and what you see is that for instance if you look at NIST the fit which was found several times and more carefully even recently tells you that the effective dimension of the manifold of 100 digits is around 15 yet so it means that really among all the inputs that you could look at that you could feed your system with the only inputs that should matter are the ones which are in the hidden manifold of 100 digits this 15 dimensional subspace if you get out of that actually the network if we were in a nice world the network should be trained to say it doesn't make sense it doesn't get anywhere close to my hidden manifold that's not 100 digits I don't want to give you an answer that's what it should happen if you feed it with this or that so now you can look there is another structure of manifold which is interesting which is a task so far I was defining what I call the world which is the space of the data within the space of 100 digits I have 10 subspace associated with each digit so there is a subspace of all the deformations of the 5 and again you can look at the dimension of this space and this was done for instance by Hein and Odiber in 2005 they found a dimension around 12 they had very different ways of computing the dimension but they are rather consistent so you see that the dimension of the digit actually it varies quite a lot between the digits the one is a simple 7 dimension subspace and the 3 is a 13 dimension subspace that's the life of the one and the three this is so basically this problem if you think of it as geometry in a 784 dimensional space it is within this space you have a manifold of dimension 15 and within this manifold of dimension 15 you have 10 submanifolds with dimensions varying depending on the digit between 7 and 13 and you want to distinguish them this is a geometrical ok that's one aspect of the geometrical structure of the data of course the problem is that the problem is that you are phrase like that it seems very easy the problem is that all this is hidden because what you have access to a priori is only the 784 dimension on space so I should have quoted initially the work all what I have been saying so far is done with Sebastian Gold, Florence Zacala and Lancas de Boreva, two of them are here so they can correct my mistakes and so together we did this analysis of the difference between NIST and standard teacher student to create a new ensemble of data that has some features which are much closer to the one of the NIST and we call it the hidden manifold model and here is how it is described basically we will build the input space the data space, the hidden manifold space as follows we have a certain number of features FR we have R features which are vector in the input space they are configuration of the initial image and then what we will do is that we will do a superposition of these features with certain weights, CR and from this superposition here what we do is that we apply a nonlinear function so if we were just if this function F was linear then we would be constructing the data in a linear subspace of the initial space that is kind of interesting but totally trivial in some sense because all the dynamics orthogonal to this linear space is trivial and basically you can look at what happens by projecting only inside the subspace and you get back to the old teacher student framework what is much more interesting is that you take a linear subspace generated by these features you do a linear superposition of the feature but then you fold that by a nonlinear function so this is a generation of the data this is the space of the data now we have to define what is the task sorry in practice what we have been using is using coefficients and features which have normal entries and we tried several nonlinear functions from extreme nonlinearity where f of x is a sign to rectified linear function doesn't matter I mean the phenomenology of what I show you it does not matter much on the choice of the function itself now here is the task so again the data was defined by that so a data point was really defined by what we call it the latent representation which is a set of this coefficient that will define where is it located in this intrinsic manifold and so what we have looked at are problems task in which the desired output depends on the position in this latent manifold and so it depends on this coefficient CR and we have looked at various functions and because we are not so much imaginative we have looked at functions which are of the type that we have been using for many years in machine learning so either the perceptron like function or something which would be a kind of two layer function but the crucial point is that whether it is the first function here or the more elaborate here the second one they apply to the latent representation that is to the space of the dimension within the manifold the one that you don't know so the task of learning you will know only the X and the desired output you don't know the CR of course so this is the whole problem of learning will be to in some sense unfold this manifold but in the end with this unfolding you get a linear perceptron like problem that you can learn easily that is the whole task of what one should do this kind of geometric reasoning about what is taking place has been looked at for instance Stefan Mala is one of the people who has been advocating that for several years trying to understand exactly how deep network from layer to layer are unfolding and kind of rectifying the unfolded manifold structure until you get it has to be like that because in the very last layer you do linear separation so it just means that you have unfolded the thing until you make it flat so this is the ensemble that we are generating oops sorry one point that I should mention it's a kind of it gets a bit technical if you look at the task this task here so you have the coefficient the latent coefficient you apply a linear function to them a nonlinear transfer function and you have a linear weight of all of them that could be the desired task this desired task if you think of it it depends on the dot product of WM times C where WM and C live in a r dimensional space is the space of the latent representation so if the number M here is less than sorry this should be r if M is less than r then we have an invariance in the sense that if we shift C in a direction which is transverse to all of the vectors WM the output is unchanged so this is what we call the perceptual sub manifold it tells you that for one given output you have a degeneracy which is you have an orbit it's like the orbit of the digit number 5 that I was taking, talking about and the dimension of this orbit is r minus M and it can be monitored by choosing various values of the size of the hidden manifold and this number M here so this is the experiment that we did for this problem this on the right I remind you of the MNIST database here is learning generalization as function of K this is a generalization error so the blue is the standard generalization error the orange is the difference between two rounds of our neural network with stochastic gradient descent in the case where the data is generated by the hidden manifold model you see that when you compare the two rounds on the structured data on the data which is really within the hidden manifold we get the same result as MNIST that is the two networks agree on this data but if you compare the global function that is you put some random input and you look what it gets out in one case and the other well you find that the random it implements very distinct function in the two case so this is the same phenomenology as we had when we were doing the MNIST then we can look at the we can look at the plateau this was the relaxation for the MNIST and this is the relaxation that we find for our model with the hidden manifold so the famous plateau that was existing the blue here is a vanilla teacher student model with this plateau and so on it has disappeared and we have some standard relaxation let's say much closer to what we get when we do the the learning the learning with MNIST so the conclusion is that this model here does not have the pathologies of the teacher student setup with IID data so I think that it is probably a valid model to start to work on much more carefully it seems to have learning and generalization phenomenology quite close to the one that we get at least in MNIST it's also a model that can be studied analytically I would have dreamed to tell you about the analytical solution today but my collaborators here and me we are still working on it and we have to we'll come back for the next conference and tell you about the solution of this model I think it quite a lot can be done actually let me if I have 5 more minutes let me tell you sorry one comment about another aspect of this structured data which is a computation that I did a couple of years ago about the Hopfield model so this is different I will not tell you about learning and generalization but I want to tell you going back to this old model of statistical physics in some sense of this ordered system which is the Hopfield model try to see what happens if you plug in some structured data into it so the Hopfield model for those of you who don't know it it's a model of easing spins I will be very fast in the description and in this model what you want to do is memorize a certain number of patterns so in the standard Hopfield model you want to memorize IID random patterns as usual most of the science or the statistical physics was done with the IID input and so the idea is that you have an ensemble of spin system interacting by pairs with a coupling constant Jij and Hopfield propose to use the Hebb rule which is to study the case in which is Jij the coupling matrix between the spin is a superposition of the product psi mu, psi j mu for the various patterns and then you can look at this problem you can study the Boltzmann equilibrium with the partition function this has been done one thing specifically that I wanted to understand is the mean field equation that describe this problem at equilibrium these are called tap equations and the simple thing that you could write are the tap these mean field equations they are well known they relate the local field HI on a certain spin they are the mean field approximation but they are exact in this case they are supposed to be exact because it is an infinite range model and so the ones which have been written by tap they relate the local field on the HI it's the sum of the field tansh beta hk is the magnetization of the neighbor number k weighted by the coupling and there is an unsagar reaction term that is a standard tap equation for the SK model if this takes into account this reaction term takes into account the unsagar reaction which is that I can polarize J which can back polarize upon I so you have to take that off now if you take that off you subtract this term tap equation for SK model the SK model is the case where JKI are IID without structure here we have a JKI which has a structure it has this superposition of the patterns so it means that this is wrong because it's wrong because you have some other path of reaction I can polarize J, can polarize K which can polarize I because the couplings between IJ, JK, KI they are correlated you have to take into account these circuits and the right way to do things is really to disentangle the partition function of the SK model by introducing auxiliary variable it's just a simple Habat-Srotanovic transform and you can write to the Hopfield model by saying it's a set of spins S i which interact with a set of auxiliary variable lambda mu the lambda mu have to be interpreted in the projection of the system and pattern mu and so you have Gaussian variable lambda mu, you see here they have an intrinsic Gaussian measure and the S i and lambda mu they interact through the pattern so this is a nice representation well known for many years and so in this case if you can look at it this way you have the spins, they are the green ones here and you have the projection onto the pattern, the lambda mu here and you can start to do all the mechanism of mean field equations on that you write the belief propagation you understand that it is a fully connected model you can simplify the belief propagation go to the tap representation and you get the tap equation which are written here they have been written longer ago there have been a few debates on that but they are perfectly correct and they can be used as an algorithm and they look correctly now if you look at the same problem with structured pattern that is in this case I will look at pattern which is again the superposition of features but in order to have a tractable solution I will look at the very simple case in which I don't do the nonlinear folding so I am really in a kind of baby model I am sorry it is much simpler than what I was saying before but it is a linear manifold let us look at what happens if I want to study the tap equation in this case so the partition function again it implies disentangling between the neuron the spin variable the auxiliary variable which are the projection of the pattern and again I can do a second how about Tratanovich transformation don't look at the formula I mean when I have structured patterns the thing that I have to do is to introduce a third set of auxiliary variable which are related to the features these features which are the basic objects from which I build the patterns I build a pattern that is superposition of features and then with the pattern I build the coupling of the Hopfield model and basically what happens is that I have a set of spin variables I have the variables which are the projection on the patterns the network is working is it close to a pattern or not and then I have this intermediate layer which is the projection on the features and it turns out that the tap equation they have to be written as message passing equation on this graph in which I have a graph with three layers spin features patterns which are exchanging messages the tap equation they can be written I write them I didn't there because it was a bit scary but you can you can you can write them and the thing which I find quite interesting in all that is the very structure of what you get that really the messages that are exchanged in the tap equation they go from this layer to the layer which is the finest structure which is the structure of the feature and then to the structure of the pattern and of course one thing that you can immediately do if you are at least a physicist it becomes very natural is to say well but let's look at something which is a bit more sophisticated in which a pattern is built as superposition of feature but the feature itself is built as superposition of other sub features if you do that then you will have to introduce another layer here which will be the sub features which will have the initial pattern the main input the sub feature which are the fine grain structure and then the feature and then finally what pattern you are close to or not this is something which to me is very evocative in some sense of what is happening if you look at how a deep network is working I mean you look at a deep network for face recognition so what the neurons what the units are doing in the various layers and you know very well that in the first layers you have a given neuron will be sensitive to be a kind of local edge detector very if you look at the kind of input to which it reacts most and then you go deeper and you find that you find some neurons which starts to be sensitive to something which looks like an eye or lips or ears or whatever and only in the very end you get the global structure and it tells you this is your grandmother so this is how it evolves when you go deeper and deeper into the structure and of course it is related to the fact that when you want to understand a face you are typically in a case in which the data has this kind of combinatorial structure the face is composed of elements which are composed of sub-elements which are composed of sub-sub-elements and so I think that this is one of the major challenges for us which is to build and understand ensemble of problems in which instead of having purely IID random things or maybe Gaussian correlated we have something which is much more structured and the structure it has to have this combinatorial structure I think that really if you think about it all the elaborate task on which we are using the deep network whether it is language analysis which is image analysis and so on they have this same combinatorial structure language is typical also you have a combinatorial structure so this what we have done so far or really in some sense first steps in this direction we want to have a good generative model that we can also study for that there are many aspects one aspect is a combinatorial structure the other aspect is a hidden if I go back to the learning face is a hidden structure of the manifold which becomes very complicated to find if the manifold is folded finding a linear sub-manifold finding a folded sub-manifold is much more complicated but I think that we are on the way in this direction and that is at least in my opinion one of the important direction to go if we want to understand one day why the deep networks are so powerful in practical database thanks a lot