 Okay, can you hear me, both in the room and in the closet? Okay, can you hear me both in presence and on Zoom? Okay, yes. Okay, so thank you for the invitation and thank you for being here. So I'm very happy to present this work. It's a joint work with my supervisor, Emmanuel Abbe, Jan Hasla and Christopher Martis, everyone from ETFL. And the title is already self-explaining, so I will explain how, in some setting, an initial alignment between the neural network architecture and the target function is needed to learn with gradient descent. So deep learning combines successfully over-parameterized networks with descent algorithms. And so in the framework of trying to understand how these two elements are combined, we pose the following questions. So we ask whether we need some knowledge of the target function in our neural network architecture in order to learn with gradient descent or whether gradient descent can learn from any arbitrary, small initial alignment. And the goal is to answer this question, but of course also to quantify both the initial alignment and the learning horizon, learning accuracy that we aim for. So before diving into the definition of the initial alignment, let me clarify which framework we consider. So here is a spectrum of neural network architectures of increasing complexity. So on the left side, we have the most structured and regular architecture, so we're connected with IID initialization, for instance. And then going to the right, we allow for more complexity, so we can include convolution networks, ResNet, and so on. And on the right side, we have the on extreme, so we have what we call the free neural networks family, which include any architecture that we can have any structure. The only constraint is that it has to be implementable in a computer in polynomial time. And so some previous work consider the right extreme, so these free networks, and ask the question which functions are learnable by any politicized neural networks. And so closer to our question, they also ask how does the learnability depend on the initialization? And so it turns out that if we don't put any constraint in the structure of the network, then one can find a neural network that is an initialization that is universal, that can learn any public learnable function. And so which means that the neural network does not need an initial alignment with a target function to learn. And also the limitation of consider this setting is that there are classes of function that are not learnable irrespective of the initialization, if these all in particular, if we train with large batch size. So yes, to hope for a positive answer to the previous question, we need to put some structure in the architecture. And so in this work we consider the left extreme of this spectrum, we consider regular neural networks, so particularly fully connected with initialization. And we ask if given a fully connected network with a certain initialization we can understand whether our target function is easy or hard to learn and specifically hard to learn for a fully connected architecture, not for any architecture. So the definition of initial alignment is as follows. So for a target function f with input space x and input distribution px and a neural network with parameterized by some weights including n theta and with initialized at random from a distribution p0, we take for each neuron v in the network, so each of these blue points and the red points, we take the average over the initial distribution of the correlation between target function and the output of the neural network at neuron v at initialization squared. And then we take the maximum among all neurons, all these neurons in the network. And so the question can be rephrased as follows. If at initialization there is no neuron in the network that picks a good correlation with the target, does this imply that after a reasonable horizon of grand decent training, the correlation will still be small. And so the network will not have learned. And yes, so as I said before, we consider only fully connected networks for our theoretical result, but so here is an experiment for a convolutional network. So the definition of analog makes sense for any architecture and for any input distribution. So here we took the CIFAR dataset and split it into pairs of tasks. So for instance, we consider only the images belonging to classes of cat and dogs and perform binary classifications for these two classes, same for bird and deer and frog and track. And for these three tasks, we estimated the initial alignment. So this is just two expectations. This can be estimated with Monte Carlo simulation. And also we don't need explicit knowledge of the target function to estimate the null. In fact, instead of f of x in the definition, we can just put the label. And we can see here that indeed there is a good correlation between the initial alignment for cat and dog here is very small. And also the generation accuracy after training is smaller compared to other tasks where the initial alignment was larger. So for instance, for frog and tracks. And so yeah, for the theoretical result, we consider, we restrict to the Boolean setting. So we assume the target function to be, to map the Boolean hypercube of dimension n to plus minus one, so binary classification. We assume the input distribution to be uniform also. And we assume f to be asymptotically balanced. So to have plus one and minus one appearing with the same frequency as the input dimension goes to infinity. The network is again fully connected with Gaussian IID initialization with the rescale variance. So the weights on layer J will have variance over nj minus one. And j minus one be the number of neurons in the j minus one hidden layer. And rail activation. Then we can generalize also to larger class of activations, but for this talk I will mention only rail activations. And the algorithm is noisy gradient descent with full batch. So at each step we take the full population gradient. We clip it to the interval minus AA. So this means that if we assume that the gradient is contained between minus AA and whenever it exceeds with four seed, we clip it to the interval minus AA. This is for technical reasons. And also at each step we inject some Gaussian noise with variance sigma squared. The same setting as the previous works. And so before the main result, I still need another definition. So we introduce the extended version of f, which we call f bar. It's a function that maps the boolean hypercube of dimension n squared instead of n. Two plus minus one in the following way. So the first n coordinates correspond to the n entries of f. And then we add coordinate from xn plus one to xn squared. These are just dummy variables. Coordinates that do not impact the altitude of f bar. And so the main result is as follows. So if a function f has initial alignment with a neural net with a Gaussian rail network as discussed before, that decreases as n to the minus c or faster. So this is a big O of n to the minus c. So we care about the behavior, the initial alignment as the input dimension goes to infinity. Then the noisy gradient descent algorithm after two steps of training on any fully connected network with e weights and any initialization. Output networks was correlation with the extended function f bar. It can be controlled by a term that depends on the learning rate, the training horizon, the number of weights and gradient precision of the algorithm. And the term that depends on the initial alignment. So the first message of this theorem is that in some sense is the initial alignment characterizes whether f is weakly learnable on Gaussian rail networks in the sense that if the initial alignment is very small, so if this constancy is very large, then and if the neural network is... So then we need either a very large neural network or we need a very large number of training steps to achieve non-trivial correlation. So when I mean weakly learnable, I mean really that weak learning, I mean achieving non-trivial correlation with the target function. And on the other sense, if the null is large, so if this constancy is small, then this means that somewhere in the network the function will be learned already at initialization. However, the result is stronger than that. In fact, if function f has a small initial alignment with the Gaussian rail network, this means that it is hard for any fully connected network with any initialization. So this includes also networks with non-Gaussian initializations and other activations that are not real. And the caveat that... So the price that we have to pay for this is that we have to add this caveat for this f bar so we can prove hardness only for the extension of f, not for the original f. And so yeah, this caveat is somehow not trivial in the sense that if we try to prove the same theorem without this f bar, then there are other examples. So one can construct a network and an initialization. Specifically, it will be a random arc initialization and a target function that has very small null but it is learnable. However, the hope is that if we restrict to Gaussian initialization, then the hope is that one can remove this f bar here. But yeah, for now, this is for future work. For now, we can remove this f bar if we impose some constraints on f. So specifically, we assume f to have sparse high degree coefficients. So for the proof, we make extensive use of the Fourier-Walsch transform. So any Boolean function can be expressed in terms of its Fourier-Walsch transform. So in terms of a sum of parities, so this ks is the product of all xi with i in s. And since we are in the Boolean world and this is just the parity over set s. And yeah, these parity functions they form a basis of the space of Boolean function. So one can write f in this way and these f hat are just the projection of f on the basis elements. And these are called the Fourier coefficients. They form the Fourier spectrums and we use them to prove hardness, to characterize hardness of f. So to do this, we add another definition which is the Fourier weight of f up to the degree k. W less or equal than f is defined as the sum of the Fourier coefficient square on any set that is a cardinality less or equal than k. And so the proof then goes by two main steps. Two main steps, so the first steps we prove that if the initial alignment is small then f is a high degree and in the sense that we can bound the Fourier degree, the Fourier weight of f by a term that depends on n to the k plus one and the initial alignment. So if the initial alignment is small that if we write f in terms of a Fourier basis then f will have only high degree terms. So it will be a high degree in the sense that it will be really a sum of only high degree monomials. And the second step consists of proving that these high degree functions are hard to learn for the noisy GD algorithm on fully connected networks. And so for the first step, the main idea is to, since the initial alignment is the maximum among of neurons of the average correlation we can restrict to the neurons in the first layer. In fact, so here I substituted to the neural network. This is the output of the neural network if v is in the first layer. So it's just a percept with a real activation top. And so if we can prove that this guy is small implies that f is a degree then also this is a stronger statement than the original statement. In fact, if the inal is small then also the inal in the first layer will be small. And if this implies a f is a degree, we are good. And yet to do this we use the Fourier-Folsch expansion of f which is very convenient because then also the initial alignment can be expressed in terms of weighted sum of initial alignment with parities. And so then the final step is to characterize this initial alignment between parities and perceptron with real activation. And the idea is that this depends on the degree of the parity. And that for any parity of degree k the initial alignment is at least an omega to the minus k plus one. So it decreases and to the minus k plus one is lower or slower. And so this concludes the proof since this means that the low degree parities will have good, will have large initial alignment with Gaussian random network. And this means that for the initial alignment to be small we need to have for low degree parities these coefficients to be very, very small. And yes, so the proof of these two lemmas is technical but yes these are the two main ingredients of the proof. And then for the second step which is proving that high degree functions are hard to learn. We can, the idea is to, is that the fully connected networks are invariant after permutation of the input and the idea is to use these invariants as a limitation to learn high degree functions. In fact, we got to quantify this. We construct the orbit of F from F which contains any function that is composed by composing F with a permutation of, on the input space. And to show that if F is high degree then the orbit of F bar is not learnable. And so the reason why this is true is can be quantified by taking the cross predictability which is a measure for a class of functions that was introduced in a random in 2020. And it is the average correlation squared between two functions from our function class which in this case is the orbit sampled uniformly at random. So it measures how informative are two functions chosen at random from our function class. And so the idea is that high degree functions are very low cross predictability. The idea is that high degree parties are very low cross predictability. In fact, we can just change one input and change completely the output. And then we can adapt previous result to show that the correlation can be controlled for function classes with low cross predictability. And so then to conclude, yes, we believe that we can extend to other non-polynomial activations. For a polynomial activation is not true. In fact, if we consider for instance the squared activation then already the parity of degree three will have zero initial alignment with this activation. But the parity of degree three is learnable by neural networks in general. So we need an activation with infinitely many non-zero Hermite coefficients, and for polynomials this is not the case. And the other is to extend to other architectures. So the problem with, for instance, convolutional neural networks is that these are architectures that are not invariant to permutation of the input. So then for the second part of the proof we will need to find another class of invariances and work with it. And the other is to extend to other input distributions. So for instance to Gaussian inputs or to inputs that are in the sphere. And so here one would need to use another basis. It is not the Fourier basis, for instance the Hermite polynomials. But we believe that most of the techniques that we use here can be applied to this case. So then I'm done. Thank you for your attention. Very nice. Thank you very much. That was a very nice start. Can somebody on Zoom confirm that they hear me now? Can somebody on Zoom? Yes. Perfect. Thank you Francesca. We have time for a couple of questions and then of course, we'll have more questions during the discussion. I see your first question there. Thanks for the talk. I just wanted to ask. So you're talking about the learnability of the function defined over the entire space of inputs basically because you take some IID distribution. So what I wanted to know is what you think about how about you have this function defined on some data manifold and how difficult it is to learn the function on that subspace of all possible inputs. So you mean instead of considered the uniform as an input distribution that uniform distribution over the hypercubes consider another distribution. Yeah, so you're ready with the bias if we assume that instead of the uniform distribution of the hypercubes some distribution that some bias on plus one or minus one IID already among all the coordinates already we have problems. It's already doesn't hold. So yeah, it seems that is uniform distribution it's really a challenge. Even if we extend to other input spaces we still need to have symmetry of distribution. But except for the technical part of proving it do you think it still holds? It's still hard to learn the function or it becomes easy if you restrict the space of inputs? I mean possibly the Inal can still characterize this case like from the experiment you see that could be still the same trade-off but I'm not sure how to reason about it. Thank you. We have time for another question. Can you just pass the microphone? Thank you very much. I have just a question about the non-polynomial activation so do you think that non-polynomial activations are necessary for your proof or do you think they are necessary for the good behavior of the neural network? So they are necessary if we want to have a measure that characterize weak learnability. So if you want to basically what we are saying is that for hard function we will not even escape the initialization in polynomial time and to do this we need to hope that then in non-polynomial time or in larger time we will escape this activation we need to have expressivity weak expressivity of our target function and for if we have polynomials then we will have limits of reason we can prove it if we have a polynomial of degree k and one layer then we can prove it only for functions that expressed in the Fourier basis contains parities up to the degree k so we will not be able to prove for any Boolean function. So you mean that if I have a neural network with polynomial activation it will not learn well? So if you have a polynomial a neural network with polynomial activation it will have limitation in learning high degree functions. Okay, thank you. And the initial alignment will be zero you will not be able to then prove that if you have lower the initial alignment you have high degree function. Thank you. Maybe one more short question Yeah, Marco. I was wondering if you could quickly elaborate on what other input distributions you can tackle with this kind of techniques? So we can tackle with very similar so for the first step so prove it that in all small implies that the function is high degree we can for instance do it for Gaussian inputs so in their might basis so this will have that the function will have only Hermite coefficient in the tensor basis of tensor Hermite expansion. Problem with that will be the second part because to prove that what's a high degree function you have high degree like in the Boolean case high degree function means that it depends on many coordinates so you cannot have x1 to the k because this is either x1 or 1 if you are to the real world then you will have high degree function like it's a bit more complicated to then characterize high degree functions because you have to count how many times each coordinates appears Wonderful. Let's thank Elizabeth one more time. Thank you.