卫 Crash and also I want to thank all my organizers for setting this up and all of you as well for being here. So, in general I'm interested in understanding which kind of functions deep learning is good at learning. For instance I think that deep convolutional neural networks are very good at learning hierarchically compositional functions which I will define over the course of the talk. ležite, da se palsič imeš tudi o druhesne priserie, ko pri들je, na Shi dole vse činje priseria, da ne bi se plati nene. Ložite, šti ne da su 4 veče, ter je začnelamoo, ti ne bi naživ, kjer žele nene. Kaj je, ki ne bi, ki ne bi... ne bi, ki ne bi... ne bi... ne bi, ki ne bi... ne bi, ki ne... ne bi, ki ne bi... ne bi, ki ne bi... ne bi... ne bi, ki ne bi... ne bi... ne bi, ki ne... ne bi, ki ne bi... in izgleda. So, let me just start by reminding you about the striking difference that there is between the huge success of deep learning at many tasks, and here I put just the examples, ImageNet, AlphaFold, SatchuPT, opposed to the lack of our fundamental understanding of why these methods are so successful. One specific question that I would really like to be able to answer when I look at these models is, how much data do you need to learn a task once you know the task? So, let me start by doing a little recap of what's known in the literature. First of all, deep learning architecture learned from high dimensional data, and we all know that these data are affected by the crystal dimensionality, because basically typically in high dimension, if you throw a fixed number of points, their distance will decay very slowly as you increase the number of points that you throw, this generally means that certain tasks for instance regressions, which require a small distance between the data points that you sample, are going to be very difficult, and they're going to require a number of points to be solved, which is exponential in the input dimension. So, if the input dimension is of order 100, this is like 10 to the 40, which is huge. On the other end, and here I borrow, I hope Sebastian doesn't mind for borrowing this picture, we know that tasks like image classification are learnable, and the reason for which they are learnable is that they are highly structured, and this is an example of the hidden manifold model in which there is this assumption, according to which data lie on a manifold within the input space, and not just an image of a horse, it's not just a random point in the dimension, it's something more than that. Not only, each data point is not a uniform point in the dimension of space, but also the tasks we want to learn are not generic functions of these points, but they are very structured, and you need all of this in order to understand. For instance, take a data set like SyFAR, which I think is what the images here are taken from, and you can estimate what's the effective dimension, let's say if you wish, of the data manifold, you're going to find something which is not the number of pixels, but less than 35, for instance, not for instance, I mean it's something close to 35, but then you only have 10 to the 4 data in SyFAR 10, and you know you can learn from those data, so 10 to the 4 is much less than e to the power 35. Therefore, what's missing is information about the structure of the task you want to learn, therefore the question of how much data you need to learn really becomes two questions in a sense. On the one hand you have what is the structure of the task, on the other hand you have how a method for this is a deep neural network exploits this structure to learn. So now I'm going to, I want to present you three ideas which have been put forward in the field for solving this riddle, and the first one, which I actually like and we have seen in a sense examples of this in the past talks, is that learnable tasks are those that maybe depend on low dimensional projections of the input data, so although inputs live in a large dimensional space, then the function which I identify with the task to learn only depends maybe on a few variables, let's say one or two or whatever, and this idea is given rise to a very nice line of work, of which I only really represented my sort of few favorites, but there's many of them and it's quite powerful because it allows a lot of analytical and rigorous insight, and also it works because people have understood that shallow neural networks are able to adapt to this low dimensional structure, to discover low dimensional structures, and therefore you have both the mechanism for the, both the structure of the data and the mechanism for networks to adapt to it, but of course there are limitations, one of it, one limitation is that in this framework I don't think you can understand what's the advantage of a deep neural network over a shallow neural network, and in a sense not even the advantage of a convolutional neural network over a fully connected neural network, let's think about the fact that convolutional neural networks are really the architecture we think of when we say that in 2016 they started winning this imaginary challenge and so on. Secondly, and this is something I personally also contributed to, you can show that for this architecture is not really understood why feature learning, so learning feature should be beneficial for a deep neural network, this is something we have analyzed in this work that I put here, which I won't mention in the rest of the talk, but I'd be happy to talk about offline, if anybody wants. So we need to go beyond this, and the first sort of immediate generalization you can make is that you go from depending on a few low dimensional projection to be invariant or approximately invariant to a more generic symmetry, to a generic transformation, why do I say this as generalization, because of course depending on projection means that you're invariant to all rotations that leave those projections unchanged, and here I'm just saying, okay, I want to be invariant to a larger set of transformations maybe, and what really people had in mind when they introduced this in 2013 and 16 by a joint work by John Brune and Stefan Malla were deformations of images, so images, the content of an image is invariant to a small deformation of the image itself, this is a continuous symmetry, so there's lots of degrees of freedom that you can, okay the image is a dimensional, but there is also degrees of freedom that you can actually lose, so you don't really need to think about it as a such a high dimensional object, but then in this case it's not so clear how neural networks are going to be able to learn these invariants from only a finite set of example data, first of all, and in fact, you know, if you wish, people know that when they know there is a symmetry in the data, they prefer to use equivalent architectures because of this, and even in this case it's not really clear what's going to be the advantage of deep networks over shallow neural networks, there's no intuitive connection between the symmetry for smooth transformations and the depth of the architecture that is going to learn the data. Before we go to hypothesis tree, which is what I'm going to focus on then for the rest of the talk, which is the learnable task are hierarchically compositional, and by this I really mean, for those of you who have never heard of this idea, is that you can think of a function as a tree where the leaves of the tree are the components of the input, and then you have the nodes of the tree which are like computations, and you do local computations, and then you keep putting all this local computation together in a hierarchical manner until you get to the label or the target function as you would like to call it. This idea was introduced in 2016, and the corresponding hypothesis for all neural networks learned is a kind of structure that actually came before the idea of hierarchical compositionality of tasks, and is that deep neural networks, because they are deep, they are able to learn more and more abstract representations of the data as you go through the layer, so they will learn, let's say, in very simple terms, I can say, maybe they will learn very simple concepts, like edges on an image at the very first layers, and then more complex features towards the end, closer to the output. This is a picture to exemplify this idea of hierarchical compositionality in the context of an image, and I think this is really providing an intuitive idea of what could be the advantage of depth, because you have already depth in the data, so it's only natural that you match it with a learning method, which is deep. And also, because you learn this representation, which are on the one hand more complex, but also more abstract, this could explain the progressive reduction of dimensionality, which is observed in the hindered representation of trained deep convolutional networks. So, this is the hypothesis, again, I'm going to follow and I'm going to frame our model inside this hypothesis, so before going on, I would like to give you a bit more details of what is known in the field of, like what is known about deep learning for hierarchical compositional functions, not in chronological order. First of all, through this work of Smith-Ebern in 2020, we know that hierarchical compositional functions can be, in principle, reconstructed with a number of examples, which is polynomial in the dimension. This is really, I don't know if it's right to call it like that, but it's kind of an information theoretic result. Like, you know that there is information there to reconstruct the function from these points, but you don't know if there is an algorithm or whatever. Then we also know that deep convolutional networks actually are able to represent this function efficiently. What do I mean? I mean that if I compare how many neurons I need in a fully connected network, which is shallow, and how many neurons I need to represent this function at approximation level, on a deep convolutional network, so I will need much less in the deep convolutional network, so like exponentially less. But again, nothing about learning and generalization. We also know, and this is my contribution and my colleagues, that for those of you who are interested in the kernel regime and lazy training regime, these networks are actually cursor dimensionality when learning hierarchical target, when they work in the kernel regime. So we do really need feature learning for this kind of problems. And also it was understood, last point as well, that what's important for this deep learning architecture to actually learn from this type of functions is that there exists some sort of correlation between each of the input points. I don't mean each of the input points, I mean each of the parts of the input. So if you think of an input as a d-dimensional sequence, each element of this sequence has to be somehow predictive in a way of the label for deep learning to pick up these correlations and work on them, do this magic, and generalize eventually. OK, so, within this framework, we would like to build a model of a hierarchically compositional task, and we really want to build a model which is simple enough so that we can understand what's the sample complexity of deep learning methods learning this model and how to relate this sample complexity to the structure of the task. And this is what I'm gonna do in the next one in less than minutes. We call this model the random hierarchy model. And let's start, OK, this is a bit bad. Let's start again with this image, right? I want to really stress this aspect of hierarchical compositionality. So we'll have the dog on the left, the image. You can think of the image itself as a representation of the abstract concept of dog, which is the class. And then you can think of this abstract concept as made by composing two lower level concepts, which in this case are gonna be the head and the pose of the dog, but it can be really whatever. And then each of these themselves is gonna be made again by composing lower level features. So what are gonna be the sort of parameters that we need to describe this kind of structure? First of all, we have a parameter which is the depth basically because the class label is determined via a hierarchy of compositions. And how many compositions do you have in this hierarchy is the depth of the model L. Then at each level you have that one high level feature corresponds to several sub features like the dog corresponds to head and pose, the head corresponds to eyes, nose, and mouth, and so on. And I have another parameter which is the number of sub features that I have at each level. Individual sub features might be also shared. For this point maybe you can think of... You need to think of something more general in the picture of the dog on the left. For instance, you can think of, I don't know, an image dataset where you have dogs and birds and both dogs and birds have heads, but the birds doesn't have pose, the birds have wings. So the head will be a shared feature between both the dog and the bird. And to have shared features, basically I assume that at each level all the sub features are taken from the same finite vocabulary. And I'm going to call the size of this vocabulary V. And then we also include another ingredient which is the fact that at each level you can have several groups of sub features which lead to the same high level feature. You might have a dog by composing different type of heads and pose and backgrounds and so on. And we call this number multiplicity M. By the way, if there are any questions at any point, please feel free to ask. And this is a kind of idea of introducing this hierarchical generative models was introduced in this paper in 2013 by Mosel and in 2018 by Eram Malak. And once we know all these numbers, we sort of forget about the dog and think about it in more abstract terms. So we really have a number which would be the label of a classification problem. And then this label generates a set of high level images which themselves generate sets of high level images and so on and so forth. Are there any questions on this idea? So let's, again, make another example for this semantic multiplicity idea which is if, let's say, zoom in on one level of the hierarchy, one compositional rules. We have that on each level a high level feature can be represented with M distinct strings of yes, please. It's always shrinking going up. Okay, no, I really don't think of this as a, I really don't think of my feature. This is an important point, actually, thank you. I really don't think of this, let's say, Azure square as the pause. I think of it as a number which represents the sort of abstract idea of pause. It's really like a high level feature which I take from my finite vocabulary. And it's like a single number. And then it doesn't matter because once you generate the model what you give to the network is the input and the output. So here in the middle you can really put whatever you want. It's completely oblivious of what's the representation that you choose of the inner feature but really think of them as individual numbers. Like that's just one number. Yeah, yeah. Yeah, yeah, yeah. Okay, and so what I was is that, yeah, at each level I have features can be represented in several distinct ways. And I'm gonna call all the distinct ways of representing the same features at a higher level just synonyms because they are, in fact, synonyms. They have the same, they are groups of objects that have the same meaning on the higher level of the hierarchy. Okay, given this kind of generative models, Malak and Shalesh was understood that if you have a correlation from the input pixel and the label, by correlation I really mean that you can look at your data set and count given that I observe blue on the left corner of the input image. What's the probability of being in a certain class with respect to another? And if this probability is different than the random uniform, it means that each single element of the input is gonna be predictive somehow of the label. And if you have this, then they show that you can explore this correlation by a combination of clustering algorithms and layer-wise gradient descent to learn these kind of problems. What is gonna be our contribution is to on the one hand generalize, on the one hand make a specific assumption, but on the other hand claim to find the general result in the sense that I'm gonna make further assumptions on the structure of this model, which are gonna be letting me compute what are these correlations and how they affect deep learning. But then, because I'm gonna use less rigorous techniques, I'm gonna use them to explain just the performance of deep neural networks in a standard training where you train all layers at the same time. So if the program is clear, I will go forward with the assumption. And the main assumption is that at each level of the hierarchy in our models, we chose the rules at random in the sense that we chose the which basically strings of low-level features are assigned to a certain high-level feature uniformly at random from all the possible assignments of low-level features to high-level features. And this is really a simple combinatorial problem. We have a finite set of low-level features, which we have to... And you know that we have to assign m of them to each of the v high-level features. And we just pick this uniformly at random from all possible assignments. Pretty simple. What's important about this random assignment is that it's crucially inducing correlation. Why? Here I showed in a very limited setting where I have this parameter s, which tells me the size of the low-level feature to 2, and I have multiplicity 3, and also the vocabulary size is 3, because I only have blue, green, and orange at the bottom. And on the left you have an example of a random choice. On the random choice, you see that you have this property of the single pixels, if you wish, being already predictive of the high-level features. If I see a blue dot on the left, I know that it's going to be more likely to have a gray high-level feature than not. And the same you can say for the orange and the green and so on. On the left, I've shown to make a comparison another possible choice you can make in which you distribute equally all low-level features among high-level features. In this way, you're completely killed all the correlations, because if you see, each of these columns have a blue, a orange, and a green. So there is no way I can tell what's going to be the high-level feature just by looking at the low-level one. So randomness of the choice induces correlation, and this is a very important point to remember for later. And here I want to have a recap of this model. I think it's going to be very unlikely. It's actually, I don't have now the number on top of my head, but we computed this number in the... Well, it's not published, but it's a draft. But we have this analysis in the draft. It's actually a very low probability that you end up with... If you throw all your rules at random, there's a very low probability that you have. Because for having an homogeneous case, like this one, what needs to happen is that whenever you pick which low-level features you're going to assign to each high-level feature, you have to pick exactly one of each, which is very unlikely, if you think about it. It's a very special point in the space of all possible assignments. Of course, once... At some point I'm going to send MMV to be large, which means I'm going to have many of them. And if I have many of them, on average, there is going to be one of each color, right? But then if you look... That's the average. If you look at the single realizations of the model, that's not the case anymore. You will have... You will have these fluctuations and you will never have... Well, you won't have a homogeneity with very high probability, vanishing when you send the number of synonyms m to be very large. So let's say, when I read its 10, this probability is going to be, I think, 1 over 200 or something like that. Okay. So this is the recap of this model in which you have a fixed number of classes and see this parameter, which is the semantic multiplicity of each concept. S, m. S, which is just the number of subfeatures you branch out in at every level. Basically, s is the number of... If s is 2, this is... s is 2 in this case, in which you have a binary tree in january level, s, 80, 3. V is the size of the vocabulary in the sense that each of these... This means that basically the yellow square is going to be filled with the one out of nc number, the number of classes. Whereas, each other square is going to be filled with the one out of V number, because I have vocabulary besides V. Of course, I could have these numbers to depend on the layer and have different one at every level, but for simplicity, I'm going to keep them like that. And finally, you're going to have l, which is capital l, which is the depth of the hierarchy. And this is really how you generate your data sets. You start from... You pick, let's say, a label at random, one number out of the number of classes. From this number, you generate a random rule, and with this random rule, you generate all possible representation for each of these classes. And you know that each classes can be represented in m different ways. Once you have the representation, all the elements of the... You choose another rule, and all the elements of this representation, which are the light green and light blue, you assign to them further m distinct couples, and you go at the next level. And so on, until you repeat the cell times and you get your whole data set. And here I wanted to summarize the few properties of this model. First of all, by repeating this procedure, as I just told you, you generate given the number of classes of the multiplicity m and parameter s and l, you generate a number of data, which is dp max over here. What's important is that it's exponential in the dimensionality of the input. The dimensionality of the input, of course, is going to be s to the l, because every time, for every level, you branch out into s different possibilities. Then, by construction, the class label in this model is going to be a hierarchically compositional and local function of the input. By which, I really mean this kind of this kind of computation that I wrote here, right? First you put together x1 and x2 to make one feature, and then you put together x3 and x4, and so on and so forth. And then you put these two together, and this is a very special structure for the label. Third and more, the label is symmetric whenever I exchange semantically equivalent low-level features. And also, as I showed earlier, the random choice of rules has induced some correlation between the input feature and the label that I can compute. Okay. Hoping this is clear, I'm gonna go forward and now start discussing the sample complexity of this model, which is what I am ultimately interested in. Okay. So, there are two simple characteristic sample sizes, starting from just the definition of the model. One is the maximum number of, well, the total number of points in a dataset, which is a function of the parameters, which is the pmax below, and it's exponential in the input dimension. Remember, the input dimension is s to dl. But you also have a minimal number of training points, which I define as the minimal number that you need to reconstruct the model, and think about the model as just, the model is really just made of the l rules that I chose in a random. Each of these rules is the assignment of m strings of sub features to one out of v high level features, which means that each of these rules takes m times v points to be learned, to be fixed completely, and so I just multiply this number by l, and I get what I call the minimal number of points that I need to learn this model. And if you wish, you can get a more rigorous understanding of this by thinking about the classical, when you consider algorithms such as searching hypothesis class of all the possible hierarchical random hierarchy models, so no indigenative model, I know that there is only a finite number of functions compatible to that, and if I compute the logarithm of the number of these functions, I get this pmin, which is a typical information theoretic bound for learning. And then the sample complexity is going to be the extra sample complexity of deep convolutional network is going to be in the middle, and I'm going to give it to you already. It's going to follow this formula, just the number of classes times this multiplicity m to the power l, which is well, two important features of this is that, first of all, is polynomial in the input dimensions, which is nice because we're not suffering from the cursive dimensionality. Secondly, it's completely independent of the vocabulary sides, but it only depends on this semantic multiplicity and basically you have that the number of points you need per each class is m to the power l. OK, now I want to try to show you how we think, we understand where this sample complexity comes from. And the question I want to ask to get to this answer is how to learn a random in a key model because of these properties that I've introduced, the natural approach, if you think about it will be to learn first the semantic class of low level patches. So I don't want to learn already from the input, from the input string. I don't want to learn the class yellow, but what I want to do first is to understand that blue and orange means purple and that green and red means brown. And if I can do that, then I, my claim is that I can reduce the dimensionality from s to dl to s to dl minus 1 because basically I go back one level in the hierarchy and if I do this successfully and if I iterate this procedure l times then ignorance of the model. Relatively simple, but is it really the case? And to check whether it is the case what we can do is the following experiment. We consider a deep convolutional network which has more or less the same structure as the model. So the same depth as the model and the filter size matching with the number of subfeatures per each high level feature. And what we can look at for answering this question is what I call semantic sensitivity of the representation. So you train the network, you look at the hidden representation, so how the neurons of the first layer of the network respond to each input. And then you play a game of switching elements of the input with other elements which can be synonyms or not. And then you ask yourself what's the difference in output in the representation when I switch synonyms, when I replace an element with a synonym or if I replace an element with something which had a completely different meaning. Right? For me for the network to be able to successfully solve the task you have to be invariant to switching an element with its synonym. And this is a criteria measure of that. And what you can see on the left is that there is clearly a characteristic number of training points such that you become, you acquire these invariants. Right? You go from something close to one to something close to zero. A typical sigmoidal shape and then if you collapse all these curves and you these curves are obtained for different parameters of the model so different vocabulary sides, number of classes and multiplicity but if you collapse all these curves by the simple complexity I showed before then you see that they all more or less follow the same master curve and they all go down around 10 to the zero which is one meaning that this number P star is in fact the number of training points such that the hidden representation of the neural network become invariant to exchanges of synonyms in the model and we claim that this is also what we claim we show that this is also what controls generalization here I'm showing I'm doing the same kind of plot but what I'm showing is the test accuracy of these models on unseen examples again you have this type of sigmoda like pictures is of course the zero is not shown because it's logaritmic also in the in the in the y axis but basically again you have curves that decay at different level different training set sizes but then once you collapse the training set sizes once you divide the training set sides by the simple complexity scaling which I found and see times m to power l you find collapse meaning that the non-dominarchy model is in fact learned by deep convolutional network with this number of training points and see times m to the l vizdokabere sides as l is the depth m is the multiplicity that's how many different let's set as equal to how many different pairs in the at the lower level correspond to the same feature at the higher level and c is the number of classes and c it's really not super important you can think of you can divide p by nc and say okay that's if you wish per class yeah it's a convolutional network much to the structure of the problem actually so the filter side is the same as the sorry here it's a deep convolutional network yeah yeah yeah they are just put like this they are like put on a string yeah yeah it's a 1D convolutional but you can make it yeah filter of s2 but you could make it 2D if you want because this is really in the structure of the of the input in n is completely irrelevant what you really need is that you know these two are gonna be close to each other because they correspond to the same because they correspond to the same level feature they also gonna be close to each other in a special sense yeah I know it's just 1D yeah I'm swapping for instance I'm swapping this semantically equivalent representation so couples of pixels in this case for instance here I have that this is an example of this a gray pixel at the higher level which is like hidden can be represented both as red and purple and as orange and green when exchange synonyms I mean as exchange a red and a purple with a orange and a green because they mean the same thing at the higher level if the structure of the network contains the hierarchy model yes it has to be deep enough yeah yeah actually I haven't tried but my guess is that with smaller filter size and more depth though you will still be able but it's a bit more complex because then you have to skip right you have to the first level of the model would be the second level of the network so you have to be careful at which representation becomes invariant to to what but model all dots I think you can like you don't need the matching exactly when I generate the dataset I have all of it if you wish and I know because I've chosen the rules I know which couples are are synonyms between each other like the generation of dataset is really I choose a rule per layer and once I have the rule per layer starting from the I don't really sample I just generate them all let's say and then when I train I only pick a a fraction of them and because I've generated them with the specific procedure I know which one are the synonyms some other hand or maybe not okay this is an example for instance of a a radiating the question of of Alessandro where we train the shallow fully connected network which is still able to represent the target function but it's not going to be able to adapt to this structure because simply it doesn't have enough depth to represent the model therefore the sample complexity here is the maximum number of training points so I need to see a finite fraction the whole dataset in order to learn whereas if I'm if I'm deep enough to represent the if the network is deep enough to represent the hierarchy then it will follow the sample complexity of the architecture which is matched although I mean there are pre-factors which I cannot control so let's go a bit into the like how do we come up with this guess by the way how am I doing on okay in total fair enough okay no I think it's fine to do this how do you do it it turns out you can do it by simply counting occurrences so just a little notation I'm gonna call with this mu couples of low level features so which are the objects that have the same meaning above and if I count that's number over there what's the number of data we have fixed label alpha having feature mu one let's say on the first patch this number because different patch different seen on in patches have the same meaning this number will be invariant to exchange of patches so based on this number I can build an invariant presentation this is all I want to say and you can compute statistics of this number on different realizations of the model so you can estimate what this number is gonna be like and then you ask yourself okay how much how many points do I have to sample when I want to get the empirical version of this number where a finite training set in order to find to get the result which is closer to the real number and that gives you basically a I skip all this signal to noise type of trade-off by balancing signal and noise you can find a sample complexity and I think I'm gonna skip this thank you all and leave you with the conclusion slides I think it's an interesting point but it's a subtle question how to scale things to make it emerge because even the fully connected network which is not deep at all a shallow fully connected network sorry, which is not deep at all so one hidden layer will as enough expressive power to learn the model adjust it, it will learn it when it's seen a finite fraction of the training of the data set so rather than having that you will have different curves which decay at different points if the x-axis is the number of training examples by having different models at different depth okay, yeah, I see your point we haven't done that what I can tell you though is that these, for instance, the convolutional networks here especially for L is not huge, it's 3 depth 3 convolutional networks they have much less parameters than the fully connected networks that I've shown here at the end they perform because for training for the fully connected network to fit all this data you really need let's say 10,000 hidden neurons whereas for these two fitted data actually you need order 30 neurons in each layer so now I what's the parameter count I have to square it at some point but it's a pretty clear observe I know that I have a picture where we have this one gradient step picture which shows us that you can base yourself on the statistics of these numbers of occurrences and we know that actually this number P star of simple complexity comes from a sort of signal to noise ratio matching on this number so when is it that I have enough sample in my training set to correctly estimate these numbers that I'm showing here but then that we only relate to actual training in this one step gd picture and not in the full training of a deep neural network and about the I didn't understand maybe the part about the low were bound much you can well there are some it's gonna be long so if you want I can tell you offline probably I think we can do better but with other methods not with the pure creating the same sent on only after the same time you can you can it's just an attempt if you wish and a framework in which we were able but you need to have the these correlations no matter where they come from in this case it's a model in which the correlation come from the randomness and we can compute them exactly as a function of all the other parameters in the model right but we also know is that if you don't have correlations at all you will not learn it's like a if you know the the parity problem it's like a parity function