 Great. So welcome everybody. It's a great pleasure to have Pietro Rotondo from the University of Milan, who will give a talk today. He's going to talk about his recent work on the interface between disorder systems and machine learning. So with that I leave the floor to Pietro. Okay. Thank you Jacopo. Thank you for asking me to speak in your meeting. And as you said, I'm going to speak about this recent work I'm sharing the screen. Is everything okay? Can everyone? Great. I'm going to speak about this very recent work on statistical learning theory with geometrically structured data and I will explain you everything single word of this title during the talk. And just to start, let's say that I would like to start with a short introduction, some motivation and background on this work. Well, every one of us know that in the last 10 years, basically deep learning achieved a lot of success in practical applications. And the kind of architectures that are used are sketched on the upper part of this slide. These architectures are basically very complex architectures made of many layers. Many of these layers are convolutional. The last few layers are fully connected layers. There are a bunch of tricks that you need to implement to improve the trainability of these architectures, for instance batch normalization, shortcut connections and these kind of tricks here. And one of the most impressive achievements with deep learning with deep neural network has been in the field of computer vision. And in particular in the last 10 years, they basically were able to improve a lot the top classification error, for instance, in this competition, which is called ILSVRC, on this super challenging data set, which is called the ImageNet, which is made of millions of images with thousands of categories. And as you can see before the advent of deep neural networks, the top five classification error was around 28.2%. And in five, six years, it's now considerably reduced. Now there are the results up to 2015 in this slide, but if you go ahead up to 2019, the results are still improving a lot. And okay, of course, this is only one of the practical applications where deep learning achieved a great success. There are many others that probably many of you are experts for instance in reinforcement learning or also in unsupervised learning with these generative adversarial networks. I'm just focusing on this one because it's a supervised learning is more the one which is most related to this talk. And okay, there has been a great practical success of the other architectures, but it's fair to say that we understand very little from the theoretical viewpoint. And in particular, let's say as a physicist, probably one thing that we may like to have is a sort of icing model of this deep neural network that currently we don't have. And so we have no let's say models toy models that we can study investigating analytically and we can say things on this kind of architecture on this kind of architecture. Okay. And let's say let me just mention briefly, which are the main fundamental unanswered issues or questions in this field. Of course, we it's fair to say that we don't know why does stochastic gradient descent work in minimizing the loss function of these deep neural network. And in particular stochastic gradient descent is a very simple algorithm. But the loss function of this neural network is a non convex object. Okay, with many local minima, and it's not really clear why this algorithm works so well in minimizing the loss function. Okay. Another question which is super important is the reason why we need the debt to obtain such amazing performance and it's not okay to work only with a shallow network with shallow network. I mean, a neural network with only one little layer. Okay. In principle, in principle, if you look at the problem of expressivity, a shallow network should be enough to implement any kind of reasonable function to approximate any other. But in practice, if you have many layers, the performance is much better. Okay. And ultimately, I would say that the most important problem is that we don't know why these architectures generalized well. And in particular, something that may be struggling is that these architecture have a lot of parameters. Okay. Too many. Okay. And all the reasonable theories that predicts how these architectures should work. They basically tell you that these architectures should overfit. Which is, which is usually something that you can prevent in practice. Okay. But to understand a bit more of the generalization problem in neural networks. We need to understand which are the mathematical theories that describe generalization neural networks. Okay. So let me introduce this mathematical framework, which is called, which is known as a statistical learning theory. Okay. And one of the main goals of the statistical learning theory is exactly to provide the rigorous bounds to the generalization error in supervised learning. And the ingredients of this theory are an input space that I call X in the slide and output space, which I, which I call Y. And, and in particular, I'm restricting the analysis to binary classification problems. Okay. It could be more general, but just to fix the ideas. It's okay to consider a data set with binary labels. Okay. Okay. In particular, the framework is probabilistic in nature. And basically objects from the input and output spaces are drawn from a probability distribution of the input output. And from this probability distribution, you can draw your training set. Okay, which is simply a collection of P input output pairs. Okay. Another ingredient, of course, is what what is called the the hypothesis space. Okay. This G here. Okay. That, of course, in a in practice can be whatever you like can be a support vector machine can be a neural network can be whatever you like. Okay. And usually, there are an infinite number of functions in this hypothesis space. And from this ingredient, I can I can build what is called the the true risk. Okay, which is basically this indicator function is basically telling me whether my function G is or not implementing correctly the input output relation x y. Okay. So it's basically counting how many mistakes I do. Okay. And of course, it's the true risk because I'm averaging over the probability distribution, the joint probability distribution of the input output. Of course, in practice, we never have these probability distribution. And so we have to define another object which is called the empirical risk, which is simply the very same function, the same indicator function, but summed over the elements of the training set. Okay. And from these two quantities, I can build, I can define what is called the the generalization error. Okay, which is simply the difference between the true risk and the empirical risk. Okay, so basically what mathematicians would like to do is to bound these generalization error and and the strategy they follow would be very simple. I have a question. Okay. So the empirical risk is, I mean, when you calculate the empirical risk you are eyes. No, sorry. Okay, nothing. Okay, so it's calculated for each value. Okay. Yes, exactly. Are you okay. Yeah, can I continue. Yes, yes, I just answered myself. Okay, great. So everything, let's say, establishing bounds on this generalization error would be extremely easy if the, the, the functions g. Okay, in this hypothesis space where a were a finite number. Okay, in this case, there exists an inequality which is called often inequality, which allows basically to bound the generalization error with the square root of the logarithm of the number of elements in the hypothesis space if the hypothesis space is finite over the number of training patterns. Sorry for not reporting this formula, maybe it was useful. But the problem, the problem is that, of course, if the if the number of function in, in the hypothesis space is infinite, you cannot basically only use this result, this inequality, and you need to establish to use more sophisticated tricks to to establish useful upper bounds to the generalization error. And in particular, if you if you use these results, which are known as a union bound or cement and symmetrization lemma, you can basically establish a chain of inequalities. And at the end of the day, you, you bound the generalization error with a quantity with an observable, which is a radar obscure if you don't look at the proof. And these observable is known as a Vapnik-Chervonenky centropy. Okay. Basically, if you are given a set of points, a set of inputs, okay, P inputs. What the Vapnik-Chervonenky centropy does is to count the logarithm of the number of different classification of the data points, realizable by a given hypothesis space G. Okay. And actually, you can prove a nice theorem, okay, which allows for any function in the hypothesis space G to bound the generalization error with this observable with this VC entropy, okay. And in particular, you know, you bound it with that with something that is called the annealed VC entropy, which is simply an average of this number of different classifications over the probability distribution of the inputs. Okay. And it's important to notice at this level that this bound is depending on the probability distribution of the data. Okay, this is super important. Another thing which is super important is that usually direct computation of this VC entropy is challenging or also impossible in most cases. Okay. There are only a few cases where you where you can really compute this VC entropy. And so since you cannot compute it, what mathematicians usually do is to bound it again with something else, which is this formula here, okay. And this formula, in this formula, basically, it appears a quantity, a number, which is called the DVC, which is called the Bucknick-Czerwnenkis dimension, which is basically a compact measure of the expressivity of a given model G. Okay. And just informally to tell you what this Bucknick-Czerwnenkis dimension is, is the largest number of data points that can be classified in all possible ways. Of course, if you have two points, you can classify them in two to the power p possible ways. Okay. The VC dimension is the largest set of points p that you can classify in all possible ways. Okay. It's informal, but this should be enough for the purpose of this talk. And again, of course, since this is an inequality, you can establish another theorem that tells you that the generalization error is bounded by this function, which depends on the VC dimension. And again, we find that these bounds are uniform in the function of the model. So they hold for any function g, small g in the model. But this time, we find that this bound is actually data independent. Okay. It only depends on the VC dimension and the VC dimension is a property of the architecture alone. Okay. Now, I can understand that these may look very formal. Okay. And so for this reason, I would like to start with an example where the VC entropy can be actually evaluated exactly. Okay. I'm considering inputs, okay p inputs, which are these vectors psi here, which belong to an n dimensional space. Okay. And I formally define something that will be very useful in in the following that I will call dichotomy or classification. It's really the same. This is basically simply a function that takes input, the inputs of my training set and just give me gives me just tells me if the classification, which is the actual classification that I'm implementing on this training set. Okay. And in particular, in this example, I'm focusing on an hypothesis space which is made of linear of linear function alone. Okay. So these formulas here, which are parametrized by the by this vector w here. Okay, just in picture basically what I'm what I'm considering is whether or not we can find a hyperplane that can classify that can classify correct classify correctly. My training set. Okay. I'm particular in the second picture here. I'm showing the fraction in in the I'm showing the fraction of vectors w that can realize this dichotomy which is a zero one zero one and this blue one is the fraction of vectors w one that realize these other dichotomies one one zero zero zero zero one one and basically computing the VC entropy amounts to ask yourself, okay, there are two to the power p possible classifications, but how many of them can be realized by these linear classifiers. Okay. And the answer to this question came in 1965 in a famous theorem by Thomas cover, which established that under very general conditions on the position of these points and in particular the technical condition is that these points are in general position, which is, let's say, a small generalization, a simple generalization of linear independence, but I won't say more on this point. Basically, the number of linearly realizable dichotomy dichotomies of p points is given by these not too complicated expression here. Okay. And in particular, if we if we plot these numbers C and P as over the two to the power p which are all the possible dichotomies we notice that if the number of points p is not too large. This fraction is one. Okay, so I can implement all my dichotomies with my linear classifier, but if there are too many points. I start to deviate from one and after after a short crossover, I go to zero. Okay, no, no, no one of the dichotomies is on average. Okay. And if you now rescale, okay, this number of points p by the dimension of your linear classifier and what you basically get is that you find two special points. The special point is that these VC dimension, which for the linear classifier is equal to n to the dimension to the number of three parameters of your classifiers, which is exactly the maximum number p of training pattern such that all the dichotomies can be linearly separable. And also these other number here, which is known as the storage capacity and represents the maximum number p such that only 50% of the classifications are realizable. And what is particularly important, I mean, I would like to mention for those of you that know a bit about the statistical physics of disorder systems that this is basically the storage capacity that can be evaluated also with within gardener framework for computing observables in statistical physics. Okay. And now what I would like to show you, which is in my opinion particularly interesting is how these covers result interact with the real data set. And in particular, we can design a simple numerical experiment to check covers prediction. Okay. In practice, we can use these amnesty data set and run experiment and experiment with the true labels or random labels. Okay, and from from this amnesty data set, we basically select the P images. Okay, we extract withdraw P images and we try to classify them with a linear classifiers. And we basically count how many times we learn these, we can classify correctly these P images. Okay. And what is what is shown in the plot on the right is basically that covers prediction works. But it does not work for true labels. Okay, it works only for random labels. So, of course, I'm not explaining I'm not really explaining the details of this experiment, which is not really totally straightforward to implement because there are subtleties on what it does mean to be in general position for for an input of an Easter, but if you like, I can give you more details later on this point. And, but what what we basically learn from this numerical experiment with from the simple numerical experiment is that basically covers the result is missing the input output correlation between the input and the outputs. Okay. And so basically the dichotomy the dichotomies of amnesty are extremely non typical for this reason here. And, okay, so let's see what what what what is the important point here. The important point is that when you are dealing with a problem in computer vision, for instance, there is a set of transformations that live the labels invariant. Okay. And this set of transformations is made, for instance, by a subgroup of rotations, rescaling, translations, okay. So, it's not, of course, true that a data set is a is a random object. Okay, but this data set is usually this data often lie on low dimensional manifolds. Okay, which are embedded non linearly in a dimension. Okay, which is quite different from from having points with random labels. Okay. And let me give you an example of what I mean. Basically, what I mean is that, of course, the images of amnesty naively lie in a in a space which has dimension 28 times 28, which is, which is basically the number of pixels in the image. Okay. But, but actually, there exists a concept of intrinsic dimension. Okay. These are these are the object of amnesty. Do not really feel all these 784 dimensional space, but they actually cover a much smaller subset of this space and in particular, introducing this concept of intrinsic dimension. You are you estimate the, the, the, the dimension of this manifold to be around 10 more or less. Okay. And just to mention, there are many algorithms to measure the intrinsic dimension of a data set as you and as you can see, one of the first ones was introduced by physicists, Grasberger and Procaccia in 1982. Of course, they had not in mind a machine learning, but what they had in mind was a totally different problem which was in the field of dynamical systems in particular, Grasberger and Procaccia were interested in measuring the fractal dimension of strange attractors in chaotic systems. Okay. But there are recent, more recent work by for instance, Lyos group in Trieste and also Vittorio Erba, Marco and I, in Milan, we did, we tried to implement a novel algorithm to measure this number here. Okay. But what it's most important to mention is that for more than 30 years in statistical physics, this kind of feature of real data set was totally overlooked. Okay. People were doing calculations with training sets that were basically drawn from a probability distribution with no correlations between the inputs and the outputs. Okay. Only more recently, people started to consider these input output correlations, which is what I call here data structure in the in their, in their calculation. Okay. And it's quite important to mention this works by some Polisky's group. Okay. Some Polisky and collaborators, which are to some extent inspired by the notion by the concept of perceptual manifold, which is something that comes from neuroscience and from brain theory, which is something that I'm not expert about the tool. But there are also, and I couldn't report these references on the slide, some work on the so called hidden manifold model, which have been done very, very recently by Lenka Zeborov and his group and her group in Paris. Okay. I think that now it's a good moment to maybe make a short summary of what I said up to this point. Of course, I say that there are basically two different classes of upper bounds in statistical learning theory. The first class of bound is uses employees, the VC entropy and our bounds that depend on the distribution of the data. Okay. Whereas the second class of bounds are data independent. Okay. And they depend only on the VC dimension. Another important point is that I can compute the VC entropy in one particular case. And actually what I forgot to tell you in the previous slide is that discover theorem does not hold only for linear classifiers, but more in general, it holds for kernel machines. Okay, kernel machines are simply machines where I basically apply an arbitrary fixed a priori nonlinearity to my data before to classify them with a final layer with a final layer of a fully connected layer and and actually there is more than that because the combinatorial formula by cover is also used for obtaining result results, exact results and rigorous results on the upper bound for deep neural networks. So let's say cover, cover theorem is not only something for linear classifiers, but it's something that can be important also for more complex architectures. But what we have basically realized with that simple numerical experiment that I've shown you is that cover theorem is basically missing input output correlations and works only for random labels. Okay. It cannot address anything if there are input output correlations in the data set. And physicists work, physicists actually address this problem of data structure only recently, okay, in the last three or four years. And they did this using what is known as Gardner framework, okay, which is basically a framework based on disorder systems. And it's important. It's worth mentioning that this framework is super nice. It allows to get a lot of fantastic results in neural networks, but it's in some way limited. Okay, the reason why it's limited is that with this Gardner framework, I cannot really compute the BC entropy, or the, which is the, the, the, the most relevant observable in statistical learning theory. But I can only compute the storage capacity, which is this unique point here in, in this curve that is describing the BC entropy. Okay, so basically the, the question we started to ask ourselves with my collaborators was, okay, can we take, can we say something, on these BC entropy, which is not what's, what has been already said by, by physicists with Gardner framework. Actually, we can we build a bridge between what we usually do in statistical physics and what is done in statistical learning theory with this BC entropy in the case where in the data set, there is some sort of data structure. And these are the kind of results that I'm going to explain you in the following as you can understand as a first step. I must define an ensemble of data structure, which is simple enough to, let's say, be suitable to analytical calculation. Okay. And so we, we are led to introduce this toy model of data structure. Okay, where basically the switch is from, is from classifying points. Okay, from classifying single points, which is, which is what has been done for more than 30 years in statistical physics to classifying manifolds. Okay. And these manifolds that we are classifying in, in, in our work are basically multiplets of K points with prescribed geometric relations. Okay. And of course, let's say, so usually you have these, these index mu, which is labeling the manifold. Okay, this object here. And you have these other index A, which is instead labeling which point in the manifold you are considering. Okay. And, and if you consider basically patterns on a sphere, you get that the only, the only quantities you have to fix to fix the geometric relations are these parameters are all here, which are simply the overlap between the scalar products between these patterns. Okay. Of course, for K equal to the manifolds that we are classifying our segments for K equal to three, they are triangles and for generic K, they are more complex simplex is okay. Again, we can, we can define these dichotomies, these classifications as functions, okay, that act from that take input, the training set and gives you the classification. But this time, the idea is that we want to classify the money, the entire manifold in a coherent way. Okay. So we cannot allow points in the same manifold to have different labels. Okay. Because we want to classify the manifold that coherently. And so we are obliged to impose this constraint that basically is fixing, is fixing the constraint on the points of the labels on the manifold to be the same. Okay. And so the question is now a little bit different from the original question by cover and he's okay, there are two to the power P admissible classifications but how many of them are linearly realizable. It's, it's decent admissibility condition that makes basically the difference with the standard classification of unstructured points. Okay. And what I'm going to explain you is basically how to generalize these the cover reasoning to obtain the covers theorem to these more complex form of data set to K of manifold. Okay. So let me start with excuse me. Can I ask a question? Of course. Yes, I'm a bit confused. Maybe I'm biased just by the picture I see but isn't it kind of equivalent to try to classify a kind of natural notion of center of mass of this manifold. And then the problem would reduce in a sense to to the usual classification of random points. I think I'm missing something perhaps you are you mean classifications with margin. Because of course, of course, of course, if you think for a while, of course, these these manifolds have a center of mass, okay, but asking to asking to classify them coherently. Okay, it means that any of the, the end points, let's say, of this manifold should be classified in the same way. So it's not enough to classify the center of mass. Okay, at least, at least you need, you need, you need to classify coherently the entire sphere that contains the simplex. But this is not the same as classifying the simplex. Okay, is it clear? I mean, of course, you can, you can, let's say, build, you can construct a sphere that covers this manifold. Okay, but classify but asking yourself how many spheres you can classify is different from asking yourself how many triangles you can classify because basically a sphere is a, is again an n dimensional object, whereas this manifold here is only three dimension, two dimensional. Okay, so basically the classification capacity of spheres is quite different from the classification capacity of this manifold here. And another point, another point which is important is that, of course, so this allows me to introduce another point with, of course, classification with margin is equivalent classify manifolds, which is a classification of spheres basically, but you don't have any, I think, I think it's difficult to really compute the VC entropy for those manifolds. Okay, it's not something that you can do of course you can bound the VC entropy with the VC dimension but you cannot really compute the VC entropy of linear classifiers with margin. You can bound it. In this case, indeed my, my goal is to show you that with this kind of manifolds, this kind of manifolds are simple enough to generalize cover reasoning and to compute the full VC entropy of, of disassembles of objects here. Is it clear? Yeah, it seems more clear. Thank you. Sorry for the confusion. Okay, so how does cover reasoning work for unstructured points? Okay, when I want, when I want to classify single points. Well, in this case, you basically, your goal is to build, is to construct a recurrence relation and the idea is that you start with p points, okay, and you already know the number C and P of linearly realizable dichotomies of p points in n dimension, and you add a new point, okay, and when you add a new point, you can basically get two different situations. The first situation you can encounter is the situations where, where there exists an hyperplane hyperplane that passes through the new points to the new point that that keep the classification that keep the dichotomy realizable, okay. Of course, in this case, for these dichotomies here, you can, you can basically shift this hyperplane a bit, okay, and obtain both classifications on the new points, okay. This kind of dichotomies, you call it the M, okay, and they count twice. They contribute twice to the number of dichotomies of in n dimension of p plus one points, whereas the remaining C and P minus M, they only contribute once, okay. So we are almost done, because now we just want to estimate this number M, but this number M is basically the same C and P, but now I have a linear constraint, because the linear constraint is that the hyperplane must pass through the new points, through the new point, and so it's basically this number here C and minus one P, okay. So I have counted the dichotomies and now I have a recurrence relation with some initial condition, which is the second step. I, in some way I can solve this recurrence relation to obtain this, the cover formula that I showed you before, and of course I can derive the VC dimension and the storage capacity, which are the two main quantities, the one that VC dimension for statistical learning theory and the storage capacity for statistical physics. And what I would like to stress is that this approach is definitely complementary to the usual statistical physics approach, which was pioneered by Gardner in the 80s. And in particular, as you can see, this approach is giving you exact results at finite size and allows you to compute things which in the Gardner framework are not possible to compute, okay. What is important is that now, and probably, I don't know what time is it? Eight seconds. It's probably, okay, if I don't tell you how, let's say how to generalize the procedure, but what it's important to know is that you can generalize these two, the set of manifolds that I described before. Okay, and firstly you can generalize it to segments, okay, to pairs of points. And what you basically realize is that in this, in this recurrence relation, in this case, you, it appears a geometric quantity, which is basically the probability that both points that you are adding lie on the same side of a random hyperplane. Which depends on the geometry of your manifold, which in the case of segments is only single parameter row here. And again, you can solve the recurrence relation. You can compute the VC dimension and the capacity and this capacity here, in particular, you can compute also with the standard Gardner framework in the thermodynamic limit. But what is the most impressive is that these recurrence relations again are exact at the finite then, okay, these are not the results valid in the thermodynamic limit. And therefore they allow me to access to the VC entropy for general number of points being for general dimension of the linear classifier. And so, although, let's say these are these results are not theorem are just and waving arguments. Let's say they are exact and waving arguments. And again, something which is very nice is that you can compute the storage capacity numerically from very, very, very small linear classifiers. Because basically, you don't, you don't see any, any finite size effect in this formula here. Okay. And you can actually do more and you can, you can compute this VC entropy for generic K so for generic simplexies. And in this case, what you get is that in, in these formulas, you find these other geometric quantities here that depend on all the parameters, raw, new, new, that define your simplex, okay. And again, the capacity dimension and the capacity can be derived. And there is, again, a very good agreement between the theory and the numerical simulations. Okay. So what we have basically done was to find an example of that structure which is simple enough to derive the VC entropy, okay. But now we can do something more. And I mean, and I'm going towards the conclusion of this talk. And in particular, if you, if you rescale by, which is what I have done up to this point, if you rescale this, this, this number of dichotomies by, by two to the power p, you, you obtain this kind of curves. Okay. These gray ones are for unstructured points and these ones are for pairs of points. Okay. And these are three different sizes. Okay. Of the linear classifier. But if you do not rescale these curves, what you basically get them is that there is a striking difference between unstructured points and structured points. In particular, what you can see is that whereas unstructured for unstructured points, basically, the VC entropy at large number of training patterns p is scaling polynomially with the, with the number p. Okay, or in the load alpha, which is simply rescaled by n. Whereas if there is structure, you observe a non monotonicity. And actually, for structured data you observe the VC entropy going to zero asymptotically for large load. Okay. But there is actually more than this. Because at very large alpha, what you observe is another phase transition. Okay. In particular, you observe that for different sizes of the classifier, you get these curves intersecting in, in this, in this point alpha star. Okay. And so what you basically get is that this monotonicity is followed by an additional satan sat phase transition. But what, what is this, this, this phase transition. Okay, let's say of course, you know, this kind of problems that there is a yes. Can you remind me what the different shades of red of these curves stand for? These are different sizes. Okay. Okay. The case is the same. Sorry. Case always to case always to yes, yes, they are always for the same kind of manifolds but different sizes of the classifier. Thanks. Okay. And of course, in this kind of problems, there is always the constraint satisfaction satisfaction problem associated to the gardeners to the storage capacity. Okay. This constraint satisfaction problem is basically the following one given a set of input a input label pairs, which, which are the training set find the hyperplane such that the training set is correctly classified. Whereas these other satan sat phase transition corresponds to another constraint satisfaction problem, which basically is that given a set of inputs find the hyperplane and end a set of labels such that the resulting trains that is correctly classified. So basically, at this novel phase transition here, what we are looking for is that at least one classification is possible. Okay. And now I'm extremely late. I think I don't want to bother you too much. But what I want to tell you is that you can study features of this phase transition by considering the generating function of these CNP here. And you can exactly determine the critical point as the condition that the VC entropy must be asymptotically constant in N. And this is because of course if if this is increasing N. Okay. And below the critical point that the VC entropy is is growing as N grows, whereas above the critical points you get that the VC entropy is lowering is smaller if the size of the classifier is increased. So basically the critical point is defined as that point where the derivative of the VC entropy with respect to N is zero. And this allows to define an equation, a transcendental equation for the critical point that you can compare with numerics. Numerics is quite straightforward to do because it's enough to see where two different systems for different sizes intersect. And as you can see, the numerics quite agrees. In the case of pairs of points, of course, which is the one that we tried in these numerics with with the theoretical prediction. But you can actually do more. And you can actually, because as you may imagine, it's not always possible to say if to construct this combinatorial reasoning and would be extremely useful to have a replica formalism to describe to describe this phase transition. And for those of you that know a bit of the replica formalism, you can basically introduce a synaptic volume or more or less on the line on it's very similar to the one introduced by Gardner with the difference that of course you have one constraint for each multiplet and the output of this multiplet must be the same for each point in the multiplet. And the tricky difference with respect to the usual Gardner volume is that the sum over the labels is not quenched. Okay, it's not that you take the logarithm of this guy without this integration over the labels and then you integrate over the labels you have to integrate inside if you think for a while. It's reasonable to believe that to understand that this phase that this novel phase transition is given by by this is slightly different Gardner volume and just to mention you can do replica calculations, you can use different approximation schemes, annealed replica symmetric or one step replica symmetry breaking and what what you basically get is that the qualitative shapes shape of the curves of this critical point as a function of raw in the case of pairs is the same but you get but you get quantitative agreement only if you if you employ at least a one step replica symmetry breaking answers. This is actually I'm a bit I'm a bit cheating but if you like I can tell you more on this point later. Okay, just to conclude, I hope I could convince you that let's say we can use an integrated approach which is based on combinatorics and on statistical physics to say something about the statistical learning theory of simple toy models of geometrically structured data. And in particular, we observe the striking difference between the VC entropy of unstructured and structured data, which is this polynomial growth versus no monotonic behavior, which, of course, should suggest that for structured data you can improve the rigorous based on statistical learning theory. And to this no monotonic behavior is associated a novel data dependent phase transition that corresponds to another satan sat problem. And when it's not possible to to say something with combinatorics, which is actually what you expect in many cases for more complex manifolds, you can also investigate the nature of the space transition with the standard gardeners replica framework which is used in statistical physics. And just to conclude that would be interesting, very interesting to generalize this kind of reasoning to more complex generative models of data. And also of course to improve to obtain improved rigorous bouncing in statistical learning theory because as far as what we have done until now it's not of course rigorous at all and to be nice to extend it with more rigor, let's say. Let me thank you for listening to me. And in particular, let me thank my collaborator on this project, Vittorio Herba, which is a young PhD student, Mauro Pastore, which is another PhD student which is finishing right now. Marco Cosentino that many of you may know and also and and also, of course, Marco Gerardi's picture is bigger, not only because he is nice looking, but also because he's the main driver of this project here. Okay, so thank you for your attention. Thank you. So there is time for questions. So, if you go back to your just to understand that if you go back to your example of a minister. Okay. How, how much this structure data would go close to the true levels. That's a good question, but it's not something the answer is arbitrarily close, because of course I can choose a structure which mimics the curve that I find the problem. The problem is that I can I can find that I can find the many structures that realize that realize that this curve here. And of the main point is that basically how big is K. Yeah, this is a, of course, it must be, let me say, wait a second, larger than. Wait a second. I would say around five, six, more or less, you need at least K equal five or six to to fit this one to fit this curve here. Thanks. Just a rough estimate. If you're talking, you are muted. Thanks a lot. So the ultimate goal, if I understand well of this theory and all that statistical learning theories to is to get bounds on the what you really want to compute which is the generalization error. So I was wondering if in your model, if your model is simple enough to also using replica computations or other techniques get the generalization error choose to see how close is the the bound to the to the exact generalization or exact at least in some asymptotic regime in some kind of dynamic. But you mean to use a replica computations with something like the teacher student scenario that kind of. Yes, for example. Yes. This is something that we have not done to be fair. This is something that say we were more keen to to understand if we could say something in statistical learning theory because my feeling is that computer scientists at least maybe maybe things are changing right now but they do not really care about the teacher student scenario. I think maybe I'm wrong. Maybe I'm wrong. But that's my feeling they care about this busy entropies they care about this busy dimension, but they do not know very much. But actually this is an interesting point. I think that we could definitely try to to study a teacher student setup framework to analyze the generalization. Even go beyond teacher students you can study so at the non rigorous level but if you have a generative model for the data. So this this is the whole idea for example of this paper you you mentioned by a by a Florence like Alenka Mark Mesa Sebastian. We are studying these papers right now so I cannot. Okay. Of these papers. Because in this paper it's it's I mean the the model is analytically tractable but it's not a teacher student in the in the sense that you have matching. They could they compute for instance I don't know that the VC entropy know that there is a paper by a link where they compute the Radimacher complexity. If I remember correctly, but but I don't know if if you can really say something about the VC entropy with the statistical physics framework I know that there is a paper by the Radoper which is something which is quite overlooked because it does not so many citations where he claims that he can say something about the VC entropy with with with a statistical physics framework, but it's the only example I know where you can see the results related to the VC entropy with statistical physics. I think that it's super difficult to do that with gardener volumes and this kind of that that's what I know of course maybe more than me. But so the question is if you can access directly. The generalization or does it matter to compute the the VC the VC dimensional of course of course if you're able to access the generalization error you don't need bounds anymore. Yes, that's that's that's clear now, but that's not usually the case. Maybe there are particular examples where you can, because of course the VC entropy, the only reason why you need the VC entropy is to a state in statistical learning theories to establish this boundary. You know that generalization error exactly course you don't need it. Okay, thank you. There is another question. Karim, you can ask the question if you want. Hello, can you hear me? Yes, of course. Thank you. Okay, so I'm just wondering, is there a connection between the VC dimension and the classical machine learning theory and the quantum machine learning theory, for example, the VC dimension for a radio business function. Like theoretically, I think it's like infinity for a radio business function. But what about the quantum embeddings or when we convert classical data into quantum data, for example, like two years ago that there was a paper by IPM where they used power feature maps to to to to transform and nonlinear nonlinearly separable data into linearly separable data into the complex space. So is there like a relation between the classical VC entropy or the VC dimension and the quantum VC dimension, for example, or is there like an analogy, for example. To be fair, I have no idea. I have no idea what it is the VC dimension in the quantum case. I mean, if you can send me a reference, I could be super interested in reading it. Can you just give me a second? Okay, I'll try to get you the paper. Just a second. Yes, you can do even later because I don't think. I think it's better if you. I'm sorry, but I, I don't, I don't really know anything about VC dimension in the quantum case. I cannot be helpful. I've never read anything, any literature on this topic. I would be glad to read it, but I really don't know anything. Yes, I think I think I sent you the paper. Thank you. Thank you. Thanks a lot. So, I think he's quite late. So I'll stop recording if you want to leave, you can leave. If you want to chat more, you are free to stay in the room and chat until Pietro is willing to stay. Okay, thanks a lot again to everyone.