 Ne, dve fleče, vandine. kako ga welim tudi, ki ležite tudi o jebno da ne laheli. Tukaj ležite, ko je zvonil, kako so venkne za mosti srati 3 večne zebra,张 je in svanje, enfesijadi in sklepo. In včasno izvang, pristajte, ki dejem tudi to pričaj in Sloka SE, ki imam z給ve o dragi, ki je s leadom nekaj, ki dejem tudi tukaj. Tukaj lahel, pa sem izbojila. Tudi so se v tudi vedeli, kot neih v neko elektro hoped, Seko lejos nekakva, ki borem v zent성ici, ki je otveni v početku hošrednjih, kako se dobroje v komputaciju več v pravdjenji, da je da ste dobroje pojel, da tega izbrujaš nima požiddovati, da je ta dobroje poživljenja, ko je tako kompres, da se reče velice zrečenje. Prišli ljub, da je saj Sergej, Zelo je imeljno bilo, da odrečno kar je našli in izvah vzgladova s teba in našli izvah vsak z naspevaj del anti taj našli našli našli in našli za taj našli, ki je pravda taj našli za taj našli našli. Poče vzgladova s teba, da je taj našli izvah, našli da je taj našli, ki je početil, našli, da je to zelo, da je taj našli našli, v Nadalja se učili, da se tudi nekaj neče predpravili tako izvaj, zelo bi si pravite in možete inzity vz Joni, da se njer posnijete, da jeifie oznaj, da je bilo včetka roli, pa je udljena iz templatelu, še le zelo se v tribere, zelo se ima izrednje, neč ga se prišel v roli. In je zelo prišel, je tudi nekaj neč tako izvaj, in je se pristila, že je jevanje, odstavila ki pri 07 m. 11.2012, in foreign elections, postanešte zelo vament i tudi v poločanje. In zelo tega evolutja je vsega izgleda, da je preformačna, da tega netvora so podetva v zelo, in tazda je, za čem tudi na imezne netve, je zelo kratko in veliko. In tudi tega netvora se pripravila možnosti in tudi vsega imezne netva. Proste, da so pripravili o zelo, in povedeščke vse, ki so odlišili, od sva povedečenja, in večo infekcije ispešcije in zelo in hvala nač unexpected, tačno je nekaj dolocil nekaj biologični sistem. Also, mankist, rodin a z nimi ne kaj je jazama malja način. Veselim o tajh njačin, ker je uženja ovedajoskjenja, da je to nožite. Priježemo vzelo, da nekaj je svoju ataknowenje z nekaj je, jazma, Makak, da je je vse našlič. Zato, da se pristajemo na zelo pozavljenju, da vse je ovo semialitistje vzelo sezivnih vzela in zelo, da je to da je všeč, da so semialitistje, nekaj na večenih raznega in zelo, da je to je od pozavljenja in od pozavljenja in od pozavljenja komputation, ki se početili, pa je unicina na zelo. Zelo, da je to večel nesega za to, ki je z vsega na prijeva Jim De Carlo, bo še je sečeno, da je to nekaj, nekaj ni nakrepo, are similarity between the structures of the system. In the brain, we have different visual areas. Visual areas are made of neurons. The neurons are the computational units in the brain, and individual systems they perform computation, which are kind of similar to what we find in the net. There are filtering operations, thresholding, normalization, pooling. And in terms of architecture, what we can see is that when it comes to object recognition, this function is mediated by a hierarchy of visual areas that start in the retina, and then keeps going across the agent and then arrive to primary visual cortex, v1. And then it keeps going through a sequence of cortical stages, v2, v4, and then the infotemporal cortex, which form the so-called venus stream. The venus stream is the pathway that in the brain is processing object information, shape information, and these underlying object recognition. And so, talking about object recognition, I think it is very important to understand what is the key challenge in this task, both for an artificial or for a biological system. And the key challenge is to achieve invariance, and I can easily state this with a very simple cartoon. So if I have to recognize, for example, this target, this car, I need to be able to discriminate it from any other object I may find in the world, right? But at the same time, this is just one of the main images that this object can project on my retina, the car can appear under different poses, positions, translations, backgrounds. I need to understand that this is always my target, just transformed. So the visual system must be able to be sensitive to changes in the identity of the object, but insensitive to transformations of the object. It may be invariant to such transformations. And understanding how object identity is somehow factor out from these other confounding factors is actually one of the key challenges in visual science, but also nowadays in machine learning, where we try to understand how deep network works, right? And they say that we don't really have a final answer to this question. I mean, we don't really know exactly how the brain is able to achieve this, but we don't even know how the deep networks exactly are able to achieve this, why they work so well, right? But as a neuroscientist, as a neurophysiologist, we can put out some ideas based on the many decades of investigation of these systems. And I think these ideas can be summarized in our overall hypothesis that is, I think, nicely depicted in this opinion paper that was written by Gene DeKal and David Cox back in 2007. So I'm going to show you now a few pictures taken from this paper, and these pictures are based on simulations that Dave performed at the time. But keep in mind that these simulations are actually based on known tuning properties of visual neurons in the ventral stream. So what is going on here is that if you look at the retina, the input layer of the visual system, and you look at how two objects, let's say two faces, Joe and Sam, are represented in the space defined by the activations of the neurons in the retina, well, this representation will be too many folds which are highly curved and highly intertwined between each other, very much tangled. When you move onward along the ventral stream and you arrive, for example, to V1, this manifold will still be quite tangled and quite curved, a little bit less, because V1 has developed a better specificity for visual features. So in V1 we have neurons that are edged detectors and also have gained some amount, a little amount of position in variance, for example. But when you arrive at the apex of this hierarchy, at the anterior part of the inferior temporal cortex, then the manifold would have become much flatter, much less tangled, and as a consequence, it's possible to linearly separate them using simple linear classifier. So you can place a hyperplane here and tease apart these two object categories. So this is, we can call it the untangling hypothesis, and we can kind of generalize this idea, which is present in neuroscience, also to machine learning, and in this case we can consider that the input space, of course, is the pixel space, the output space can be considered as the last hidden layer space, and the object and the untangling hypothesis, what it states is that the object manifold become less tangled, they become fluttered, and also they become lower dimensional because, of course, they become influttered, and occupy less volume of the embedding space. Now, I'm calling this hypothesis because some of what I'm saying here is actually quite well established. For example, it is known in neuroscience that, in fact, if you read out the representation of objects in the last part of the inferior temporal cortex, these representation are, in fact, linearly separable. So you can actually read out very easily the identity of objects from the population activity of inferior temporal neurons using simple linear classifier. So the representation are, in fact, less tangled. And the same is kind of true for deep nets, right? I mean, the fact that they work so well is because at the last layer, not the softmax layer, they are separating very nicely the representation of different manifolds. But what is not really so clear is whether they actually fluttered. So if this untangling happens actually because there is a fluttering on the underlying manifold and if it is actually a reduction in dimensionality. So the target of our work, as I'm going to describe now, was to understand this. And we focus especially on the dimensionality. So our goal was to track the intrinsic dimension of image representations across the layers of deep networks. And our experimental design was the following. We took a various state-of-the-art deep networks, AlexNet, VGG, VGG with batch normalization and ResNet. And then we estimated the intrinsic dimension in a subset of checkpoint layers. These were the input and output layers, the pooling layers following a convolutional layer or a stack of convolutional layers, or in the case of ResNet, a ResNet block, and the fully connected layers. And the equation we asked were, how does the intrinsic dimension vary across the layers of these deep nets? How linear or flat are the data manifolds? And whether there is any relationship between the intrinsic dimension in the last hidden layer and the performance that the network can actually achieve. So, of course, in order to do this, we need a way to estimate the intrinsic dimension of this data manifold. And the intrinsic dimension can be defined as the minimal number of variables or coordinates that you need to describe your data points without significant information loss. Now, if the data live in a linear subspace over a hyperplane, this estimation is quite easy. We all know that we can use principle component analysis, and we can see this in this simple cartoon. If my data live in a two-dimensional embedding space, but they actually stretch out along the line, I can compute the principle component of the covariance matrix, and I can look which of these components are actually explained in most of the variance. I can throw away the component which is not really explained in much variance, and I keep the others. And so I will figure out that this is actually a representation that live in a one-dimensional subspace. But the problem is how to do the same when the data actually live in no linear manifold, which is not the general case. So, in this case, there are some other approaches to computing intrinsic dimension. What we used was an approach that was recently developed by the group of Alessandro Lajo in describing this paper here, FACCO et al., 2017. And let me just describe how this works. So, suppose that you have, in fact, data living in this, let's say, one-dimensional manifold. What you can do, you can go through all the points. As I said, you start from point i, and then you compute a distance between these points and its first and second nearest neighborhood. And then you compute the ratio between these two distances, r2 over r1. And you obtain this variable mu, mu i, because you do this for all the points in your data set. And what Alessandro has shown is that this variable is distributed according to operator distribution that does not depend on the local density of the points, but does depend on the intrinsic dimension of the data, or the data manifold. So, one can simply compute the distribution of mu from the empirical data and then can estimate d, the intrinsic dimension. And something which is quite nice is that one can run a scale analysis because it has been shown that the idea estimates, so the estimation of the intrinsic dimension is robust if it is scale invariant. So, this means that we should compute d as a function of the number of data points, and then if we observe a plateau in this relationship between d and n, then we can trust the estimate of d. And just to show you a couple of examples from this paper of Fakotal, these are simulated data living on a hyperplane, a two-dimensional hyperplane embedded in a 20-dimensional space, and there is Gaussian noise added to this data, so to make them spread around a little bit, and sigma is giving the amplitude of this noise. And so, when you do this scale analysis, you can see that d is changing the function of n, but there is a region of n in which the estimate is very stable and this estimate is actually 2, which is the actual dimension of the data here. And these apply not only to the linear case, but also to nonlinear cases like this in which actually Gaussian was wrapped around the whip roll, as whip roll, again with addition of Gaussian noise. As you can see, there is a tendency for d to increase while you get more data, you are kind of filling up the volume of your embedded space, but there is a region where d is very stable, it is flat, and this corresponds to the actual value of the embedded dimension, sorry, on the intersting dimension, which is 2. So, we use this approach to systematically track the dimensionality of representation in deep networks. So, we started in our experiment with the VGG 16, which is a quite standard benchmark in the field of machine learning now, I mean, can achieve very good performances. And this was downloaded on PyTorch and pre-trained, it was already pre-trained with ImageNet, but then when we looked at the representation of the data manifold, we didn't use the ImageNet images, we actually used another set of images that come from our new official logic paper that we bought last year, actually this year. And these objects are computer graphics models, 40 computer graphic models. As you can see, they are quite realistic objects, and they are rendered in 36 different views. So, for example, you can see here the 36 views of the starfish. And the total of images is 1,400, 40. So, why did we start with this object set? Because we wanted, in our first exploration, to kind of contain a little bit the viability of the complexity also of visual features in the data set. So, these are still very complex objects, but they are black and white, they are actually, they are grayscale, they don't have background, they have a little bit, the span is a little bit less widely in the image space. And what we did, we took VGG 16, as I said, portraying, but we fine tuned the last few layers of the network so that it was performing very well on this classification task, on these images. And then we look at the interesting dimension. What we found, again, by sampling these checkpoint layers, what we found, what these, I will say stereotypical, as I will show you later, a hunchback profile. So, the interesting dimension starts at a given point, then it grows quite substantially, and then it decreases monotonically to become quite small in the last hidden layers. And something that should be noticed here is that the intrinsic dimension is really much, much lower than the dimension of the band space. So, here, these numbers are telling you how many units there are in each layer in this network. So, for example, if you look here, we have 800,000 units, this means a 800,000 dimensional in band space, while the dimensionality is about 70, dimensionality of the actual manifold. And the same apply here in the last hidden layers. You can see that the dimensionality goes down to 7 or 6. So, it's very, very, very, very low dimensional. And we run the scaling variant analysis, of course, to check that our estimate of the intrinsic dimension was actually reliable, robust. So, this, of course, is just a small dataset and one architecture, VGG 16. After this first experiment, we wanted to test how systematic and how robust was this trend, this Hunchback profile that we found. So, we took a bunch of architectures and several instances of networks for each of these architecture. All-point train on ImageNet. And this time, we actually computed the intrinsic dimension for the seven most populated categories of ImageNet. So, there are a few object classes that contain a lot of samples. And we computed independently the intrinsic dimension for each of these classes using 500 images for each. And then we averaged the result. And this is what we get. So, we see, again, this trend that we saw before, this evolution of the idea with the stereotype pattern, which is this Hunchback shape. You see, an initial... So, the representation start to be quite low dimensional in the initial layer. Then it goes up and then it goes down. And also, again, the intrinsic dimension is much lower than embedded dimension. But you can also see here that it looks like that these different networks have really a similar trend and this would actually be almost identical if only they had the same number of layers. The fact is that these networks have a different number of layers. Like AlexNet has just nine layers, while ResNet 34 has 34 layers. We can plot these same curves on a normalized layer depth. So, we can plot them as a function of an index which tells the relative depth of the layer. So, how distant is the layer from the output, let's say. And what we found is that surprisingly, all these curves kind of collapse on a single profile, right? With the exception of AlexNet, the other networks are kind of having a peak more or less in the same range of relative depth between 0.20 and 0.4. Then they fall down quite with the same slope, although there is a little bit difference between the ResNet family and the VGG family. So, there is a considerable overlap between this profile and the conclusion would be that after an initial go, or the intrinsic dimension, the deep networks perform a progressive dimensionality reduction of the object manifold. No, yeah, there are differences, right? I kind of show this here if you look at the picture of VGG. So, there is actually an alternation of different kind of processing layer. So, you have convolutional layers pooling, then you have fully connected, right? Then you have softmax. So, each network, of course, is on architecture. What we try to do, we try to standardize the checkpoints layer for which we took our measurement and these were the input and output, but then we kind of targeted the pooling layer after a spark of convolutional layers. And then, finally, the fully connected layers. And this can be done in all the architecture that we have investigated. So, there are other questions? No, no, no, I will come back later to this because we have some intuition about why there is such a trend and I will kind of give some idea later. And so, coming back to this trend here, right? One can wonder at this point if the dimensionality of the representation, especially the one reached in the last hidden layer, has anything to do then with how good the network is, with how good is the classification of QSC that the network can achieve, right? So, we tested this and, again, we used the same architecture, again, pre-trained image net. In this case, we used a random issue of 2,000 trained images, taken randomly from ImageNet to compute the increasing dimension and then we compare it to the classification of the QSC as the top five score error provided by PyTorch when you download the net, right? And this is what we found. There is a very strong and clear linear relationship between the intensity dimension and the error rate of the network. So, the lowest is the intensity dimension that is achieved in the last layer, in the last hidden layer. The lower is the error rate. So, the relationship is very much linear, very strongly correlation. And this applies not only when you pull together all the different architecture, but also within a single architecture. For example, if you just focus on a raised net, we can see this very strong linear relationship between performance of the network and the dimensionality in the last hidden layer. So, we can kind of conclude from this. Now, something important is that while the error is computed on the test images, okay, our estimate of the ID was supposedly performed over the training set. So, which means that somehow the intensity dimension over the training set is a strong predictor of the performance of the network on the test set. So, somehow the intensity dimension can be taken as a proxy for the generalizational ability of the network. So, coming back to our original hypothesis, we can kind of say that, yes, there is, in fact, a decrease of dimensionality. It looks like these networks are trying to compress the dimensionality into the representation, but we should keep in mind that there is also this initial expansion that will come back to that later. But of course, the main equation of whether these manifolds are also becoming flatter, more linear. So, this is still an opening question and we wanted to target this. And of course, there are different ways to do this. You can actually try to measure the curvature of manifolds. There are some geometrical approaches to do this, but you can also exploit the estimation of the dimensionality because, for example, in a linear case like this, the dimensionality estimate yielded by PCA should be equal to the dimensionality estimate yielded by our estimator 2 and n. And actually, this is something that Alessandro groups are shown. If you have data that lives in a linear subspace, his estimator is just returning the same value that is returned by PCA. But in the case in which, this is not true, the case in which the data lives in a nonlinear manifold, in general, PCA would actually put much overestimate the actual intrinsic dimension because it would not be able to capture, I mean, to see the underlying structure of the data. We see that the data are filling a big chunk of space, a big volume. And so, what you should expect is that if the manifold is nonlinear, there would be a big difference between the two estimates. So, and the linear estimate with PCA should be much, much larger than the estimate that you get from 2 and n. And so, what we did, we took our architectures, we performed PCA on the last, this is an example on the last hidden layer. And we rank-ordered the spectrum of eigenvalues according to the magnitude. And what you can see immediately is that there is a very smooth variation of the magnitude of the eigenvalue as a function of the rank, right? So, we do not really see a gap in the spectrum of eigenvalues which by itself is an indication that this manifold is probably not linear, right? We should expect to see just a few components captured in most of the data if the data live in a linear subspace. And so, then data manifold already do not appear to be linear just based on looking at the spectrum of eigenvalues. But you can still define an intrinsic dimension based on PCA as the number of principal components that's account for 90% of the variance in the data set. This is something that people have done in the past, right? And then you can compare this estimate of intrinsic dimension over the one yielded by our argument, by 2nn. And so, these are the trends that we get across the layer of a deep net. This is VGG 16. And this is the same plot that I showed you before the Hunchback profile in the case of 2nn. This is what you get with PCA. As you can see that, first of all, there is a big difference in terms of magnitude. So, the estimate yielded by PCA is at least four times larger here. But the shape is also very different. There is basically just a flat plateau. And although there is a little bit of decrease of the dimensionality toward the last hidden layer, still the dimensionality here remains very high. It's about 100 that you remember here. It was, I think, 10. So, it is at least 10 times larger than what we got with our method. So, this is what is happening. But also, it is interesting to compare this profile for train network with what you get for untrained networks. So, this is a network in which we simply take random initialization of the weights. We do not train them. But we can still feed images to the network and see how is the dimensionality or the representation of these images. And what you see is that in the case of the estimation yielded by PCA, there is no much difference between this curve. So, it is just a little bit lower, but it is almost like if it is just shifted over a few tenths of dimensions. While it is very different, the profile that you get would turn in. So, in this case, you don't get anymore the hunchback shape. You simply get almost a flat trend of the intrinsic dimension, which is something that you should expect because if you just have a unit that are applying to the input, a random linear weighting sum, what you are basically doing is simply a rotation of this vector in the input space. So, the structure or the representation is not changing. You are performing a sort of orthogonal transformation. And so, you should expect that the intrinsic dimension remain the same. And so, taken together, all these analysis seem to point to the fact that the data manifold, oh, sorry. Let me see if it is going to start. Oh, okay. Sorry, guys, but the point collapsed. Let's try to get it back into shape. Okay, so I think it should be working again. Yeah, so what I was trying to say is that, I mean, we can conclude from this that the data manifolds become lower dimensional, but they don't appear to become much more linear. So they still live on curved surfaces. So, when we come to this hypothesis here, it looks like, at least in deep nets, they don't really become flatter, no, these data representations. So, just to conclude my presentation, I just want to point out to one another thing. No, this initial increase that we see, this expansion of the intrinsic dimension that we see from the input layer to the first hidden layers, right? We wanted to try to figure out what is the reason of this. And we had a very simple intuition. So, if you look at these images, right? If you look at a dataset like ImageNet, you can see that these images are super complex. They contain a lot of features, but these features are also low-level features. Just think about the color, the hue, the luminosity, the contrast, the textures. So, these features can impose gradients over the image set. And these gradients may actually be correlated with each other. For example, luminosity may somehow be correlated with some specific hue or the color or a contrast. So, what is going to happen, probably, is that these strong trend, the strong gradient or low-level features are squeezing the representation in the input layer to live over a rather low-dimensional space, because this is what we see here, right? All the representations, I mean, for all the networks, the representation starts living in a very low-dimensional space. And so, what could happen is that the network is actually trying to remove these low-level features, this correlation in the data. And this is why there is this initial expirience. So, to verify this, we ran an experiment with a simple network. Just shown here, it is just based on two convolutional layers, two pooling layers, one fully connected and one output layer. And we used, as a dataset, a something much simpler, this amnist image set of handwritten digits. As you can see here, we don't have any crazy background. The background is just black. The digits are just white. So, these are very simple images, quite standardized, right? There are no gradients of low-level features here. So, we can train the network to classify these digits. Of course, the network can achieve very large performances. Then we can compute the intrinsic dimension. And what we see is that, oh, sorry, there is no any longer the initial increase of the dimensionality. So, the dimensionality starts high, and then it just keeps decreasing monotonically. So, again, what we think is happening here is that differently from this kind of dataset, where we have, for example, that, you know, if you look at walking dogs, well, they typically are on grass. So, there is a lot of green, like husky over a background that contains north. It is a lot of white. Here, the sailing boat is a lot of blue. There are sea. So, these gradients of low-level features probably are responsible for projecting the data on this low dimension. So, how can we test this with our dataset? Well, very simply, we can take our MNIST original dataset and we can add randomly alumina's background. So, we can impose alumina's gradient over this transformed MNIST dataset. So, you can imagine that now what we are going to do is taking all these images that are living in a cloud, right? And then we are kind of stretching this cloud along a line, which is the one imposed by the gradient luminosity. So, when we try the network over these images and we look at the dimensionality, this is what is happening. So, the dimensionality is expected to start from a much lower value in the input layer. And then the network is actually taking a few steps to increase the dimensionality before it starts decreasing it again. So, this kind of reminiscent of the hunchback profile that we saw before. And so, our interpretation of this hunchback profile is that there is this need by the network to initially get rid of these low-level features. And here, as you said, I mean, I'm a neuroscientist. I just want to make a quick connection to physiology. So, just to summarize, it looks like that in a train network the initial expansion of the intrinsic dimension affects the pruning of low-level, highly correlated visual feature, which are not essential for the classification task. And this is a little reminiscent of a finding that we had a few years ago on rat visual cortex. So, let me just mention this quickly. So, here, in this study, we were investigating how visual objects are processed along this pathway, which you can consider it being the homologus of the venture stream in the rat brain. So, this is a sequence, a progression of visual areas named V1, LM, LI, and LL. And what we found is that if you look at the amount of information that individual neurons are conveying about the object identity, this steadily decreases across this pathway. But when you break down this information in two components, the components about the low-level information, the luminance of the visual input, and the higher-order information, you can see that actually this color part is quite preserved. In fact, it's not different from V1 to LL. But what is actually getting thrown away by this pathway is information about the luminosity. So, what we are kind of seeing here is something similar. So, it looks like both in the rat brain and in these nets, there is an initial processing. The initial layers are trying to, let's say, make the distribution of the data, let's say, spherical again before performing some further processing to classify properly this data. And so, now, I just want to show one last experiment that I'm done, which is again with amnist, which is also quite interesting because something that one can do is to take a dataset, set of images, and shuffle around with the labels. And, of course, if you do so, there is no consistent classification that the network can learn because to the category 5 it will belong digits that are not 5. It may happen that there is a 5, it may happen that there is a 1, it may happen that there is a 9 or whatever. But still, previous studies have shown that these networks are such a good learner that they can actually learn an arbitrary classification. So, actually, you can train the network that I showed you before and the network will achieve on the training data 90% or more correct performance, right? But, of course, there is no generalization that can be possibly done. So, we wanted to see what happens to the dimensionality in this case and what happened is shown here. So, in this case, the intensity dimension, rather than decreasing in the last layer, it jumps up, right? So, it seems that there is something specific in the decreased dimensionality that has to do with training a network on a generalizable dataset. If this is not the case, the dimensionality will simply go up, rather than go down. So, I'm done. I wanted also to point out that this finding is reminiscent of what other authors like Ma and colleagues have already reported in the literature. And just to sum up what I showed you, so, we found this highly stereotyped trend of intensity dimension across the layers of the net with this hunchback profile. We saw this very strong correlation between performance of the network and how small the intensity dimension in the last layer can be. We saw that even in spite of such compression on non-dimensionality, the data live still on non-linear manifolds. And we've seen that the initial expansion of the intensity dimension appear to be due to this process of discounting low-level information. And this can be depicted as a final take-home message in this join here, where we can see that in the input layer the data are kind of correlated between each other, so they live in this lower-dimensional manifold. In the initial layers are kind of making the distributions faker again, they throw away these correlations. But then they start mapping the representation into manifold that becomes increasingly lower-dimensionality. So, this is it. I just want to thank again the collaborators in the study, Alessandro, Jacob, and especially Alessio, which is also about to fly to Vancouver for presenting a poster about this work, a new episode. If you happen to be there, you can actually stop by him and ask him more technical details that I will not be able to answer. And I find some knowledge to the funding agency that allows me to support this work. Thanks.