 Yes, you can go. OK, I can start now. OK, thank you very much, Marie-Lou, for your invitation. So I'm John Zarka. I'm a PhD student at Guinness Paris under the supervision of Professor Stéphane Mala. And I'm going to present to you our work carried out over the last year in collaboration with the Florentin Gute, another PhD student, which deals with separation, concentration, and phase collapse in deep networks. So the starting point of the work is the experimental evidence that deep neural networks progressively concentrate each class around separated means up until the last layer where intra-class viability nearly collapses, while separation of class means is preserved. That means that if we start from an input image x, standard CNNs produce a last year presentation phi x in whose space classes represented here by different colors are strongly concentrated around separated means. And this evidence is interesting because it adds geometrical information on how neural networks work beyond pure classification accuracy. So in our work, we aimed at explaining the mathematical principles behind this evidence by structuring a network with as little learning as possible, which performs explicitly separation and concentration operations while reaching resonant accuracy on image net. So a low-linear classification error is achieved when distances between class means normalized or more specifically whitened by the within-class covariance is largely greater than 1. Assuming in a rough approximation that multiple classes share the same within-class covariance, that means that the linear classification accuracy is correlated to the trace of the ratio of between-class covariance, sigma b, normalized by the inverse of the within-class covariance, sigma w, which gives the Fisher ratio. And this ratio simplifies the characterization of any improvement in linear classification accuracy, which may come from mean separation where the trace of the between-class covariance increases or concentration of within-class variabilities, in which case the trace of within-class covariance decreases. And as this ratio is invariant to any invertible linear operator, including scaling, and that the scaling can transform the separation into a concentration and vice-versa, we used in our network normalized tight-frame for which f transpose f equals identity. So the first step is to understand how to perform separation. And a simple lemma gives us an answer. So if a nonlinear representation phi admits a linear inverse, then by construction it necessarily improves linear classification accuracy and the Fisher ratio. And wavelets with multiple phases followed by a volume alongside a low-frequency channel constitute a tight-frame, which verifies this linear inversibility property. wavelets separate image variations at different scales, orientations, and phases. And they appear naturally in the low-level features of standard CNNs like AlexNet. Hence, here they are not learned but predefined. We can see here, visually, with a simple example of Gaussian classes, how rectification allows to perform mean separation of two processes with same means. Before rectifications, they are not linearly separable at all. While linear separability increases drastically after applying a relu, provider of the filter f applied before the relu induces a suitable rotation. When the separation operator with a filter bank at a single scale is cascaded, it defines a tree. We call a scattering network. This cascade has a number of channels which grows exponentially with depth and needs to be reduced with one-by-one orthogonal projectors. In the case of the scattering tree, those projectors are predefined. And they consist simply in the pruning of branches of the tree comprising more than two relus and an averaging of those different phases. And that case involves no learning at all. But we can also improve this prior by learning those one-by-one projections, which greatly improve results leading to what we call a projected scattering network. We then want to reduce intraclass variability and concentrate each class around its name. We show that this concentration is obtained with an operator phi x, which is a soft thresholding in a one-by-one convolutional tight frame f. And the f transpose rho f structure of phi allows to keep the same dimensionality as the input. The theorem proves that under a Gaussian mixture model hypothesis, if class means are sparse in f, then this operator indeed concentrates each class towards its name. And this result simply stems from the fact that if intraclass variabilities are not sparse in f while class means are, thresholding will not move much means while it will contract those intraclass variabilities. The concentrated scattering network inserts this concentration operator after each mean separation. And f needs to be learned because we do not have prior information on the structure of class means. But learning can be reduced to one-by-one operators. We may also use a rectifier with a fixed threshold instead of a soft thresholding, which can interpolate between pure separation and concentration. And this slightly improves results. And the main result we obtain is that our deep concentrated scattering network is able to outperform resonating on imaginary net, reaching the top one error 0.5% below resonating and the top five error 0.2% below, yet with six fewer layers and by learning on tight frames along channels, which perform separation and concentration. And I remind you that there are no learned spatial filters and no learned biases. Projected scattering alone outperforms AlexNet by roughly 2% for the top one error and 0.5% for the top five, while reducing the error of the scattering tree by a factor 2. We also observe that the Fisher ratio correlates with the linear classification accuracy. It increases when the classification accuracy increases, but also increases layer to layer, which is an evidence of a progressive improvement of linear separability. But additional experiments show that sparsity and biases do not seem necessary. There is only a 1% loss in accuracy when we set the threshold of the value to 0. So the question becomes now, how can we remove within-class variabilities without thresholds? We give the intuition here using translation variabilities as an example, which is a simple subgroup of within-class variabilities. The first lemma tells us that provided the filter C side we use has a sufficiently small bandwidth, the translation by G turns into a variation of phase of psi G, where psi is a center frequency of our filter psi. This property ensures that the cascade of rectified wavelets with multiple phases remains equivalent to translation, even with subsamplings, which is not the case of traditional CNNs, since a special translation here turns into a translation along the phase channels. And invariance to translation can then be achieved by averaging over those phase channels, which leads to a modulus non-miniarity applied to the high-frequency filters. The modulus was concentrated by ability by collapsing the phase induced by the translation. And the use of complex value high-frequency filters is justified by the fact that due to the Hermitian symmetry of the Fourier transform of real filters, they cannot have a small bandwidth. And hence, following the lemma, they cannot turn the translation variability into a phase. And an extreme case of tight spectral bandwidth is a Fourier basis. The modulus of Fourier transform provides richer invariance to translation than any linear translation invariance, which keeps simply the mean of our signal. But the modulus of the Fourier transform is not really adapted to more general within class variabilities and translation, since it is unstable to small differ morphisms, contrary to the scattering transform, which uses wavelets instead, and is obtained directly through the cascade of this modulus non-miniarity and the filter bank j times. The network, the scattering network concatenates scattering coefficients of order 0 to j, where order m coefficients comprise m-modulee and m-intermediate wave-length filters with increasing scales. For instance, order 0 applied to a signal x is simply x convolved with a low-frequency filter phi. Order 1 is a modulus of x convolved with a wavelet psi convolved with phi. Order 2 is a modulus of x convolved with a wavelet psi convolved with a second wavelet psi prime and a low-frequency filter and et cetera. And this network is approximately invariant to translations and equivariant and stable to small differ morphisms. However, as we saw previously in the previous slides, only orders up to 2 are tractable and useful for classification purposes. And we saw that the result is scattering tree as t is far from state-of-the-art. So how can it be improved and in our new setting of phase collapse? And it can be improved by learning one-by-one operators P-along channels which are complex values this time. This gives rise to a learned joint scattering network when the modulus non-miniarity is cascaded j times within which the learned channel operators P can create new phases associated to a wider group of within classes, within class variabilities along channel than just translation and small differ morphisms like this is the case with only waivers. The modulus then collapses phase these associated to the joint transform in space and across channels. And this architecture is similar to the previous protected scattering but much more constrained with a lower dimensional modulus non-miniarity instead of a rule with four phases and it is complex value. And we observed numerically that the learned complex operators P-along channels are somewhat structured with typically real and imaginary parts in quadrature like Fourier or webbed filters. However, the phase collapse of high frequency filters at each scale can hurt discriminability by hindering the ability of the network to build a richer set of filters. Adding a skip connection of high frequency filters in the non-linearity with cascade improves accuracy by giving to the network the flexibility to create filters having a wider range of spatial and Fourier localization which is important to preserve spatial localization information of image classes. The results obtained with this architecture is that we are able to reach resonating accuracy again being 0.1% above in top one and 0.1% below in top five with once again much fewer layers, 11 versus 18 and an even more structured complex value network with no biases, fixed spatial filters and modulus non-linearity which validates the fact that collapsing appropriate phases is sufficient to reach high accuracy. And another important point in the learned joint scattering architecture introduced is that it disentangles order which we define for the scattering transfer and refers to the number of non-linearity used to compute a neural and scale which is simply the size of their receptive fit and those two notions are mixed in traditional architectures for which scale and order both grow with depth which is a number of layers used and using triangular one by one operators P we're able to build like for the scattering network an order separated architecture which shows that high order neurons correspond to more irregular responses and we visualize here rescale patterns maximizing the response of neurons of different orders and scale and we see that the complexity of these responses increase with the order whereas for the fixed order the depth J only modifies the scale for an order of one on the left of the figure the response is maximized by nearly periodic oscillations which depend upon the scale and orientation of wavelets while as order increases iterated phase collapses create progressively more structured patterns and as a first heuristics to study the influence of order distribution on the classification accuracy we say the proportion of different orders at each layer J to follow a binomial distribution of parameters J and P since orders at layer J are comprised between zero and J and at the last layer the average order with such a distribution is simply PJ we evidence a slow decay of the accuracy as the parameter P decreases and with P set to 0.25 which induces an average order below three with a total number of 11 layers the accuracy is still well above the one of Alex net and only 4.5% below resonating and this architecture opens the possibility to define a notion of high dimensional regularity from the decay of the number of neurons at each order so in conclusion we have first shown that separation of class means is achieved with a rectifier and predefined wavelet frames and one by one orthogonal projections one concentration of class variabilities around their means is achieved by thresholding coefficients in one by one convolutional type frames and cascading those operators leads to resonating accuracy on image net yet with a constrained architecture where learning is limited to one by one type frames and there are no learned special filters and we have mentioned in a second time that thresholding and sparsity did not seem necessary to achieve such accuracy in fact, concentration of within class variability can be achieved with a face collapse using a modularity and the resulting learn joint scattering network also reaches with resonating accuracy with a more structured architecture without biases, fixed special filters and learn one by one complex valued filters and this architecture also allows to disentangle two different notions which are order and scale with higher order neurons shown to correspond to more structured patterns which offers a continuous path between shallow and deep networks using coefficients of different orders at the last layer and following up on this we are currently looking into further structuring those learn one by one channel operators P so as to possibly characterize the corresponding group of learned symmetries and we are also trying to find a better way to optimize the repetition of the different orders and need key to a notion of regularity and that's it for me, thank you for your attention.