 So I will present you some work we did in the use of deep learning tools in the context of the resolution of inverse problems. It's a work that was done in collaboration with my colleague in CVN, Jean-Christophe Pesquet, a former PhD student, Marie-Caroline Corbino, and also in the context of collaboration with a researcher from the University of Modena in Italy, Carla Bertocchi and Marco Prato. So the applicative motivation is indeed the resolution of inverse problems that can arise in image or signal processing. When we will observe some data, why is that all degraded version of an image of interest X-bars that we would like to retrieve? The model that is considered quite standard model is that we have a linear observation model that could be some blur operator or projection operator. And the model is noise perturbation. You can imagine some additive Gaussian noise, for example. Quite efficient and popular strategy to solve such kind of problem is to resort to a variational method where you would define the solution as the solution of an optimization problem. And you will minimize the sum of so-called data fidelity term that takes into account your model, HX, and that measures the discrepancy between this model and your observed data. And because in many applications, the problem, the inverse problem is said to be ill-posed. That is, we would have some, because of the presence of noise or the structure of matrix H, we need to add some regularization function in order to have better quality and more stable solution. And the idea of the regularization function, as you may know, is to incorporate some prior knowledge about the sort image. And here, lambda corresponds to a positive regularization factor. So the advantage of such method is that it's quite easy to incorporate this prior knowledge. Just playing with the regularization function, you can also enforce some constraints using specific regularization functions such as an indicator function. You solve your problem using theoretically-sundead optimization tools. So clearly, your method is grounded on clear mathematical concepts such as the convergence analysis of your optimization method. The difficulty is that for most interesting choices for F and the regularization function, you do not have a closed-form solution. And so for sure, you will need an iterative algorithm, which can be high in terms, for example, of complexity. And another kind of drawback is that at the end of the day, you build this objective function, but it may not reflect the final goal. For example, in image restoration, this function does not really contain any explicit information about the final quality of the image. Such as the signal-to-noise ratio on this image or maybe some visual perception metric on the final image. And the last point, which is a well-known drawback for the users of such kind of techniques, is the setting of this famous regularization parameter. Here, I consider the case when we have only one parameter to tune. But it can happen when you have more sophisticated regularization strategies, that you have more and more regularization terms and more and more factors to set, and it can be quite painful and time-consuming. So in contrast, we have seen recently the usage of deep learning methods, especially in the context of inverse problems that can lead to very promising results. So deep learning methods, as it was said in the previous talk, are generic methods actually for approximating complex nonlinear functions. In such methods, especially in what is called supervised deep learning methods, it's natural to incorporate some knowledge about the solution and it's done through what is called the training phase, because you will show to your network, to your method some existing results that can come from big database and it's true that in image processing, we have access to more and more and larger and larger database. So it can be very interesting to take advantage of the existence of this database. The difficulty is that in the context of inverse problems, it's quite difficult in deep learning methods to account for a physical model, especially a complex physical model. So as soon as we escape for the case when H is identity, that is a denoising problem, when you go to complex linear operator H, it's not so trivial to incorporate this into the deep learning framework. And the last point that is a typical critique of deep learning strategy is that it can be quite black box and sometimes a bit empirical in the way you construct the architecture. Basically you construct it to have at the end the best results in terms of a metric, but the architecture itself, it can be quite difficult to interpret why is it working or not and it can be difficult. It can be a problem in certain types of applications. So what we would like to do is to adopt this recent framework which is called the unfolding of optimization algorithm that combines the benefits of both variational methods and deep learning methods. A bit more precisely, what we will propose is to make the link between neural network architecture, so especially a fit world neural network, a specific but quite common architecture of neural network and a well-known iterative optimization method that relies on the proximity operator that was introduced by Mo. Let me give an example before starting the core of my presentation. Let's have a look on the first basic optimization problem, the least square problem. Imagine you want to solve your inverse problem, minimizing this least square function, quite basic formulation. You have your least square and you introduce some constraints that X should belong to a close convex subset of Rn that you can imagine based on some prior knowledge on your image. Typical strategy to solve such problem is to use the projected gradient algorithm where Xk plus 1 is defined as follows. So you have a gradient step on the least square and a projection onto C and here gamma k would be the step size in your projected gradient algorithm. So let's have a look more in detail in this iteration. What we can observe is that we have this projection and inside the projection what we have are actually linear operations. Then I choose or propose to introduce this operator Wk. What is very interesting is that actually one iteration of the projected gradient algorithm is actually very related to what can happen in one layer of a neural network. Why? Because we see that we will have this linear term, so linear weight Wk times Xk plus something that does not depend on Xk that is sometimes called the bias term of a neural network layer and then we have as the output of one iteration before the output we have the projection operation that would act as an activation function in a neural network and actually iterating this projected gradient algorithm is equivalent to pass through this neural network. Here I set the number of iteration constant equal to large k and this is just another way to reformulate the first large k iteration of the projected gradient algorithm. This is very interesting because we see that there is a clear mapping between a certain family of neural network architecture and optimization methods. Let's go one step further in this analysis. Actually what we will be interested in is this generic architecture that is called fit forward neural network. The idea would be to consider that each layer will be a consecutive linear operation with weight matrix Wk, a bias Bk and compose with an activation operator Rk. The output of your neural network would be the product of large k iterations of this specific operation. Just a remark. This is quite generic in the world of neural network architecture and of course here you see that you have some linear operation. Famous operation in a neural network is to use convolutive neural network. What I would like to emphasize is that here you can choose Wk so that Wkx actually models convolution. Just a specific choice for the linear operator Wk. Now we are interested in this Rk. Let's have a look in this Rk. In my simple example of projected gradient, Rk corresponds to the projection. In mathematics there has been a general notion generalizing the projection which is called the proximity operator. So it's defined as followed. It's defined for proper lower semi-continuous convex functions. The proximity operator of such function g at a given point x is defined in a unique manner as a minimizer of this function plus the quadratic distance to the point x where it is calculated. When g is the indicator function of a set C, that is a function that you would choose if you want to impose some of these constraints C, actually this proximity operator is equal to the standard projection onto the set C. So just another way to say it, the projected gradient algorithm is just a special case of what is called the proximal gradient algorithm that would alternate between gradient step and proximity step in the case when the proximity is done on the specific indicator function. Proximal gradient algorithm is also known as for backward algorithm and in special cases it has other names. So for example, we have the famous iterative of stress shielding algorithm, which is also a special case of forward-backward algorithm. So the question is that if I come back to my architecture. So before, remember, RK was my projection. Now I would like to replace this RK by the proximity operator. Can we make the link between proximity operator and the activation functions that are used in the literature of neural network? And actually what can be said is that most of the activation operators that are used in a standard neural network architecture are actually proximity operators of certain functions. So again, this is a very interesting link between these iterative optimization methods and the neural network architecture. So let me give you some examples. You may have heard about the ReLU activation function. The ReLU activation function is nothing else than the projection onto the set zero plus infinity. So as I said, projection is a particular case of proximity operator. You have some more sophisticated activation functions. So for example, the rectified linear unit, the parametric one, when you add a parameter alpha here, is also a proximity operator for a specific function. I listed here other examples. So the sigmoid activation function very often used. For example, in classification, binary classification, you have other examples like the aliot activation function. And it's very interesting to see that for many, many activation functions. As I said, you can find this equivalence like this activation is a process of a certain function. So here I give the example of the softmax. So this being said, what we would like to do is to take advantage of this very nice observation that there are link between iterative proximal methods and architecture neural network to try to develop some kind of hybrid method combining both words. And what we hope is that it benefits from both words and that we have a better result for the resolution of our inverse problem. So this is the main motivation for all this project. Another example. So it's a squashing function in CapsNet. So now let me go a bit more precisely on the problem. So I will focus on a specific class of variational formulation that I would like to use. Of course, I don't want to be restricted to project gradient. I would like to make something a bit more sophisticated. And so let me start with the optimization problem we consider in order to solve our inverse problem to retrieve x from the observation one. So this is a formulation we consider. We have our data-affiliated term f, hx, y. A regularization function r that is assumed in this particular work to be twice differentiable. And we also incorporate some constraints on our solution. The constraints are as follows. That is that they are defined as a list of inequalities that is x belongs to C if it satisfies a list of ci of x positive. Where all the minus ci or convex function in order to have that the set C is convex. We also assume that the interior of this set is non-empty so as to have a well-defined problem and to ensure the existence of a solution to our initial problem. You may wonder why do we have this strong assumption of twice differentiable because at the end of the day we want to construct a neural network architecture and we would like to benefit from the nice auto differentiation and stochastic gradient design tools for training our neural network and that is why we impose a twice differentiability on this particular problem. A way in optimization to account for constraints in optimization problems is to make use of an interior point strategy. So the idea of interior point strategy is to replace the difficult constraint optimization problem by a second of unconstrained optimization problem as the constraint are imposed in a more implicit manner through the introduction of what is called a barrier, logarithmic barrier in the objective function. So basically you would add this function b in your cost function and you make the constraint disappear. So what is the shape of this logarithmic barrier? It's minus the sum of the log of the ci of x. So more graphically what does it mean? So we start with our problem. So imagine a very simple example. In blue I have my function f plus r in the scalar phase. So sparse corresponds to my constraint. So in that example my constraint will be to say in this interval what I obtain when I construct my augmented problem with a barrier. So I add my barrier and by construction my cost function will look like this. So it means that we add because of the logarithmic terms a barrier that would avoid the minimizer of this augmented function to be outside the constraint set. Here mu corresponds, it's called the barrier parameter. It's strictly positive. And what is done in interior point method is that you will replace p0 by the seconds of sub-problems p mu g, mu j parameterized by mu j and you will solve approximately each sub-problem for a second of mu j that goes to zero. So there exist many analyses. Actually it's not a recent optimization method. It's well known in the world of constraint optimization. And you can show some convergence and even in some particular implementation the super linear convergence of such strategy depending on the, for example, on the speed you decrease mu j and on the properties of your function. What is very nice is that each iterates of this interior point algorithm are feasible. What I mean by feasible is that they will all belong to the, they will all satisfy strictly the list of constraints, CI, so which can be, again, very interesting in specific problems. The main difficulty is that the most famous interior point solvers are based on Newton-like techniques and typically they will need an inversion of an n-by-n matrix at each step. So here we will focus on a bit more, like improve the version of the standard logarithmic barrier algorithm that incorporates a proximal step. So the first strategy combining prox and interior point was proposed in these papers. Basically what the algorithm was doing is that at each iteration we needed to compute the proximity operator of all the functions, the augmented function f plus lambda r plus mu b and it defines some algorithm with a nice convergence property, but the problem is that at each iteration you need to solve some problems that can be quite difficult and even as difficult as solving the initial, the initial problem. So this is interesting for the theoretical analysis but not a very practical algorithm. So here what we propose instead is to take advantage of the fact that f plus r are assumed to be differentiable and what we will do is we will adopt this forward-backward, this proximal gradient algorithm and what it means is that at each iteration we will perform a gradient step on all that fidelity plus regularization term with a given step size for the gradient and we will compute the proximity operator for the barrier function. This is the generic scheme for our algorithm. Of course we need to have the second mu k that is evolving and decreasing to zero and some specific assumption on the step size gamma k. The nice advantage is that we see that now at each iteration we need just to compute the proximity operator for the barrier function. We showed in our work that actually for a large set of examples for the constraint we can actually express the proximity operator for the barrier associated to this constraint and also the Jacobian matrix for this proximity operator which will be a key ingredient when we will do auto-differentiation in our neural network. Here I give the result for the affine constraint for the hyper slab constraint that are defined as follows. Here again we have a closed form and for the bound constraint so I did not write the expression but this is in our paper and this is how the proximity operator looks like. So it's quite similar here to a thresholding operation with a different shape depending on the value for the step size gamma and the barrier parameter view. And the last example which is a bounded L2 norm constraint so we see that we can also encompass non-linear constraints. So now we have our algorithm we are happy with this algorithm it's quite efficient but we faced a difficulty is the setting for the step size and the barrier parameter and actually clearly it was quite hard to find a good compromise to make it to reach a good result and it was really problem dependent and again we had this famous problem of how to set the regularization parameter to solve our inverse problem in order to have the best image quality possible. So what we propose is to learn actually those parameters so mu, gamma and lambda along the iterations that we will fix large k iteration and we will see say that actually one iteration of our algorithm actually corresponds to one layer of a neural network the activation function and these different terms and we will use a training phase in order to learn the best choices for given image restoration tasks for gamma k, lambda k and mu k which could be actually different at each iterations so this is how it look like so we transform our iterative proximal algorithm into a neural network so this is really the idea of this unfolding of optimization algorithms that I tried to explain you a bit in the introduction with a projected gradient this is what is happening for our proximal interior point algorithm each layer corresponds to three different possibly neural network for now I put some boxes that will help us to learn gamma, lambda and mu and then once we have learned those gamma, lambda and mu we put them as an input of one iteration of our algorithm and then we can produce an output and then we iterate over those different steps let me be a bit more precise about the architecture we proposed so gamma k we call that it was a step size in our algorithm a step size should be positive we propose to use something quite simple the step size does not have much reason to depends on xk on the data we set something very simple just the output of a soft plus activation function and with input linear parameter ak that will be learned during the training this is for gamma this is what will be in the box here for the barrier parameter actually what we observe is that it is much better to make it depend on the current iterate xk which is quite intuitive in terms of optimization and we propose this specific network architecture which is made of different layers and finishing by fully connected layer and this soft plus again important to impose the positivity of the barrier parameter just want to satisfy the basic constraint in order to have some valid optimization method and for the last parameter the regularization parameter so with the differencing and what we propose actually is that this lambda it's good to make it depend on xk but actually also to incorporate our knowledge as people from the world of inverse problems and signal processing and actually to include some knowledge about the image statistics and the noise level as we will know that this may influence the setting of the regularization parameters so not to have something completely black box and as I said box A corresponds to one iteration of our algorithm when mu k gamma k and lambda k have been learned by the three previous steps at the end we add post-processing layers that is a very common in the literature of unfolding optimization methods in order to remove smaller artifacts and again this layer can be learned from end to end from the training database and it can be an application dependent actually. So thanks to the form of this algorithm and the fact that we managed to express the Jacobian for the proximity operator we managed to do the training co-initially with stochastic gradient descent and back propagation so as I said we can actually have an explicit form for the gradient of the box A thanks to what we obtained for proximity operators for the three other boxes we use the standard auto differentiation tools because those are quite standard architecture from neural network world. So last point before showing the numerical results something very important when you use neural network and quite often critics is the lack of robustness of some neural network architecture. There have been some very famous examples showing that when you modify the slightly some pixels in an image the neural network classifier could classify a cat into a dog or etc without any visible perturbation actually in the input. This is a major issue because if you want to use your architecture to do medical image processing automatic driving for sure you need theoretical guarantees about the robustness of your network. So what we use is that we use a recent work about the stability of fixed point structure that was made in this work of combat and pesquet and it allows us to study the stability of the network we proposed. So let me briefly present you the results of first something very important is that the stability of some iterations is highly related to this concept of alpha average that is defined mathematically as showed and basically what we will need is to have alpha average operators in our pipeline. So I don't really have time to go too much into the mathematical details but you have all the definitions here but what is very nice is that when you show that your architecture can be is alpha average. This is what we would like to show. You will obtain these very nice properties that actually you can bound the output variation of your network as soon as the input of the neural network is bounded and this is exactly what you would like. You would like that if you do not have too many variations in the input of your network, the output is not completely exploding but you can actually control very precisely what will happen. So the output will be a T of X and T of Y and the input is X minus Y. So we would like to show that our architecture satisfies this mathematical property of alpha averageness. So let's go back to our neural network. So what I said is that it's very important to rephrase it as a fit word architecture. So composition with activation function, linearities, etc. The proposed architecture actually shares this structure so I can show you very easily on a specific case when the regularization is quadratic you can rewrite directly the gradient and you will obtain something linear and you clearly see that you obtain activation function and linear operation. So actually what it means is that setting the barrier function is just a special kind of activation function. So that means a lot of work, theoretical work regarding the stability of such network and we will inherit from those nice results. So this is what we did. So this is a general serum under some mathematical conditions on the weights that are involved in your fit forward network then you can show that this is alpha average. So let's keep a bit the mass. To Sanu with a take home message, the stability of fit forward neural network depends on its weight operator. So the matrix is WK. So let's have a look more in detail in our particular architecture so I will consider a slightly simpler problem quadratic regularization I make some assumption on the linear on the linear operators. I introduce some specific notation and basically what we managed to obtain is that under some conditions about the eigenvalues of those matrices we can show that our network is alpha average can express the value of alpha so we can really control the stability of what is happening and we can see that it basically depends on what is in our problem and under some conditions that can be satisfied in practice we have the stability of the network which is very reassuring in the application view point. I will finish with some numerical experiments we apply our architecture in the problem of image deblurring so it means that H corresponds to a convolution with a known blur we are not in the blind case we have some additive white Gaussian noise we do not assume that we know the standard deviation of the noise it's important because actually we will learn it some way we consider color images so like three channels images this is a variational formulation so quite standard we have some constraint on the range of the pixel intensities and we use smooth version of the total variation regularization because I remember you that we need to have quite different stability for our particular algorithm and here DH and DV corresponds to horizontal and vertical gradient in the image so it will incorporate some smoothness about the image so when you do neural networks maybe you know that you need to make many settings what we set is the number of layers equal to 40 different things this is the final choice we arrived for the regularization parameter as I told you we use our knowledge on signal processing we propose a way to an estimator of the noise level which is based on the looking at the wavelet coefficient of the blur and noisy image and this is the sigma art is combined by some soft plus operation so basically here what we would learn are BK and and CK so we have only two parameters and and XK here which is an input of this of this function so it's very nice because contrary to many existing neural network methods for deblurring we do not need to know the noise level and for the post processing we use something quite standard which is as follows this is what we consider so we train on those images we validate that is we make the fine tuning of the parameters on 100 images and we finally test on 200 images so those test images were not never seen by the neural network we try with different gauchan with different kernels gauchan motion uniform and different noise level to see the validity of the noise estimator too and also information about the training and what is very interesting is that we will train our network to maximize the structural similarity measure which really denotes what we would like to have as a nice visual quality of the image so like the final goal of the resolution of our image restoration problem so just for information we implemented this in Pytorch and the training text 3-4 3-4 base we compared with some state of the art method so deep learning base method and also what is quite interesting is to compare with the standard optimization base method if we just solve it with an optimization method for which we fine tune manually the regularization parameter so it's quite a good benchmark for the optimization method so this is what we obtain in terms of SSIM so what we can see is that it's basically most of the deep learning base method are better than the optimization base method and in those 5 data sets we have that our unfolded optimization, our unfolded learning algorithm can be the standard deep learning strategy here in terms of SSIM so we just finish with some visual examples you can see what you would obtain with solving the TV TV problem and this is what we obtain so we have less artifacts and also the image are a bit less blurry than what you obtain with standard black box deep learning strategies so in conclusion what we show is that actually you can build neural network architectures that are explainable by mixing some ideas from iterative optimization algorithms especially proximal gradient algorithm and neural network techniques here we focus on a specific family of optimization methods with our interior point based method so we show that actually we can express the prox of some barrier functions and their gradients and what I would like to say is that it's not the end of the work for people in optimization I think it was also clear in the clear of the talk that for sure we need better especially non convex optimization methods for example for the training of those neural network very difficult optimization problems and also what is clear in the stability analysis is that optimization concepts are not only useful to train neural network but also to analyze them because the stability analysis really requires quite sophisticated tools from optimization so they still work for optimizers ok so thank you very much maybe I could answer your questions so maybe I could start with one question so from what I understood your starting point was classical optimization approaches, variational approaches for inverse problems and you try to take advantage of these algorithms to design new neural networks and to have some kind of understanding of these neural networks with these rules and I would say that in the inverse problems community so you go one way from optimization to learning and I think people also could do the reverse way because you can think of for instance prior distribution which would be data driven and that could be inspired by for instance GAN, the narrative and reverse networks and so I was wondering if you and so with this kind of approach you could learn the prior distribution and so is it something that you have you know and for instance could you improve classical approaches to solve inverse problems by taking advantage of neural network approaches yes so for sure so there have been another another track of works that exist for example what is called ISTA net there have been many other methods in order to learn the goal is to learn the regularization function like to replace the standard regularization by a neural network that will perform as a denoiser then of course there are many interesting theoretical questions about at the end the convergence analysis of such black box regular there are some results very promising results it's also a bit related to this alpha average and fixed point series basically like some kind of contraction properties of your denoiser and in the context of unfolding it's true that in the present work we didn't learn anything very precise about the regularization just the regularization constant but now we are working on combining this with the learning of for example the linear operator which is in the total variation you could have something more like dictionary and learn the dictionary from the database so yes for sure this is interesting what is important to denote is that the computational cost becomes I mean with those neural networks increases a lot rapidly here we learn only three scalar parameters we already have training of several days so you can imagine that you need to pay attention to what you are learning and what is important in the what is shown in the literature when you learn regularization is to impose a lot of structure like convolutive structure not to explode in memory not to give so so much freedom to the network but yes for sure I think this is this is promising and this is maybe the way to again improve the resolution of this complex inverse problems thank you other questions yeah so the question is is unrolling only adapted when your model based algorithm needs to tune some extra regularization parameters so I would say no I think it's a bit related to your question Charles actually you could learn it's much more flexible than this you could learn something about like inside the regularization function it does not need to be only a scalar parameters I don't know if it's if it answers the question so of course yeah maybe the question is do I need to enroll all my optimization algorithm it's true that the idea of unrolling is twofold so the first thing is that you can train so you can use training and learn from a database so of course you need to have something to learn can be parameters but it can be the regularization and the second thing I would like to emphasize maybe it was not so clear here but it's clear in the experiments we did is that putting this algorithm this optimization methods in neural network framework also means implementing them in PyTorch benefiting from the GPU acceleration and what we saw is that even from the optimization viewpoint it allows to accelerate with factors up to 50 standard optimization methods so I think this is some things that is maybe not so known and that is very interesting for people in inverse problems just making the effort of going to PyTorch and using the GPUs thank you very much see you next time