 Okay, so let me just remind you what we had done, and so with what I will continue. I had introduced, so we are looking at functions in L2 of R, typically, and I had introduced operators acting on these functions that shift them around by this, that modulate them, and then I also had introduced a convenient phase factor on top of that. It's a convenient phase factor because what we had seen was that if we took the product of two such operators, it expressed itself beautifully in terms of the operator that had the sum of shifts and the sum of the two modulating exponentials, and then a phase factor that because we had introduced this one had a very nice symmetric form. Something that I did not emphasize, well, or maybe I did. This is obviously a unitary operator. Because if we look at its shape and we look at what we get when we multiply it with the operator of negative the argument, then we get obviously a zero shift and a zero frequency modulation, so that's the identity operator. And the phase factor, well, we get t1 times nu2 and then minus t2 and nu1, which vanishes, and so we have that the phase factor is one. So we have that it's a unitary operator of which the inverse is adjoint since it's unitary, and we see here that its inverse is that, so we already see immediately that the adjoint of each of these is the thing with the inverse arguments. We had also seen that if we took an integral over all of r2 of the inner products of an arbitrary function with a window function that is translated and modulated, and we integrated with respect to a different window function, 1 over 2 pi, that, and I understand this integral in the weak sense, that's to say I take its inner product with g in order to give it a good definition, then what we get here is the inner product of these two windows times the function f. And so if I take an inner product of the left side with g, I get the inner product of f with g. In particular, if the two windows are the same, which is the case that interests us the most, and if the window is normalized in l2, then we get the identity operator, and so because that is something we'll use quite a bit, let me write it especially. So in that case, I have, I like to interpret what I'm doing as taking, I mean typically I take windows that are very well localized in time around zero, and of which the Fourier transform, because the window itself is smooth, the Fourier transform is going to be well localized as well around zero again, and so I like to think of projecting on such a w as something that localizes the content of my original function around zero in time and frequency. If I move it around with these translation and modulation operators, then I'm thinking in time frequency space of looking at around the t new point and localizing around there. So these are all rank one projection operators that localize around different pieces in time frequency space, and the sum of all these gives me the identity operator. So it's sometimes called a resolution of the identity. Okay, and then finally something that we had also seen is I was interested and I will continue to be interested in the case where I discretize, oops, shoot, I will be interested in the case where I don't look at t and new as continually changing parameters, but I will be interested in the case where I look at m tau and omega window, and in that case I will be interested in a function, an operator that goes from L2 of r to typically, and we'll see, I'll have to come back to that, square to square summable sequences which maps f to f with these shifted, and one thing that we had seen last week was that if tau times omega is to pi, and if w decays, so the window decays faster than 1 plus s, 1 plus epsilon, and it's Fourier transform as well, then it was impossible to get, then we could not form an orthonormal basis, well actually what I claimed is if you had that, then no family w m n where I call this w m n can form an orthonormal basis regardless of the choices for tau and omega, what I showed you was that this could not be done if the product tau omega was 2 pi, we'll actually come back to the fact that omega times tau has to be 2 pi for a basis today, but so that's what we've done, and as a tool in that proof what I had also introduced was a transform from l2 of r to l2 of the square, which I called the zack transform, and which was defined by this sum, and it's well defined for f that have sufficient decay, but what we showed is that on such function it had, it was a non-preserving operator, and so we could extend it to all of l2 by uniterity, and we proved that an orthonormal basis was mapped to an orthonormal basis, and so we were done. Okay, so today I will embroider further on some of these, I will also at some point, either today or next week, I mean many of these things that I'm doing here have a complete analogue in the wavelet world, I mean I'm talking here about windowed Fourier transforms, and I will mention those too, but actually we'll stay mostly with the windowed Fourier world in these presentations. Okay, so and I promised you we would come back to this localization operator. Part of what we will do today is we'll be interested in looking at the operator, so I'll be interested in still looking at these superpositions of t, thank you, thank you absolutely, t and integrator to pi. If I integrate it over all of r, r2 I get f. What I will be doing here is I'll integrate over a subset, I'll integrate over the t and nu within a disk, and I also will, because it is convenient, work with a very special window function which is a Gaussian, and this defines to me a localization operator on the disk could reduce r on f, so I define an operator l r this way, it's my perfect right, as I had a professor who on oral exams, if students wrote something particularly obtuse on the board, so on he says well that's your perfect right to write this, but then you had to explain, so I'll be explaining, so okay, g here, so I'll use because it turns out to be a very nice thing to use for me, the Gaussian normalized this way, so you can check for yourself that this has l2 norm 1, it is for physicists among us to the grand state of the harmonic oscillator, it has a number of very beautiful properties with respect to this transform, let's look at that, wt nu g, so I get this normalization, I have f of s, and then let's write out what this whole thing is, I get exponential of minus a half, and I get I have a g of s minus t, so let me write out minus a half s squared, minus a half t squared, and then plus s t, then I have this is conjugated, so I get minus i nu s, and then plus i over 2 nu t, and this whole thing s, so I can, let me rewrite that in a fancy way, so I have an e to the minus a half s squared, okay I keep that, then I have a e to the s t minus i nu, and then I like to incorporate this as a double product as well, so let's, if I write e to the minus a quarter t minus i nu squared, then you see that the double product will give me exactly that, I will already have one quarter of the t squared I had here, so I need to write another quarter, and I have here added, because of the i squared nu square of 4, so I have to subtract it, this might be a completely silly thing to do, what I wanted to accept that it gives me something useful, it gives me something that apart from this factor e to the minus one quarter of what's really the magnitude of t nu squared, it gives me a function that is analytic in t minus i nu, or if you wish in minus i nu plus i t, so in nu plus i t. So what means t nu, bracket? Oh, so t nu, I'm viewing that now as an element in R2, and I write its magnitude squared as the glidian norm t squared plus nu squared. Right, from the physics point of view, it's like it's strange to have t and nu different dimensions to be added, so there is a unit of frequency. I have a unit of frequency put to zero, I mean that's why I'm thinking of the harmonic oscillator, and I've put the frequency equal to one, and that's where that comes from, this unit has disappeared on me, I agree. Okay, so and in fact what happens is that the transform, so I went for functions in in L2 to functions that are analytic, and of course if you're analytic function and you grow in certain directions, but they don't grow badly enough that when you multiply them with this factor, they become square integrable again. So they're functions, analytic functions that have growth that is controlled sufficiently so that this multiplier turns them into square integrable functions, and that's called the Bargman Hilbert space. So it maps to function in the Bargman Hilbert space, I mean which is a very nice and beautiful tool to work with, will implicitly be doing some computations in there but I will not typically write them out in detail in the Bargman space, but so okay, so that's the Gaussians. I'll before I mean so my goal is to get to tell you things about that localization operator that will be very useful and interesting, but to get there I'm going to introduce yet other things which also will be interesting and useful. Okay, so now I'm looking at this phase space with time and frequency, and so I look at points with coordinates t new because I have implicitly introduced a frequency that I put equal to one. So in sense I'm using the fact that you can do dimensional analysis in physics in order to find what dimensions things should have to sweep to get rid of all units, which is abuse if you wish, but it makes my life easy. And I'm going to be interested in rotating things in time frequency space. So a rotation over angle theta, so I'll define r theta of t new to be cosine theta t plus sine theta nu minus sine theta t plus cosine theta nu or I mean if you you think of it in terms of a column vector then you have the standard matrix acting on t new. And it protects the way you indicate or the other way? Do I? No, just to I'm always confused. I'm always confused too. So what do I do if theta is positive then so if I'm here then I become negative. So I think it's correct. Okay, so let me define an operator u theta by saying look if I had I integrate again over the full plane u theta on f is defined by taking the inner products of f with the Gaussian and then here I write w of r theta t nu on the Gaussian and I integrate over the t nu over 2 pi. And that's a linear clearly a very linear operator. If theta is zero we know that already that that's the identity operator. And what I will what we'll see is that this operator has a very natural way of acting on so what I claim is that u theta on a w t prime nu prime g will be easy to compute. It will have a very nice expression. In fact I claim so for the moment this is a claim that this is going to be just r theta t prime nu prime on g. So let's verify that and we don't really need to do a lot for that. It's clear that we will have to I mean we'll have to insert this thing in there so it is a good idea to have a nice expression for what a w t1 nu1 g with a w t2 nu2 g is. Well we know the adjoint is minus the operator. We know what happens when we take the product of two of them we get a phase factor and I have it here the second member so I have to take it I have to take its conjugate and that will give me an e to the i over 2 t1 nu2 minus t2 nu1 is that correct let's be careful g e to the i over 2 yeah that's what I thought too but I wasn't so negative t1 nu2 and then plus t2 nu1 and w minus t1 plus t2 minus nu1 plus nu2 g and and so I get e to the i over 2 t1 nu2 minus t2 nu1 and then in a product of g with w of okay so we need to look at things like that so let's look here we had a computation that already had given us in detail a general f with a whole Gaussian like that so let's just write g w t nu g here I get another pi minus a quarter it gives me pi minus a half and then I get an exponential minus a half s squared and then all that other stuff so I get exponential minus s squared minus a half t squared plus st minus i nu s plus i over 2 nu t and this thing the s okay this I can rewrite as minus s minus t over 2 squared that accounts for this and the double product then I have taken a minus t squared over 4 no sorry minus a quarter t squared then I have here minus i nu s minus t over 2 and that's it okay and then what I have is and you can work out that all the square roots of 2 and pi and so on work out what I have essentially is a Fourier transform of the Gaussian but normally if I had a over 2 here then it would give me the Fourier transform would give me minus nu squared over 2 because I don't have that factor I get a nu squared over 4 and I get exponential minus t squared over 2 minus nu over 4 minus nu squared over 4 so if I plug that in here then I get e to the i over 2 t 1 nu 2 minus t 2 nu 1 and then here I have with notation that I introduced over there the exponential of minus a quarter and then the Euclidean norm of t 2 minus t 1 nu 2 minus nu 1 and now that introducing Euclidean norms and so on I'm going to rewrite this in a similar way it is after all the inner product of t 1 nu 1 with nu 2 minus t 2 or if I introduce matrix J which is 0 1 negative 1 0 which has the property that if I let it act on the vector t nu maps me in in nu negative t then I can rewrite this whole thing as exponential of i over 2 t 1 nu 1 dot J of t 2 nu 2 and minus a quarter Euclidean norm of their difference okay so I'm going to use all that in order to look at what this left hand side is by my definition this is 1 over 2 pi integral over the whole plane of what I take the inner product of w t prime nu prime g with w t nu g and then I have here w r theta t nu g and I integrate the first thing I'm going to do is I'm going to change integration variable I can call this my integration variable and then put because the this measure is invariant under rotation in time frequency plane I can go back so this becomes and here I have w of rotation over minus the angle and here I have just okay here I'm going to use the computation that I worked out there so this becomes 1 over 2 pi r 2 and then okay I have here exponential of over 2 t prime nu prime dot dot what dot j and then what I had as argument here so r minus theta t nu okay and then the next thing I have is minus a quarter of the difference between these things so t prime nu prime minus r minus theta t nu and the Euclidean norm of that thing and that's this whole inner product and I have here w t nu g d t nu okay now here because the Euclidean norm is invariant under rotation I can put the rotation there now if you look at j j in fact is nothing but one of these rotations itself j is r pi over 2 I mean if you plug in pi over 2 in our definition of r theta then you get exactly j and all these rotations of course commute I mean the rotations in the plane so this is here the same as t prime nu prime dot r minus theta acting on j t nu and because the dot product is invariant under rotation that's the same thing as putting the rotation here and taking the dot product with that and then I now I have you see with these operations I've put r theta always on t prime nu prime and now I can backtrack in the whole thing and I'll see that this is 1 over 2 pi the integral over r2 of w of r theta t prime nu prime on g inner product with w t nu g w t nu g d t d nu and so since that gives me the identity operator it's just as I claimed in the beginning are theta t prime nu prime on g and so my claim is established okay so I have rotations in time frequency space now you might say rotation time frequency space what does that really mean well something that maps time in negative frequency and frequency to time is really the same thing as the Fourier transform if you take a w t nu g the Fourier transform you will find that that is in psi well yeah this is the same thing as well on any f actually this is the same thing as acting with nu and negative t on the Fourier transform so if I look at this in psi and I look at this in psi I get exactly the same thing so the Fourier transform is a rotation in time frequency space these rotations over angles that are not quite 90 degrees are the same as what's known as fractional Fourier transform I always find it very complicated in papers in fractional Fourier transform to read that everything they're doing but if I think simply about it as rotating in time frequency space it becomes much easier now there are some operators that behave in a very some functions that have a very simple behavior under these rotations so we have seen here that if you take the Gaussian itself and you move it and translate it around all you do with it is just rotate its label I mean that's all you do when you apply these these rotations so it's very natural something similar happens to when you involve not just Gaussians but hermit functions so so the action of these u theta on hermit functions okay so I define my hermit functions the zeroth order hermit function is just my Gaussian and then I define the h ends as up to normalization so normalization factor to keep it times so multiplying some operator that I apply n times on the Gaussian so for instance if I take the negative derivative on the Gaussian I get a factor x this also gives me a factor x so I get x times the Gaussian and I have to normalize things so to get to normalize but that's essentially what hermit functions are by the way can one see u theta as exponential theta of the amytonian of the harmony yes absolutely and that's why everything that's going to come out now comes out so nicely absolutely and and that's how in fact I found what I everything what I'm going to say here first but but people in in math these days are not as familiar with the harmonic oscillator anymore well maybe in France they are but in the States they're not so okay so and of course that's the creation operator of the harmonic oscillator so these are the hermit functions so that already tells me something very nice about what if I take these modulated and shifted Gaussians and I take their inner product with the h ends so if I take the Bergman transform of the h ends that gives me something very nice already why well it's the integral h n s of exponential minus a half s minus t squared then minus i nu s plus i over 2 nu t d s and so I have here up to normalization this s minus d d s to the ends power applied to g if I use integration by parts for all the derivatives every minus d d s becomes a plus at infinity everything gives zero contribution so I have no boundary terms so up to normalization I will find that this is going to be the integral g s and now I have to apply s plus d d s on this thing on this exponential but let's see what that gives d d s on this thing is going to give me minus s minus t minus i nu and then I have to add an s and so that whole thing gives me t minus i nu and so what I get here is up to normalization I will get t to the minus i nu I have to do this n times and then what I get here is just the well the Bergman image of the Gaussian self and we know exactly what that is that's exponential minus a quarter t squared plus nu squared so if you think in the Bergman space this factor e to the minus a quarter t squared plus nu squared is always there you always have in front of it some analytic function of not too fast increase and what you get here is just the monomials so the Hermit functions become extraordinarily simple in Bergman space they're also t to the minus i nu okay so we'll see that in a second okay so let me get all my boards in order I have this magic thing this magic thing here can I still yes I can still get that so I now what I'd like to show is that ut u theta acting on the hn is something really really simple okay u theta acting on hn let's look at what we get when we compute well let's let's let's not do it this way let's it's by definition of hn and then I have well I apply u theta wherever I defined it there I will again put the r theta on the other side so I get r minus theta w of this on t nu g and here at that t nu g d t d nu over 2 pi and I integrate over all of our two here let's write that out this the euclidean norm of t nu is invariant on the rotation so I can just write exponential minus a quarter t squared plus nu squared and now I have to write here so I have the t part of well let let me write out what this thing is I mean I have to write r so something that depends on t and nu so r the t part of r t nu t plus i part the r theta of t nu theta minus and this thing to the nth but as you can as you you you you no doubt expect this will become much simpler than that because t minus i nu is nothing but t nu dot one minus i defined the point with two coefficients with two components one and minus i so I have if I have to like at look at r theta uh t nu dot one minus i then that's the same thing as t nu dot r minus theta acting on one minus i and let's look at what r theta gives on one minus i it gives a vector of which the first component is cosine theta times one and then negative sine theta because I have negative theta times negative i so that's plus i sine theta and the second component gives me negative sine theta which become positive sine theta because I had a negative theta times one plus well I have a minus i minus i cos theta which is I mean we use them well of course e to the i theta one minus i and so what has happened is that I have exactly e to the minus e to the i theta times t nu dot one minus i so this whole thing is just the expression in which I can bring e to the i and theta and then I had the expression in as if there were no theta no rotation in there so this again exponential sorry this I can go back and so on and so what I find is that u theta of h n is just e to the i and theta h n so it's harmonic oscillator minus a half what that is that is u theta expansion of theta times harmonic oscillator minus a half minus the ground energy okay so u theta is this beautiful unitary operator that acts very simply on these shifted and modulated gaussians and that has the hermit functions as eigen functions with very simple eigenvalues and that is now going to help me to deal with the operator that I had a very top there and uh that I want to analyze further because let's look at this operator l r and imagine that you let l r work on a u theta f so the f up there I have to replace by u theta that of course I mean so now we're going to do some some a little bit of mental stuff the u theta in front of that f I can it's a unitary operator I can just bring as an adjoint on the other side so it gives me a u minus theta on the w's so let me write that u minus theta on w t nu g w t nu g but we know what u theta gives on a w this is just w of r minus theta on t nu we let it act on the label and then I use the fact that this disk in time frequency space is invariant under rotation to make it change a variable I mean so I change this this becomes my integration variable and then I have to write here an r theta t nu g but that is exactly because of my computations u theta acting on these and of course that's going to be the same I mean and if you need to be super rigorous about it remember that we define these integrals via in weak form and write it in weak form and then then work it through this is going to be the same as u theta acting on l r so this operator commutes this is a very nice so my picture is in time frequency space and I'm defining disks of radius r and I localize remember every one of these projection one operators tries to localize as well as possible around t nu and in fact when you use this your window or gaussian you can give an explicit definition to localizing as best as possible in if you think of it in terms of quantum mechanics then you know you can't localize extremely precisely in both time position and momentum and you what you do is you localize as well as possible and compatible with the uncertainty principle so you localize exclusively finally and you superpose many of these but you only superpose the ones within this disk so I'm cutting out from the whole function only the content that it has within this disk and that it turns out computes commutes with rotation in time frequency space as as it should I mean if my localization operator made many sense it should do that and in fact we verify that it does what that means is that since the hermit functions are eigenfunctions of the u theta and if that is not rational then it's not degenerate you find that the eigenfunctions of the u theta have to be eigenfunctions of the l r so you know that the l r's have to be eigenfunctions of these localization operators and you can actually find the eigenvalues very simply because l lambda and r is going to be the integral of well it's going to be in a product of l r h n h n and by the definition of my l r this is going to be 1 over 2 pi the integral over t squared plus nu squared less than r squared of h n with w t nu g squared dt nu and we know exactly what that is because we did compute it somewhere else and with a bit of luck I may not have erased it yet and uh well no I don't have that luck but we know that this was exactly uh uh the boards okay so we know that gave us the exponentials times the monomials so let's put that in so it's giving us 1 over 2 pi integral t squared plus nu squared less than r squared and then I'll have some normalization which I haven't bothered to compute but which of course one can compute exquisitely precisely uh and I get here uh and I get absolute values so t plus i nu and absolute value is t squared plus nu squared to the n and then e to the minus a quarter squared gives me e to the minus a half t squared plus nu squared and dt nu and we can even do the integral over the angle and what we we find is the 2 pi miraculously drops out and we get an incomplete gamma function we get uh n n squared and then we get from 0 to r uh r to the 2n e to the minus a half r squared to n plus 1 d r uh so we get by renormalizing uh we get r squared over 2 uh u to the n e to the minus u du and uh if you work out the full normalization I mean we now actually know what that normalization will give us because if we integrate it over the full space so if r was infinite we would get 1 because we would get uh uh h n in the product with itself because then all r becomes again the identity operator and we get 1 so this is just going to be 1 over what we get when we get take the full integral which is a gamma function so which is n factorial or whatever yeah n factorial so we know those eigenvalues um if you plot them so fix some r and look at what the lambda and r they are decreasing monotone decreasing and when r if r is very large then typically for n small you already have captured most of the functions so you get an integral very close to 1 if you're much bigger than if n is much bigger than r squared over 2 you have something very very uh small and you can what you find is that the decay 2 0 over a zone that is of the order of r and that corresponds exactly in this picture that we had in time frequency space this disk the hermit functions live really on annuli in time frequency space and it's only when their order is compatible with the radius here that you don't capture the full norm and once they're much bigger you have essentially zero of the norm when you localize on this ring so it gives you a very nice interpretation of what this is um such a localization operator turns out to be interesting for applications in in a signal process so these have interesting applications uh for signal processing yes why is not one using uh like a smooth cutoff you know what it could be gaussian it could be gaussian it could be simpler without the gamma function i mean there would be well explicit integral anyways they might be yes and in fact you can do that everything that's rotation invariant will do the whole thing will work through absolutely uh the reason i i i searched i first defined them with a sharp cutoff is because they were at first now we have other applications for them but at first they were motivated by what i'm going to say now um and that harks back to a mention i made last time when people were interested in uh uh doing explicit localization in uh so working on finite time intervals of what i said sprouts for other pro pro pro let's throw it away functions and uh so and and the work of dave thumps which was preceded by the work of landau henry landau henry pollack and dave slapion um they were interested in the problem of trying to find uh good functions to describe the physical situation of looking at signals functions which they knew were limited in frequency so they knew they were looking at f such that uh if you looked at their Fourier transform and you multiply them with xi below some cutoff that gave you f hat again so these are band limited functions but they were going to observe such functions so uh the set of f for which this is true let's call them uh b omega so this is this defines that set and they were interested so uh so observe for f in b omega just a characteristic function in minus tt because well they were going to observe these functions over finite time interval and one thing they wanted to know is how big is that space can we prove things about how many uh how large that space is how many independent functions you can find in there can you construct them can you find a nice orthonormal basis for that space and so on so here when your f belongs to b t you mean no so you what you will do you observe this but four functions that are in there and so that's the interplay of projections into the two sides and and so there were i mean if you there are old papers there's even a paper by uh peter lax that actually once i was talking about it with him and he said oh we had an old paper that was completely bypassed of course by the work of london pollock a slipion but about this this question so people were very interested in in the operator if i define now so in l2 of r let me define pt to be the operator acting on arbitrary so f in l2 i define pt of f by the characteristic function for t smaller in t times f and i define q omega of f by the by the fact that in Fourier transform it consists with multiplying by this so the b omega are the functions this is a projector operator are the functions in the space on which q omega projects so they were interested so people are interested in finding what is the maximum norm for if i start with a function of norm one and i let q omega act on it and pt what's the maximum i can get what that really amounts to is that you want to study properties of this operator the operator times it's adjoint i mean so you want to study this operator so f this squared is the same thing as f with pt q omega and then the same operator adjoint but the pt is combined to pt and i get q omega again so they wanted to know what can i say about the spectrum of this operator and what can i say about its eigenvalues how many eigenvalues are there that are close to one you want them to be close to one because you'd like for the interval where you observe them to have observed most of the behavior of the function so you don't want to lose too much and functions for which you would lose very much are functions you don't care about and so the so people were very interested in this in finding eigenvectors and the spectrum of this and what Henry Landau and Henry Pollack realized and David Slapian is that this operator commutes in fact with a very special second order differential equation differential operator with non-constant coefficients I mean for some miracle I mean just it does and so that happened to be a function a differential operator that people had looked at for mechanical properties and for which they called the eigenfunctions a prolethe freudal wave functions but since it commuted with that operator that means that the prolethe freudal wave functions were going to be eigenfunctions of this operator and that made it possible to look at the spectrum of this operator very nicely but the spectrum and the spectrum of this operator has eigenvalues that are close to 1 and then dk and get close to 0 and the place where it drops so here this happens at about n about equal to 2t omega over pi but the width of the region here turns out to be more narrow you see there my region I didn't tell you where they dropped but the width was of order r but where they drop is for n about r squared over 2 so it was about square root of the size here it's of the order of log of t omega so much more narrow it turns out that these functions because what it tells you is how many possibly independent functions can I have if I only am going to observe from minus t to t in the band limited function space so it gives you a limit on how many functions you can have now I remember these these these numbers always so easily because in all cases this case here I had my disk I mean also my explicit formula but even if I don't remember the explicit formula it's the area of the disk divided by 2 pi in this case here I'm implicitly defining in time frequency space a rectangle for minus t t to minus omega to omega a rectangle that has area for t omega and I divide it by 2 pi the number of degrees of freedom that you have when you look at a local region in time frequency space is the area of that divided by 2 pi now if you want insist on defining areas like that in time free guess one it doesn't apply anymore I mean reasonable things and that actually is yes can you say a bit about why you have this log t omega with I for that you have to start looking at the analysis of this and at eigenvalues it comes out it's not something I can prove in a few are rules it's actually something that you you it happens to be the case it's extraordinarily well localized I mean I wish I did have log I don't have log no I mean what happens with these products for all ones is really very particular you can do similar things I mean people have tried to do similar things in more dimensions and so on but I don't think it ever becomes as nice as in one dimension and with the products foils but the what what what is however what's the inconvenience here is that the eigenfunctions and the eigenfunctions themselves depend on the value of omega and t I mean of course if you shrink omega then you can expand t by a scaling argument so things depend on the product only but if you have changed the value of the product then the eigenfunctions change and what was the nice thing about these localization operator was yes they don't localize exactly on the rectangle you get for band limited for for looking at one versus the other variable and so on but they are the same eigenfunctions no matter what are you take which is really nice I mean because it means you can localize different areas in different ways and also I have defined the localization operator around zero but it's very simple of course if you want to localize on a disk elsewhere you just move the whole thing with a shift and and so on the whole all these commutation relations work beautifully okay so I I I will now come back to the question that I kind of tried sweeping under the rug and then you teased it out last time of what is the special role of 2 pi in the product of tau and omega when I defined these off normal basis because now we can understand and actually we can use the result that we proved for our localization operator to do that what we find if we look at our l r operator so our l r operator we now know a lot about we know its eigenfunctions we have these lambda n r's and so we can find the trace of l order and it is going to be r squared over 2 plus order r namely that's how many that's how many I have that are close to one and each one gives me an extra one and then I have a zone of order r where things decay and then the decay is exponential so I don't have to worry about a tail if I have an orthonormal basis then this is also going to be the w m n l r and my operator l r is a positive operator so I don't have to worry about things that are have a trace but are not trace class and so everything is nice and positive okay now if my w's are nicely localized in time frequency then w 0 0 is nicely localized here w m n is going to be localized m tau's away and n omega's up so if I have this this disc then I expect n you can you can write it out it's going to be m n the integral over the disc with radius r of the w m n and the w the well operators t nu g squared dt nu this is going to give me something that is mostly one if this point lies within my disc mostly zero if it lies far away and I may have there's a zone here where I will get contributions that are between zero and one but that zone the number of points I have in a zone that lie within a certain distance of the circumference is going to be of order r so I'll get that this is going to be equal to the area of the disc divided by well how many points will I have well I will have since each of these cells is tau omega wide has a dimension of tau omega divided by tau omega and then I'll have something of order r and so I get that area disc which is pi r squared over tau omega has to be equal to r squared over two plus order r and if I divide by r squared then I find that tau omega has to be two pi plus order one over r and since I'm talking about constant in the limit that I take discs that are big enough I need to find that tau tau omega is two pi so I mean and it's something that in quantum mechanics you know also if you look at at Wiles formula for the number of bound states of a Hamiltonian you look at the classical Hamiltonian you look where its energy is negative you divide by two pi and that gives you an approximation a semi classical approximation for the number of bound states it's the same argument okay so we have revisited this we have found that for orthonormal basis you can only can have two pi it also tells you the same argument tells you that this this so in if you now look if you now make a this is not the time frequency plane this is a plane for the parameters tau and omega and I'm going to make a properties I want to look at read different regions in this parameter plane properties for the family of functions that I get by discreetly moving my window by integer steps with with tau in time and omega in frequency then we have this hyperbola here omega tau equals the two pi if my if I build a family of functions w m n with omega tau bigger than two pi then it's clear that the ones that lie within the disc cannot span the full set of functions of eigen functions because I don't have enough of them I mean my my mesh is too large too few functions are localized within the disc and the disc needs r squared over 2 eigen I mean functions that are localized there so then I have I'm not we do not span so these w m n do not span h if my product omega here in this region here is the only place where orthonormal bases are possible but as we saw last time they cannot be very nice meaning they either decay badly in time or badly in frequency they can't have good decay in both decay faster than one over x or one over xi and then what happens here well the same argument will tell me that I'm trying to cram too many functions in a space that's just not big dimensional enough not large dimensional enough so if I can look at all the w m n's within with that have a localization within that that that disc then I have more than the dimension of the eigen functions of the localization operator that have eigen value bigger than a half and so they're not linearly independent so here they have no independence and so which is sometimes called over complete okay I think this is a nice point to break first because I am thirsty and and second because we'll shift to a slightly different thing so we have introduced quite a bit of machinery and all of the machinery you've seen we will use again in different places and I thought maybe now that I have made up my mind of what we will talk about I would tell you a little bit of the list of things we'll see next week as well if there is something that you would like me to talk about that's not on the list please let me know and I'll try to prepare and and squeeze it in so my email address is very simple it's my first name at math because I'm considered a mathematician these days well I pretend to be a mathematician I have a little dirty secret I have no degree in math I mean yes so all my degrees are in physics but but at least you have some degrees some people I don't have a PhD famous people that's true and and more famous people than vastly more famous okay um so this afternoon I was thinking since these are the adamar lectures of telling you some work I've done in inverse problems ill post problems au sens d'adamard and that has to do also with time frequency localization because that comes in all the time um so inverse problems then next week I was I would like to tell you a little bit more about the construction of wilson bases and uh their their LIGO use although I will say very very little about their LIGO use because I was not involved with that at all I mean it was a major surprise to me that all of a sudden they turned out to be really useful although I understand why they are useful but uh and uh I also would like to tell you and that will use these localization and these rotation operators again something about what we call synchro squeezing for signal analysis and I can give you I mean so let me say a few words about that so that you know what it's about and but then I'll come back to it next week I showed you the very first time I showed you the spectrograms and if you remember there were things that in time frequency were bands and and these had to do with the fact that in time the the the spectral properties the frequency properties of the signal were changing in time and that's why as time goes on the frequency profile changes but they also were all kind of blurry I mean there was nothing sharp there you didn't see oh this beautiful shiny and so on it was all all things that I have to draw with by putting my my my chalk flat there are applications where we really want to know things more precisely I might say well the uncertainty principle means you can do precisely exactly of course you can't but uh there are on the other hand applications where you know you would be able to do things precisely I mean people do what in optics people do super resolution for instance they go beyond the Rayleigh limit on and resolving sources because they know that what they see is two sources if they have two pinpoint sources and they see something that is blurrier then they can figure out where the two point point sources had to be in order to get the best response the closest response to what I actually see so that in a sense is fitting if you have something with very few parameters then you can look at what you get and fit those those parameters as best possible what we want to do is not really fitting we have many more parameters than that but we want to make use of the fact that we have an idea of what our signal looks like and so in some medical applications we know that our signals for instance example is an electrocardiogram I mean you know that the signal looks like I mean I work with doctors and they have an exquisite way of drawing this and if I do it I hope nobody hears a is there any doctors in the room no okay so then you won't mind that I draw an electrocardiogram like this because so you see this thing repeat and you've all seen on television series I mean every er series on television you see machine that's a bleep bleep bleep and it repeats periodically except it's not quite periodic if it was periodic would be so easy to analyze it's not I mean in fact and this is well known the difference the distance between peaks successive peaks is an indication of respiration and in fact one over the distance of the peaks is the first measurement of your respiration and in fact that's often how your respiration is measured from your electrocardiogram because although you might think it's so much easier to measure air coming in and out than electrical properties of your heart electrocardiograms are very easy they just put electrodes and they measure measuring the air and so on is something for which they have to put an apparatus in front of you and it's very cumbersome and it's not something you can forget about which you can do about the electrocardiogram so they measure your respiration by these intervals between successive peaks which means they already use the fact that it's not periodic if it's not exactly periodic then time frequency tools are much have much more problems so what we know is that we have signals that look like some basic shape which we don't know exactly which repeat in time so we have an s let's say that s is periodic but we have a rewarping in time that is not quite so phi prime is not constant we have an amplitude that may change and we typically have the sum of a few of such signals so ak also not constant but the fact that you have a model like so we don't have parameters I'm not saying how ak should use or how sk and so on but I still have an idea and we try to make the reading of the signal more precise that way we have come we still have a lot to do and some of the tools I've described will come in in how we are approaching this at present we've come so far that my collaborator in this who is not only who holds only not only a phd in applied math but also a medical doctor he was in radiography so he's a fully trained specialist in radiography in medical and has a medical degree we can now from electrocardiograms taken from a pregnant mother extract the heartbeat of the fetus which is a very very very small signal and very high SNR very low SNR very low signal noise ratio and of course it's dominated by the electrocardiogram of the mother so and which is diagnostically important because right now for babies that they expect have a heart problem the only way they can measure during delivery the heartbeat of the baby is actually by by putting a probe and and with all the dangers that's one so if you can do it by signal analysis that's much better so I'll come this is what I was planning to do at the very end and surprisingly some of the machinery of the rotations and so on that we have seen today will come back in there so that's the plan for the remaining lectures but today I wanted to talk about yes the other question oh I wanted to talk about inverse problems and time frequency localization the time frequency localization comes in as a non-essential way although it's essential to the problems that we solve it's not essential to the thing we did but I felt it's the only thing I have done in my career that relates directly to Adam so I felt I owed it to him so I've talked and I've given you examples of sound signals of other things that are localized in time and in frequency I've also shown you that we can fabricate little nuggets of things that are localized in time frequency so but in many cases we are interested we might be looking at things like these T new windows we will be almost forced to look at situations where we have a redundant family of functions so I might be looking at for instance these I mean I discretize and but I want to discretize very finely because I want to be able to put things in very special locations so that means that I might take tau in omega so that our product is much smaller than 2 pi I will have an enormous redundancy of functions I there are done many many ways in which I can if I if I'm given a function in which I can write a linear combination such that of these building blocks that gives me the function or for which the distance between these things is small let me gives me the function if I have no noise but in practice you always have noise so you don't want to exactly reconstruct it but you might want to reconstruct it up to a certain margin but even then there's still many many ways you could find and this is something that in some cases is useful you could try to find the way of writing it so that the constants have smallest little l2 norm and for that we have very good methods operator methods I mean what you do is you write you define this operator t from l2 of r to little l2 of z2 that maps f to the inner products with this this sequence and you study properties of that operators and through properties of that operator you can then find a procedure that will give you this type of reconstruction for a function with the smallest little l2 norm for the coefficients fine it just turns out that in a ton of other applications that is not really what you want to do because the l2 little l2 norm for a sequence is the norm which often we prefer to work with because it's so easy little l2 is a Hilbert space for heaven's sake I mean it's it's it's almost kindergarten stuff I mean you love it because it's so easy but in a little l2 norm because you think I mean think of what what the contribution is of every single coefficient in the sequence if a coefficient is very tiny because the square function looks like this very tiny coefficients don't contribute much to your l2 norm and so saying I want the minimum l2 norm the coefficient sequence the minimum l2 norm means that you don't really care about having it a lot of dust at very small values and you could say well let me get that dust and then put it to zero well why why first compute it and then put it to zero why not do something else instead and recently people have become very interested interested in trying to so given f try to find try to construct well actually even more given an f which is the effect of some operator on a function f and this is most of the time what you have in in physics most of the time you're interested in something an object that you observe through a telescope through a microscope whatever way and so on but you want the original thing not what you have measured from it and in fact what you you know is that f is the thing you wanted to plus some noise given that you measured this which you know is some known operator acting on f plus noise you try to construct a f or an approximation the best possible approximation f sharp to now best possible approximation what do i mean by that i might in the case where i know it was well localized in time frequency want to construct an f sharp that is a linear combination of such but with the coefficient sequence as sparse as possible so it might be it might be that i know that the thing i'm looking at is something that in some building blocks of this type which i construct because they make sense for my problem and they could be time frequency localized they could be wavelengths actually for what i'm going to do it doesn't matter what they are but often we get that they are well localized uh that they have a sparse expansion because they have good localization in space and in spatial frequency and try to do that now the the little l2 norm is not a good idea for this but it turns out that the little l1 norm is a better idea and the reason for that is that if you look at the function so this is the function t and this is the function t squared it makes you pay much more for small coefficients than the the l2 norm does so the idea is you so you model the whole thing by saying so you want that f hat f sharp sorry f sharp so you want cm n's or let's call them c lambdas c lambdas where lambda is in some index set so that a um sum over lambda c lambda and let's call them these building blocks uh uh phi lambdas so or well let's call them w lambdas so i'm going to call these things w lambdas so i want to construct them so that this minus the observation is small and typically i like to put this in l2 norm still l2 or little l2 depending on how i measure it um because uh typically we know well we believe we know and that the the noise we have is close to white gaussian noise and so then it makes sense to look in l2 norm this is small and then you put in an extra parameter and you put here the sum over all lambda c lambda so that this is minimized so that's how you model the problem now why you model it that way i mean i've given you some arguments already let me give you a few more the first is i do want i believe that making such linear combinations of my building blocks is a good idea i believe it will be fairly sparse and i know that the l1 norm is a better way of measuring sparsity than l2 norm one reason i know that that's the case is that if you think let's forget about the operator for the moment if you think uh and let's imagine that which isn't always the case but let's imagine that we have an orthonormal basis so uh so justification suppose the un are an orthonormal basis and i want to find and so uh then the argmin of sum over c and un minus f squared in little l2 oh well in yeah little l2 plus tau uh sum over so tau uh sum over the n c n so the l1 norm of the coefficient c could see so because i have an orthonormal basis this is going to be the argmin of well if i expand the whole thing in un then i'll get cn minus f un so i get here sum over cn minus f un squared plus tau cn and if i now do a uh a dirty quick and dirty minimization of that i mean this is of course not differentiable but if i don't care for the moment and i can justify this but uh then differentiating this so if cn is different from zero where i can do this and i look at the minimizing equation then i get uh cn minus f un is um plus tau is zero if cn is positive if cn is negative then absolute value cn is just negative cn so then i get uh cn minus f un negative tau is zero if cn is negative so this top thing gives me cn equals f un minus tau but only if cn is positive so for this to be positive i need that f un is bigger than tau here i get cn is f un plus tau but that's if cn is negative so in order for this to be consistent i need f un to be less than negative tau so you see there's between these two consistency uh conditions there's a gap namely f un in absolute value lies between f un lies between negative tau and tau and then i can have neither this case nor that case well i can't have cn different from zero i get cn equals zero so that's a kind of dirty derivation now in order to uh do it correctly you would have to work with the sub differential and you so on you get exactly the same thing so what you get is that the minimizer for this gives you cn is f un and then you subtract from that tau times the sign of f un if f un is bigger than tau and otherwise you get zero so what that tells you is that you would get exactly what you would have gotten if you didn't have that term except for an extra threshold and the thresholding that i'm seeing here is the following i mean so if this is uh so f un the possible values of f un and here i'm telling you what cn will be so this is the diagonal that i would have gotten for cn if i didn't have this extra term by having the extra term i am subtracting a little something here i'm adding a little something there i'm putting zero in between this is called soft thresholding and it's telling you that if my original function for some of the un's had lots of fluff in it i'm cutting it out so it is a uh i'm just saying as a justification motivation it is something that cuts out small terms another justification another justification is something that comes from the area of what's called compressed sensing so still justification um and this is something that has had an immense impact in in uh uh in a whole range of signal processing areas the idea is if indeed i work like i'm doing with my i mean if i'm looking at very very redundant uh blocks here that means that i'm looking at a very very large coefficient space so i'm thinking of working with very many very long vectors i mean so this is a column vector in the way people have of drawing it but the true space in which i work what i observe is actually smaller dimensional so my operator a is something that is a block matrix well it's a matrix that has it's much wider than it's tall and i observe this then it is of course obvious that although there might well be solutions there will be a multitude of solutions because the problem is undetermined however it might be that ahead of time that your solution the thing that you're looking for has very few entries in fact much fewer than this dimension and then you could ask yourself can i now that i know i don't have that many uh possible numbers i have to find can i find them from these it's hard to say that you have only a fine a small number of degrees of freedom because it these coefficients could be anywhere so any of these dimensions might be used so you have a large number of degrees of freedom but you will not use many of them at the same time and so the question is if you know if you have this matrix a and you want you want to find x once y is given can you do that and the idea is uh and and how would you find it and the answer is if a so for special matrix is for uh a large class of matrices a it turns out that the optimal sparse is the argmin of a x minus y squared plus tau x1 where you have to choose tau in the right way so doing what i was doing there with the l1 norm turns out to be the good thing to do for these but although this is a large class of matrices a it's not all a and in fact in practice it's almost impossible to verify impossible to verify that a is in this class worse if i have if my problem a my problem my physical problem is of the sort i mean here i'm not really talking about a physical problem depending decomposing into these time frequency blocks but in many cases you have a physical problem where your operator a comes from the transfer function of the of the the telescope of the microscope or whatever um in those cases you know that a will not have this property so what i'm saying here is that the optimal sparse x is finding the solution with smallest what people sometimes call the x zero norm well because it's not a norm the x zero the idea is you take for every coefficient you take nothing if the coefficient is zero and you take a one if it's there so you take uh you just count that the coefficients well that's not a norm i mean it's by by by any stretch of the imagination so uh you can't really call it a norm and and the notation is is but so what's special about this compressed sensing is that you have this collection of matrices for which you can prove that going for the smallest zero norm is achieved by going for the smallest one norm which is a true norm and which is convex and which i mean so this is something you can solve and this you can never hope to to to get around to so well i mean unless you have very special a minimize also on tau what do you do with you don't know tau you say yeah so there is a link between tau in fact what what you do is for some tau so how would i find the tau has nothing to do with the tau of the w no i'm sorry gamma what happens is that that you can look at and in these problems you can look at three different ways of asking the problem you can say that you want to minimize this argmin four x ones smaller than some r that's one problem another problem you can ask yourself is you can look for the argmin of the l1 norm given that ax minus y2 is smaller than some epsilon that's also a problem you can ask and then the the final one is argmin of ax minus y2 squared plus gamma x1 in every case you have to pick a parameter what you can show is that the solution set if you let r vary or you let epsilon vary or you let gamma vary is the same so what you can do is you can look for a particular gamma what the solution is you can then compute what ax minus y2 is and that is the value for which you would have found a minimizer here or you can look at what x1 is and that's the r for which you would have found a minimizer there but what the correspondences depend on your operator a but because the whole thing is convex and because you're looking at minimizers every time you can show that you have this equivalence of minimizers so in that sense gamma determines r but I can't give you an explicit formula because it would depend on a but I like to look at the problem in this way because it's a penalized rather than constrained situation and that is typically numerically much easier to implement because constraints are things that you have to work harder for in order to and in many of these situations so I've now finished with my justification for looking at this from now on I feel I have justified it I'm going to look at l1 penalized minimization the in in many of the situations what happens is that the problem the original problem is truly an ill-posed inverse problem in the sense of Adamach in in and but but where the the term that you impose regularizes it namely minimizing that quantity gives you something that gives you an approximate minimizer of this but now you become continuous in the all the parameters of your original problem so that was my justification so I'm interested in finding minimizers of this and I'll put l1 here in my original in the problem here which we actually we have worked with exactly this formulation in in in some geophysics problems where in fact we were looking at and so we were decomposing into elementary building blocks that had a nice time frequency localization and then the operator itself was the whole operator of reading from the structure of the earth which we were modeling with these blocks the seismic traces so for that we relied on the geophysicists who had great models for that operator that all together came in the operator a x was our coefficient vector and we wanted our coefficient vector to be sparse and the problem on which we applied it just on the side here these are papers with houston let's so if you look for papers with houston let's as one who is a niece is now just retired and and and myself as author you'll come across them the idea there was that you if you look at the whole earth and you look at I mean there are all these layers in the earth which are more or less known at what depth they are I mean we're not talking about finding oil layers I mean oil layers all happen in in this this stick of this of the chalk here we're talking about deeper structure than that they they look at the seismic traces registered all over the earth from really large earthquakes and so those have enormous energy and things that happen here get get propagated and so on and measured there and so from what you register you will get some imprint of what you went through and so they modeled all that and what they wanted to hope was to find evidence of what are called plumes so in in there is a belief a strong belief that in places like where you have island chains like Hawaii and and and it's one there is a plume of of material that comes up from the mantle that wells up and that causes these volcanic islands to be formed and with tectonic drift these things move away and so you have chains of islands that get formed but they're very localized in space and so the idea was by trying to build models with localized things in space and imposing sparsity because those plumes are sparse they don't exist in many places by giving the model the freedom to put these plumes in if it needed them to see whether we could tease them out it turned out that the data were not sufficient didn't have sufficient high resolution to get to that but that was what the effort was trying to do and so what we really had was an operator a that acted on a coefficient vector of the building blocks we were using which we knew to be sparse and so that's why we put in this penalty and trying to reproduce the data and see where the model would pick the sparse elements to be put okay see new here is our theta the two of three dimensions in space and also frequency so spatial dimension and spatial frequency because these these these earthquakes they measure them in different they decompose I mean they're dispersion relations and so different spatial frequencies get dispersed at different speeds they get that transmitted at different speeds and so when you you measure you can actually in their model you decompose also in in spatial frequency but the plume is nearly fixed the plume is fixed in so the so so what you observe uses yeah the plume is fixed but we have building blocks at different scales we worked with wavelengths there so we had fine scale and large scale and large scale wavelengths so that is the the frequency there the second parameter in any case that's what you want to to work with so you want to find algorithms for solving that so algorithms you want to prove convergence and you want to also prove that it's regularizing so that you have in the case typically the operator a is an operator even if a has no kernel which is often the case I mean not the case in the over determined situation then you do have a kernel but when a has no kernel then you typically still have eigenvalues that can go very small and so the operator is not invertible doesn't have a bounded inverse and so you still do need regularization in order to and that's what the penalty term takes care of but then another thing that came up later so this was a some work I did in the early O's with Christine when Michelle the frieze so the the the the the brother of Lucette the frieze and Christine de Mol and that algorithm was actually an algorithm that also had we were not aware of it at the time but had been proposed by other people although not with a proof of convergence in 50 dimensions so we were lucky in that we still had done something that nobody else had done so there's work done a whole there was don't know who and students there was work by Rob Novak and figurado no there's still other people but then we found that engineers actually very often instead of using the algorithms for minimizing l1 norms are slow and much slower whenever you have an l2 penalty people use conjugate gradient methods and things go very fast numerically so what many engineers do when they have to solve an l1 norm is to actually put in a weighted a reweighted l2 norm so an iteratively reweighted they turn it into what they call an iteratively reweighted least squares algorithm so many of the algorithms and I'll come back to that for solving this are are iterative that's to say you start with initial gas you compute a better gas and so on so you compute you will compute x n's and what you hope is that you compute x n plus one computed from x n and the data and then you hope that to prove that x n goes to a minimizer x sharp so what what iteratively reweighted the square methods would do is they say okay at the end step let's minimize this but instead of having to worry about the x k's the l1 norm is a nuisance because it's not differentiable they say let's make it an l2 thing let's just divide it by what I mean if the thing converges then in the limit this will just give us I mean if the x n conversion this will just give us the absolute value of x and k and then they say well of course we might not have if this is zero we are in trouble of it's over very small so let's put here an absolute a square and a square root and so at every step you put a different weight in to the problem is that this of course is not going to converge I mean it's going to converge to the problem where your weight would be the sum over x k squared over exactly this I mean if you're lucky I mean but you can prove that it converged to that and that immediately gives you all the small fluff that you typically get it doesn't give you the l1 norm so part of your justification for the l1 norm is gone however if you take the epsilon away you can really run into trouble so what we then did is we found that there's a way of defining epsilon n's in such a way that you the method will converge and thus converge to the right limit and again that was something that was unknown that you could do that that you could do an iteratively reweight at least squares algorithm for solving l1 penalized problems that gave you the right solution so epsilon n depends on the previous solutions the problem is that you want epsilon n's to decrease of course you want it to go to zero of course but it shouldn't go to zero too fast so it's actually going to depend on the norm of previous solutions but it works and in speed it turns out I mean so then you can on this you can apply a conjugate gradient I mean so you can do a couple of steps with a weight conjugate gradient and then read just the weight conjugate gradient again and so on and so this work was done in a paper by my students Sergey Voronin and myself which should be coming out soon I mean it took a while after he finished this is for it to be written up and an improvement on this has been done by so we proved convergence for the original method but if you want to conjugate gradient but not to conjugate gradient to convergence and then go to the next iteration then so you want to just do a couple of steps then that was done by Massimo Fournazier and one of his students so and then you get significant speed up compared to its standard all one methods so so there are a number of ways in which you can attach these problems and optimization theory is in fact a very nice branch of applied mathematics in its own right and there are very slick ways of approaching those and in what I will show you is a fairly pedestrian way but something that explains what what's going on so if I want to look at ax minus y squared plus gamma x1 first of all I like actually to put a 2 here because in what I did up there I now realize I forgot conveniently a factor 2 it should have been 2 cn minus fu n plus tau equals 0 and so I should have thresholded tau over 2 and it's because I had forgotten to put this factor 2 in there I mean I like to put that factor then I don't have to write the 1 over 2 every time so argument of that so let me define this as a functional of x and let me also define a functional g of x another vector you see there I could actually write a minimizer because I didn't really have an operator I mean I was nicely diagonal in my c and so I could minimize but here I have this a if I try to do the same trick here then I mean I could I mean if we try to write the same trick minimizing then I would get so I have here x times a star ax minus 2 x a star y let me assume that I'm real because I mean complex doesn't makes it slightly more complex but it doesn't make it it essentially different plus y squared which of course for minimization doesn't matter and then plus 2 gamma sum over k x k so here I have sum over k x k and then this k minus 2 sum over k x k and this k and if I write my minimizers then I get a star ax k times 2 because it's quadratic minus the 2 I have here so let me forget about it 2 a star y k and then plus 2 gamma and sine x k and that should be 0 and so what I get by the same argument as before if they said x k is okay so this should be 0 so actually let me be a little bit more careful let me say plus 2 gamma is 0 if x k is positive but 2 disappears thank you thank you minus gamma is 0 if x k is negative and okay so if x k is positive then I get that x k is equal to so how did that work again I stare why minus gamma you don't invert a star a no I'm not inverting anything I'm trying to write one formula that will incorporate everything in that will otherwise be completely useless okay so if okay gamma is positive okay if x k is positive then I have to satisfy this so then gamma is going to be equal to that which means this is also positive so I can write that it's x k is going to be equal to this minus gamma the sine of x k plus a star y k minus a star a x k sorry all I've done is written something simple more complicated but it's true that was if x k was positive if x k is negative then I have that gamma equals to what I have on this side so I get minus a star y k a star a x k so what that means is that this thing is negative and x k is negative so the sine of this whole thing is negative and so saying I do this plus gamma is right again because everything cancels again so this is always true now if x if if neither of of these things is true then I get x so I will get x k equals zero if if this whole contraption doesn't have is not is not bigger than gamma then I'll get x k zero so I can write that this so it is true that the what I call soft thresholding as gamma x k plus a star y k minus a star y a x k is the minimizing equation for this this functional the way I mean so I didn't do it I mean I did it hand waving here but you can do it rigorously except it will take me another board by really looking at what happens when you you do a variation so you say let's add a little u to x and you have to differentiate between those k for which x k is different from zero because then you can make your u small enough so you at u times t you can make your t small enough that the sine of x plus u t k is the same as x k but you can only do that if x k is not zero if x k is zero then you end up with a u and you get a different equation and that's how you do it rigorously but in the end you do get this equation that is the minimizing equation but you see this is a nuisance because you sit with this operator there I mean yes I could try to define I mean so from this equation immediately finding what what what what x should be is not not trivial you can of course now this suggests that you define x n plus 1 k is s gamma x n k plus a star y k minus a star a x n k so this it suggests that you do this but it's not clear that that's going to converge so at least you have an iterative procedure and that in fact was what had been proposed by a number of people but you have to prove it converges so that is what we're going to do now so in such case it's a proximal yes but many of the proofs that you have for that are in finite dimension and I want to prove it's going to converge in infinite dimension okay f of x is by definition a x minus y squared plus gamma x1 thanks and I want to get rid of this a star a I want to I mean that is what causing me the problem so what I'm going to do is I'm going to define something that looks like my but I'll add some extra terms I'm going to assume a less than one and in fact with different approaches you can go to a less than square or two but for the proof I'm going to give you a less than one works is easier okay and what I'm going to do is I'm going to add a term here that because a is less than one is going to be positive and that I hope will disappear in the limit and that's going to get eliminate my a x squared so what I'm going to write here is an x minus a squared minus ax minus a squared so and what I will do is I will take I will consider in particular x near what I mean so I will add a term that compares with the previous iterate and so I'll define xn plus one sorry is the argmin in x of this and I now have a number of interesting observations that help me I have that so f of x is always smaller than g of x any a I have that f of xn is equal to g of xn xn so I have f of xn plus one which is the same thing as g of xn plus one xn plus one that is smaller than g of xn xn plus one explain sorry xn plus one xn so I didn't add on yeah okay and g xn plus one because xn plus one is the minimizer is smaller than g of xn xn and that is f xn so at least I see that my f xn decrease next let's look at what g xn plus one xn minus f of xn plus one let's look at that well since I defined that the one from the other by adding that extra term that is just xn plus one minus xn squared minus a times the same thing and so that is less than one minus a squared and that's why I wanted that a was no less than one times xn plus one minus xn squared so this is some constant which is bigger than zero and so now because this is just a difference between these two oops between these two terms this difference is sandwiched between these two terms I'm having now that the sum over xn plus one minus xn squared not some yet let's write this is smaller than one over c and I get f xn minus f xn plus one and that means that any finite sum I would write here is going to be equal to the same finite sum I would write here which is bounded by one over c f of x1 because all those terms are positive and so it follows that this is an infinite series converges excuse me yes maybe I missed something but from the previous line as the norm of a is less than one you see that this is a contraction operator so it must if there is a fixed point is the only solution and there is an obvious fixed point no because a could have eigenvalue yes yes but a could have eigenvalue zero so I actually I am you're right you're right I've done something really really wrong here this is not true I mean I'm subtracting that so I've done something wrong this is just simply not true as I've stated it so what should I be doing instead um no it's true it's true yes season the next time you reverse inequality so you probably should be larger than one one minus I have yeah I've done something terrible I I I'm sorry uh little little no yeah it's this that's bigger this is what I should have done and then everything is okay yes because it wasn't true the way I'd written it because if a if I was looking I mean it's not true yeah thank you so much I mean so if I write u squared minus a u squared that is not in general true because if I look at u that has as as as as eigenvalue zero then I would be wrong but you're right if if it had been true then that's what I should have done but I made two mistakes that cancelled each other I mean but thanks for catching them and and in fact that's the whole whole point of introducing this this concatenation that we have a way of sandwiching xn plus x minus xn squared by something that's bounded and that that that that will chain together okay so um that doesn't tell me yet so I do have okay so I have that but I also have an equation for xn plus one and xn let me look at at what g x a is again if I write it out you see the a x squared will drop out so I get minus two x a star y plus two gamma x one plus x squared minus two x a and then the a x squared drops out again plus two x a star a a plus stuff without x so it means that the minimizing equation well let's first rewrite this whole thing I get uh x squared plus two minus two x a star y plus minus a star a a and plus a and then plus two gamma x one and we've seen a number of times how to get the minimizing equation of that what you get is that you write the whole quadratic thing and then you put a soft thresholding so the minimizing equation for that is uh x minimizes so uh so minimizing okay so let's write the equation first um x k equal soft thresholding with thresholding uh step gamma off and we get a two for minimizing that so we get x k minus a star y uh I have a problem now okay so let's do it okay x k to x k minus two a star y k minus a star a k uh so plus minus a k and then I have plus gamma is zero if x k is positive and so I get x k in that case is a k uh plus a star y k minus a star a a k minus gamma and x k needs to be positive so that means that this has to be bigger than gamma and so I have soft thresholding of that I have the whole same expression if with a plus gamma if x k is negative and so so I get that my minimizing equation is now write it here x k is the soft thresholding gamma a k plus a star y k minus a star a a k great that's my minimizing equation so if I apply that to x n and x n plus one so a is x n and this is x n plus one I get that x n plus one the minimizer is equal to x n k plus here and so this is starting to look like what so if I have convergence then things will indeed converge to the right minimizing equation if I have convergence which I haven't proven yet but I have squares yes I wish that would be that would be great I mean it would be wonderful but but we have to to work a little harder yeah no but that's exactly I mean we it would be wonderful which we don't in general have that so we do know that we do know that the difference between them goes to zero but we don't know that they're summable but uh okay we have this component wise so how do you go again we also know because well are we have an we know that f all the f's are bounded so we know that know that all the the l1 norms are bounded so f of the x n's are all smaller than f of x1 these are bounded and so the l1 norms of all these things are bounded x1 sorry because we know that we have a decreasing sequence and so we have this and so we know that the the l2 norms are also bounded all bounded by some radius smaller infinity and so we know that we are going to have a weakly convergent subsequence okay it follows that uh the x and l k uh will converge to some x twiddle k as l goes to infinity weakly means what uh it means that there exists uh so there's a convergence subsequence no uh no uh weakly convergence means uh so x and l for f for any vector you can think of this will go to zero as l goes to infinity oh the the inner product the weakly convergence uh there exists a sequence that has a weak limit yeah so here the difference between yes you're right i mean that's what you meant the difference should go to zero i'm sorry i'm sorry i i had not okay so because uh taking component is a particular case of taking a sub uh in a product so this is true um the when you apply an operator to weakly convergent sequence it's still weakly convergent so um because i have this relationship between every next iterate and the previous one it will happen also that the x and l plus one k uh will converge as well is that true x and yeah i i don't need that i just need that x and yeah x and plus one minus x and goes to zero in norm so it means that this also has to converge to x twiddle k as l goes to infinity and so it follows that x twiddle k will satisfy as gamma x twiddle k plus a star y k minus a star a x twiddle k so that weak limit will satisfy the uh uh fixed uh the the fixed point equation the minimizing question for the functional okay in case that i mean you can do some more complicated argument if it's not the case but in case that minimizer is unique which is for instance the case if the kernel of a is zero even even if the kernel of a is not trivial then it's highly unlikely because of the l one norm term that you will have a non-unique minimizer because you would need two things in the kernel that both are very sparse but i mean it is possible but in most cases the in minimizer will be unique if the minimizer is unique which is the case if a has trivial kernel then uh it follows that every uh every convergence subsequence every subsequence i take of this will have a weekly convergent subsequence to that limit i mean you can parlay that into showing that there exists that the whole thing must converge because if something stayed away from it then you would have a subsequence there that did converge and you had a problem so we have convergence in the infinite case okay i'm going to get a little bit of water i'll be back in 30 seconds okay so we have found that the x and converge to uh while they converge weekly to limit because they can't converge or anything else um and if they do a minimizer and in fact you can prove that they converge strongly to the minimizer too and the argument escapes me right now and so let me do not try to recommend so we have an iteratively uh soft thresholded algorithm which is this one and so this is of called ista iteratively soft thresholded algorithm um it turns out that you can speed up this algorithm using a trick from optimization theory that was invented by nesteroff that i still have not really understood i mean uh what it amounts to is that you uh you play you put in a parameter here and you play with the mu n and i can understand the derivation and how he shows that it goes faster and i have not the slightest idea of what is really going on and i asked him and he says but you can see it it works i said yes but what's going on underneath and i mean either it's so obvious to him that he couldn't tell me any other way or or so that's an algorithm that then called fista fast iteratively iterative soft thresholding algorithm and there are a number of of things that people have tried to make these things even faster um with with varying success um but as i said the uh engineers like to work with something in which you circumvent the l1 norm by putting in weights so now so what we really want to look at is this functional but what we are going to look at instead we're going to introduce an a and i'm going to introduce a w and i'm going to introduce epsilons so i mean all of these are going to be sequences um so i start with ax minus y squared and then i have an x minus a squared minus ax minus a squared and then this w has nothing to do with the w of the window functions anymore i mean they're just a sequence of weights so if i remember correctly so let's let's okay uh yes no x is x sorry there's probably a gamma here somewhere probably two gamma here might be just a gamma i don't know let's look at it and is an epsilon here it's uh it's because you had an epsilon before i mean epsilon will change from there yes epsilon is a sequence so the uh so yeah there's no double there's there's an epsilon here so there's no n in here okay they will come in thank you um what we are going to do is we're going to optimize in turns so x n plus one is going to be the argmin of g x with the w n's and epsilon n and w n is a sequence and epsilon n is just a number for the nth case um the epsilon n's will have to work with that but i'm going to assume that the epsilon n's are non-increasing w n is going to be the argmin of g x n w epsilon n so that's easy to see immediately because the equation we get for w is x n k squared plus epsilon n squared minus one over w k squared equals zero so this is that so w k w n k is one over that but it's still convenient for me to think of it as something that i vary like that and that's why i have no two here because you see that in the limit each of these terms is going to contribute one of the things to x one okay so let's now write my the whole chain of equalities again in garlwood g of x n plus one take the same for the x n plus one x n then these extra terms have introduced don't matter and if i then take w n plus one and epsilon n plus one then what i have is a x n plus one minus y squared the extra terms the extra quadratic terms don't matter and then i have two gamma sum over k of x n plus one k squared plus epsilon n plus one squared to the power a half and since my x's are going to be bounded again if i take my epsilon n non-increasing and epsilon n going to zero then this will go so the difference between this and f of x n plus one will go to zero now let's look at at all this minimizing stuff so the g x n plus one x n plus one w n plus one epsilon n plus one is less so this here is less than what i get if i put in an epsilon n oh is that what i wanted to do to get rid of to find the right order of doing these things well in the paper there was okay so so what you do so what happens here is uh yeah and i'm i'm i'm getting really it is a long time and i'm getting at the end of my my uh stamina um what happens is that you have to uh um you you do a very similar argument where you concatenate things and you the the whole thing is really um predicated on the fact that you can find an epsilon sequence that is decreasing and that goes to zero and for which all this will work and you will the reason you cannot put epsilon n to a non-zero value before the limit because otherwise you will get in trouble because you really are looking at this w n's and if one of these things is zero by accident too early then epsilon n epsilon n means to be some zero um so epsilon n is not allowed to go to zero too fast or your proof will start failing on the other hand it has to go to zero for thanks to work and the the the miracle that the the thing that ultimately makes it work is that you uh you you consider um you you set epsilon n plus one to be the minimum you pick some number bigger smaller than one alpha n plus one and then you put um no put it as the maximum of alpha n plus one and and you look at the previous so you put in something uh uh sorry x you put in something that goes to zero that depends on the i mean so it's only if you already have explicitly converged that epsilon becomes zero on the other hand since this goes to zero and that goes to zero epsilon has to go to zero and that is what makes the proof ultimately work i mean the statement is that uh okay the statement is that by defining in the correct this this this x n plus one and that so okay theorem well actually didn't write the statement for the previous version either although in the paper there are theorems proofs and so on so it's a math paper so theorem if f of x has a unique minimizer but in fact it can be formulated even if that's not the case because you can scale it down until it's less than one but of course you have to do that before the algorithm then this minimizer yes thank you yes actually it's the minimum of epsilon n minus one and the maximum of this and you had to put that in also because i needed the decreasing and uh you have to worry about the fact that uh epsilon n might get stuck but in fact it turns out you can prove as part of the proof that it never gets stuck i mean it can get stuck a number of times but that then it because this goes to zero and alpha n goes to zero this goes to zero and so it can't stay stuck and alpha is some number and alpha between zero is picked arbitrarily yes so why do you prefer this i'm going to the previous one so there was the previous one that you described without yes so i uh the reason this is much much easier to program turns out are you mean than zero? yeah and uh because it's quadratic in principle i would have to do the argmin every time but in order to do that people do conjugate gradient and conjugate gradient they do a couple of steps and they stop and if you even if you do that and you don't do the full argmin you can prove that the resulting thing still converges that was something massive similar to what you described one hour ago yeah i mean that's exactly what i described one hour ago i mean this is this is this is uh in in in mathematical words what i described in just hand waving things one hour ago so what what you uh you have an algorithm that's just purely quadratic so you can do conjugate gradient and in practice you just do conjugate gradient a couple of steps and then you go to the next iteration and it still converges and that converges faster than many many other proposed methods and engineers just i mean typically they have very good code that's optimized for quadratic problems and they just plug it in and and also i mean people had been while people were doing so this is called irls iteratively reweight at least squares so people have been doing iteratively reweight at least squares for for a long time and uh usually you don't really hit problems i mean things don't become zero but in this iterate in in this this uh depending on the problem you're looking at if you really know that there's par solutions you will hit zeros and if you hit it one zero by accident too early then you do get in trouble in your algorithm so um so people have been putting in epsilons at random and so on and it was not known that you could put in epsilon in such a way that the whole thing the convergence wasn't hurt and so and apart from the plume problem you mentioned what are the applications of this just to have an idea oh um so whenever you um so for many actually many problems that people work in where people work in wavelets they like to work in wavelets because they believe that they need the high resolution that wavelets will give you to find scale but you need it only in certain places you don't know where but with hindsight i mean so wavelets were successful for image analysis in the 80s late 80s beginning 90s and so on and and since then with hindsight uh we now i mean i believe now that uh this was a first instance of sparse expansions i mean so what the paradigm has been for a long time that you want to expand into bases because it makes your life easy and you typically expand in a basis that's given to you and very often you have to work harder and harder the further you go in your basis i mean her meets uh uh uh her polynomials and so at some point you're exhausted and you stop and that's your approximation um things like wavelet bases and other bases gave rise to situations where you could expand in many many coefficients but only a few of them are going to be useful and so but you don't know which ones ahead of time and so that's typically what we call a non-linear approximation i mean by the fact that the spaces of just a few coefficients that the set of all such things is not a linear space anymore so you approximate out of this non-linear space or uh our our compressed sensing sparse expansions um but then with wavelets we we had enough understanding of the problem and the progress in harmonic analysis had given the powerful theorems that were needed in order to show that indeed this expansion was going to be useful that you had a situation where you had kind of guessed the right building blocks for sparse expansions we now know that there are many situations in which sparse representation is useful but we don't know the building blocks and that's what many people are doing presently in learning in uh they're trying to learn to write building blocks in order to do sparse expansions but we're still in most of the interesting problems are the stage where we don't even have a good formulation of those building blocks we know we have built algorithms that ultimately will do some approximation but we don't understand those algorithms well enough but I think that's what's going on is that we are building very efficient approximations in in in uh for for problems and we have to optimize for them so there are situations for which we have learned to build sparse building blocks that are beyond wavelets and that are useful and people have done that in in in things like compressed sensing and and and other in sparse computations and and then there are beyond that and that's more recent uh uh situations in which we have these these complicated algorithms that do very well and we have to start studying them and I think I mean but that's a different matter I think people in computer science that do deep learning should really work with physicists in trying to understand their networks because they don't have the right I mean I was talking with people last week I don't have the right impetus of saying I now have a complex object that does good things what is it what are the inner workings of that object how could I stimulate it I mean they can they can give it special stimuli and then try to see what it is in the network that is working in order to understand what network is doing because their networks are way too complicated for what but they don't know how to probe it and how to analyze it but that's yet another stage so so my my my take is these decompositions into sparse expansions are really useful when you know the building blocks wavelets any problem that people use wavelets for you really want to work with sparse expansions so these l1 penalized problems are algorithms are really useful and there are other situations where we have constructed building blocks in which which are not wavelets but in which we know we need to be sparse so again these l1 penalized algorithms will be useful and then there are situations where we haven't even identified the building blocks but once we have they will be useful again and the basis here is important for the l1 part if I have a human space the first part is I don't think the basis no it's it's it's the l1 part it's the l1 part the coefficients in that basis and and that makes it really basis dependent I mean it's so you you have to be in that basis so you have to identify that basis first just for the last reason so compared to the fista algorithm which is exactly the same problem so the message of us guarantee you is proving more or less that here when you're looking at the value of the way the speed at which the criterion is reaching zero is like one over the square of this yes that's up and down yes the do you have comparable results well it's it's one over typically but but once you start doing this with conjugated gradient steps I don't we don't have theoretical results anymore because we don't wait for convergence we we just do one or two conjugate gradient steps and then go further and then also what people in practice do is they they play with this exponent once you are in you believe that you're close to the right thing you increase this exponent so that you have an even lower l1 norm so you're no longer convex but I mean so but you you hope that in your right in the attraction basin and of course that speeds speeds up things tremendously but if you just compute in terms of in iteration between being solving this yes yeah but but you can't you can do the nester of trick to this as well I mean in order to get speed up that you get I mean and it works for the same weird reasons all the time I mean do you understand this nester of thing really I mean I but there must be a reason I mean if things work that's my my cradle if things work they work for a reason and and so it's the mathematician's challenge to find the reason okay I'm sorry I became a little less limpid at the end yes and I just have one more question so like you like the l1 norm what about the l1 I don't know 10 11 norm or something like this which would be more convex is it is it a bad norm for some reason it's it's uh once uh for engineers anything that's not two is something they like less yeah and 10 11 11 tenths is no as soon as you make p bigger than one it because converges slowly actually you you would like to take p less than I want to get faster convergence but you since it's not convex you may you may not have the local minima or not global minima and in practice you have many local minima I mean so what people do is they they work with l1 unless until until things really start creeping down I mean because l1 it kind of gets to a point where you you feel like like you're you're watching grass grow I mean when you get to that point you say okay I believe there's nothing serious in the neighborhood except the true minimum and then you you you crank down your norm so if you used and free have norms you would not be doing thresholding or it wouldn't be thresholding but you get you get instead of of you get a funny kind of other thing um wait no you uh it's um I forget no you you continues then you you you you uh can you you get smaller coefficients where there's more yeah it's not as bad as as as yeah they wouldn't get zero more questions