 Well, thank you, Alec, and thank you to the organisers, to Karim and Adriana and everyone for accommodating me. I made it here at last, so I come to you from Canberra, Australia. It's evening time here, and I think it's early morning where most of you are. I hope my internet is going to be as accommodating as the organisers of this committee, this workshop. Okay, so I will share my screen and get going. I will attempt to share my screen and hopefully everyone can see that. I'll assume if I don't get interrupted you can. So the title of my talk by first two lectures here will be original title with some new trends and some old lessons in geophysical inversion and an alternate title is a focus on emerging directions in geophysical inversion. Now I know we have a mixture of audience members and I want to give an introduction to inversion so it puts us all on the same page and talk about a few areas which I think are interesting as a personal view and which I think have potential for important applications in the next decade or so. But as a famous physicist said, prediction is difficult, especially about the future. So there is a reviewed chapter associated with this, there's going to be a publication associated with this workshop and associated with this talk with the same title of Emotions and Directions in Geophysical Inversion. And that's co-authored between myself and Andrew Valentine at Durham University. Now, there's a draft of that available under review and the topics covered in that are listed at the bottom here from aspects of sparsity constrained inversion, optimal transport, ensemble methods, a whole series of things where we think they are emerging and there's a summary in there. And it's really the areas in red that I want to talk about today. Sparsity, optimal transport, generative methods and surrogate modeling part of machine learning, and also physics and forms neural networks, all particularly how they influence geophysical inversion and how they already be used in geophysical inversion. And I think it's an exciting time to be alive eventually. So I want to start in a very basic manner, introduce some concepts which hopefully will be familiar to many people. So data, data everywhere. It was estimated more than a decade now ago by Gantt in 2010 that the global data collection rate from all sources was increasing at 58% equal to 1.2 times 10 to the 21 bytes per annum, which is more than the estimated number of stars in the universe. So an enormous growth. And we've seen that in many areas of our discipline. There's some examples there of how data sets have grown. And it's huge volumes of data present challenges for data custodianship, but they also lead to new ways of, of throwing instances about the real world or about the physical world around us. So as we change the style of data, we change the way we go about making inferences. But here's the basic paradigm that I've worked on for many years now and many people will be familiar with, which is classifying forward and inverse problems. Well, the forward problem being a deductive process where if we take some process in the earth, some assumption about the earth, it might be in my case seismic structure of the earth or properties of a earthquake, or it may be anything might be a weather system and simulation of physical phenomena is the forward problem, which is where we go from the left to the right. Typically, it may be anything from simple algebraic equations to solve to all the way to solutions of complex, ordinary differential or partial differential equations. But they particularly then make predictions, predictions that can be compared to observations. They make predictions of seismic wave as a function of position in the earth or the weather or simply the ages of rocks for geochemists, geochronologists, but they are compared to observations, the comparable corresponding problems to go backwards to take observations and try and constrain properties or even the whole earth, but usually properties and that goes into parameter fitting data assimilation phase and influence and that's an inductive process because we are trying to go backwards to find out where the what model or what representation led to the observations that we have. Some would argue forward problems are easier to solve than inverse problems, some might argue the other way around, but I always come back to Samuel Carlin 1983, who is the famous statistician, he said, the purpose of models is not to fit the data, but to sharpen the question. And that's always stayed with me because we're not when we solve an inverse problem, we're always at best finding some approximation to the thing we don't know about it is not the earth is always some limited inadequate approximation and finding such models, the purpose of it is then to sharpen the questions. And I think that's a good way of viewing all of inversion as asking questions. Now, to ask the question, you need to know what you're asking the question about. And that typically means that those have to mean discretizing your unknowns in some way, which often leads to parameter estimation and to do that, we need some sort of basis function and I've represented here simply as a sum of some known basis functions by J times constant MJ, and we would be looking for the constant and assuming pose the basis functions and here are some examples, local supported basis functions that are limited in space, the representation sometimes an average of the property you're interested in, a localized average in some way, or global supported basis functions. These are some of the examples being Fourier, analogous or Fourier components of a signal. And there are things that go in between things that are both local support that have finite frequency, which is where wavelength come in, or curbs, etc. So we make inferences about continuous function by using some sort of basis function and everything then that we would cover or we learn is through that lens of the basis functions have chosen if we're looking at a discrete problem. We have to recognize that. And they're often chosen to suit some phenomena to suit the physics of the problem, or for convenience, or hopefully something which is meaningful in the question we want to ask the data. Okay, so classes of inverse problem. Typically, the simplest one is linear systems equations where D is our data, M is our model and G is some, perhaps higher dimensional matrix where it is known and it is a constant and that makes a linear problem solving linear systems equations is a pastime of many of us and doing that efficiently when problems are large is a matter of often research, but there are many, many ways of going about that. And nonlinear and discreet is when we don't have a linear problem, we simply have a function of some model parameters and we're trying to find them given D and given G with a GB matrix of above or some function below. So nonlinear and discreet, but we also have linearized indiscreet, which is how we typically solve a nonlinear problem by perturbing the data and perturbing the model, and that's a formula linear systems equations of changes to some reference model, given differences between observations and predictions of D is our delta D. Much less studied, but also important are linear and nonlinear continuous problems as represented by my two equations below there. These are typically because we have big computers these days, we like to discretize problems and we usually cast one of the more complex problems into simpler problems as in above. Okay. Now, there are two common approaches to inversion to classes of approach, the left hand side representing what we might call model building, trying to find some set of unknowns and satisfy constraints of trying to fit data. As you see on the left hand side here, I've got a simple least squares or L2 normal data misfit plus some sort of regularization possibly. But the key distinguishing between the left and the right is that we typically try to solve an optimization problem on the left and seeking some optimal model in some sense. And on the right is a different class of problem where we are not trying to find a model per se, but trying to find some function over the model parameters, which might be a probabilistic interpretation of the problem. And this is essentially well known to be by Israel, where we have a distribution represented here as P of M given D, which is proportional to a likelihood, the probability of the data given the model, kind of the probability of the model. And it's a very popular way. Increasingly so in recent years, I think way of trying to attack inverse problems in that we usually end up resulting into sampling. So rather than one model on the left, we might look for an ensemble on the right, which was distributed approximately according to the way that the data support the model. So high concentrations of samples are where the data is where the model is well defined, and we can use the spread or properties of an ensemble to try and understand uncertainty constraints and ask questions. Okay, so here's my discrete linear input problem, classic least squares fitting a Newton's laws of motion, a trivial example. So in this case, over-determining least squares problems we can solve and we've known how to solve these many, many years and more adventurous by solving bigger problems. And this is a useful starting point for many cases. It's not going to work in all problems because even as a linear because problems can be undetermined. But a classic problem just put there as a reference. Okay, but we can quickly move into problems of interest where they are non-linear. Now, I'm going to hopefully show you a model here. Here's the calculation I did a couple of years ago. On the top left, hopefully you can see there are a panel of four images where this is simply a one-dimensional seismic model of the crust of the crust and the upper mantle. Oh, it's mainly the crust and lithosphere, which is a shear wave speed as a function of depth in layers, very, very simple. And from that, we can make a prediction of this blue curve here, which is what we call a seismic receiver function. It's a product of seismograms that is constrain the near surface structure. The blue is predicted. Gray here is my synthetic observed inverted commas receiver function with noise. Now, what I'm going to do is play a little model, play a little movie, and you'll see the blue curve here is just the differences between what we might call predicted blue curve and the observed inverted commas gray curve. So we would like the blue to be as small in amplitude as possible than the fit is best. And on the left-hand side, I'm just talking essentially the sum of the squares of the areas of the blue. Now, what I'm going to do is move, change the velocity model, and I'm changing the wave speed here. The seismic wave is in one layer. And if you do that, you see the changes in the predicted receiver function, the differences in blue, they change up a bit. So it looks quite complicated. But actually, if you look at the misfit, it's nicely almost quadratic. There's a well-defined minimum. And at that minimum is the red dot goes down. It's the place we're at at any one time. The red dot hits the minimum. The fit between the blue and the gray is as good as possible. And it's at an optimum level. So this problem is actually quite simple to solve if we're looking for seismic velocities. And I'm only varying one velocity here. OK, but below is exactly the same problem, exactly the same setup. But what I'm going to do is not vary the velocity, but vary the structure of the model by moving the red dot up and down and redefining the layers as we move through. So if I click on here, hopefully you can see the movie. Hopefully you can see that. Now the structural components, the velocities are not changing. The velocity is moving through the model. And it's changing the model structurally with different thickness of layers, et cetera. Now this type of change is much more complicated. The waveforms jump around, move through. The blue goes up and down. If you look at the misfit, as that function at one parameter, the red dot goes up and down. It hits a global minimum around 20 kilometers, which is where the optimal model is. But you see lots of local minima. So this is a nonlinear problem and it creates a lot of difficulties. But it's actually the same problem. It's just that one problem, in one case, will vary one type of parameter. And in the other, we're varying a different class of parameter. So problems can have this mixed character. And on the right-hand side, you see the same thing is on the left, but there's a function of two layers, that's the two layer thicknesses. And you see these enormous mountains of insignificance behind that can make optimization very difficult. The white area in the foreground is where the global minimum is. So dealing with this type of class of problem, as we vary two, three, four, or as many unknowns as we want, becomes exceedingly complex. The character of the misstep landscape is a function also of how we measure the fit to the data. So this is what I mean by highly nonlinear problems. Now, having paused just for a moment then on my introduction to inverse problems we have to deal with linear and nonlinear coming all shapes and sizes, I want to concentrate on a very particular feature called sparsity. Now, sparsity means emptiness. So unlike what I just described up here, if we have a linear problem and which is nonunique, the left-hand side of the linear misstep function is nicely quadratic and linear. And everything is wonderful. The left-hand side indicates that it's nonunique. And we have a region where all of the fit to the data, given by this misstep surface, this misstep surface is a valley and it's a constant. And along the line across the minimum there, there's no unique conditions along there. And we have an underdetermination. So in these problems, if we're trying to build a single model or find some preferred model, as I described at the beginning, we need to introduce some sort of regularization or some sort of preferences expressed in terms of how we go about minimizing some function to get an optimal model in some sense. Now, the idea of looking for sparse solutions and sparse solutions, I mean solutions which are encouraged to have many of their unknown as zero or close to zero as possible. Now, that may seem a little strange. But it turns out that the idea of looking for sparse regularization of sparse models cheaply given by the term on the right-hand side here, which is the left-hand side is a data set and the right-hand side is a model norm, we call an L1 norm. Taking an L1 norm as the model and trying to minimize the combination of fit to data and L1 norm is a simple way of trying to encourage a sparse model. And the idea of L1 norm in geophysical version dates back at least to the 1970s when Claire Boutemure used it essentially on the data side of the equation to find robust data misfits. And John Scales in the late 90s used it as a regularization term for imaging seismic tomography to try and find robust solutions to imaging seismic imaging. So L1 regularization on the right then, I'm claiming encourage sparse solutions under the term in problems indicating that we prefer the models to be zero and we'll see why that might be a good thing in some cases. But it doesn't guarantee it. Unfortunately, if you wanted to guarantee a truly sparse solution, we should put in the L0 norm in there. The L0 norm is essentially a measure of the number of non-zero components. The L1 norm is just at some in the absolute values. The L0 norm says we prefer a fewest number of non-zero components. Unfortunately, if we use the L0 norm, we can't solve the problem because it's too hard. It's extremely difficult, especially as the number of unknown increases. So L1 is a good compromise to try and get smooth solutions or practical solutions to these problems. Now, a fascinating idea that we'll talk about in a moment comes up when there are, figuratively, all solutions to inverse problems are in the green, that's saying a problem, and all solutions of sparse are in the blue. And there's a particular case, and I've said it above with that definition, that when our sensor basis is incoherent with our model basis and the model of sparse, these two conditions I'll mention again in a minute, are the conditions of a well-known field of called compress sensing. Now, what that means in layman's language in the simplistic terms is that essentially every measurement constrains as many of the unknowns as possible. That's where incoherence would encourage. And the number of unknowns that are non-zero are small. If you have that situation, then you're in this position where even though there's an infinite number of models that fit the data because the problem's undetermined, and there's always an infinite number of, well, a very large number of wave problems can be sparse, it turns out the intersection of these two is very small. And you can actually solve problems almost exactly, almost to a high degree of probability when the right condition tolls. Now, I want to jump to where this is coming in, because this is the underlying theory of Candidates, Candace and Paul, Donahoe, Paul and Felix Herman were one of the first people in the hour end of the woods hours to jump into this idea of compressive sensing. Now, compressive sensing, here's a simple example. More formally, if the data are represented as a sum of, sorry, the data are measured by what we might call a sensor basis. That's the type of data we collect. If we're maintaining a time signal as in the top right here, the red dots are where we measure. The time signal then our sensor basis because we're simply measuring the amplitude would be a delta function. We could measure the frequency or we could measure the average over a period of time. There would be different sensor basis. But if the sensor basis and the model basis, that's how we represent the underlying model, that might be constrained by this data. If they are incoherent in that data, constrained many model parameters, and the model is bar, it turns out that you can recover models with high precision. And in this case, I'm simply saying that my model basis are the Fourier components of the data. So I'm just trying to do a simple regression. I'm trying to recover the data from the red dots. Now, if this condition holds where the two bases are incoherent, it turns out, and they are, if you use amplitude of a signal and Fourier components, many of these authors have shown that you can get near exact collusion with relatively few random samples of the data. And that's the idea behind compressed extension and nutshell. And here's a simple example. I'm going to measure data on the left of some band-limited signal, and I'm going to try and recover the Fourier components of it on the right. So the right of my model parameters, the left of my data, and I'm only sampling at a particular point. Now, if I use a sparsity constraint, p equals 1 there, that's the L1 norm, p equals 2 would be the least squares constraint. We could try and recover the Fourier components from the data using an L2, p equals 2, throat, or an L1 sparsity base approach. I'm going to show you very simply two examples of that. This is what happens if you try least squares on that problem. The blue is the original data. The black is where I've sampled it. The green is where is the model covered by finding a least squares solution to this where I'm minimizing the L2 norm of the data while trying to fit the data as well. And you can see that we fit the data exactly. The green lines pass through all the black data points. As you would expect, it's an undetermined problem. We can always do that. But between those, the green is a very poor approximation to the blue true signal. We only know the signal blue at the black point. So we're trying to get the blue curve from the black point. And least squares will always give us something that looks like the green, which is because on the top here, we're looking at in Fourier space, these are the recovered models, the recovered wave number coefficient. And you can see that even though the true model is sparse as below, there are only 10 non-zero coefficients here. At the top, when you use a least squares approach or L2 regularization, you get essentially power in all of the wave numbers because that's the solution that minimizes the L2 norm and fits the data. And you get poor amplitude recovery of the signal and many non-zero coefficients. But this is a classic example of a poor solution. And this was pointed out by the previous authors that this type of problem is very poor from an L2 perspective. But if you require sparsity, impose sparsity on the problem here, and this problem is incoherent in the way we sample the data to pick an amplitude compared to our model, which is Fourier component, those bases are incoherent. This is the type of answer you get. With, I've forgotten how many days I used there. I think about 10, yes, 10. I think I used about 100 or so data points. So there are 10 finite coefficients on the top right of what's recovered. On the bottom right is a true example. The truth in this case, and you see basically a near exact recovery. So this is using the same data. And all we're doing is changing from a regularized solution, which requires a minimum length solution, L2 to 1, which requires a sparse solution, L1. And the solution is dramatically different because this is the appropriate conditions for compressive sensing. Okay, now, where would this be used for? Well, as I mentioned before, from Gantt, from a croaky new number of data that's growing. And he also said that it is 2007 we've been generating more bits of data a year that can be stored in all the world's storage devices. And so this has been taken up by a number of fields now by this app in the Earth Sciences on seismology. And we have enormous databases. The idea behind compressive sensing is that we could perhaps not collect all that data but collect random samples of the data and then use compressive sensing to either reconstruct the data and if the circumstances are right, we can do some of higher accuracy or use the limited data and combine the idea of reconstruction using the inversion as part of the analysis of the data. So it is a potential solution to recording really only the data you might need and then using that in any application. I'll give you a 2D example, this is a way of encouragement here. So compressive sensing concepts are applications in image reconstruction and the papers I've seen Watkin and others have shown this. The left hand side is the famous paper painting by Escher and on the right hand side, the middle is the 1% of the data of that image sampled and on the right we're trying to reconstruct it. It's not that great is it, using sparsity and we can actually see some of the stuff with the waterfall here, but it's not great and we've used 1% of the data and what's characteristic in these problems is that as you increase the number of data and try to reconstruct a threshold and as you pass the threshold the solution, if we're in the right regime or being incoherent and sparse solutions, this isn't sparse solutions, we can actually recover this near exactly so here's the case for 10% of the data and the recovery is pretty good so if it's only the data immediately being used there's no construction of a compressive sensing on the right, again not perfect but pretty good for 10% of the data. Now this is not compression in the sense of adaptive compression, we don't look at the whole image and try and work out how to represent the same wavelet, we simply have the data in the middle and impose sparsity on it and try to, within the basis that we've chosen and try to reconstruct on the right. So as I mentioned before, sparse is a long history besides the imaging but the earth itself is not known to be sparse so there's a problem, how can we use this in imaging and inversion when the earth isn't necessarily sparse or perhaps we haven't found the right basis functions that it could be sparsed in, there are many attempts at using them, weighing that type basis functions and I want to sort of present what I think is an interesting direction here by my PhD student with Andrew who's ANU and she's looking at this concept of over-complete tomography. Now in this then, if you look at the image in the bottom right there we have a smooth model, it's a simple linear tomography problem, we have a smooth background model given by the gradient and on top of that we've imposed a pixelized perturbation. Now both the background smooth model and the localized pixel model are sparse, they're sparse in different bases. Now the theory of compression sensing works in a single basis and the idea of over-complete tomography is can we try and apply the same idea to models that are non-sparse in a spatial sense but are sparse in two different bases i.e. localized and long basings. Can we better recover features where we have local anomalies on a smooth background? That's the motivation example here. And I'm going to show you some of the results in this just a movie. On the top here we're going to see results from an over-complete in the over-complete box and from simple least squares and by that I simply mean regularizing with a sparsity maximizing constraint on the left and on a minimum length on the right, a NL2 norm and we'll see what happens. On the left-hand side you see the ray density and what we're going to see is a multi-faceted movie in a moment. Here's a typical over-complete with only 10 rays in the problem it's performed poorly and the top is the model and the bottom is the difference between the model and the truth. It's a simple synthetic case and in the two little boxes I'm showing you the projections in this over-complete the solution from the discrete case in E2 and the long-wave length in E3 and what you're going to see so here's for the same exactly the same data is the L2 solution. So in both cases things have not worked too well but you can see on the right-hand side you very much see the rays which is what you would do in a NL2 norm sense and on the left-hand side you see a smooth background with pixels. Now what we're going to do is run a movie and look at the number of rays increases. It's a simple linearized tomography a linear tomography but the true model is not sparse in the combined basis it's only sparse in the individual basis and we're going to try and solve for two at once and on the bottom left you're going to see a movie of essentially the mystic effect of data as a function of the number of data and we're going to increase the number of data and cycle through this so everything's going to be moving at once. Let's have a look. Okay, so you see recovered so you're going to see recovered model accuracy that's what it said not get to data now against a number of samples you see a blue a gray and a pink curve and hopefully what you'll see is as the number of data increase the compressive sensing idea of hitting a cliff and the problem in the sparse basis becomes very accurate very quickly we'll see that phenomenon. Okay, I'll just try and play it so you can see it. Okay, in my version of the movie we've gone over the cliff now and as we go over the cliff the number of data as we pass through that threshold the over complete model becomes near perfect instantaneously and the least squares model slowly catches up and when the data overwhelm both both become reasonably good as we get large numbers of data both will eventually get there I'll just try and play that again I can play that again here we go again so you can see it I'll start the movie and if you look at the middle you should see it fits on the truth solution and the error in the truth solution which is the bottom go to zero as it goes over the cliff it passes through the quick number of samples which here is about 200 raised, see? So that shows it's possible to extend the idea of sparsely constrained inversion to an over complete regime and trying to recover models which are smooth and locally anomalous so we hope this will have applications in the range of applications there's local structure on a smooth background perhaps in a volcano or in a local earthquake setting or imaging subduction zone etc. Anywhere where you have small features and large features long ways and small way features being separately represented and using the idea of sparsely to try and improve the solution for a fixed number of data okay so I'll go there I will play that one more time because it's a movie for fun and I'm coming to the end of my time and hopefully you can see that this is just a first go at this problem but we did not know whether in an over complete regime this would possibly work compressing idea work in a single basis these are two bases it equally can represent the model and it turns out with a bit of effort and a bit of tweaking choosing your trade-off parameters in your regularization which is the same problems you have with all such cases you can get over complete type problems to work there's nothing about the linearity in this that is important you can apply this to nonlinear problems or any other pairs of bases or even more bases triple the bases or quadruples but the problem gets bigger and bigger and more difficult to solve okay so I think I've come to the end of my first part of my lecture there and I will pause there as we reach 43 minutes yes we will reach for the 3 we will have a 10 minutes break but Malcolm are you here immediately or you are still with us yeah you are still with us there is a question just a very brief question because it's the last slide it was really exciting in terms of the understanding of how it influences the results of the inversion meanwhile what about the model if more