 We'll share my screen again. Good, so yeah. I hope you can see this. Please let me know in case you don't. So as I said, the next chapter will be on auto-tuning. That means the automatic tuning of this Hamiltonian Monte Carlo algorithm that I presented before, but also of Monte Carlo algorithms in general. It is all about, as I said before as well, but drawing independent models in a Monte Carlo simulation. And that really is the challenge. Why is it important to draw models or samples that are independent? First of all, it is important for efficient model or null space exploration, because independence means that you're drawing models that are actually different from each other. So if you have a Monte Carlo algorithms, algorithm that draws models that are dependent or more dependent, then it means that the models that you're drawing successively actually more or less always look the same. And then of course, it's not what you want. But it is also important for the convergence of any Monte Carlo integral. So if you want to solve a Monte Carlo integral, then you can show that the convergence error of that Monte Carlo integral that you need, for instance, to compute the mean or a standard deviation or a low dimensional marginals, then that convergence error is proportional to one divided by the square root of the independent models. So what matters is not how many samples you draw, but what is important is how many of those are actually statistically independent. Let's look at an example. We try to sample the Gaussian. It's not a very interesting function, but actually when you look at a 1,000 dimensional Gaussian, then it starts to become interesting again. We equip this Gaussian with a covariance matrix that is diagonal. It's also in principle boring, but the diagonal entries they vary by about one order of magnitude from 0.1 to 1.1. And initially, we say that the mass matrix of our null space shuttle, so this variable M, is equal to the identity matrix. It's the most simple thing we can do. And then the question is, how many of the samples that we actually draw in this Hamiltonian Monte Carlo algorithm are actually independent from each other? Now let's see how we can do this. So we draw many, many samples. Say we draw about 2,000. And we measure the dependence or the independence of those samples using the autocorrelation of the sample chain. So what we do is we take one model parameter, say M1. So it's the first model parameters in our 1,000 dimensional model space. And we compute the autocorrelation of this sequence of samples that we are drawing. And this autocorrelation normalized, the normalized autocorrelation is shown here. And it is visualized in this plot in black. So you see that this autocorrelation function is always relatively close to 1. So for this model parameter M1, we can draw many, many samples. But even after drawing 1,000 samples here, or after drawing 2,000 samples, the autocorrelation is always relatively high. So this means that we draw many, many samples. But in the direction of M1, so-called that one model parameter, they're actually always more or less the same. So you learn nothing new, even though you draw many, many samples. Now for model parameter M1,000, so the last one, this looks very different. This autocorrelation function, which is plotted here, it rapidly drops from 1 to nearly 0. And then it stays near 0 until the end of the simulation. So this means that for M1,000, for that one model parameter, you very quickly draw samples that are independent, meaning that you actually see different aspects of model space, which is what you want. We can quantify this a little bit more by computing the effective number of independent samples. There's a definition or an equation for this. I want to derive this. The effective number of samples that you're drawing is equal to the total number of samples that you actually draw, divided by 1 plus 2 times the autocorrelation function. And the autocorrelation function is what you see in the project. And then from this, we can compute an effective sample fraction. So we take the effective number of samples and divide by the total number of samples. And what you see is that for model parameter M1,000, this effective sample fraction is about 0.3, which means that in order to get a sample that is independent from the samples that you have already seen, you need to draw about three samples. But for model parameter M1, which has this wiggly autocorrelation function in black, this effective sample size, this effective sample fraction, is only about 0.047. So this means before you see the model that is different from the ones that you have seen before, you have to draw about 200 samples. So very many, you have to draw many samples before you actually discover a new aspect of model space. And this, of course, makes this algorithm very inefficient. We can understand this in pictures. So what you see here, let's look at the red curve, is a projection of a Hamiltonian trajectory of a trajectory of that null space shuttle, if you will, onto the M1, M1,000 plane. So it's a two-dimensional projection of that 1,000-dimensional misfit surface. And what you see is that if we choose the mass matrix M to be the unit matrix, our null space shuttle oscillates through this model space along the red curve. And what happens is that in M1,000 direction, the space shuttle makes very rapid progress. So you very rapidly see something else. You very rapidly see a different corner of model space. But in M1 direction, so in this direction, the progress is very slow. So this means you progress along your trajectory, along this Hamiltonian trajectory. But in M1 direction, you're actually not covering a lot of distance. Now this changes significantly when you change the mass matrix. So if you say that the mass matrix is equal to the inverse covariance matrix of that Gaussian, then you find a trajectory that is shown here in blue. And the blue trajectory indeed makes equally rapid progress in all directions. So it equally rapidly progresses in M1 direction and in M1,000 direction. So one can in fact show that the ideal mass matrix for such a Hamiltonian Monte Carlo algorithm is the Hessian of the potential energy, which is equal to the inverse posterior covariance matrix for the special case of the Gaussian. So this is all nice and good. But of course, the problem is that the Hessian of the misfit functional cannot be computed or stored explicitly. It is simply too expensive to compute and too large to store on any computer for all the relevant problems that we are interested in. And this brings us to a trick to a strategy to circumvent this problem. And it is based on cosine Newton methods. So here's how this goes. The basic idea is to not pre-compute this Hessian, which I said is totally out of scale, but to approximate the Hessian of the potential energy of the misfit on the fly, that is, as the sampler is running. So what we do is we just use the last couple of samples to approximate the Hessian times an arbitrary vector, which is all we need for the algorithm. And this is very closely related to the well-known LBFGS method from nonlinear optimization. This is a quasi Newton method, a classical one. And then we use the approximate Hessian as a mass matrix for the computation of Hamiltonian trajectories. So here's the concept of pictures. We start at some model and not. And this gives us an approximation of the Hessian, which we call H0. And it is simply the identity matrix. And we say that this is the mass matrix. And then we progress. We have the next sample, M1. And then we use the first approximation of the Hessian, which we call H1. And this is based on M0 and M1. And then we progress. And we get better and better approximations of this Hessian using more and more models. But after some time, we throw away the last model. So we throw away M0, which was here. And we use a Hessian approximation based on M1, M2, M3, and M4. And then as we progress, we use an approximation based on M2, M3, and M5, M4, and M5, and so on and so forth. And like this, we get successively better approximations of the Hessian, which we then use as the mass matrix for this Hamiltonian Monte Carlo sample. So let's return to this 1,000 dimensional Gaussian, again with a diagonal covariance matrix. And we have seen this autocorrelation plot before. So it's the autocorrelation plot of the sample chain, four model parameters M1 in black and M1,000 in red. And just to remind you, we had effective sample fractions of 0.3, which is quite OK, meaning that you draw three samples and then you actually see a different part of model space. But for M1, this was very poor. We had to draw about 200 samples before actually seeing something that was different from what we have seen before. And this is without this auto-tuning. And if we turn the auto-tuning on, that is, if we approximate the Hessian on the fly using the RBFGS method, then this autocorrelation plot changes dramatically. So now, not only M1,000 decorrelates very quickly, but also M1. So very rapidly, for both model parameters now, you start seeing different aspects of model space. And now our effective sample fractions are approximately equal, so 0.17 and 0.14, meaning that you roughly draw five, six, seven samples, which is a very small number. And then you already see different corners of model space that you have not explored before. Here comes a somewhat more challenging example. This is a two-dimensional projection of the 1,000-dimensional Stieblinsky tan function. There's an analytical function, which has four to the power of 500 local minima. So it's a very nasty one. And then if we do not use auto-tuning, we again have the problem that one of the model parameters, M1,000, decorrelates quickly. So you rapidly draw independent samples. But for the other model parameter, M1, the autocorrelation remains very high, meaning that you can draw as many samples as you like. They're still dependent. And then if you turn the auto-tuning on, again, as we wish, both model parameters decorrelate quickly and we have a much more efficient sample. So also here, if the auto-tuning is turned off, so you have a highly correlated model parameter M1, you see that the sampler actually misses this local minimum that the sampler with the auto-tuning accurately recovers. So it improves in many respects. Now, this was a toy problem. I want to show you one that as well, still a toy problem, but goes more into the direction of an actual geophysical image problem. And it is a one-dimensional viscoelastic for wave dimension. So one space dimension. The setup is as follows. We use a 1D viscoelastic wave equation that's shown here in the frequency domains. We have omega squared, which is frequency, times the displacement field, so the wave field, plus velocity squared. So this is the phase velocity of those waves. There's a complex quantity. We have attenuation in the system times the second space derivative of the wave field. And that is equal to some external forcing F. There are two model parameters, physical model parameters that we're interested in. The first one is the real part of the logarithmic velocity that you see here. So we have the real part of the velocity divided by a reference velocity and we take the logarithm of this. So this is logarithmic velocity. And then we do the same for the imaginary part of velocity. So we divide this by reference, imaginary velocity and take the logarithm. And this is the logarithmic attenuation. Why do we use logarithmic velocity to attenuation? The reason is very simple. There are two main reasons. The first one is that probability densities that we're interested in actually do not change as we do the subjective flip from velocity to one over velocity, so to slowness. And also we use those logarithmic quantities then it is actually legitimate to use a Gaussian as a prior probability density. If we were to use the attenuation or the velocity itself, then we would not be allowed to use a Gaussian because the Gaussian also admits negative values, but we know that both velocity and attenuation actually positive physical quantities. Now auto tuning for this example is really essential because the sensitivities with respect to the logarithmic velocity and the logarithmic attenuation vary over many orders of magnitude. The physical setup is as follows. We are in one space dimension. You're seeing distance in kilometers on the x-axis and on the y-axis we see here logarithmic velocity on top and logarithmic attenuation, so the imaginary part here below. The true model that we plug in is shown in black and some initial model is shown in red. The data in one dimension, look as follows. What you see in the top plot is a wave field at relatively low frequency at 0.02 hertz. The artificial observed wave field that we contaminated with noise is shown in black for a source that is located here at the position of the star and the initial synthetic wave field is shown in red, that is noise free by design. And if we go to higher frequencies, then you see that the artificial observed and the initial synthetic wave fields they're really different, very, very significant. They're cycle skips and so on and so forth. Just an impression of those frequency domain wave fields that we're using. Then using this auto-tuning Hamiltonian Monte Carlo, you can compute many statistical quantities, for example, one-dimensional marginal distributions for all of the model parameters. And the number of those model parameters is 2,000, there's 1,000 logarithmic velocities on the left and 1,000 logarithmic attenuations on the right. And what's interesting here in particular is that when you look here in this region for example, you very clearly see that our velocity distribution, our posterior velocity distribution we recover is by model. So you in fact have two families of plausible models based on those data. One is up here and the other one is down here. So you very nicely recover the non-uniqueness of this inverse problem, which results of course from the non-linear dependence of the observations of the model parameters but also from the contamination with noise and the sparse data coverage. So important here also is that the effective sample sizes when we do use the auto-tuning above 10,000 for all of the model parameters for all of the 2,000 model parameters. So 10,000 the effective sample size, the total number of samples is one million. And if we had not used auto-tuning then the effective sample size would be less than 100 for some of the model parameters, meaning that you definitely under-sampled that posterior distribution. So also here's some conclusions for this chapter. And the motivation I have shown you that the convergence of any Monte Carlo algorithm critically depends on the number of independent samples. So not on the total number of samples. And in Hamiltonian Monte Carlo, the independence of the samples can be steered through the choice of the mass matrix. The ideal mass matrix is the Hessian of the misfit or of the potential energy that is the second derivative of the misfit function. And I argued that this cannot be computed or stored explicitly. It would be too computationally expensive to compute because there are too many second derivatives and for all of the interesting problems that we're typically dealing with it also cannot be stored on a computer. So we have to find a different approximate solution and that approximate solution is to use an approximation of the Hessian of the local Hessian that comes from quasi-Newton methods that we know from linear optimization. And then to use this approximate Hessian as the mass matrix in Hamiltonian Monte Carlo sample. There are many quasi-Newton methods that one could use. The one that we have chose is LBFGS. And that as you have seen in one of the illustrations uses a small number of previous misfits and gradient evaluations in order to come up with a local Hessian approximation. And in the numerical examples in the toy examples with the Gaussian and the Stablinsky-Tank function, I have shown you that quite often the effective sample size can increase by orders of magnitude. So this means in simple words that this auto-tuning strategy increases the efficiency, the convergence of this Monte Carlo sample sampler by orders of magnitude. So that implies that with the auto-tuning you can solve problems that otherwise you would just not be able to have. And then I've shown you still a simplistic application to a one-dimensional viscoelastic full waveform inversion. And there my argument was that this auto-tuning is really absolutely essential in order to recover the posterior distributions of both velocity and attenuation. We have also seen that using this auto-tuning Hamiltonian Monte Carlo, you can quite nicely recover the multimodality of the solution. So you get an impression of the non-uniqueness of this inverse problem. That is of the existence of solutions that are very different from each other but still explain your observations equally well. And what is interesting from a geophysical perspective, and I haven't shown this in that specific slide, is that even though we run a full waveform inversion and do things properly, I think there is little hope to actually constrain attenuation in yours. The information content is too weak. So this brings me to the last chapter which has nothing to do at all with Monte Carlo sampling but with the solution of the forward and the adjoint problem. It is with the solution of forward and adjoint wave propagation through the earth. And it's a method that we somewhat jokingly call called smoothiesome for reasons that you will discover within the next couple of minutes. So it's really all about accelerating wave propagation and doing so through the use of wayfield adapted meshes. So here comes the basic idea. To illustrate the idea, we look at first regular finite element mesh. This is what you see here to the right. We look at a domain that is 600 by 600 kilometers and we have a source, for example, an explosion or an earthquake that is sitting here and a receiver that is over there. And in a regular finite element mesh or regular spectral element mesh, you would use a certain number of elements per minimum wavelength. You would do that regularly, for example, in this case here. And this number of elements per minimum wavelength is needed to ensure that you have some reasonable numerical accuracy. Now, in the figure to the right here, you can see a wayfield snapshot for the mesh that you see to the left. In the background, in gray scale, you see the seismic velocities. So the P velocity and the S velocity. And in colors, you see the wave front of the S wave, this is this one, and the wave front of the P wave, that travels through this relatively smooth media. Now, interestingly, this wavelength that I mentioned up here, which controls the spacing, the grid spacing, or the size of the elements, is actually an anisotropic quantity. Wavelength is anisotropic. And you see this quite easily. So parallel to the propagation direction. So here, the propagation direction of the wave is this. In this direction, the wave field varies pretty rapidly. But if you look in this direction, the wave field varies very slowly. So in this direction, in orthogonal direction, the wavelength is very large. But in this direction, a long wave propagation, the wavelength is very short, right? So this wavelength here, there's no such thing as the one wavelength, but it is an anisotropic quantity that depends on the direction in which you're actually looking. And we can use this to our advantage. So instead of using the regular mesh that we see up here, we use a mesh that is adapted to this anisotrop, to this anisotropy of the wave field. And what you see, in fact, is that the grid spacing in propagation direction is much finer than the grid spacing orthogonal to the propagation direction, where, as we have seen above, the wavelength is a lot longer. So we then use the numerical mesh that we see it on the bottom left and we compute our wave field. And this leads to the figure that you see here. And with the naked eye, you don't see a difference between the wave field on top for the regular mesh and the wave field below for this wave field adapted mesh. So the mesh is complexity adapted. And in this specific example, the wave field adapted mesh that we see at the bottom uses about eight times less elements, which means that the computational cost for computing the wave field down here is about eight times less. So almost an odd of magnitude less than for the wave field that you see on top. So this can be extended to three dimensions. That's what you see here. This is a toy example. To the left, there's a global scale wave field as one would use a traditional global scale spectral element simulations of seismic wave propagation. This mesh here is good for wave propagation at a long period of 100 seconds and it has around 400,000 elements. The mesh that you see to the right is wave field adapted. So it is much coarser in asthmutile direction here. It produces wave fields that are almost indistinguishable from the wave fields for this traditional mesh, but it has less than 10% of the elements. And so here, actually the computational expense for computing a wave field for this mesh is more than 10 times less than for the mesh that you see to the left. One can drive this a little bit further. One can even adapt those meshes to the coverage, to the data coverage that we have. This is what you see here. We have an event up here and stations in East Asia. And then of course one could, for example, cut out those pieces of the mesh where we are anyway not interested in the wave propagation and then additionally, you can reduce the number of elements and reduce the computational cost. Now, so far so good, but there's actually an additional difficulty or an apparent difficulty. So what we're interested in is to compute sensitivity kernels that is derivatives of our misfit. And that turns out to be apparently non-trivial, but then in the end, fortunately, very easy. Now, I want to illustrate this with a simple two-dimensional setup. So we look at the source receiver geometry here to the left, all the yellow stars are sources, all the red triangles are receivers. And then we have an input model that you see to the right. So this is the shear modulus distributed through this two-dimensional domain. And this is what we want to recover starting from a homogeneous one. Now what we do when we want to compute sensitivity kernels, we start with a forward wave field that radiates outwards from the source towards the receivers. And this is what this looks like for a regular mesh. And then at all the receivers, we feed in the adjoint sources that radiate the residuals backwards and time into the medium. This is a snapshot of this adjoint wave field. And then the product of the forward and the adjoint wave fields integrated over time produces the sensitivity kernel that we see down here. And the sensitivity kernel tells us how to modify the current model such as to reduce the misfit. So this is for the regular mesh. But for a wave field adapted mesh, this apparently is a little bit more difficult. So we still have a source here in the center and the mesh follows the geometry of this wave field. But then the adjoint wave field looks pretty weird. And that is because at all of the locations of the adjoint sources, so that is at all of the locations of the receivers, this wave field is way too sparse. And the grid, the mesh is way too sparse. And so here we have a refinement around the actual physical source. But in the locations of the adjoint sources, which are all over the medium, the grid is very sparse and as a model direction and therefore cannot accurately represent that physical source. This actually not a physical source, it's just an adjoint source. So from this figure, one might get the impression that the adjoint field is wrong and therefore most likely produces an incorrect sensitivity kernel. But then if we just compute that sensitivity kernel, we see that it looks very similar to the sensitivity kernel that we have seen below. Even though the adjoint wave field looks wrong. So what's going on? How can this be? This solution is very simple. If we look at the forward problem already in discretized form, then we have an operator L, this is just a matrix that is acting on a discrete forward wave field U. And this is equal to a discrete forward source F. So L U equals F. And then we apply the discrete adjoint method. So we take this discrete forward problem and we see what does the discrete adjoint problem look like and it looks like this. It's very simple. It's simply the discrete adjoint. So it is the Hermitian transpose of L is the discrete adjoint operator. This acts on the discrete adjoint wave field. And on the right-hand side, we have the discrete adjoint source. So we can simply solve this problem, which we have done here before. Doesn't want to go back. This is what we have done here. This is the solution of the discrete adjoint problem. And this by construction produces the correct sensitivity kernel. So this discrete adjoint field here, when plotted, may look a little bit strange, but it really just does not matter because this adjoint wave field is not a physical quantity that needs to look nice and needs to look like a physical wave field. It just doesn't matter. It just has to be the correct solution of this discrete adjoint equation. And so it is the correct adjoint wave field even though visually we don't like it and that produces the correct sensitivity kernel. And this is essentially why the method works. So with this, we can solve a 2D4 wave form inversion problem using the toy setup that you have seen before. So just to remind you of the source receiver geometry and of the input model, and then using the regular spectral element mesh that you see up here, we recover the model that you see to the lower left. And using those wave field adapted meshes, we recover the model that you see to the lower right. And obviously they're very, very similar to each other. And here the beauty is that for the regular mesh because it has so many elements, we needed 66.6 CPU hours to solve this problem, solving here meaning to find a model that leads to a misfit reduction of more than 95%. But if we use those wave field adapted meshes, we achieve a 95% misfit reduction using only 6.2 CPU hours. So this means that by using this more clever algorithm with the wave field adapted meshes, we have reduced computational cost by more than one order of magnitude, which of course is very, very significant. It makes the difference between being able or not being able to solve a certain problem. Now, the application of this to an actual global scale full waveform inversion is a work in progress. I still want to show you what we have at the moment. Our main goal here at the moment is to construct a long period reference model of the Earth that includes data from all earthquakes for which we have reliable CMT solutions. That is with a magnitude of more than 5.5. And we want this to properly account for wave propagation through 3D heterogeneous media for topography, oceans, trustal variations, but also final frequency sensitivity. Now, the setup is as follows. We have a preliminary data set. This is 1,200 earthquakes and it's continuously expanding. We have about 3,000 to 4,000 recordings, broadband recordings per earthquake with a minimal period of 80 seconds. Meanwhile, we actually took to 60 seconds if we use complete waveforms. So that is really three components and all the body and surface waves. And the initial model is prem and we use again an LBFGS method to solve this optimization problem. These are sub preliminary results. What you see is SH velocity and SV velocity as a function of steps. So the propagation speeds of horizontally polarized S waves and vertically polarized S waves on top at a depth of 150 kilometers, 1,000 kilometers, 2,500 kilometers. It's a whole earth model, which contains a lot of detail. I don't really want to go into an interpretation here. It looks very similar to models that have been produced before using much more data and much larger computational cost. And the computational cost is actually my main message here. This is what I want to show. If we had solved this problem. So if we had tried to compute this whole earth model using traditional finite element measures that I have shown before, we would have required about 1.3, 1.4 million GPU hours on our national supercomputer. Some colleagues in Princeton, now in Colorado in 2016, they computed a global full wave form inversion model that has slightly lower resolution and that required 320,000 GPU hours. Of course, to some extent, this is a comparison of apples and oranges, but still what's important here is the order of magnitude. And then other colleagues in Princeton in 2002 computed a global full wave form inversion model with a similar resolution and they actually used 1.1 million GPU hours on their supercomputer. Now, for the model that you have seen, we used about 15,000 GPU hours. So there are two orders of magnitude in computational cost between the traditional methods that you see up here and the full wave form inversion with the wave field adapted measures that we have developed and are currently using. So some conclusions here. We have seen that those wave field adapted measures, they are designed in order to be in accord with prior knowledge of the wave field geometry. And the mesh complexity, so the number of azimuthal elements can be adapted to the complexity of the mean. If your medium is very smooth, you may get away with a very small number of azimuthal elements. If your medium is very complicated that produces a lot of scattering, then the number of azimuthal elements of course needs to be increased. It works very well for smooth, that is tomographic models. And this is what we have typically at global scale. And as we have seen, it can be incorporated into full wave form inversion schemes. Of course, there's more bookkeeping to be done because you need the different mesh for all of the sources. And effectively, I think it's fair to say that we reduce the dimensionality of the problem by around one. So effectively, we reduce a three-dimensional problem to a two-dimensional problem. And as a consequence, the computational cost is a lot longer. And we have seen this preliminary global scale application. We are going towards long period 3D reference models that on a wave propagation through 3D-agricenuse media but also topography, oceans, trust variations and so on and so forth. And our long-term ambitious goal is to come up with such a reference model that includes all earthquakes with magnitude above 5.5. So far using 1,200 events, the results I think are very encouraging. The model geologically makes a lot of sense. It compares well to other models but the computational requirements are one to two orders of magnitude lower compared to similar studies that have been done in the past. Now this brings me to the end. The last few slides I want to show contain some literature that goes into more detail. As I said at the beginning, this presentation, this lecture was really to wetten your appetite for more detail and those details can be found in the papers that are listed here. But what I really would like to bring to your attention is a more extensive book. In fact, it's called lecture notes on industry theory. We uploaded this a few months ago to the pre-print server of Cambridge University Press. It's called Cambridge Open Engage. And when you go to the website, you can see the snapshot of the website here to the right. You can find this book which contains really all the basics of geophysics inverse theory, including some of the content that I have just shown. And it's complemented by lots of interactive code examples that you can run and where you can learn more. So really just Google Cambridge Open Engage, go to the earth science part of it and then you will find those lecture notes to play with. Then we also have some other resources, for example, YouTube channel where you can see some of the content that I have shown explained in little video abstracts but also we have a software library in our group. So if you go to www.swp.ethset.ch, you will find software as much of the software is actually educational and covers some of the content that I have shown here. And with that, I would like to thank you very much for your attention. And I would be happy to discuss more. Thanks a lot.