 And that may also welcome Andrew once more. Our speaker today is Andrew Stewart. Andrew is a professor of computing and mathematics at Caltech. And I think it's fair, not just judging from his job title that he likes to live on the boundaries between subjects. He obtained his PhD in 1986 here in Oxford from what was then called a computing laboratory. Before moving to Caltech, he had faculty positions both in mathematics and engineering science at universities of Warwick, Princeton and Bath. For his work, he has received numerous prestigious awards on both sides of the Atlantic and he was elected a fellow of the Royal Society in 2020. Andrew is passionate about the combination of rigorous mathematical tools and models with the vast datasets of present times. He applies this first principle thinking to problems such as unfolding or the analysis of stochastic systems. As referred in the last three talks in this series, machine learning provides very powerful methods to address these very challenges in the context of meteorology. And in this talk today, Andrew will complete the picture by telling us how machine learning can help to solve some of the equations that take center stage in climate models. Andrew, please go ahead. Thank you, Philip. And thank you to both Philip and Peter for the invitation. It's great to have this opportunity to speak to you. I'm aware that it's a potentially fairly wide audience from different backgrounds. And I would very much welcome interruptions and questions as we go along so that you can help adjust your understanding in a way that will make the talk more enjoyable. So the work I'm going to talk about is the title is there. I'm interested in using data to learn solution operators for partial differential equations. But as I discussed at the end, there will be potentially wider application of this methodology. But the simple take home idea is that solving partial differential equations is expensive, it's something that we do many times. And if we record the inputs and the outputs to the numerical solution of partial differential equations, we have the opportunity to replace the numerical solution of a partial differential equation by a learnt machine learning model which is cheaper and faster to evaluate. And I want to discuss some of the theory that underpins that and some applications of that way of thinking. The work is joined with many people who I will describe as we go through the talk. And the way I'm going to organise the talk is as follows. So in the problem setting, I'll just describe the basic idea and the concept that I want to study in a number of different settings. And then I'm going to look at two ways of generalising the idea of supervised learning and neural networks in particular to the problem of learning mappings which take spaces of functions into spaces of functions. So the first of these will be the kernel network approach. And having developed it, showing you a little bit of the concept and the theory, I'll describe applications of that methodology and fluid mechanics. And sorry, some mistake. And then I will look at a second method based on model reduction and I will describe application of that in solid mechanics to a problem in crystal plasticity. I'm happy to make the slides available and the references will be... You'll be able to pursue the references in more detail by having the slides made available to you. So let me try and describe the objective of this work in an abstract setting. So there are many problems, as I mentioned, typically arising in partial differential equations, but much more generally than that, where one has an input-output map, which throughout the talk will be psi-dagger, which takes a space of functions x into another space of functions y. So the inputs and the outputs of this map are themselves functions. And that way of thinking about the problem is key to understanding what I will try and explain in this talk. So the data is given to us in form of pairs xn and yn, where the xn inputs to psi-dagger, so the xn are functions, and the yn are outputs from psi-dagger, so they are functions. And they're assumed yn and xn to be related through the map psi-dagger. And just to fix ideas and put ourselves in the standard setting of supervised learning, I will assume that the xn are distributed according to a probability measure that I will call mu. The objective is to try and discover psi-dagger from this data. So we're given these pairs of functions. That's all we know about psi-dagger. And what we're going to try and do is fit what I call an operator class. So psi is a function which takes x to y that is parameterized by theta. And our objective is to choose a particular distinguished parameter theta star so that what we learn using neural networks in the case of this talk, the function psi at the parameter theta star is a good approximation of psi-dagger, which is not necessarily known to us, except through the data xn and yn. So this is the standard setting of supervised learning. The only thing that's non-standard about it is that the inputs and the outputs are functions. So the successes of machine learning that we've seen over the last, especially the last decade have concerned cases where the input space x is Euclidean space, very high-dimensional, and the output space may itself also be Euclidean space or maybe, for example, a set of classes of finite cardinality. So a classic example is image classification where the input space would be pixelated images, so very high-dimensional vector space in Rn, and the output space will be classes classifying the images. In this case, the inputs and the outputs are functions and the key idea that I want to get across is that effective methods in this area follow from thinking about designing the architecture, that's the parametric dependence of the class that we try and fit. Thinking of designing that on function space, so Banach space means in this talk, spaces of functions, and then discretize. A different approach, and I will show you drawbacks of it, would simply be to discretize the spaces of functions and apply standard neural network methodology. But I want to show you that that doesn't work so well, and I want to explain to you how methods that we have designed, which work directly on spaces of functions and are then discretized, are much more effective. That's the take-home message, but I want to substantiate it. But first, let me ask if there are any questions on the setting. Okay, so let me start with an example. I'm going to look at a partial differential equation from the study of porous medium flow. It's an elliptic partial differential equation, and it's made up of the following. So we have a velocity field V, permeability A, which characterizes the porous medium in question, and the pressure U, or it's actually the piezometric head, but you can think of it as being pressure. Then the equations of interest are mass conservation, which states the divergence of velocity is given by the sources and the sinks. The Darcy closure, which is a constitutive model relevant for porous medium flow, that says that the velocity field is proportional to the gradient of the pressure, and the constant of proportionality is a permeability A. And then for simplicity, I'm going to specify the pressure of the boundary of the domain D. So if you substitute this Darcy model for V into this equation, you get an elliptic differential equation for the pressure U, which depends on the permeability of the medium A. And what I'm going to look at is trying to learn the mapping, which takes the permeability, which is a function, that's a field over the domain D, and maps it into the pressure, which is also a function, a field over the domain D. And throughout this talk, I will measure error in a relative sense. So I will look at the difference between the truth, which we will have access to for the examples I look at, and the model that we train in a relative sense. Excuse me, do we assume uniqueness of mapping from A to U? Thank you, that really good question. Throughout this talk, yes, and this equation has a uniqueness property for provided A has sufficient regularity, it needs to be an L infinity function, then there will be a unique solution to this problem. There are interesting questions where one has multiple solutions, and that's something that we haven't really tackled in detail, but it is a very natural future direction for this methodology. Thank you. Okay, what I'm going to show you here is what happens if you apply what I call naive approach, which is to discretize everything in sight, so you have high dimensional input and output spaces, construct an architecture, and then look at how that architecture performs at different levels of resolution. So what we have here is a measure of error against resolution, error on the vertical axis and resolution on the horizontal axis, and this is a really nice piece of work by Zew and Zabras, which was developed at a particular resolution, which is the coarsest resolution down here of a 50 by 50 grid, and they here obtained a relative error of less than 2%. However, if you take the architecture that they designed on that particular mesh and use it on different meshes, you see that the error increases, and that's because the architecture has been tuned to a specific discretization and is not capturing the intrinsic physics. So the methodology I want to propose in this talk is a methodology that tries to capture the intrinsic physics of the partial differential equation and not be entwined with any specific discretization. So using that method or that philosophy and a specific instance of the method, you can obtain relative error, test error, again on the vertical axis, as a function of resolution, which is constant, meaning, for example, that it's possible to train the neural network at a relatively coarse resolution and use the same neural network with the same choices of parameters at a fine resolution and obtain the same error. There are, in fact, three curves here, and the parameter D is a parameter of the neural network, and I will explain what that is later in the talk. But the key idea which I want to get over is found by comparing this figure and this figure. So this figure is one in which one discretizes the problem, learns an architecture and then tries to use that more widely. That behaves badly, and instead I'm proposing that you conceive of the architecture without discretization. So you do that on the space of functions and then look at the method at different resolutions and you find that you're able to transfer between different resolutions and indeed between different discretization methods such as finite difference, finite element and spectral. I'm sorry, I hope it's not too late to ask, but what do you exactly mean by discretization here? So in this particular example, the discretization has been performed, not too late to ask. This is a partial differential equation in space and we have discretized the spatial variable by a finite difference method. And so the resolution refers, the number 50 here, for example, refers to the use of the computations are done on a two-dimensional grid that is 50 by 50. And up here, it's a two-dimensional grid that's 400 by a 400. So the partial differential equation is being approximated at different levels of resolution. Thank you. Thanks. Good question, thank you. Okay. Sorry for a basic question. And is the neural network trained separately for each resolution? It is here. So in this case, we've trained the neural network in this figure for each separate resolution and obtain the same error, but we have similar figures where we train at one resolution and use those fixed parameters at another resolution. And in the previous case where the error was getting worse? Yes. And here in this case, we trained it at every resolution, but we use the same architecture that had been proposed at the coarsest resolution. Thanks very much. Thank you. All right, that's the idea. So now there's going to be two specific instances of this idea. So two neural network concepts mapping between spaces of functions. And in each case, the first case applications, the fluid mechanics in the second case applications to solid mechanics. So this is joint work with a number of people. And I just want to tell you who they are. So the first Zong Yi is a PhD student at Caltech in computer science. Nick is a PhD student in applied mathematics. Kamya is an assistant professor at University of Purdue who was formerly a postdoc at Caltech. Bogada is a postdoc in mechanical engineering at Caltech and is about to take up an assistant professor at Cambridge in mechanical engineering. Kaushik is a colleague in mechanical engineering and Anima is a colleague who works in computer science. There are a number of papers here. And as I say, I make the slides available afterwards and you can get hold of them through Philip if you're interested or perhaps they can go on the YouTube channel alongside the recording. So let me describe the basic idea of generalizing in this case deep neural networks to the setting in which the inputs and the outputs are functions. So first let me just describe how neural networks work as mappings between RM and another Euclidean space. So they typically have this structure here. And I just want to concentrate on the red part of the problem because that's where the essential differences in infinite dimensions mapping between space and functions will show up. But in this basic version of deep neural network we start with the linear transformation of the input X with parameters matrix A0 and vector B0 which need to be learned. And then we iterate this red map here again with parameters matrix AL possibly of changing dimension and vector BL where the first V0 is computed from this layer here and sigma will be a sigmoid function. The relu would be one example but think of it as a monotonic function from R to R applied component wise to this vector. And then one has maybe another linear layer at the end and that defines the mapping psi which takes the input X and computes this output through this iteration. And the parameters theta are just made up of all of these matrices A's and vectors B. They're learnt and this is fit to data. All right, so that's a classical deep neural network and there's a lot known about this in terms of theory, approximation theory and there is also abundant evidence that this provides a flexible and efficient method of approximating functions from data. So the generalization I'm going to describe here which will form the first half of the talk that we're here now looking at mapping functions so function space X and I'll think of functions on a domain in Rn into Rm. So every input will be such a function and the output will be a function also defined on Rn mapping into R. It's not necessary that the domain D is the same for the input and output but I will in this case assume that it is. The key difference of the architecture that I'm going to work with in this half of the talk is described by the red line. The inputs and the first layer and the last layer in this particular example are just linear and applied point-wise so they only involve learning a matrix A0 vector B0 and they just applied point-wise to the function X evaluated every point Z in the domain D every point Z in the domain D. What's new about this architecture is the red layer here and it involves point-wise evaluation of V and modification through a linear matrix AL and a shift through a vector VL-1 that does not depend on Z that's a typo but the key difference here is that we've also have an integral kernel operator so this is a non-local operator that maps the solution at the L-1th layer so this is a function VL-1 sorry, difficulty highlighting let me just look where my cursor is VL-1 is the function at the L-1th step we apply a kernel to it and integrate and now the parameters of this neural network that need to be learned are the A's and the B's as before as I said that should not be a B of Z that's just a vector B but we also need to learn kernel functions so that's these KLs and we parameterize these two and learn them so this describes an architecture which is made up of both point-wise and non-local operators and the key thing to appreciate about it there's a lot of theory being developed right now in this area but let me just tell you that there are universal approximation theorems for this architecture as a mapping between spaces of functions and there are also other methodologies such as DeepOnet which is being developed by Karnia Darkis and co-workers at Brown University and this theory about that based on this paper of Chen and Chen which was the first I'm aware of to look at approximating mappings between spaces of functions there's also work that I don't refer to here coming out of ETH by Sid Mishra and Samuel Landholer which looks at convergence proofs for architectures that map between spaces of functions and we ourselves have a paper that will be looking at that problem too the basic point is that there are universal approximation theorems within the class of neural networks that I've just described to you however what you might think is okay this is good but what about the computational complexity so we know there are universal approximation theorems this kernel is non-local and in general will be dense and as a consequence if one thinks in terms of discretizing this kernel integral operator one gets n squared computational complexity which is potentially makes the method rather unattractive both in terms of training and using as a predictor so we sorry could I ask a question so the previous slide one before that as well so it essentially is the idea that we learn these weights and biases of the network so that it generalizes across any value instead so that we can evaluate this function at any point in its space yes exactly I think of it more as the input points and the output points are themselves functions and the way I would say it is that we try and choose the parameters theta which is the weights and the biases and the kernels so that we approximate the mapping that takes functions to functions and we do that in a norm which respects the probability measure on the input functions that I called it mu at the beginning so that the error is small with respect to a measure that is defined a measure of accuracy that respects where the input functions come from with respect to the probability measure so that that provides one measure of error but the specific theorem I've mentioned involves the continuous functions and so in that specific setting you can think in exactly the way you described that the method will be point wise accurate okay thanks thanks but the broader question that you're asking translates into what is the norm that you put on measuring the difference between psi dagger and psi and there are different ways of doing that and some of the theory I mentioned works in uniform topology which is what you were referring to and some of it works in Sobolev spaces which works in an L2 sense with respect to derivatives of the function spaces in the space in the functions in the relevant spaces sorry I would have a question what's the role or the interpretation of the of your kernel and the integral operator in red in the previous slides yeah if this is going to work it has to be non-local and that's because the mappings that I'm interested in are non-local if you take a particular point Z and evaluate the input there it does not determine the output at the same point Z you need to know the entire function everywhere in the domain D to determine the output as a given point in Z and so the operator if there's going to be a universal approximation theorem you have to add non-locality into the picture I'll do it in a different way in the second half of the talk but in this part of the talk we're doing it through kernel integral operators which are a very natural way of introducing non-locality and in some very specific problems such as if you think about greens functions how they work you can clearly see the role of kernel integral operators in that setting and it then turns out that through the universal approximation theorem we can approximate many mappings using this formulation so the potential bottleneck which I'm about to address is the cost of doing this so that's what I'm going to next so regarding cost there are a number of ways around the n squared complexity of the kernel integral operator but the one that will be presented in the examples in this talk is to work in Fourier space and to use translation in variant kernels and actually if you like this is very much related to convolutional neural networks and thinking about what they would mean in the case where the images which are used as inputs are in fact functions but the computational advantage is that the convolution can be computed in Fourier space using complexity from n squared to n log n and the methods I described will use that today but we have other methods using graph methodology and the links that I gave within the talk can be used to follow those ways of reducing the complexity as well but for this talk Fourier is enough that's all you need to understand all right let me show you an example which comes from Berger's equation so this is a time-dependent partial differential equation that has two characteristics you have dissipation which is preferentially stronger at large at short length scales and you have an energy conserving non-linearity which in Fourier space transfers energy from low to high-wave numbers what we're interested in here is the solution operator that maps the initial condition to the solution at time one so we're trying to find a function that replaces integrating the equation it will be trained by data obtained from a numerical method but once you have it you don't have to use the numerical method anymore you just evaluate psi our approximation to psi-dagger to replace the solution of the equation so this is just for those of you not familiar with the equation this is typical behaviour of the equation this should not be a that should be u0 and this u should be the output at time one so on the left is an initial condition on the right is the solution at time one so this equation forms a shock which has a certain thickness related to the parameter nu which is the viscosity appearing in the second derivative so there are lots of different methods compared on here I'm comparing as I did earlier the predictive error on the vertical axis against the resolution of the computation that's the number of points used to discretise space and I'm looking at the let's just pick two of these methods the lowest one is the Fourier version of the neural networks that I described and what you see here is fully connected neural networks this will be a standard this is what you would get if you discretise and apply state-of-the-art neural networks on each finite dimensional space so key point here in comparing the blue curve and the curve here is that the method we describe is invariant to resolution because we are approximating something that is intrinsic to the physics of the equation and the values of the error at all resolutions are the smallest amongst all of these methods and in comparison with the standard fully connected neural network so if you come from numerical analysis sometimes these curves are a bit confusing because in the standard setting numerical analysts think of error improving with resolution but remember here there is a finite amount of data so what's limiting the accuracy here is the finite amount of data so what you're seeing in this method that we've learnt here is that at every level of resolution we squeeze the maximum out of the data available and get the error in this case down to about 2% okay any questions on that I have a question about training does not local term necessitate any special thing to ensure convergence or is just vanilla STD enough that's a really good question so in general I like to say that I try to separate the training from the things I'm concentrating on in this talk which are the approximation capabilities and what you can do having trained but it is fairly standard training it very much depends on who is doing it there are many parameters to be set when you use off the shelf software to train neural networks which is what we're doing here but there is definitely a degree of the expertise of the person doing the training that makes a difference I wish neural networks weren't like that and it is arguably one of the drawbacks but it is a standard framework and Zhong Yi who is primarily responsible for much of the numerics you'll see in this half of the talk is really excellent at getting the most out of standard methodology so it's using yes sorry please I was wondering with this setup that you have to look at the binary spaces in input and output is it a possibility to also look at all partial differential equations or solution operators for nearby equations at the same time so instead of considering one partial differential equation look at the approximation of the inverse or the solution operator also parameterized by the differential operators absolutely so the first problem I described the Darcy problem was parametric dependence so there I was indeed looking at the parametric dependence on the coefficient and learning that mapping here for Berger's equation I'm learning the mapping from the initial condition to the solution at a later time and those two can be combined so you can look at the joint mapping from the initial condition to crossed with some coefficients that define the operator that's possible within this framework very good thank you okay so let me go on to Berger's equation was a warm up let me look at some more complex problems in fluid mechanics so I'm going to make a shift here to a different way of thinking about the accuracy of the method because I'm going to start looking at chaotic partial differential equations and I'm taking the perspective that what is interesting here is to ensure that the statistics of what we learn as a map are reproducible and reproduce the true statistics and that is recognizing the fact that trajectory prediction is hard over long time horizons in chaotic systems and in numerical weather prediction what I'm about to describe would not be so relevant if you're doing climatology where you're interested in predictions of statistics this perspective is more relevant so what I'm going to do is learn the solution operator over a time H but having learned it I'm going to compose it with itself to make predictions about the statistics of the system and I'm going to compare them with the statistics I'm going to show you what the Khromotov-Sivoshinsky equation and this is so S1 is just that just means periodic boundary conditions are on a domain of length L all right and Khromotov for those of you who don't know this this is an equation that arises for example in the study of fluid mechanics of flames the mathematical properties of this equation are as follows the linear operator is unstable at low wavelengths and dissipative at high wavelengths and increasingly dissipative at higher and higher wavelengths so the linear operator generates energy at low wave numbers and removes it at high wave numbers and the non-linearity as the Burgers equation conserves energy but transfers energy from low to high wave numbers so you have a generation at low wave numbers, you have a transfer through non-linearity and you have a removal of energy at high wave numbers so very much the picture that you have in incompressible fluid mechanics and the result of that is a chaotic partial differential equation as I said we're going to learn the operator that takes the solution at time t to the solution at time t plus h for some fixed h and then we're going to iterate that map to make predictions about long term statistics so commonly used methodology in this area are LSTMs or GRUs and so these are both methods that are souped up with current neural networks and they are developed to try and deal with memory effects they're rather, in my opinion, rather unattractive as approximations of the problem I'm describing because they introduce memory where the equation itself is Markovian so there is a map that takes the solution at time t to the solution at time t plus h and there is no memory so here memory is in a sense in the LSTM and GRU memory is introduced to help with the approximation theory but it takes you into a class of approximations that are don't fit with the equation we take the free and neural operator approach learn the map over time interval of length h and iterate it what you see here the solution here is being plotted as the function of space along the horizontal axis and time along the vertical axis and you see the error of LSTM GRU and the black is small and what you're seeing in terms of trajectory here is simply that we are able to approximate the trajectory accurately for a longer time but I want to get down into a little bit more detail and show you some of the statistics that come out of this so let's look at the left panel first of all the blue and the red are two slightly different ways of training and the dashed versus the solid this is really what I want to emphasise the dashed is learning the mapping over a time step h and looking at the error only on a time interval of length h and so what you see there is that as a function of h the error grows very small at small h and grows as h grows that reflects the fact that it's harder to learn the map over longer time intervals because of the chaotic response of the system having learnt the map over any time interval h we then compose the map with itself to make a prediction over a macroscopic time t and we then look at the error in red or blue slightly different measures of error but the key point is the difference between the dashed and the solid when we compose the time h map with itself and make a prediction over a macroscopic time we get the errors that you see with the solid curves which are a relative error of less than 1% so what this shows the left hand panel is that the idea of learning the semi-group generated by this equation over a small time interval and we're then iterating that to make predictions over longer time intervals and we're showing that that works alright that's the left hand panel the middle panel is now comparing the ability to predict the spectrum with our method and with the LSTM and GRU and the takeaway is that they all do pretty well with the numbers our method is fitting the spectrum better than the LSTM and GRU this figure on the right is looking at the spatial correlation and firstly let me say that that's the very long scale correlations in this problem and it's very hard to get them and none of the methods really nail the correlations that's in the middle here but they do well over on the left and again our free neural operator approach beats LSTM and the GRU so the simple take-home message of this is that learning the solution operator on short intervals and then composing it to get predictions over long intervals works well both in terms of measuring the trajectory error at a macroscopic time and in terms of looking at the statistics of the resulting dynamical system okay any questions on that just a quick question what's that gradient in the what's the spectral gradient in the Fourier spectrum you mean what's the slope I should know and I don't I'm sorry next time I know to give the talk I'll know the answer to that but the truth is in here that's this line here so the truth as in computed by very high resolution spectral method in space and we use exponential time-differencing which uses the exact solution of the linear operator so I think the truth is very accurate I'll find out the answer but I'm sorry I don't know it thanks thank you all right now I'm going to look at very much the same setting that we had for KS equation for the for Kolmogorov flow so these are 2D incompressible Navier-Stokes equation on a tourist so periodic in space with particular Kolmogorov flow for those of you that haven't seen it refer to specific forcings made up of small numbers of Fourier modes or eigenfunctions of the Stokes operator if you like I'm going to do exactly the same as we did for the KS I'm going to learn the solution operator and then compose it with itself the learnt operator I'm going to compose it with itself to make predictions at macroscopic times and to to look at statistics we use vorticity stream function formulation there's a lot of information here in a table which I won't go through in detail but you can train on the stream function the velocity or the vorticity and we have tried all three and you can also use different norms to measure the error in the stream function the velocity or the vorticity and we have looked at different norms the L L2 norm H1 norm and H2 norm so we have a picture of the effect of using different variables to represent this flow stream function velocity or vorticity and of the effect of using different error measures to do the training and the testing we have tested Reynolds numbers of around 500 but the results are not fully complete for what I want to show in this talk so everything here is with a Reynolds number of around 40 and these are samples from the flow so you have some idea of the complexity it's relatively low dimensional but chaotic and that just to give you a little bit more insight into the flow here we have projected the flow we do a PCA on the output and project onto the two energy dominant modes and what you see is that you have a ring on which the solution likes to live but there's a lot of activity inside the ring and these are color coded by the amount of dissipation present in the solution at any given time so that's the H1 norm squared and so what you see is that the higher dissipation is associated with leaving this ring and so a typical solution on the ring here in space on the left and a typical solution inside the ring at the right so it's chaotic but relatively low dimensional I think that's the take home message from this and here we have compared our method here is called MNO that's the Fourier neural operator but then composed in a Markovian way to make macroscopic predictions and we have trained that using both H0 the L2 norm of the second derivatives basically and we're comparing this with UNET which is a standard method used in this field the key point here is that in terms of matching the spectrum we get by using our methodology with an H2 that's matching the second derivatives we get an almost exact fit with the true minus 5 third spectrum you can also look at the invariant measure projected onto dissipation that's the squared norm of the gradient of the velocity field and again you're seeing that our methodology is producing a much better fit than the UNET methodology the green is the truth ours is the dotted orange here is a study of the accuracy of the method as a function of the derivatives used in the fitting there's a lot of numbers here but I don't want to show you those I want to finish by just telling you that there's a the slides are available if you want to look at these numbers they get into details about how the different ways of doing the training affect the accuracy we can train on stream function velocity and vorticity any of those three we can also apply an L2 norm and H1 norm and H2 norm and these have implications for the error which is summarized in this table if you're interested I want to describe a different method and this is work with I've also already introduced Kaushik and Nick this is also work with Bandad I'm currently a postdoc at Caltech and soon to be an assistant professor at the University of Washington in applied mathematics so this is a different idea and I just want to describe it to you very quickly it actually underlies the very first example I did for porous medium flow we're trying to find a map on the left here side dagger which takes an input function space x into an output function space y is to learn what's called an auto encoder which is an approximate factorization of the identity in the input in the output space using the data so g composed with f is approximately the identity g composed in both the input in the output spaces and in fact we will use PCA principal components analysis to find that auto encoder we tried other things but that was the best and we also have a theory for PCA which I'll describe the point is these approximate factorizations of the identity on the function space x introduce a latent space which is finite dimensional in both the input space and the output space and we then train a neural network in this latent space that's here and the resulting architecture is made up of this step which is taking the infinite dimensional input and projecting it onto a finite dimensional latent space learning a neural network that maps between the input and output in the latent space and then lifting this latent space back up to infinite dimensions and the accuracy of this method is controlled by the extent to which these are accurate factorizations of the identity and that can be studied using the theory of principal components analysis on function space and then the accuracy of the neural network trained in the latent space which can be controlled with the theory of neural networks so again there's a theorem that shows that this is provably accurate methodology for the learning of maps between function spaces and we've already seen this problem which is the Darcy problem now I've put everything together, eliminated the velocity and written it as an elliptic PDE and coming back to Peter's earlier question we are indeed here learning the mapping from the coefficient which is a function to the solution u which is a function and this could be combined with learning dependence on the initial condition and time dependent problems and this can be used to determine a piecewise constant for example of the type we see on the left and that's the permeability and the pressure or piezometric head on the right is the output so just to remind you this was the first example I showed you in which we obtained resolution invariant approximation of the true operator mapping the permeability to the pressure what I want to just finish with is show you the calculation of this method in crystal plasticity so you've seen all the co-workers on this paper beforehand the same as the co-workers on the Fourier neural operator paper so this is a problem in solid mechanics and homogenization so the equations may or may not be familiar to you but let me just say that the basic equations are non-linear wave equations for u which describes the small deformations of the material and the model is completed by a constitutive model p p is the stress tensor which depends on the gradient of the deformations u some internal variables which carry memory this is not an elastic response with an inelastic response I haven't written down the equations for the memory but there are dynamical evolution equations for the internal variables this is a homogenization setting where there's a small length scale epsilon inside the stress tensor that small length scale makes this non-linear wave equation extremely expensive to compute with and what we are going to do is use homogenization miss-belt here and neural networks to learn a homogenized constitutive law so the micro scale model that I describe here that includes the small scale has a constitutive model which maps the gradient of the deformation the strain the memory variables the internal variables psi and this small parameter epsilon into the stress tensor and what we're going to do is homogenize this small scales and we're going to compute a stress tensor p bar which involves the past history of the strain so if you like we're constructing a constitutive model which depends not point-wise in time on the strain so that in general in this area you're interested in constitutive models that map the stress strain relationship mapping strain to stress and here we're working with the history of the strain and mapping that to the stress and we're going to learn p bar using machine learning so the output of this is a non-linear wave equation that depends on the past history of the strain through a p bar which we learn with machine learning okay so the benefit of this is that the machine learning is done offline before the computation and then when one has p bar you've eliminated epsilon and the cost of the computation is decreased significantly there are a number of challenges here but the primary one is that in a testing training scenario for neural networks we don't know how we don't know the correct training data because the correct training data will depend on the solution of the problem that we wish to solve so what we do is we train on random paths and we've shown by testing that training on random paths we're able to make predictions random strainings we're able to make predictions for more systematic strainings such as uniaxial tension and loading and unloading so that is really showing what's in machine learning we're able to make predictions with respect to one measure and predict with respect to another the key point is that having done this we're able to get massive order of magnitude speed up if you use what's called FEM squared which involves finite element discretization of the micro scale plugged into the macro scale computation we're getting order of a million speed up by doing this and even over Taylor averaging which is a localizing approximation that's widely used in this field to speed things up we are getting still three orders of a magnitude speed up in this computation so this is a computation of a blunt impact on a piece of metal alright I've used up my time I just wanted to say that I don't need to convince many of you I'm sure that neural networks have an enormous amount of empirical success but if typically being used for regression problems on Euclidean spaces and classification problems what we've done is taken the idea and replaced the input and output spaces by spaces of functions and I believe that will be widely useful in many areas of science and engineering the key idea is to conceive of the architecture and then discretize there are generalizations here I looked at neural networks but there are other methods the random features method and I would highlight this work with Nick Nelson which shows that similar ideas but with architecture there's a lot to be done in this area so we've looked at PDEs but you can imagine problems where you don't have a model so you're just given data so there are many problems in robotics for example which are generating large amounts of data for which the models are very incomplete and there's a possibility of learning directly from time series models from this using this methodology there's a lot of math needed there's starting to be approximation theory but that is at the beginning as much needed there are interesting questions around theory in the chaotic setting which I described to you application to inverse problems experimental design and many applications I'm really grateful to you for listening and all the great questions thank you and I'm happy to take more thanks a lot Andrew there was indeed a fascinating talk great thanks a lot okay we've had a lot of discussion already which is great if people have more questions then please now is the time can I ask a question please Andrew hi of course Andre here oh hi Andre nice to meet you concerning this non-local kernel that you had in the beginning so how much do you need to know about the green's function of the linear operator or of a closed by linearized operated interesting nothing so we we were motivated to use kernels and indeed cameo was one of the first person to do this by looking at problems with greens functions where we know what the kernel is and we ask the question can we learn it because that's a natural place to start to be non-local but what's remarkable about this architecture is that it's a universal approximator between classes of functions without having to put any structure into the kernel so it's simply the non-locality that comes from the kernel and then the iteration and the application of the sigma enables you to go from universal approximation theorems of the standard neural network in finite dimensions to very similar results in this infinite dimensional setting it's kind of surprising if you'd asked me a year ago I would not have thought that was the case but it is well I suppose also unfortunate because if you move to non-linear problems exactly there is I mean you might invoke a nearby linear but it's not necessarily clear what one should choose so it's great that one can do this and we had empirical evidence at first and now there is a developing body of theory so Sid Mishra who I'm sure you know at ETH and his student Samuel Landhaler have developed a theory to explain this in an L2 sense and this student here is developing a more general theory which includes working in spaces of continuous functions and other solid left spaces thank you Andrea very nice maybe I can brief the champion here so you said it's not necessary to put in a structure is it helpful to put some structure if you have additional information about your problem or is that not helping in terms of efficiency of the training firstly the working in Fourier space which is what we did in all of the examples I showed in the first half of this talk it really helps it reduces the number of parameters you need to make an effective approximation so it's parsimonious and it's also fast because you get the Fourier transform advantage so definitely and by the way the theory of Mishra and Landhaler applies to this specific architecture so definitely that is the case empirically we've also looked at for example including in the kernel dependence on the input function and I wouldn't say we have systematic understanding of that no understanding from a theoretical point of view but some evidence that for example letting the kernel depend on the input is helpful empirically is also present but the other than that there's a lot that can be done with a fairly standard just viewing KL as a function and parameterising it as a neural network in a standard way and then the fact that it's then lifted up to become an operator gives it this approximation property on spaces of functions so that's a very good question I see very nice thanks a lot okay Matthew please go ahead thank you that was a wonderful talk Andrew my question is about so your functional based approach seems to work really well in the examples you show do you have a recipe or a thing that you think people should look for and when to use the functional approach versus the sort of more classical point wise neural network approach everyone to some extent gets driven by their methodology towards examples where it works and why is it working here so well with respect to mesh invariance it's really because we're using computational models that even when they're discretised are faithful representations of the limiting infinite dimensional object so in that setting everything I've described here is highly relevant and in particular I think the fact that you can transfer between training on relatively course discretisations and then use it at finer discretisations and vice versa is very useful in that setting however there are many problems where we don't resolve the physics so the climate modelling is a great example the setting in computational models is such that you're using a grid of tens of kilometres that just doesn't resolve physics of clouds for example and so the idea of being close to the infinite dimensional limit in the computational model just doesn't apply so in that setting I think you need a very different perspective which is more about learning model error so if you have the idea of a continual model which you don't have access to and you have a course model which really is not at certain scales it's not accurate so there's a significant model error then I think the correct philosophical approach is to focus on learning the model error to the discrepancy between the model that you're able to compute with and the resolved physics which is too expensive to compute with so I have some work in that direction for ordinary differential equations and I have some work with a colleague in climate modelling in a very specific context of that type but I think that's the general message for problems where the physics is not resolved by the model I think using machine learning to learn about model error is the way to go and that's not something I described in this talk Wonderful, thank you very much Thanks Thanks, Ben Yeah, thanks Really awesome talk, I really love the work It's a related question, I'm not sure but I'm really interested in how these approaches scale to larger domains and kind of equivalently to higher resolutions in the solution of the PDA or even multi-scale solutions as well and the related problem with neural networks is their spectral bias so this idea that they tend to converge much slower when there's high frequencies in the output so I wonder if you have a sense of how these foreign neural operators would scale to larger domains and maybe the analogy in a climate model is that they input the initial conditions of the earth and do a full-blown climate model with this sort of technique Yeah, so two things, let me point to something which is exactly a point you've made If you look at the space, this is the KS equation If you look at our ability to predict the spatial correlation it's pretty poor at large separations and so this is a problem where there is high frequency activity in the problem that induces correlations at quite long length scales and we're not predicting them accurately we're doing slightly better than some other methods but we have not nailed that problem so you're quite right that even in this setting here where we're looking at PDs that are close to resolving the underline discretization that close to resolving the underline continuum it's hard to predict some activity that goes on at high wave number so that's the first thing so the second question you asked was then more about climate I would not wish to claim that we're at a point where we could yet just take a climate model and train it based on its initial condition and do anything useful the reason would be is that in climate modeling the primary hurdle in my opinion is to learn about model error or sub grid scale activity and represent it accurately in the simulation and so I would just argue for trying to use machine learning to learn sub grid scale models as being the first thing that should be done in that area or a first thing that should be done other than just try and use this blindly for current models yeah thanks thanks a lot, David Hi, again I'd like to reiterate what everyone else said, I think it's a really amazing talk so I've been working with some of my colleagues on developing a crystal horse system model and I'm kind of like more of an experimentalist by trade so we've had some very good results with the model comparing to some small scale laser experiments like trade fractional experiments that we've been doing but one of the trade-offs of that has been is that obviously quite complex set of differential equations that we have to solve and that's really caused performance issues basically we can model things that are very small but anytime we try and make things much larger it causes issues I think you're talking about these kind of like parameters like hidden parameters so in our model it has like a set of slip systems and then stores like dislocation activity on those and then runs a set of differential equations to work out what slips is reactivated, how all the stress has changed, is that something that like I mean I'm assuming that's kind of like I've had a very quick look at your paper I'm assuming that's the kind of thing that was in the model that you were trying to emulate but the performance speed-ups that you're suggesting would be very useful for my area of work so I guess it's like could this be expanded to more like more complex crystal-plasticity models and would you expect that the kind of efficiencies you're seeing would remain the same or get better or worse depending on like what the system equations people were using thank you a really good question so you're just winding back to the beginning of what you were saying that these desire the internal variables that keep track through a time-dependent model not described here but they're a coarse description of the dislocations and so forth that are going on and inducing inelastic response so that's the connection with that part of the model you described so the others understand that so just in terms of that obviously you're using some particular model but if you changed it to a slightly different model it's kind of your approach is independent of the model you don't really care as long as you get enough training data you can be loading that model yes that model has to be good but yes it's independent of that how one represents the internal variables a potential limitation of what we've done is that with respect to the small length scale so I think from what you're saying you're able to do accurate simulations at small length scales but the problem is scaling them up to macroscopic prediction here we're using the commonly adopted approach which is to assume that the micro scale is periodic and that is very much a limitation of what we're doing but my understanding I'm not an expert in this area but my understanding is that that is a commonly used assumption for the micro scale so the variation with respect to the short length scale epsilon is small and we then are able to take a unit cell as is typical in homogenization and we do the computation on a unit cell defined by this length scale epsilon and compute the homogenized constitutive law using that periodic structure so and we then get the speed ups I've described on this specific model I think the two things I would say in terms of caveats are firstly everything here is limited by how good the description of the internal variables is how well described the dislocation physics that you're talking about how well that's represented by the internal variables and secondly how useful or not it is to work with a periodic model of the micro scale structure that those are the sort of modeling caveats but at some point of view an important caveat is that this model as you progress in time accumulates more and more memory so that means it's a t-squared computation and that's a drawback so where we would like to go is to try and describe Markovian closures for this the Maurice Tuansig formalism from statistical mechanics is a useful way to think about this so I think it would be interesting to try and find representations that are Markovian using exponential decay modes to try and make something which is computationally tractable so we did get fast speed up as I report here but eventually that's going to get killed by the t-squared nature of the computation if you go back to the slide before oh sorry the one before that this one yeah okay so the model that we're using we're interested in the response of the different crystallite grains so commonly we mash up like small grains and I guess my question is does your model then kind of make some average response for each of the grains and just assume that you have some generalized texture and they all behave or do the grains each individually behave differently given that you know they're in different orientations different slip systems right yes they're all treated individually so this is the unit cell that are referred to and we would do the computation on this unit cell fully respecting the specifics of the microstructure that you see here and the methodology would apply with different microstructures so we have to do computations at the level of this unit cell and then but within that you can put in whatever microstructure you want we generate data here we train the constitutive model using it but then the assumption is that this is repeated periodically throughout the domain okay yeah that's really interesting because yeah like one of the ways we were thinking about expanding this was to just kind of take a little bit of a tail average and just kind of say all of our grains we're just going to average over that and then that's the way we could scale this up to large systems but it suggests that this is like maybe a way that that would even be better than that we're able to beat tail or averaging by using this method in this specific case okay thank you I don't want to oversell anything in this specific case and in this case so I think this field is very exciting there are lots of possibilities for using machine learning and I hope that this talk has given some indication of them but I'm just be careful of all the caveats I think that's an important thing okay great so one last question Duncan please brilliant thank you very much Andrew I don't take too much of your time but I was just digesting your answer to Matt's question about and your answer particularly respect to modeling GCMs where as you as you rightly point out we don't resolve the key key processes is there any opportunity or any way of combining this functional approximation with the point wise so I'm envisaging perhaps having targeted high resolution simulations that might inform the the function space combined with course simulations that we can run for a long time so thank you so that is a line of work that I'm engaged in with Tapio Schneider I haven't described any of that work but in that work we're trying to learn models for clouds and do so on the basis of data which is both large-scale satellite data and local computations that are using models of turbulence so that's our goal we have not got there yet and so far we have mainly looked at the effect of using satellite data to learn about cloud models with relatively small numbers of parameters but where we're going is to look at non-parametric models of clouds and inform them using machine learning of the type I describe here but by small-scale data as well as satellite data that's where we're going so that's a wish and that targeting directly the structural uncertainty this unresolved yes exactly to what I've learned from Tapio is that poor modeling of clouds is a primary component in the uncertainty that underlies all of the different GCM models we see so the objective is to try and get better cloud models and one dimensional in the vertical that describe physics of clouds learn about them from data and couple them into GCM great thank you very much thank you fantastic well thanks a lot Andrew that was a great end to this seminar series this term thanks very much indeed so that's it for this term I hope you enjoyed it and hope to see many of you again in mykomas cheers thank you