 Hi guys, hello again. Today's the last of the simulation lectures we'll do, and we'll talk about partial differential equations, and also Gaussian process inference. Again, surprise, surprise. So first of all, a bit of an outlook of what we're gonna talk about today. So first of all, I'll clarify what a PDE actually is and why these equations are important, why what all the fuss is about essentially. And the main part of the lecture will be showing you how to integrate PDE-based models into machine learning models and specifically probabilistic machine learning models. And we'll do all that by sort of working through a practical modeling example just to make things a little more visual. So first of all, PDEs are used as the language of mechanistic knowledge. Now, what do I mean by that? A mechanistic model, it has a term mechanism in it. It describes essentially what's going on, for example, in the real world, but also in financial markets and situations like that by describing the mechanism that generated the data. And think about something like opposite charges attract equal charges repel, right? You don't know how to protons like you don't directly know how two protons will interact, for example, but you know that they will sort of repel and you know the strength of that force. And this is what I mean by mechanism. You don't know the trajectories of the protons, but you know how to get to the trajectories from that mechanistic knowledge. Another example of a physical system that is actually described by PDE is the realm of fluid mechanics. So the description of fluids of all sorts. And actually there's sort of a fluid in a wider sense. So for example, climate models and weather models are also based on these PDEs. And the important model here, it's actually a very important system of PDEs. It's called the Navier-Stokes equations. You might have heard that already. And yeah, they're described to simulate systems like the weather, climate, but also ocean. So you could do like a tsunami model which predicts when a tsunami will hit the coast or whether it will actually come about. Because of, well, you can sort of see from these situations that the models are typically really large scale. So simulating the ocean, simulating the entire earth climate is over a long period of time. And it's a really large system. So yeah, these problems we're talking about are typically a very, very large scale. And they're also very difficult to solve in practice. But also PDEs are kind of an interesting thing to talk about because the theory and practice of PDEs are actually still a highly active field of research after you could argue already a century or maybe even centuries of research going into this field of mathematics or applied mathematics. And it's actually in a sense so difficult that one of the well-known millennium problems is about these Navier-Stokes equations, namely just to state whether the set of equations even has a solution and how smooth that solution is. So that sort of maybe already gives you an insight into how difficult it is sometimes to talk about these models. Now because these are typically difficult and this is sort of I guess a common scheme in mathematics, nonlinear or general PDEs are quite difficult to understand. And so in this lecture, we are sort of restrict ourselves a little bit to a simpler class, namely the class of linear PDEs. I'll define what that is in a little bit. But just know that even if we restrict ourselves to that class, this is actually still quite a powerful modeling language. So for example, many of the physical processes that happen around us are or can be described via linear PDEs still quite accurately. So for example, thermal conduction, the diffusion of heat and for example, a piece of metal is described by the so-called heat equation, which is one of these linear PDEs. The phenomenon of electromagnetism, I already talked about protons interacting. This is also described by a set actually a system of linear PDEs called the Maxwell equation or Maxwell's equations. Wave mechanics, so essentially the description of water waves and waves that sort of propagate through air can be described to a very good approximation by the wave equation. It's also a linear equation and the particle velocities of particles and Brownian motion are also described by the so-called Fokker-Planck or Kormogorov forward equation. The latter one actually also has quite an important place in mathematical statistics where it relates to stochastic processes. But there's not just physical models that are described by these equations, but I already hinted at financial markets. So the famous Black-Schold's equation, which is used in mathematical finance is also such an equation. And finally, if we actually in practice work with nonlinear partial differential equations, we can use linear approximations to these nonlinear equations in numerical simulation essentially by iteratively re-linearizing. So with all that motivation aside, we use these models typically to describe the behavior of a real-world system. And we don't know the exact behavior of that system in advance, but we know, as I said, how the mechanism behind the system works. So I'm deliberately phrasing this goal we have for this lecture here as a sort of pseudo graphical model because we want to fuse the well-known or the models we know, the probabilistic models with these mechanistic models, with this mechanistic knowledge in order to gain some of the strengths of machine learning in practice. This is also sometimes called hybrid modeling because we have empirical knowledge from data, from observational data, and then we have mechanistic knowledge, which ties all that nicely together quite nicely. And so we know, which is why this is stated, we know the mechanism and we want to infer the system behavior. However, PDEs usually have some set of parameters and we usually don't know these parameters in advance. I give a couple of examples here. So examples of these parameters are like strengths and distributions of heat sources or charges in electrical problems. Material parameters such as the speed at which heat diffuses through a certain material or actually just forces in classical mechanics are all examples of this and we don't usually know these, but we can measure them. These measurements are typically noisy, but from that we can already see that we actually need to be able to deal with observation noise in our models and classical descriptions of these things typically don't handle that too well. So we have these measurements of the system parameters and the system parameters also sort of need to be known in order to solve for the system behavior, right? Because the equation is sort of governed by these parameters. And sometimes we also have measurements of the system itself in terms of think about like a sort of a heat simulation of the heat distribution. We might just place a thermometer at a certain point in our simulation and measure the temperature of a piece of metal, for example, but we can't do that at every point of the simulation. And this is why we actually need that mechanistic knowledge to sort of interpolate between our measurements. Question? Where do we get the measurements for? That depends on the situation. Forces you can just measure, right? That's like essentially measurement devices for that, for example. Sometimes it's actually more difficult to measure this stuff in which case you can actually use these measurements and sort of propagate the knowledge back to the system parameters, which is then called an inverse problem. And Natané already talked about this quite a lot in the ODE lecture. So this is actually sort of the flip side of what we're doing here, right? Not necessarily just simulating what the system does, but also inferring what the sort of causal mechanism underlying that is, yeah. And our approach to this here, and actually sort of the goal of this lecture is to use Bayesian statistical estimation to fuse what we know, like the mechanism that we know, and this measurement data and actually also uncertainties that are probably contained in most of these system parameters. Why Bayesian statistical estimation? Well, because we have all these uncertainties here and would actually be quite a good thing to propagate all, essentially all that we don't know onto the solution so that we can give some confidence to the predictions we do. All right, so let's actually jump into the PDE world and first of all answer the question what we actually mean by a linear PDE. So first of all, think about, we wanna simulate a physical system. Typically physical systems have some sort of spatial extent and if it's a system that evolves over time, we also have a time span over which we wanna simulate this. So first of all, we define this set D which is called the domain. And then we look for a function that describes a physical system, say for example, a temperature distribution or like the force is generated by a set of electrical charges. And this is actually the unknown function you, this is the description of the system that we wanna simulate. The mechanistic knowledge in these PDE based models is actually given by this equation here where D is a so-called linear differential operator. I'll actually show you some examples of that. But generally speaking, this is just a linear combination of partial derivatives of the unknown function U and we don't know the function U but we prescribe a fixed value F which is called the right-hand side of the equation for this linear combination. And this is essentially turns out to be a very elegant description of a lot of physical processes, for example. Now some examples of linear differential operators are given by the, probably the most well known one is the Laplacian, which is just the sum over all second partial derivatives, essentially the trace of the Hessian, right? The sort of non-mixed partial derivatives. And another example of a PDE is actually a linear PDE is actually an affine ODE. So in case where we only have one input variable instead of D input variables to our function, we can construct this differential operator here which consists of just the one derivative that we can actually build with this function and another sort of linear term here. And then this, if you rearrange terms in the equation just gives you the vector field of an affine ODE. So in a sense affine ODE's are special case of linear PDEs and in general an ODE is a special case of a PDE. It's not always helpful to think about it this way because PDEs tend to be a little bit more difficult to simulate, but it's just sort of a closure argument here. Now let's talk about some problems we have with these equations if we wanna actually apply them in practice. First of all, usually we do not get an analytic solution. This was already true for a lot of the ODE's we considered, but here it's in a sense because it's a wider class of models it's actually, well, we inherit these problems. So we need to use numerical solvers to actually get to these unknown functions U which because the function U is actually an infinite dimensional object will introduce discretization error inherently unless you know something about the problem but we're gonna talk about this. Secondly, we already sort of talked about this. The PDE has parameters and we can actually pinpoint what these are now a little bit better. The right hand side function F is one of these parameters. For example, when I said, well, heat sources, we need to know heat sources that are involved in our problem and charges. These are typically described by the right hand side of the equation and material parameters, for example, are the coefficients in the linear combination of the partial derivatives of the function U. And we don't know these exactly usually. I already talked about this. Now finally, classical solvers that have been developed essentially over the past century are sometimes quite difficult to embed in computational pipelines because of this reason that, well, the parameters are usually not known exactly but we sort of need one estimate, we need a point estimate of a parameter and so it tends to be quite difficult to actually propagate these uncertainties through the solvers, which is something that we aim at solving by this Bayesian approach we take here. Now, there's a bit of a problem here because PDEs, just the PDE itself, usually does not identify its solution uniquely. For ODEs, we have already seen that we also need to formulate initial value problems because essentially there's a constant of integration involved if we solve these equations. And for PDEs, this is not much different but the types of additional conditions we need to impose a little bit more difficult than in the ODEs. But let's actually look at an example first. So we look at the so-called Poisson equation which is just the Laplacian of the function having a prescribed value. And let's for now think of a solution candidate to this equation, which is just linear. And we can see that if we apply the Laplacian to this linear function, I said this is just a trace of the Hessian matrix. But because this is a linear function, the Hessian is zero. It doesn't have a second order term. So this is just zero, which means that linear functions are actually in the kernel of this differential operator. And so for any solution of the Poisson equation, we can just add a linear term and still get a solution because I'm gonna talk about this in a little bit. This differential operator is actually linear. So usually, maybe also in the spirit of what you do in the ODE case, uniqueness can be achieved by requiring an additional condition to work, which is usually a boundary condition or in this case, actually to keep everything a linear boundary condition. Where again, we have a linear operator here. I talk about this in more detail in a little bit. But in the physical intuition, why we prescribe something that happens at the boundary of our simulation domain is actually that, well, if you think about the problem you're simulating, there might be interactions coming into your system from the outside of the simulation domain. And if you don't simulate these outside influences, so if you essentially don't model them in your mathematical framework, then essentially anything can happen. Like, if you model a piece of the heat distribution, a piece of metal, and there's a truck rushing into it, then yeah, what's gonna happen? You don't know. So you essentially need to model everything that happens outside of the simulation domain by summarizing it to how the outside interacts with your simulation at the boundary because that's the only sort of influence you can actually have here. And a PDE together with a set of boundary conditions is usually referred to as a boundary value problem. Now an example of, a specific example of such boundary conditions are dervishly boundary conditions, which just say essentially, well, the operator restricts the function to the boundary, which just says, well, we prescribe the value of the function at the boundary. If you think about the function U being a heat distribution in some piece of material, then this just says, well, we know what the temperature is at the boundary. And in physical analogy, this might just be like, you have a huge bucket of ice water which will not change temperature, whatever happens on the inside of the domain. So you know the boundaries of your domain will always stay at zero essentially. All right, so I already sort of revolved around this point for quite a bit, but PDEs are actually statements about functions. So we have an unknown function U. And functions are typically infinite dimensional objects. So you might already know if there's an infinity involved, you need to be a little bit careful. But it turns out that we have quite a convenient structure for functions because it turns out that functions or function spaces, sets of functions, under certain conditions are actually vector spaces, right? So from sort of linear algebra classes, you might be very familiar with the vector space R n, so essentially the space of n dimensional column vectors. And as you know, you can add them, you can multiply them with a scalar and you can do essentially the same thing with functions just by saying, well, the sum of two functions and the product of a function with a scalar are just defined point-wise. This is what's set here. So we have vector space structure, but that also comes with all the sort of nice amenities that come with vector spaces. So there can be bases in these vector spaces, right? It's just a linear combination of some set of basis vectors. And for function spaces, you have a linear combination of a set of basis functions. Usually, at least for the function spaces we care about here, these spaces are actually infinite dimensional and sometimes they don't even exist in this form. This is what I mean by you have to be careful. But for this lecture, these exist, they are infinite dimensional and they exist essentially a series of vectors. You also have linear maps and we know from finite dimensional vector spaces we can represent linear maps by matrices. And the linearity property just means we can pull the matrix inside a sum and we can pull scalars out of the matrix vector product essentially. Linear maps in infinite dimensional vector spaces, specifically in function spaces are called linear operators and this is also actually where the term linear differential operator comes from because a differential operator maps a function for example to its derivative. So it's a map between vector spaces or functions and it has this linearity property. So essentially because you have the sum rule of differentiation and you can pull scalars out of a derivative. So this is actually why we talk about linear equations and linear differential operators and by sort of applying this knowledge you can see that a linear PDE is nothing but a linear system in an infinite dimensional vector space. So we will actually in this lecture apply quite a bit of intuition from solving linear systems that we developed with Yonatan's lectures to the case of PDEs. This is actually quite a nice analogy you can always think about when working with these systems. There's two more details. Well you can define norms on vector spaces right. For example here we have the maximum norm or the infinity norm on RN and well the essentially the analog of this also exists for certain classes of functions so you can just take the supremum over a function and then that is that it turns out to be a norm on some function spaces. If we have norms we can also talk about convergence and if every sequence in these spaces converges we can talk about a Banach space and then well we know that RN with the infinity norm is a Banach space but actually the space of K times continuously differentiable functions on some set is actually also a Banach space with this particular norm and so again we can take a lot of the intuition from the finite dimensional case over to the infinite dimensional case but maybe not always so we have to be careful in these senses again. And the same actually also holds for inner products and Hilbert spaces with a couple of caveats which I'm actually not gonna go into right now. I just noticed that usually sort of these translations involve replacing indexes into column vectors by function evaluations and then sums by integrals. Sort of a very straightforward way of deriving these analogs here. Yeah, all right so let's move on to a bit more practical topic which is actually this toy example that will serve as the main motivational example with which I'm gonna introduce the methods we're gonna use here which is a simple model of the heat distribution in a central processing unit in a computer, a CPU. Now this is what the discrete GPU in a computer usually looks like. Actually this metal piece on top here is not the sort of the chip itself it's just a cover which is used to extract heat from the chip. Actual silicon is just this little black box here. That's black, well chip actually. And these components are particularly limited by the heat they give off, right? If there's currents going through it due to internal resistances and things like that the chip will produce quite a lot of heat and if these systems overheat they may take damage or not function properly and we wanna avoid that. So it's actually quite a good thing to know what the temperature distribution in your chip is in practice. You might already sort of guess that this thing is actually really thin. And so we can sort of model it as a two-dimensional thing so for now we'll actually restrict ourselves to a two-dimensional modeling and like a mathematical modeling of this system. This is the spatial domain we talk about so it's just a Cartesian product of intervals in this case the length and the width of the chip. But actually since there's some homogeneous geometry here to keep things simple for this lecture we'll even restrict ourselves to just this modeling the temperature distribution sort of on this line slicing through the chip right here. We are actually gonna see a two-dimensional example towards the end of the lecture and we'll see that this is actually quite a good model assumption for this particular case. But now let's turn to actually moving from a sort of physical model into math. So first of all we already defined the spatial domain that we're gonna work on which is just this interval from zero to the length of the CPU which is sort of this coordinate axis right here. Now we know that when we wanna simulate problems involving heat we actually need to know where the heat comes from like what generates heat in our system. And for now we'll actually just assume well the GPU built into this CPU is actually idling it's not doing anything and the cores are actually computing something really hard. So the compute cores of the CPU are working and generating heat. At the same time these systems are usually built into a computer in a way that there's a heat sink on top of them which transports off all of that heat generated by the chip so as to prevent it from overheating. And actually this sketch might be a little bit misleading. The heat sink by sort of squashing this what's called a thermal interface materials sort of a paste around these edges you could assume that it extracts heat sort of uniformly from all of the surface of the chip. So the heat leaves the chip via its surface. And so if we actually wanted to model where the heat sources are in this chip and sort of where the heat sinks also where the heat leaves that system we could look at a function like this, right? So we placed three Gaussian blobs on the cores themselves which are the heat sources and you can see the unit here is what per cubic millimeter so it's essentially heat per unit volume and then there's another negative constant or yeah negative constant function superimposed onto these Gaussian blobs which models the heat sink everything that is getting pulled out of the CPU. Yeah and now you might wonder well we talk about PDE's here so what is the PDE that actually models this system? Well I already talked about the heat equation so here it is. It's a linear PDE it's also second order why is it second order? Anybody? Exactly yeah the last term contains the second the non-mixed second partial derivatives exactly yes. So this is where that second order comes from I'm gonna talk about what these individual terms actually mean a little bit later but for now note that this function U is gonna be the temperature distribution in our chip. And we can see here that there's a temporal derivative so a derivative of the time variable involved. A reasonable assumption to make things simpler for us here is actually to assume that the temperature distribution stays the same over time which will eventually happen in a CPU which sort of runs over time and when the control on the fans of that heat sink actually reach sort of a stationary point so what we'll do is we'll assume well at some point we are actually gonna reach a stationary temperature distribution and you can sort of work that into this model by saying well okay if the temperature doesn't change then this temporal derivative is just zero. There is no change in temperature no at least no temporal change in temperature anymore in which case we arrive at the stationary heat equation which is just you obtain from this equation by just setting this entire first term to zero. And since we restricted ourselves here to this one dimensional subset of the CPU essentially this line this actually turns into this equation right the Laplacian with D equals zero with just one dimension it's just a second derivative. Now what is that? We have actually already seen that in this lecture this equation what does it remind you of? It's just an ODE because we restrict ourselves to one dimensional problem to a one dimensional problem here it's an ODE which sort of you might say well where's the point in talking about PDEs here? It's just so things are a little more visual for this lecture everything that we're gonna talk about in the following actually also applies to the multi dimensional case we don't have any specializations to the one D case here. All right now how do we inject this mechanistic knowledge we have from this differential equation into our statistical model? A way of thinking about these PDEs is actually interpreting as an observation of this unknown function U and this is done essentially in much the same way we work with in the ODE case there is an unknown quantity we don't know anything about this function U but what we can observe is a derived quantity we can observe its image under the differential operator and we know that because we want the PDE to hold in our system this image of the differential operator think of it as just differentiation for now just one partial derivative we know that it has to be a certain value namely the value of the right hand side function and in physics a lot of this actually has an interpretation a lot of the most fundamental laws formulated in physics are actually conservation laws so they describe the conservation of some fundamental quantity like energy mass momentum charge some physical observable and it turns out that these are usually actually expressed as PDEs and the heat equation we've just been talking about notice that I actually moved one term over to the other side is a statement about conservation of energy particularly heat energy so this left hand side term here is proportional to the change in temperature the temperature change in temperature and you can actually so it turns out this is also proportional to the change in heat energy if temperature changes, heat energy changes note that temperature and heat energy are actually not the same thing but they're in this case proportional to one another via these material parameters here and we say that well every change in heat energy has to be explained via either a heat source so this Q dot V is actually a heat source so essentially a known value of heat entering the system at any point or it has to be explained by heat flowing in to a certain point from the surroundings via heat conduction and this actually sort of makes sense because the Laplacian computes sort of a curvature estimate of the function in one D it is actually the curvature it's just the second derivative and if you have a situation where you have sort of a parabolic bowl then you would expect because the surroundings of the center point are harder than well the center point itself that heat flows into that because well temperatures tend to equilibrate over time and this is the statement is well every change in energy is either explained by conduction from the surroundings or by a heat source being present there so there's no energy lost or gained other than what we can explain essentially and this is a local statement because there's all these derivatives involved which are computed at every point of that domain yeah I already stated that normally this or abstractly speaking this is just a local mathematical property so the value of the prescribed value of a derivative at a certain point and we can actually write this equation which is sort of first of all just a notational thing but we can write it as the difference of the differential operator minus the right hand side being equal to zero and we define this as the function i or the operator i which is in probabilistic numerics also referred to as an information operator and for the specific case of just you know the derivative you've actually already seen these information operators in the ODE lecture this is exactly what we conditioned on in the probabilistic ODE solver so this is just a more general sort of framework it doesn't just apply to differential equations but you can essentially express any piece of information with such an information operator and you can see it as a sort of extension of the notion of a data point right a PDE is arguably not a data point in and of itself but it still provides you information about the problem that you're trying to solve so it's a generalized notion of data or actually information all right now we actually want to solve this differential equation now and we have an unknown function u so what do we do if we have an unknown quantity in Bayesian statistics well we just put a prior over it in this case surprise surprise it's a Gaussian process prior u and you can actually see that prior over up here it's a Matan seven halves kind of and just a you know constant mean function and the observations which replace the normal point observations in regular vanilla GP inference is now the information operator right we require the differential equation to hold at every point of the domain where we replace the solution candidate by the GP now now well how do we actually do this first of all how do we actually apply such a such a differential operator to a GP what kind of object do we get from that in another another way of friends is how do we take the derivative of a GP and second of all how do we actually compute the posterior now how do we condition on this piece of information and this is where oh well actually it turns out that both of these objects are GPs again so we have a sort of a closure property of GPs under linear observations um... and we still have a bit of a problem because this is actually an infinite set of observations right because we want the PDE to hold at every point of the domain and the domain is typically like an interval for example it's uncountably infinite um... so computationally this is going to be quite a challenge to do it's basically impossible for the general case unless you know something analytically about the problem you're solving so we relax this piece of information by essentially saying well we don't want it to hold at every point of the domain but just as at a finite set of training points acts in this case uh... and these are because of classical methods that are similar to this called collocation points I think Natanay also talked about this in the ODE lecture so it's essentially the same approach we take in the ODE lecture just um... that we don't use state estimation but an actual GP now it turns out that these objects are very very similar uh... to the forms of of uh... what you get if you if you condition a finite dimensional Gaussian random variable on a linear observation so this actually popped up already in the lecture on inter GP it's just uh... yeah so the Gaussian inference theorem essentially on RD column vectors essentially but now recall that we actually said well functions are also vector spaces right so we can see a GP as a probability measure or as sort of like a random variable on function spaces and yeah we just use some sort of some some GP prior here and then as before we choose a linear operator in this case it's actually a linear operator that maps the sample paths of that GP uh... onto RN so a set of N linear observations of that GP uh... and we introduce some sort of a noise variable here which is independent of the GP so this is just independent Gaussian noises uh... just as in the the final dimensional case now it turns out that uh... that the prior predictive so the the image of the Gaussian process under this linear operator is actually given by a normal distribution where we map where the mean is given by the the image of the Gaussian process mean function through that different sorry that linear operator and the covariance matrix is very similar to the covariance matrix we actually get in the final dimensional case just that um... instead of well we can't really apply like a matrix product here right because uh... this is a linear operator not a matrix and this is not a linear operator itself but also function the analog of this a sigma A transpose thing in function space essentially is to apply this linear operator which remember acts on functions and returns a vector from RN apply this first to the first argument of the kernel function fix the second argument of the kernel function then you have a univariate function in X which returns a real value and this is actually exactly what we can input into that differential operator because well that's the space it's defined on uh... and then afterwards seeing it as a function of X again which now maps from uh... because we index into the into the eye here maps from the well the spatial domain X to are again and this is again one of these functions that we can input into the differential operator so we apply it again you actually see a concrete example with a concrete differential operator that will make everything a little bit more clear and a little bit and the posterior of this this event turns out to actually have a similar structure too essentially we just replace matrix vector multiplication of the matrix with applying the linear operator to a function uh... and we replace this a sigma A transpose object by uh... yeah that that thing we already saw here this gram matrix i left out a lot of theoretical detail here we're going to return to this in at least some vagueness in the end of the lecture but know that there's a lot more involved than just you know writing down these equations and you need to be really careful when doing this because of all these infinities involved uh... yeah we're going to talk about this a little later it basically is so think about this map from from yeah so this is why it's mapping to r n instead of so normally like just differentiation with map function to function but this is essentially concatenation of the differentiation with several point evaluations right so because point evaluations are actually also linear because they define that way essentially you know because uh... summation of functions is defined point wise so um... yeah so this is we actually going to see this on the next slide so uh... in example of this would be just take a derivative and then evaluate it at the point x now if we actually want to compute this object what does that look like anybody have an idea like for this specific choice of linear operator do you mean this x it's just some x it's a fixed x actually yeah I screwed up here so this is a different x than that so just you know think x till the year that's it's just some fixed x say three it's three no no it's it's it's three think of it as three so it's it's just you know you differentiate the one argument then you differentiate the other argument it's simple as simple as that it's just a little bit difficult to express in this standard linear operator notation right so you first sort of fix the second argument to some value then well applying the linear operator to a the kernel which is now just a function of the first argument means I differentiate once with respect to the t1 argument and then insert x into that essentially now I see it as a as a function of the second argument and then I differentiate with respect to that simple as that for differential operator uh... differential is actually quite easy to do and then well if you have multiple x points here then you just build a matrix of all pairwise derivatives essentially between these these two points in order to gather this object it's actually matrix and actually to so uh... if if we sort of enter this case that you just just brought up where we don't actually point evaluate where we have a proper linear operator between function spaces uh... we also define these these covariance uh... kernels and uh... uh... well this is a kernel these are cross covariance functions uh... which are essentially the same thing we've defined above just that we we actually see it as a as a function of the point we want to differentiate it afterwards and what you can show uh... by essentially applying what you just learned about uh... what you just learned on the previous slide uh... is that if you have such an operator and it fulfills certain conditions which we're going to talk about later uh... then the the the image under that operator of a gp is just a gp again by definition essentially uh... and this is how you get at the derivative of a gp for example right if this is just a derivative uh... operator for example uh... for scalar gp just differentiate once uh... then you differentiate the mean and you differentiate the kernel functions on both sides essentially to get a symmetric function again for example you can see it like that uh... and then that's a gp again under certain conditions now let's apply that to our PDE case so we return to well the same gp prior and uh... the observations and we already said that we already saw that well applying the differential operator to the gp is a gp again with specifically this form of uh... posterior well uh... moments uh... and you can actually see that gp here this is the in this case the the well the second derivative ish well the scaled second derivative because that's the differential operator we're working with of that gp you can see that happening because the samples are actually much less smooth we lose degrees of differentiability by differentiating these functions which is also how you know that you're working with a maternal and not with something like a squared exponential kernel because the samples from a squared exponential kernel are actually smooth so infinitely differentiable uh... yeah now now that we've seen that we can also apply the same theorem or the same table essentially to this problem here where the linear operator is now well first apply d then evaluate at the set of points and we can compute the posterior uh... Gaussian process in this case well we know it's a gp now um... and if we actually apply this to this problem then we first define the set of collocation points the blue dashed lines here are these x points essentially um... and then well the observations are given by this black function because this is essentially the right-hand side space of the PDE right once we applied uh... the differential operator the gp has to sort of match the right-hand side function and this is why the y values of the um... these observations in this transform space are just given by point evaluations of the right-hand side function and you can see in this space we said it's essentially just have a um... not normal Gaussian process regression problem but we actually propagate that the knowledge we gain from that back to this original gp which is connected via the differential operator well we can see that it's sort of reacted to it but it doesn't seem to really have worked there's still a lot of uncertainty left so what's the problem here think about what i said about PDEs not being uniquely solvable the boundary conditions are missing it actually does work it's not it's not broken in a sense you can actually see all the individual samples which you can actually also see from this plot all samples sort of approximately solve this equation there's just degrees of freedom which the PDE doesn't fix and this is actually exactly this linear degree of freedom that i was talking about so every sample of this gp at least approximately is different sort of from from one another by a linear added linear function sort of you know just just scaled or skewed actually by exactly this term um... well since this posterior is now just the gp again and if we for example impose Dirichlet boundary conditions so we prescribe the value of this temperature distribution at the boundaries of the interval the left and right boundary point um... which we like physically you can do it could interpret this as as a measurement from a thermometer you might sort of attached there um... well we just have a normal Gaussian process regression problem because the posterior of that previous problem is a gp which you can just take as a prior for a normal gp regression problem you just observe two values at the boundaries that there's nothing special about this and now it works we still have a bit of uncertainty left here uh... actually observe that the right-hand side basically does not change so it's is essentially the same from before but now uncertainty collapses and the remaining uncertainty you can see here is just the approximation error we didn't actually require the PDE to hold in every point of the domain just in these collocation points this is also why we get some uh... this reflects actually in the gp's confidence estimate it says well i'm not certain because i didn't get all the information essentially so let's recap a little bit what did we do we've seen that the generalized form of gp inference can actually produce an approximate solution of this boundary value problem that we formulated here um... and we get an estimate of this approximation error that is typically quite difficult to handle um... here we're actually returning to this graphical model from the beginning if you remember correctly so we have the description of a physical system which we initially don't know but we condition it on the observation that some mechanistic uh... knowledge uh... we have some mechanistic knowledge about the system here which is in this case we measured the values of the temperature values of the cpu at the boundaries and we know how heat flows essentially through a piece of silicon now unfortunately a little bit unrealistic about this is the boundary values in deployment are usually not known these thermometers don't exist if you have a cpu in your system there are thermometers on the cpu but not at the boundaries um... so we need to sort of get rid of that in order to actually have a realistic model and secondly these heat sources the values of these heat sources are also not exactly known the way we modeled this was sort of well okay if the cores actually compute something then let's approximate the heat source distribution with like a Gaussian block but you don't know it is actually a Gaussian block like we haven't measured that um... so to get around this it would be actually kind of cool to have the possibility of adding uncertainty to both the boundary values because these are also not exactly known and the exact values of this heat source distribution so let's return to the case where we just conditioned on the PDE what is a more realistic boundary condition for this uh... well more practical boundary condition at least uh... it turns out that I already stated that that uh... we know that heat is extracted approximately uniformly over the entire surface of the of the cpu and actually um... these boundary points are sort of the the side parts of this uh... this box this idealized box that the cpu is represented by so physically you can actually model these boundary conditions so the information about how much heat is extracted from the surface translates to a so-called Neumann boundary condition which is instead of setting the value of a boundary point it says the first derivative of a boundary point if there's a lot of heat being extracted then you can assume that sort of the temperature distribution uh... moves on to the boundary in a relatively steep declining slope and if there's a lot of heat entering then it's a positive slope at least on the right slope here so if we actually do that we we model the uncertain uh... values of that derivative because yeah we don't know exactly like we don't know how much exact heat leaves the other boundary but we can sort of add some plus or minus some uncertainty to that so we model it with another Gaussian in this case a Gaussian process but well it's it is actually just a Gaussian distribution because it's just two values instead of a function uh... and we add an additional information operator which is given by exactly that derivative at the boundary don't worry about this is a directional derivative because normally if you have a multi-dimensional domain so say for example like a plane it's not just the derivative but it's the derivative in the direction of the normal the normal vector to that boundary so essentially how much flows out of the surface to the to the thing is well given in in in the direction of that that normal vector the exterior normal vector but think about it as just the first derivative for now uh... and well yeah this is an information operator describing that Neumann boundary condition it's also not a in this case not a normal Gaussian process regression problem anymore because we have another observation through a linear operator here it's just a different one than the differential operator uh... of the PDE it's this this B this boundary operator we saw before and if we actually do that well oh sorry we see that we got rid of one of the degrees of freedom right so instead of there being a linear degree of freedom where we can actually have an additional slope uh... we fixed the slope but the only remaining degree of freedom in the samples of this GP is the offset the translational degree of freedom and this kind of makes sense because we just said where where he flows to and where it flows from not at what absolute scale the system operates so this could be a system running at a thousand degrees Celsius or this could be at like the minus two hundred below zero how we actually fix that absolute scale is by these thermometers that are actually contained on the CPU just not at the boundaries that contained in the center of the CPU course and so we have uh... measurements thermal readings which from from the census which are called digital thermal sensors at three points along this this cut through the through the CPU and we actually add those this collapses but it doesn't collapse fully because well there's measurement uncertainty on these thermometers um... and so we only can do so much right essentially what we can learn from the absolute scale of these uh... three of these three measurements again we already said that well we we don't actually know that the heat source term um... there should be some uncertainty on it it was sort of a eyeballing rough estimate of of what might be going on so how do we fix this maybe maybe even tutorial just add another GP all GPs everything um... so instead of saying well the right hand side of this PDE is some fixed function we say well okay it's just some GP it's a probability measure an uncertain estimate of a function that we actually should know so we model our prior belief about what that function actually is by another GP we can use the same inference technique now but there's a bit of a problem because the physics of the problem actually well first of all this this this deterministic boundary function we used before was carefully chosen such that it's actually integrates to zero and this makes sense because if it wouldn't integrate to zero then in total there would be more heat entering the system is leaving it and in this case we would never reach thermal equilibrium it was the system would just keep on heating up because there's always more energy entering the system so we need to choose this this this function this right hand side function to integrate to zero how do you do that with a GP a GP if you just add Gaussian noise to that single estimate then some of these samples will consistently lie above the mean which we use as the original estimate of the right hand side function and then well they will have a bigger integral so how do we solve this problem how do we essentially guarantee that this GP has area one in all of its samples exactly integration is a linear operator integrals are linear because well the integral of a sum of two functions is the sum of the integrals of the two functions so we can actually formulate another linear observation well linear information operator which exactly formulates this specific condition which I call a stationarity condition because it refers to thermal stationarity you actually have to add in the the boundary effects too because well there's heat leaving via the boundary so the heat that leaves via the boundary and the heat that leaves sort of on the interior which is modeled by that that negative term in the in the right hand side function and then plus all the heat that is generated generated by the CPU cause that needs to be zero that needs to equilibrate essentially and so by actually adding this additional constraint which now doesn't actually affect our function you in the first place it's just the the the the right hand side function in the boundary function Q V and QA the well the prior over these two functions changes and you can actually see if you look closely that samples that start out below the mean here which the mean actually already did integrate to zero well actually in the well actually in the end lie consistently above it and then they this happens in such a way that this actually this integral zero condition holds for all of the sample paths and well now we can sort of apply all of these information operators to the well this joint GP prior this is actually multi-output GP prior now to arrive at this system and this is I would argue one uh... uh... a much more realistic model in terms of the assumptions that it makes about the problem than the one we started out with and you can see that there's still some uncertainty left which is make sense because we we don't have a certain right inside function we don't have certain boundary conditions and we don't have certain measurements so all of these uncertainties contribute to the posterior uncertainty about the solution but you can actually see that the the red area the red shaded area down it's actually realized quite difficult to to discern that uh... the red shaded area in here lies essentially consistent sorry lies consistently within this blue shaded area the blue shaded area is the uncertainty about the right-hand side function so all of essentially or most of our estimates for the curvature or whether the image of the GP under the differential operator agree with uncertainty or they're consistent with uncertainty about the right-hand side so that makes a lot of sense alright let's recap again so we've seen that this GP approach combined with the notion of an information operator enables us to integrate prior knowledge about the systems behavior which is uh... formulated as a GP well as the assumptions made by a maternal in this case uh... we can inject mechanistic knowledge in the form of this linear PDE that we've been formulating we can have uncertain boundary conditions and right-hand sides which actually appear quite often in practice specifically also in the context of inverse problems uh... we can use a noisy empirical measurements to get rid of some of the uncertainty that we get from well uncertain boundary conditions and uh... right-hand side and uh... approximation uncertainty all this happens while we provide quantification of the approximation error that we get by you know not requiring the PDE to hold exactly at every point of the domain uh... we get an error propagation from these uncertain estimates of the system parameters right inside the boundary conditions the bigger story behind this is that all of this is only possible because instead of just giving a point estimate of the solution we actually relax that a little bit and say well we we just answer with a an infinite set of solution candidates which awaited by a probability measure it's essentially just uh... one of the essentially the promises of Bayesian inference in a different form but yeah it's it's very powerful instead of using a point estimate of that function just use a probability measure over functions which can give you confidences with which can give you samples which all in this case exactly fulfill the uh... conditions you subjected to uh... and it's also probably more honest because instead of saying well we actually know that this is the right hand side function we want to solve the PDE with you acknowledge well there is uncertainty in that this practically always uncertainty in these estimates almost never know these exactly and this is a way of actually modeling that now i already feel that this you can actually simulate exactly the same model also in 2D this approaches exactly the same only that you replace well your one-dimensional grid of points by a two-dimensional grid of points essentially change the prior little bit uh... you can see that it works and actually because there's almost the same tape the same temperature across uh... the the y-axis here you could argue that well the 1D model was actually quite a good model in the first place because there's not a lot of variation and you and by positing this this 1D model the way we did was well we said it's essentially the same temperature along that dimension so was probably fine to model it like that just just as a proof of concept here now in contrast to the ODE lecture uh... we didn't really talk about time here what is time other than just another spatial dimension you could argue space-time uh... domain essentially and you can apply it exactly that to a 1D version of uh... the heat equation so now the actual heat equation not the one where we set the temperature derivative to zero now this axis is the time axis it's just as I said another spatial dimension and this is the the the one spatial variable we have now you can actually see that this in a sense resembles a lot of an ODE with an infinite dimensional state space so you have a state you can think about it as a state variable per domain spatial domain point and then well you have an ODE part which describes this evolution over time and if you slice through uh... through this function over time and actually animate what's happening then you get a physically plausible simulation of a heat diffusion which also has all these uncertainties because it's a GP right there's not a lot of uncertainty here because I didn't add uncertainty to the right hand side and the boundary conditions and I used a relatively dense grid but in the end you can sort of see that the grid sparsifies and there's more uncertainty specifically at the edges so you can simulate uh... temporal problems also with this uh... approach by essentially saying what time is nothing special special time is just another input dimension essentially um... alright so I talked a little bit about classical numerical methods for these PD's which have been developed over this past uh... century so how does this approach fit into it it might seem a little bit weird to use a GP when there's all this stuff already developed but it actually turns out that the posterior mean of this method we've been developing assuming that you don't add any uncertainty so you have exact boundary conditions and you have exact right hand side of the PD is just the point estimate produced by some classical method in this case it's called symmetric collocation which is also why these points are called collocation points um... so this is just one method so where's uh... the plethora of other methods more generally we've actually we've actually showed that um... not in this lecture but you can actually show that uh... all so-called weighted residual methods can be realized as these posterior means of a Gaussian process just without the uncertainty quantification obviously collocation methods are actually an instance of these but there's also finite volume methods which you might have heard and there's also the large class of what are called Galerkin or Petrov-Galerkin methods which contain probably the most famous method for solving PDEs which is the finite element method and also spectral methods um... and the way you get to these methods is essentially by changing the way you discretize the equation remember here we took this PDE residual this dU minus F and just evaluated it at a couple of points well why would you just evaluate you can essentially project onto any other function because you have a Hilbert space usually so you have an inner product on functions and you can just put in another function there and by essentially by realizing it this way you get at all these other different methods by carefully choosing the functions that you project on and by carefully choosing the prior you work from uh... with a different scheme you can actually also show that finite element... a finite difference discretization of PDEs can be realized via GP inference this is another paper from our group um... and now you might wonder why why would you even use GPs in in this or additionally through this plethora of other methods well we get this uncertainty um... quantification and what's kinda nice is because we know that the posterior means of these methods are the same as the classical methods we can essentially use them as drop-in replacements uh... of these classical methods because you get the same solution just plus uncertainty quantification and that's kinda cool because you can reuse existing software stacks alright quick summary in general you've seen that GPs can be used to solve linear PDEs but maybe even more importantly we showed that these information operators are quite an elegant language to realize regression so function estimation essentially from very heterogeneous types of information not just point of point evaluations but all sorts of linear functionals um... and that turned out to be quite helpful in these hybrid or yeah physical physics-informed or uh... uh... mechanistic models in the beginning I said it's like incredibly hard to solve these equations so where's all that hard math that uh... should be necessary according to that statement to actually solve the equations and I mean so far we just really needed derivatives and a little bit of linear algebra to actually express what we're doing here so where's all that well as I said there are some very important um... failure cases that we should be aware about mainly there's two points so maybe you can come up with uh... one of the points where actually what we just did is not working if we formulate a model in the wrong way there's one very obvious one and one not not so obvious one someone have an idea think about what well I mean we use derivatives of gps right so what can go wrong with the derivative well a function might not be differentiable if you differentiate a non-differentiable function you essentially it's undefined right you don't know what's happening first of all we need to make sure that the sample paths of our gps are actually differentiable because otherwise it's kind of meaningless what we do here and the second thing is that well gps and specifically evaluations of gps are random variables right and you might know from your probability theory statistics as random variables are functions which need to be measurable in order to actually produce a meaningful uh... sort of consistent statistical model here now if we differentiate a gp then that derivative because the gp was random um... the we think of this derivative of the gp specifically the point evaluated derivative also as a random variable it should be a random variable right because well the quantity that we differentiated this is random so that derivative better be measurable otherwise we're again in undefined territory so everything that we do in this case doesn't work fortunately a way of of uh... choosing your prior and choosing specifically to essentially tuning it to that differential operator you're using such that all of this works out but you need to verify some important conditions and i'm gonna try to go over the the formal details of that theorem a little bit this is actually uh... from an from an upcoming publication of our specifically on that topic to to make everything rigorous in a sense um... so first of all let's talk a bit about the the sample paths of the gp first of all what what even is a sample path of the gp formally speaking when are they differentiable uh... well first of all actually me say because i'm not going to come back to that uh... this this what's what's what's called bounded here or the statement of the linear operator is continuous in this case and uh... under this did the assumption that well everything else in your holds sort of gives you the measure ability of the derivative some actually going to talk about this so much more just recall that matrices are always continuous so that the the linear maps defined by a matrix all linear maps in a finite dimensional vector space in fact are always continuous in infinite dimensions it doesn't hold anymore and you need to verify this for every particular operator well and we sort of we sort of uh... require this here but we're going to talk about this in the last slide maybe a little bit um... so first of all let's think about how our gps even defined well the gp is just a collection of of random variables of real valued random variables one of such of such of these random variables at every point of the domain of the gp essentially if you if if you say well i evaluate the gp at a point x1 then that is a random quantity that is a random variable it's a real valued random variable in this set of random variables that define the gp there is specifically one corresponding to that output of the gp of the gp at point x1 uh... don't be confused by this omega this just comes from the definition of a random variable a random variable is a function from this sample space of the of the probability space this omega to some real value in this case if you fix x and i guess you can think about this omega as like a random number generator if you sample something from a random variable you first sample an omega from your probability space which that's what the random number generator does and then you transform that thing through this this function f or f of x in this case to produce you a value and well if you chose this random number generator in that function f correctly then you get for example a Gaussian random variable if yeah you choose that in the correct way so for a gp we first of all only know what the distribution of a finite combination of evaluations of a gp is that's essentially what you do when you sample from a gp for example you evaluate the kernel function at all the points you want to sample at and then well this you know from the definition that that is a multivariate normal distribution and you sample from that we use gps to model functions not the evaluations of functions so where do the functions enter in this definition if we actually fix one such value so imagine that we actually were to continuously sample a function continuously sample of sample path from a gp we would do this in the following way we would fix an omega so we would essentially say well random number generator generate like a random event and then we transform that omega essentially via all these functions so there's for every x there's one such function and if we fix omega then well we get one such function by looking essentially at the collective of all of these fx's with omega fixed and this is what's called a sample path this is what we talk about when we say well we we want a sample path of the gp and this is also why we think of about gp's as models of unknown functions this is obviously something we cannot do in a computer because there's usually an uncountably infinite number of these x's but this is what you were doing well think about it as constructing like an infinite dimensional matrix then taking the Cholesky of that and then you know using this to compute your regular gp sample and if we look at the collective of all of the all of such possible sample paths by sort of looking at well taking one for every possible event generated by the random number generator then we get what's called a sample path of the gp and what we need to make sure for our purposes is that all of these sample paths in the sample space are sufficiently differentiable because otherwise we're doing something that is undefined second of all if we if we state that we want to compute lf so the application or the the image of some gpf under a linear operator l what we actually mean is first of all fix an omega then you have a sample path a function which just maps x to the real numbers and then we map that through the linear operator and then afterwards we sort of let omega go again so it's the concatenation essentially a sample path with that that operator and in this case because we choose the operator to map to rn again um... this is just a an rn valued random variable so random vector even though there is like an infinite dimensional object in between there which is this gp right um... and I already said that well this is exactly the random variable that the thing that needs to be a random variable this needs to be measurable because we condition the gp on the fact that this has a certain prescribed value so yeah this better be measurable again not really going into details here it turns out that the sample paths of gps are actually reproduced or can be made into a reproducing kernel hibbert space by choosing an appropriate kernel um... I'm not really gonna go over this this is just for the people that know what an rkhs is however usually the problem is actually I believe always the problem is that that the reproducing kernel hibbert space of the actual kernel of the gp so the kernel function that you choose is not the space from which the samples come the samples are uh... informally speaking usually rougher in terms of differentiability then samples from the well elements from the original rkhs um... and you sort of need to choose a larger space in which these samples are contained and you can actually show that a sample from a gp is almost surely not from that space so with probability zero you draw a sample from specifically that space let's look at a concrete example which is also very useful in practice so this is actually why I'm ending on that um... is the matern kernel so uh... scary expression you can simplify this quite a bit for specific parameters p here which is actually integer and it turns out that this parameter p controls the differentiability of the gp so the higher the p is the more derivatives you can take of that gp essentially first of all the rkhs generated by that kernel so the function space that that that is generated by this kernel contains functions that are p times uh... differentiable so you can actually get uh... you can yeah well if you if you get a function from that space you can differentiate differentiated p times actually continuously differentiated p times uh... but the problem is that if you have a gp with specifically this kernel then the samples are not p times differentiable the samples are actually less have have have less continuous derivatives uh... and you can show that informally speaking you can show that they are actually d times less a d half times less partial derivatives where d is the input dimension of the kernel so for example in our case we chose a matern seven halves uh... kernel which in the uh... in the in the uh... covariance rkhs i'll call it the kernels rkhs which mean that would mean that the functions are actually three times differentiable since we modeled a d equal one problem uh... the the functions we draw from that gp actually one times less differentiable two times differentiable which is exactly what we need for a rule of thumb you can uh... if you if you need sort of partial derivatives up to order at most m remember we have a second order equation so far as m is two uh... then you can show you can use this formula to compute the the parameter p for the matern kernel which you actually need in this case and a nice thing is if you actually choose this uh... specific uh... formula then you actually know that the differential operator on the path of the gp is actually bounded so you get a random variable that's a nice plus with that ah sorry uh... and yeah so so this is actually you can generalize this this process i actually should have done that here because i talk about dimension uh... these are normally just euclidean norms so instead of just computing an absolute value it's the well yeah but euclidean norm of the difference uh... so you can deploy matern kernels in an arbitrary number of input dimensions but specifically for PDEs and you actually see that on the exercise sheet this week uh... it's better to choose um... to to to construct a d-dimensional kernel by taking products of uh... one d matern kernels over the dimensions and um... this is specifically useful if you have sort of mixed uh... orders of uh... derivatives remember in the heat equation there was only one so if the first partial derivative with respect to time but second partial derivatives with respect to the spatial variables so it actually makes sense to use a well uh... a matern kernel that gives you two partial derivatives in the spatial part of the kernel and one that is rougher namely is exactly one degree of differentiability rougher for the time dimension that helps you adapt to the concrete um... equation you're trying to solve yeah with that i come to an end uh... we've seen that PDEs are this uh... important and actually sort of ubiquitous language for for modeling problems physical problems specifically but also other uh... problems from other domains in the real world um... we can we have seen that we can use our recurring framework of GP inference to actually solve these PDEs but more generally GPs provide this this framework uh... in which we can fuse or combine very heterogeneous information sources uh... into a single regression model by using these affine information operators and in the end we've seen that you need to take quite some mathematical care uh... so as not to make mistakes in the model construction but this mostly applies to the uh... construction of the prior and i've given you sort of a shorthand for how to get at a specific uh... prior for a specific equation in practice thanks for listening and i'm happy to answer any questions