 Welcome everyone to lecture number seven in Numerics of Machine Learning. So it's lecture number seven, but it's also lecture number three on the block of simulation. And so this is the third one in that row. It started two weeks ago with Jonathan that told us about state space models, and he introduced a language to describe dynamical systems, but in a probabilistic way. So we have an estimate for a state and how it revolves over time. And we also have a language to describe an observation process that depends on that. So the point of view is one where there's uncertainty involved in the evolution and in the observation and we try to reconstruct the state. And the specific tools that we talked about are Kalman filters and smoothers or extended Kalman filters and smoothers for non-linear problems. Then last week, we also talked about dynamical systems. But with a very different language and we talked about ordinary differential equations. Where the point of view is not really anymore that we have a state that evolves in a probabilistic way and an observation process. But it's kind of a very precise way to formulating that we really know how the state is going to evolve with a differential equation. But then the problem is that when we do that, we still have, we want to answer the question of what is the state then, given this differential equation. And doing that with a computer suddenly is difficult again and still does estimation. And we were only able to get algorithms that do, well, solve ODE's approximately. And so today, we combine those two ideas. We will look at ordinary differential equations and ways to solve them, but through the lens of Bayesian state estimation. And so this will be the biggest part of the lecture and the main part is looking at ODE's, interpreting them as state estimation problems, and then solving them as state estimation problems. And by doing that, we get probabilistic numerical solvers because the algorithms that we use are, well, they give us distributions over the state or over the ODE solution. And so the output is strictly richer than from a classic server. But we will see that in the lecture. Before I start with the actual content, I will have one slide recap for each of these topics. The first one is numerical ordinary differential equation solvers. So the stuff we did last week, and that was, well, we looked at differential equations or ordinary differential equations, which is this equation in the middle and big. Which describes the derivative of X by, well, some vector field F that, well, can be easy in the examples that we looked at, but can be hard in practice. And then, vaguely speaking, when you are given an ordinary differential equation, the task is to find X. Well, and then you use an numerical algorithm in order to simulate starting from X zero. But it's kind of, there's a mismatch between this task of finding X and the thing we actually do. And one example that popped up last week, and we'll look at this a lot today again. It's the logistic ODE. That's one of those cases where the differential equation is quite easy because we can actually solve it by hand. And that makes it nice for us to evaluate the methods we have. And while visually speaking, the problem statement is basically, we have these arrows on the back which are the vector fields. So this F, we have the blue thing on the left, which is the initial condition. And we want to find the black line. And then last week, we looked at numerical methods that do that. And while we looked at these three, the first two are methods, and the third one is a whole class of methods. But each of them had the approach that we have some estimate at some time t. And we want to compute an estimate at time t plus h. And we do that in a specific way. And the simplest one we could come up with is forward Euler. Similarly, simple as backward Euler, but there's the tricky implicit part. And then the explicit Runge-Kutta methods got, if you recall, these butcher tableaus, they got way larger and more complicated. But conceptually, it's this simple formula. You just need to define what bi should be, ci, and then there are these other a's hidden in the x tilde in here. But so there's a mismatch between what we do and what we want. Because I said, we want to just find x. And what we do is we call it x hat now because it's wrong. So fundamentally, these methods do, they try to find a quantity. They cannot find the quantity, so they approximate it. Or when we switch up the vocabulary a little bit, they estimate the true solution. And they provide a single x hat as an estimate. And they don't just estimate it like randomly. They estimate it by, well, from some information. For example, by evaluating the vector field. So you evaluate the vector field, and thereby you update or you extend your estimate to a future time point. That's the point of view that we will have today. And estimates really means there's an error, and the error depends on the step size. I think we've seen that last week also. So if we make the age small enough, then the solution gets better. If we would be able to make the age infinitely small, then we get the actual true solution, but we cannot. And well, depending on the age, we have either a very noticeable error as you're on the top. Here in the middle, we still see it. Here on the bottom, we cannot see it anymore in this plot. But there's always some error somewhere. And so fundamentally, numerical OD servers try to estimate a quantity. They will be wrong also. So they will never be completely correct about the thing they return. There will be an error. And they estimate x by using some information, in this case, in valuations of the vector field. And this kind of, once more, sets the whole motivation for, well, when we see that numerical OD servers do estimation anyways, we might want to phrase it as an estimation problem and solve it as an estimation problem. And the tool we use is exactly what we did two weeks ago, Bayesian state estimation. And just to briefly recap it and to refresh the vocabulary and the notation that we had, we looked at non-linear Gaussian state estimation problems where x is the state, the lowercase x here, and y are observations. And well, we first write down kind of a generative model of how we think x and y behave and we do that by specifying an initial distribution and a prior or dynamics model that describes how x evolves over time. Then we specify a likelihood or a measurement model or observation model sometimes that explains how y depends on x. And then we are given some data and the task is now to do inference, which basically means say, well, describe what we think x is, given the fact that we observed some data. And there were kind of four, well, two tasks that we looked at last week. But then there are two more things hidden in the algorithm. Basically, we have the filtering distribution that we looked at a lot, which is basically estimating the current states given all information that we have access to right now. And the smoothing posterior, which is estimating x given all the data. And then, well, to compute these, at some point, we do compute these prediction estimates. So basically estimating a future state given current information. And there are these likelihood terms hidden in there, which gives us an estimate of the next observation that we might see. And these quantities on the right, we can compute with quite simple algorithms. So the extended Kalman filter and extended Kalman smoother, which have some, well, the building blocks of a predict step and an update step. We don't need to go through the formula. But it's quite nice to see that the predict step is really a two line thing. And the update step is a five line thing. We could even make it less, of course, and just make things less readable. But it's conceptually quite a simple algorithm and simple to implement. And the smoothing is similarly complex. And so, yeah, I said it before, and I will repeat this point often today, but we combine these two approaches. So we combine the methods from the last slide that we saw in order to solve the problem from the first slide to get probabilistic numerical OD solutions. And one motivational quantity is this one, because that kind of describes the kind of thing we want to have in the end. And it also shows you how to phrase the numerical OD solution as an inference problem. Because instead of saying that we want to find an X that fulfills the OD everywhere, which we cannot compute anyways, might as well specify the thing that we might be more likely to compute. And that is, we want a posterior of a X of t that satisfies, well, the initial condition because we look at initial value problems. So we want X zero to match the given initial condition. The X subscript zero is given. And we want to fulfill, well, satisfy the ODE, but we relax the whole thing and satisfy it only on a discrete set of points. And the motivation for that, maybe go back again. Like, we only evaluate the ODE on a discrete set of points anyways. So all of these methods that we looked at, like we always go from t to t plus h in some sense. We never look at the continuous thing or condition on the continuous thing, which is why we make it part of our problem statement to say, well, given the discretization, I want an estimate of the output. And because I told you that we're going to use the stuff from two weeks ago, well, that motivates the fact that we want to do Gaussian filtering and smoothing. And the reason for that is, of course, not only because we did it in the lecture, but because it's a fast algorithm. It does inference in O of n time and, well, Runge-Kutta, that's the tool, so we want to be a bit comparable. And in order to end up with something where we can do Gaussian filtering and smoothing or extended Kalman filtering and smoothing, we need to construct a state space model. Because, well, as you've seen on the recap, once we have the state space model, we know what to do. So the main job will be to build up a state space model that, once we solve it, gives us a quantity of this kind. And for that, we will need to define a prior model, a likelihood and data model, or like a way of informing the state. And once we have that, we can do inference. So let's start with a prior. And, well, on the Bayesian filtering and smoothing slide, I use this lowercase x to describe the state. And I also use the lowercase x in the ODE context, but in some sense, let's still ask the question, what is the state, actually, that we can consider? And, well, just using this lowercase x through ODE solution doesn't satisfy the thing we wanted the state to be two lectures ago. I think on the slides, we had something like a quantity that fully describes the dynamical system. And just having x of t is not enough. It's like, you take the logistic equation, it's a single point, it could evolve in any direction. But one thing that would, if we could fully describe the thing, would be to have x of t and all derivatives. Might be infinitely many. But we've seen Taylor series expansion's last lecture and two lectures ago. So I hope these are familiar. But basically, here, the point is really, if we had access to x of t, it's derivative, it's second derivative and so on. Potentially, like an infinite sum, then it's annoying to work with. But if we had those, that would fully describe the system. And, well, once more, we do truncate the Taylor series expansions. So we want something that is very informative, even if it's not perfect, which we can get when we truncate the whole thing. We only track the first q derivative, where, well, x superscript zero means the zero of derivative and so on in this notation. Which introduces an error term. And now, I will get a bit vague for the rest of this slide, and then very precise on the next slide again. But in some sense, the prior that we had two weeks ago was always like a probabilistic thing, right? There was uncertainty when you go from one step to the next. And that is really motivated from this term. Like if we would do this approximation, then there's an error in there, which we don't know if we only do this approximation. And so to get a probabilistic model basically means, well, this error is now a random variable and suddenly we have a transition model. That I will write down on the next slide. But so the goal here is really to find out what quantity can we use as a state in an ordinary differential equation. And my claim here is that tracking q derivatives is good, and then we have an error term that we can describe. So that is the state. We will, instead of just looking at x of t, we will look at x of t. It's derivative, it's second derivative, and so on up until the qth derivative. Q might just be one, then we have a lot of noise. In the, I will show some code later where all we will track are the first two derivatives. So it's not like we want a hundred derivatives or something. But yeah, by tracking a limited number of derivatives, we get to a probabilistic state model. All right, that's the motivation, but the notation doesn't yet match what we did two weeks ago. And so I give you something that does match the motivation, the notation from two weeks ago. And well, that's on this slide. The thing I just described has a name. It's a stochastic process. It's a so-called q times integrated vener process. And there's a bit more history to this thing and a bit more context than we have time to talk about today because it's actually a continuous thing. But we will work with it in discrete time so that we can apply all the tools from two weeks ago. So we have this uppercase x that I will use now whenever I talk about the stochastic process that has a structure that goes from x0 to xq. And to be a bit more precise, I replace it now, like I don't write the lowercase x anymore because the lowercase x, we don't know, right? That's the true thing we would like to find, but the uppercase x's here are now the random process that we are currently defining. But the idea is that, of course, this random process is supposed to model the true solution. This q times integrated vener process prior, this one specific process that I motivated and that we will choose to do today has transitions of this form. I think the most interesting line is maybe this one which says that if we have x at time t, we can get x at time t plus h with a Gaussian model. Which is actually linear because we multiply some matrix A of h with x and we have some covariance matrix q of h. And the q times integrated vener process, apart from the Taylor series approximation motivation, is also quite convenient because this A of h and q of h are known in closed form. The formulas look a bit annoying, but you can implement this and you can build up these matrices. So it's actually quite convenient. And so this object is now something that we can work with and that we will. And the A h of q h, yeah, I told you, the formula looks a bit annoying, but when you look at the thing in two dimensions, then suddenly it becomes quite structured. And A h is this upper triangular matrix with, well, actually only diagonal terms. So once everywhere here, h, h squared half, which might remind you of the Taylor series coefficients. And that's where h comes from. And q h, well, has a similar structure, not the same, but again, written like this, it's quite easy to write the code for it. And we will actually see the code for it. All right, so this is our prior model. We can plot it. So this is the, okay, maybe before I show the plot. It's a two times integrated V in our process prior because, well, we track two derivatives. And so the second, yeah, maybe it's better with the plot. But the thing on the right, you see A h is a three times three matrix because we track the zero derivative, the first and the second. So it's a three dimensional thing. If I plot this process, x of t, then I get these three subplots, basically. Like here, I think there are five lines in each or six, and then, well, five. So these are five draws from the process I defined. And maybe that visualizes why it's called a two times integrated V in our process because the second derivative is a V in our process, which you might have heard before. Maybe you've heard Brownian motion before. It's the same thing. So this is this weird random walk. If you integrate that version, you get these. And if you integrate that one, you get the top plot. Or going the other way around, if we take the green line from the top plot, we take computed derivative, then it is actually this green line here. If you compute its derivative, then we get the, well, not so nice line down here. And, well, that is why it's called a two times integrated V in our process. And if we draw more samples, then we maybe get a better feeling for, well, its statistical properties. Because at each point in time, it's actually a Gaussian. And again, we will not talk about it as a continuous object a lot, but just so you have the reference, these are Gaussian processes, actually. So they satisfy the properties of a Gaussian process. And they are Markovian because we have this. So it's a so-called Gauss-Markov process. All right. And then maybe one other thing to mention at this point, because we will need it later. If we have the whole stack capital X, then accessing, well, for example, the first derivative is very easy. It's basically just looking up the correct entry, which, well, instead of implementing as a slicing in numpy or something, is like a multiplication with a projection matrix. So accessing an entry is a linear operation. So it's very convenient to give them this whole stack, access the first derivative, second derivative, or something. And we will need that in a few slides. All right. To summarize all of this stuff I had to define here, we defined a prior that we will work with. And it's the Q-types integrated Vienna process prior. And it's nice to work with because we can write down these transition densities, which we need to do Gaussian filtering and smoothing later. And it's convenient because A and Q are given in closed form. And so the other part that we need, it's two bullet points formally, but conceptually it's really the same, and we will see why. The other part is this likelihood and data combination that will inform the prior so that it actually does the thing we wanted to do on the top. And to get there, we'll look at the ODE again because, well, unsurprisingly, we didn't use the ODE yet. And so the other half of the state space model then probably has to include it. And so we will define a likelihood model and data motivated by the ODE itself. And well, the thing we were given in the very beginning was this ordinary French equation with the lowercase x and the goal that we want to find x. And that's the perfect thing we want, but it's intractable. Now we defined this x of t process in the last slides. And the motivation for x of t was that it should model the thing we want to have in the end. And also I said that x1 is the derivative of x0, and they all correspond to one of the derivatives that we want to model. And so using the language of the thing we just defined, the equivalent statement to solving the ODE would be that we want our process to satisfy this equation. If it does, then it also solves the ODE. And then, well, we can reformulate that to put all the x's to one side and define a measurement function or sometimes also called information operator that, well, takes x and is defined in the following way, and it should be 0. This is the ideal thing we would like to have in a perfect world where we have infinite compute. Now we already talked about discretization. So we already relaxed the goal a few slides ago that instead of trying to solve it everywhere, we tried to do things on a discrete grid. And so the requirement basically becomes satisfying this ODE on a discrete set of points. Or equivalently, with the thing we just defined, that means that we want m of x now evaluated only at discrete points ti to be 0. And conceptually, it's still kind of a thing we want. That's the information that we have and that we want x to satisfy. And the tool that we use in order to make it satisfy the thing is by making this an observation model, or measurement model, or maybe even information model, and then doing inference in two slides. But so this quantity motivates a likelihood model in the following way. So in the state-space model description, I had y's. Now I use z's. I will say y in a second. But we have this observation z that conditionally on x is given by a Gaussian, where basically we use this measurement function that I defined above. And we have data that says that this thing is actually 0. And 0 is precisely the reason why it's a z. And so it's really like this is the desired thing. We define this observation model. And then in the inference, you will see why this actually helps us satisfy the desired thing. I introduced one thing here. I don't know if you realize, but there's an r. There's actually no reason for an r to be there, because there's no noise in the measurement process. We defined this measurement function above exactly. So in our problem setting, there's no reason to assume that f gives us something random. Also, there's no reason to assume that we don't actually want to satisfy the OD. We do want to satisfy the OD exactly. So there's no noise on either side. So it's actually a noiseless likelihood model, which more formally would be a Dirac likelihood. Though this one is actually quite convenient, because if you have the Kalman filter update in mind, there's an r somewhere in there. And just plugging in the 0 does the correct thing. So that is actually what happens when you condition on a Dirac likelihood. In your code, you really plug in the 0 as an observation noise. All right. So again, we define a lot of stuff in this likelihood model. The good thing is we can have a look at it and get a better feeling for the thing I just defined here. Before that, I defined a prior for the x and how it evolves. Now I defined a generative model for a z. And so we can look at it the same way as we did look at the prior. I can draw samples. This time for x, I just used the once integrated Gner process, which means that x1 is a Wiener process and x0 is the integrated upversion. I mostly did that to save some space. And z of t is defined in a specific way. So the concrete example is again the logistic ODE. And for this given ODE, when you plug in the ODE on the left-hand side everywhere, then z of t is basically defined as x1 minus, so z of t is x1 of t minus x0 of t times 1 minus x0 of t. And that's literally how I computed these lines. I take a line from here. That's the x1. And I subtract a line from the top times 1 minus the line from the top to get these sample draws. So it's a generative model. And we can simulate it. Now the question, I mean, maybe to understand why we did all of that, let's look a bit deeper at the solution and how that relates to the whole inference stuff. I can plot the solution in there too. We don't really see much because I'm zoomed out a lot. So if I zoom in at the solution, then we see the standard logistic ODE solution that we know from all the slides. This is the correct derivative. We can also compute it in closed form for this ODE. That's why we look at it. And then z of t, if I take this black line minus the black line on the top times 1 minus the black line on the top, then I get exactly 0 by construction. Because z encodes the fact that the x satisfies the ODE. So it has to be 0. But maybe seeing this again explains why we have a noiseless measurement model. This line doesn't wiggle around the 0 or anything. It's actually 0 if you give it the true solution. So this is our generative model. And now looking ahead of what the inference process will do, the data lies in this plot, not in the other ones. It's in the space where the z's lie. And it's kind of a discretized version of the z. And then the thing we will do next, and in the code and in the homework, is basically going from the generative model to a posterior, which basically says we want this thing to collapse around the z. It's not perfect. It's approximate inference. And by doing that, we also get estimates for the x. And we see that they relate to the solution. So I hope that this clarifies why we go through the trouble of setting up this generative model in this way, because it allows us to go from here to here. And the thing we get out is kind of an ODE solution. And that's what we will do in a bit more detail in the next few slides. But basically, so there were two parts. We defined a prior in a specific way, which is the Q times integrated Vina process prior. It's a class of priors in some sense, because there's a Q that we can choose. But I didn't give you the most general thing that we could work with. But this is a very general way to do these things. Then we have the likelihood model that encodes the differential equation. So this is the object that relates the prior to the thing we want to have in the end, this likelihood model and data combination. And I mean, the data, they are just zeros. But in order to really get to this language that we introduced two weeks ago, we have to define all these quantities so that we can do inference. So that is actually the main part. So we wrote down a state-space model, and we can now just solve it. Yeah, there will be so many of these things in this lecture, so let's look at it again. So that's like the whole thing. I think I didn't talk about the initial distribution, but you can just use a zero-mean standard covariance Gaussian. We have a dynamics model that I defined. We have a likelihood model or information model that encodes the ODE, so that depends on F. And we have the data. There's one thing I didn't talk about, and that we still need to do, or to add in order to actually solve the problem that we're given. Does anyone see what's missing? So the only information that we include right now is the ordinary differential equation, but there are many solutions to an ordinary differential equation. So that's one additional piece of information, yes? We're not using the initial value. Exactly. We're not using the initial value. Thanks. So precisely, that's it. The initial value is missing. And so the question is, how do we include this initial value? I skipped it earlier, basically, because I think this is the more tricky and more interesting part. How can we include the initial value? Yes. We may be setting mu to the initial value. That does actually work, and is a sensible thing to do. There's a more general way to talking about it. And it results in the same, because it's a noiseless observation model. But this really drives the point, like, motivates once more that observation models or information models, maybe even, are the objects that allow us to define the things we want. And we will do that a lot. So the way we can include the initial value is by just defining another measurement that depends only on x0. So on x, the full stack of all the derivatives that we tracked, it actually evaluates only the 0th derivative. And the data point we have is the initial value itself. And because it's noiseless, it's kind of setting it to the correct value, only in a formally a bit more precise way. So yeah, it was super easy to give a model, just add one more thing we want. On the next slide, I will write down some pseudocodes of the extended Kalman ODE filter. There are building blocks that we need. So for completeness, I define this Kalman filter prediction and the EKF update. It's the same equations as we had in the second actual slide, I think, only that it's pseudocode now instead of equations. We will also use a Kalman filter update, but it's the same only that, like, computing the Jacobian Australia. So these are the building blocks that we need in order to implement the extended Kalman ODE filter. And then the algorithm itself is given by this. Let's go through it line by line. So it's a procedure or a function that takes in basically the state space model that we wrote down before, which has different blocks, which is, well, the initial distribution, mu 0 minus and sigma 0 minus, the transition model A and Q, the vector field and the initial value, which implicitly defines an observation model, and then time discretization. And then the algorithm, well, first does this initial value because, well, we start at time 0 and then we need to satisfy the initial value first, which is a Kalman filter update so that maybe that explains also why we have mu 0 minus and sigma 0 minus because, well, there is an update that we do first. This E0 is the linear observation because I defined E0 to take the state x and map it to its 0th entry. This 0 here is the observation noise, which is 0. And then we have the data point x0. So again, that matches the notation here on the right. If you go through the slides again at home, then I think these things should always match the things I use here. And then inside the filtering loop, this is just a standard filtering loop because there's a prediction and an update happening. The prediction uses the transition model and, I mean, A and Q depend on the step size, the way I defined them. So in the line before, we need to compute the step size. And then the EKF update does not have observation noise and the data point is also 0. And then I had to define this measurement function in order to kind of hide the T because otherwise it doesn't formally match what I did on the slides before. And then we just return the whole thing. I think I hope that from this description, just doing the extended common ODA smoother is kind of straightforward. We take all of that and we smooth. And then instead of returning the filtering distribution, we return the smoothing distribution. All right. I will show a lot of code today to really prove that these 10 lines with the building blocks and setting up the model correctly is enough in order to solve an ODE. Because sometimes the formulas seem more complicated than it actually is. So I will go and show some code. This is Julia again. I've seen you probably, well, we've seen a little bit of it last week, but a bit more two weeks ago. I know that most of you are probably more familiar with Python, but I think the notation should be simple enough today that it's easy to follow. And so that was actually my motivation. Because if you look at this Kalman filter predict, for example, on the top, is it big enough for everyone, by the way? No? OK. Our command plus? Yeah. OK, then I have to scroll a bit more. But it should be better, maybe. All right, so this Kalman filter predicts that I wrote down here. I mean, it's two lines in the return statement, and it takes four arguments. And it's exactly the same as in the slides. And this will also be the same in the code, mostly. So I have these building blocks. And maybe so you don't hide stuff. There are some imports above. But yeah. And then I have these building blocks that I need to define first in order to build the extended Kalman ODE filter. There are these Kalman filter prediction, Kalman filter update, extended Kalman filter update. Maybe if you come from Python, then this is one of the things that might be a little bit surprising at first. But it's just a transpose. And this divided by S, I think you've also seen two weeks ago. It's multiplying with the inverse, but in a numerically more stable way, because it solves the linear system. And I use a package to do Jacobians. But so these are the building blocks that we have. And then the actual algorithm that will be run in this notebook is this extended Kalman ODE filter. So it has the same name as in the slides. It takes in the initial value and initial distribution, the transition model, the initial value problem, basically. So the vector field f and initial value x0. And a time discretization. I also pass in the e0 and e1, because from the inside, you cannot figure them out from the inside of the function, so I just passed it in. And then the algorithm itself consists of, well, most of the blocks that we saw in the slides, which is one update on the initial value here, so on x0, basically. And then inside the loop, there are these four lines that we've seen in the slides. So computing a step size, doing a prediction, defining measurement model, doing an update, and then just returning the whole thing. So this is, I think, closely matches the pseudo-code that we've seen before. And so we will solve a logistic ODE again. Yeah, you've seen it before. So I define a vector field. I define the initial value, time span, and the true solution looks like this. Then in order to run the extended Kalman ODE filter, we need to define a prior. We did that on the slides already. And the code is actually the one from the slides. So it's a two times integrated VINAR process. And if you look at the slides, there was exactly, like this was a of h, and this is q of h. Once again, maybe this a of h is unfamiliar. It's a shorthand notation for a function in Julia. So a is a thing that you can evaluate at a step size. And it returns a matrix. And then I define these e0, e1, e2, which just, well, are a one times three matrix that map to either the zero derivative, which is the first entry of the vector, the second or the third entry of the vector, or the first and second derivative, respectively. So if I take e0 times x, then instead of having the three-dimensional object, I only have a one-dimensional object, which corresponds to the zero derivative of the thing we want. The initial distribution I chose arbitrarily here. It's just zero mean and standard, like one covariance. And then I choose a discrete time grid also with some step size. And if we run the whole thing, so there's a bit of code hidden sometimes, then so maybe you're more familiar with Jupyter. In Pluto, you have the output above the cell. But this is the code that is run in order to get the thing. The line that I need to show is only this one, because this says, well, we call the function I just defined with the object I just defined in order to compute filter estimates. And then below that, there's some annoying plotting codes in order to get the thing on the top. But if I hide this again, then you see that, well, first of all, we get three plots, because we solved the ODE, but our state describes the zero of the first and the second derivative. So we get an estimate for x, x dot, and x dot dot. And second of all, there's a dashed line hidden behind the blue points, because the thing that we output is actually an OK ODE solution. So we get a line that also is roughly at the location where it should be. So it's pretty much the same thing. So it's pretty cool. I mean, we learned about extended carbon filters two weeks ago, and now we used it to solve an ODE where we did the whole lecture on how to solve ODE's, but we could basically just use the algorithm that we learned about two weeks ago to get something that for now looks roughly the same. So that's pretty nice. I have a block here playing around with step sizes, and that's exactly what we're going to do, not just to showcase Julia's sliders, but we will learn from it. Basically, we can change the step size and make it smaller. So down here, there's code again that is hidden, which basically just calls the function again. And you see that if we make the step size smaller, then at least in the third subplot, the uncertainties get smaller. And if we make the step size larger, then at some point, we will start seeing the uncertainties in the first plot, too. Yes, here, for example. So we really see that we get uncertainties. Maybe one thing about these uncertainties also. So you had part of the homework on classic ODE servers was about local error estimates. And maybe one thing that might be surprising is that when you look at this, the uncertainty doesn't just grow over time. But maybe you actually expected that, well, I mean, if I make an error now and make an error again and so on, then it should just grow forever. And in some sense, worst case for any ODE, that is true. If you want to know the error of a solver on any problem in the worst case, then you can prove that they can get larger and larger. But this is not a worst case error. This is an error estimate for this specific problem. And the logistic ODE, if you remember these arrows, basically as long as you start at a number larger than zero, you will get pushed to the one eventually. So even if you have a large error at some point, as long as you get to the one eventually, then your actual error does decrease again. And so if the actual error decreases, you also want your error estimate to decrease again. And it does. It shrinks again on the right-hand side in the top plot, also here in the derivative, because the derivative tends to zero. So it is an actual error estimate. It's not meant to be interpreted as a worst case bound. It tries to model the error that is happening. There's, I think, one other thing here. So at some point, the solution gets worse. And just to showcase that smoothing generally does improve the results, which also kind of makes sense because if we look at the line, we want to interpret it as the best estimate that we can have. Whereas the line I show right now without the smoothing means that this point in the middle is only informed from the left-hand side. And if we don't have to do this, then why would we only inform it on the left-hand side? So if we look at trajectories in general, it always makes sense to smooth. And that matches also the motivation from two lectures ago. Filtering is nice for online stuff or like trying to reason about a point life. But when we have trajectories, then we can get further improvements by doing the smoothing. All right. So the error estimates are kind of cool because of their structure. There's something that's not great about them. And maybe, I don't know, so can someone tell me what is not nice about these error estimates or describe how they could be made even better, maybe even in this specific plot here? So the blue shaded area is supposed to be an estimate for the error, the numerical error that the solver does. And the numerical error here is, well, the mismatch between the blue line and the black dashed line. So the area is like this, but the actual error is like this. Basically, the error estimate is way too large in some sense. It would be a much better error estimate if it barely covered the black dashed line. And coming back to the whole motivation and why we're doing this, part of the motivation is that we want to have error estimates. So we want to have an answer from our algorithm that describes, that gives an estimate of where the solution lies. And I mean, the estimate is fine. The black line is covered, but it is kind of underconfident. So we would like it to be more confident. Like the blue area could be smaller and we would still cover the true solution. So, yeah, that's the answer to this thing here. The uncertainties are too large. They don't just seem too large, they are actually too large. And that is the thing we will have to fix now and that we didn't talk about yet. But so this whole topic of making uncertainties more meaningful is also known as uncertainty calibration. And we will see that we will have to do that in order to get actual uncertainty estimates that mean something. And there's a very... So for doing uncertainty calibration, there's a very simple reason on why the uncertainties were the way they are and why they are too large. And that is that there was one part in the prior that was kind of hidden on the way and again one of these things I didn't talk about so that we can talk about it now. And that is there's a prior hyperparameter that we did not set. When I defined the prior first, so the prior of course depends on Q and that Q needs to be specified and that is also a question by itself how to choose Q. But there's one other hyperparameter that again was kind of hidden on the way that directly influences the uncertainties and that is this sigma squared. There was a... Like on the first slide where I introduced the prior I said there's a Taylor approximation plus noise and there was a sigma in front of the noise. And that really is a hyperparameter. You can think of it as a GP hyperparameter like when you do GPs with an RBF kernel then they are typically like... There's a length scale and an output scale that's basically an output scale. It makes the thing wider or less wide. And because it directly influences how much noise is added at each step. So making this thing a thousand just makes things a thousand times more noisy and making it 0.00 something just makes things way more precise. So it's clear that sigma directly influences the covariances and we didn't set it, it's a hyperparameter and the question is basically how should we set it so that the posterior gets more informative. And on a high level the motivation is exactly the same as on GPs. That's why I mentioned them because in Gaussian processes you have the same problem you have prior hyperparameters that you need to estimate and in the first three lectures I think you also talked about that you estimate them by maximizing the likelihood and there's a way to compute this likelihood and then you optimize it by calling optimizer computing a gradient somewhere. In our specific context this formula is not specific to our context but it's convenient because of the context maximizing the likelihood means maximizing the P of data given the parameter which data was this collection of z's so it's a bunch of zeros but they have specific numbers to them and they correspond to a random variable so we have z1 to n and we can decompose this probability here by writing it as z1 sigma times zk given other data points before this is really like if you multiply them up again then you end up with all the z's on the left side of the bar again so this is a reformulation and that is convenient because we have a way to expressing this thing on the right that is why like when I recall the quantities that an extended common filter computes I also mentioned that we have this estimate of what the next measurement should be given all the current estimates because we need that in order to compute this marginal likelihood which we want to maximize and so we the extended common filter actually does Gaussian estimates of these so this P of zk given z1 until k minus 1 is a Gaussian where these things pop out of the update equation and that allows us to rewrite this thing as a well so called quasi maximum likelihood estimate because there is an approximation happening here so basically we have an approximation of this this should be a approximately equation because this is not correct because of this approximation happening but we will compute sigma hat as the arc max of this thing on the right because this is tractable whereas this quantity for a non-linear model is not revealed to get that and so this is really a general way of how to do parameter estimates in extended common filters like on the next bullet point we will talk about this specific parameter as sigma that is exactly in this location but conceptually the standard approach part is really just a recap of what you do in GP regression this thing here explains to you how to do parameter estimation in extended common filtering in general so this is the quantity you want in code this is the thing you compute you use autodiff with and you optimize and so on in our specific context like in ODE filters and so on we can be a bit smarter than that instead of computing this quantity taking a gradient step and so on we can actually solve for sigma hat in closed form with a bit of derivation half of the derivation will be heard of the homework but you can well in the specific model if you write down the model and write down the inference algorithm that we use so write down exactly the state space model as we had on the slide and use an extended common filter then we are able to show that there is like we can compute a maximum likelihood estimate in closed form by evaluating these things so z hat is still the same quantity as before and as also so these two come out of the update equations these are the zeros so we can also just ignore them and so not only can we compute that but we don't even have to recompute the solution again with the updated parameter and that is basically part of the homework in the homework you will show that sigma directly scales to covariances so like setting it in hindsight is equivalent to just rescaling the covariances and that's part of the homework and because it is equivalent what we can do is we can take our estimates that we had like these bad estimates from here and we just overwrite them in a rescaled way and then we get better estimates or we get the estimates that correspond to the maximum likelihood estimate sigma and well we will visualize that and show that encode is also again conceptually quite simple to implement we have these so this is exactly as in the slides I don't know why the resolution is a bit off this then gets better so these are the equations from the slides I just wrote them down again so that you have them for the code I defined a new update function with the suffix cal for calibration it's exactly the same one as before only that in this line I also compute this increment because well inside of the sum we can compute this quantity which I call an increment because it increases the sum at each time we can compute that in the update step and just return it so otherwise it's a copy paste from before and then I also just copy pasted the extended Kalman OD filter from before and just changed two lines which is this one where I take this increment and I increase the sigma head because we want to compute a sum at the end so I do it in a running way and I do this divided by n also to we want to compute d is the dimension of the OD and because it's a logistic OD it's one and then at the end there's this rescaling happening and then yeah so this is basically with the dots we overwrite the covariances and that is also what I had here above so we take the sigmas that we computed and we just rescale them with the sigma head that we just computed and then we can do the same thing we can play around the step sizes and maybe like one thing that you see directly is that down here we always had some uncertainty before if we scroll up even for small steps even though the error is actually super small now we cannot see the uncertainty anymore so that visually is already in upgrade but and then like looking at the other end we see that well now for these step sizes here it covers barely the true solution so it's much less overconfident the posterior is much more informative much closer to the actual solution more meaningful and then if we make the steps too large well maybe this is kind of a failure case but it provides one piece of information which is it's an uncertainty estimate so we are still doing approximate inference formally when you run an extended current filter you don't like it's still trying to approximate something as gaussian that might not be so that is one reason why it is this way and even just like generally like conceptually and philosophically getting uncertainty estimates that are always perfect all the time is an impossible thing whereas if you had them you could just improve the solution so in some sense the goal is to get uncertainty estimates that are as informative as possible and well in many cases that we see here there is a meaningful relation between in particular the structure of the uncertainty estimate because it gets smaller again and the actual error in these plots we don't really see anything so one other plot to look at is basically from the top plot I just plot the error so instead of looking at the solution and another line which kind of lie on top of each other I look at the difference between the lines and so the error estimate on average is zero of course so this top plot is the actual error and the bottom plot is the thing in lock scale and the mean of the error estimate is zero and then there's a dashed line somewhere that we can also see so for these solutions above where we see uncertainty there's a dashed line this is the actual error that we are making so we're kind of above the true solution and it's still covered by our error estimate when we make things smaller again then yeah in this plot we can still not see anything but if we look at things in lock scale then we see that we are like an order of magnitude or something but structurally it does the correct thing and well it's underconfident which typically is a direction that you prefer to tend towards but if you have to decide between over and underconfidence alright and so for all the step sizes that we can play around with here we do get meaningful error estimates like have a similar structure to the thing that they should have alright and so with the stuff I defined before and the error like the uncertainty calibration from this slide here which also led to code we do get an algorithm that provides error estimates that really depend on the ODE and its discretization and meaningfully describe the error like the numerical error that we are actually making so that is kind of the thing I motivated in the beginning of the lecture and the thing we went for so now we kind of did it in some sense we have a numerical ODE server it does provide error estimates they are useful they're not always perfect the algorithm is fast we kind of checked all the boxes but maybe that's a good point to ask like why we are doing all of this so why are we doing ODE's in a probabilistic way and why don't we just call Ron Kutter and there are kind of two aspects that come to mind and that influence our research in the group also there's this part basically the one that we talked about so far which is just fundamentally we believe it makes sense to have uncertainty quantification whenever you can and to try to quantify errors that you make and to be probabilistic about things and we've seen that we can get uncertainty estimates that are meaningful we get plots that look nicer than they did before we get these uncertainties around we can draw samples from the whole thing it's kind of nice to be probabilistic about all of this in the rest of the lecture I will talk about a different aspect that is not motivated by the error bars or anything but by the methods that we use and basically I believe it's actually a convenient thing to use ODE servers that are formulated in a probabilistic way and to go through all the hassle of defining the state-space model and so on because it's quite flexible we were able to in one question-answer basically decide that we also want to include an initial value and in the rest of the lecture we will just include other stuff that we would like to have and see how that leads to algorithms that we can run computing things that we didn't even talk about yet with essentially the same algorithm and there will be two examples or two main ways where we'll talk about this and well, yeah, so leave the space of solving ODE's a bit and look at all the things that we can solve now that we know how to formulate things in such a way that an extended common filter can solve them and yeah, so this first example is really motivated to looking at other problems that are not quite ODE's the way we looked at but similar enough that we can maybe figure out how to solve them this is the stuff we did so far so we had a numerical problem which is an initial value problem and we defined a state-space model and then we just ran an extended common filter with fairly generic code what if we actually have this problem? it's a second-order ODE it does appear actually quite often when you talk to physicists they are often able to describe the acceleration of something because I don't know, there's something as a planet and it moves around the sun and so you have physics that you write down and you get a description for an acceleration but not for the velocity so if we have this problem how could we adjust the state-space model to compute the solution to this problem does anyone have a suggestion of what we could do? yes? to include this x dot, you mean? yes, exactly that would include this additional piece of information if we make one that looks similar to this I will not show the next part yet one more thing we want to adjust yeah, again? so you mean adjusting the ODE likelihood here? yes, exactly so if you recall the derivation for this line here we started from an ODE then we replaced the lowercase axis with the uppercase axis with the correct superscript and we ended up with this where this is basically the left-hand side from the equation that was the right-hand side from the equation and if instead in the beginning of the lecture I would have started talking about this I would have said all the words would have been the same only that the thing I would have written down is this one so we start from the top and instead of saying it should be in equality I put all the things to one side and said their difference should be zero that's why there's a minus and instead of talking about the lowercase axis I use the prior process that we have here because the goal is to write down a state-space model so the second order ODE from the top becomes a likelihood model or an information model that is defined as x2 minus f of x1, x0, ti the motivation for this quantity is exactly the same as before instead of saying oh, we also need to add the initial value I would have said we need to add the initial value and the initial derivative so I just defined this thing to make sure that x1 is equal to the initial derivative that we are given and this is again something we can work with like we can just run an extended command filter with some predict and update steps and to prove that I will show a code again it's again a lot of copy-paste from stuff we did before because not much changed so in this block we saw the second order ODE a specific one, a so-called harmonic oscillator but basically there are only two lines that are changed from before exactly two which is we also do an update on the first derivative if you compare this line and that line then the only difference is that here the measurement operator is E0 and here it's E1 because first we update on x0 and then we update on dx0 which is the initial derivative and then I changed the observation model but I didn't even have to change the EKF update call because the EKF update was defined in a general way anyway like it's this generic update building block that you could use to solve all the exercises from two weeks ago but I did however redefine the measurements model as the difference between E2 times x minus f of E1 times x comma E0x TSI to match the thing I wrote here in the slides like x2, x1, x0 and there's calibration happening again because we always want to do that to get meaning for results and now we can look at the result of all of this there's code hidden again but it basically just defines the harmonic oscillator and then calls the extended Kalman ODE filter 2 please don't write actual software like this but it's nice to be able to show that this is actually the only change we needed to do to the function by copy pasting stuff and well again we get a probabilistic output of our server that provides an estimate of what the solution to this ODE should be and if we take a very core script then again we make a noticeable error but the error estimate is kind of in a similar scale this is also quite nice to see because the error in this case doesn't just go down to zero because there's not like this attractor that we tend towards anyways so the error really adjusts the problem setting like it really defends yes one question yes exactly which is conceptually so very good comment maybe to repeat it so the point was that for the velocity so for x dot we don't have an equation so it's just inferred by the model and yes exactly kind of in the same way as for x itself we also didn't have an equation but we have this measurement model or information operator that takes the whole stack and maps it to something we know which is well formally speaking it's a zero but I mean the important like the interesting thing is the definition of this information operator and now in the second order ODE the information operator itself has access to x0, x1 and x2 and so we learn about the whole state in the process exactly and so therefore the output that we get actually describes x, x dot and x dot dot where by the way like we also had an estimate for the second derivative when we solved the logistic ODE and that doesn't even appear anywhere but because our prior says that the second derivative is the derivative of the first derivative which sounds obvious but that's actually encoded in the prior model therefore we also had an estimate for that earlier alright we can basically just solve other things now by with similar motivations I don't know if you ever heard about them but they are differential algebraic equations they can be written as done here in the middle and green so it looks like an ODE we have this m on the left but we are not able to move the m from the left to the right because it's singular so we cannot just multiply with its inverse you could for example imagine a diagonal matrix that just becomes zero at some point so there are some equations that don't actually describe a derivative they are algebraic equations therefore it's called differential algebraic equations and I think the standard examples always come from chemistry so there are some chemists to scare about them but well for this example suggestions on which part of the state space model we can adjust to get an algorithm that we could just run yes exactly instead of calling it ODE likelihood we could call it DAE likelihood and we can encode this equation in here run it and we get something that solves the DAE and yeah so this is the one where I don't have a code example for but we have like we did that on the paper and it also works so we can actually solve DAEs in a probabilistic way with the same algorithm even though like I never talked about you of what a DAE is if you really care about DAEs probably still make sense to have a full lecture on them and like do all the theory stuff but conceptually from an algorithmic perspective it's very straightforward on how to solve a DAE right now one more example sometimes you have an ODE but you have also something on top maybe the motivation here could be a planet again that moves around a star where you can write down a differential equation that describes like the acceleration of something but in addition you also because I mean when you write that down you did a physics course before and you learned that energy has to be conserved so you can write down a bunch of equations that give you a quantity that needs to be constant or a different like something that needs to be zero if you just put the constant in the equation also and so you have an ODE but you have some additional information the whole thing is kind of over determined by now like formally right because you don't actually need that equation it follows from the rest that's what you learned in the physics class probably but it is additional information that we can use and can encode in our model and also outside of like probabilistic numerics there are algorithms like people in classic numerical analysis sit down and define solvers that try to conserve quantities of a specific kind it's a whole research field you can find a book about them but like from our inference perspective it's just another piece of information and so any suggestions on how to include just another piece of information in a well discretized way inside of the state space model yes yes exactly thank you so we add another observation model and while algorithmically instead of doing one update step we do two update steps but using the inference algorithm aside like this is now the model that would be motivated by the equations above so basically there should be TIs in here of course but on the discrete grid we also want to satisfy G where again formally we have these zero observations that we condition on and algorithmically it's two observation models so we just do two updates instead alright I did this already there's a video that maybe motivates this whole quantity stuff this is like it's not a planet I think it's a star that moves around the center of a galaxy and the whole thing is projected to a 2D plane because that's how we see it when we look at it on the left I have a very accurate solution of how it should look in the middle and right I have two extended common filter approaches where the middle one only has access to the ODE information the right one has access to the energy and ODE and they have the same grid size and so if we run the whole thing then I mean it looks pretty nice like they're basically the same I think if we go back a bit more then they are more closely the same up to here it's kind of similar here maybe but then if we continue to run the thing then we see that the middle looks noticeably more different and also now it kind of starts to do weird stuff that doesn't at least visually like it's not as appealing as the stuff on left and right so it seems to have lost some structure that the other ones have and that is like really what's happening like we are the right hand one if we compute the energy at each point in time then the right hand has much more constant energy than the left the left just leaves the constant energy plane and then there's something that we cannot interpret anymore from a physical perspective and instead of showing a video like maybe that's also like if you let the video run for a long time and in this case I had other step sizes but basically the trajectory you end up with for the true system is kind of the thing on the left which visually looks quite appealing with a lot of structure and so on and then the one on the right is not the same right the lines are probably wrong there's numerical error but it keeps the structure that you have much better than the thing here and if we let things run for long enough then this here becomes quite degenerate and it shouldn't even be able to go here to this middle right there should be an empty triangle so even though mathematically the thing like you don't need to add this conserved quantity because the ODE fully specifies it already we're not able to add the ODE anyways so putting in more information actually helps us to get a result that is more meaningful and there are classic methods that do similar things but that is kind of the flexibility and convenience point from before we didn't have to learn a new algorithm we just had to think about a problem formulate the state space model and then we use the tool that we know already from and add tools for doing state estimation alright there's one more example it's not a new one if you know it before but now we have all the tools to solve this problem and that's kind of a combination of ODE's so the stuff we talked about in today's lecture and last lecture and GP regression are well external data and the example is of course the COVID data where we have a scatter plot of the number of infected people in Germany from well beginning 2020 to July 2021 and we've looked at this last lecture already in the context of ODE's and parameter inference like in general when I today we used ODE's where I only write X and X was always one dimensional but we can extend the things we did today to higher dimensional ODE's in fact in the homework sheets we will do a two dimensional ODE basically means for each dimension you do things individually but you will see that in the homework and in the tutorial next week but so for this epidemic model of course we had this SIRD model where S are the number of like the susceptible population I are the infected so the scatter plot here are the recovered people and D are deceased and we already motivated that if I just have a scalar B then the result will look way different and it's not a good model for the reality so one way to improve that model in order to be able to recover this plot would be to make better of T a time varying thing and last lecture I showed you a result with a neural network which was ok-ish maybe but today we will make beta a Gaussian process at least that's what I said last week more precisely we make it a Gauss Markov process because this is the language that we know to handle in this context of Bayesian state estimation and like we didn't really talk about why this is actually a Gaussian process but I guess the intuition is we can specify the step size so if you can discretize something at will it might be a continuous object and so this beta of T we can write down the same way as we wrote down our state X with a transition model so it's a linear model there's an A subscript beta now because it might not be the same it depends on the step size H and it describes a transition from beta of T to beta of T plus H and we also have an additional model for the data that explains how this scatterplot depends on the ODE solution and it's basically Y given X is a Gaussian with some matrix maybe maybe this should rather be Y I given and then you have the stack S I R D and that is well only the I's with some Gaussian yeah this is it's one dimensional so we don't need a covariance matrix exactly because I mean here yeah I didn't plug in the actual ODE but X is this X is the solution of the ODE this four dimensional object S I R D and we measure only the I's in this plot or in the code we can actually measure more of them but in generality we have this linear measurement operator and well we do the same thing as we did for the last bunch of slides which is write down a state space model and then do inference and you see these things get more complicated and larger and harder to parse visually which is why I made these three blocks so there's a solving the ODE part where well we also assume that we know the initial value there's a data part and then there's this beta that come in addition to the ODE solution itself and so the first part is the prior for the ODE where the X well is S I R D so the capital X is supposed to be a model for the ODE solution so for S I R D and we track a bunch of derivatives with the same motivation as before we have a model for beta where again like it's the same language as we have above it's only that so and this is a Goss Markov process it might also be a Q times integrated linear process prior we can choose it that way if we want to with some initial mean and covariance and now we have a bunch of likelihood models but two of them we already know before this is the ODE likelihood only that it not only depends on X but also on beta because the vector field depends on beta we have the initial value likelihood which says that X at time zero should have a specific value so probably like well the initial count of how many infected people and so on there are and then we have the data likelihood which is the thing I wrote down on the slide before that relates this ODE solution to the scatter plot and actually we've seen so we've seen this in the last lecture also in the parameter inference context we said that we measure the solution with some linear matrix H because maybe we don't measure all the dimensions or something and we also said that it's a Gaussian mass model so we have two direct likelihoods and we have this Gaussian likelihood we can look at it in a bit more visual way I think we've seen a bunch of these in the very first and lecture and two lectures ago it's this stuff here but in circles and arrows so we have a state that evolves over time that is a prior we have a latent force which is also like where we also have a prior for it a Gauss Markov prior then we have this ODE that depends on both the state and the latent force because X is an input and beta is an input and then we have observations only of X we're not able to observe the beta and the inference task that we will use an extended common filter for is basically given the black dots what are the white dots? that is what an extended common filter does only in the settings we had before the first row doesn't exist and the last row doesn't exist but in some sense it's the same inference algorithm as before in this conserved quantity we already talked about even having two observation models so we do the same thing yes there's a question in the back the question is if it would also be possible to put beta in the state very good question you're taking away my next slide because I clicked somewhere the next slide was basically about how do we implement that and maybe the one reformulation that we can do to just make this simpler and more close to the thing that we already did before is exactly that the X and beta kind of look the same Z depends on X and beta might as well merge them and so what we can do I introduce X tilde it's a stack well in X tilde there's X and there's beta which means that mu zero tilde and sigma zero tilde are well the stacked version of the means and a block diagonal of the covariances same for A and Q but we have so this X tilde now contains all the things we want to infer and the measurement model for the ODE now only depends on X tilde same as the initial value same as the observations and to make things similarly visual I redefined E0, E1 and E beta as the linear operations that access the correct entry because we will want access to the zero derivative that we need here in the vector field we need the first derivative here and we need beta here so getting them out of X tilde is just a linear operation yes there's another question A tilde and Q tilde yeah they should be block diagonal yes exactly very good question same as the covariances which is a block diagonal these A and Q's are also block diagonal but maybe it's good to mention that the filtering distributions we get out are block diagonal X and beta they are correlated there will be covariances after having done the inference in the prior they are not alright okay so this is the thing that you have to write down in order to solve this COVID latent force inference problem in the way that we have seen already in a bunch of lectures now and the inference algorithm is like really it's an extended common filter and smoother if you look at the paper by Jonathan and then also Niko Kramer and Philip Henig then they defined this in the correct way in a smart way they choose the discretization of everything but algorithmically it's an extended common filter and smoother so the motivation is still state-based model we know how to do state-based models because we know how to do extended common filtering and smoothing and then we end up with this thing which gives us a posterior distribution of the same quality as an extended common smoother gives always that we have a posterior both for i because it's a part of our capital X state and also for the betas jointly from all of this data one part of it is the scatterplot and the other part of the data was the ODE itself in one joint inference algorithm and maybe it's also nice to mention that so we only talked about inference algorithms while the extended common smoother goes forward-backward once and then gives a posterior and so to get this plot we had to do the forward-backward pass once which while in comparison to what I've shown you last week with this video that goes on for a thousand iterations is pretty nice alright so that's basically the content for today's lecture we looked at ODE's from a probabilistic perspective and I argued that solving an ODE is actually solving a state estimation problem only that last week we didn't talk about it as a state estimation problem but fundamentally we tried to estimate an unknown quantity and there will be an error so we want to be probabilistic about things and we have the tools to do state estimation in a Bayesian way in an efficient way even and so today we defined a way to solve ODE's with extended common filtering and smoothing which leads to this whole class of servers that are sometimes called ODE filters and then I well showcased that there is this motivation of we want uncertainty estimates and we want to be Bayesian about things and the uncertainty estimates that we get are useful but there's this whole different aspect that maybe sometimes doesn't get talked about so much but it's also just sometimes very convenient to be Bayesian about things and in this particular example we've seen that by having this general inference algorithm we can write down a generative model for a whole bunch of problems like you know how to solve a DAE know or at least like in practice with code even though you never heard about it before probably and that is only because you know how to use an extended common filter and how to implement it maybe sometimes in a slightly non-standard way because you have two measurements in the beginning or in the middle or something but yeah we know how to do that and then that not only allows us to do to solve numerical problems but also to do inference from data in a very efficient way that combines well physics and general external observations so one summary of maybe this three lecture block that we had that what ends with this lecture today is that in our minds Bayesian filtering and smoothing it's really the way that we can model or formulate dynamical systems and model them in a way that is like flexible because we can just add information if we want to and well kind of factor out the inference algorithm as something that we can take care of afterwards so I thank you all for joining the lecture please take out your phone right now and do the feedback like right now I want to see people scan QR codes and then yeah thanks for the lecture I'm around if you want to ask questions and then see you next week for partial differential equations