 Hi, everyone. I think we can start. So yeah, welcome to today's lecture, Numerics of Machine Learning. We're in lecture six. My name is Nathaniel. I'm also PhD student with Philip. And I prepared this lecture with Jonathan together from last lecture. We're in lecture six, but we're also in the second lecture of our block on simulation. I started last week with Jonathan that taught us about state-space models. And he provided us with a language to describe the evolution of a state in a probabilistic way and describe observations that depend on the state also in a probabilistic way. And then provided us with a set of tools in order to estimate states from the observations. And we talked about that in a very general sense. But then we also looked at a practical algorithm, the so-called Kalman filter, and Kalman smoother, or Rauch-Tung-Schriebel smoother, that solves this problem exactly in the linear Gaussian case. And then we extended it to the non-linear case and got an approximate inference algorithm, which is very fast and does work well in many situations. Today, we do a similar thing, but we talk about it in a very different way. Because we also talk about states. We also talk about dynamical systems. But we use a very different language because, well, for the most part, things will not be probabilistic. We assume that we have a way to describe the evolution of a state in a completely exact deterministic way. But even then, things are complicated enough, and we do need to approximate the simulation aspect. So before we did approximate inference, this time we have to approximate simulation because doing it perfectly is just not feasible. And the language that we use is ODE's. And the algorithms are like ODE's are ordinary differential equations. And the algorithms that we use are numerical ODE solvers. And so we start with a small intro to ODE's, especially an intro from the perspective of machine learning. And I give some examples on where ODE's appear in machine learning. Then in the main part, we talk about numerical solvers, construct a few, and look at some examples. And there are differences in properties. And then in the last half hour, we have a block on parameter inference where it's not really a simulation problem anymore. It's rather that we want to learn about the system that generates the simulation. And how that interacts together, we will see in the last part. Before I go to the examples, let me define an ordinary differential equation. An ordinary differential equation or an ODE is this thing here in the middle. So x dot of t equals f of x of t comma t. x is the unknown function. It has a similar meaning as the state from last week, I think. So it's a thing that evolves over time. And it is a function, so it is continuous. So x is defined for all the t's in some time domain. I also use t and talk about time because that's the easiest interpretation of an ODE. x really has a one-dimensional input, so only a scalar, and evolves along this one dimension. That's also why it's called an ordinary differential equation. Then f on the right-hand side is called the vector field. And it allows us to describe the derivative of x given x. And typically, we want to solve the ODE for some time domain, not for the whole real line, but up to some time capital t. So in general, it is a way to describe the derivative of a function, but not the function itself. And so the question is, what is x? Because this thing basically implicitly describes x. And so we have to do some work in order to find x. And now to the examples where ODE's appear in machine learning. I think the first one is, I guess, super well-known now. And everyone that, at least as a Twitter account, knows about diffusion models. Or even in the first lecture, we had a lot of images, I think, that were generated from one. So diffusion models are quite hyped for the whole year. And they fundamentally rely on differential equations. I'm cheating a little bit with this example, because for the biggest part, I think stochastic differential equations or SDEs do the main work. And so if you really want to get into those models, you need to understand SDEs. But at the same time, in order to understand SDEs, you need to understand ODE's first, because SDEs are just strictly more complicated. We will not talk about SDEs in this lecture, nor the next. But I think ODE's, well, are important in order to get there. Also, in one detail, there are algorithms that perform this mapping from noise to data that use ODE's. So there are papers published last year that make this transformation from there to there by numerically solving an ODE with some Runge-Kuter-Sorwer. This mapping from noise to data has also been done before diffusion models already, because people cared about modeling complex distributions. So things like here on the right, or I mean images, it's also like there's a distribution of images that are natural. And so modeling that, well, for that, you need to describe a probability distribution over those. But it's complicated to work with, whereas the thing on the right is just IID Gaussian noise, and it's easy to work with. And this mapping from distributions that are easy to work with to distributions that are complicated to work with is something that normalizing flows do. And, well, I guess most of the work on normalizing flows, or half of it, like maybe, does that in a continuous way by using ODE's again. Then you might have heard about neural ODE's that's So it also relates to the two points before, but there was a nice paper published in 2018 that sparked quite a lot of hype in that field, which reinterpreted residual neural networks as discretizations of a more continuous thing, where instead of thinking of a residual neural network layer by layer, you think of it as a continuous transformation that is then discretized by a numerical algorithm. And one specific algorithm leads to the thing that we know, and in general, we get a continuous formulation. And that's, yeah, again, sparked a whole lot of research. There are still many papers being written on neural ODE's, and I would also argue that I think it accelerated the two points above because the diffusion model stuff often also relates to neural ODE's. To give something completely different in case you are more interested in optimization, it turns out that in order to understand what happens in gradient descent, you can do something similar as I just presented in neural ODE's and reinterpret gradient descent as a discretization of a continuous object. And this continuous thing is called a gradient flow, and it's much easier to write a theorem about what the gradient flow does than what the discretized thing does. And I took this graphic from a very nice blog post by Francis Bach, who does a lot of great work on optimization, and he had a lot of posts on gradient flow, and also extends this then to SDEs and stochastic gradient descent. The one example we will do in this lecture is parameter inference, which maybe relates more to, well, the machine learning settings that we're used to. So we have some data, and we want to learn from it. And well, if we know about the process that generates the data, maybe we can learn something more meaningful. Like in this example that we see later where we learned four lines, even though our scatterplots describes only two of them. And in order to do that, we need some assumptions on the structure of real properties of these four lines. And a small spoiler for next lecture is the act of solving an ODE is itself a process of learning something, learning an unknown function in this case, and so on next lecture we will interpret numerical ODE servers as machine learning algorithms. All right, so I hope that, well, you know why we're sitting here and why we're talking about ODE's and why I am excited about them. So coming back to the definition of an ordinary differential equation, the question for the rest or the two thirds of the lecture is how do we solve ODE's? So how can we find X such that we satisfy this differential equation? And well, we can write down the solution of an ODE the following way, X of t is actually given by X zero plus integral from zero to t f of X tau comma tau d tau. That is a true statement. It's also not very helpful because it is the solution. Like if you compute the derivative, the thing on the top comes out. It's not helpful because of two reasons. We have an integration problem now and integration is also non-trivial and that's why there will be a lecture on it, I think. And the integration problem is even particularly annoying because X appears in there too and we kind of need all the values from zero to t of X but we don't know X, right? So yeah, it's a true statement and it will motivate a lot of the algorithms or maybe even all of them that we talk about today. But it's not yet a solution that we can actually work with. There's one piece of insight though which is the solution depends on the initial value. So the problems that we actually care about are so-called initial value problems which is basically an ODE together with an initial value X zero given by, well, some X subscript zero. Whenever I say ODE, I probably mean initial value problem with an ODE and an initial value. When I say solving an ODE, I mean solving the corresponding initial value problem. So let's look at an example that, well, visualizes why ODEs can be a helpful tool for modeling and how we can solve it or how solutions look. It's a very simple example. It's one-dimensional, it's a so-called logistic ODE and but we can motivate it from a biological perspective where, well, we don't use X here anymore, we use P because we want to describe the population size of something and the change in population is given by R times P. So R is a growth rate and P is the population. So if we ignore the parts after, it basically means it grows proportional to the population which makes sense. Like if 10 people have five babies, then 100 people probably have 50 and so the growth is larger if the population is larger. And then the part on the right describes a upper bound because if P divided by K approaches one, then the thing gets zero and we stop growing. So K is effectively like an upper bound for the population or also called carrying capacity or a bottleneck. The solution of the logistic ODE is given by this. Today we talk about numerical solutions so we don't really care about how to solve ODEs by hand. This solution you can get by doing some derivations by hand but this is explicitly not the goal today. We want strategies with which we can approach general complicated problems with a computer which is why I just give you the solution but you don't really have to trust me on this because you can't compute the derivative by yourself. It's one of those nice cases where at least verifying a solution is very easy even if coming up with it is more complicated. So yes, this function solves the ODE. We can plot it. This also maybe visualizes once more what the problem setting is. The arrows in the back are a visualization of the vector field because they kind of move the solution around and then here are seven lines that solve this ODE for different initial values with some chosen parameters K1 and R1. So we get multiple solutions for different initial values. If we specify an initial value, we end up with one solution. And so the goal of solving an initial value problem is basically given the starting point down here somewhere and the arrows in the back coming up with this plot. That is the task that we're interested in. In general, please feel free to interrupt me at any point if you have questions. If anything's unclear, then we can talk about that. All right, so the main question is how can we solve ODE's in general? So if no one gives us the solution, what can we do? In particular, well, in most cases, it's actually impossible to give this solution in closed form, but we still want a strategy to find X or at least find something that is close to X. So the question is how can we solve ODE's in general? And for that, we come back to, well, the definition that we saw so far and the solution that we've seen so far. And the strategy in numerical ODE service is basically to extrapolate step by step, which means that instead of trying to compute XT directly from time at X zero, the question is rather, if we have something at time XT, how can we go to time XT plus H? And then again, the formula is of course exactly the same as before. So this thing is still the true solution if you have the true solution at time XT. And the difficulty is also exactly the same as before. So in order to do this, we still need to solve the integral, which is non-trivial, and we need to handle this X inside here, which is also, that is the main difficulty in this problem. The motivation is though, that probably if H is small, so T and T plus H are not so far away from each other, then the approximate algorithms that we come up with dealing with this are probably fine compared to trying to jump ahead directly. So that's the motivation. Yeah, so the main question is, how do we compute this thing here? And all the algorithms that we show will be different ways of computing that. And as a result, we get numerical ODE service. So the first thing we will try is using Taylor series expansions or Taylor series approximations, actually. You've seen this last lecture when we tried to push gaussians through non-linear functions. And we, well, use Taylor series expansions to linearize the function. In general, a Taylor series expansion takes some function G that, well, here for simplicity maps R to R, just to have easier notation. And it says that, well, G at some tau is equal to G at time T zero plus, and then we take the derivative times tau minus T zero plus one half the second derivative times tau minus T zero squared and so on, or in this series representation. Note that this is exact if you actually let the series one run to infinity. What we did last lecture then is basically ignore everything that is here to the right and only look at the left part and then this thing is an approximation and we obtain a linear function because tau only appears in here with some slope in front of it. And that's a Taylor series approximation. Today, we'll do something even simpler as our first algorithm, which is we will, like last time we did the first order Taylor series approximation, we will not even go there. We'll do a zero of order Taylor series approximation, which basically means we discard everything to the right from here, which is a very bad approximation, but it leads to a numerical algorithm that everyone knows and surprisingly many people use. And we will see that in a second. So zero of order Taylor series approximation. Okay, well, the reason we do that is still we want to approximate this integral on the top. And so we do a Taylor series approximation of the thing inside the integral, the integrand, where the relation, so this f of x tau comma tau, you can interpret that as a function of tau if you just merge the f and x and so on together and then you're in this notation here. This is why I used g of tau. And when we do the Taylor series approximation around t, then we approximate f of x tau comma tau as f of x t comma t and we have some error term. This error term comes from this here because that is at most h. And when we plug this into our integrand, that means that instead of integrating a function like this thing that depends on tau, we suddenly integrate a thing that does not depend on tau anymore, so we can pull it out. We integrate from t to t plus h the one function, which is only h. So we managed to find an approximation of this integral and if we plug that into the top, we get our very first numerical OD server which is the forward Euler method. And I imagine you've seen that before. The scheme is, well, you take your estimate at time t and you adjust it by adding h times, like f of your estimate at time t, which is, well, the local slope. It's an explicit method because it explicitly describes how to compute x hat at time t plus h. So it's a formula. You can write literally write this into your favorite programming language and compute the next point in time. All right, so that's the very first server and again, like it is actually used in some software or yeah, it appears also in dynamical simulations. I think, I mean, if you look at what OpenAI Jim does under the hood, they do this basically to simulate the systems. Now let's try to do a thing a little bit, like change one thing and get another algorithm, which is, well, choose a different location to do the zeroth other Taylor series approximation. And the motivation for that is in some sense that, well, here we use the local information of, well, the point where we are in order to extrapolate. Maybe a way to get more information would be to use the slope from the point where we want to go, which basically means we do a Taylor series approximation around tau equals t plus h. And, well, the math still works out, so we get the same approximation here, only that it says t plus h now. We, this formulation is the same as before, only that here it says t plus h. And we get the so-called backward Euler method that is given by this. So x hat at time t plus h is equal to x hat at time t plus h times f of x hat at time t plus h, t plus h. And, well, it's, again, very easy to write down, but this is not an explicit equation anymore. Maybe now it's clear why I said explicit earlier. This equation describes the quantity that we want, but doesn't explicitly give, or I'll tell you how to get the equation. So again, similar to, well, the thing we started with, which is an ODE, which describes the thing we want, but we have to do some work. Here again, it describes the thing we want, but we do have to do some work. And, well, one way to do that is basically move that to the other side also. Then we have zero equals something. And that is a root finding problem. And we will see that, or we'll talk about that again in the next few slides. So, yeah, we have our first two ODE servers, basically. Forward Euler, an explicit method, and backward Euler. So, let's look at code quickly to, well, see how easy it is to solve ODE's. There's, I mean, it's 13 lines of code, but most of it is because I wanted this to be copy-pasteable. So if you copy-paste this into your terminal, then you actually get a plot that looks like this. And, well, the algorithm itself is basically only in line nine. Well, the rest is problem definition in line three and four, importing a plotting package on the top, specifying a step size. Maybe you're confused about this double slash here. It's basically a way to write a fractional number. So it's not a floating 0.1, but it's like one divided by 10, really, which makes things a bit nicer. It doesn't matter a lot. And then here you have the explicit Euler, and it's really the same way as I wrote it down. Like the new x, I just overwrite the variable, is x plus h times f of x comma t. And it works like we get a solution of the logistic ODE that looks reasonably fine. It's conceptually similar like the same as we did before. I didn't plot the true one here because we will talk about error later. Backward Euler and code is also the same in the sense of like we can write it in one line, but as I said, as we did, like in slides we were also able to write it in one line, but I said that you need to do some work in order to solve it. Well, I chose the same way to implement it because I used the roots package, which has a function find zero to do the thing that I talked about, which is this thing solves the root finding problem for me and then spits out a new x that satisfies this implicit equation. And again, we get a solution that looks fine and the code overall looks quite easy to read. A few comments about this thing though, that's, I mean, we've seen that in the first three lectures how numerical methods appear when you try to do something with a computer like while doing GP regression and then suddenly when you start looking into what it actually means to invert the matrix, you open up this whole field of research. This is exactly one of those points where, again, we can start digging and thinking about our numerical method calls a numerical method, maybe we should think about that. We can look at what this function does. I did check out the doc string and it turns out that find zero is actually quite complicated. Like there's a whole heuristic algorithm that decides from a set of like 50 different methods what the best roots finding method is and so on. Today we're not going to dig deeper there, but I mean, we are, because we think about numerical methods, we are aware of this happening. And for example, I know that there are some packages where implicit differential equation solvers have an interface for the user to specify how this kind of thing should be solved. And well, because implicit methods have this additional complexity, you can actually tell it to compute inverses with conjugate gradients, for example. So that is actually what people do. And if you look at the documentation on how to solve stiff ODE's, then you need to care about these things. But we want to get to Romy Kutter, so we will not care about these for today, but you see the theme, I think. Yeah, so one other question in this context is, so this thing does something complicated, whereas here it's something easy. So why would we want to do the complicated thing? In particular, this find zero function, it's an iterative algorithm, like maybe it uses like a Newton iteration. And so this thing is computed many times in order to get to the quantity that you care about. So that is more expensive. It is, well, more annoying to code. It is more annoying to understand. So there's some extra cost involved. Why would we want to do that? And well, I guess the answer is because backward Euler is a different algorithm from forward Euler and that there's one specific property that we will look at that visualizes this difference of explicit and implicit methods, which is stability. So yeah, stability is the vague word that, well, pops up when you look at differences between explicit methods in general and implicit methods. There are different notions of stability. So if you open up an ODE book, then this is not all there is, but people care about way more than this one slide that I'll show. But yeah, I think this visualizes explicit and implicit Euler already. Also, if you've heard about, I don't know, ODE is being stiff, then this is the context. So to compare forward and backward Euler, we will look at a very simple ODE, which is x dot equals lambda x. So it's scalar. It will also be easy to plot. And lambda will be a number that is lower than zero. Here I chose minus 21. We'll see in a second why. The true solution is therefore like this dashed line. It's an exponential decay to zero. And it decays to zero quite fast, as you can see, because the plot goes from zero to one, but it's visually zero already at 0.2 or something. Now, when we solve that ODE with forward Euler with some step size 0.025, we get these dots here and lines in between, which basically, I mean, if we zoom in, then I guess around here, you see that there's an approximation error as there always is. But conceptually, it does the right thing because it goes to zero. Like the true solution goes to zero, forward Euler goes to zero. In the limit, it's fine. The question is, for which step sizes does it go to zero? And well, we can look at that by just plugging in forward Euler. So x hat t plus h equals x t plus h times, and then we plug in the vector field from above. So lambda x hat, which gives us a simpler formulation where we just multiply one plus h lambda to the previous estimate. And now the requirement of it, well, becomes different. Like if we want this thing here to converge to zero, suddenly we know something about this thing here in the middle, so one plus h lambda should be smaller than one. And if we check that for lambda minus 21 and the 0.025, then this holds. But we can also ask the opposite question. For which step sizes does it not hold anymore? And well, when we solve to lambda, or I mean, okay, basically I just have a counter example but we could also solve. But if we plug in a step size of 0.1, that would mean that we have 0.1 times minus 21, which is one minus 2.1, which is minus 1.1. Absolute value of it would be greater than one. So this equation does not hold anymore. And so just looking at what we derived here, it should actually not converge to zero anymore. And it, well, we can try that out. And I mean, of course, this is correct what we wrote down and the code, like when we run it, we see that it actually starts oscillating around the thing. But at each step it overshoots and then the gradient information it gets is like way steeper, it overshoots again and so on and there's this whole feedback loop happening and it actually diverges. So this basically means that for specific problems we actually need to take small steps, otherwise bad things happen, like otherwise we get this divergence behavior, even though the ODE is actually quite simple to solve, right? And so, I mean, the goal was to compare forward and backward Euler. We can do the same math for backward Euler. And we see that when we re-order this equation, then the requirement we have is one divided by one minus h lambda should be smaller than one. And when we plug in the numbers, that means one divided by 1.1 and that is indeed smaller than one. And when we run the code, it actually converges to zero. And like backward Euler, you can check for yourself on which h is this works but with that lambda, I think it should always converge to zero. So that's a fundamental difference on why backward Euler has properties that forward Euler has not and why sometimes you might actually want to do the expensive, more complicated thing instead of the easy thing. That's not completely the full part of the story because the one backward Euler iteration is more expensive. So this is not a final answer to in this problem you should always use backward Euler because probably when you sit in front of a computer, the thing you care about is just how much time you have to wait until you solve the ODE. But roughly speaking, when you expect your differential equation to have very steep areas, then you should consider solving using an implicit solver. Or phrasing it differently, when your solver diverges, maybe you have a problem with the stiffness of the problem and then you should consider using an implicit solver. All right, it also shows us that we talked about two algorithms, both of them are super simple and we already faced the problem of which one should we choose. And yeah, this is one partial answer to that. In that regard, it doesn't get better because we will actually talk about a whole class of solvers. So this opens up basically 150 new algorithms that you can find in software packages. But it's also a class of solvers that is very well used and probably the default solvers when you use sci-pi is probably a Runge-Kutta solver, a specific one that you saw in the very first lecture and that we will revisit in a few slides. The problem is still the same. So we still try to extrapolate from some time t to some time t plus h. And the difficulty is also still the same, which is we need to solve an integral, which is annoying, but it's particularly annoying because inside the integral we have x. The motivation for Runge-Kutta is now a different one, which, well, it's basically, I said there are two problems, one with the integral and one with x being inside. We just tackle the integral problem by numerically solving integral. And conceptually, numerical quadrature is, well, a term for numerically solving integrals and typically works in the following way. So when we have an integral with some lower bound and upper bound of some function, could be vector-valued, then we approximate it by discretizing in a specific way. In a specific way because, well, we have to choose where to evaluate the function g, which are the so-called quadrature nodes. And then when we sum them together, we can give more or less weights to different ones of them. And those are the so-called quadrature weights. Again, I think you will talk about integration in more detail in three lectures. And we don't really, like, there are many ways of doing that. You can probably fill whole courses for the semester on how to do quadrature. So this is really just a conceptual motivation for the kind of operation that Runge-Kutta does because Runge-Kutta methods or explicit Runge-Kutta methods I will talk about today do a similar operation. So we replace the integral with a sum of S terms and we have specific weights, quadrature weights and quadrature nodes. Only that, and I said it before, there's this additional second difficulty of, well, we still need to find x hat at time ti, tau i, because, well, it's not given automatically, we have to construct it somehow. And this is basically the difference between different Runge-Kutta methods. So if you look at five different Runge-Kutta methods, they'll give you five different answers to how to choose wi, tau i, and how to construct the x hats at those locations. All right, formally, Runge-Kutta looks like this. If we compare that to the last slide, you basically, so here I use w and tau. On the next slide I use b i's and replace the whole f term with the k's because that is the notation that Runge-Kutta methods use. Yes, you have a question? Yes, yes, that is, that will be the very first Runge-Kutta methods on this slide. So yes, yeah, but you are completely right. All right, so in general, Runge-Kutta methods look like this. So that is basically what I had on the last slide. It's just the definition that we, when you code them up, basically you implement how to compute k1, then how to compute k2 and k3 because k3 depends on k1 and 2, k2 depends on k1 and so on. Then at the end, well, no, this is just a general formulation of the lines here. And at the end, you update x hat by extrapolating again from x hat t with step size h while summing up the different k i's with the appropriate quadrature weights. Because this is a bit tedious, and this is the way you implement it, and I think if you write code, probably you want the formula like this. A way to talk about Runge-Kutta, that is a bit more compact and more readable, as the so-called Butcher tableaus, because when we look at these equations, structurally, they all look the same, right? And I told you that the way Runge-Kutta methods are different is basically by choosing different weights of the i, different quadrature nodes here, c2, c3, and so on, and then constructing the x hats in a different way. And a general way to write that down is with so-called Butcher tableaus because all the free variables here that need to be chosen can be packed in a tableau with a column that basically this thing here describes how to construct k2. This thing here describes how to construct k3, and so on, up until ks, and at the end we sum them up in the correct way in order to get the next step. If you go to Wikipedia and look up Runge-Kutta methods in general that you find this article with a list and they don't write down equations or anything, they just, it's basically a list of Butcher tableaus. So in some sense, if someone gives you a Butcher tableau, you can also just implement it. Or even write code that handles general Butcher tableaus. All right, so let's look at Runge-Kutta method examples. And well, this is just so you can do some pattern matching with the stuff that I wrote below, but it's the content from the last slide. And as you said already, yes, the very first Runge-Kutta methods that we can talk about are the very simplest one is again forward Euler. So I kind of tricked you to go through all the stuff only to show the same algorithm again. But it's because, well, the extrapolation, if we just replace this thing with a k1, then this actually satisfies the equation above. And we can also write down a Butcher tableau. It's just a very, like, not very special one. But if we insert some zeros, then we get this Butcher tableau. And again, if you go to the Wikipedia list of Runge-Kutta methods, this is the first one. Like, that's actually the first Runge-Kutta method they give. Unsurprisingly, okay, I didn't define implicit Runge-Kutta methods, but you can imagine that there are implicit Runge-Kutta methods and the backward Euler fits the definition of an implicit Runge-Kutta method. And basically, well, here normally, there needs to be a zero for things to be explicit. There's not, so it's an implicit method. And yeah, there will not be a definition of implicit methods. There was a question, sorry. Is, maybe I can have an attempt because I could imagine that the question is if we split the inter, like the integral into different terms and then just take steps in between all the time. Like, instead of going from t to t plus h, we go to the middle point and then to the next or something. So, I mean, I'm still not completely sure that I got things right, but if the idea is, I mean, one other thing that you could do is, for example, do an Euler step and do a second Euler step. Is that the intuition of like doing a thing, like doing one step and then doing another step from right there? Because this is conceptually not what happens. Like that would be a different algorithm and it has different properties from here. So, I guess the observation is correct that the BI's, I think they sum to one, they need to probably, like something like that. But you actually have quite a lot of freedom in choosing these numbers there. Maybe, like I will have two or three examples that are not forward and backward Euler where we see butcher tableaus. Maybe we can revisit your question after you saw that and you can check if that was a counter example or not to your intuition. Yes? Okay. All right. Yeah, so let's jump to the actual example. I mean, there's maybe the second easiest thing that we can still try to interpret intuitively on how it was constructed is the so-called explicit midpoint rule. And that is an example that is kind of close to doing two Euler steps, but it's not. So maybe this is already a good example for why there's a bit more happening here than just doing two Euler steps. Motivation basically being that, well, if earlier I said we can either take the derivative at the point where we are or the point where we want to go. If it's the point where we want to go, we're implicit. So maybe we can take a better derivative in the middle in an explicit way because we don't want to get to be implicit, which would amount to this equation here on the top. But then the question is how do we estimate the thing in the middle? And the easiest way of doing that would be taking an Euler step. And that leads us to the so-called explicit midpoint rule, which is the third Runge-Kutta method in the list of Wikipedia. But if you check that, it's not an Euler method. Like it's not the same as doing two Euler methods because you still extrapolate from the point down here and the K1 is internally multiplied with half the sub-size again. So you kind of reduce the thing that forward Euler did in order to get a better extrapolation. Whereas if you write down two Euler steps, something else comes out with different properties. We'll talk about them in a little bit. One more for the fun of it. There's the classic fourth-order Runge-Kutta method. I think it's called like that because it's like the original method by Runge and Kutta together. These are the equations. I'm sure we could find some intuitive understanding of that and there are also nice plots on Wikipedia on how to extrapolate with a gradient and adjust that in some way. But I think we skipped that. The thing is the Butcher-Tablo is still readable enough. The code is readable enough. You can implement this. And maybe at this point it's also, Runge and Kutta came up with this method so they derived the numbers on the right. And that is a proof that is readable. If you can check books like the Solving Ordinary Franchal Equations book, which includes a derivation for these coefficients. I think by now it's clear that we cannot choose them completely however, like how we want. But it's also clear that there is some freedom, otherwise there wouldn't be a list of 150 Runge-Kutta methods on Wikipedia. But so feel free to check out the derivation. It is understandable. Like you can do that if you're interested. But yeah. So, ah, there's a question. Sorry? I think so. Ah, do the, ah, good. Do the BIS. So the quadrature weights here on the last line always sum up to one. And yes, I think so. Yes? It would be surprised actually. Yeah, I think I got a bit too confused. So, sorry. I don't know if someone, if you understood. Okay, all right. Yeah, let's talk afterwards. Yes? So, yes. So the way to derive these would be choose a family of polynomials, derive the quadrature rule, and then derive the coefficient from the quadrature rule. Yeah, I'm not so sure about that anymore because it's not just an integration problem. It's also constructing these intermediate values. So I'm not sure if you have a direct mapping of integration rules to Runge-Khuzad methods. They choose a family by the zeros of a polynomial family. So if those are the nodes of the quadrature then they should directly come from the family. Okay. But I'm not sure if that's true. You have freedom in choosing where you go. Every one of them. And so the, when I looked at the Solving Ordinary Differential Equations one book, sometimes the text was motivated by, there are these quadrature rules. Can we find a similar thing that solves ODE's? And then they, like then the follow up question is how do we construct these things in between? That's why my impression was that it's not a direct mapping, but it's, but I mean they're very motivated by quadrature rules. Thanks for the question. So yeah, we were here. I guess the next one to jump to, you know, from the very first lecture, which is the Domen Prince or Dupri five method. And at the latest here, it gets like, it's still fractional numbers. So it could be worse. But yeah, we see that it's hard to interpret what's going on and to like, the goal should not be to try to understand how this gradient does and so on. Like in this book, there are the explanations on what properties these coefficients need to satisfy. And then Domen Prince came up with a method that does a lot of things, like works well and satisfies the properties. And there's a question in the back. So the question is why are there two lines at the end? And yeah, I was coming to that. So the thing is so complicated because it does a whole lot of things. First of all, things get more complicated when you make the whole tableau bigger. And that makes it less trivial. We've seen that before that this one was harder to interpret than this one. So that's already one thing. But then the question was about these two lines. And so Dupri five does one cool thing, which is it gives you two solutions at each step. Because I mean, this last line was about how to sum up the case that you computed before. And there are two different ways of summing them up, which give you two Runge-Kutta methods at the same time. So it's, and this is why it looks this way, like because they tried to have an efficient algorithm that gives you two Runge-Kutta methods at the same time, but which have a different degree of accuracy. One of them is more accurate than the other and there are theorems that say that the order of accuracy is different from one and the other. And so basically you can use the more accurate one to estimate the error of the less accurate one. And again, that's why this thing is so complicated because of this. And error estimation is in turn helpful because one thing we didn't talk about yet, but I will talk about here now and on the next slide and you in the homework, is I never discussed how to choose a step size H, but if you have an estimate for the error, then you can use that in order to adjust your step size H. If your question changes from solving the ODE for a given step size to solving the ODE while satisfying some local error. Yes, there's a question. I actually don't know which one's the more accurate one. Yes, yeah, yeah. I mean intuitively this zero here surprises me, so it seems like it has contains less information. So I would assume that this one is less accurate, but I wouldn't guarantee it because I never implemented it by itself. Yes, another question. The question is why are there two zeros? Like if there are two zeros here, why even compute it? And it's because above they're not the zeros, so you need them as intermediate values. Exactly, all right. So yeah, you also, I think in the very first lecture, you switched to Python code and scrolled through the DOPRI5 FORTRAN codes on GitHub. I just copy pasted one specific part out of it, but so the rest of the code is still quite hard to read because it's still FORTRAN code, but at least this huge list of numbers that you talked about last time, not last time, but in the very beginning, got a bit more interpretable because now at least you know that this hard to read thing maps to this slightly less hard to read thing. So A65 is, well, 65 somewhere here. And then, yeah, you can find all of those numbers in the left side. And again, it says that given the butchers have low, implementing the methods basically amounts to, well, implementing all of these numbers and then doing the generic run a hotel thing. And also, as I mentioned before, the FORTRAN code is also so annoying to read. Well, again, because it's FORTRAN, but also because all of this stuff also happens in there. So it doesn't just compute the next step. It computes two estimates. It does some local error, so to have local error estimation, it chooses a step size. There's all of this bells and whistles stuff that is on top that gets called when you call scipy solve ODE. Yeah, and once more, I mean, I refer to this book very often, so you can imagine that it is actually this thick. But in chapter 2.4, there's a section on error estimation where this here is actually explained how it works. All right, so we come to the intermediate summary and to the end of this block on how to solve ODE's. I think the main takeaway is, well, we talked about Runge-Kutta methods and I showed a bunch of them. We ended up with doc.pre5. There are a lot of them. They have different properties. The most complicated we saw today is doc.pre5, but you can actually get larger methods with, like, where the tableau is 15 lines long and similarly, similar in readability. But there are differences between the solvers and these differences are interpretable in some way. We talked about stability, so explicit versus implicit methods. I only talked about forward and backward Euler, but there are implicit Runge-Kutta methods. I didn't show the tableaus for them, but they do exist. They exist in Python and Julia and whoever you like your solvers. And so if things diverge, maybe try an implicit solver or if you know that your problem is stiff, maybe try an implicit solver. We also briefly talked about order. I didn't really have that as the focus in the last few slides, so let's do it quickly here because there was this comment on if it's the same to do many small intermediate steps or do this complicated construction and the difference between doing the complicated thing and doing the simple thing is actually that you have a different convergence rate of your local error, where, okay, it's not defined here, but local error is basically the one step error. So the question is, how much error do you make if right now you have the true correct solution, but in the next one you do the estimation. And the order of the Runge-Kutta method is defined as, well, the local truncation error having an order h p plus one. Basically also shows you that if you make h small, then your local error gets small. If you get h to zero, then your local error converges to zero, so that is good, but yeah. And then the service that we looked at is thought Euler has an order of one and you will prove that on the XSS sheet. Exclusive midpoint method was the next more complicated thing and it is a bit more complicated because it has a bit better convergence. Fourth order Runge-Kutta method, it's in the name and dopre five, again, it's called dopre five because it has order five. Thanks a lot. Yes, another question? No, it uses, well, it has one order difference in both estimates and that's also the trick on how you do step size selection then because you know that there, well, the difference between those two estimates because the other one is also not the true one, but you can use the difference because you know both of their convergence rates. Also, in the explanation that we just heard, it's so this does not mean it's always better to use a higher order method because otherwise we would use order 13 methods or something all the time because there's a hidden constant in front. And so unfortunately, like same as with stability and order that you have a heuristic of if you want to solve something in a precise way, probably you want a higher order. Also, probably dopre five is an okay default setting, but like if you want something only very costly, things can be faster and more stable by using only an order three method or something. And then I mentioned also that there are many like bells and whistles happening when you call SAP IOD or similar. The most important one is probably step size selection, which uses this information about order and local error to choose a step size. And again, that's something we didn't have time to cover in the slides, but it's a whole question of, well, wondering about how to choose age instead of taking age as a given. And the question rather becomes how to choose age if my local error should be lower than. And we have an exercise on the exercise sheet. Also, just for completeness, there's sometimes you even have automatic server selection happening that tries to do the above thing, which makes the implementation of everything way more complicated again because they only use heuristics. But as always, when you have automatic things, as soon as things break, you probably need to understand these again. So it's good to know about stability and order. All right, that concludes the how to solve ODE's part or how to do simulation. Are there any more questions on that? Otherwise, we have time at the end also. Okay, so now we flip the problem setting basically because up to now I assume that we are given an initial value problem. So we have an initial value, we have a vector field. We know everything exactly and perfectly. And still we have to approximate because it's a hard problem. Now we assume we don't know F and we don't know the initial value maybe even. But instead we do have data, which, well, we like to have. And the question is how do we estimate F and estimate the initial value from it? And yeah, I mean the motivational example from the first flight was this one basically. So when we have a scatter plots like here with a green and a red, yeah, green and red scatter plot and we just put on our machine learning hats, then we do curve fitting, we use a neural network or after this course we use a GP and we get a very nice interpolation and then we go to the prior on the sites. But of course, if we know more about the generating process of this thing, then we can get a more meaningful answer. And this links together this whole idea of ODE's with the problem setting that we often encounter where we just try to fit data. And well in this block we use ODE's as information because it helps us understand the process that did generate the data and we can write down the generative model and do inference in that system. And in this example, I mean, yeah, you've seen flavors of this I think multiple times already. I apologize for not using the same ODE here as in later slides, it's because in this paper we used a different ODE than another one. Here it's a so-called SEIR model. It's also an epidemiological model where S stands for susceptible, exposed, infected and recovered, so we don't model the dead. We just say, if you die or recover, you're recovered. You could add a D at the end and then yeah, we would know how to write that down. And well, it's also quite interpretable. It basically says that there's a specific, oh, there's also a typo, but with a contact rate beta, the end susceptible people meeting exposed people, you have people moving from here to here. So this should be an E divided by N, pardon for that, so that the mass moves from there to there. Then with some rates, exposed people get infected and with some rates, infected people get recovered. And you can also have an additional term that moves people from susceptible to infected directly. And if you look that up on Wikipedia, that is the case, but we said the second beta to zero in this experiment, so made it easier to write down. And the parameter inference problem is basically finding beta, gamma and lambda such that the ODE solution fits the data, which, well, here is not the case, so the lines in the back are generated with wrong parameters and the lines on the bottom are generated with inferred parameters. And having a good solution basically means having a good fit through the data, but at the same time also having parameters that we can kind of cross check with, well, what we think they should be, because compared to the weights of a neural network, we actually know what lambda means. And if it says that, I don't know, the recovery rate is huge or super small, then maybe something's wrong, because maybe we have some additional information from literature about it. Yeah, so it's not just a simulation problem, but there is one hidden in there, but it is rather an estimation problem of the vector field F and the initial value. To write that down, we, well, write down the generative model for the states X, and then we will write down the observation model. And the generative model for the states is in general an ODE, where now we have parameters theta entering the vector field F and parameter theta potentially also influencing the initial value, but all the rest is the same, which also means that if we know theta, it's a simulation problem, we can just solve it with dopre five. If we don't know theta, then it's not a simulation problem, but an inference problem. And the thing we learn theta from, or we want to learn theta from, are noisy observations. So a data set that consists of Y i's and T i's that are observed as follows. So Y i depends on X at time T i. For generality in a linear way, you can imagine this H, for example, being selecting only the infected people. Like in the example above, like before we can interpret that we can measure this, but this we cannot measure, and that we cannot measure, but this again we can. So H would be a linear map that maps a four-dimensional thing to the last two dimensions, like a projection. That's why it makes sense to be in there. Then we assume Gaussian noise. And a general goal here for this parameter inference problem would be to estimate theta from the data, which with base rule means solving this thing. And well, as you know or probably had before, doing like actually getting the posterior is not quite trivial. There are a bunch of things we can do. We could use MCMC or Laplace approximations or something. Today we will talk about kind of a sub-goal, but it really contains most of the complexity that we need to talk about today, which is computing the maximum likelihood estimate, which requires, well, computing the likelihood. But the main thing that needs to be done is actually thinking about what this term is. We will do that on the next slide. And since this term also appears up here, this is really the thing we need to understand. And I mean, we can always add a prior or start and then think about how to extend this to get the actual posterior. All right, we can write down the likelihood because we can write down the generative model and we know that the data here is just, well, a linear map plus Gaussian noise. So the likelihood of the whole data is just, well, a product of Gaussians evaluated at yi with mean h times x theta ti comma sigma. Now, I mean, it's easy to write down and this is actually the correct likelihood that we cannot work with that because our fundamental assumption here is that yi is measured from the true x. But we spent all this time today talking about the fact that we cannot actually get the true x and we have to, well, estimate it somehow. So we cannot actually evaluate this likelihood and we have to approximate it somehow. For exactly the same reasons for which we needed to do all the Rungen-Kutte stuff before because we cannot compute the true x. So that's a problem, but we also talked about what to do in this case. And at least from the perspective of today, the correct thing to do, or one thing that we can do is approximating the true solution with an estimate of the solution, which is essentially what we do anyway is when we call Rungen-Kutte because we take the output and just continue working with that. And so we just assume that our solver is good enough while of course having in mind that the solver makes error. So yeah, if you use it with a huge step size, maybe this is not a good assumption. So yeah, but we can do this approximation and then plug it into our likelihood from above. We get out this term, all that changed is the head here. And this term we can now evaluate and we can actually compute the marginal likelihood, the maximum likelihood or the likelihood itself. And as you know, maximizing likelihood is equivalent to minimizing negative log likelihood and we get basically the mean square error. In particular, if we assume that this thing here is sigma times an identity matrix and sigma is fixed, then it is actually a mean square error. So this we can compute. It's a loss function that takes a parameter and gives us a loss and we need to maximize it and I mean, we maximize loss functions all the time. So we'll just do that. So let's look at an example and I want you in advance, but now we don't have an S-E-I-R model anymore, but S-I-R-D. The E disappeared, which was the intermediate thing between susceptible and infected, but instead we started counting that people. We have a time span. We have an initial value that has a free parameter. So we say that the number of infected people in the beginning, we don't know. So we need to estimate it and we have unknown parameters beta, gamma and eta where beta is the contact rate. I generated a data set by choosing some parameters and then simulating this model and then taking noisy samples from it. Note also that this will not be estimated. It will not be part of the loss, but it totally could be. Like we could estimate a noise term just that here I didn't. And the true parameters are, well, a very low number of infected people in the start and then, well, some contact rates, gamma and eta. If we have this, we can also start writing down a loss. This is exactly the thing from the last client. It's a mean squared error and there's a 0.1 coming from up here. Doesn't really change much, same as this half here. But we can implement it and we can evaluate it. And then, well, if we want to optimize it, we start with some initial guess. This is a very bad one because it's orders of magnitude wrong here and yeah. And if we plot that initial guess, we also see these, only see three lines because R and D are stacked because those two numbers are the same. But it's a very bad fit. And what the loss does is given the parameter guess, computing the approximate solution. So these lines and then computing the L2 loss, which means, well, comparing this line to the scatterplot and that line up there to the blue scatterplot down here. And now we can run an optimizer. I mean, it's an optimization problem. So the question of which optimizer to run is similar to in general. And I think there will be time dedicated to that in this lecture. I used LBFGS because it worked. And well, when we run that, we see how it iterates and I take a bit larger steps here and at some point that after 150 iterations, we have what visually seems to be a perfect fit. But the cool thing is that we don't just care about fitting the data and seeing that the lines are fine, but we get an output that we can actually interpret. Like from this problem, we inferred that the number of infected people initially was 10 to the minus five. I mean, from, okay, where the population goes from zero to one, so scale it with the population count to get an actual number. We get a contact rate of 0.5, we get a recover rate of 0.06 and we get a death rate of 0.002. And these numbers we can interpret, we can compare with literature and so on, which is why it's pretty nice to not fit this with just a neural network or even a GP. Even if we, at this point, don't get uncertainties out, we get something that we can interpret. All right, so this is in general, and I mean, this thing I explained here, scales to, the strategy is also how to solve a parameter inference problem in more complicated settings. It's really about having an estimate for the dynamical system with some free parameters, having the data and then constructing the loss where internally it uses here an ODE server. Now we come to an example that once more, you know from the very first lecture, but in the very first lecture, you saw a solution that we don't yet know how to do, so we'll do something else. But the motivation in the very first lecture was that we take the SIRD model from the last slide, and if we compare that to actual COVID data that is from up here, like the black dots, then we see that the curves from here would probably not be a good explanation for it. And even if we just look at this model, it's because the model itself would also not be a good explanation for it. And one way of making it better is, for example, to make the contact rate time varying. I think that this is still a very reasonable assumption by, from my experience. It's still only a model because, I mean, it still only tracks, like, has these four categories and so on. So I'm not claiming this is the model, but we made it more complex on one specific point. And so the question is now, how can we do parameter inference with this slightly more complex model? Or quite a lot more complex if we make something continuous instead of just a scalar. And what we've seen in lecture one is that, well, by the end of next lecture, we will end up with this plot where we find a beta that is time varying and that gives us an estimate of the contact rate together with an estimate of the states of the solution which has a good interpolation and with uncertainty estimates. We don't have the tools for this today. So this is the motivation, this is the motivation for what I'm going to show. We do, however, have the tools today to just model beta of t with a neural network and do parameter inference on the neural network. I'm not claiming this is the smart thing to do. And as you will see in the results, yeah, it doesn't give you something as nice as here. That's, like in principle, we have the tools. So the estimation problem is one where for this COVID data here, I assume that gamma is known and eta is known. So we don't estimate those anymore. We just take them from literature, which is also what was done to get this plot here. And all the parameters that we need to find are the parameters in the neural network here. But we can still write this loss function that takes a parameter and maps it to something. And what it does is basically given a parameter, solve the ODE, confuse an L2 loss and then try to minimize that. And as the difference in quality of this plot and this plot shows, I prepared this for this lecture because it seems like the thing to do not to explain what happens here. So yeah, it could have been a nicer plot. We have S, I, R and D in four different plots because they vary so much in scale, but I wanted to look at all of them, not only I. And the scale, the first, that's why all of those numbers on the left. And the thing I cared most about is beta, which is why it takes up so much space on the right. I will show an animation on how it looked in my experience to try to train this thing. And you will see that, well, it kind of works okay. I used, I mean, I'm not a deep learning guy. I don't really know how to choose neural networks. So I chose hyperparameters after trying around for a while in preparation for this lecture. I tried out optimizers in preparation for this lecture. And I get to something that is, I mean, it's not comparable to the fit that I had with the GP stuff, but I mean, it gives us the two bumps here and here. I mean, it fails to fit all of this complex part over there, but it's, and it gives us a contact rate. It's way less beautiful than if we would have uncertainties all around them and it's way harder to interpret, but it's a point estimate for a contact rate. And I mean, at the same time, I say it's hard to interpret, but in the, like we're using neural networks, right? And like typically neural network things are hard to interpret. And here we can plot a line and we know what it means. So like compared to neural network stuff, this is super nice to interpret. It's just that, yeah, I don't actually know how good this contact rate is and how well it explains the actual contact rates that we would have observed. Also, like if we, yeah, I think kind of stops at some point. I just ran it until conversion. Like, so it's an optimizer. There's line search stuff happening, which is why sometimes it's faster and sometimes it's slower, yeah. And this is the final result. So, yeah, it's okay, but it could be better. And in particular, maybe one thing to mention is that, like, so this was the result. And here on the left, there's this peak happening. And the question is, so if we try to interpret this, then basically I would be like wondering about what's going on here. If we compare that to the other result that we have, then until January, April, that would be from here to here. And this thing is completely uncertain about everything. So maybe I should not spend half an hour trying to interpret this peak. And the reason for why this thing is uncertain is also like we can interpret this because if the numbers are low, ah, there's the typo. This E should be an I. It moved from one side to the other. So if the number of infected people is small, then this gradient is small. There's low change and basically better influences basically nothing because like if you multiply a very small number with something, then it doesn't really change much if the something is big or not. And that's kind of why there's so much uncertainty here for beta and also here, because here again, infection rates are small. Whereas in this plot, like we don't really get that information and like it's kind of a drastic change to double contact rates, so yeah. But again, the big disclaimer is I did this in preparation for this lecture. Conceptually I think this is a valid approach. I mean, with all the pros and cons that you have when you use neural networks, by doing exactly this, you do not get uncertainty quantification. But as I said before, we talk about maximum likelihood estimates because that's the hard, like that's the thing where we need to, like that we needed to learn how to do. But if you have likelihoods, you can add priors, you can start sampling and so on, not claiming that this is the best thing to do in that context, but it's not impossible to add uncertainty quantification and still do conceptually this, which is using a neural network or another parametric model. And this is actually, like I gave you this code and you can see that on Elias and you can play around with that for the exercises this week. And yeah, so what we will do next week as a final thing to come up with is this. So because we had so many lectures on GPs and they have, well, so many things that I just mentioned, like about the uncertainty and what's happening on the left would also be solved by using another class of models which might be in particular very helpful because we map a scalar to a scalar and it's not a very common neural network task, let's say, but it is a common task, like we saw many plots where we did exactly that with a GP. So that might be the more natural choice and we actually can do inference and that's what we're going to do next week. So with that, we are at the end of the lecture. We saw, well, we talked about ODEs in general and saw some examples in machine learning and I hope, again, that I gave enough examples that you see that like in-core machine learning research and papers that are submitted to conferences right now, like two weeks ago, I reviewed papers that had ODEs in them again, as always. And so ODEs play an important role in machine learning, like in-core machine learning papers, but also when you start applying machine learning to like real-world problems, in the real world, you often have stuff to write down because you did actually think about the problem before calling your machine learning algorithm. In general, solving ODEs is non-trivial. That's why we had like one hour talking about numerical solvers and we kind of only scratched the surface, but we've seen a bunch of examples and I hope you got a better feeling for how like the difficulties that arise and the complexities they are for solving ODEs with a computer. And well, we cannot just simulate ODEs, but we can also learn ODEs in their parameters from data, which again, I think is particularly interesting in the machine learning context. Even when there are neural networks in there, maybe one final mention of something like the way to be more general or zoom out more is like, at this point, you know what neural ODEs are and how to solve, like how to work with them basically because in the neural OD context, the difference is that you don't assume structure because maybe you didn't think about your problem before and you just make everything a neural network and try to train that. But conceptually, it's still like, it's basically the same loss function and the same training procedure. So yeah, there we go, neural ODEs. All right, so that's the end. Next week, we talk about probabilistic numerical for these servers. So for the first half, same problem setting, but different point of view because we know about probabilistic state estimation and so we want to apply that. And then at the end, we do parameter inference again. So thanks a lot. Please fill out the feedback form.