 Welcome everybody to the fifth lecture of the Numerics of Machine Learning lecture course. My name is Jonathan. I'm a PhD student in the Philippanics group as well. And today I'm going to give you a lecture that I call State Space Models, which we will open a new chapter, so to say, in this lecture series. And we are going to dive into a topic that is actually a very hot topic at the moment and quite emerging. What we are going to talk about in detail are dynamical systems. So over the next three or four lectures, we are going to talk about systems that evolve over time. This is what dynamics really means. We have an evolution over time. And these systems are quite interesting. We can do a lot of stuff with those systems. And today what we are going to start off with is the task of estimating the temporal unrolling, so the temporal evolution of such a dynamic system from observations that we assume to be coming from such a dynamic system. What we are ultimately working towards is the simulation of such system, because this is a very interesting task that we can do with probabilistic methods, as we will learn in two weeks. So similar to the structure that Jonathan Wenger and Marvin chose for the linear algebra and Gaussian process lectures, we first have to kind of chew through a bit of theory. And we have to look at stuff that has been around for quite a while, actually, in order to get to know what cool stuff we can use this for. So this is what we are going to do today. We are going to establish a bit of groundwork. We are going to construct a language in which we will describe dynamic systems. And on top of this language, we are going to build an algorithmic framework that allows us to discover latent components of such systems from parts that are actually observable, so from data. To put this lecture now content-wise a bit into context, so previously you learned about linear algebra and Gaussian processes, so basically a regression task trying to find a posterior over a function that describes a fixed set of data. So you are given a data setting trying to find a function or a posterior over a function that describes this data well. And Jonathan then showed us last week in particular how to make these techniques scale on huge data sets. They're still huge. They don't even fit onto your computer anymore. Today we are going to, as I already said, open a bit of a new chapter. Today we are going to shift the setting which we think about a bit to settings in which we are dealing not with fixed data sets but rather with streams of information. So maybe we don't even know all of the data we are going to see or how big the data set in the end will be, but there's really information trickling in one at a time. So we want to estimate the state of our system in an online fashion. Okay, and to start off let us look at a few examples of such system I've been going on about now. And while looking at those examples I'm not only going to tell you that these systems are actually useful because otherwise I wouldn't be telling you about those, but we are going to extract from this example a few key characteristics of dynamical systems and of this model or the characteristics we really want to be reflected in the model we chose to model those dynamics. So let us start off with this example you saw already in the very first lecture given by Professor Hennig. It's data from the COVID-19 pandemic and not only data, we see here in this red line, let's test this, it actually works, and this red line we see a Gaussian process posterior fit through this data and a bit of a prediction afterwards. It's the number of infected case counts per every day in this COVID-19 pandemic and on the bottom you see this beta parameter which is also an evolution over time which is interpreted in these kinds of model interpreted as a contact rate between individuals in a population. And there of course you see, as I already said, the key characteristic of dynamical system is we have this temporal structure, we have this temporal evolution, right? But what you can also think about is that we can only observe parts of the system. We can measure the infected case counts because we have COVID tests, maybe they are unreliable, but still it's a means of gaining information about a part of our system. But we don't have any means of measuring this contact rate. At least I hope so. It would be really questionable if we had means to measure the contact rate between individuals. So this is a part of the system that is hidden to us. It's a latent force. Okay, moving on to the next example which is a bit different in the way at the speed that data basically trickles in because why we maybe get a new estimate for the infected case counts and the COVID-19 pandemic once a day. If you think about an application that estimates the orientation of your smartphone in three-dimensional space, probably you're going to have to process a rapid fire of information coming from the sensors in your device. Maybe there are some kind of accelerometers built in there and some application on your device analyzes this data and has a built-in model for these dynamics and estimates the orientation of your device. So we really have an online estimation task here at hand in which we have to quickly update our estimate. And as I already promised, we are working towards a task that is called simulation, simulation of dynamical system. This is a really very interesting task. Here on the right-hand side you see a fancy animation of a very difficult to simulate dynamical system and we're not going to talk about such difficult to simulate systems because this is a spatiotemporal dynamic system so it not only evolves over time but also over two-dimensional space and this is also a very difficult model to simulate. It's a fluid dynamical system. I just wanted to show you where this might be leading in the future. So what are the characteristics now that we learned about, right? We already talked about this evolution over time and we talked about models being only partially observable. Then we have, especially in these simulation tasks, we are faced with very complex dynamics so we might be dealing with highly non-linear functions and interactions so we have to find a way to deal with those. But I think the thing which I want you to keep in mind over the next couple of slides in particular is really this online estimation setting. So we need an algorithmic framework on top of our modeling language that allows us to build upon the most recent piece of information that trickles in and update our belief accordingly. Okay, but now that we know what we want to do, let us actually look at how we are going to do this. And we're first of all going to look at a graphical representation of the modeling language we choose. And this is also a figure you saw in the very first lecture already. Philip introduced this, Professor Henig introduced this in the first lecture and I'm going to walk you through this step-by-step now. What you see here in this red box is a sequence of white nodes and these white nodes represent random variables modeling our system state. Okay, first of all, what is a system state? In order to define this, I think you could look up 100 different definitions and 100 different books, but I chose one of them because it carries everything that I want to say about these systems. So the definition reads, the state of a dynamic system is a set of physical quantities, the specification of which completely determines the evolution of the system. Okay, so there's a lot to be learned from that definition. First of all, a state is not only one quantity, it's not something, one number which we want to model, not one trajectory, but really a set of physical quantities that are tracked in the same state and are expected to interact in a potentially very complicated way. So we have to specify this set of physical quantities and actually I brought some examples what these quantities could be. So for example, in a very canonical dynamical system that is often used to teach about dynamical system, it's a pendulum model. There you model the angle of the pendulum and the angular velocity at every time step. So this is quite simple, it's just two quantities that naturally interact with each other, right? But in this COVID example, it might be a little bit more difficult. So you have to track the amount of susceptible infectious deceased people in a population during a pandemic, but also maybe even more than it first appears. Maybe you want to model such a thing like this contact rate or like a recovery rate and all of these are potential components of your state. Or XYZ positions, velocities, accelerations in three-dimensional space to do something estimation with your smartphone. Okay, now that we know what a state is, we have to define how we observe the state and as I already told you, in these models, they are often called partially observable. We have observations, why? So we collect these observations in a random variable. Why? That somehow depends on the present state. So, and there's a lot we have to take care of in this plot at the moment because I, for example, use different indices here. So we have Y sub n and X sub k. So this is kind of confusing, right? But this is just to visually represent the fact that we don't have to collect a measurement at every state or every point in time that we model the state. So we can have trajectories of states where there are no observations available and we can still model them. However, I'm in the future or in the remainder of the lecture, I'm just going to drop these different indices. I'm just always going to write Xk and Yk so that we just assume we have a measurement for every state that we model. But that is actually not a restriction. So if we now take a step back and look closely at this graphical model, we can actually read off the properties that we want this model to have before we actually go ahead and define it properly, mathematically. The first thing I want you to notice is that there are only arrows going from one state to the next. There's no error going from Xk minus one to Xk plus one. Only two subsequent states are connected with an arrow going forward. This is a property of a so-called Markov chain and this is probably also what the person who wrote this book, I read this definition in, meant by the specification of our current state completely determines the evolution of the dynamical system in the future. And we are also only going an arrow from Xk to our current measurement. This measurement is not dependent on any future or past states. Okay, but enough with the talking, let's actually make this form. This definition, this property with only an arrow being between two subsequent states is called the Markov property. Probably all of you know this. The Markov property basically says that if we have a conditional distribution over our current state variable, it is independent of everything else, but the immediate previous state variable. This is what this equation here says. So it often reads, given the present, the future is independent of the entire past. The second property I just highlighted is this of conditionally independent measurements. So our measurement random variable, Yk, is only dependent on Xk. This is just an assumption we make, which will make computations and derivations later on way easier and very elegant. These are two properties which I want you to keep in mind and they are actually quite intuitive to think about when having the, in the back of your head, this graphical model, the Markov property and conditionally independent measurements. Furthermore, I want to highlight something that I didn't write down here on the slide, but by extension, you can also read this up, read this off from the graphical model. The current measurement is also independent of all future states and all future measurements. Okay, with everything that we just learned so far, you can actually now go ahead and write down the definition of a probabilistic state space model. And this is just a Bayesian model that defines three things. First, an initial distribution P of X0 over the very first state in our trajectory. And then for every state Xk in the sequence we are considering and for every measurement Yk, it defines, okay, first of all, for Xk, it defines a dynamics model as the conditional probability distribution P of Xk given Xk minus one. So here you see already, this uses the Markov property already. And for the measurements, it defines a generative model for our observations. So a likelihood if you will. It's often called a measurement model. It's just a conditional distribution of Yk given Xk. It's independent of everything else. And this is the entire definition of a probabilistic state space model, the language we are going to use to model dynamical systems. Are there questions so far? I'm going to bring up these distilled components of the state space model throughout the lecture from time to time because these are really what we are going to use to build our algorithmic framework. Yes? For this lecture, a state is the same as a Markov state? Yeah, a Markov state depends on the definition of a Markov state. But if the Markov state for your definition is a state that is given its previous state, you know, it fulfills the Markov property, then yes. In this case, we are considering states in a Markov model, yes. So the question was if the state can be used interchangeably with the term Markov state, yes? Yes? Implicitly, yes. So the question is if Xk is independent of Xk minus two, but Xk minus one is dependent on Xk minus two and Xk is dependent on Xk minus one, isn't then Xk also dependent on Xk minus two and yes, it is possible because otherwise we couldn't carry the knowledge on what. But yeah, it is implicitly dependent on the past. Yes? Okay, yeah. Yeah, exactly, yeah, thanks for the remark. You can also read us off actually from the graphical model. So these models are, they are, so the arrows and the nodes kind of have a conditionally independent structure encoded in them. So yeah, this is something that's for example, taught in probabilistic machine learning. Thanks for the questions and the remarks. Now let's move on to what we are actually going to do with this definition now. In order to analyze such systems, such dynamics using the state space models, we're basically computing four different, four different conditional probability distributions that we can, so to say, label. So first of all, we're going to predict, we're going to compute a prediction distribution which is just a conditional distribution of our present state given all previous measurements not including the present measurement. We are also going to compute the filtering distribution which then is basically the updated prediction distribution, P of XK given all the measurements up until K, so including K now. Along the way, we have to compute this quantity which I call data likelihood, is just the probability of the current measurements given all the previous measurements. This is something that is not really of high relevance in this lecture, but we have to compute it anyway along the way. So I'm going to show you how, because it's going to be maybe relevant in later lectures. And last but not least, we are going to compute a smoothing distribution which is the current state estimate given all the data we have. So if we at some point say we have an entire trajectory finished and we want to compute using future data, the current state estimate, that's called smoothing, but we're going to do that a little bit later in the lecture. And these four probability distributions we want to now compute for every step K in a potentially infinite sequence. Of course, not infinite, but an ongoing sequence. First of all, we are going to derive how we compute the filtering distribution. And in order to compute the filtering distribution, we have to learn how to compute the prediction distribution because those two distributions are computed from each other. So maybe this might seem confusing at first. How can we compute one thing from the other? That's basically just saying we compute a recursive algorithm which a recursion starting at the initial time step X0. So we are not going to start a derivation at X0. We are making this a bit more general. And we say we have one filtering estimate available at time step K, given all the previous data and the including one. So we have a current filtering estimate. And from that we compute the prediction distribution. So how are we going to do this? And this is a recurring theme in deriving these kinds of probability distribution. You just basically introduce the quantity you care about. So from this distribution, you build a joint distribution of what you already have together with the quantity you are looking for. So the joint distribution between p of XK plus one and XK given Y1 to K. This joint distribution you can split up into the product of two conditional distributions. Why do we do that? Because we notice that we actually know both of these distributions. The left one is just our dynamics model as defined by the probabilistic state space model. This is given. And the right one is the previous filtering distribution. We know it's given because we assumed it's given. In the very beginning, this is just p of X0 because there's no data to be written down on the right of your conditioning bar. Okay, so now we have to get rid of this XK because we really don't care about it, right? We want to compute p of XK plus one given Y1 to K. So what are we gonna do? We're just gonna marginalize it out. Yielding this equation p of XK plus one given Y1 to K is the integral where we integrate over XK of this product of conditional distributions, our dynamics model and our previous filtering distribution. This equation is so important it has a fancy name. It's called the Chapman-Kolmogorov equation. Usually, this is a convention from the research field of signal processing and it's a really important equation. It gives us the possibility to predict into the future, giving past measurements. Cool. So in order now to compute our filtering distribution, which was our initial goal, we have to, as soon as the new data point YK plus one comes in, we have to have a tool to correct for this newly gained knowledge. It's called the correction step. This is just the copied over Chapman-Kolmogorov equation from the previous slide. And now we have a new data point P of YK plus one and now we are going to integrate this into our estimate of XK plus one. How are we gonna do this? Base theorem. So we just write it down. P of XK plus one, so probability of the next state given all the data up until this state is proportional to the product of a prior in the likelihood term, so to say. The likelihood term is easy. This is just given by our state space model, if you see, it is the measurement model that we defined in our probabilistic state space model. The left term, which can be interpreted as the prior, is just the predicted distribution we computed using this Chapman-Kolmogorov equation. So this is what I meant by we compute one distribution from the other and then we have this scheme of predicting and updating, predicting and updating. And of course, since this is base theorem, we have to divide by a normalizing constant that we compute using this integral, which is sometimes called the evidence in order to get a valid posterior distribution. And some of you may have looked closely and noticed that this is actually this term that I phrased data likelihood in the beginning. So we have to compute this quantity anyway along the way. This is just the probability of observing the next data point given the previous ones. Okay, cool. In order to put some structure into this, yes? Yes. I told you earlier that this doesn't have to be the case. I just use this convention, if you will, such that not too many indices are flying around, because then I would have to use a different index for n and a different index for x so it makes everything a bit more confusing and hard to look at. But you don't have to have a measurement for every state that you model. Exactly. Actually, we are going to see this later on, but I just basically neglect this here a little bit because, yeah, the algorithm is usually phrased in a way that you don't predict between measurements, but we are going to see this later, thanks. Okay, let's give this a little bit of structure so in form of pseudocode because this is actually a Bayesian filtering algorithm in its general form. We start at a time step zero and then we enter this while loop. This is not a follow because we don't have a finished data set or a fixed data set because new observations could trickle in at any time, right, and we just predict up until the next step using the Chapman-Komogorov equation and we correct using Bayes theorem. These two equations you just saw, these are the prediction and correction equations. We increase our time step and go on and so on. Notice that we never go backwards in this algorithm. As soon as we see the next data point, we don't update our previous state estimates, even though we could provide them with more information, but we are still in this online estimation setting. We want to be quick with our estimates and our most recent state always saw all previous measurements so we can't get better than this. We have one linear time loop here and if we decide at one point that we stop our state estimation algorithm, we just return the sequence of distributions, of filtering distributions, the distributions over the current state given all the previous measurements. I say this so often because it's very important. Okay, what I didn't show you yet is how to actually compute those things, right, because there are integrals flying around there and there's also an integral hidden in this proportional sign, and we don't even know what our model is yet so we cannot really compute anything with that. It's just a general collection of formulas. And the next step I'm going to take is actually build a model that can produce plots like this where you see here on the right-hand side of this vertical bar denoting our time step K or current time step currently being at zero. The right-hand side is the prediction. The left-hand side is then the filtering estimate which is currently not visible. I have not told you so far how to produce this plot but I'm going to tell you on the next slide. And this is roughly what it's gonna look like. You see here always on the right-hand side the prediction into the future. You would not compute this prediction normally because it's just not really informative. This is just for visual reasons such that you can look at how these predictions could look like under a certain assumption of a model which we are now going to look at. Okay, back to our general setting. We have our state space model. We have those distributions. We want to compute, we cannot compute yet because we don't know what our model is. So let's insert actual models into this. And we are going to choose Gaussian, a linear Gaussian state space model. We're going to start off with an initial Gaussian distribution distributed according to some mean mu zero and a covariance matrix sigma zero. And now it gets really interesting because we are going to consider linear transformations of the previous state to get a distribution over our next state. So XK given XK minus one is a Gaussian distribution that arises from a linear transformation of our previous state. I say linear, it's actually affine. We have this plus B here. But anyway, I would confuse it so much that I just stick to linear now. But I mean affine when I say linear because it's more general than that. Under a additive Gaussian process noise given by this covariance matrix Q. These two matrices have names. A is called the transition matrix and Q is called the process noise covariance matrix. And I will call them that in the remainder of the lecture. Just a convention from signal processing. And we are going to choose the same kind of model for our measurements as well. We are going to consider linear Gaussian models. We have this measurement matrix H and this measurement covariance matrix R. And this is our state space model we are going to choose. Most of you probably know why we are going to choose this kind of model because linear transformation of a Gaussian random variable is a Gaussian random variable. And you could think that all those distributions then are also Gaussian random variables with different moments. We are going to call the prediction moments, denoted with a superscript minus. We have this filtering moments mu k and sigma k. This data likelihood is computed along the way. And later on we're also gonna see that the smoothing distribution is actually also Gaussian. So this is quite convenient. How do we know this? We know it by Gaussian inference. By the theorem of Gaussian inference, you already saw this probably more than once. And I think also that all of you understood it because otherwise you would have had trouble to understand the previous three lectures because Gaussian process is really built on this theorem. But I want to read it out very quickly again because later on I want you to recognize these terms again in our algorithm. If we have a Gaussian distributed random variable X, distributed according to some mean and covariance metrics, and we have an affine transformation of that, and we have this P of Y given X as this Gaussian random variable with this affine transformation of X, then we can compute P of Y, which is a Gaussian, but I'm not really interested in this right now. What I'm interested in is this posterior distribution P of X given Y. And it's this really complicated collection of terms that you probably all saw. We see here that we add to our prior mean this product of this gain term times this residual term. And we have here for our posterior covariance, we subtract from the prior covariance this product of sigma times A transpose times this gram metrics and then from the right hand side again, multiply this A sigma to this gram metrics. Okay, so if you keep now these terms that are labeled here in mind, we can proceed now to an algorithm that allows us to compute the prediction and filtering moments given our linear Gaussian state space model in closed form just using Gaussian inference. And this is actually a name which most of you might have already heard. It's called the Kalman filter. It is named after the scientist Rudolf Kalman. It's a Hungarian researcher. He was a Hungarian researcher. He died in 2016 and he is the person who gave this algorithm this very useful algorithm as its name. It is algorithm just allows us as I already said to compute prediction filtering distribution in closed form. And I'm not going to derive this here because I want you guys to derive this in your homework sheets by inserting our linear state space model, Gaussian state space model into our prediction and filtering equations. I'm just going to show you these equations here. So the prediction equations are given by these two things. So we have to compute the predicted mean and this is just the linear transformation or the affine transformation of the previous filtering mean. This is quite easy, right? For the predicted covariance metrics, we just multiply the previous filtering covariance from the left and the right with the transition metrics and we add the process noise covariance metrics queue. So quickly we have computed our predicted moments. For the correction, you will probably recognize terms from the previous slide. We first compute this measured predicted mean, so to say, and save it in this variable y hat. We compute this metrics S as the measurement metrics times the predicted covariance metrics times the transposed measurement metrics and add this measurement metrics R. Then we compute this metrics K as the predicted covariance times the transposed measurement metrics times the inverse of this metrics S. And this is a very important quantity. It's called the Kalman gain. If we remember gain from the previous slide, we will see that the structure is actually very similar. And the gain kind of, so it has a meaning. It kind of translates from your measurement space to your state space. Kind of maps the information that you gained in your measurement space and maps it onto the state space. And therefore we use it to update our filtering mean by multiplying this Kalman gain metrics with our residual term where we just subtract from the data point our measured predicted mean. So basically the difference between what we really observed and what we would have predicted to observe. And the filtering covariance metrics is then just the prior covariance metrics or the predicted covariance metrics minus the Kalman gain multiplied from the left and right to this Gramian metrics S. And actually along the way we did already compute this data likelihood term which is also Gaussian and the mean is just given by this measured prediction and the covariance is given by this Gramian metrics S. And that's the Kalman filter. Equations, we can write this now down into this algorithm where we just basically have the exact same structure with as we had with our Bayesian filtering algorithm but now we can compute these distributions and actually we can compute them in closed form. We have again this Y loop and we compute our predicted moments and our corrected moments according to the equations you saw before. And in the end we just return a sequence of means and covariances to completely specify your filtering trajectory which is just a collection of Gaussian random variables. And again, I want to highlight this. Again, this is only ever informed of the previous and present data points, not of the future. Okay, now I'm gonna do something risky. I'm actually going to show you code. So let's see if this works. As the name already suggests we're going to look at a car tracking example in practice. So the setting is we have a car going from its start position to its destination in a XY plane. So of course cars are moving in three dimensional space but we neglect the third dimension which has basically imagined a map, a trajectory drawn on a map. And we want to, based on noisy measurements, so for example GPS measurements of this car we want to find the true trajectory. And all we have to do for that is open this notebook, quickly realize that we are not staring at Python code. This is actually Julia code. So larger, let's see, yes, thank you. I hope everybody can read this, nice. Anyway, we're looking at Julia code. This is not Python code. I have no special intentions. I don't want to advertise Julia or anything. It's just maybe it's interesting for you to see. I'm using Julia myself and my PhD. It's a quite nice and emergent programming language but of course it has its advantages and disadvantages. So if you want to use Python in your homework, please do so. It's completely fine. Okay, this is just some setup we actually don't need. We have to begin by setting up our state space model. Okay, remember we have this car tracking example here, right? So the state is going to consist of four components, x and y position and x and y velocity. The velocity is respectively denoted with this dot. And we are only ever gonna measure the first two components of the state vector. We are only going to get GPS measurements, instant measurements of the position. This might be completely different in practice but this is just how it is for this example. Okay, this is our state. We have a measurement dimension and state dimension of two and four respectively. We are going to initialize this with a Gaussian distribution with mu zero and sigma zero. Mu zero just being the zero vector in four dimensions. Sigma zero is just a scale identity matrix. Okay, we're gonna write this down, execute the cell. Cool, so far so easy, right? Now it gets complicated. We have now our dynamics model. It is a linear Gaussian model, as we already said because then we can do common filtering. We are going to assume that our matrix A doesn't change over time, so it's constant. But it's quite complicated, right? So okay, maybe A is not as complicated as the process noise covariance. We are gonna talk about it in a minute. It's just this thing with ones on its diagonal and something that involves DT on its off diagonal. DT is just, as the name already suggests, it's the time step. So the difference in time between two subsequent steps, K and K plus one, as I wrote down here. And Q, you can see has an interesting structure and also involves kind of polynomial terms of this DT somehow. And I'm actually not going to tell you why because it's very complicated. It, roughly speaking, in the rises from the discretization of a continuous stochastic model, a stochastic differential equation. And these objects are really complicated to study. And we don't have the time to discuss them here. But you recognize that it has some structure and you just have to believe me that it makes sense to use the state space model. Okay, cool. You can actually just translate this to code. We say our time step in between two subsequent K and K plus one is 0.1 and we just write down this matrix, these matrices, transition matrix and process noise covariance matrix A and Q. Okay, as for our measurement model, as I already said, we just measured the two first components of our state vector. So we have this projection matrix H here as our measurement matrix. Just selects basically X position at X velocity and no X position and Y position from the state vector. And the measurement noise covariance matrix is just a scaled identity matrix again. Okay, so far so easy. Let's execute this. What we are going to do next is we simulate this dynamical system once. We are going to draw a trajectory from these dynamics and we are going to draw a bunch of measurements from this trajectory that is going to give us our crown truth and our observations on which we make inference later on. And we can visualize them just a second. So here we have the time series plots here on the right. So we have the X position over time and the Y position over time. And we see that we have not equally spaced measurements over time. These I chose random from the sequence of measurements. Because these two plots are a little bit hard to interpret, I always plot this trajectory of the car left from these two plots in the XY plane. So basically the car starts here and then drives around this trajectory and reaches its destination here. So now what we are going to do is implement our command filter. First the prediction step and it basically completely translates into code. You can basically write down the equations from up here. So we don't have a affine term in this model. So we just have as the predicted mean A times mu and we have our predicted covariance metrics and we return them from our predict function and we execute this easy. And we are going to do the same for the correction. We have all these terms that we already saw. You saw them in two slides already. So I'm not going to detail them anymore. I'll just show you. We translate them here into Julia code. Notice that this matrix K involves the inversion of a matrix. And this we learned in the second lecture given by Marvin, we don't do. What I do instead, because I'm lazy and Julia is awesome, I tell Julia our matrix S has structure. It is symmetric. And then I can just use this divide by operator and Julia does a fancy matrix decomposition intrinsically and solves the linear system instead. So don't use inverses. This is our common filter correct function. And now I just wrote down the loop that we saw in our pseudo code. I start off with, I initialize a list with our initial moments. I enter this for loop. It's not a by loop here because then we would sit here in two weeks and wait until it finishes. This is not really interesting. I take from the end of my current estimate our current filtering estimate. I use this and our dynamics model to predict the predicted mean and covariance. And then here's the twist. And this is related to the question you heard earlier. If there is data, I use the correct function and save for the filtering estimate to our filter estimate. So the filtered moments, the corrected moments. If there's no data, I'm still going to save something. I just going to save the predicted moments just in order for you to see what happens in between measurements if you execute the common filter. But this you don't have to do. It's not necessary. Okay, we execute this loop. It's actually already finished because it's not a lot of data points. And then we plot it. So this is complicated code. We don't have to look at. And we see that this is our resulting estimate. You see here on the right, the X position is regressed in a way. And here in between those measurements you see the uncertainty. So this is the marginal uncertainty on two marginal uncertainties. Marginal standard deviations. Now we got it. In between those measurements and you see the further away we get from the previous data point, the larger and more rapidly this uncertainty grows. You see this also here. It's a little bit difficult to plot this uncertainty in two-dimensional space or in the state space, so to say, because we would actually in order to make this correct plot Gaussian contour lines around every step that we take and then this plot would just be cluttered. So I didn't do that. This is the best we get, but it's kind of inaccurate. Yes? No, we don't do anything fancy like that. We just used the prior as given by our A metric. The question was whether we somehow correct using the time step or by decreasing the time step in between those measurements. If I understand your question correctly. Not entirely sure what you mean by it's different from the usual discretization. Maybe I can try and answer this question offline. Okay, anyway. So you see here that also here the fact is reflected that our model just predicts in between measurements using its dynamic model and because it's linear, it's just a linear line, right? And as soon as new information trickles in, we snap back onto the trajectory because we trust our data and the uncertainty also rapidly decreases. And this is our filtering estimate for this task. Okay. And now we go back to the slides and look at what we have learned so far. We learned how to build a state space model. A state space model is the language that we are going to use to describe dynamical systems such as a moving car or a pendulum. We chose it to be linear Gaussian because then we can compute all the interesting quantities in linear, no, not in linear time. Using linear algebra with matrix vector products and so on. And basically the essence is what we learned to compute is this posterior. So the sequence of filtering estimates that are only informed by the past and the present being a collection of Gaussian random variables. And this is really cool for this online estimation setting that I've been mentioning so often today because at our current, our most recent estimate, we have seen all the data that is available, right? Here at this point, the estimates are at the very end. Here it is informed about all the data points we have seen so far. But this point here, I hope you see this, this point here is not informed about this data point. This is just something this algorithm doesn't do. But if we now know our car at some point stops, it reaches its destination and now we want to shift our problem statement. We shift away from, we want quick online estimation until to the problem setting that we now want to know what the perfect estimate is given all the data. So the correct estimate or the best possible estimate of this trajectory the car shows, given all the data at every time point. So basically roughly speaking, we want for example inform this estimate here about this data point here. Okay, yes. And this shifts our problem statement to compute a sequence of posteriors that have a different form. These are called smoothing posteriors. It's the distribution about our current state given all the data points available. Y one to capital T. And I'm already going to tease it. This is going to be a collection of Gaussian random variables as well. So this is quite easy to compute as we will see. And the derivation is actually quite difficult. So this is just a handwritten note from when I was a master student and taking a course on time series analysis. I don't want you to understand or even read these derivations because it's quite lengthy and we don't have to do it in this course. If you are interested, you can do them yourself. All of you can do it. I just want to show you what I used in order to make this derivation. I used the Markov property, I used Bayes rule, and I used the Markov property again. And then some probability theory stuff in between. What I want you to look at is the thing that we compute or that we want to compute and the result in the end. So this is what we end up in. These are the general smoothing equations, so to say. And even though this might seem complicated, all of these conditional distributions, all these terms over there, you know already. You see here the filtering distribution and it's already tells us something. We have to compute the filtering distribution in order to compute a smoothing distribution. Okay. Here, we just have our dynamics model, P of the next state given our current state. This term is especially interesting. This is the smoothing distribution at the next time step. This is the quantity we want to compute just at the next time step. So what does this tell us? Yeah, I saw a good gesture over there. What do you mean? Exactly, we have to go backwards. This is going to be a backward recursive algorithm. Based on our filtering distribution, the dynamics model and the thing down here, this is just our prediction distribution. Notice that there's no measurement model involved here. It's just using the filtering and prediction and the dynamics model and our smoothing estimate at the next time step. And then we just marginalize over the next time step because we don't care for it. If we now go back to the setting of linear Gaussian state space models, I already told you it's going to be a collection of Gaussian random variables. We can write down matrix vector product operations to compute first the smoothing mean. That's a matrix G. No, smoothing mean, smoothing gain. Smoother gain, this thing is called. It is a product between the filtering covariance matrix times the transition matrix transposed times the inverse of our predicted covariance at the next time step. And this matrix is all we need in order to compute our updated mean in which we just add to the filtering mean the product of the smoothing gain times a residual term between the smoothing mean at the next time step and the predicted mean at the next time step. And the smooth covariance matrix is, comes about by adding to the filtering covariance. The product of, so we first built this residual term between the next smoothing covariance and the predicted covariance and then we just multiply the smoothing gain from the left and the right. Just three equations makes it possible to compute this posterior from our filtering posterior. And then we can write this into an algorithm as well. Notice now that we shifted from formulating y loops to four loops because this algorithm only makes sense if we have a data set or at least a fixed portion of time steps that we want to consider because from the very end, we go backwards now up until the beginning. It is highly unlikely that there are new measurements coming up in the beginning of the time series. This is just not how time works. Anyway, we start from the end of our time series because as I already told you, at the end of the time series, our filtering estimate coincides with the smoothing estimate and then we go backwards in time. See here for K and T up until the one, we compute the smoothing gain and we use the smoothing gain to compute the smoothed moments. And then in the end, we return the sequence of Gaussian random variables and this is our smoothing estimate. Okay, let's look at this in code again. Well, nice. Okay, we're still in the same example. It's just part two basically. We scroll down for a bit. Oh, actually, I forgot to show you the animation. Wow, let's look at this really quickly. That's an animation of the filtering estimate. Real quick, because I like animations. Whoop, you see how this comes about. You see here again for visual reasons plotted a prediction every time and then you see how the uncertainty snaps back every time we see a data point. Now for smoothing, we again translate the equations we just saw. You don't have to memorize them. Of course, I put them on the slide such that you can look it up later, but we just translate them into code. Again, we're going to tell the algorithm or we tell the compiler that this matrix here is symmetric. It's a covariance matrix. So we divide by that by solving a linear system. We use the smoothing gain to compute our smooth moments. We return them, execute, and then we're going to initialize our smoother estimate, our trajectory with the final filtering estimate from before. And then we go K steps in reverse over our state indices, select from the beginning of our list because we're going backwards, our current smoothing estimate. We select our filter estimate at time step K and our predicted moments at the time step K, which I sneakily saved earlier. I didn't tell you about that, but I saved them because we don't have to compute them again, right? These are pre-computed during filtering anyway. And all of that we are going to provide the smoothing function with together with our dynamics model. We're going to execute this loop. It's already finished. And look at the fancy animation in which our smoother now smooths out our filtering distribution. I've started off here with this plot in the XY plane because it's just nicer to look at, but this uncertainty here is, as I said, a bit difficult to interpret. So if we want to see the actual clean truth, we have to look at this plot, which plots the same thing just as a time series here. And you can see that now we have a nice interpolant between those points. Actually, this nice interpolate is a Gaussian process posterior we just computed. So the smoothing posterior, this collection of smoothed random variables are the marginals of a Gaussian process posterior we just computed in linear time, going one time forward and one time backward. Again, I cannot tell you why exactly this is. I'm just going to tell you. If you're very interested, we can talk about this a bit in the tutorial, but this again has to do with stochastic differential equations and very complicated math. Anyway, so what assumptions did we make for that? We assumed linear transition, linear measurements. We assumed additive Gaussian noise to both these dynamics and measurements, and we assumed the Markov property. Okay, now we can compute Gaussian process posterior in linear time. Does that mean that you can forget everything that Jonathan Wenger told you over the past couple of weeks? No, I only told you that every smoothing posterior is a Gaussian process posterior, but not every Gaussian process posterior that you can compute can be computed using Gaussian filtering and smoothing. This is just something that you can keep in mind. Okay, now we can do Bayesian filtering and smoothing. Cool, are there questions? I know it's a lot of stuff that you're going to learn at the moment or that you are learning at the moment. You don't have to memorize every equation. I just want you to have them such that you can look them up. Far more important are the principles I keep repeating over and over again. Before now, taking the next step. What could this next step possibly be? Remember that our ultimate goal, why we actually are chewing through this theoretical part now, is simulation of complicated dynamics. So what kinds of models do we have to look at next? One keyword, one keyword that makes everything so much more difficult. No, we're going to stick to the mark of property because otherwise it would screw up our framework entirely. Yes? Non-Gaussian is also something. Why did I ask this question? No, we are going to stick to Gaussians. I'm going to tell you that it is possible to drop the Gaussian assumption in the very end but I'm not going to show you how. Yes? Non-linear. Yes, thank you. We're going to talk about non-linear models but these are very good comments that bring up, so especially the non-Gaussian thing brings up an entirely new class of algorithms. Anyway, we're going to talk about non-linear models. If we return now to our state space model, you see here that this F-hine transformation of the previous state has been substituted, has been replaced by a general function F for the dynamics model and a general function H for the measurement model. And as all of you probably know, non-linear function of Gaussian random variables are not Gaussian random variables anymore so everything doesn't quite work out yet. So this doesn't mean that we have to stop here. We just pretend that we are dealing with Gaussians. We're just going to approximate our model and use Gaussian inference again. How are we going to do this? The principle is actually quite easy. We're going to use both of these functions F and H and we're going to linearize them and then use common filtering again, as easy as that. How are we going to linearize them? As you probably know, we're going to use a Taylor approximation in first order because we linearize. I just quickly wrote this down here again. So for a general function G, you can form a first order Taylor approximation, which is an approximation as a linear function in a linearization point, ceta. And it involves the Jacobian matrix, which I denoted here J sub G, of the non-linear function G evaluated at this linearization point, ceta. We can rewrite this Taylor approximation and reorder the terms such that we see that we have a linear or affine function again, ax plus b, where a and b arise from this Taylor approximation. Just for you to remember, the Jacobian matrix is just the matrix containing all the combinations of partial derivatives of this vector-valued function G. All of you probably noticed. Okay, let's look at a few pictures quickly. So here we have this non-linear function G in blue, just a logistic sigmoid function. We chose one linearization point, ceta, this green dot. And in this dot, we approximate the non-linear function G with the linear function G hat. This is this orange line. And you see the further we get away from this linearization point, diversity approximation gets. Okay, but we don't want to talk about analysis here. We want to talk about random variables. So let's draw a few samples from a random variable. And this is Gaussian distributed with mean two and standard deviation one. And now we see what happens if we map these samples and compute the push forward measure, so to say, by mapping all of these samples through this non-linear function. And we end up with this distribution, which is clearly not Gaussian. It doesn't look Gaussian. However, if we map it instead through this linear function, we get this distribution, which is again a Gaussian distribution, as you would have expected. And we can see it's just an approximation of the true distribution. It's really quite different. This one is not symmetric and the means doesn't really match up. And this here, the map samples to outside of the image space of this non-linear function, everything, but we're just going to use this approximation for now because we want to use Kalman filtering because it's so nice. Okay, so how we proceed now is we just linearize everything that is non-linear in our state space model and we will begin with our non-linear dynamics. So we are given this model. XK given XK minus one is this non-linear function of XK minus one under additive Gaussian noise. We are going to linearize those dynamics and we're going to have these terms involving the Jacobian as our new matrix A and this affine term involving the evaluation of our non-linear function and something else involving the Jacobian as our affine term. This just drops out of the Taylor approximation. Okay, as soon as we did this, we ended up with a affine model again. So we have a linear Gaussian dynamics model here where the terms just arose from the Taylor approximation. And what we do now is we just insert those new system components into our Kalman filter equations. So for the predicted mean and this is really nothing else than the Kalman predicted mean which we saw before, we insert instead the Jacobian and this affine term and we see already here that we have here plus this Jacobian term and minus this Jacobian term. So these two cancel out and the predicted mean is just the non-linear dynamics function evaluated at the previous filtering mean. I forgot to say that we choose, this comes about because we choose the linearization point smartly. We choose theta to be the previous filtering mean. So this way, all of this cancels out nicely. For the predicted covariance metrics, we also just insert for every transition metrics we find the Jacobian metrics of the non-linear function F and then we have this term Jacobian times filtering covariance times Jacobian transpose plus our process noise covariance metrics Q. Okay, so we now have equations that allow us to deal with non-linear dynamics and we do the same for the measurements. We have this non-linear measurement model YK given XK is the Gaussian random variable that involves this non-linear function H evaluated at XK. We linearize this function, we get this form H, HX plus C where H and C comes about from the Taylor approximation as before. We choose as the linearization point theta equals the predicted mean now. We recognize that we have a linear Gaussian model again and we insert everything into the Kalman filter equations. We just compute our Gramian metrics S just using the Jacobian now instead of the transition metrics or the Jacobian is our new transition metrics, if you will. At our measurement covariance metrics R we compute the Kalman gain in the same way. For the updated mean, we recognize again if we insert our Taylor terms into this equation here, we end up with this term after rearranging the terms for a bit and we have minus Jacobian plus Jacobian again. So our measured predicted mean is just the evaluation of the non-linear measurement function at the predicted mean. So the residual term is this YK minus H of predicted mean. And for the updated covariance metrics, actually nothing changes. What we end up with are equations for the so-called extended Kalman filter. And it's nothing else than this, it's just an extension of the Kalman filter for non-linear dynamics and measurements. And they are actually so similar that I chose to introduce a slide again where I just write down the Kalman filter equations next to the extended Kalman filter equations and highlighted in red everything that changed. We just predict our mean with this non-linear function, substitute every transition metrics with the Jacobian and think of our Kalman filter equations. This is basically all that matters. Just for these two parts, we see that we just have to evaluate our non-linear functions. Questions regarding the extended Kalman filter? Okay. Wait, so yeah. It still makes sense to do something similar to the function that you can't differentiate or don't want to differentiate for a reason and use linear approximations there too. So linear approximations, like calculating the function of different points. Oh, okay, interesting question. So the question is what happens, or I slightly rephrased the question, what happens if we cannot differentiate our function? So what happens if we cannot compute a Jacobian? And you already gave us a suggestion what you could do like a finite differencing scheme, right? You take a function evaluation at this point and a function evaluation at that point, build a difference like standard finite differences and then do the same thing. I don't know if this works actually. Might well be that it actually works. You are free to try this instead of using a Jacobian. I think it works just maybe it gets unstable at some point. But yeah, feel free to try it out if we can talk about it in the tutorial maybe. Okay, cool. You can do the exact same thing for the Rauch-Turm-Striebe-Smuter or Kalman-Smuter, which is sometimes called but it's not the correct name. So we just built the extended Rauch-Turm-Striebe-Smuter. Again, by looking at the Rauch-Turm-Striebe-Smuter equations and insert as the transposed transition matrix. Again, our Jacobian matrix of the nonlinear function F. Evaluated at the filtering mean U.K. That's it, that's nonlinear smoothing or one way to do nonlinear smoothing. Let's look at this also in a code example. And as the name already suggests, we're going to look at this dynamical system I've been talking about earlier, this pendulum. Whoop, going to open the next notebook. Import some stuff. And build our state space model. Our state space now consists of two components, theta and theta dot. Theta is just the angle of the pendulum. And theta dot is the angular velocity. So our state dimension is two and our measurements are actually scalar. Okay, so this is quite simple. So far, we are going to start off with an initial Gaussian distribution. We are going to initialize the mean at zero angle but an initial momentum by setting the angular velocity in the beginning to a non-zero value minus three. I just chose this, it's not set in stone. We are going to encode that we are quite certain about the initial angle but a little less uncertain about the initial angular velocity by having here not a scaled identity matrix but still diagonal. It's just this initial covariance matrix. We're going to execute this encode as well. And now we're going to set up our dynamics model. So this now gets a little bit more interesting. We have now this non-linear state space model where xk given xk minus one is this non-linear transformation of our previous state and this non-linear transformation looks like this. So f maps our state vector of angle and angular velocity to a new vector of angle and angular velocity that changed. To the previous angle, we add again this time step that is computed from the difference of the subsequent time steps k plus one and k times the angular velocity. We just built this product and added to our current angle. For the angular velocity, we add to our current angular velocity this term where we have this resistance our friction parameter alpha which I just set to 0.3 multiplied with the angular velocity minus g divided by L where g is the gravitational constant of our planet Earth which is roughly 9.81 divided by the length of the rod of the pendulum which obviously also affects our dynamics which I just set to three multiplied with the sign of the current angle. And all of this we again multiply with our time step. Okay, you don't have to understand how this function comes about or why on Earth this thing here models the dynamics of pendulum but I promise you you will learn about these types of equations next week from Natanel and these are actually quite interesting one of my favorite topics to study. Very cool stuff. We are going to choose as our process noise covariance again this very complicated matrix which involves some weird polynomial terms of our time step. Again, not going to talk to you about how this comes about just have to accept it for the moment. And this is our dynamics model. Okay, how does this look in code? We set our time step to 0.05 we set our gravitational constant this sigma q term just basically scales all the terms in this process noise covariance matrix we could also put it in front here and it's not very interesting we have this air resistance term that influences our dynamics or maybe not air resistance maybe a better term would be the friction right just a general resistance term and we have this rod length of the pendulum. And we're going to write this down in code this is just basically what you see here just that I substituted every theta with angle and every theta dot with angular velocity gets quite complicated here in the end we return the new state vector and this is our non-linear dynamics process noise covariance matrix and we have our dynamics set up. As for the measurements we actually have quite a simple model h of the state vector is just the first component of the state vector similar to the car tracking example notice this is actually a linear model so we didn't even use a non-linear measurement function here. And we have a scalar measurement noise of 0.08 just for the fun of it and that you can see the extended common filter equations later on I actually wrote down this non-linear or at this linear measurement model as a function h of x. Okay let's execute this. We are again going to simulate now our state space model where we draw one realization of the state space model so one dynamics unrolling so to say that gives us our crown truth and we are going to draw the according data as well and we're gonna plot it and this is what we end up with okay that looks familiar but a little bit weird right you see here that we have some wiggling going on down here and also here that's not a perfect oscillator this is because we are looking at one realization of a stochastic system this is if you will say a pendulum in the wind so from time to time there's random momentum coming in changing the dynamics of the pendulum that's why we don't see perfect oscillations here and we have some measurements also noisy measurements at a few random time steps and we are going to use these measurements and the extended Kalman filter in smoother in order to get a good estimate of the crown truth trajectory. Okay the Kalman filter extended Kalman filter equations in code just take the filtering mean, the filtering covariance the nonlinear function f that describe our dynamics and the process noise covariance matrix q compute our predicted mean by evaluating the function then computing the Jacobian and what I'm doing here is I'm calling a package that uses automatic differentiation probably all of you know that maybe few of you saw it other than in the context of deep learning so you can use automatic differentiation for other things than training deep networks so that's cool of course for this very simple model you could just write down the Jacobian on paper but as soon as the models get very complicated you might want to consider making use of this very cool technique called automatic differentiation so I do this here for the simple model because I'm lazy I use this Jacobian matrix substitute our transition matrices with that and return our predicted moment for the correction we do exactly the same thing we compute our gradient information so this is not a Jacobian now because we have a function mapping from R2 to the real line so it's a scalar function so we just have basically one row in the Jacobian it's our gradient and we compute the measured observation the predicted observation using the measurement model on the predicted mean and then compute basically just using the Kalman filter equations our predicted, our updated moments mu and sigma okay clear then everything I changed in this loop is I added an E here and then E here everything else stays exactly the same it's just a loop going over our state indices we spare the first one because it's just our initial moments we select from the very end our current filtering estimate predict the next ones again push them to a list because we need them later for smoothing if there is data we correct for the newly gained observation else we're just gonna push our predictions okay let's execute this it's actually very fast it's already finished and now let's look at the animation no let's look yes let's look at the animation that shows us our filtering estimate of this pendulum trajectory well okay here you see it see here in the right it's very interesting the predicted trajectory already encodes now this oscillatory behavior because we informed our prior dynamics about our model and it's non-linear so it's not just one line going in the future but it's already exposing these non-linear dynamics so this is quite cool and in between measurements where the prediction maybe was not so good the uncertainty goes fans out very rapidly and then as the new data point comes in it snaps back onto this data point and corrects on the basis of the data we've seen okay so this is our filtering estimate we don't have to make multiple plots now because this is a time series and these two marginal standard deviations here in this shaded area are the correct depiction of our uncertainty estimate and this is the filtering estimate okay cool looks nice doesn't it? now that we already learned about smoothing we can just go on and implement the extended Rauch-Tung-Striebel-Smuthe as well where we just substitute our transition matrix with our Jacobian of our non-linear um dynamics model okay the smoother our again uses automatic differentiation computes our smoothing gain smooth mean smooth covariance returns it and then we execute again our backwards loop you notice already not going to go over that again and then we again can look at a fancy animation where our smoother again tidies up our filtering estimate from before by going backwards through time and computing actually quite nice interpolants in between those data points this is a Gaussian process posterior marginal evaluated at all the points that we considered in our sequence computed in once linear in forward direction and once linear in the backward direction but it's quite cool right so this is how it looks what it looks like in the end so nice interpolate in between the data point and it covers the ground truth trajectory nicely in its uncertainty okay whoops wrong direction uh... okay with that i'm actually almost finished with my lecture so you made it through the theoretical part such that we can go on next week actually really nice dynamics that we can look at first of all though i don't want to leave without telling you that extended common filtering of filtering a common filtering in general is not the only way to do basing filtering smoothing it's just one way and it's a way most of the time actually works quite nicely and it's very easy to understand and implement however i want to present to you now and an alternative and i'm not going to present it i'm just merely gonna mention it that there's an alternative that's called the unscented kalman filter while the extended kalman filter approximates the model and computes the exact solution the unscented kalman filter leaves the exact model intact so the non-linear function but approximates the distribution it does so in a very complicated way it deterministically chooses points called sigma points from this distribution gaussian distribution you see here on the right represented by these samples in the background and maps these points through the non-linearity and then uses the mapped sigma points to compute a gaussian process and gaussian uh... approximate of distribution on the basis of these points i'm not going to go into detail i just want to let you know there exist other methods that you can use if the extended kalman filter doesn't work now coming back to your question over there somewhere when you asked me whether we can drop the gaussian assumption as well there i told you we have to open up a box of more complicated algorithm and one of these algorithms is the particle filter it's a sequential Monte Carlo method and it works for non-linear models it works for non-gaussian models it approximates the actual true posterior under your model it's roughly speaking a sequential version of important sampling and all kinds of complicated stuff added to make it work and it comes with all the pros and cons of Monte Carlo methods and that for example first and foremost it's very expensive to compute and a bit harder to implement than the extended kalman filter but it's a very cool method and if you want to learn about this there is a book called Bayesian filtering is smoothing Bayesian filtering and smoothing i actually used a lot of this in this lecture to create this content it's written by Simo Zerke and Simo Zerke is a researcher in Alto in Finland and he's actually going to visit Tübingen late March next year as part of the probabilistic numerics spring school that has been announced yesterday or the day before yesterday so maybe you even have the chance to listen to a talk given by him if you're interested in the topic and i definitely recommend if you're interested having a look at his book it's actually very accessible finishing off now let us quickly recap what we learned about today we learned about the language that we use to describe dynamical system probabilistic state-based models and we chose a special kind of these probabilistic state-based models in that we chose a linear Gaussian model such that we compute can compute the prediction and the filtering distribution in closed form just using linear algebra by using the Kalman filter we extended this because we need to do this to non-linear models we will use this for our simulations later on by using the extended Kalman filter which is just an extension of the Kalman filter that linearizes everything in your state-based model and uses the Kalman filter equations again and now we can handle non-linear dynamics non-linear measurements in a really cool way next week i strongly recommend you coming back because Nathaniel here will tell you everything about how to mathematically describe these dynamics themselves so not the models for these dynamics but the dynamics themselves so he's going to introduce you to a language that's very powerful that captures all sorts of laws of nature and in the lecture afterwards also given by Nathaniel so in two weeks from now he's going to combine today's lecture and the lecture given in one week where we will learn how to use probabilistic numerics i.e. Bayesian filtering and smoothing to actually simulate those dynamical systems so this is the combination of those two lectures with that i want to finish off by asking you to provide feedback i'm very i very much appreciate you listening to me today and with that i finish off this lecture thank you for listening