 Welcome to probabilistic machine learning lecture number 12. Today I'd like to make the collection back to the beginning of the course. We began this course by observing that probabilistic inference generalizes conditional logic to statements that include uncertainty by distributing truth across hypothesis spaces. We then encountered already in lecture number 2. An interesting problem with this framework which is that it can be computationally very hard actually exponentially expensive, so combinatorially hard in the number of hypotheses. And over the course of all the lectures since then we have essentially been working on building frameworks that can deal with this potential extreme computational complexity. There was a phase in between where we spoke about sampling methods which are a generic way to do these high dimensional integrals if you have access to random numbers being distributed or being generated from the question, the distributions in question. And then we spend a large part of the lecture until now speaking about one particular alternative approach which is very important and that's why we spend so much time on it which is to consider only random variables which are linearly connected and which are jointly Gaussian distributed because in this framework all the resulting necessary computations are then of the type linear algebra and therefore only polynomially expensive. What I'd like to do today as we are approaching the end of the let's call it the Gaussian phase of this lecture course is to make a connection back before we turn forward to lecture number 2 in which we spoke about a very fundamental way of dealing with computational complexity which is conditional independence. To remind you back then I introduced the notion of directed graphical models which are a visualization of conditional dependence and independent structure in joint probability distributions and we encountered three different atomic structures in such graphical models. So atomic in the sense that they are the ones that arise tri-variate graphs, the first sort of non-trivial joint distribution and since then we haven't talked that much about this model anymore or this modeling language which will become more prominent in the rest of the course. So maybe let's see if we can make a connection to what we've done so far. So in lecture number 7 when we did parametric Gaussian regression we encountered this kind of model which back then I didn't write in this graphical form but we can do so now. Actually well I kind of did but in a less clear kind of fashion so we spoke about function values at various locations from x1 to xn which we are collecting data y so the data are observed which is why they are in this filled in form and we assume that these function values are generated by some underlying feature functions which are weighted with a joint set of weights. So these weights are the entire explanation of the data set and if we are assuming that the likelihood of the individual observations factorize over local Gaussian error terms created at each location by evaluating the function value at that point and then adding some Gaussian noise then this corresponds to this graphical model which is an instance of this fan-out structure. So we saw in lecture number 2 that this kind of structure implies that conditioned on B, A and C are independent and that's exactly what we see here as well conditioned on the weights, so conditioned on the function values these, well actually conditioned on the weights the function values are independent of each other and therefore the observations as well. We saw that in the Gaussian framework the resulting inference cost here reduces so in general inference cost here reduces on to the expensive part of computing the posterior over W which in general can be still be exponentially hard so combinatorially hard in the size of the hypothesis space over W in our case actually W was a continuous value variable so that in general wouldn't really be tractable but because we make joint-negation assumptions inference in this model was of polynomial time cubic because we have to invert the matrix essentially which is also a linear problem in the size of the number of weights in lecture number 9 as we expanded towards non-parametric models with infinitely many degrees of freedom we encountered this new kind of model the Gaussian process which at least one way of looking at it is that these models do not keep track of an explicit set of weights anymore one other way to think about it is that we expand the set of weights towards an infinite set and under both views it then makes much more sense to only reason about the function values themselves rather than some latent weight space this then leads to this inference algorithm where the computational complexity is still polynomial but so cubic but now in the number of observations rather than the number of weights in terms of a graph this corresponds to this kind of essentially fully connected graphical model where every function value depends on all the other function values conditioned on all of these observations we still have sorry conditioned on all of these function values we will still then have independent observations so the likelihood still remains factorizing in this way but the prior or the function values cannot be simplified further beyond a joint distribution in which everything depends on everything else so we actually saw where these numbers explicitly come from they are in the inverse covariance matrix of our Gaussian process model because remember from the lecture on basic properties of Gaussian distributions the entries of the inverse kernel gram matrix so the inverse covariance matrix, the precision matrix correspond to the well their sign, their flipped sign corresponds to the sign of covariances of marginal distributions when marginal conditionals, when conditioned on everything else so this entry 1, 4 for example the sign of that number is the flipped sign of the covariance between variable 1 and 4 under the conditional for 1 and 4 given everything else so that's one way to create that graph by thinking about how you would generate these 4 when you're conditioning on everything else so in these settings these conditional independent structures didn't really help us that much because well I mean this is in some sense the most trivial graph it's the one where everything is connected we saw in lecture 2 that every joint probability distribution can be represented by a fully connected graph but that's also pretty useless precisely because it doesn't encode any additional information about the joint distribution so that's maybe why we haven't spoken so much about graphical models lately except to make the connection to deep neural networks but there is one more interesting structure which we have seen in lecture 2 already which is this one up here this is called a so-called chain graph in which one way to think about this is in a generative fashion that we can generate each variable in this chain by conditioning on the previous variable and using that to predict the next variable so the probability of A, B and C is given by the probability of A times the probability of B given A times the probability of C given B and this is the important bit there is no A in here such graphs as I said are called chains and their structure is already suggestive of an underlying type of data that creates this kind of situation these graphs have well maybe I could tell you there is a kind of a temporal structure to this if you think of extending this graph in a sort of going forward then you can think of something that kind of just evolves in one dimension so that one dimension you might as well call time even though it doesn't have to be identified with physical time such that at every point in time the process has essentially a finite memory what happens at the next time step only depends on what the current situation in the world is so that the current situation in the world kind of decouples the prediction for future states from the values of past states today we're going to try and make this connection more formal we'll try to think about what exactly this means in terms of computational complexity and in doing so we will encounter an entire class of models which have very high practical importance and are related to or directly connected to a class of algorithms from signal processing and other domains which are so important that I have to bring them up at this point in the lecture we will only talk about them in this particular lecture and then I will only occasionally mention them at later points because there are so many other interesting models to look at but these models are so important that we have to spend a little bit of time on them and they are connected to the kind of data that you might call a time series so a time series is a sequence of observations which are indexed by a scalar variable let's call it time it doesn't have to be time in the physical sense but of course it's often that physical time where we typically move forward through that sequence temporal structure is often also associated with time steps of constant step size but it doesn't have to be in principle time at least the way we think about it is of course continuous but often in practical applications time is actually discretized into individual time steps maybe for physical reasons because the way that sensors are implemented has a certain refresh rate or for other reasons these models are called discrete time models you can imagine that this kind of data set is extremely important it shows up in everything, well that changes through time so climate and weather predictions, sensor readings in engineering systems that run through time medical data that is collected over time from fabric curves to all sorts of readings that you might imagine that you might take from a patient all sorts of descriptions of dynamical processes more generally in physics and so on and so on economic stocks prices, supply and demand data and so on and so on the previous lecture actually introduced a time series as well my body weight measurements just to keep the example going an interesting aspect of these time series is that they often pose particularly interesting computational challenges one issue is that because it's connected to real physical time you might want to make predictions in real time and another one is that the potential size of the data is essentially infinite because time just keeps going and you keep getting more and more data so the typical setting you're in is that you get a datum you want to make a prediction and then the next datum comes in and you have to make another prediction and the next datum comes in and you have to make another prediction or associated with the prediction take a decision so you need to decide whether to buy or sell stock then you wait for a certain amount of time you see how the market evolves and then you have to again decide whether you want to buy or sell stock this kind of structure directly informs the kind of algorithms we want to use in this setting and we need algorithms which are computationally lightweight so that they can deal with a potentially infinite dataset and we need algorithms that are lightweight so that they allow us to make predictions in an iterative fashion at every point in time what do I mean by lightweight complexity? well, if you think about data coming in at this rate one datum after the other at every time step you might have a fixed amount of computational resource available so let's say you get a datum every second then you have one second time to take your decision or to reason to do inference before the next datum arrives so that means the computational cost per additional datum has to be constant and that means the overall cost complexity of the inference of a dataset of size n should be linear in n it should be O of n the algorithms for Gaussian regression we've seen so far are not linear in n unless they have a finite set of features so we have in these kind of models inference that is actually linear in n but these models do not allow for the weights to change over time so in these models you can become more confident over something that doesn't change in time, like these weights but maybe that's not what you want you have a dynamically changing system and you don't want to become ever more confident you want to actually track something that changes over time in these fully connected Gaussian process regression models you can in general at least not do this kind of inference in this kind of dataset because the computational complexity of this inference is cubic in the number of data points so in a time series setting the computational cost would rise faster than data arrive and quickly you'd be overwhelmed with the computational cost of having to keep track of the entire tail of data behind you and you'd have to give up so we need a new kind of model to deal with that and these models correspond at least in our Gaussian setting to data in which the covariance matrix at least under some transformation has the structure that it's inverse is of tridiagonal form if that's the case then that means that we can condition that we can predict into the future one variable by conditioning on everything else before and that's then convenient because if you have a structure like this then all these zeros in here mean that you don't have to keep into account all the preceding data instead you just have to look at the very most recent observation and you can use that to predict one into the future if this picture is confusing to you don't worry we're going to do the actual derivation in a moment but I wanted to give this intuition first clearly this corresponds to this kind of graph which allows us to predict one variable into the future by conditioning on the previous observation and then predicting forward at this point I should tell you what these models are called they're called Markov models Markov chains so I said this is a chain graph but actually this kind of model is called a Markov chain this is due to a Russian mathematician who was a contemporary of Kolmogorov actually he wrote a text in Russian here is the original in 1906 for his local physics society at his university the title actually suggests exactly what this text is about it's essentially inference in models where the variables depend on each other but only this particular structure Kolmogorov actually was the one who popularized this paper because it was in Russian it wasn't well known in the rest of the world Kolmogorov actually informed the world in some sense about this by citing it multiple times here is his citation in his original krumplegriffe der Wahrscheinlichkeitsrechnung the foundational text that I've cited several times before and so this is not the right text to talk about what exactly the math is here but it may be useful for the further thought process to notice that Kolmogorov already actually mentions that this kind of independent structure is really at the heart of the considerations we have to take when we build practical inference models I've already told you previously that Kolmogorov pointed out that actually the tricky part about probabilistic inference or about all of probabilistic reasoning is not the notion of Bayesian inference it's not sum and product rule because these actually follow directly from said theory the tricky part is this notion of independence because independence is philosophically so difficult to define it's easy to define in this formal way that we use here but actually it poses lots of philosophical problems total independence is boring because it just means that everything decouples from each other and this Markov chain structure is as Kolmogorov writes here maybe one of the most important elementary types of structures one has to understand to build interestingly structured inference settings or inference algorithms so that's what we're going to do today to do so it's a good idea to change our notation a little bit from previous lectures so so far we've made a quite generic we've built very general generic kind of models which assume that there is some latent function which we called f and it mapped from a very generally structured space x which could be as anything because that x is going to be masked by the feature functions or by the kernel so it could be a very generic space like a view vector space but also the space of graphs or the space of strings or something else and we assume that our function maps from that space to the reals and that all the observations we make are linearly connected to the function values now we're going to use these models that have this notion of local memory that evolves over time and that sequence the typical inference setting we're going to be in is that we are at a current point in time we get one more datum and then we want to make one more prediction over the current situation that we're in that means first of all we will have to consider spaces in which the inputs are ordered because we need to know what is the past and what is the future that means that we are essentially restricted to an input domain that is a subset of the real line and we might as well call that input time because that's a good word for it so instead of calling the input x we're going to call the input t like time and we will consider situations in which we have an ordered set of observations y1 to yn which are real valued and they are made at times that are ordered t1 to tn now because we are now not using the variable x anymore in this community it's actually typical to call the latent object not a function f but a state x which evolves over time the word state already evokes the idea of a memory that gets updated in one step to the next so we will call that latent state not f anymore we'll call it x that's a little bit confusing maybe to you because x used to be an input now it's the function value but I'm sure you'll get over it and we'll make observations which are typically local observations they might be a linear map of but they are a linear map of the local state of the local function value not of all the function values together such models have various different names so they are called state space models because the state is the object of interest from a probabilistic structure perspective these are called Markov chains as I've already said before and actually I've already introduced these Markov chains in lecture number 3 when we spoke about sampling algorithms or lecture number 4 when we did in fact Markov chain so here is just a reminder of the definition that you already saw in that lecture a joint distribution over such latent states is set to have the Markov property if the i-th state given all the other ones if that conditional distribution can be written as the i-th state given only the immediate predecessor so this should be an I here I'm sorry I'll have to fix that such the corresponding sequences of objects x i that are distributed according to this to this kind of factorization structure are called Markov chains now what I'd like to do now is to think about what this kind of structure actually implies for our inference algorithm and interestingly for the moment I'd like to point that out I'm not going to make any kind of assumptions about Gaussianity or in fact any other assumptions about the shape of the probability density function or whether there even is a density actually I will assume that there is a density but I'm not going to make an assumption about what the shape of that density actually is it doesn't have to be Gaussian we will just see how the structure of the inference process evolve or how it changes if we impose this kind of conditional independent structure that is encoded in this graph so to be precise again we will make two assumptions the first one is that the joint distribution over the latent axis has the Markov property and secondly that the observations are in some sense local so that means the probability that the the generative probability for an individual observation at time t depends only on the latent state at time t and not all the other states so let's see what this causes what kind of structure this causes in our inference procedure so the typical thing you want to do in these kind of settings as I said before is that you're at a particular point in time you have already collected data all the way to t so I'm going to use this notation to denote data that is a collection from all times from t zero up until t minus one and now what you typically want to predict is the latent state xt then you get one more observation yt and you want to update what you know about xt given that new observation and then you want to predict the next state xt plus one so that next state next prediction xt plus one is again a distribution of this kind of type where we now just got one more entry in here so let's point our loop closes and we can continue again forward then there's another setting in which you already have the entire data set and you just want to sort of know what you actually would know now going back about all the preceding states we'll talk about that in a moment so let's first think about this predictive setting where we have data up until this time so I'm going to do the detail derivation so that you know what I'm going to show you what the result of this derivation is going to be and then we'll do the detail derivation so that you don't get confused so it turns out that this kind of predictive distribution actually can be simplified quite a lot into this kind of expression so what this means is what you can read off here is to compute the prediction for the state t given all the preceding data take the posterior that you've computed in a previous step so that's a posterior over the state at time t-1 given all the data up until and including time t-1 then multiply it with this sort of predictive distribution this conditional that we have from our assumptions and marginalize over t-1 this structure is actually called this particular equation is called the Chapman-Kolwogorov equation so here's the full derivation for it I know that this derivation often confuses people so here's a simplified derivation which is suspending your disbelief a little bit so one simple way to explain where this equation comes from is to say well let's just consider the joint distribution over xt and xt-1 given all the preceding data now use this conditional independent structure implied by the graph sorry no so first use just a product rule so just rewrite this in the generic form that you can do for any joint distribution so it's just xt given xt-1 and whatever else was on the right-hand side times p of xt-1 given whatever was on the right-hand side and now use the structure of this chain graph that means xt is independent of all the previous data because they're all on the left-hand side and they're shielded away by this graph and that means we can drop this y from here and now just marginalize out over xt-1 and this process of doing that already gives us exactly this equation that we had on the previous slide which is called the Chapman-Kolwogorov equation now annoyingly what I've just done here is I've waved my hand around a little bit and just said this is the result of the graph but notice that this expression here isn't explicitly an expression that you have up here so if you want to be formally precise I have to reduce like actually have to show that this is true that I can just drop this y from this expression so if you think you just believe me then you can stop listening for two or three minutes now and if you don't believe me and you want to see how that actually works so let's do proper inference so what is this conditional distribution? well it's given by base theorem so the probability for xt given all the data can be written as the prior for xt given the likelihood for all the observations given xt divided by a normalization constant now annoyingly we don't really have a prior for xt so we have to construct it by expanding onto a project that we actually can talk about so we write the probability for xt actually as a joint distribution over all of the latent states x and then we just marginalize out all the latent states that aren't the one we care about xt normalization constant for that is the same thing but now we also marginalize over xt itself so it's just an integral over everything now in that expression we'll do two things so first of all let's move these likelihood terms sort of to the left and then let's expand the joint p of x using exactly the Markov chain structure so now we're introducing the assumption that we've made so we can write this joint distribution over all the latent states x as the probability for x is 0 times a long sequence of individual terms where we always just get a conditional for one particular x given its immediate predecessor and those can be structured into three different parts starting from x0 we move forward through time all the way up until and right before the time t that we care about then there is the individual term for the prediction of xt and then there's all the stuff that follows afterwards we can do the same in the denominator so here I've actually just done the exact same thing on the top and the bottom on the numerator and denominator of this fraction now we look at these expressions and first of all you'll notice that for all the quantities that come after xt there is an integral here over the corresponding individual elements so xj where j is larger than t and there is no y associated with them because we assume that we only have data up until t-1 so far we actually can be moved all the way over here and the same goes in the denominator so that means we have the same terms in the numerator and denominator and they just cancel out so we can get rid of all of these green terms here at the back now we can go and have a look at these blue terms that are a little bit more interesting so here again we notice that and here at this point you might want to stop the video and stare at this for a bit otherwise it goes too fast we'll notice that for all the terms that involve an x that is at a time j that is less than t explicitly less than t there's a corresponding situation here that you have a likelihood term and a prior term that show up individually factorizing in the numerator and the denominator so they cancel out the only thing that's left is explicitly this final term which is the bit where we compute an xt given xt-1 times this entire interesting blue thing and what that is is explicitly just a posterior over xt-1 given all the data so far so notice that there is no corresponding term with a yt yet so this the fact that here is and this integral is not over xt so up here there is no integral over dt-1 there's only a term over an individual remaining integral over dx-t-1 okay so this gives us this chapter on tomography equation and now we might believe this statement a little bit more this is what allows us to take all of the data up until that not including the current point t and predict xt this is called the prediction step now we make an observation at time t so we get our yt so what do we have to do to include this yt in our posterior over xt well we just have to use Bayes theorem so that's easy this part is actually trivial this is called the update step and the update just consists of multiplying the term and being computed in the previous step from the chapter on tomography equation with the local likelihood and this is just a local object that completely only is available at time t and normalized fine okay so now if our only task is to predict into the future and we don't care about previous predictions we've made then we're actually done this is an interesting observation because this is often the setting that we're in you don't really care anymore about what you predicted in the past you only want to make sure that your current prediction is the correct posterior now nevertheless sometimes there are situations in which you want to look back through time and decide what you would now actually predict or what you now actually know about older previous states to do that it actually turns out that you still don't have to pay a cubically high price instead if you if the only thing you care about are the marginal distribution so the predictions you really should have made at preceding times about the state at time t then there is another sequential kind of algorithmic step which moves back through this time series and fixes mistakes basically that or fixes reduces uncertainty by introducing all the information of later subsequent observations and this step is called the smoothing step and here is how it is computed so we now care about this expression which is our posterior distribution at a particular point in time t given all the data not just the preceding data anymore so to do that we can do a similar step to what we did in the Schabank-Komogore equation we just introduce the subsequence state rather than the preceding state and write this as a marginal over this bivariate distribution and now to use the product rule to expand so we write xt given xt and t plus 1 given y as xt given xt plus 1 and y times xt given y and now we expand these individual terms in there so this is the bit that we will need to care about next so this is something you might already have in the preceding iteration so as we go back through the chain in time let's assume that we have this distribution already notice that you actually have this distribution at the end of the chain so once you've moved all the way to the start to the end you've seen all the data and at time capital T you have exactly this kind of object you care about so that's a start of an induction and now let's see what happens if you go backward so we need to think about what this term actually is here in this integral so to do that let's expand it this is essentially a posterior so we can write it in terms of a prior times a likelihood divided by a normalization constant which is actually just a standard thing and now we use the Markov chain structure again so this term here will not change the part of the denominator we'll think about in a moment and here we notice that the conditional for all future data from T plus 1 until end given all the past data and these two states is actually given by the corresponding distribution where we drop the current state XT because of the Markov chain structure the same goes down here in the denominator so we can basically do the same change and then notice that the integral here doesn't matter anymore because the integral is over XT so XT is not a part of this expression anymore so we can move the integral to the right hand side we have the same terms here on the left and the right and we're left with an expression which is, this integral is just one because it's an integral over probability distribution so we're just left with this expression and this expression is something that we can think about more but you can already imagine what's going to happen here so let's see what happens so here we do a similar kind of expansion we can think of this conditional distribution as again sort of speaking intuitively as a postivio essentially so we write it as a joint divided by a normalization constant you could also say we use the product rule and then up here again you sort of turn the crank the handle again use the product rule to expand this bivariate distribution into two conditionals and here this is the interesting part now that we have to look at we now notice that this is a prediction for XT plus one given XT and all the previous data so here our Markov chain structure helps us again we can drop this Y and just left with this expression and now we can take this expression as it is from here we notice that this expression is equal to this expression that's the expression we needed in the integral up here so let's take that and plug it into the thing up here and now we notice that there is an expression here over XT given all of the previous Y's this doesn't depend on XT plus one so we can take it outside of the integral and we're left on the inside of the integral with this kind of object so notice that this is something that we already have we have that from the predictions that in the first phase of the algorithm where we move forward through time and these expressions are things that we have as well so here is the term that we actually have from our Markov assumption here is a term that we have by induction so we have it at the beginning or at the intermediate point where we've reached the end of the chain and then we can move backward and close it by induction and here is an expression that is actually of the same kind of structure as this one which we have constructed in a previous time step and by the way I'm sorry this should be a zero here not a one so now we're left with only expressions that are available at this point in time so this might still be a complicated integral but at least it only involves quantities that we now know these three steps each of which are of complexity at most linear in the number of time steps have names in the signal processing community this is called the prediction step, the first one moving forward this is called the update step, it's basically just base theorem this particular abstract form is also called the Chapman-Kamogov equation and this sort of moving back step where you go back through time to correct your predictions is called the smoothing step and if you've heard any kind of signal processing before you might have heard these terms prediction, filtering and smoothing before so before we move on to actually build algorithms let's summarize the abstract observations we've just made if we assume that the stochastic process we're trying to do inference on involves a Markov chain so if it involves a latent state which evolves in such a manner that the next state in the chain is independent of all the previous states given the previous state then inference in this chain separates into algorithmic steps that ensure that the overall cost of inference on all the marginal distributions of the states is linear in time and this happens more precisely in terms of these three steps so moving forward through time and making predictions at every point t involves predicting the past data points and then updating using the most recent datum this is called the filtering process if you keep doing that forward through time and if at some point the time series ends and you now would like to make a correction for how you would now believe the states at time t actually what they actually were, what their values were then you can fix these or update these marginal posterior distributions also in a process that is linear in time and this process is called smoothing and it involves solving this equation in both the Chapman-Komogorov equation and this smoothing equation there is an integral over a general probability density function so I haven't told you how to actually do this in practice yet no matter what exactly the structure of these probability density functions are assuming that they all exist and that the corresponding fractions exist the inference in this model is going to be linear cost in time rather than cubic cost as we had in general Gaussian process regression so far so we're going to move on now to think about how to do this in practice and not in concrete distributions but here is a good point for you to take a quick break so that's all nice and dandy but we haven't actually built something yet that we can really work with on a computer it's just a bunch of abstract symbolic derivations which have shown that complexity in these kind of models is reduced by this conditional independent structure but it hasn't given vice yet to a concrete algorithm so to re-write or to convey to you a little bit better what we're trying to do now is we want to construct an algorithm like this that you could call inference which is a for loop so that means it has O of n cost structure that performs inference in this time series with n individual entries which is defined by these three types of objects an instantiation, an initialization of the induction then a prediction distribution and an observation distribution and does that by moving through the time series first from the front to the back and then from the back to the front doing filtering and updating in the forward pass and then smoothing in the backward pass and what these words mean, filtering, updating and smoothing we encountered on the previous slides at least in these abstract expressions that encode what we need to do now so to do this in practice of course on a computer we have to choose specific values of these distributions once we have them we can then hope to think about how to do these two integrals which are going to be the tricky parts here and well there's an integral hidden in here as well which is the normalization concept of Bayes theorem so they're going to be tricky integrals so what are we going to do? well let's make our lives easy and assume for now and actually for the rest of this lecture that all the distributions in this process are Gaussian and that all the relationships between variables in them are linear because we know from previous lectures that if that's the case then everything becomes tractable and linear algebra again so let's assume for the moment that we have this kind of relationship this is a relatively generic way of writing down this set of assumptions so we'll assume that the prediction distribution is a Gaussian distribution which maps provides a linear relationship between the previous state and the next state with some Gaussian covariance let's call it Q and a linear map is called A we need to initialize let's say at time 0 we have an initial belief over X0 which is Gaussian it has a mean and a covariance called M 0 and P 0 and the observation likelihood is again factorizing so as in previous assumptions and previous slides and the relationship between the latent state and the observed thing is also linear with a matrix called H and Gaussian with a covariance called R actually the names of these variables are standard in the literature that I came up with right now they are actually used throughout this kind of filtering and smoothing literature this isn't maybe the most generic way to write that down you could come up with a few more for example there could be a shift in here and in there as well but that wouldn't really change things all that much so it could be an affine map rather than a linear map and these variables A and Q and R and H could all depend on time they could be different in time so following derivations at all you can just add indices I to all of these quantities and then everything follows through I will not do that to simplify the notation a bit and such models are then called linear time invariant Gaussian systems or LTI for linear time invariant Gaussian systems let's see what these prediction, update and smoothing steps correspond to we make these Gaussian assumptions so we need to implement the Chapman-Komogorov equation which is this expression so here we just plug in the quantities we have from above now so the prediction distribution comes from here we just plug in A and Q and we assume by induction that we have the posterior distribution over the preceding state given all the preceding observations I'm now actually assuming we make observations from time 1 onwards that simplifies a little bit what we're going to write down and let's assume we have this distribution and it has a mean and a covariance which as we call m and p a time t-1 clearly we have such a quantity at time 0 by assumption so this is going to work if everything stays Gaussian which it will so here we have a Gaussian times a Gaussian and then there's a dxt-1 missing I'm sorry and that gives a product of two Gaussians which is another Gaussian times a normalization constant so that other Gaussian integrates to 1 so what's the normalization constant it's this quantity so it's a Gaussian distribution over xt with a mean that is given by applying the map A to the previous mean and the applying A from the left and right to the previous covariance and adding Q this resulting distribution is often called the prediction distribution and it has means and variances which are usually written as mt- and pt- because we are not yet at the posterior sort of that closes this induction here because for that we first need to introduce the next observation and to that for that we just do Bayesian inference so that's actually really straightforward we just compute a posterior distribution over xt given all the data for that we incorporate the next observation which comes from this observation model and we just use Bayesian inference so we just write down the standard form for Gaussian postivios and actually in this literature there is a notational sort of convenience that is often used which is to introduce two quantities Z and K Z is the residual so this is a quantity you've seen on previous slides in the Gaussian process lecture Z is the distance between what we observe and what we would have predicted to observe notice that there is this h in here because we are assuming that the observations are linearly related through h and then there is this expression which you know from the Gaussian process lecture which is the matrix to invert times the covariance between the observation and the thing we are trying to predict and this whole object together is often called the gain and it is notation or in this community just to confuse us even further this matrix is often called K this isn't a kernel it's a gain and this K comes from the name of the person who introduced well at least popularized the theory for these kind of algorithms it was Rudolf Kalman and this is called Kalman filtering and therefore this capital K reminds us of him notice that there is a matrix in here so you might not be thinking oh but I thought I could get away without inverting matrices this is a matrix that is totally local so this object is of size of y if y is a scalar then this is a scalar and if y is a vector then this is a vector or a matrix quadratic in the size of this particular vector yt not of the size all of the previous observations using these quantities we can write the posterior update and the posterior covariance actually like this you can convince yourself that these are the same expressions you've seen on previous slides that you want to by going back to previous lectures it's just a different way of introducing new notation that is historically like due to history doing that gives closes our induction so now we have our posterior over xt given all of the data up until and including t it has it's a Gaussian with the mean and the covariance which we now call our allowed to call mt and pt and with that we could now move on to the next time step and basically repeat this process again so now we are able to do that where t is replaced by t plus 1 this process prediction and update iterating between the two moving forward through time is called Kalman filtering there's also actually the process also works the other way around so if we now arrive at the very end of this time series at capital T we want to correct all of our marginal beliefs over previous states then we have to implement the smoother step that smoother step is a little bit more elaborate but it's also possible within the Gaussian framework we to compute the posterior over xt the marginal posterior if you like because it's only over xt given all the data we look up on previous slides what we need to compute and then notice that these are all quantities that we have so this is our Kalman estimation distribution that's what this thing is called so it's a Gaussian over xt with mean mt and covariance pt this thing here is our predictive distribution which we have above so we can just plug it in this thing here is something that that's actually the only really interesting object so this is a quantity that we assume to have by induction notice that we have it by induction if we've previously done Kalman filtering at the end of the sequence of the chain we have exactly this object we make our final observation y and then our Kalman estimate at time t capital t is this distribution and now let's assume by induction that we have it for all the steps going backward as well so then at time t we have this for time t plus 1 and there's a bug here this is a Gaussian distribution let's assume because it is by induction which has a mean and a covariance and this mean is usually written as the smoother mean so m index t plus 1 subscript s and a smoother covariance and here I've exchanged a super and superscript so it should be pt plus 1 superscript s and we need this normalization constant here which actually is a Kalman prediction Kalman estimation distribution as well so it's of the same type as this thing here and we just need it for t plus 1 so that's also a Gaussian so all of these can be moved together I don't actually go through this in detail you can do this for yourself it's like a little hand exercise if you like back of the envelope calculation using Gaussian identities but it's clear that it's going to work right there's lots of Gaussians here everything is Gaussian linearly related and there is an integral at the end and that will give us a new Gaussian distribution it turns out that that distribution is given by a Gaussian over xt with a smoothed mean a so called smooth mean which is given by the estimation mean from the filtering part plus an object g which I'll define here below which is called the smoother gain times the residual not between observations but between the smoother mean at the next location and the predictive mean at the next location so what we have here is basically the difference between what we thought xt plus 1 was before we got to see later data and what we thought afterwards that's kind of a correction and this gets multiplied by this object which is the sort of the well it's called the smoother gain so what this is is the covariance between the current location and the next location times the inverse of the predictive covariance at the next at this location and actually at this and the covariance between this and the preceding location let's go like this right and there's a corresponding expression for the smoothed covariance this is also a Gaussian and these quantities in here this and this the smoothed means and smoothed covariances are as I said before denoted mt smoothed and pt smoothed so to summarize if you make the linear Gaussian assumption so if you assume that all the quantities in your model are linearly related and jointly Gaussian distributed and start the induction by assuming that you have a Gaussian distribution on the original very first state then in these time series Markov chain state space models inference on the marginal distribution first thing you want in a time series setting for all states at arbitrary time t separates into two algorithmic steps filtering and smoothing and all of the computations you need to do in this process are simple linear algebra operations that are local and therefore can be done in time linear in the number of observations and they consist of these the Kalman prediction step which is a simple computation to compute the predictive distribution the Kalman update step which consists of this computing these quantities to compute an so-called estimation distribution for xt and then the smoothing step which isn't named for Kalman because I understand it wasn't actually derived by him but instead by three guys called Rauch, Tung and Striebel and it amounts to this relatively simple update that looks like this actually there are variants of this smoothing of how to write this smoothing update these are the ones that are connected with these three names with that we're at another gray slide to remind you we've already spoken in the first part of this lecture about the notion of Markov chains and Markov structured models I'm not going to just leave this up here but I won't talk about it again but we saw that in such variables inference is of computational complexity linear in time that itself doesn't really mean tell us yet what exactly the inference actually is it's just an abstract statement but if we make jointly Gaussian assumptions and linear relationships between the variables then we have a linear Gaussian system and in such systems inference on all the modular distributions separates into Kalman filtering and RTS smoothing ok so this is the practical content that I wanted to cover in today's lecture I didn't show you a code example but actually I would hope that some of you might be interested in just trying this out yourself just take a time series and construct or define these quantities that we have here and try out for yourself whether you can do this kind of inference now for the rest of this lecture I want to address a theoretical question that some of you might already have having followed along today's lecture which is that here we're talking still essentially about a regression problem so there is a the data set consists of pairs of x and y's of inputs and outputs and everything is jointly Gaussian related and in the end we're learning predictions for the latent function which we now call the latent state at all the times t but the way we've defined this model even though it's a Gaussian regression model if you like is quite different from the way we've defined Gaussian regression models in previous lectures there is no kernel here in this presentation and everything is instead phrased in terms of this a Markov prediction distribution and the observation likelihood so this likelihood is quite similar to what we've seen in previous lectures but this thing here is sort of a new way of representing if you like our machine learning algorithm so the natural question you might have is how is this family of models related to the Gaussian process regression models that we've discussed in the previous lectures to answer this question we have to think about what happens in between the observations so up until now I've only spoken about the predictions for function values states at particular points in time that are discreetly spaced away from each other and this is fundamental to the way I've done this presentation because this discrete relationship allows me to define these quantities a and q which are these linear maps with this particular form I can't yet make predictions at time points that lie between the discrete points in time but of course time being a physical dimension of time is a continuous object at least for our purposes so how can we take this model and make it into a continuous time model that continuous time model would then allow us to draw function values at any point in time and that's clearly something we need to be able to relate this to Gaussian process models the way I would like to do this because I only have half an hour and because I don't want to give an entire lecture course just on this particular relationship it would be a little bit pedestrian so I ask for your understanding for your suspension of this belief when I'm drastically simplifying some relationships here and those of you who know who might have taken the lecture on stochastic processes or just stochastics might be a little bit disappointed by the way I do this but this is a computer science lecture on machine learning and not a stochastic processes lecture so what we're trying to do is that we currently have these Gaussian distributions and actually there is essentially a joint Gaussian distribution here over these discrete time points which you can get by starting at some point and then drawing forward individual states using the conditional distribution to actually draw states if you don't understand how that works you might want to stop the video here for a second and think about how I created this image so I actually made this image with a very specific choice of jointly Gaussian distributions with a time step delta t which I like without loss of generality I've just decided to be one so I can just index time by natural numbers and then I've started this process with a Q so with a predictive variance that is just one actually I've set a to one so we're just predicting a new state by taking the previous state and just adding standard Gaussian random noise standard Gaussian by that I mean zero mean variance one now one way one could think about making this into a continuous process is by interpolating between these values in distribution and what I mean by that is that we could make twice as many of these function values of these state values by taking time steps of size one half and halving the variance so that means after I've taken two steps I've added variance one because I've drawn two independent Gaussian random variables and the variance of a sum of Gaussian random variables is the sum of their variances as you can easily convince yourself of using standard properties of Gaussian distributions and then I can keep doing that and half the time step again to one quarter and by now you might have noticed that this means that the variance of the individual update step is just given by the time step so from one to one half to one quarter what we can ask is and you're not alone in asking that what the resulting object is if we take the infinite limit towards delta t at the infinite limit of delta t towards zero it turns out that there is actually such a limiting object and it is a stochastic process that we've seen before you might have seen this plot so just seeing this plot might convince you that we've seen this process before now it turns out that doing this precisely is actually shockingly hard mathematically and it opens up a super complicated can of worms which I cannot talk about in this lecture because it goes way beyond the scope of this lecture so let me just say that there is a probability measure that corresponds to the in some sense infinite limit or infinitesimal limit of the Q that you get when you take the time step towards zero and this probability measure is called the Veno measure and it gives rise to these kind of sample paths which are Brownian motion because intuitively speaking at every infinitesimally small time step the process gets an infinitesimally small perturbation up or down Gaussian distributed and with being zero and with a variance that is proportional to the size of the step there is a lot of deep theory behind this which again I don't have time to cover however you do know this process from previous lectures and you know that it has a name it's called the Veno process and you know that it's a Gaussian process in fact as we now see it's not just a Gaussian process it's also a Markov process because it has the Markov property and therefore it's called a Gauss Markov process and there are actually several such Gauss Markov processes they are a subset of the space of all Gaussian processes a two subset and the path created by this kind of infinitesimal step and with this kind of stochastic fashion clearly is described in terms of the dynamics of some let's just call it stochastic dynamical system now normal non-stochastic dynamical systems are usually described by differential equations actually if they move through time in this one-dimensional fashion then these are called ordinary differential equations and so it might seem natural that there is a corresponding concept of stochastic dynamical processes and these are called stochastic differential equations that's how deep the connection is going to go in this lecture I would just like to leave you with a big caveat that these stochastic differential equations are in many ways more complicated than ordinary differential equations because they involve the use of this Wiener measure which is actually much harder to define precisely than what I just did in this intuitive fashion by taking these individual steps and making them arbitrarily small such stochastic differential equations are usually written in this notation and this notation for the purposes of this lecture is really just a symbolic thing that actually is just a string you write down to define a Gaussian process and or a Kalman filter and in any way so what you are going to read off now is essentially a definition a backward definition of this line up here so I will call this particular object which is called a linear time invariant stochastic differential equation and it's called linear and time invariant because it involves linear maps f and l which do not change through time so they don't depend on t and this thing is called a stochastic differential equation linear time invariant together with an initial value so that's the value of x at time t0 and that value let that value be x0 this string is meant to describe the local behavior of a Gaussian process and it turns out that this is actually a unique Gaussian process which has a mean function and that mean function is given by this object and for that we take the exponential of f times the time distance since time t0 and map it onto x0 it turns out actually that this is possible to do not just if f is a scalar and if these are all scalar quantities but even if x is a vector and f is a matrix because there's a corresponding object called the matrix exponential down here is the definition of what the matrix exponential is it's kind of a natural definition but I'm just going to use it for purposes of just understanding what's going on think of a scalar exponential but everything works if f is a matrix so this is the mean of our Gaussian process and it also has a covariance function which as you know is the kernel and that kernel is given by this object so this is an integral from t0 to the minimum of a and b over and essentially an outer product that's just a square of the exponential of an expression just like before but here we have the integration variable tau in here and this other quantity l in here so f and l are often associated with physical processes or physical interpretations this here is the behavior of a deterministic system and in fact if you forgot about this term here if you just dropped it then this would just be an ordinary differential equation and this is called the diffusion of this process this is called the drift of this dynamical system the drift as you can see follows a deterministic ordinary differential equation so you could solve this ordinary differential equation using an ordinary differential equation solver or actually because it's a linear ordinary differential equation also in closed form and its behavior would be given exactly by the mean function and then there is this additional thing which introduces the stochastic nature of this process and this is called the diffusion of the stochastic process and it shows up therefore necessarily in the bit that defines the stochastic part of the Gaussian process the kernel as in this expression where l shows up now this is a way to connect the behavior of this picture to the language that you already know from previous lectures called a Gaussian process so here we have a mean and a mean function and a kernel function and you know how to do inference in this process using means and covariance functions but the same expression also connects to a Kalman filter at discrete times ti and moving forward this Gaussian process is often called the solution of this stochastic differential equation and the discrete times stochastic recurrence relation so that's defined by this Markov type predictive distribution that arises from this kernel can actually be computed with similar quantities and they turn out to be given by this expression for the quantity A, the linear map and this expression for the quantity Q so on this slide there's a connection between Gaussian processes and filters now of course this doesn't answer all of your questions a natural question might be but if you give me a filter how does it map to a Gaussian process and I can't simply answer this in this one lecture because it would go way beyond the time we have today if you want to you can ask me in the flip glass room and we can talk about it there instead of doing this complicated bit let's do a simple thing and provide two intuitions at the end of this lecture the first one is just a sanity check to see that what we're doing here actually is exactly what we expect to be doing so let's see whether we can derive the Wiener process in this particular way so here I've just written down copied over the definitions from the previous slide it turns out that the Wiener process corresponds to the case where and that's the reason why this is called the Wiener measure f is 0 and L is a constant if f is 0 and L is a constant then clearly the mean function is just a constant x0 because e to the 0 is 1 so m of t is just x0 and a is the unit matrix because e to the 0 is 1 and actually the matrix exponential of the 0 matrix is also the unit matrix and the kernel is well let's look at the exponential for the kernel up there this is just an integral over a constant function so e to the 0 is 1 and we can just get L squared which is theta squared and we're left with an integral over theta squared from t0 to the minimum over a and b and that's clearly just theta squared times the minimum over a and b minus t0 so that's our Wiener process so okay so that's maybe a sanity check it works well that isn't particularly interesting let's look at another actually almost equally simple stochastic process that we haven't encountered so far yet in our lectures about Gaussian processes but you might as well encounter it now this process has a physical interpretation it describes the behavior of a particle in a gas if you like but that gas is not free if the particle is free to move around then its behavior is described by Brownian motion so by the Wiener process but if it's not free but instead it is caught by a potential well and that potential well has a linear shape so it's basically like a like a spring pulling you back towards 0 then the stochastic differential equation is a linear one where dx dt is just minus constant times x so that if you're moving into the positive direction there is a force acting towards 0 and if you're moving in a negative direction there is a force acting towards 0 again then the so f is minus a constant and L is again just another constant which is usually defined in terms of some other quantities that makes things easy so 2 theta divided by the square root of lambda but those are just some numbers you just think of this as a number and clearly e to a constant will show up in our mean function the mean function is given by x0 times an exponential dk at time t so if we start at a point that isn't 0 so if we start at x0 that isn't 0 then over time this process will decay back to 0 a is given by the simple exponential function and q by this constant times 1 minus exponential dk so this means that if you're going a large step forward into the future then q asymptotically becomes 1 it just adds a sort of a finite amount of uncertainty and if you take a very short step then q is again almost 0 and the corresponding kernel is an interesting thing it's not the Gaussian kernel so make sure you don't mistake this for a Gaussian kernel it's not e to the minus something square instead it's e to minus time differences between the two points but just the absolute value of it this process is a very rough process it's actually as rough as the process but it is stationary clearly because this kernel depends only on the distance between a and b unless at least at a point where we're far away from t0 right so if you're far away from t0 then that means once the process has reached its equilibrium we have a kernel that is just e to the absolute distance between the points you can think of this as a physical system that goes into some kind of equilibrium where if you're going far into the future you don't really know where exactly this particle is but you know that its position is basically bounded by the potential well that keeps it inside the potential well then being quadratic and the corresponding force being linear this is maybe one interesting stochastic process to end on but I'll show you just one more generalization just to make sure you don't mistake this to only work on scalar values let me just point out that this process also work on this kind of way of constructing stochastic processes also works if the operators f and l are not scalar but linear operators so matrices and this kind of notion can be used to define very interesting stochastic processes in particular you can use the fact that we're essentially solving a differential equation here to define objects that are that's to define stochastic processes that are the integral over processes like for example the Wiener process or the Einstein-Ulmbach process this is what is on this slide here if you do a single integral if you do this as multiple integration so you introduce more and more of these different states and define f such that it is of this sort of integrator type then the resulting stochastic process is either the integral or overgrowing in motion so we've already seen this what this gives it gives us polynomial splines or it is if we use the Einstein-Ulmbach process as the starting point so if there is this corrective term on the lower right corner f it gives rise to a family of stochastic processes that are known as the Matern type processes or the corresponding kernels are called the Matern family of kernels these are actually quite popular in machine learning in regression because they are less extreme and aggressive than the Gaussian kernel because they only assume that the corresponding stochastic process or the sample paths have finitely many continuous derivatives if you didn't understand this it doesn't matter I just wanted to point out that there is a way to do this beyond just scalar objects with that we're at the end we saw we considered today just for today a specific kind of stochastic process which is defined or which is specifically useful for data types that are structured as time series that means they have a one-dimensional ordered input and typically arrive in a sequence that might even be infinitely long that means on the one hand that the kind of prediction we want to make is a local one we want to keep track of what we know given the past and then predict the future and it doesn't matter so much what you predicted in the past you only want to predict locally that means you want to predict marginals and on the other hand we need a way to do this at low computational cost because we want to keep doing it and that means that the overall cost of inference has to be linear in the number of data points Markov chain structured models provide these kind of inference models and they automatically give rise to inference algorithms which take the structure of a filter going forward through time and the smoother going backward through time that's an abstract structure for the inference algorithm which will actually be turned to a data points in the lecture because it will turn out to be a specific case of an algorithm that even works for somewhat more complicated structured graphs if we assume that all the variables in this kind of model are jointly Gaussian distributed and linearly related then the corresponding algorithm has a concrete form that can be implemented very efficiently on a computer and it's called the Kalman filter for the forward pass and the RTS smoother for the backward pass these models are defined in a quite different way to Gaussian processes but they are actually Gaussian processes they are Gauss Markov processes they are a subtype of all Gaussian processes and they are connected to Gaussian processes through the notion of a stochastic differential equation which is an abstract concept that comes with a lot of mathematical caveats but once you write it down it essentially defines both a Gaussian process with a kernel and a mean function and a linear Gaussian recurrence relationship that is defined in terms of these operators A and Q this short lecture today was really just a very first and short look into Markov type models time series models we only had time to look at the abstract case to discover the notion of filtering and smoothing and then we looked at what is maybe a very restrictive form of definition these linear Gauss Markov models as you can maybe guess from the last few slides there are much more complicated world of theory about these models in particular there are inference algorithms for these kind of settings if the relationship between the true function the true states and the observed variables is not linear and not Gaussian these are called filters more generally than Kalman filters and there are also models for the case where the relationship between the latent states is not linear and Gaussian these are called hidden Markov models in this basic introductory lecture on probabilistic machine learning there is unfortunately not time to cover all of these for here, for today, for us we're finished thank you very much for your time and I'm hoping I'll see you again at the next lecture