 We have now reached this point where we like I spent pretty much That two weeks or so trying to connect the relatively formal study of Gaussian process models, which are maybe the foundational type archetype of supervised machine learning from the probabilistic perspective Generalizing that to a Large chunk of contemporary machine learning in the form of deep learning in a particular way and hopefully last week's lecture Made it clear to you that that's a worthwhile thing to do Because it allows us to think about the models from a perspective that a point estimation framework doesn't allow us to do For example to track changes in the model or in the data over time For example to heal certain pathologies in the model like it's finite uncertainty far away from the data and That this was sort of maybe would have been ideal to have that kind of lecture at the very beginning of the class so that It's sort of clear why we need to do this and why it's still useful to think about uncertainty in 2023 But it took a bit of time to get there right we needed to learn all these mechanisms and all this to build this toolbox of linear algebra and Algorithms and all this stuff to make stuff work but now that we've done this maybe we could take a step back and look more again at the foundational structure of probability theory and how we can make use of it to create interesting functionality and The functionality I want to talk about for this week is How to deal with data that arrives across time? We already motivated that last Thursday because sometimes you're just in a setting where Data arrives continuously and you want to sort of deal with it So we saw in the very beginning of the class first one two three four lectures that probabilistic inference in general can be like in its general form can be computationally quite taxing and We have to think about the structure that we endow that we sort of impose on probability distributions To actually make inference possible so today we're going to do this sort of spiel for time series problems for problems where data arrives across time and Here's the plan. We're going to think about Why it's even a problem to think about this kind of data So where computational complexity actually arises in our inference models our supervised machine learning problems and Why we cannot apply the models that we've used so far including deep neural networks out of the box to time series data I'll talk a little bit about some examples for why you might want to care about Street data arriving in a stream just very briefly one slide and then will They take a sort of a dive into the algebra and think about what we actually need what kind of structure we need in a probability distribution to allow constant cost inference per time step because that's exactly what we need if you have an infinite stream of data and Then as in the past We'll first do this on the abstract level of what structure does a probability distribution need to have for it to even be in principally in principal tractable and then we'll realize that to make the local steps actually real a real thing on our computer We have to again make further choices and this time again They will be Gaussian and linear algebra assumptions because they lead to closed form linear algebra updates And I'll point out a few connections to other models as we go And that will lead to a type of algorithm that is called the Kalman filter raise your hand if you've heard of Kalman filters before Okay, and raise your hand if you feel that you can implement one and you know how to use it Okay, that's a small number and it's gonna simplify your homework But for everyone else you should do it be able to do it after this exercise after this lecture. So first step I'm going to tell you something about how conditional independence affects computational complexity of inference so that's our highest level we'll realize that if We just don't think about the structure of our model It's potentially going to be very expensive to learn and then we'll need to think about structure to make it possible to learn so this harks back to lecture number two when we had these Graphical models which since then have sort of fallen a little bit along the wayside So you will remember I had this example with two coins and a bell taking for Stefan Hamiling Where the bell rings whenever the two coins show the same side heads or tails We spoke about conditional independence and how it can be represented to some degree imperfectly by pictures like this directed graphical models and We realize that if you just have two variables There isn't much to talk about in terms of independence two variables are always either independent or dependent and that's a bit boring But when you have three variables There is interesting structure called conditional independence because the conditional distribution of the other two variables given one of them can become independent or Stay dependent or become dependent if it previously was independent So we had these these atomic structures called a chain a fan out and a fan in or a restructure or a Collider there are different words for it Well, and we discovered that if we write down graphs like this So these remember these these graphs are a graphically representation of a factorization structure We draw these graphs to represent a joint probability distribution by first drawing one circle with a variable name in it for each variable and then looking at a representation of this joint distribution as a factorization into terms and drawing an arrow for every such factor from All the variables on the right-hand side to all the variables on the left-hand side of this conditional distribution So for a joint for a general distribution If you don't know anything further if there's no further structure known Then the product rule of probability theory only says that p of a b and c is p of a given b of c b and c times p of b given c and that's a fully connected graph where every node has either an incoming or an outcoming arrow To every other node and then there is nothing further to say because everything could depend on everything and then things are boring But sometimes we know that we can drop certain terms So for example, this is a simplification, right because there is no a here in this factor And we had an example of such structures with this, you know The alarm and the burglar from Judea pearl sort of classic textbook examples And we noticed that these different types of structures lead to different conditional independent structure in this graph if we Marginalize out b then a and c are marginally dependent on each other But when we condition on b they become independent of each other and in this graph Marginally a and c are independent of each other, but when we condition on b they become dependent on each other This is called explaining away sometimes or induced dependence So back then maybe that structure was a bit abstract It's like why do we care about this and now it's like, you know Half a year later almost this sort of we've almost forgot about this But actually it was always with us during the entire last few weeks It's just that the models we looked at were much more complicated And so this simple atomic structure was maybe not always so easy to see So the first type of models that we looked at were these parametric regression models So we have we are trying to learn a function f of x Which we observe with noise so the y's are noisy observations of f evaluated at some point And we made this assumption that we can write the function as a finite set of feature functions of x Times a finite set of weights I actually wrote that the corresponding graph back then more suggestively That it to make it look more like a like a neural network like a shallow neural network with just one Actual layer. I mean actually two layer, but the the input layer was just predetermined and not uncertain So that yeah, it's sort of we can forget about it Um, but here's another way of writing it as a graph that maybe makes the conditional Independence independent structure a bit clearer We have these unknown weights. So that's why this is an empty circle because it's unknown And we think that through the feature functions one two three four of the x's We can create the value of the function at all of these four points and then make noisy observations Remember that observations are filled in circles in black So this is an example of this kind of fan out structure If we knew what the weights were we could independently predict the observations And that's actually the structure we use in the likelihood To make computation in some sense easier This is you may remember or maybe you can think about it for a few moments while I keep talking that this was the reason why Something like like stochastic region descent is possible in deep learning Because it it means that we can draw batches of individual wise And consider them independently to learn about w without having to think about the otherwise in the data set it's also um something that affects the computational complexity of the The algorithm a phone so it has anyone found a phone somewhere in your Around you. It's not here either No Can you try maybe ask the house master? Just right next to the entrance divided found it All right, uh, let's come back to complication complexity get away from phones. Um, so we The another aspect of this graph is that it It also means that there is some kind of finite object w That we can keep track of And it describes everything we think we need to know Not everything we need to know but everything we think we need to know It's a finite dimension representation of the problem And that's why we can in this case work in the space in the on the Like the so-called weight space so the probability distribution on the weights And represent the entire problem this way We did this with that a plus approximation by building this covariance matrix that is constructed from the Hessian of the loss function And you may remember by constructing that matrix is linear In the number of observations y So it's linear in this set It's cubic in the size of the weight space At worst case but linear in the number of observations We also had this other type of model called a generic gaussian process model a non-parametric model A model in which we directly work in the space of function values without assuming a finite latent representation of the function values In some sense as we saw this is a more powerful representation Because it's effectively infinitely flexible can learn any function That's also the reason why infinitely wide neural networks can learn any function because they are effectively gaussian processes But the price we paid for that is that there is no such finite latent representation This graph between the function values is fully connected If we want to write down the conditional probability distribution Of one of those variables given all the other ones We have to really think about all the other ones And you may remember that for gaussian distributions We can read off the conditional probability distribution of one variable given the other ones From the precision matrix from the inverse covariance matrix So that's why I've sort of written down these Terms in the matrix. So if this matrix is dense, which it is in general for general gaussian process models Then this graph is fully connected at least in f. I mean the y's are conditionally independent That's sort of our assumption the likelihood, but That doesn't really help us much because The latent variables are fully dependent on each other So that still means that we you know, we could we could go that collect batches of the data set Somehow and then compute gradients, but the gradients would have to be with respect to all of the latent function values And so inference is going to be cubic not in the size of some latent object But cubic in the number of observations we've made Because there is no latent object And that's bad If we have data That keeps arriving here on the right So if there is finitely many y's Well, I mean cubic complexity is Still not something you want to do with a very large data set If you have more than I don't know 10 000 data points You will need to think about how to make things faster But if you're now sort of put yourself in the You know perspective of a theoretical computer scientist, then it's a polynomial time algorithm, right? So it's kind of you know tractable But what if n the number of observations rises without bound? If we just sit somewhere and we keep getting data in all the time Then we will need an algorithm That has linear complexity in the number of data points Why linear? Does someone have an idea this? It's the absolute lower bound if you have new data all the time. Yeah in the sense that That means that every individual time step is o of one constant cost Right and that's exactly what we need because if it keeps growing then eventually as n gets arbitrarily large The computational complexity of one individual step will be arbitrarily high And we can't do that anymore, right? So no matter what your hardware is we will eventually run out of hardware So that's sort of the the worst we could we could possibly accept And of course there could also be an o of one algorithm But that would mean we never touch most of the data set, right? So that's also not really feasible So we kind of need o of n From below in a sense, right? Because if we if it's if it's cheaper than o of n Then at least if we're thinking of serial processing then That that means we're not touching a particular part of the data set. That's bad We're missing some of the data and if it's more than o of n then we will eventually run out of computational um resources And that's not going to work either at least in the assumption that data arrives across time all the time And that is actually the case in some applications So if you are collecting sensor readings through time for example For whether in climate modeling or if you're building a self-driving car that drives around and keeps seeing the world around it or In medical applications where you might attach some sensors to a patient and read off their Vital signs or many other kind of settings where things typically change across time We need to be able to deal with this continuous accumulation of computational Tasks to make them tractable and so usually people think of because of this kind of Connection between computational complexity and run time We tend to think of these problems where we need the structure as time structure problems So maybe more generally you could say you can think of one dimensional problems in an ordered space that just We just move along that space in an arbitrary direction But of course in pretty much all real-world applications That direction is time and therefore these types of models or problems are called time series So for our purposes a time series is a sequence of observations indexed by some scalar variable, so that means it has an ordered space And what we often have is that the time steps are constant So that there's sort of a regular frequency at which we sample data For example, because that's how your sensors work or because that's how you decide to make the the poll to your system That simplifies things a little bit because then we can just think of indices of observations and as integers right as natural numbers and That then simplifies some of the say at the exposition that we're going to talk about today So on Thursday Nataniel Bosch will take over for one day because I'm not in Tübingen and Maybe give you an idea of what we do if the time steps are not constant Yeah So I guess it's Like from this slide should be clear that these are the types of models that we should care about right some some nodding I'm mostly I feel like there has to be a slide like this just to say that this is important But yeah, it's also just to set the scene Maybe one interesting thing to note about such univariate spaces so By definition I've said this ti comes from a scalar variables or scalar variables or univariate one interesting aspect of the univariate Both continuum and also the row of integers is that they are ordered So we know whether something is before or after some other stuff And that's actually going to be extremely useful Because it will give us a sense of direction to move through the data from one end to the next So in such systems In such problems We need to achieve computational complexity of order n And we saw that we can't well, okay, so maybe we saw that one way to do this actually The one example we had so far Where this would work Is this so In this parametric regression setting It's actually possible to keep updating a model By putting in one data point after the other And we did this a few times We did it when we when we spoke about parametric regression in the first place We saw we had this we had we actually wrote this piece of code called gaussians dot pi That has a function called condition Which for the weight space just for weight space inference just returns another gaussian And then we can keep that And condition it again on the next data point and so on and so on And because the object that we're operating on is The mean vector and the covariance matrix of the associated gaussian distribution That's a finite object that we can hand around through time We did this again when we talked about How the linear algebra methods that do gaussian process inference actually work We realized that the Tolesky decomposition is effect effectively a bookkeeping process That goes through the data set one after the other and some arbitrary ordering And at each point keeps the conditional distribution Of the weights given all the previous data points So the conditional distribution of w given all of the y's up to a certain point And then updates them with one new data point And that update it turned out is quadratically expensive in um the weight space And therefore since we have to do it for the and actually it's cubic expensive in the weight space Sorry, so for the non-parametric model, it's quadratically expensive But then we have to do it anytime. So that's cubically expensive And we did it at the final time last thursday Where I said we had this example of this permitted MNIST setup You keep getting new data and you'd like to make sure that your model loads all of the data Right, and then your question actually came up. I said, but if your model is Is not extremely over parameterized even given all of the data Not just one data set but all of the incoming data shouldn't it eventually become Sort of over constrained And stiff And not be able to learn anymore new data And that's true. So in this setting Because this conditional distribution assumes that w is a constant object that is not changing across time If we keep getting more and more and more data, we will just become more and more confident about w And if you actually think that that's the case then that's the correct thing to do So if there's a description of the world that does not change For example, you know that I mean I guess all laws of nature would have this property So if we at least we would like it them to have this property so assuming that Newtonian mechanics don't change then we could learn Newtonian mechanics in this way Or if w is the set of cosmological constants You know the sort of this finite set of things that describe the world then maybe we could learn them in this way But if we were thinking of a of a task where the problem changes as we move we need something else No matter whether the thing that we that the model we're going to talk about is Some tiny little sensor reading somewhere in the airbag of your car that has to be done on a chip that will cost I don't know 50 cents or whether it's a deep neural network that You have to hand around in your organization and it has to work on a huge data stream that keeps coming in These are all of the same type a setting where we have to keep Track of something that changes across time but that thing which we would like to Like keep track of and change like see change across time Will never the left have to be something finite and you'll have to be something that Does not grow as we get more data Because if it does then well, we won't have o of n inference right and the cost of each step will keep growing all the time Does that make sense? so the the object that we're going to hand around is We could think of it as some memory As some finite thing that we hand from one time step to the next Sometimes that's also associated with the word state That gets moved from one time to the next And there's various reasons why the word state shows up There is a physical reason so physicists tend to think of they use the word state as designating that It provides a full description of a system So in physics the state of a system is a description in terms of variables that fully identify the system So that you can predict its behavior into the future That's the sort of Cartesian view of the world Determinism if you only knew what the state of the world were at the entire universe at some point in the in time Then you could predict the entire world forward But there's also a kind of computer science see Um interpretation of the word state So if you've taken the theory of computer science class, maybe even by myself You've heard about finite Automata Which are state machines, right? They have a finite set of states And then a rule for how the system changes from one state to the next Given some input from the world and in this Well, I'm going to show you the next picture, right? You could think of something like this Where the world moves on and at every time step the world provides an observation And that somehow does something to the state Your state changes according to some rule It's just that that change Doesn't necessarily have to be deterministic. So for finite automata that change is deterministic Um, but you could imagine a setting where it's I'm not going to use the word non deterministic because that has a technical meaning in theoretical computer science, but Stochastic, maybe Yeah, and that leads us To these kind of models To the following idea We're going to assume That function values That we observe across time Are represented by this graph That's the first way of thinking about this. That's maybe a way to get into this thought process We we had and sorry that I keep no, I'm not going to jump around I'm just going to say it and wave my hand around instead of hopping to the previous slides So we had this early slide with the different atomic structures of graphs We saw that there is one type of graph called a chain graph. This is a chain graph Which has the property that the left side of the graph Is independent from the right side of the graph when we condition on the variable between them So in particular in this graph These two variables are independent of this variable when we condition on this variable Or these two variables are independent of this variable when we condition on this variable And conditioning means That we write down what we know about This variable or whatever variable we try to condition on So that means we build a probability distribution over it And then marginalize out when we predict Another way of thinking about this About what this might mean for our Gaussian probability distributions But that's not something that will like make help us particularly far But maybe just for intuition for what this means relative to the distributions we've looked at so far Is that we're looking for gaussian probability distributions where the inverse covariance matrix the precision matrix Has lots of zeros on the off diagonals Why because on a previous slide I show you that you can think of these zeros on the off diagonal as the arrows Or the presence or the absence of arrows from Other variables onto like across their direct neighbors and one Reason to anticipate why this might be a good thing to do is that you may have heard or maybe the fact that this works Implies the other way around if you haven't heard about it yet that there are Fast algorithms for solving such types of linear problems So in gaussian inference, right, we're going to need to compute This thing times a vector And you can imagine that that's there's an o of n way of doing this that involves back substitution or maybe just, you know Multiplying things in directly because there's also clearly only o of n numbers in this matrix In this case Not even quite to n just n plus n minus 1 for the off diagonal So what we need is conditional independent structure and models like this that have this chain structure are also called markoff chains After this guy Andrey Andreevich Markov Or makov probably Any russian speakers in the room? Yes Ah, okay, so maybe it's russian empire ah Okay, I need to read up on him. So my he did write this text Oh Kazan, so maybe ah, okay, so it's probably somewhere sort of eastern southeastern I maybe I had to read up on on him so he did present This work that this is all named after to a well A society for physics and mathematics at Kazan University Which at that point at least I think was part of the russian empire 1906 That's my understanding that the reason why I have this understanding is that um This text is only available at least it was for a long time In this form Which I assume is russian because I can't read it And we know about it in the west Thanks to this guy Kamal Gorov who Read it and then actually Mentioned it in the Grundbegriffe the wahrscheinlich rechnung In 1933 and also had a second text in which he pointed out that this was a very important result He said he sort of relayed The inside of Markov and say this is something we should really study because it's a very interesting structure to use Why is it an interesting structure? He actually writes here in the original german Because it's the first kind of relaxation of independence so This is sort of the other way of approaching this problem. So so far in our course We've gotten used to everything being dependent on everything because those are powerful models All right, then we can do this in 2023 because we have powerful computers We can think about what we would do if everything depends on everything But in 1933 or 1905 People couldn't invert matrices with a with a million entries And they didn't have you know gigabytes of memory available What they could think of were things that were completely independent of each other And then you can just sum up those independent things and keep them completely separate, but Ramagorov already realizes that this is Basically a little bit too boring right if everything is independent of each other. There's no interaction with each other And he even writes that you know this this whole thing this whole uncertain. Sorry independence independence really is the like Well, the term which has the probability of your own kind of Like this thing that makes Probability theory so complicated. It's not the fact that we compute conditional distributions and marginal distributions. It's not base theorem It's not the fact that there's a prior and a likelihood. These are just the natural rules that arrive from keeping track of measures So when you when you correctly treat measures in the right way so that in particular We're not accidentally losing or adding measure as we operate Then so that's the idea of probability theory right conserving measure during operations then Those rules just they are just completely natural. There's no other way to define them But independence is actually the weak sort of the Achilles heel of this entire process Because it's kind of assumes that things just completely happen Fully separated from each other in a world where we can never really be sure that they are fully separated from each other And if you want to look in your master thesis or whatever you want to do afterwards For like one of the deepest questions you could possibly ask it's probably one of these So the people who work on foundations of probability theory There are even some here in tubing in for example in Bob Williams and scoop They still sort of deal with this fundamental problem of independence It also by the way is inherited by everything related to causality. There's a corresponding problem in causality that It seems weird to assume that the core the set of causes of some event Is somehow finite and restricted to a particular actual set of causes similarly to how it's sort of A bit dangerous to assume that there is a finite object a set of variables Such that if we know those variables everything else becomes independent of each other This also translates into physics And then the questions of what quantum theory actually says and whether there's randomness or not a lot of the questions boiled down to these particular issues of Whether there is a finite representation that separates things from each other or not local or non-local And so Kandagorov says if We want to consider something where things are not quite fully independent The first thing we could do Is this kind of kind of conditional independence structure given a local set of variables Called the state of the system and that gives natural rise To ah, yeah, so here actually he's now. I should have showed this slide a bit earlier. This gives natural rise to the idea of Markov chains So that was the high level part of this lecture And now before we go into the break, let me set up some notation so that you can stare at it if you like during the break um Actually, should I do that? Maybe it's better to leave you with the philosophy um Yeah, how about we just I just leave this up It's because it's a much more nice thing to go into the break with right So if you want to think about something big during the break Wonder whether things are ever actually fully independent whether you think there is a set of variables such that if you knew them Everything else was just random All right, and we will talk. We will continue this conversation in a much more mathematical sense at 5 past 11 so um Someone pointed out that I managed to mistype the name Kandagorov on those two slides. I've just now fixed it Sorry about that. That happens when you actually type the name quickly um It's gonna correct it. So now with the big philosophy out of the way As always we'll need to dive in and actually do the math And to do that, I just want to be explicit that I'm now going to change notation um to make it more Germane with the typical notation in these classes of models. So so far in this class, we've Tended to What we we thought in terms of in terms of a latent function f In the regression setting where there is a bunch of x's inputs And we get to evaluate the function f of x and then we see why And this kind of made sense in this sort of axis coordinate type Space where you there's like x's inputs and y's outputs But now we're we're sort of we have this third object called the state the latent representation And those used to be the weights, but there was only just one set of weights and weights. Maybe it's the wrong word So we're going to change the notation And this is notation that comes more from for example the signal processing community the people who deal with sensors Um and maybe also the physicists who work with these dynamical systems or applied mathematicians who do dynamical systems namely We're now going to assume That there is this latent state of the world and we're going to use the variable x to denote that state So x is not an input to a function anymore At least that's not the first way to think about it. It's A representation of the world the latent thing And then there is a set of observations Which we call y Which we make at a sequence of instances of x So we needed another variable to to index Where we are in this one dimensional space Along which things change and because of the natural connection to the notion of time We're going to use the variable t to denote this one dimensional space So there is now a chain of x's at time t Of which we make a somehow corrupted observation called y at time t And because we assume that there are that we only measure at discrete intervals with at least for today A constant step between the different intervals We could also index the time by t1 all the way through tn And sometimes i'm just going to drop the t and just think of x i from one to n or sometimes also Yeah, one to capital t So there is usually some kind of confusion about whether the index is over time Or over natural numbers For today that won't actually matter so much because i'm just going to assume that the time step between them is constant And then it's just an integer index and on Thursday in artana I will have to deal with what happens if that distance between time steps varies So the question is about about this h song and Here this is still the row about regression right so in regression We assume for full generality that we're making linear observations affine projections of the function And then add some gaussian noise. So the little curly Approximately equals here is supposed to mean that there is some noise involved some probability distribution typically gaussian noise And in regression we realize that actually i mean we typically assume that we just evaluate f at some point But actually we could more generally Evaluate any linear projection of f because then the framework still works For example, you could evaluate the gradient at the derivative the gradient of a function f Or even integrals or whatever right or just subsets of it Here now um, we are Typically going to assume That we make linear observations of x and here this is now maybe a little bit more important So i'm using the variable h for this Because if we actually see All of the states x That's maybe as like A boring base case Like an interesting situation might arise if we only get to see certain parts of x So if y has a lower dimensionality than x and then h is going to be a rectangular matrix By the way, not only x and y and t are the canonical names for these variables, but from now on Or everything i'm going to show you for the rest of this of this lecture Is super standardized notation So the fact that i use h here is not a random decision by myself But it's an absolutely standard thing and the reason for this is that it comes from engineers So signal processing is an engineering discipline And engineers are very precise with the notation so that they can nail it down once and then they never have to think about the math Again, they can just use the notation So if i use the variables a and h and q and r Then a signal processing engineer will immediately know what i talk about And it's not just a random choice of variables So i'll mention this a few more times So now what we're going to do is we're going to make the observation abstractly that if we assume That our joint distribution over the latent variables has this form Then things get easy So first of all, what does this form mean? So this is the simple math representation of this chain graph Right so on previous slides, I had this graph with circles and then just one set of arrows going through That graph corresponds to this Expression for the axis at least for the relationships of the axis the y's i haven't written down yet Why because it means that when we want to predict the i's state Given all the previous ones That is equal to the i state given just the previous one What does this mean? It means conditional independence. It means that i x i is independent of all the x 0 1 2 3 4 5 and so on When conditioned on its direct predecessor x i minus 1 yes No, we will not we don't want to learn h. We want to learn x This is exactly the point The thing we care about now is x so in regression We were given x because it was the input to a function Now the input to the function is t time And we wanted to learn a function which we could have represented by a finite set of weights But now there is a changing set of represent of things that affect the world and we call those x And I know that this is confusing the problem is if I didn't do it It would be even more confusing because the entire notation in this field is in this form So if you pick up any textbook you're going to see this notation And so we just have to make this switch now So x is the state it's the thing that we care about that we would like to learn When a joint distribution has this property we call this the Markov property Because it means that things become independent when you know the latent state And now we're going to also observe that when we have this structure There is a class of algorithms Which are o of n Which allow inference in Linear time and this works there's a little bit of a subtlety to it Which we'll find which is that if you want to Go through the data set one after the other and observe one datum after the other and always keep A local estimate of what the state x is Then that can be done in a single path Through the data from left to right if you like and that's called Filtering it just is called filtering and don't ask me why Now there's a bit of a complication that if we have gone through the whole data set And we are at the end Whatever we write we might want to call the end Then later observations Also tell us something about what previous state values could have been They are not completely useless And if we want to make a statement afterwards about previous states That is consistent with our current estimate of the last state Then we have to do a final step back through the data set all the way to the front But only once and then we're done and everything is fine And that backward pass through the data is called smoothing And I say forward and backward because also this algorithm is sometimes called the forward backward algorithm But this is not related to autodiff and to backward passes in deep learning. Well at least not superficially So you can switch off your back backward pass and backprop ideas for a moment. This is message passing Just it and okay if you really care about it. Yes, there is a relationship and they are not completely separate concepts But this is not the time to think about it so And for that I now need you to bear with me for three slides in which we're going to do some nasty math So nasty that I had to reduce the font size So up here You see the our graph This is the assumption we make about the generative model of what's going on So we are moving through time from left to right at every time time step. We potentially make an observation called y We would like to know what the latent state of the world exists For example, what are the weights of my neural network that I currently need to use to predict For example, where is the position of the car? relative to the outside world for example, what are the Kinematic state of the robots that I'm trying to control and so on and so on We make two assumptions those two namely the important one the mark of property that the state at time t Given is independent of all the previous states When conditioned on the immediate predecessor What this notation here is this slicing notation that means All the x's from zero all the way to t minus one Just like in python Yeah, I'm going to use capital x to say the collection of all the x's through time indexed in this slicing notation from zero to t minus one where There's some maybe annoyingly I'm using t minus one to mean and including t minus one So in python it would be up to t minus two I actually always thought that that's a Dangerous choice to make in the notation. It seems more natural to say those are included And the second observation is this conditional independence of the observation that you can see in the graph So the fact that there are no arrows pointing from all the x's to all the y's means that We assume that the observations are local as well. There are local observations of the state effectively This is a less novel assumption. We've actually already always made it in previous applications as well It's just a likelihood factorizes. So now If these two are true These are like axioms of our model Now we're just going to see where they where they get us what they lead us to and that involves these three lines of annoying math And the key thing here now is that we're not going to make any further assumptions about what p actually is We're just saying there is a probability distribution which has this property It just factorizes And the x's could be a vector of real numbers It could be discrete Well values variables Anything it's just a probability distribution over variables x and y Now assume the following setting We have this data set in front of us the world with its time series And we keep getting observations and our task for a moment is to predict The state at time t Given everything we've seen so far Up to the previous observation y t minus one We have not made our observation at time t yet But we're currently at t. We've seen all the ones to t minus one So what we're interested in is this posterior distribution Which in particular is a conditional distribution For x at time t given all the previous y's So first of all we are going to write this down as an instance of Bayes theorem So remember that Bayes theorem is The joint probability of everything here x and y All the way to t minus one Divided by the marginal probability for the observations for everything that's on the right hand side here But we're not just going to do that We're actually going to expand even further by adding in all the previous x's as well and the later x's all the x's All of them and we can do that using The product rule Or actually sorry the sum rule of probability theory So we just we sort of do it reversely We just add variables and then say we could have integrated those out to get this Distribution over x of t given all the previous y's So this up here is an expression for p of x t Times p of all the y's given x t And then we extend with all the other x's as well And why do we need to do that? Well, because all the other x's are needed to explain what the y's are all right. We need to put them in Okay Already lost you. I've already lost many of you. So let me maybe see We could have written So we need p of x t given all the y's from zero to t minus one You could first write that's p of x t Times p of all the y's from zero to t minus one given x t Divided by p of y from zero to t minus one, right? So in particular, this is also p of All the x's including t Times p of y Zero t minus one all the x's but we integrate out all the x's that are not Like j not t Divided by Well p of y so p of y we can also expand like this. So it's the integral over p of y Given x times p of x and the y again goes from zero to t minus one over all the x's No not equal to just all the x's because they all get to integrate get integrated out, right? Okay That's the first step The next step is To explicitly write out The joint distribution over all the y's and all the x's so let's first think about this term All the p of x's So the joint distribution We could in a general probability distribution We could write it as p of x zero times p of x one given x zero times p of x two given x One and x zero times p of x three given x two and x one and x zero and so on right But here because we have this chain graph we can drop all these previous ones So that's where we use our first action So we get p of x zero times a product Over all times from zero to t p of xj Given xj minus one And then all the later t's as well They are back here if you can see it. It's in green For all the larger times larger than than t p of xj given xj minus one It's just an ordering of these terms to make them move them at the right point in the product It's a product. So it's commutative. We just move things around Same thing for the y's so we have this assumption that the y's are conditionally independent given their parents x t So this term all the y's given x Factorizes into a bunch of terms y zero given x zero and then in the same product as before All of the yj's given just the local xj and not the other x's And there is no y yet in the future. So back here there is no y terms so far. We haven't seen any future wise And we do the same thing in the denominator because it's the same expression The integral is just over a different set of variables. So we just can literally copy paste from above And just keep in mind that we're integrating out one more variable Next step So now we realize first of all that these terms back here these integrals These green ones they are over an xj Where that xj doesn't show up anywhere else Right here at this entire thing does not contain a j larger than t So it's just an integral over probability distribution and an integral over a probability distribution is just one So there's just lots and lots and lots of ones back here both at both in the numerator and the denominator So we can just forget about those. They're already gone. The green stuff is gone in the next slide Now Let's look at the front part So here we realize that actually in the top and the bottom a lot of things are the same There's only one difference and that's to do with this j unequal to t bit so in at the point where j is equal to t if you have this extra thing this extra factor here in here Actually here here it is That gets integrated out at the bottom And it gets actually integrated out because there is no yt yet So this is just a probability distribution. So it just integrates to one But up here we can't do this yet because Xt is not actually integrated over right. It's that this is not in the integration set So we have to keep this separately here in front Move it to the sort of move it separately to the front And then we are left with an integral over all j less than t and down here as well It's just that at the top and the bottom there is one extra factor here and with a t minus one inside So this is a function of t minus one as well And therefore we can't just do the integral directly But for all the other ones The terms are the same and we can think about what they actually mean. So up here This is effectively a p of x t minus one times p of all the y's up to t minus one given t minus one divided by well all of the Like p of all the y's basically up to t minus one Because we integrate out all the x's so that's I mean down here. I've left it once explicitly But it's really just the joint distribution over everything that isn't x t Integrated against all the x's so the y's are left So this is actually an instance of Bayes theorem as well It's a p of x p of all the y's given x t minus one Times p of x t minus one divided by p of all the y's So it's a it's the previous posterior sort of it sort of it's p of x t minus one given all of the y's And then there's a single integral left that we can't do Which is well at least not yet that we can't get rid of that We have to sort of keep around to actually think about which is this local integral of this form So what this entire derivation tells us is we can build a recursive inference structure where To compute the predictive distribution for x t given all of the previous data We first do inference on x t minus one from all the previous data And then do this whatever this requires us to do So this is a problem that we haven't solved yet, but it's a much smaller integral Right, it's a single integral over this latent state at the last time And because we assume that this latent state is somehow finite and it's something we carry around Maybe this is something we can do It's actually explicitly do There's another way to think about this I could have also waved my hands around and said well, there's this graph up there You know you can think about it right if I give you all of the previous y's Then the only way they interact with x t is through the previous x t minus one So therefore, you know, I could first operate on that and compute this thing By the way, I have a slide for this right. Yeah, you could have just you know that well The joint distribution of x t and x t minus one given all the previous y's could be factorized in this way So we you know, we take x t minus one Here to the left times. So this is a product rule And and that just works because of the graph, but it's a bit hand wavy Right, it's just because of the graph because of independence So what I just showed you on the previous slide is the actual explicit derivation to show that the graph is a correct representation of this This kind of behavior And then this gives us exactly the object that we care about and this equation is called the chakman komorov equation And this is another instance of western eastern divan type Cold war situation where two people Come up with similar ideas On different sides of the iron curtain So this is the first part of our what what is going to become our general purpose inference scheme for time series Which is if you're currently at time t You've dealt with all that data up to t minus one To get ready to observe y t You need to solve this equation Which is a local computation involving what you currently know about the local state That's why it's so important. It actually represents this idea of a local state It says you can deal with the entire past by only keeping track of the predecessor state So now comes the next step and the next step is actually much easier It's oh, I need to make an observation. Let's observe y t so to observe y t So the change from the previous slide is that there is now a all the way up to t So this is y up to t minus one plus y t Well, we just do Bayesian inference So it's a local observation and local stuff is going to be typically tractable so we just we have from the from the previous slide Our local prior you could call it for x t given all the previous observations Now we just multiply with the likelihood for the local observation because of this assumption We can get rid of all of the other axis right and directly write it in this likelihood There's nothing we already integrated everything out basically Normalize and that's just Bayesian inference. Hoopsa right and so I'm not going to write more down than that It's you just have to do this But it only involves an integral over the local x t. So Hopefully we're able to do that So that means we now have two out of three important steps We're able to start with the data move all the way to time t Deal with the past make a local observation at time t and now We could actually run this thing forward through time We could keep doing this make a prediction Chapman-Kolmogorov make an observation base theorem make a prediction Chapman-Kolmogorov make an observation base theorem And those two steps only involve local computations So they only involve an integral over something called t x t Or d x t minus one So those are integrals that we might hope to be able to do And we could do that all the way to the end So now Let's say at some point our time series actually ends And we're done Or we just call it a day basically the time series continues, but we just say okay, that's enough for now Now there's a bit of a complication which is that the things we've computed so far Are always just x of t's given the previous data So if we could if we would have kept those in memory We would have had had to have a memory that keeps linearly growing, but we could have done that right We just keep storing all the stuff that we've done so far Then those probability distributions that we've stored there these p of x t given y all the way to t They are now sort of deprecated right. They are kind of outdated Why because they do not take into account that later on we got to see more data So to make them consistent with the later data, we want actually to compute this thing on the top left p of x t given all the y's not just the ones all the way to t And for that We now do this bit which I'll do a little bit faster And which will turn out to be this process where we go effectively from the end of the data stream backwards through time to correct the predictions we've made in light of later experience And it works like this. We first introduce again another variable using the sum rule And why do we do that? Well because we kind of have this intuition that the only way that later data provides information to us is through the immediate successor in the chain So let's maybe put that in and see if it helps us anything Now we use the Product rule so we move this to the we write this Distribution which is also a distribution over x t plus one in particular as a conditional distribution plus a marginal distribution plus times a marginal distribution And now I've colored this in green because this is the bit we now Sorry, the the blue. This is blue and green. Maybe can you see this? This is blue. This is green just barely This projector is not so great because this blue bit is something we now need to look at for a bit So what does this mean? This is what we learn about x t When conditioning on x t plus one and all the y's And we can probably sort of imagine what it's going to be You can already see it if you sort of peek ahead down here What we're going to see is this by conditioning on the future state x t plus one the successor state We can kind of get rid of all of the future observations in this term so The only way information flows to x t Is through x t plus one And x t plus one will kind of represent everything we've learned from the later data And we actually do this in this row. So we take this blue bit write it here Expand to write it with base theorem again So we write this as a sort of likelihood for all the future y's given x t plus one and all the earlier y's and x t with a prior Normalization constant below and then Use the factorization property to realize that those future t's Are independent. Sorry the future y's are independent of x t when conditioned on x t plus one and stare at this for a while to realize that those two cancel out and we're left with well bits in here cancel out and we're left with and essentially Well, something that only depends on the part the initial data all the way to t and x t plus one and not the future data So then this blue bit we can plug in here. So we can simplify this term and just drop all the future y's I've copied this bit here It's still blue because it's the same thing, right? All the blue bits are the same exact same object and Then actually Does it make sense to well, okay? So by now it sort of maybe becomes a bit more mechanical So we look at this object. We realize that if we can write it as a conditional orbital distribution um, so we sort of an instance of base theorem if you like or the product rule if you like and Rearrange again into a likelihood and what you might call sort of a prior and Notice again conditional independence. So all the future x t plus one are independent of the past y's given x t So we can drop all the y's here in this bit and we're left with This sort of expression which we can now plug Into up here because we realize that the blue thing is equal to this expression And if you plug it in here, we get this Which tells us that p of x t given all y's Is the thing that we computed in the previous step the red bit. So that's the bit we got from moving from left to right through the data times a correction which involves how you would predict x t plus one from x t and A sort of ratio between what we knew about x t plus one after having seen all the data versus what we saw previously from the earlier data So this is some sometimes this is called also called a cavity distribution It's sort of the full thing that we have computed in the In the previous step coming backwards So in as the algorithm runs, we are computing x like p of x t plus one given all of the y's That's the green bit And we have maybe stored what we've previously predicted x t plus one given All of the y's all the way to t And we will divide those two distributions by each other. Well So dividing in the sense of this integral, but you have to think about what it means to divide a function by another function and Multiply with this relationship between the two And this step is called smoothing So these three together give us a general purpose algorithm for dealing with data sets. Sorry with with time series And it consists of these three steps if you have an infinite time series in front of you you start by So you break it up into a recursion or an iterative procedure by first Somehow computing the posterior distribution over x t minus one given all the data up to t minus one Then only locally predicting x t That's the prediction step Observing the local y t and updating with base theorem. Those two together are called filtering And if at some point in the future You would like to have a joint full thing that takes care of the later data as well because you need to go back in time Then you do an extra step called smoothing And this is called an extra step because it involves the different types of kind of data structure the first two you can imagine just Like have constant memory requirement. You just keep around your local p of x t given y t minus one or y t And then you never have to grow anything. It's just constant resource allocation But if you want to be able to do this thing you need to be we need to keep around um these objects And well actually yeah, they are the same right so you have to keep around those objects And that means you have linearly growing memory requirements, but there are some applications where this is useful in simulation for example Okay, so that's the now there's various ways of like representing this I have now basically made the same slide again in gray It just says what I just said. There are these three steps You could also think of this as an as a pseudo code algorithm if you like so We are moving from left to right through the data to call this Filtering and from right to left through the data to call this smoothing And there's no point really much in talking in me talking about this But what we need to do in the last 10 minutes is to observe that everything I've done so far Just involves p of something And I haven't really said yet what the p of something are So there are basically two cases Right in general, this will involve an integral and integrals are hard So you can imagine that they are probably just two cases in which this is going to be tractable The first one is if the x's are discreet Right if the if x only takes one out of k values If it's either one two three four five six seven, okay, right? That's called a hidden mark of model sometimes Um, and it corresponds actually to large degrees to a finite automaton It's just you're in a finite state You get to observe something that's like an input of the world and then your state changes in response interesting But maybe also finitely interesting somehow And the other What is the other setting going to be in which things are tractable if you have a continuous valued x? Yes gaussians So the only other case in which we are just going to have tractable computation typically is if all of the p's Are gaussian distributions, but then we need something else What was the trick that makes everything tractable? It's not just gaussian but gaussian and Then linearly related exactly So we will need to make like to get to an algorithm that we can actually write in code and that you can write in your homework this week We will need to make the assumption that The model is what's called linear gaussian or there are actually various names for it. What's also often said is Linear time invariant. These are these types of models. So where these two axioms this one and this one Get not just written in this abstract p notation, but they explicitly are given a linear gaussian form Linear in the sense that the relationship between xi plus one and xi Is through a linear relationship. So there's a matrix a that is always called a That is applied to xi And then we add some gaussian noise with covariance q q is another one of these variables that has a fixed name And then we make observations just a moment observations of y that are gaussian around xi With a linear mapping applied to it called h and some gaussian noise with covariance r added And if we assume that r and h and a and q are just Just a fixed set of matrices that never change. They're just always the same It's just four matrices to provide then this system is called linear Time invariant because a and q and h and r do not change through time So time invariant lti systems and you may have heard of lti systems before in some other place The question is is time invariance the same as being stationary in x? It's a bit more complicated But Yeah, kind of but I have to be I needed to have another five minutes to explain that So if that's the case then and now here i'm going to move quite quickly because you can look at the slides afterwards You can plug in The gaussian expressions into the equations that we had on the previous slide So this first row here is chapman called mogorov. It's the Taking the step part it's we have learned all the way to t minus one now We want to predict at time t we look on the previous slide. We see that we have to do this That's an integral we plug in the two gaussian distributions from the definition up here By the way, of course, there's also an initial value We sort of have to initialize at some point somehow so the initial distribution is also gaussian So we plug this in And then we just stare at this and then you go back and say, ah, okay Now I have to open up lecture number five and we first got to see gaussian distributions So was it six maybe and then there's going to be some property of gaussians there are the product of two Gaussian distributions is another gaussian distribution or actually this is yeah It's essentially effectively an instance of a product of two gaussian distributions or it's a you know the The posterior arising from a from a linear observation of gaussian random variables So you look it up you find the right equation you find that you can write the update like this Sorry, that's not called the update. It's called the prediction step. So we have now a prediction for Uh The state at time t given all the previous data That is a gaussian distribution and it has a mean and a variance and those are typically given a name They are called the prediction Mean and covariance the prediction distribution and they are usually written with an m for mean And a minus up there for before the observation And then a p It's just a p. It's that's the notation in this community Don't ask me why It's not a precision. It's a covariance. So it's not the inverse of a covariance Now we need the update step. Well, the update step is actually really straightforward because it's just gaussian inference We have a local local gaussian distribution over x t We make a local observation Of x t in the form of y t by taking a linear map of x t observing y t with noise r And then you look up on the slides at this big slide with all the properties of gaussian distributions What the update is it's this so it involves um a matrix k which Has of course an inverse of a matrix in there because there will be matrix inverses when we do gaussian inference And you can see that this maybe looks a little bit like those grand matrices that you've seen for gaussian process inference And if that's if you if it seems like that that's because it is it's exactly the same kind of algorithm It's just that it the names of the variables tend to be chosen a bit differently So people use to sort of another like they kind of slice the Equation in a different way to give it names And then this k is called the gain and there is a reason why it's called k Because it's the kalman gain after root of kalman hungarian mathematician um Yeah, and it's just patient inference in gaussian models And then finally we need to do this smoothing step backwards Which is associated not with kalman because kalman was a signal processing person and he would never have gone back in time He needed something to works all the time constantly So some other people got to do the going back in time thing They are called rauch and tung and strebel and that's pretty much their entire claim to fame except for rauch actually He had a few other things And it's again just gaussian algebra So you can stare at this after the lecture. I don't have to do it now But there is another integral over the state that involves a ratio between gaussians times the gaussian And you can convince yourself that means we need to update the the estimation mean and estimation covariance Using another thing that involves a matrix inverse that is called the rauch tung strebel smooter So another update so in short There's two algorithms that we need to implement That are actually things you can do on a computer rather than abstract mathematical objects They are called the kalman filter And they look like this So it's a for loop that starts at time zero And then all the time just move through Makes a prediction step. So to predict we compute m and p minus Then we observe why we compute these sufficient statistics called the residual The innovation covariance and the kalman gain And then update the mean and the covariance to get something without a minus That's called the estimation mean and the estimation covariance And you can also see maybe why k is so important because it just shows up here And then at the end we just return all the m's and p's that we have computed so far And if we don't want to store those we just throw them away and then the algorithm is constant time cost for each step And constant memory for each step And therefore all of n If we want to do smoothing afterwards, we shouldn't throw them away because then we can't do smoothing But if you keep them around then the memory cost is also linear in time And we can go back afterwards through this for loop And do these updates and compute the two Um Quantities that Uh define the so-called smooth Gaussian estimate, which is another mean and another covariance called the smoothing mean And the smoothing covariance and why do I show all of these through you because this is an instance Where we are going to try and do You doing in the homework the basic exercise this week's homework is to implement this algorithm as a for loop in python Using a race an actual linear algebra So, um, I'll stop it here. So Markov chains are the archetype the algebraic structure That we need to build algorithms that run with constant computational complexity through time So that therefore the cost of doing inference is linear in the length of the time series This is a fundamental property It's not just For kalman filtering or signal processing. It's a fundamental property of systems that have finite state and they have deep philosophical implementation implications And connections to physics and to finite automata and many other types of interesting models But and they give rise to these algorithmic structure called filtering and smoothing If you want to implement those in practice We need to make further assumptions about what the p's actually are the probability distributions If you assume that they are discrete, then it's just an array operation And if you assume that they are gaussian and the relationships are linear We arrive at an algorithm called the kalman filter and the rauchtungstribel smoother for the backward algorithm And you can implement those methods You can use this framework for tiny little things With five states or seven states or 42 states and build them in like tiny embedded systems that run in cars or in vacuum robots Or you can use them on massive sets of weights of large language models across time as we saw last week And because everything can be made into a gaussian distribution if you really wanted to You can use the same framework for all of these problems And that's why it's a powerful notion to have in the in your toolbox And on thursday, so there's a there's lots of books. This is the most recent one that might be fun to read There's a new version of it actually and On thursday nataniel will be here to tell you a little bit more about the deep underlying mathematics of these models and how you actually Implement them for real world problems if things are not quite linear and not quite gaussian Thank you