 Hello and welcome to probabilistic machine learning lecture number 17, we're already well into the second half of this lecture course now and we've already amassed quite a lot of results so far. We started the course with the foundational observation that probabilistic formulations, that probability theory provides an extension of propositional logic to statements that carry uncertainty and that probabilistic inference allows reasoning under uncertainty and in lecture two already we noticed that a downside of this empowerment is that it comes at a potentially significant computational price because probabilistic inference can be combinatorially hard in the space of hypotheses or in the number of hypotheses to consider because we have to keep track of all possible explanations for an observation at the same time. We saw in lecture two already that conditional independence is a crucial tool for reducing this computational complexity by separating certain parts of the inference from each other and then there was a long phase in the lecture where we actually like where conditional independence took a bit of a backseat because we focused more on specific choices of probability distributions under which inference even the full inference over a joint set of variables has polynomial costs and these are Gaussian probability distributions. We saw lots of connections and ideas for using Gaussian distributions to create machine learning algorithms and in doing so actually covered a large part of the range of existing machine learning methods. Now in the last two lectures we began to move away from the purely Gaussian framework. We first encountered another set of probability distributions a generalization to the notion of an exponential family under which inference remains tractable in some sense and then in the last lecture we returned to the notion of conditional independence and encountered two different classes of so-called graphical models which allow a sort of manual design process to think about notions of conditional independence when building a machine learning algorithm and today we will think a little bit more about these. We mind ourselves of their weaknesses as formal tools and then try to extend them into something that we can use when we move from something that is just a design tool on a whiteboard to a framework that can be implemented in a more automated fashion to empower machines to take efficient automated decisions about inference for us. So let's quickly remember what these two families of graphical models are. They are directly graphical models also now known as Bayesian networks. These are constructed directly from a generative factorization. So you take a conditional or a factorization of a joint probability distribution into conditional probability distributions and then read off the structure of the graph from these conditional distributions by drawing directed edges where the direction of the edge tells us whether a term shows up on the right or the left-hand side of a conditional probability distribution. We saw that this framework is powerful in the sense that you can directly construct such graphs but it has a bit of a weakness which is maybe not a particularly strong weakness which is that if you want to read off conditional independent structure from such graphs that's not entirely straightforward it requires a little bit of a complicated process involving the notion of de-separation to think about how variables become conditionally independent of each other given some parts of the graph. As an alternative we discussed the notion of Markov random fields or undirected graphical models so these are also graphs with edges but here the edges don't have directions. These graphs are by definition make it easier to read off conditional independent structure from the graph because the graph is directly constructed such that you can read off conditional independence structures. However the price we pay for this as we saw in the last lecture is that reading off the joint probability distribution from such a graph is actually very complicated it can be very hard indeed and requires potentially complex computations. So what I want to do now is to think a little bit more about the relationship between these two families of probability distributions mostly also to motivate why we actually need a third and somewhat more extended formal language that goes a little bit beyond maybe what you can do with just a visual presentation as a graph. And for that let's maybe first think a bit about how we would go from one of these families directed to the other one undirected graphs and see what the mechanical processes are that we have to do we have to go through during this kind of process. Let's start with a particularly simple graph one that we've already encountered several times now in this lecture for example in the at the beginning when we spoke about Markov chain Monte Carlo methods and then again when we spoke about Gauss Markov processes or time series models this is a directed acyclic graph that has the structure of a chain and we already have started calling this a Markov chain even though of course this is not a Markov random field it's a directed graph. How would we turn this into a Markov random field? Well here is the factorization that you can read off from this graph and this is sort of a direct map from one to the other. Now what would we do to turn this into an undirected graph? Well we need to think of potential terms what the potential terms will be the individual factors in this factorization and we just drop the fact that there are these vertical bars in here so that we have an order between these terms and replace them with maybe just a comma to think of individual factors. Let's do that so actually the first factor we can maybe just directly absorb into the second one because otherwise it doesn't really help us at all with the graph it doesn't create an edge so we're going to have an unnormalized distribution which contains a factor that contains variables one and two and then one that contains two and three and then and so on all the way up to n minus one and n. This we can now directly like turn into an undirected graph you can basically read off the undirected graph from this factorization these define the cliques of our graph so the sets of variables that we have to densely or fully connect now that's trivial because they're just pairs so we if we densely and fully connect these cliques we just have to draw an individual edge between these variables so one from one to two one from two to three and so on all the way to the end and clearly this is again just a chain we just dropped the directions of the arrows so that's why this is called a Markov chain because it corresponds to a Markov random field that is a chain. So that seems easy right that's almost suggests that to get from a directed to an undirected graph we might be able to just drop the the directions of the arrows however this is not true in more complicated settings if each node has several parents so think of this undirected sorry this directed graph which is an example I've taken from Chris Bischoff's book but it's a basic example that you can basically construct for yourself of course this directed graph corresponds to this factorization like we already know this now what do we have to do to turn this into an undirected graph well we have to think of the potential terms that this factorization corresponds to so again we can basically get rid of these first three terms here we don't have to care about them because they only have one element so they don't create an edge in our graph but we have to think about this term here which is going to correspond once we drop the vertical bar in the to a potential term that involves all four variables now remember that took that we constructed such potential terms by looking at all of the cliques of our graph so we have to create a clique in our undirected graph that contains all four of these variables so in other words we have to const we have to connect all four of these variables with each other and this amounts to a fully connected undirected graph this process which amounts to marrying the parents if you like because x4 is a child node and 123 are the parent nodes which were previously not connected in the previous graph is historically and perhaps very anachronistically called moralization because it amounts to marrying the parents of a child to each other so that you have a moral connection between them and that's maybe not so important what it is important is that the effect that it has it creates a fully connected graph and of course in doing so we lose any interesting structure you might want to read off from this kind of graph right this fully connected undirected graph doesn't capture any conditional independent structure at all anymore now imagine what we needed to do to go back from this fully connected graph undirected graph to a directed representation so if I just give you this graph without this particular factorization that the only thing you can read off is that there is a factor in here that contains all four variables but you don't know in which order they show up which of them is on the left on the right-hand side so the only thing you can do to create a directed graph from this visual representation is to take the set of nodes and give them an arbitrary order so for example you could index them by numbers from one to four as I've done here and then that arbitrary order gives you an opportunity to assign directions arbitrary directions to the edges so you could just say go like for every edge for every pair of nodes just draw the arrow in the direction from the lower index to the higher index so from x1 to x2 from x1 to x3 from x1 to x4 and from x3 and x2 to x4 each respectively now clearly this again gives us a fully connected graph and it's not even fully connected directed graph that allows us by dropping edges to return back to what we had here previously that's in general at least so by creating an undirected graph we lose information that we would otherwise want to use to encode interesting conditional independent structure so this kind of observation might give you the intuition that directed graphs are somehow universally more powerful than undirected graphs and if that's the case then of course you would wonder why I've even ever started to talk about undirected graphs well the fact is that that's actually not true there are situations in which one of these graphs is actually more powerful than the other in coding conditional independent structure but it's not the case that one is always at least as powerful as the other and in fact it also turns out that there are probability distributions in which none of these graphs is fully powerful really let's do that first actually because we already encountered such an example in lecture number two and I did this example by Stefan Hamelin of a generative process that we called two coins and a bell so someone is throwing two coins and then whenever they show the same face we ring a bell we saw back then that this process is implies all sorts of interesting independent structures I'm not gonna go through them again because we've done this several times there are four different independent statements one can make about this joint probability distribution they correspond to three different factorizations and we already saw that each of these factorizations corresponds to a different directed graph and each of these directed graphs does not encode all four of these independent statements at the same time we can only ever capture some of them now maybe an undirected graph could help us with that but unfortunately the answer is no so let's look at any one of these individual graphs we just saw what we have to do to go from this directed graph to an undirected graph we have to moralize it we have to marry the parents so in each of these cases that means we end up with just a fully connected undirected graph and of course that fully connected graph doesn't capture any of the conditional independence structures because if you condition or marginalize on any of these variables so if you marginalize out any one of them or condition on any one of them then the other two remain dependent on each other so again in this case undirected graphs don't really help but directed graphs weren't fully powerful either is there a situation in which it is always the case so could is it always true that directed graphs are more powerful no that's not true either so we already gone through this example here right so there is a directed graph for which the there is no undirected graph that captures the conditional independence structure encoded in this graph but here is an undirected graph which actually encodes conditional independent structure that you can't represent in a directed graph so let's think of this graph for a moment it has four nodes a b c d and if you think about this for a moment then you see that first of all all of these variables or any pair of these variables is general in general dependent on each other if you marginalize out the other ones so if you don't condition on anything because they all share an edge with each other however if you condition on variable C and D then they provide a separating set they block the path from A to B so condition on C and D and B are independent of each other but also the other way around if you condition on A and B then C and D become conditionally independent of each other because they're separated by A and B now what would a directed graph look like that represents this kind of probability distribution well it would have to have directed edges so along the same directions that this graph has from A to C from C to B and from A to D and D to B but no matter which way around we draw these edges in a directed us cyclic way we are not going to be able to encode both of these conditional independent structures so let's say we draw arrows from C downward to D so C to A A to D C to B and B to D then that would mean that if you condition on D and also on C but in particular on D then A and B actually become dependent on each other so that doesn't encode the one of the properties of this undirected graph and we could try to fix that by drawing the edges the other way from A to C and C to B and A to D and D to B but again then we would have the corresponding problem that if you now condition on B then C and D become dependent on each other even if you now also condition on A so this other conditional independent structure is not encoded from these examples you can already see that both directed and undirected graphs are not fully powerful as a representational language for encoding conditional independence structure you can make this a little bit more formal as well here's a statement that I've taken from again Chris Bischoff's book let's say we can we consider an individual probability distribution taken from the set of all probability distributions over a set of variables and consider a particular graph that is constructed over that set of variables we can could consider either a directed or an undirected graph now we could say that if every conditional independent structure that is satisfied by the probability distribution is encoded in the graph then we could call that graph a D-map so for example the fully disconnected graph is a trivial D-map because it because it encodes any independent structure because it makes everything independent and vice versa we could call the situation that a graph encodes let me say this a little bit more precisely if every conditional independent statement implied by the graph is also satisfied by the probability distribution then we might call that graph an I-map it's always a bit difficult to get these statements right so for example the fully fully connected graph is such an I-map because a fully connected graph doesn't imply any independent structure and therefore of course it implies all like that any probability distribution also satisfies these independent non- existent independent structure now the interesting situation arises when a graph is both an I-map and a D-map for a particular probability distribution then we might call it a perfect map something that captures both the independent structures encoded in the probability distribution and you can also read off all of the independent structures of the probability distribution from that graph now we already saw how complicated or basically all the examples I've gone through in the previous two slides give an overview over how complicated the situation is Markov chains for example are a perfect map for Markov processes so for chain structured probability distributions and that's actually true both for the undirected and the directed case of course the fully connected and the fully disconnected graph both in the undirected and the directed sense are also perfect maps either for the fully factorizing probability distribution or for the fully like maximally dependent probability distribution but there are also examples where only one of the graph types directed or undirected are perfect maps so for example for a probability distribution that has this conditional independent structure this is a perfect map but there is no corresponding undirected graph and for probability distribution that has this conditional independent structure this undirected graph is a perfect map however there is no directed graph for which is a perfect map for this probability distribution and yet we also saw an example in coin and bell where we have a probability distribution for which neither the directed graphs any possible directed graph nor the corresponding undirected graph are a perfect map so in some sense the set of probability distributions which you can faithfully in the sense that in the sense of a p-map of a perfect map be represented by either directed or undirected graphs are first of all overlapping but none is a subset of the other and secondly there are around that set of perfectly represented probability distributions probability distributions even very simple one like coin and bell which are not representable by directed or undirected graphs so with that we are at our first gray slide and I can briefly summarize actually say something that we've already had in previous lectures we have now these two graphical languages directed graphs and undirected graphs which provide and that's also historically maybe where they come from a visual design tool to that can be used by a human designer when building a probabilistic model in particular for machine learning you can use directed graphs to write down probability distributions for which you already have a fully generative description in the sense of a factorization into generative models into conditional probability distributions and then you can read off certain conditional independent structure from that graph but not all of it in general nevertheless such models can be very helpful and we will use them as a design tool in the following few lectures undirected graphs are perhaps more useful if you have computational constraints if you want to say that you want your if you want to have certain conditional independent structures in your graph then you can draw the corresponding graph directly and then you have to pay a price for reading off the joint probability distribution that's why these graphs tend to show up in models where the computational constraint the interaction constraint is sort of natural to the to the model this is both two in for example feels like computer vision where you might directly think about computational constraints but also feels like statistical physics where you want to directly think about the potential terms that show up in your computation but we saw that both of these frameworks have a certain flaw which is that they are not perfectly fully representative so if you want to have something like a universal representation of a computation one that allows us to read off all potentially simplifying aspects then we have to move away from this relatively simple visual language of variables with just lines maybe the reason for that is just that relationships between variables are more complicated than just a line there's a reason why in mathematical descriptions of a function we have a more powerful language than just a single simple a line connecting two variables instead we actually write down a function that says how these two variables at least are connected to each other or how sets of variables are interconnected to each other so if our goal is not to just draw graphs on a whiteboard but and then think about our model but to come up with a language that actually allows a computer to reason about inference in a at least semi-automatic way for us then we need a language that has an explicit role for functions for functional relationships that are not just represented by a line and we will get to that framework in a moment and then we'll start thinking about how to use it for automated inference so this framework i just hinted at is really motivated and driven by the idea of an underlying algorithm which we'll talk about for most of the remainder of this lecture and actually the next one as well but it's also connected to yet another a third visual graphical representation of joint probability distributions and these are called a factor graph they are nice as a visualization but they are actually not the key thing the key thing is the algorithm nevertheless of course i'm going to introduce the graph first and this idea of these factor graphs is at least notionally due to these three chaps and they wrote the corresponding paper Frank Tschichang a well it's tempting to call him a german-born electrical engineer but really he was just born in metmang in germany he spent his entire life in canada so maybe we should call him a canadian electrical engineer he's now at the university of toronto brandon fry a canadian born and raised and still living there um electrical engineer physicist computer scientist and entrepreneur and Hans-Andréa Leuliger a again electrical engineer from switzerland who is a professor at the eth Zurich you can already tell from the background of these people that they are electrical engineers actually they come from the signal processing community largely that this notion we're going to be talking about has comes not just from a different community than machine learning historically it's actually tying together ideas from many different communities and that's really what this is going to be about and we'll get to it in a moment once we talk about the algorithm but first let me take a few minutes to introduce this graphical notation and again let me stress that that graphical notation isn't really as important in its visual form as the underlying algorithmic ideas so a factor graph is a bipartite graph it's yet another generalization of or a visualization of a joint probability distribution for our purposes so you can think of a joint probability distribution over variables x as factorized into a bunch of potential functions as we've already seen several times now these potential functions might either be conditional distributions if we're coming more from the direction of directed graphs or just potential functions as in the sort of less precise less specific expressive framework of mark on random fields then to draw such a factor graph we create a bipartite graph so bipartite means that there are two sets of variables v and f which have different names they're called variables v and f factors sorry two different sets of nodes and there are edges each edge connecting uh nodes from the two different types so connecting variables to factors and factors to variables but never connections between variables and never connections between factors that's what a bipartite graph is a two part separated graph where the variables represent the variables that show up in this joint distribution and the factors correspond to the individual functional forms of these functions well of these either potential functions or conditional distributions now the key part here is that you should maybe think of these functions as explicit objects that are available of course to the designer but more importantly to the computer performing the computation and this is really what makes this framework more powerful the fact that we were thinking about explicit functions rather than just edges or like lines on a board maybe lines that have directions in fact like the factors really are the key idea of this concept and there is even a variant of this notation which does away with the variable nodes and only draws factors and connects the factors directly to each other writing the names of the variables just on the edges of that connect the functions with each other this form is due to george david forney and it's sometimes called a forney factor graph i'm not going to use it here but it's maybe useful to to realize that the factors the functions really now are the first class citizens and the variables are actually taking a sort of a second second class role almost so how would we construct such a factor graph from the kinds of families we already have just to make that connection briefly well it's very easy of course if you have a directed graph then you already have a set of variables and you have a set of functions which are conditional probability distributions so you draw the factor graph in exactly the way you would imagine right and only have to formally specify this you just take your directed graph and then you introduce for each term in the factorization so for each function that contains children and parents you just draw one factor node and then draw connections between the factor and all of the variables and parameters that enter that function so here is how we would turn our example from the previous lecture or parametric gaussian regression into a factor graph if you already have an undirected graph then the situation is even easier you just set you just again take all the variables that make up the graph and then create factor nodes for every potential function every factor in the graph and just make connections so this sounds like a relatively trivial variant to both directed and undirected graphs and so you would kind of imagine that if you just think about this visual representation then it doesn't really help us much with the problems we've encountered so far and you would be totally right well actually for undirected graphs of course adding this new concept of a factor maybe is helpful and you could because it sort of have arguments over whether having these factors is actually more expressive more powerful or not in a computational kind of sense so here is an undirected graph which corresponds to well what kind of factorization does it correspond to at the very least there has to be one potential in here that involves these three variables because that's our clique but maybe we know that our factorization has a second function second potential in there then we can't represent something like this in the undirected graph because we already have the corresponding edges between two and three the factor graph allows us to introduce these kind of additional functional forms if you like right so maybe in that sense having factor nodes is more powerful than having undirected graphs but if you think about die vector graphs about die vector graphical models then you might be thinking well okay so to go from die vector graphs to factor graphs we're getting rid of one thing the directions of the arrows we're also gaining a thing the a notational trick of having individual functions so do we lose something or do we gain something which is there something that gets lost in having die vector graphs why do we not have directions on these edges anymore well here this is really why what i mean by the the visual form being not that important so explicitly i could of course say if i give you a graph that looks like this a factor graph with these like three variables connected in this way then just from this graph itself you don't really know whether i'm thinking of a joint probability distribution that has this form or that has this form now however if i actually now if you think about more like having a factorization already like this one so these are the two factorizations that correspond to this and this die vector graph then if i sort of really faithfully represent these factorizations in a factor graph then there is actually a difference between this graph and this graph right if i write this particular graph in terms of this factorization and add factor variables for each variable then i will need one factor variable for x one one factor variable for x two and that joint factor for the conditional distribution from x one and x two to x three and instead if i have this factorization so if there is an individual term for x three and then two generative distributions for x one and x two both given only x three then the corresponding factor graph would look like this and of course these two graphs are not the same so from a formal perspective making these connections precise is relatively tricky but actually that's not going to be so much of the point if we really wanted to be precise we could force ourselves to always include in factorizations every individual term and sometimes when we talk about these graphs as visual aids we might actually do so but typically what people have to care about is how to use such graphs or typically we should think of these graphs as the implementation of a computer program that has flow from the start of a program towards the end and then this kind of situation will be much less of a problem and will be like more interested in thinking about whether we can use the structure of this program to think about the cost of inference and with that we come to our next gray slide so what I've just done is I've introduced a visual language which is going to be helpful when we think about a computer algorithm a program to compute various quantities and the first one we will think about in which these factor graphs are going to be helpful are marginal distributions so let's say we have a joint probability distribution over a bunch of variables from x1 to xn and we are going to typically represent this in some kind of graph let's say we use a factor graph and we are using that factor graph to encode factorization structure available in this function here now one question you might want to ask we've asked that in previous applications is what is the marginal distribution over one of these variables this will include in particular situations in which we know one of these variables so that we condition on it and then therefore can do inference essentially to compute marginal distributions over a bunch of unknown variables given some observations but let's leave that a little bit for later and actually this is just going to be one particular computation which we're interested in and there are other related computations which we'll get to in the moment which are also interesting in this context this particular computation computing the marginal distribution of one of the variables in a graph gives rise to an algorithm which actually comes under different names that are also sometimes used to mean very specific things and sometimes slightly more general things and i'm just going to commit to using it in a certain way it's historically connected to many people a few of which I have to explicitly name as such in particular it's to to Judea Pearl and Israeli-American I think actually originally electrical engineer even though it's not really clear what field to assign him to anymore who is now a professor with UCLA and he wrote about this kind of algorithm in his book probabilistic reasoning and intelligence systems which was published in 1988 and around the same time these two people this is Stefan Lauritzen a Danish statistician and David Spiegelhalter an English statistician Spiegelhalter is now in Oxford sorry Lauritzen is now in Oxford Spiegelhalter is in Cambridge wrote a paper at the same published in the same year as today Pearl's book called local computations with probabilities on graphical structures and their applications to to expert systems both this paper and the book are often cited as the origins of this kind of algorithm we'll be discussing a later formalization where the name of the algorithm I'm going to be using some product algorithm comes from is the paper that I've already mentioned by these three people who have already introduced however so this is the algorithm maybe specifically we're going to be talking about but what's really exciting about this and what makes this such a compelling idea is that it's actually a generalization of a kind of structure that had previously been discussed in many different communities by many different people under various different names so here is a list that I actually took from a keynote presentation by Hans-André Leuliger himself in 2008 which connects just a few of the ideas that are connected to this notion of this algorithm and variants of it that we're going to be encountering over the rest of this lecture and actually also over the entire next lecture afterwards it includes ideas from statistical physics like where Markov random fields originally come from to compute marginal distributions in lattice structures in crystals in all sorts of thermodynamic statistical systems from signal processing where some of these people come from in that's particularly in engineering for Kalman filtering state space modeling that we've already encountered in previous lectures but also least squares estimation from statistical formulations of signal processing and learning hidden Markov models and so on and so on and also information theory parity check codes so error correcting codes and compression algorithms and then more recently also machine learning more general statistical analysis so this kind of list also maybe is witness to the fact that machine learning is a culmination of research efforts for many different communities that is maybe unifying the idea of computing with data and that's particularly prominent in this issue of computing a marginal distribution in a joint probability distribution because it's a very generic kind of process that you need to solve and address if you're dealing with more or more or less any set of variables that are connected to each other through a joint probability distribution okay so now i've hopefully hyped you up enough about this wonderful idea and you might understand why i'm willing to invest actually more than one lecture of this course into this class of algorithms which we will then use actually in subsequent parts of the lecture as well so today we're going to start the process of constructing this algorithm which i can already say is going to be called at least for my purposes the sum product algorithm it's also connected to ideas that are sometimes called belief propagation and message passing and today i will just start to construct that in a phenomenological way by just observing a certain structure in a particular class of graphs and then the next lecture we will generalize it to a more general class of graphs so the class of graphs we're going to begin with are chain graphs you already see in this list here that this idea is connected to Kalman filtering an algorithm that we already encountered in a previous lecture on Gauss Markov models so we're going to try and reconstruct actually the filtering process but to make things a little bit more interesting i'm not going to reconstruct the Kalman filter because we already spent one lecture on it instead i'm going to assume that we will deal we deal with a time series structured model in which the individual states at time t are discrete and then we will reconstruct basically a discrete version of the Kalman filter so here we go let's say we have a joint probability distribution over states at time t where t goes from zero to n and we're particularly going to be interested in the state at time t equal to i so x i and those individual states are discrete so they are variables that take a value that is either one or two or three all the way up to k which are maybe names of a class or of a state or these kind of models are also historically used in language modeling so you can think of an individual word and then this k is the number of words in your vocabulary and we know that the joint probability distribution or at least we assume that the joint probability distribution of these variables is given by or has a factorization that is encoded in this particular factor graph so this now we're using the factor graph notation we've already seen at the beginning of the lecture that we could also use a directed or a macro friendly field formulation and they would all look the same they would all give us this chain structure and we know that the individual terms and this chain structure amount to potentials that involve local pairs of variables so now what we're going to use is this factor graph notation we're going to explicitly use the like write down the factors the potentials that connect two variables to each other and for the sake of generality we're not going to assume that the individual factors are actually generative probability distributions so we don't think of them as p of x1 given x0 and so on but just as potential terms and that in particular means that we don't know the normalization constant of this probability distribution this is going to be interesting because we will discover that we don't actually need this normalization constant for what we're going to do and of course that's nice because it means that we can be a little bit more general and it includes of course the case where these are generative terms and then we just actually know what the normalization constant is so because we have discrete states maybe it's helpful for you to understand that these individual terms here amount to matrices so in abstract formulation this is maybe just a function a function of two input variables x0 and x1 but notice that each of these variables only takes k possible values so we can represent the entire function in a table a matrix of size k by k let me just write that down so for example psi 0 1 is a table maybe it shouldn't be equal maybe I should write the colon is a matrix that contains numbers that relate to x0 and x1 and they go from one all the way to k and from one all the way to k all right okay now let's go back to our computation and let's say that the only thing we're interested in is the marginal probability distribution over state xi we want to know the word that was spoken at time i having made recordings of all the individual variables then if you think of this as a hidden mark of model in terms of a language model of course it could be any other kind of interpretation of what these discrete variables are now what is a marginal a marginal is a sum over all the other states of all the other variables right so we have to sum out all possible values of the states at times not i so let's do that and let's look at how we can simplify this computation in general this is a very complicated computation because it requires us to sum over the joint probability distribution of n such variables which have k individual states so if we had a general probability distribution for this whole thing we could have you would think of a high-dimensional array an array that has n dimensions where each dimension has k possible states now to sum out all of those variables is super expensive it would cost it would have cost exponential in n it would cost k to the n number of computational steps however we know that we have this mark of structure we know that there is redundancy in the or repetition in structure in our multidimensional array and because we can write it as in some sense the outer product of these individual quadratic quadratic size terms so if you want to sum out x zero then let's start with that then at the very beginning we notice that x zero is actually the only part of a single such factor all the other ones do not depend on x zero at all so we can move the sum all the way to the very end of this so this is the variable that is at the beginning of our mark of chain and then sum out x zero over this function psi zero one which depends on x zero and x one but what does that actually mean well if you come back to our table then summing out over x zero just means that we sum the numbers along this array in this direction and what comes out of course is a vector or array type object and whether you think of this as a row or column vector is not important it's just an array that contains k values that are a function of x one right so k possible values for variable x one so i will call this a function of x one because that's what it is but maybe it's more intuitive to think of just a finite object a vector of length k okay so now we have that vector of length k and now let's think about what we need to do to sum out variable x one so x one again doesn't show up anywhere in the chain except for the first two terms here and in the subsequent factor so what we now have at this point when we do this summation is we have this vector valued object of x one and then we have another table another matrix of size k by k that contains the entries of factor psi one two of x one and x two so let me write another picture for this so now we have our our new table which contains entries so this entire table is maybe called psi one two and it contains values for x one from one all the way to k and x two from one all the way to k and the operation we're now going to do is we're going to multiply element wise this matrix with the vector valued object vector or array type object list type object from one to k uh i should know of course that's of course one let me draw it the other way i need to write it on the right on the right side of the picture so this thing we're going to be not multiplying it with is this object we computed in the previous step which is a function of x one from one to k and we we multiply this object with every single column of this matrix and then sum out x one so now we sum out x one along this direction so we multiply one two three four and then we sum ah so that's a sum over a product now we take a sum over x one so that's a sum in this direction over the product of this matrix with this list with this vector let's go back to our slide and maybe at this point it's already natural to think about what this how we want to call this object so we just computed this this array this uh list this vector whatever we want to call it without raising too many intuitions in you um that we multiply with this table well that's something that comes out of the factor psi zero one so therefore it's clearly a message being sent from the factor zero one into the next variable x one and then in the next operation we will take the factor one two multiply with that message and sum out and then we can keep doing that all the way until we reach xi now what happens at xi okay now we we're not yet done right we have to sum over many more variables all the variables that come after xi but what we certainly have now is a message that is being sent from the beginning of the chain recursively constructed into our variable xi and that message is a vector or a list or an array of length k entries containing positive numbers now we can turn our heads and look to the other end of the chain and see if you can simplify the summation of the variables in the remaining parts of the chain as well and of course we can because there's a similar situation here if we look all the way to the very end of the chain to x n then um the sum over x n requires us to sum over only entries of one of the potential terms one table of size n by n containing numbers that depend on x n we can do that to get to to reduce this two-dimensional array into a one-dimensional array and get a message that is being sent from the factor at x connecting x n minus one and x n into x n minus one it's a function of only x n minus one because we've integrated out x n or summed out x n and now we can recursively keep doing this process just as we did in the first part of the chain until we reach xi again in that in that penultimate step we're summing out values of xi plus one to get a function so a vector of length k that depends on values of xi and now we have two messages coming in i've noticed that each of these messages is a sum over a product a sum over the product of a table and a vector in this case such that at the end we have two messages coming in and we can use those two messages to construct the the marginal distribution over xi because they contain all the necessary parts to do so they are just two lists of length k so we just multiply them directly together so that's easy it's just two lists of length of length k there there's just element-wise multiples and these contain positive numbers the only problem is left to do is that these numbers don't actually necessarily sum to one yet because we don't know the normalization constant but now notice that to make this a probability distribution over xi the only thing we need to do is to sum over these k possible states so that's a one-dimensional operation which we can do at this only this final time where we compute our marginal rather than trying to construct this much much more complicated normalization constant for the probability distribution over all the end states which could potentially be exponentially hard so now we're done we have our marginal constructed locally and what we needed to do so is only this recursive sequence of constructing these messages locally now how expensive is it to construct such a message well it requires us to do this operation which means we have to multiply this array of length k with all of the k such columns in this matrix right so each of these multiplications costs k operations and we need to do that k times so that's k squared and then we have to do this k square operation n times to go through the entire chain so the cost of this whole process is n times k squared which is of course much faster than the general case if we didn't know that we had this change structure we would have to pay a cost of k to the n which is radically more expensive so we have an algorithm i mean that's not surprising for chains because we've already encountered it in previous lectures but it's still nice to see again that we're just paying linear time cost in the number of states for the construction of this marginal we also have seen that we can construct a normalization constant at the very end so that's nice we don't need to do a big sum we can construct a normalization constant z for our marginal state at the final time at the time once we've only have constructed our messages and we've also noticed that to construct our messages we have to construct sums over products of terms and that's where the name of the algorithm comes from so what we've just constructed here in a sort of phenomenological way by going through this example is the first example of the algorithm that will be called the sum product algorithm or the message passing algorithm or the forward backward algorithm for inference of the marginal state in a Markov chain like this in a hidden Markov model so a model with latent discrete states so latent being that there are some probability distributions discrete states over the individual variables and maybe this isn't a great slide but we can use it as an opportunity to take a quick breath before we move on to the next question so maybe one thing to notice here is that on a very abstract level what we've actually done here what what we've used here the algebraic structure we've used to make this all work is the fact that we could move these sums through this entire expression so why are we allowed to do that will be because sums are of a distributive nature so the sum of a times b and a times c is a times the sum of b and c now we can actually use the same algebraic structure if it shows up in other operations to construct similar algorithms so any other operation that also has a distributive nature can be used in this kind of recursive fashion and we're going to use that now to construct another interesting quantity another interesting object of such a joint probability distribution not the marginal but the most likely state we've already seen that the most likely state plays an important role in statistical analysis and it's an interesting point estimate of course even for the Bayesian so you might really want to know what the most likely state is as well in your chain now you might think why do I even need to do that well I already have all the marginal distributions why don't I just take the maximum of all the marginal distributions that's easy to do because it can be done in linear time I just go through all the marginal distributions just look just look for the maximum unfortunately that's not the right thing to do because the maximum of a joint is not necessarily equal to the joint of a maximum so to give to tell you what I show you what I mean by that let me write something down that's an example that's actually due to again Chris Bischoff's book he's doing double duty or lots of duty at the moment in this part of the lecture in this book so consider a joint probability distribution over variables x0 and x or x1 and x2 but I just wrote down and they so let's say they or they are binary variables so they only have two possible values each and we need a 2 by 2 joint probability distribution and let's say they have joint probabilities 0.3 0.4 0.3 and 0 so notice that that's a probability distribution because the entries sum to 1 right 0.3 plus 0.3 is 0.6 plus 0.4 is 1 so they have to be a 0 over here now if you compute the marginal distributions over these individual variables so we sum out the corresponding other variables then we get a marginal distribution over x1 that is equal to 0.7 0.3 and the marginal distribution over x2 which is equal to 0.6 and 0.4 so the most likely joint state has probability 0.4 and it lies here but the most likely marginal state has a probability of 0.7 for x1 and 0.6 for x2 and that would correspond to this state here which isn't the most likely one so if we really want to know the most likely path a dynamical system took through its time series then we need to construct it again in a global fashion and keep track of the whole thing in a global fashion so let me wipe that out and then we can do this and this will give rise to an algorithm that isn't the sum product algorithm but it's the max product algorithm or actually we'll see that we could also call it the max sum algorithm for a reason that i'll show you in a moment and that algorithm also has important historical use okay let me go through one more slide so let's say we want to construct now i can actually do this um our maximum likely so most likely probable state and actually we need two things now we both want to know the value of that most probable state like how likely it is but also we want to know what that state actually is so we want to know the max and the arc max if you like now everything's as before we still are thinking of our chain structured graph with discrete states k of them and here is our corresponding factorization the joint probability distribution factorized into individual states and an unknown normalization constant now if you want to know the most likely probability so the probability of the most likely state then we have to take the maximum over all the variables from x0 to xn and now notice that the maximum also has a distributive nature the maximum of ab and ac is a times the maximum of b and c and in fact the maximum of a plus b and a plus c is also a plus the maximum of b and c so we can again move the maxima through this entire expression all the way to the end where we write or all the way to the point where the variable we're taking the maximum over shows up for the first time so let's start at the very beginning and we want to know the more the maximal most likely path through the time series like all the way to the end then we take our maximum of all the variables and we start with the final one with xn and that one only shows up at the very right hand side of this graph so we can take our maximum all the way through and take the maximum over this factor this table of size k by k over the variables xxn minus one and xn actually let me write that down once again so we have our let me show you maybe this color so we have our our two variables xn and x minus one and there is a joint factor xn minus one n it's a table of size k by k that's for simplicity's sake so we have four of these here is xn and this is xn minus one now what we need to do as we just had on the slide is we need to take the maximum over xn over this quantity so there are numbers in here in this table and we want to know what their maximum is as a function of xn minus one and the maximum is over xn right so we go over xn so xn goes from one to k and we want to know what the maximal value is depending on what xn minus one is right so in this column there is one entry let's say it's this one which is the largest one in this column there's one that's the largest one and this column there's one that's the largest one and in this column there's one that's the largest one if we store these k numbers we can store them again in something that is of a sort of an array type of length k one two three four actually let me draw this below so if we take the maximum then we get another array it's not a bivariate array it's a univariate array of length k where we store those numbers so let's say that number might be 0.3 that number is 0.4 0.1 0.001 right and we could store that numbers in here 0.3 0.4 0.1 0.001 let me hope that works yeah so that's still feasible for that to be a whole probability distribution 0.7 0.8 and then a lot more stuff now notice that of course when I write down those numbers I don't yet know like by storing those numbers I don't actually store which of these entries was the arc max I only store the max not the arc max so if I wanted to be able to recover afterwards the arc max then so the entry that maximized the the probability that's often what I want to know because I want to know what the most likely path was through the system then I have to also store the indices of the arc max so here the index is two here it's one three and one and two again and I could store these and of course to do that I also need another array but that array is again of linear size right it's just a list that contains entries of the arc max so here would be two one three two if we're using one based indexing fine okay so now we have this object and we can think about what our next step is going to be our next step will be why do I draw that that now we go to x to the to the potential for xn minus two and xn minus one so there's another table like this here we go now we have xn I should draw this somewhere else now we have the table for xn minus two and xn minus one they again go from one to k one to k whoop whoop whoop whoop whoop whoop whoop we have the result of the preceding step in our recursion which is a maximal value so the most probable value for xn minus one and then we need to multiply this table this this list sorry this list of of k entries with every single row of this matrix here that gives us new numbers and we also have from the previous set this list of indices so here we have two one three two 0.3 0.4 0.1 0.001 right and so now we multiply this message coming out of the previous step with all of these individual rows and again we get numbers and now we want the maximum over xn minus one over this operation I should yeah I did this badly by changing the order of these okay let me just fix it so then we call this we need to call this is xn minus one and this is xn minus two and then of course this list comes from the side comes from here so this is our message coming backwards and this is the index of the arc max and we're gonna have to need need a name for these in a moment I'll introduce it in a moment it's clearly a different data structure or another data structure of this sort of interesting type it's basically a list of links right so here's our 0.3 0.4 0.1 0.001 two three one two and we multiply this object with each column of this table outcome a bunch of numbers and now we take the maximum over xn minus one that's our next maximization we do so again now maybe here is the largest entry and there is the largest entry and there is the largest entry and here is the largest entry in each of the columns and we can again store the values that are in there maybe the value here is 0.1 maybe by the way notice that this doesn't even have to be normalized because we don't assume that it's normalized it's just has just has to be positive so maybe there's a one here and the two 0.2 and a 1.1 and a 0.01 okay then we take the maximum so we store these we store 0.1 1.1 2.2 and 0.001 and we again store the indices of the maxima so here the indices are three two one three three two one and three and now we can keep going recursively now what should we do first okay maybe a first thing to notice is that at the very end at the final step of this iteration we are going to be able to finally compute the probability for the most likely state because at the end we're left with a final such table and we just take the maximum over it and now we know the most probable path of all of them but we will also be able to reconstruct the most likely path why because we now at the end have one such index for one of these locations that is the maximum and we can use it to recurse back through this graph in a linear fashion so if this one is actually the maximum then we know that it's if maybe at the very end we can go here right and say okay 2.2 is the maximum it's the largest value so 2.2 corresponds to index one index one and now we can go back and say ah this one was the largest okay so that means in the previous step for x n minus one the largest value must have been the second one so the second one is this one and we can keep going recursively back through like this and this process requires us to go first from the beginning to the or actually from the end to the beginning of the chain and then again from the beginning back to the end this and i'll just show you this right now this additional data structure that we need to store the arc max rather than just a max has a historical name it's called a trellis because it's been historically been written in this kind of fashion that's a representation for this chain and then at each chain each location of the chain you write this list basically you write down the list of the individual states and then just draw lines connecting the corresponding candidates for the arc max as you go backwards from 3 to 0 and then once you're at the end and you know which of them actually is the largest one you can go back through this data structure to construct the most likely path so as you go backwards you create hypotheses for the most likely path and then as you go forward you know which one actually is the most likely path this is called a trellis because so this is an english word for something that in german is called a spalia if you like the german word it's something you put in your garden like a like a rigid grid on which roses and other crawling ranking plants can move their way up and hold on to so you just have to visually think of this object as rotated so that your roses are growing from well either from 0 to 3 or from 3 to 0 whichever one you like and you can see them sort of grow up and once actually maybe you can see them grow up from the end to the top and at the top once you see which one is actually the largest you can follow its path back down into the soil and know which one's the largest so that's actually an algorithm right this algorithm for this case of markov chain models with discrete states is called the vterby algorithm actually it's called it has many names vterby is maybe the one that is most commonly used it's due to an italian-american again electrical engineer here you see the background of this part of the of the field in electrical and computer engineering and he's also one of the co-founders of qualcomm if you want to look him up however there are many other people in other communities who have invented this algorithm over and over again there are different names connected to it so we shouldn't be maybe too focused on that particular name why because it's a very common kind of problem you're observing a dynamical system and they'll evolve over time and you're only observing it with noise with uncertainty maybe you're observing someone communicate over a channel you want to recover what they've said over the channel maybe you're listening into a conversation that's also one of the original applications of these kind of models for military intelligence style revealing communications over potentially encrypted channels this algorithm from our point of view for sort of abstract automated inference in graphical models because we are soon going to move beyond chain graphs and more general graphs could be maybe called less loaded historically the max product algorithm because we're constructing our individual messages by taking the maximum over the product of in this case a table as a matrix and a rowwise multiple of the of the incoming message so it's a product of matrix and vector and then we take the max over it however we could just as well take call it the max sum algorithm why because we might be unless we've now seen several times in the course of this lecture we could also take the logarithm of our probabilities and then many things get easier and a product turn into some so what do i mean by that this is what i have on the slide here so if you instead of thinking about the maximum of a probability distribution we could think of the logarithm of the maximum of the probability distribution that means we can take the logarithm and because it's a monotonic transformation move it through this entire expression we take the log of the of p of x that turns the product of individual factors into a sum of logarithms of those factors now we take the maximum of that logarithm which is now a maximum of a sum that's why it's going our algorithm is going to be the max sum algorithm if you like and everything else stays as as before and that's why it's because the maybe i should write that down somewhere the i need to wipe that out the properties we use here are really that the maximum of a b and a c is equal to a times the maximum of b and c or the maximum of a plus b and a plus c is equal to a plus the maximum of b and c and for the max product as the sum product algorithm we use the fact that the sum of maybe let's call it sum of a b a c is equal to a times sum of b and c that's really the algebraic structure that drives this entire entire kind of process so we can encode this inference process into the maximum the most likely probable path through a dynamical system in this again linear time algorithm which starts which is sort of encoded in kind of pseudo code in this little picture which starts with an initial incoming message that is essentially um zero and then we go forward through the chain if you like and you can actually do this also backward through the chain doesn't really matter which order we do that either from left to right or from the right to left and first take the maximum over just this one factor that gives us a function of x1 that you can multiply again with this table that is associated with the factor one two and then keep doing that recursively until we are at the end building up our trellis and then at the very end we can move back through the trellis following the path of most likely or candidates for most likely paths to construct what actually was the arc max of this probability distribution with this we're almost at the end of this week's lecture it's nice it looks like one more lecture but we are slightly under one hour and 30 before i show you the summary slide maybe it's a good idea to already prime your heads for what we're going to do in the next lecture which is going to be the generalization of this process of message passing away from chain graphs to a more general class of graphs and maybe we can already think at this point about what kind of graphs these are going to be what kind of structure we actually need for this to work and to do so maybe i'll go back it's a bit stupid that i wiped it out now but i'll do it again and think about what how we can generalize this process i just drew here with these matrices and the summation from two or towards more general data structures so in this chain example what we did is we took these bivariate matrix like objects which we have because we have potentials in our chain graph that are bivariate objects they are functions of two inputs x1 and x2 and then or xn and xn-1 or x1 and x yeah whatever xn and xn-1 and then we took the sum or the maximum along one of these directions to get out a function that is only depends on sort of a subsequent part of the graph so we took in particular yeah so yeah the sum along this direction and got out an object that is a vector which can then be sent into another variable now maybe you've already thought in your head while we were doing this about how you would generalize this to more than a factor of two variables so let's say we want we are thinking about a graph in which there are let's say there is a factor connecting three such variables to each other and then there is a remainder of the graph that's sort of on the left hand side here that's sort of going on in this direction and what we want is to be able to pass messages along this direction and to do so we will need to marginalize out more than one variable and of course you can imagine here an array that is sort of well as a multivariate array right something that has more than two dimensions for a potential so that's x1 x2 x3 and we have a psi I should write it next to the factor actually psi of x1 x2 x3 if that's our factor then to get a message that goes into here we will need to take let's say we do some product rather than max product or max sum we will need to sum out individual terms well we need to sum out something along this kind of object and what are going to be the incoming bits the incumbents are also going to be messages where each incoming message is again a one-dimensional object typically speaking right so we multiply each of these one-dimensional objects which it's corresponding rows or columns out comes something that is again of this size whatever the number of inputs to the factor is and then we sum out that's obviously going to be more expensive in this case it would be k cubed rather than k squared to get out another one-dimensional object which we can then send on as a message so it's really already clear to you that it's not the chainness of our graph that we are really using here but it's going to be another kind of structure that actually makes it possible to do this kind of process and maybe i shouldn't tell you at this point what this is we'll do so in the next lecture to that let's do our summary slide so today i introduced well after and i was in space on this bracelet to talk about it i mean first reflected a little bit more on directed and undirected graphs and we noticed that these graphs are although interesting as design tools perhaps not the right language to encode automated inference in a computer program to do so you really need to know the functional form of the probability distribution or at least the factors that you're dealing with and that means maybe that you should think of a graphical language which has an explicit role for the functional relationships between variables factor graphs are that language they provide a tool to directly represent an entire computation into a formal language not by drawing little boxes but by assuming that those little boxes are actually assigned to a concrete program a concrete function that is realized on a computer both undirected and directed graphs as we've used them before can be mapped onto factor graphs we saw how to do that and once you have such a graph with its associated functional forms we have seen at least one case in which certain relevant computations for probabilistic inference in particular the computation of a marginal distribution and the computation of the most likely path through time series in this case or the most likely state and the location of that most likely state the identity of that most likely state can be computed in well in an efficient way so in the graph and in the chain case that efficient way turned out to be linear in the number of steps along the chain and we did so by passing local messages by actually just by computing terms that are local and then you can think of these terms as messages being sent along the graph and notice that we can use this notion of message passing both to compute marginals and maxima and most likely states and what we really use in this process is the fact that these computations are distributive that both the sum and the maximum operation are distributive in nature so what we're going to do in the next lecture is to generalize this process to a larger class of graphs beyond chain graphs and then we can use that in the remainder of the lecture as our tool set for efficient inference or for implementation of efficient inference algorithms in structured graphical models that's enough for today though thank you very much for your time