 Hello and welcome to probabilistic machine learning lecture number 18 this entire course of which we've now finished about two-thirds is about Providing you with a tool set if you sometimes actually looked at like ideas a set of ideas some of them quite old some of them rather new Then endow you with the ability to build your own structured probabilistic reasoning Algorithms or systems so that means the ability to write down uncertain statements about Certain quantities certain situations certain things you'd like to know and Then refine them with data from the real world in a process called inference To create improved predictions to increase knowledge to refine knowledge About unknown quantities in the world. I've shown you this slide several times already these this toolbox maybe So far we've moved through it a little bit too quickly. So let's spend some time and see what we've collected so far We started with the relational observation that probability theory a Reasoning system that rests on two Elementary rules the sum rule and the product rule and arising from them as a corollary base theorem provides in the form of base theorem a reasoning system that extends propositional logic by distributing truth Across a space of hypotheses rather than committing to an individual one and Only in this way is it possible to really perform inference Now almost immediately in lecture two we discovered that there is a high price we have to pay for this Extension in our reasoning abilities and that's a computational one Because keeping track of an entire space of hypotheses rather than committing to an individual hypothesis Causes a potentially Exponential increase in computational cost so more or less the entire rest of the lecture course Up until now and including now Has been about finding ways of dealing with this computational complexity in various different ways They are almost orthogonal to each other Right at the very beginning of the course lecture two We discovered that one way to speed up computational Well to reduce the computational cost of inference Was to use conditional independent structure use the observation that certain variables if they are known separate parts of The space of latent unknown hypotheses from each other so that the computational steps We need to perform become much easier Because they factorize into individual parts and we saw that there is a graphical way of visualizing this kind of Structure in our generative processes and these are called graphical models Now this is nice because it separates or conditional independence separates the computation Over several variables into separate parts if we can use it But in lectures three We saw that even if we do this even if we separate our problems as far as we can then if the remaining variables that we have to deal with are continuous valued and we saw how to extend the idea of Probabilistic reasoning to continuous variables using the idea of a probability density function If we do so then The remaining computations we have to perform even if they are low dimensional are still integrals They are continuous summations if you like and these can be still Intractably hard even if you make them as low dimensional as we can Which might still be quite high dimensional and so we Need ways of dealing with these integration problems in our probabilistic models and again We saw several different ways of dealing with these integration problems and we're actually not yet done There will be more coming little on our first one is the idea of sampling of Monte Carlo methods and We encountered several of these an entire family of these algorithms Which are introduced very early on in the course so that we have one concrete algorithm. We can actually use for some applications These methods Monte Carlo methods create samples Random numbers, which at least approximately and in the infinite limit asymptotically Acquire the correct density. They are distributed in the way that is described by the probability density function and And therefore allow us in a tractable fashion by summing over these individual samples rather than an infinite continuous set of probability densities to construct estimates that Multify the structure of a probability distribution for example compute moments and Expected values That's one way of dealing with the integrals another one That's quite different in nature, but maybe addresses the same issue is that we could try and construct our models such that all the integrals just remain tractable that we just know how to do them and We saw and this actually took a significant part of the course that there is one very elegant Framework that allows us to use this idea in a really powerful way by mapping integrals onto linear algebra And that's the Gaussian framework if all variables in our reasoning problem are jointly Gaussian distributed and related to each other by linear maps That means they are real numbers that are connected through linear maps Then all we have to do for inference Amounts to linear algebra problems. So to multiplying matrices and vectors and solving linear systems of equations We saw that this idea actually is very powerful It allows us to build a large class of the algorithms that we now describe as machine learning in particular It allows us to address many supervised machine learning problems we spend some time on Continuous valued regression or real valued regression and saw that we can learn functions that map from an input to an output space Using the Gaussian framework by describing nonlinear functions through a set of features and then learning the weights for these features This essentially defines a neural net if you like how do we learn the features if we don't want to fix them Well one way to do so is To step outside or maybe sort of set to the real boundary of the probabilistic framework and still use probability distributions But only consider their maximum to do maximum likelihood or type two maximum likelihood Inference to learn features and this gives rise to the idea of deep learning an alternative that's in some sense orthogonal to this idea is To instead of add layers to a network and learn them a maximum likelihood Increase the number of features in one layer towards the infinite limit and not learn them at all But just use a distribution over them and make it infinitely why this gives rise to the idea of a kernel machine and Associated probabilistic model a Gaussian process We toyed around with these models a little bit saw how to apply them to regression problems also to regression problems of particular structure and particular time series models which gives rise to notions like filtering and smoothing and Socastic differential equations and then we began to maybe we sort of return to sort of a Like an itch that needed scratching Which is that of course sometimes our variables Involved in our reasoning problem aren't actually real value it in particular it might be that the that in a supervised machine learning problem the output variables the quantities we get to observe as the targets are not real valued and We discovered one way of dealing with this issue Which is to try and stay in the Gaussian framework as much as possible and whenever something Fundamentally isn't Gaussian we try to shoehorn it into a Gaussian. We try to approximate non-gaussian distributions with a Gaussian This led to the idea of the Laplace approximation Which allowed us to retain the Gaussian framework and extend it at least in an approximate fashion to the tasks of classification Multiclass classification and even beyond that through the idea of a generalized linear model to output variables of relatively general type So this already gives us a pretty powerful framework for supervised machine learning, but of course it's a little bit You know underwhelming that you have to use this Like that that exclusively the Gaussian framework and then Extended through approximations in an approximate fashion and one might wonder whether we're really so restricted to just the Gaussian Family of probability distributions whether there aren't other distributions that also allow closed form tractable inference if we are willing to consider them and in fact we saw that this Question leads us in lecture 15 to the idea of exponential family distributions. These are probability distributions which Essentially leverage one known integral In the Gaussian case, it's the Gaussian integral But in other cases of other exponential family distributions or other exponential families, it's other tractable integrals To provide tractable closed form inference on Certain types of variables relative to each other where closed form means that an integral which is generally Intractable is replaced with an optimization with computing a gradient essentially not really not optimization But differentiation where an integral is replaced with a derivative of a log normalization constant These exponential families Provide in some sense data types. That's also why they ended up in our toolbox. They provide data types for Certain specific variables and pairs and sets of variables like strictly positive variables scale variables variables on the simplex probability distributions and so on and so on and They allow closed form inference in very specific but frequent and practically relevant inference problems That's cool, but of course It's also relatively restricted because it's very specific to certain kinds of combinations of variables What if we have a more general relationship or if you want to write down a more general relationship between variables in The way that we usually write computer programs Well, that's where we currently add in the course and the question we really want to answer so in the previous Lectures like a previous two lectures actually we will be returned to the idea of a graphical model We've already encountered in lecture two back then exclusively or specifically specifically looking at conditional independence and We saw that actually there is more than just one way of graphically representing a probability distribution If you have a generative Model if you can write a probability distribution in terms of conditional distributions then you can use a directed graph to Directly visually represent this probability distribution and this allows us to infer conditional independence structure from the graph purely It's a little bit tedious because it requires the notion of de-separation, but it's possible and if we don't have a Generative factorization So if you don't have a way of writing a probability distribution in terms of conditional distributions but only in terms of factors in terms of Interaction potentials between variables as they would be defined in physics then instead we can use the more Restricted maybe the somewhat somewhat more simplistic notion of an under vector graph a mark of random field which still allows us actually in an even easier way to Extract conditional independence structure from the graph at downside of this is of course because We started out not knowing what the joint distribution is only interaction terms between variables We still don't know what the joint probability distribution is because in general Inferring the normalization constant of these distributions can be hard again And then we need to use all sorts of approximations including the ones we've just already discussed To make computations tractable And the last lecture we saw that there is a way to make these graphical notations a little bit more expressive by and also to bring them closer to the Kind of formalism you would use on a computer rather than on a whiteboard by giving an explicit role to functions By creating what's called a factor graph a graphical representation that provides an explicit place In the visualization for functional relationships between variables and really makes functions first-class citizens Now what we saw in the last lecture? Like turn number 17 was a first observation that when we write run probability distributions in this way then we at least for specific cases And in the last lecture it was a very specific case that of a Markov chain with discrete states We are then able to compute certain interesting quantities interesting aspect of a probability distribution a joint probability distribution namely the marginal distributions of each individual state in the or each individual Yeah state and a time series and And also the most likely path through the joint so the most likely configuration of the joint distribution in an in a linear time way using an algorithm that passes messages conceptually between variables on the graph along the graph from Notes on the graph to their neighbors In Particular it's possible to pass messages from the front to the back and from the back to the front of this chain and in doing So compute in linear time all the marginals and the most likely state of the joint distribution Using an algorithm that or two kind of flavors of the same algorithm that we call the sum product algorithm and the max sum or max product algorithm And you saw that these have a lot of historical connections. Actually, we didn't really see I just told you that they have a lot of historical connections And but we were very specific about this this setting of a Markov chain with discrete states So what we are going to do today is To extend this idea to the full flavor of this sum product algorithm Which is also connected to words like believe propagation of message passing To more general graphs And in doing so will also deal with continuous variables and with observations on the graph So with conditioning observations to do actual inference Before we get to that point, maybe you're wondering why we're doing all of this and The last two or three lectures certainly have been a little bit abstract I didn't show you concrete examples of these tools being used and this is quite sort of deliberate because I Want to do an extended example a large example on a Almost real world problem That uses all of the ideas that we've actually all of the ideas we've used in the entire lecture course That example will take a little bit of time to introduce though because the model is also really non-trivial so Doing so will take actually several lectures to do and that's why I haven't started doing it yet We will actually do it starting from the very next lecture on books But up until then I wanted to make sure that you actually have the tools we need to then employ them in our example so if I just go back to the rough plan we're now at lecture number 18 and From the next lecture on I will start to create an example Which will then return to once we have added more tools to our toolbox in the final few lectures of this lecture course To that actually apply the ideas that I'm introducing here today That's why we're relatively theoretical and Maybe this is a bit of an experiment. So in other lectures We've tried or I've tried to have the examples quite close to the point where we introduce these corresponding concepts That's maybe nice because it allows you to directly see how we use an idea It also is quite restrictive or is it restricts the set of examples We can look at because they don't have to be quite simple so that we can do them in a few slides So in the next lecture onwards, I want to introduce a larger problem a larger example And that requires me to first do a little bit of theory. So let's do that now and from this dark slide Let's see how we can generalize the idea of message passing to graphs that are not chain-structured The kind of graphs we are going to consider in fact are trees and there's a reason why we consider trees It's not just because They are sort of the next easiest thing to consider after a chain But in fact, they are fundamentally the exactly right group of graphs or Structure of graphs on which we can think about message passing But before I can explain to you why this is the case. Let me first define what exactly a tree is So many of you will have had graphs before so you already know what what a tree is and if you haven't it's a very Intuitive kind of thing to to define. Let's first consider Undirected graphs or something like this up here. This is in fact a tree Maybe you can first do some visual bit, right? This is a tree This is an a directed tree and this is actually also a kind of tree which will have to give a special name to Why well, you can see that this well first of all, this does look like a tree right, so trees are plants that grow leaves on Branches and the branches of a tree don't grow together again at least usually they don't and that's exactly what this definition kind of Defines, so here's the formal statement for under vector graphs a tree is A graph for which there is one and only one path between any pair of nodes So that means all nodes are connected to each other But between any pair of nodes there is any there's only ever one path and such graphs in fact have no loops So this is easy for undirected graphs and you can see that this is clearly a tree Even though it doesn't maybe look like what your intuition of she looks like but clearly there's only ever one path to go from one Node to the other where a path means that we're not turning around and walking back along an edge, right? so a for die vector graphs things are a little bit more Tricky because of the direction, but essentially this works as you would expect a die vector graph for a for a die vector graph Or a die vector graph is a tree if there's only one node which has no parent in this case That's this variable that's called the root and all other nodes have exactly one parent. That's this kind of situation now Like a nice part about this is that because every single node only has one parent Even if we think of this graph as an undirected graph So if we if you turn it into an undirected graph, then it'll remain a tree Why because remember how we have to turn like what we need to do to turn die vector graphs into undirected graphs We have to Marry the parents moralize So we have to connect parent nodes and because every note here only has one parent We basically don't have to do that. We can just drop all of the arrows from the edges and get the undirected format That's then clearly still a tree Now there are also graphs like this which are the directed version of this In fact, we will still be able to build our algorithm on these kind of graphs These kind of die vector graphs are called poly trees So those are die vector graphs such that every pair of nodes is connected by one and only one path So that's the definition we used for undirected graphs and now if you turn such a graph into an undirected graph You would have to marry the parents That means you would have to introduce a connection up here and then of course this undirected graph would not be a tree anymore And that's maybe one of the reasons why you consider factor graphs rather than undirected graphs because if you build a factor Graph out of this then this graph will still remain a tree Let's maybe think about what we need to do to turn all of these into factor graphs I'll show them to you in a second But maybe I don't need to show them to you now because then you can think about them So how would we turn this graph into a factor graph? Well, they are just pairs As a bunch of like the cliques in this graph are just all the individual pairs of nodes, right? So we know that there will be individual factors in our in our factorization of this Markov-Vendem fields Four of them each of them pair-wise with the center node and that means we can just add little function Factors in between the on each of the edges for the for these graphs Because they are fan-out structures We will just have individual factors on each of the edges and here as well And then an additional factor that goes into the parent because there has to be in our factorization of this graph in terms of a generative process a prior for this variable and then conditional probability distributions for all of the lower level variables Conditioned on their one single parent Now for this graph the situation is ever so slightly more complicated here We will have individual factors here and there and then we need parents Sorry, sorry, then we need prior terms. So Marginal probability distributions for these two Parents and this is what this is going to look like and you can see that written as a factor graph this Graph remains a tree even though if it were turned into a Markov-Vendem field it would become a non-tree like graph Now why do we consider trees and exactly trees and not more general classes of graphs for message passing? Well, if you remember what we did in the previous lecture, we need to be able to send Messages and what that means is that we like sending messages basically means that we are incorporating knowledge about the an entire part of the graph that is subsumed into that message and send it to The message well the variables to which we are sending the message So what this will mean that this will become clearer once we go through the actual derivation is That we need to be able to separate one part of the tree entirely from another part of the tree and Combine all of the information that we want to send from one part of the tree to the other into one message Now if our graph isn't a tree Then so in particular if it has loops Then it's much more difficult to think about what kind of message we what kind of data structure We need in our messages to include information about other messages that might also arrive at a subsequent node through other paths and In fact, it turns out that essentially to do that in a general fashion We basically always have to like reconstruct our graph such that it becomes a tree So we basically have to keep track of additional information about potential other messages being sent around in a way that more or less Ends ends up with us turning the graph into a tree So we'll see why like how exactly that works in a moment, but this is the motivation why we look at trees So tree structure graphs are going to be very important or there are the Fundamental class of graphs on which we are constructing our message passing algorithm That's not because trees are ever so slightly more complicated than chains and we did a chain example first And now we do a tree example trees actually are the most general class we could look at and Any more general graph any probability distribution that defines a more complicated distribution that isn't a tree will require us to Do manipulations on the graph until we have essentially a tree back All right, so let's get started We will now do more or less exactly the same thing as in the previous lecture for chain graphs, but for tree structured graphs So we begin again by setting the scene and stating exactly what we're going to assume So we are let's say we have a tree structured graph so that means we have a joint probability distribution over a set of variables from x1 to xn and They there is an associated probability distribution p of x which can be represented by a factor graph That looks like a tree that has tree structure And if it's if our corresponding graph is undirected or directed and it's a poly tree Then we just turn it into a factor graph first and we know how to do this so now We will further assume Just as in the previous lecture that all variables are discrete. So That's actually a relatively easy assumption If it's easy to state what you would do if you had continuous variables You just replace all of the sums that we're going to encounter with integrals And then we still do some product You just have to think of infinite sums in terms of integrals in practice This is a little bit more tedious and it will be strict the kind of probability distributions We can deal with because of course you can't do any arbitrary integrals. So That's gonna be a practical problem to deal with but from the conceptual perspective of what the algorithm is going to be It's fine to think of sums as integral sums and integrals as the same thing and for simplicity We will just assume that everything is discrete Now what are we going to do we're going to do we're going to try and construct the same quantity that we constructed in the previous lecture, which is the marginal of One particular variable x. Let's pick out one individual x from the collection of bold x of all of them And let's just call it x without the index Otherwise you could call it x i But there's going to be a lot of indices floating around now in the moment So let's just call it x as it is by the way the exposition I'm going to go through again Comes from Chris Bischoff's book from in fact chapter 8.4 if you'd like to look it up in there so Actually to be a bit precise We are going to construct an algorithm initially that tries to construct just this one variable Just a sorry the marginal of this one variable x But in fact the same algorithm can be used if to efficiently construct all Marginals in one go and constructing all of the marginals is essentially as expensive as constructing a single marginal We'll see that that's the case later on So the key insight we're going to to Leverage and that's really right at the beginning the one thing that allows us to do all of this is that the graph is a tree and Therefore we can write the joint probability distribution P of x as a factorization as provided by the graph So by that factorization I mean that we can think of this graph as containing individual terms and I will call those terms Capital f s where s is an index that goes over all of the neighborhoods of x I'm sorry all of the neighbors of x so the set of neighbors of x is the set of all factor graphs And because there's a bipartite graph these are only factor graphs I'm sorry only factor nodes and the set of all factor nodes that share an edge with x along those These these neighbors we can think of terms that are connected to those neighbors by Paths that lead from those factors deeper into into the graph and we will call this entire subgraph on That is connected with one of the neighboring factor nodes of our Neighboring our variable x we will call such a subgraph capital f s where s is the index that goes over the neighboring factors the key thing is that we are allowed to do so because The the the graph is a tree well or more precisely the key insight is that these subgraphs f s do not overlap Because it's a tree because there is no possibility For a hidden connection somewhere further away between those neighborhoods. That's what we are going to leverage And that's actually all so once we've gotten this the rest is going to be relatively straightforward So let me go through this once more slowly Because the graph is a tree we can think of the neighbors of x these are factor nodes as entries into subgraphs That are disjoint from each other And now what we're going to do is we're going to try and construct the marginal distribution over x itself So to do so we just okay Let me see this again. What is the marginal just to be clear the marginal is a sum over all possible states of all the variables in the graph that are not the variable we care about and I will use this notation for it So that's a bold x Without little x so we need the sum of all states of all the variables in the graph that aren't the one variable be care about now because sums and products commute We can take the summation which we need for our marginalization inside of the product term and Because these individual subgraphs are disjoint from each other there will be certain like like each each of the subgraphs contains a set of nodes xs Which are disjoint from each other under the subgraphs and Therefore the sum now breaks up into sub sums into individual sums that only depend on Or that only contain or only indexed over Variables that are in that subgraph Now what we're going to do and that's really just a notational trick But it sort of gives rise to the entire idea of message passing is we're just going to call this object a Message because what is this? It's a function that depends only on The variable x that we care about for which we were trying to construct a marginal and In particular, it's not a function anymore of all the other variables into that subgraph because we have assumed that we've Summed them out of course summing them out might potentially be expensive, but Maybe it doesn't What we once we've done it we are left with an expression that is only a Function of x so in the picture of the What I did in the last lecture for the chain graphs You can think of this as a vector this object that had that happens to be a summed out multivariate array Where this vector contains a number? positive numbers not necessarily a probability distribution, but positive numbers Assigned to these are unnormalized probabilities assigned to each possible value of our variable little x By the way, maybe at this point. Well, okay, maybe let me just finish finished. I thought we will call this object a message It will call it the message mu that is sent from the factor fs So that's the entry into this subgraph into our variable x Because we are now talking about a tree rather than a chain there might be several not just two but a Larger number potentially of these messages coming into our x from all the individual neighbors And that's exactly the situation. We're going to have the marginal over x We've now observed or shown to be a product over incoming messages From the neighboring factors by incoming messages, we mean a vector or a function in the continuous case containing positive numbers amounting to unnormalized probabilities for every possible value that x can have or in the continuous case a density function so a Positive non negative valued function over the domain of x Maybe it's also useful to keep the connection a bit tighter to what we did in the previous lecture So in the last lecture when we spoke about chains I basically constructed the algorithm in by starting a recursion from the other end We said let's look at the two ends of the chain the beginning and the end and When we now do our summation for the marginalization We notice that we can that there is one term in the summation. That's particularly easy, which we can do at the very end of The chain so we first sum out the final part of the chain and then the penultimate chain State of the chain and then the state two steps away from the end of the chain and so on and Then at the end we are left with our messages coming into Our like at the end of this recursion We reach the point of the variable that we're trying to compute a marginal over and that turns out to be a message then Now in this derivation here We'll start of doing it the other way around where immediately the first constructing the messages that are coming into our node And then we are expanding them That's what we're going to do on the next slide into a recursively into subgraphs further and further out The reason why we need to do it this way in our at least why it's more convenient Maybe to do it this way here is that we have a tree graph. So there is no natural Opposing ends like one beginning and one end from which we can can make the construction But there are potentially many ends and these ends are called the leaves of the tree So our recursion will end in a leaf and we know that there is such a leaf because the graph is a tree So let's do that now. Let's continue our recursion from here So what we're going to do is we're going to descend into the graph essentially we're going to look at one particular factor node FS which is among the neighbors of the variable X and Think about what is Contained in this cloud in this subgraph that we called capital FS Well, what it contains is Well subgraphs an entire remaining set of factor and variable nodes Which is going to separate and actually let's do that now into the local Variables so local by by local. I mean they are direct neighbors of the factor of s and Then other stuff and the other stuff will be connected Disjointly because the graph is a tree to the individual variables X I so let's call these variables X 1 to X M right each individual one being X I and they are M and of them in total And this subgraph can then be written as the local factor So that's the the term that is added to the entire graph by this particular factor node And then all the remaining bits that are in the subgraphs in the subgraphs that can be indexed from 1 to M And they contain like these subgraphs they contain themselves variables one of which is special because it's a direct neighbor of our factor F Let's call that X 1 to X M and there's a bug here in this slide. This should be just X M I'm sorry. I'll fix that and Then every individual fact graph in addition to these direct neighbors Of course can contain more variables unless it's a it's a leaf But if it's not a leaf then it can contain many more further variables and those variables are However, disjoint sets. So if a variable shows up in this subgraph, it doesn't show up in any other subgraph because the graph is a tree So let's give names to these sets of variables. Let's call them bold X S because it's connected to the factor s from 1 to M Okay, now we know what the structure of this thing is now Let's see what we need to do to compute our factor message So that's the bit from the previous slide what we just sort of Subsumed into this object called a message sent from the factor of s to X. What is that? Well, it's a sum over all of the variables in that subgraph Over the terms in that subgraph. So we now know what that term is. It's this one So let's plug it in and let's do this entire sum So we need to sum out all the variables that are in that subgraph But we've just seen that these variables have structure. There is a set of local Direct neighbors those we have to sum over in entirely, right? So they this sum has to stay at the outside of this of this factorization because these variables actually show up in the local factor FS and therefore have to be summed out at this level But all the other variables which can potentially be many of them, of course are in these subgraphs g1 to gm and Again, they are disjoint sets. So we can take the sum over all of these other variables and move it inside of this product of these individual subgraphs and Then each subgraph has its own separate sum which can be done inside of that subgraph And doing so we are now closing our induction loop We're noticing that this structure here itself is again of the type of a message But which I mean you can think of it in the discrete sense as a multivariate array Where we've summed out all but one dimension. So we are left with a one-dimensional array or one-dimensional function a univariate function containing positive numbers or sorry non negative numbers so unnormalized probabilities for the only one particular variable xi and those individual one-dimensional arrays now get Multiplied with our factor F, which is actually a multivariate array and like a multi-dimensional table These individual one-dimensional things we can think of as messages again And they are messages that are sent from a variable node to a factor node rather than from a factor node to a Variable node as we have on the outside here The only distinction between messages from factors to variables and messages from variables to factors is that in the messages from factors to Variables we have this additional term here the factor itself Now the interesting thing that gives the algorithm its name is this algebraic structure Which is that the the messages sent from factors to variables actually happen to be a sum over a product of Individual terms where most of the terms are just one-dimensional arrays and one of them is a multivariate array That's why we're thinking about the sum product algorithm Okay, so now we're almost done maybe One minor thing you might be worried about before we can close the the induction loop is that These messages from variables to factors might have a different type But it might look differently. They might harbor some problems that we don't have for messages that come from factors to variables So let's be clear about this and check again that the messages that are being sent from variables into factor nodes Actually are computable as well And they are in many ways very similar to the messages being sent from factors to variables to do so we expand the Subgraphs G Index from 1 to m that are connected to variable Xi Sorry, yeah that like to be expand one of these G I that is connected to one of the variables Xi What those what kind of subgraph are we going to have here because you have a bipartite tree? so a factor graph tree this subgraph consists again of a product over individual factors each of which is connected to its own sub tree and Each of which is also connected to Xi So when we compute this message being sent from Xi into our factor FS that we care about at this level of the recursion Then we need to sum out all the other variables that aren't Xi. That's from the previous slide. Let me go here again there we go and We can do so Because we know that this graph has this factor structure. So let's plug it back in here this is a data copy from here and we can therefore exchange the product and the sums and The sums now simplify or they basically or they split up into sub sums for the Individual factors connected to Excel So maybe that's actually the right way to phrase this we can also do it a little bit like at this sort of abstract level We can also be really a little bit more careful within this set of variables xs I so those are the Variables that are connected to Xi through these various factors. There are subsets again of Variables each of which is disjoint from the others and it's connected only to Xi through the factor FL. So let's give a name to these and let's call them Xi L where L goes from 1 to capital L and Then we can make this big sum over many variables into a product of smaller sums each over a smaller subset of Variables such that the subsets of variables are disjoint and jointly make up the entire set Now we notice and here's really where we close our recursion that now we have exactly the same situation as two slides ago that we are constructing a Marginal or a message actually we shouldn't call it marginal because it's unnormalized, but we're constructing a message being sent from Variable from the factor nodes In this case, it's a factor node FL Into the variable Xi and this message is a function So in a discrete case, it's an array with non-negative numbers or in the continuous case It's a function containing non-negative function values of Xi a one-dimensional object or well actually not necessarily one-dimensional But it's of the dimensionality of Xi. So Xi might of course be itself a Variable that has more than one dimension more than one value and then this would be of the same dimension as Xi So to compute these variable to factor messages, we have to take the product over all incoming factor to variable messages and These messages themselves are sums So whether you want to think of that as the sum product or as like more fundamentally this sum product here Doesn't really matter all that much because they all amount to the same thing So Okay, that's our recursion. Where does this recursion end? What's the final point but at the very end of this expansion at some point? We're going to reach a part of this graph where if we expand further We're not getting going to get anything more Right, we're just left with that as an end of the graph and this end could be either a factor node Or it could be a variable node depending on where we are in the graph So we need to define what the messages are going to be that we send from such leaf nodes in the case that these leaf nodes are either variables or that they are Just factors Now to do so we basically have to define what those variables are and we can do this definition in a very natural way by thinking about What kind of entries we need to keep the the metaphor of a message alive So let's say we are reaching a leaf of the tree that is a variable node Now let's check what we need to do to compute messages from variables to factors We can do that on a previous slide here We said to compute variable to factor messages take the product of all incoming factor to variable messages well, there are no incoming factor to variable messages, but they're just there's just nothing coming in and That means we're not there's nothing else to sum out So notice that there's of course still a variable X right, which we're going to hand on but it There are no other variables if you go back, right? So factor to variable messages Are sums over all the other variables that are in the subgraph, but there are no other variables in the subgraph so there's no sum to do here and Let's think about what we're going to do with that message while we're going to plug it into our Factor to variable message and to get factor to variable messages We are going to evaluate a factor and then multiply with the incoming messages So that incoming message should be something now for our for our tail end for our recursion end That doesn't mess up this process. So ideally here we want to multiply with what maybe that's a moment where you want to think for yourself You can stop the video here if you like We want to multiply with a one with a unit right because the one is the unit element of the of the product And then that's not going to mess up this further summation. So let's define that and just say the the message coming in from leave variables into factors is just a unit a one where unit means that it's of this you can think of this as a an array or an Yeah, an array of the size of dimension of x that contains just once Everywhere now, what about messages from factors that are leaves into variables? So here again, we need to check what a factor to variable message looks like It's the sum over the product of incoming messages and the local factor So their product in time messages is empty because there's no no message coming in And basically just have to say what we mean by a sum over an empty set and a product over an empty set And of course, we mean that it don't have an effect, right? We just hand in the function itself So we just define the incoming factor to variable message to be the factor itself without anything else With that we have a recursive algorithm that allows us to compute marginals in Tree-structured factor graphs and it works like this to compute such a marginal This is a petition or maybe a summation of what we've just constructed We let's say if you want to compute the marginal of an individual variable just one Then we take that variable and treat it as the root of the tree We are allowed to do that because if you have a tree-structured graph You can pick any variable in it and make it the root of the tree and treat it as the root of the tree You can sort of mentally think of picking that variable and pulling the graph up by that variable and letting it dangle from that variable Then this is clearly a tree So we treat that variable as the root of the tree and then move away from that variable all the way to the leaves So we expand the tree until we reach leaves and then at all of the leaf nodes Simultaneously or I mean it doesn't really matter like one after the other and all the leaf nodes we initialize the Incoming messages into our initial messages So these are either if the leaf nodes are factors the factors or if the leaf nodes are variables unit messages Functions that just contains one that just contain ones Then once we've done that for all the leaves we start passing messages Inwards from the leaves towards the root we can do that now because all at least one layer inwards All the messages we need only require other leaf nodes and then we are constructing recursively more and more of these quantities we need to construct further messages moving inwards and Those messages will always have the same form if you are constructing a message from a factor to a variable node Then we have to construct the sum Over the product of terms where the sum is over all the variables in the factor That Are not the variable we're setting to better. So when I say all the variables in the factor I mean all the variables that are connected to the factor by an edge all the neighbors of the factor and We're summing over all of them except for the one variable to which we are sending the message and The product is over the factor itself and all the other incoming messages which have been recursively constructed Messages that come from the other variables into our factor of that So we can think of these as one dimensional arrays or actually a race of dimension of whatever dimension x has and this as a multivariate array that Contains the functional relationship between all of these variables if you are constructing variable to factor nodes Sorry variable to factor messages Then we just take a product over all of the incoming messages where incoming means that they come from Other factors that are connected to the variable Other than the one factor to which we are sending the message to Eventually in this process we are going to reach the root and At that root we just take that we we this is a variable node, right? So we have lots of incoming Factor to variable messages and there we just take the product of all of these messages and normalize and that Normalization is now a cheap operation because we only have to normalize over that one local variable x So that's an integral or a sum over an object of dimensionality of x so maybe a one-dimensional object or a very low-dimensional object and In doing so we've then reached the marginal of this one variable x But what if we wanted to know the marginal of all of the variables x? Now that's a very nice aspect of this algorithm that it's actually possible to construct the marginals of all variables in one go Because you may have noticed that as we are doing this process That we just did for our one variable x i we are actually constructing all of the messages necessary To compute marginals everywhere because once this process test finished we have Essentially got all the messages necessary except for the fact that we've only been sending messages in one direction We now have to send it in the other direction as well so here is again the algorithm and that's the proper sum product algorithm for computation of marginals of all the variables in a joint graph so to do so Choose any x i as the root so that's as on the previous slide We basically just randomly choose one treat that as the root then begin as before so start at the leaf nodes Initialize all the messages send messages from the leaves to the root and then once we have reached the root Passed messages from the roots towards the leaves We can now do so because at this point every variable has all the necessary messages to To be to send messages down why because the only so in the upwards pass from the leaves to the root The only variables that we didn't yet have messages to send from where the what you could call the parent in the tree Now at as we pass from the roots to the tree Of course, we're beginning from the top parent of all the like the root the top parent of all variables And we are moving downwards So now we have all the necessary messages left to send to child nodes later on Once all nodes have received all messages from all their neighbors They have all the necessary messages so they can take a product over all the messages coming in from neighboring factor nodes and Normalize locally and then we have marginals over all possible variables You might want to compute marginals over therefore this algorithm constructs the marginals Over all variables in a tree structure graph in a time that is linear in the number of variables in the graph Because we have to start at the leaves move all the way into the roots passing upwards So that this step takes as many computational steps individually as they are Variables in the graph and then we have to pass back down again That's the same kind of operation so it takes just as long again and we just spent two Operations of size linear in the size of the tree and therefore we have inference in linear time So with this we're basically done. That's the key result of this Of this lecture we still have a little bit of time And this isn't a great slide because there are a few open questions that I'd like to address before we summarize a little bit and then even move on to a few variants and generalizations and I'd like to before we do that first point out a few things that Haven't been addressed by this construction yet, but which are essentially implicit so the first thing to note is that We don't actually need to formally define a specific kind of message from variables to factors, so you might have noticed during the derivation that It it's almost a little bit at hawk to think about this year as a special step Because basically if the next thing we're going to do in this sub graph is that we're going to more or less Like there's no individual term here That only depends on a on a variable the next thing that shows up in this expansion is another factor So it's also possible to define this exact same algorithm exclusively as exclusively on factors and Associated with this is an idea that you can even define factor graphs completely without Variable notes. You can just write graphs that only contain factors This is the so-called for me notation of factor graphs Which I haven't shown because I don't want to confuse you even further given that we've already had two other kind of graphical graphical models So just to summarize or just to sort of restate this what I just said vaguely these you can you can define You can you can subsume the idea the notion of a message being passed from a variable to a factor Which is essentially? I mean already in the definition of the messages being sent from factors to two variables by directly Defining messages that are being sent from factors to factors and they are of the same type. They are really just sums over products of the factor and all the incoming Messages from factors to factors and the way then what you need to do to define that is more or less nothing, so you just Directly define what the message is and the only really the only real difference is that the product here is not over the neighboring variables anymore It's over the neighboring factors directly, but this just as a minor observation on the on the side It's not really particularly relevant for the further constructions We're going to do what is much more relevant is the question that I so far sort of swept under the carpet Which is what happens if we actually observe something the whole point of probabilistic inference is that you want to condition a probability Distribution on observations you want to do Bayesian inference all the very all the graphs. I've shown you so far contained Variables that are unobserved that have probability distributions assigned to them That means we don't really know yet how to include data in our computation What do we do if we want to add an observed variable? Well, it turns out that in This message passing view at least as long as you want to construct marginals Conditioning on observed variables is actually surprisingly easy and the reason for that is that you can think of an observed variable as something Described by a probability distribution where all the mass unit basically is in one of the Possible values in the buckets right so into the sweet case That's relatively easy if you have a node or actually possibly multiple nodes in your graph that is observed then you just set the corresponding probability distribution to Zero everywhere except for the one value that it has and set that value to one or if it's a continuous variable You need to and you need to use a little bit of a construction that ensures that What this amounts to or the way to encode this in this view point of message passing is Two possible things so either you can introduce an extra factor attached to that variable Which has the form of a Dirac delta or a chronicle delta depending on whether it's a discrete or a continuous variable and Basically says this variable has just this particular value Or you can more directly and that's of course what you visually want to do Just color in that variable so fill it with a with a dark color and in doing so decide that the message being sent from that variable to any other factors is that When we do the sum over the next factor to that factor to message Factor to variable message we simply set that Variable to whatever the value is that we are clamping it to that we are setting it to rather than summing it out Right, that's what you essentially do by saying we have a Dirac delta message here somewhere It just says when you do that some don't do the sum and instead just set me to whatever value I want this to be now why can we use this particular Clamping this setting a variable to a particular value for inference we can do so because the the conditional distribution the Posterior distribution that you might be interested in is essentially a joint distribution where something is set to a particular value So the I mean that's actually what this slide already says, but maybe for clarity Let me write this down Base theorem says so let's say we have in our graph Two sets of variables one is called xh and the other one is called x o H for hidden o for observed then what we might be interested in or what we actually typically are interested in is the probability for the hidden variables given the observed variables Now by base theorem. This is given by the joint distribution of the observed and the hidden variables Divided by a normalization constant now when we compute marginals We notice that we can we don't really have to worry about these these normalization constants up to the fact that we have to compute them for the marginals But for the marginals we I mean obviously we won't we won't ever need a marginal over the observed Quantities because we know what they are the only thing we care about is the marginal of the hidden variables and for those we can just write The up to normalization the posterior distribution as the joint distribution Where we evaluate at a particular value for the observed variables And that's what happens if we clamp a variable to a particular value We just set it in our in our marginal computation and then the corresponding marginal probability for the all the hidden variables that we arrive at through the sum product algorithm Actually evaluates the posterior marginal distribution up to normalization, which we can do normally locally for the hidden variables Some the sum product algorithm or more generally message passing Can thus be used to construct marginals also in the inference setting also in graphs that are conditioned on Observed variables and we'll see an example of that later Now the final question I want to address is what do you do if your? Graph isn't a tree of course We could have in general settings in which we are encountering graphs that just aren't trees What the answer to that is going to be a little bit disappointing Because it's simply you somehow turn the graph into a tree so How do you do that? Well notice that all of the derivations we've done so far I've actually mentioned this along the way every now and then Work if the variables in our variable nodes in the graph are not Univariate so no one is saying that x has to be a one-dimensional object So a variable that only has scalar values. It could have vector valued Representations and then the sums are just over all the possible instantiations of that vector if it's discreet That's just a multivariate array that has to be summed out if it's continuous It's a multivariate function that has to be integrated out So what we can we can use this ability of our algorithm to deal with multivariate variables to solve the problem of Graphs that aren't trees which is that we take our graph Whatever it is and just may actually show you the correct slide for this We take the graph and we subsume nodes that cause the graph not to be a tree into a larger group of Variables such that the resulting graph is a tree and then we can just do some product again Now of course the question that immediately arises is how do you actually do that in practice? If you have if you're constructing your graph yourself, you can take care to do that But ideally of course you want an algorithm that does this automatically for you and that's the kind of algorithm I don't want to talk about in this lecture because it's a little bit too tedious So it's actually the subject of the paper by Lauritzen and Spiegelhalter that I mentioned in the previous lecture Stefan Lauritzen and David Spiegelhalter from 1988 and it's an algorithm called the junction tree algorithm It said it's that the typical kind of method That you have to build when dealing with graphs and this is not a lecture about Graphs what about probabilistic machine learning? So I'm only going to hint at the fact that such an algorithm exists and it involves The usual kind of well actually not the usual involves the kind of steps you might expect when you take your graph You find loops in that in in the graph in particular loops that are larger than three variables So these are called shortlist Loops or the smallest loops that are larger than three variables and then you make like you fully come connect the Corresponding sets of variables in that loop so that so that they form a clique then you build a maximal clique Cliques of this graph again, and then try to connect those maximal cliques to create a new factor graph and Doing so is a little bit tricky. It requires a bit of care to do it in the right order That's the idea behind this algorithm called junction tree and if you want to look it up, you're free to do so the Main insight behind that is that the reason we do that is actually pretty straightforward Such that we end up with a tree and we can do message passing some product inference again so if Your encounter a graph that isn't a tree then what you're gonna have to do is to subsume variables Turn them into a larger higher-dimensional Meta variables such that the resulting graph is a tree again, and then you just apply some product With that we're actually now finally at the end of the section on the sum product algorithm and the lower half of This slide basically constitutes a gray slide If we're performing probabilistic inference in a joint probability distribution Which can be described by a factor graph and that factor graph is a tree Then we can construct the marginal even the marginal conditional so conditioned on some observations over any variable in the tree in Linear time Linear in the size of the tree the individual local computations are Still exponential in the dimensionality of the local variables remember that for Actually they are exponential in the size of the The dimensionality of the factors in the graph Remember the example of the Markov chain where we had to spend quadratic time to integrate out or sum out Bivariate relationships if you have a factor that contains k variables You typically have to spend something exponential in k in computational time to do this to do this local inference But in terms of the number of variable nodes in the graph inference is still linear if your graph Isn't a tree then you can make it a tree by subsuming variables into relay into constructed Meta variables and that of course raises the computational cost potentially significantly because of the exponential dependence on dimensionality of each of each variable That's a bad thing so let's try and find three structured graphs as much as possible and we noticed that we can also use this framework even if we want to condition on variables in our inference process in This some product algorithm We does now have an algorithm that constructs marginals of variables in three structured graphs in linear time Now marginals are great and they're useful for many applications But of course they're not every not the answer to every possible question you would ask an inference problem In general to answer any inference problem We still need the joint distribution and computing the joint distribution in Such even such a tree structure graph is still potentially exponentially hard one way one can generalize the The sum product algorithm At least a little bit is that if you need and this is just a site with the mark That isn't actually on this slide if you want the marginal distribution over several variables in that in the Graph so a joint marginal if you like if those variables are connected by the same So if they are direct neighbors of one particular factor is actually relatively easy You essentially just have to combine them into one variable and then add that factor as a as a Factor directly connected to that joint variable and that's actually an easy generalization of the message passing process Maybe that this I might put this on an exercise sheet or not We'll see but in general to compute the full joint is still expensive So if we want to get more mileage out of this message passing paradigm We have to think about other quantities. We can also compute in a message passing way other than just marginals and as I've already pointed out essentially in the previous lecture on in the example on Markov chains a interesting quantity to compute is the Maximum apostatory solution so the configuration of values of the variables in the graph which maximizes the joint probability distribution and as it turns out and we'll do that now basically by utilizing or combining both what we did today and what we did in the previous lecture this maximal configuration maximally likely configuration Can actually be also computed in linear time using a variant of the sum product algorithm known as the max product or max sum algorithm just to remind you of Something we've already mentioned in the previous lecture of course We cannot get the maximally likely configuration by first computing the marginals by some product and then just finding the maxima of each of these marginals and then assigning that as the most likely joint state because the location of the and actually I have the slide for that because the location of the joint maximum might not be at the intersection of the Marginal maxima, so this is this example that I already do on the whiteboard in the previous lecture here It is just to remind you briefly Here is this joint distribution of a binary Bivariate set of of variables so two binary variables. Let's say this is their joint distribution I wrote that down as an example from Chris Bishop I wrote it down in the previous lecture already with probabilities 0.3 0.3 0.4 0.0 Obviously, this is the most likely configuration, but even you compute marginals Then the most likely state actually seems to be this one, which evidently is not the most likely one however As it turns out we can generalize the process that and I mean it's not going to be surprising because we already did the example in the previous lecture We can generalize the process of computing Marginals by some product to Computing maxima and the reason we can do so is that the maximum has the same distributive Algebraic property that the summation operation also has and that's really the trick that makes our algorithm work That allows us to Tribute the computation into the tree structured graph that represents our overall joint probability distribution and as in the previous lecture we noted already that The well okay in the previous lecture. We already noted that finding the arc max so finding the configuration that maximizes the probability requires an additional data structure the trellis and that data structure is as Come like essentially as complex to keep track of as the message passing itself So it doesn't break the fact that our algorithm is linear time This is going to be true here again, and we're just going to now construct the Tree structured generalization of this we terribly algorithm that I introduced in the previous lecture and doing so is actually relatively straightforward so we can do it at the end of this lecture basically replacing all the sum and product operations in the derivation of the sum product algorithm with Max product or max some algorithms So here we go. This is the essentially again the slide that I had a few slides ago Let me see if we go back and show you this slide Here for some product and now all we need to do to do max product is To replace all summation operations. So there's a summation operation here in particular and actually only here We have to replace that sum with max and that's it. Let me go back Here we go. So this is the same slide as before and only the sum has been replaced by a maximum That's it. That's the entire thing because that's the operation. We want to perform maximization rather than summation and we can Basically do the exact same derivation as for the sum product algorithm And I'm not going to do it because the only difference is that the word sum is replaced by the word max You can check that for yourself if you like and the reason this all works is that the max operation also exchanges with the product because of Distributivity just as the sum does However, of course, there is one additional challenge as before in that once we compute the maximum We also need to figure out or have to have to keep track of the identity of the state that actually maximizes For that we have to construct A lot of multivariate generalization of the trellis. So maybe let me draw at this picture again That I had on the pit that and in previous lectures in the previous lecture so in The preceding lecture we spoke about it by variate Factor because we were talking about chains and chains involved by variate factors factors of two entries x1 and x2 and When we locally find the maximum we Need this max this fact this maximum is a function of the variable that we are not Getting bit off that we're not maximizing out or summing out then so if you find the Max along this dimension Then we find the value that maximizes each of these columns of this of this matrix and then store it maybe in a max message suck suck suck and In addition, we also back then store the identity of the maximizer the arc max and that identity is an Is that basically an index that says which of these is the maximizer? Now in the multivariate case the only thing we still have to keep track of these Maxima across now more than one dimension Essentially, right so now there is an an individual point in Like over all of these that maximizes Sorry all these individually that maximizes and they are corresponding entities for the identity of That maximizer which are also one-dimensional and we also have to store them alongside Sorry, so this is actually wrong. So this of course doesn't multi-dimensional But here we can just have multiple lists for each dimension to identify where the maximum is Then we run the algorithm as before so on in our tree we start at the leaf So we pick any variable call it the root find all the leaves corresponding to that root move Inwards from the leaf to the root passing along messages that now contain two different entities One of them is a message you might call mu actually. I should have called this mu and phi obviously that stores the Value of the maximum that maximum is a function of xj and also stores the identity of the maxima storing those identities of course might Is or is an object that is locally of larger size. It's of the number. It's its size is of Linear in the number of Variables connected to the factor, but that still means we can do this message passing in linear time and Stored them and then once we reach the root We know what the actual value of the maximum is and we can pass back down from the root and As we pass back down that this hypothesis called backtracking identify which of the variables actually maximizes the probability This algorithm is called max product in the literature. We could actually though also construct an algorithm called max sum by noting that of course we can also use logarithms of our probabilities and then We can still identify the maxima because the logarithm is a monotonic transformation so it doesn't change the location of the maximum and the value of course is changed but it's only changed by taking the logarithm of its value and The only thing that changes then is that all Values f now have a logarithm in front and all products turn into sums and There's also a minor change in the initialization at the leafs So for maxima at the leafs everything is still fine because we're still doing products and the unit element of the product is Still the one so we that we initialize our Variable to message nodes as one but when we take the logarithms sums turn into products turn into sums and the unit elements of summation of course is zero not one So we have to take the logarithm of one basically and store a zero message other than that everything is exactly as before and We can use this algorithm. She's extremely closely related to some product to construct most likely configurations of Potentially very high-dimensional Joint probability distributions as long as they're joined very represented as a factor graph is a tree structure graph Maybe just as a fun thing to point out at the end for those of you who have taken an optimization class You might have heard of an idea called dynamic programming Which is an optimization method that separates a complicated optimization problem to be the control problem into a local computation and then a Sequent Optimization such that you can move backward from a time horizon towards the end to take optimal decisions And actually it turns out this algorithm is extremely closely related There's max some algorithm is essentially a dynamic programming algorithm If you don't know what I'm dynamic programming You don't need to understand this at all that major different Not major the main difference between this algorithm and dynamic programming at least conceptually is that there's no control Variable here. We're just finding the most probable path Rather than the optimal path under some control and we're not taking optimization steps in the sense of choosing control inputs but we're actually just figuring out what the solution to an optimization problem is directly and the fact that this is actually this connection actually exists is more or less emphasized or Indicated by the fact that our max Some algorithm involves this passing backward from the root of optimality of optimal identity of optimal paths and This is has exactly the structure or actually I should have pointed that here, right the messages moving backward and this I Know I should have pointed at this because there's a Phi missing this should be a Phi rather than a mu and this equation actually is a a Hamilton-Chiacobi-Bellman equation or actually it's a Bellman equation because it has no integral here It's not a differential equation. It's just a discrete Optimization equation if you've heard about Bellman equations before then maybe it's useful for you to make this connection If you haven't then you can just ignore what I said in the past two minutes Because with this we're at the end of today's lecture What we have seen today is that in joint probability distributions, which are represented by factor graphs that have three structure Inference on the marginal and in fact also on the most likely configuration Can be performed in linear time in the size of the graph so linear in the number of variables in the graph by an algorithm by a class of algorithms that Perform message passing along the graph if these algorithms construct the marginal and they're called some product And if they construct the most likely configuration Then you might call them max product or max some depending on whether you were working with logarithms on lock probabilities or with probabilities both of these algorithms rest fundamentally on the fact that the max and some operations adhere to an assubative law and Therefore we can shift them into the corresponding factorizations in or we can shift them through the factorization encoded by the factor graph such that They create local computations that can then be passed along the graph as we move up and down Now you may have noticed that this lecture and the one preceding it Maybe even the one before that as well were relatively theoretical in nature and maybe the Honest answer for why this is is that these algorithms that I've Introduced here today this message passing idea even though it's a very powerful idea actually on its own has relatively severe limitations and Maybe one way to phrase that is even that it really only works for discrete distributions Now, that's not quite true. I already mentioned today that If you have continuous variables in your graph, you can just replace the corresponding sums with integrals and then Formally everything stays the same as long as the integrals exist and can be computed But of course in practice, that's not quite true because solving an integral In general is intractable The only real hope to solve an integral exactly is if it happens to have a particular form, we know how to solve That's Usually where exponential family distributions come in so exponential family distributions can to some degree help us with this kind of issue They can provide Conjugate combinations of prior and likelihood under which inference remains tractable But as we've seen in the lecture on exponential families exponential families are really just Available for very specific combinations of prior and likelihood and for very specific kind of data In real-world applications, we will often encounter more or less arbitrary combinations of different kinds of variables that interact with each other and if we really want to have a Formal framework that allows us to do inference in such general models then At some point we basically have to give up the hope for a totally formal framework and we have to Allow ourselves to use approximations. This will then often lead to a Kind of fudgy hotch potch combination of different kinds of approximations Like Monte Carlo like Laplace approximations But also like other approximate frameworks that we're going to encounter in the remaining lectures until the end of term and Combine these with the idea of message passing with some product But in an approximate fashion to really get something that works in practice and we're going to begin This process from the next lecture onwards For today though, we're at the end Thank you very much for your time