 Hello and welcome to probabilistic machine learning lecture number 16. We began this course with the observation that probability theory is a tool to extend truth values from discrete values to a false to a continuum which allows us to distribute truth across a space of hypotheses and thereby extend Propositional logic to reasoning under uncertainty Already in lecture two we noticed that this process comes at potentially high computational cost because it requires us to keep track not of just one single hypothesis which we deem true but an entire potentially combinatorially large space of hypotheses which we have to simultaneously track and assign truth values to and And ever since then the course has been about developing computational tools to deal with this computational complexity we used Monte Carlo and Markov chain Monte Carlo methods to produce random numbers samples which allow us to compute Approximately integrals and therefore expected values marginals evidences over probability distributions in a discretized and approximate fashion We Then quickly realized that there is an extremely powerful framework called the Gaussian framework which is particularly amenable to settings in which the variables we reason about are Linearly related to each other if you assign Gaussian distributions to linearly related variables Then all the inference boils down to linear algebra and that framework It turned out is actually quite a powerful one which can be used to build learning machines probabilistic learning machines for supervised problems in particular first for regression so for learning functions that map from a general space to the real line This can be done a parameterized fashion and we saw that this is connected to the notion of deep learning if you want to do maximum likelihood inference on the parameters of the representation on the features of this function But it can also be extended to Sort of in some sense sort of in the opposite fashion from a deep representation with finitely many degrees of freedom to an essentially infinite but wide Representation that is associated with the notion of a Gaussian process and that provides a non-parametric in some sense infinite dimensional representation of Unknown functions We also saw apart from various interesting theoretical properties of these models that they Can be extended to particular kinds of conditional independence in particular for time series and can be extended Approximately to settings in which the observations aren't real valued But they are some they can be seen as some kind of transformation of an underlying latent real valued function This gave rise to first the notion of classification for binary outputs multi-class classification for discrete outputs and generalized linear models for Functions that map from a general input domain to a relatively general output domain To do so we already had to use some form of approximate inference in particular the Laplace approximation now in the previous lecture we Found yet another generalization of the Gauche actually not yet another but a generalization of the Gaussian framework in Perhaps sort of surprising direction which provided a framework for learning probability distributions rather than functions And this is connected to the notion of an exponential family, which in some sense is a parameterized representation of a family of probability distributions such that inference in these models is particularly Computationally efficient we can do maximum likelihood inference maximum posterior inference or even full Bayesian inference Today, I'd like to return To the very beginning of the lecture course lecture number two What we observe there and then see how that the initial sort of simple observations We made there can be extended into a more general framework Which we will then use for the rest of the course to build ever more powerful quite structured probabilistic models So let's recap what we did in lecture number two, but then I showed you an example from a book by Judea Pearl It was very simple example Essentially a very low-dimensional discreet inference problem that could be done without continuous variables It's connected to a story about a guy who lives in a place that has birth both earthquakes and burglaries And one day gets a call in his office that the alarm on in his house is Ringing and now has to reason about whether that the cause for this alarm is an earthquake or a burglary or actually something else and Then later gets information from the radio that there is actually been an earthquake and they're then we discovered that This kind of observation has an interesting effect on the joint probability distribution over earthquake and burglary maybe the more fundamental insight back then was that the this Joint probability distributions over such a set of variables So in this case, there are four binary variables have in general Combinatorially many degrees of freedom. So if you have four variables, you could have 16 minus one degrees of freedom because the 16th possible state has a probability that is given by one minus all the other states and 15 of course is not so much But if you have more than four variables than that number evidently quickly grows exponentially now Such probability distributions joint probability distributions over four variables a e b and r can generally using the product rule of probability theory be decomposed into Terms that could be called generative terms not causally generative But in a probabilistic sense generative terms as the probability for the first variable given all the other ones Times the second variable given the two remaining ones times the third variable Giving a final one and times the third times the final variable The order of variables in here doesn't matter every possible ordering of these variables allows such a factorization because it's a fundamental property of probability theory however, there are certain Factorization so certain representations of this joint distribution in terms of such a factorization in which the representation becomes easier because you can use domain knowledge about the generative structure to reduce the computational complexity We know that the probability for the alarm to ring Doesn't depend on the vet on announcements on the radio Because it's caused actually by earthquake or burglary we know that the probability for the Radio to give an announcement has nothing to do with a burglary in the house of our house owner But only with the earthquake and that the probability for an earthquake to happen has nothing to do with burglary is taking place At least under this model Since that's true That means that these individual terms become easier probability for the earthquake now consists just of a single number of the probability For that earthquake rather than two numbers the probability for that earthquake if there is or there isn't a burglary therefore such factorizations can Potentially drastically reduce the number of degrees of freedom the variables we have to keep track of the number of states We have to simultaneously consider in our inference problem At the same time back then I also introduced sort of more as a On the side and today we'll spend a bit more time on it this graphical representation of this problem Which is back then I said it's called a directed graphical model or a Bayesian network And it's a graphical representation of such such generative structure that can be directly Created from such a joint distribution So if you have such a factorization either this one or this one in particular this one Then you can create such a graph by first checking the set of all of all variables here a e b and r creating one circle for each of such variables then Looking at the factorization and for each term in the factorization drawing an arrow From the right-hand side of the factor to the left-hand side of the factor So here an arrow from e to r and here arrows from e and b to a And then finally when we actually do inference there's an additional sort of syntactic trick that we fill in all the variables That you get to observe that makes it easier for us to parse this graph In lecture two we already thought a little bit about these kind of Directed graphs and what kind of structure they can represent we did this relatively briefly and today we'll spend more time on them in particular we observe that because of the product rule because every joint probability distribution can be written in terms of these conditional distributions and the order of these variables doesn't matter of Course every joint probability distribution over a set of variables can be represented by such a directed Acyclic graph a directed graphical model a Bayesian network However, because the order of these variables doesn't matter the order of the direction of the arrows in such a graph also doesn't matter and That means I mean that and the fact that every joint probability distribution can be represented in this way That this fact isn't particularly helpful the fact that every Probability distribution can be represented as a graph just means that this graphical representation is sort of powerful But to make it useful you have to find the factorization in which the graph isn't densely connected because only then does it actually encode conditional independence information that is useful for inference We also notice back then already quite quickly that this notion of directed acyclic graphs or graphical models directly graphical models Bayesian networks. It has maybe some Some aspects that you might consider a conceptual flaw Which is in particular that Not every conditional independence Structure of a joint probability distribution can be jointly represented in a single graph back then I did an example, which is actually I forgot to say back then due to Stefan Hamiling Who is at the University of Dusseldorf? Which Works like this. So there are two coins that we throw each of which is fair. So it has 50% chance of landing heads or tail and The there's as a third object to a third variable C There is a bell that gets run whenever the two coins have parity. So wherever they show the same face heads or tail This is the corresponding conditional independence table and we saw back then I'm not going to redo the derivations but we saw that these three like this conditional independent structure can be sorry this this conditional probability table can be represented through three different factorizations because it has actually three different kinds of conditional independence or marginal independence The probability for the first coin is independent of the other coin Even you compute the marginal and integrate out C the probability of the other coin is independent The face of the other coin is independent of the face of the first coin when you marginalize out C and The probability to hear a bell is actually independent of an individual coin if you marginalize out the other coin At least that's true if both Coins are fair if they both 50% chance of showing heads or tails and or tails So these three different factorizations though, they correspond to three different graphs and as we saw back then You can look at the video again if you want to these three different graphs Each do not encode all three of these independent statements So clearly these directed graphs Although maybe helpful, maybe they are at least beautiful to look at are not perfect They are not encoding all conditional independent structures. You might want to encode So today I would like to return to this topic of directed graphical models and in lecture two We introduced them in Sort of Mary better the ad hoc way and because we needed to talk about conditional independence to understand the computational Complexity so that then in the following lectures I could talk about conditional independence But I didn't actually formally really study these graphical models in any sort of detail So today we'll begin to do that and what we will do is first of all in the next slide I will introduce a little bit of extended syntax. I'll make this language of director graphical models a little bit more expressive and powerful Then we will look a bit more in detail at conditional independence and how to read it off or whether it's even possible in general to read off from a directed graph and Then later on in the lecture. I will introduce actually a second framework Which is also a graphical way to write joint probability distributions, but it's a slightly different kind of encoding It's almost like a separate programming language in which Under which certain other operations or certain operations are actually easier than under the directed graphical model framework but which also has its own downsides and Then we'll do a little bit of theory on a high level about the representational power of these graphs and In the next lecture, I'll then also introduce actually a third framework for representing joint probability distributions which has again its strengths and weaknesses and The reason I do all of this is that in the remainder of this lecture I will want to use these graphical representations because we will now move to more complicated Structure probabilistic models in which computational Aspects become important and in which conditional independence plays a major role to do so We will need to use these graph graphical models or actually we don't have to but it's actually really useful to have them because they are an easy representation of structure and I think that they should be in your toolbox as mental aid to write down Generative structure and conditional independence structure in your model So that you can think a bit more abstractly about what you're building and then maybe sometimes directly read of efficient algorithms from the structure of the graph And the first thing I want to do is to introduce Maybe to extend a little bit of notation of directed graphical models This is really just for convenience. What I'm going to do is essentially a bit of syntactic sugar Which is going to be helpful later on when we build expressive models and I don't really know where else to introduce it in the course So I might as well do it here even though it's a little bit random here so that when the next time I start writing Notation like this at least, you know what I'm referring to these are standard Notations which are widely used by people who use graphical models to design probabilistic models. So the first thing I'm going to introduce is a notion called a plate a plate is a rectangle like this and what a plate represents is a Copy of all the variables that are inside of it So each plate has is a rectangle that has a little number at the bottom and that number means that the indices in here In this case the indices are I Run from one all the way to n and there are copies of the contents of this plate So this graph here on the left-hand side Corresponds to this graph if you so far only look at the filled in sorry as the at the full circle variables because Why I copied n times The second thing I'm going to introduce is something called a hyper parameter node that those are these small black Circles these correspond essentially to observed variables about Which we are don't don't play a crucial role in the rest of the model if you like and they're typically used to denote hyper parameters And you can think of them basically as observed nodes Observed nodes on which we are typically Conditioned in a straightforward way in an easy way. We don't have to worry about so much So this graph here on the right-hand side represents the graph we need to encode the Structure of our Gaussian regression parametric regression algorithm we've used in the past so here you remember that we inferred the the Latent a latent function that is supposed to explain a data set of observations of a supervised problem with inputs x and outputs y by assuming that there is an IID so conditionally independent Evaluation of the function value at location xi with noise sigma So that means conditionally independence here means that if you know what the latent function is if you would know what the latent function is then the individual observations are independent so the individual observations are measurements of the true function made with Gaussian noise independently and We put a Gaussian prior over the weights of this function So the corresponding graph is of course, let's write down the set of all variables Y and W and assign circles draw circles around them and then draw arrows Going from the right-hand side of an additional distribution to the left-hand side So here there's a prior which has no right-hand side so there's a W and then with the Proditional distributions for why given W are all independent So there are individual arrows pointing from W to all of the Y's and no further arrows between the Y's and under this new sort of extended syntax we can Both expand more make more expressive this graph by introducing all the variables that are a part of the model we could even introduce a variable for phi as well, which I haven't done here and Write them down this can sometimes be really helpful because then you know why your variables enter what their roles are and maybe like It might help you parse your code and then I use this plate here to say that there are n copies of these I should say probably maybe that this kind of notation even though reasonably standard is also not universally popular So some people for example think that for certain applications. It's not a good idea to draw these plates Because they complicate your graph. They might hide complicated structure inside. However Later on we will see models where it's really very difficult to get away without drawing a plate because things just otherwise get really really Complicated and the graph is very difficult to parse. Okay, so that was it I've introduced a bunch of notation and it was easy to do so because we haven't really used it yet But so now don't be surprised if I start using hyper parameter loads and plates to represent models the next thing I want to do is to talk a little bit about conditional independent structure and What actually takes to read that off? So in the lecture in lecture number two We already encountered this table of conditional independent structure with three atomic independent structures So I basically wrote down the first non trivial set of of graphs that turns out to be the graphs The set of graphs that have three nodes because a graph with a single node is totally trivial It's just a probability distribution a graph with two nodes It's also is also trivial because it's either Disconnected so then you have two independent variables or it's connected And then the two variables are just dependent on each other and there is no conditional independent structure But if you have three variables, then there are three different graphs you can write down as you can see here They are the chain graph which we by now have seen used in Markov Markovian time series structured models There is the what you might call a fan out or sometimes also called a V structure graph where you have a parent that creates two child nodes and Thirdly, what's called about what you might call a fan in or a collider structure where there are two parents that Create one child node now notice that all three of these graphs have the same order of variables So really the only difference between these graphs is the direction of the arrows And that's maybe like obvious when I say it But it's also just important to remember that what makes these graphs interesting What what creates the structure and the encoding of conditional independence in here is the fact that these arrows have a direction Otherwise, if you leave out a direction, then you can't encode at least not in this framework this kind of conditional independence We'll get back to that in a few moments So in lecture number two we already Like manually I actually did this on the whiteboard went through and showed that the conditional in sorry that if the factorization Implied by this graphical notation Implies certain conditional independent structure. So what I mean by that is that this graph is a graphical representation of This factorization structure the fact that I write down this graph means that the joint distribution over a B and C is given by P of C given B but not a Times P of B given a times P of a and this graph means that the probability of a B and C is given by a Given B but not C times C given B times P of B and This graph means that the joint distribution over a B and C is P of B Given a and C times B of C not given a times P of a Now once you have this particular structure you can then like explicitly just go through and show that this factorization implies a particular conditional independence by marginalizing out or even or Conditioning on particular variables and we saw back then that this structure implies that A and C become independent of each other when we condition on B So what we are seeing here is some kind of blockage, right? You can mentally think of if B is if you condition on B So if we fill in the variable B and make it black then this chain becomes blocked and C becomes independent of a given B But in general when we marginalize out over B and C are actually dependent on each other under this model in this case we saw that again, and I'm not going to redo the computations but if you want to see them check out lecture number two that In in this model a is independent of C actually the similarly to the chain graph but if you marginalize out over B A and C become dependent and this graph is in some sense different so under this model A and C are independent in the marginal But they become dependent when we condition on B in general this Feature or this kind of yeah like feature of probability theory maybe is called Explaining away and be observed it in the example with the burglary and the alarm once you get Information that your alarm is ringing you now suddenly have Covariance between the two observations or codependence dependence between the two observations Sorry between the two explanations for the alarm burglary or Earthquake because they can both create this observation But it's quite unlikely that they are both true at the same time now What I didn't address back then and what we should talk about now is of course the fact that these are very simple graphs They just contain three different variables now the natural question you might have is what happens if the graph has more than three Variables what if it has four five or a higher number of nodes? How do I then reason about conditional independence in that graph? Well, it turns out that that is like making this connection formal is Possible, but not particularly straightforward and it requires the definition of a notion called de-separation that is due to today a pearl again and I'm going to present it in a way that I've taken from the textbook by Chris Bishop, which I've mentioned in previous lectures as well actually significant parts of the presentation in this lecture are due to well in theory and from their from from their Genesis to today a pearl and also other people like Stefan Langwitzen and David Spiegelhalter, but in their presentation due to the book by Chris Bishop so to Encode this aspect that to get a feeling for why it's non-trivial to think about How to read off conditional independence? Let's look at these three examples again that we just thought I had on the previous slides here are our three Graphs again now I've actually changed the order of the variables But other than that they are the same graphs now notice that in the in the on the previous slide We just saw we just discussed that for these two graphs So for the chain and the fan out if we condition on this variable Which I know you call C then the two other parts of the graph actually become independent of each other And when we don't condition on it, then they become dependent on each other however for the third kind of graph where the Arrows are pointing inwards the situation is in some sense reversed So when we condition on it, then these two things become dependent even though they are marginally independent of each other So clearly that means that we cannot just think of the graph and blockage in this graph Without considering the direction of the arrows We have to be careful to find rules that that Formalize the role of the direction of the arrows in making things conditionally independent of each other And this is exactly encoded in this notion called d separations where d stands for directed separation By Judea pearl and I'll just gonna read it off because it's a little bit tedious to read So consider a general directed asset the graph. That's our graphical model and with three Nonintersecting sets of notes the notice that we're talking about sets of notes not individual notes So you can think up here actually about sets of notes as well Now and the importantly the union of these can be smaller than the complete graph to ascertain whether The sets A and B are conditionally and independent of C. We need to consider all possible paths So possible paths means a ways of traveling along the graph if you ignore the direction of the arrows From any note in a to any note in B and now define a concept called blockage Graph such a path through the graph is considered blocked if it includes a note such that either The arrows on the path meet head to tail or tail to tail. So that's the situation here and there at the note and the note is in C or The arrows meet head to head at the note and Neither the node nor any its descendants is in C now. There is a theorem Which is again due to today Pearl that says if all paths between a and B are blocked Then a is said to be De-separated from B by C and if that's the case then a and B are conditionally independent of each other given C So to see how tricky this theorem really is I Drawn here on my whiteboard a little graph Which is actually taken as well from the book by Chris Bishop and it's a great example of how complicated the situation is So here is a graph that contains and in total five variables Here is a and B These are the ones we're going to be thinking about and they are Connected to each other or they are part of a joint graph that involves also variables F and E and C So let's think about the independence or conditional independence between a and B and under various or Underconditioning on various sets of this graph Now let's first assume that we're not conditioning on everything On anything sorry, we're not conditioning on anything So let's look at the theorem which says we to ascertain whether a and B are Independent of each other given some set in this case the empty set consider all possible paths between a and B So there's only one possible path in this graph. It goes from a down up and down again So we now according to the theorem have to consider the variables Along this path. So that's E and F to think about whether a and B are independent of each other and to do so we have to check for both variables whether how the how the the the arrows along this graph meet at these variables and Then check whether that node or any of its descendences in C. So let's first check F So F is a tail-to-tail node But the node is not in C. So there is no Information that we are conditioning on on F. So we're not conditioning on F. So The first statement of De-separation does not apply here. Let's look at E. E is a head-to-head node and But the node and none of its descendants are in C. Ah, okay So that is the case of de-separation. That's the statement number two or the second way of getting de-separation So the fact that there is this variable here means that a and B are conditionally independent of each other Well conditional on nothing, right? So they are independent By themselves and this is just so if you think about what what this graph is supposed to imply in terms of in the sense of a Generative model you can probably convince yourself that that's true, right? So a is Involved in the generation of E but B only depends on F and Because that there's no path pointing upward here and we're not conditioning on any of these pieces of information There's no reason why we know anything about B if you know something about a So now let's assume that we observe the variable C. So that's supposed to be a filled-in node So now let's check again So what they are just said about F still holds but if you now think about E then E is now a node at which the arrows meet head-to-head But Not like itself, but is not in C, but one of its descendants C is in Capital C the set of variables we are conditioning on So therefore This path is now let's say unblocked It's not blocked anymore because the second de-separation criteria doesn't apply anymore and we can actually not generally assume anymore that a and B are Conditionally independent and conditioned on C and again intuitively what's happening here is that by observing C We are potentially learning something about E Of course, that's not it's not guaranteed that that's the case, but it could be the case and that's all we're looking for so Once we know something about E A and F become explained away They are conditionally dependent because it's one of these collider structures So now C might tell us something about the relationship between A and F So therefore by learning something about F A would learn something about F and then therefore naturally also about B Okay, so if you're if you condition on C then A and B become Potentially conditionally dependent on each other and as a final case if you condition on F then F is still a tail-to-tail node, but it's now in C so therefore it provides blockage and A and B are conditionally independent of each other given F again this is not surprising because F is directly generating B and F is not involved in the generation of A. So by learning something about A. We don't learn anything about B So this example has maybe shown how tricky this notion of D separation is. We really have to be very carefully looking at the graph and thinking about these complicated rules of head-to-head and tail-to-tail and head-to-tail nodes and whether they are in the set or not So reading of conditional independence from a directed graph is Not exactly an easy exercise if you want to do it by hand You have to really stare at the graph and apply the rules of D separation Nevertheless, you can imagine that these rules are sufficiently formal to allow for automated processes the check for conditional independence One particularly helpful concept to write such an algorithm that tests whether two Variables are independent or conditionally independent is the set of all nodes You need to condition on to totally separate one variable from the rest of the graph So which other nodes you need to know? such that the a particular node in the graph becomes completely independent from everything else in the graph now that notion is called a Markov blanket and It looks like this. So let me first maybe first. Just give you the statement that we can think about why it's true So the Markov blanket is the node of a particular node Xi So that's our variable here in the graph is the set of all parents children and co-parents of Xi When you condition on that set X i becomes Independent of the rest of the graph so conditionally independent of the rest of the graph So why does that set look like it does? Why does it consist of all parents? children and then co-parents of X i Well to do so let's do a little bit of simple math. So let's consider the probability distribution So let's let's say we have a set of variables Bold X that's a vector of lots of variables More than we have here in the picture potentially one big graph and they all depend on jointly together They all form one joint probability distribution Now let's say we consider this one variable Xi that we care about and we condition on everything else so conditioning on a variable if it has an effect on Xi means that Xi and X j are not independent of each other so Which set of variables do we have to condition on so that the rest of the variables actually don't matter anymore? Even if you condition on them you know to do so let's write down this conditional distribution The conditional distribution is essentially based theorem right so it's a joint divided by the evidence the normalization constant For that you have to integrate over X i So now we use the graph so whatever the graph structure is this is like not necessarily just this graph But basically any arbitrary graph we know by definition of the graph that we can find this joint probability distribution in This form where here. I mean that this consists of terms in the factorization, which is represented by the graph such that the these are essentially just the the Individual sort of terms in the graph that you can read up on the graph. So for every variable K there is a set of parents that Corresponds to the set of variables that are that have arrows pointing towards this variable And those we can write down here These are easy to find because this is a direct that are cyclic graph So for every variable there's only an obvious set of parents that can just be read off from the graph by looking at the arrows and Then this together gives the joint probability distribution of course there will be some Variables in here, which don't have parents and then this notation is supposed to include those parents can be an empty set So for example, this variable would have no parents No, so now we have implemented the graph or the information that is encoded in the graph And now we just have to think about where X i actually shows up in here What are the terms in this? Expression where X i shows up well So X i can show up in two different Parts of such a factor either on the left-hand side or on the right-hand side If it shows up on the left-hand side, then it's a child and if it shows up on the right-hand side, then it's apparent So what are the terms that contain X i? Well, they are the terms exactly that make up the Markov blanket so X i can either be on the left-hand side then it's Involved in a set of terms that contain its parents or it can be contained in the right-hand side then All variables that are its children might be on the left-hand side and all co-parents might be with X i on the right-hand side So those are the terms that we cannot get rid of in this integral So if we even we do this integral down here in the in the denominator Then we cannot move the integral through these terms because X i is somewhere in there either on the left or on the right-hand side For all the other terms that don't involve X i the integral sort of passes through and we can take these terms outside And then they are the same terms on the top and the bottom on the denominator and the denominator of this fraction And therefore they cancel and we're left with just this expression So no matter whether we are conditioning on variables or not if they are outside of the of the Markov blanket they don't affect the marginal distribution over X i Assuming that we are also conditioning on the Markov blanket So Again the Markov blanket is obviously a useful concept at a formal notion that can also be encoded and taught to a computer if you like But it's again a little bit tedious It's a little bit annoying because if to think about one variable X i if you want to check whether it is conditionally independent of A set of variables given everything else then that set of variables doesn't just have to contain all the nodes Which are connected to X i by an a line right an arrow and pointing in either one or the other direction But we also have to check for all of the children of X i what their Co-parents are and whether we are also conditioning on those as well So this property to be taken together with that of de-separation maybe gives you a feeling for why directed graphs are Perhaps not the ideal tool if the concept you're interested in is conditional independence They are a great tool though if you just want to write down a generative model So with that we're at a great slide Direct the graphical models are one kind of model in a moment. I'm going to introduce a second one which reflect Directly the factorization of a probability distribution you can read off. I mean you could if you have a factorization of Probability distribution so something that looks like this Where the joint probability distribution over x a and b can be written like this You know that this is the case then you can directly translate it into a graph You just write down all the variables a b and x draw circles around them And then draw arrows for every term in the factorization Pointing from the right-hand side of the factor to the left-hand side of the factor This is really convenient because there is this direct map between factorization and graph But there's another concept that we care about which is actually was for many in many ways the main reason Why I initially introduced these graphs which is conditional independence because that has a strong computational effect Now conditional independence can actually be read off from the graph But with a few caveats the first caveat which is particularly well the first caveat is maybe an easier one Let's start with the easier one which is a more Maybe more of a nuisance which is that reading of conditional independence in co as encoded by the graph requires using the notion of de-separation and or the mark the mark of blanket which are Sufficiently formal to be encoded in an algorithm, but they are also a little bit tedious to use by hand So if you want to use these graphs to write them on a blackboard or a whiteboard and then think about conditional independence implied by the graph then that can be a little bit tedious because you have to think about notions of Head-to-head and tail-to-tail and head-to-tail notes and whether it's sets of children Are contained in the set that you're conditioning on and sets of parents and co-parents You can still do that though, but it's not ideal for a notation that's supposed to be used by hand as well But there's a bigger problem Which is that Given a particular probability distribution not every conditional independence structure can be read off from the graph so each graph encodes a certain like certain sets of conditional Independences, but there are joint probability distributions which have Conditional independence structure that cannot be simultaneously represented in a single graph so at this point We see that Directed graphs are a useful tool something you might want to use to write down Given that you have a factorization representation of your of your generative model to think about it to analyze it to maybe even identify interesting computational aspects But they are not perfect another question that arises of course is I said another way to write graphs That is in some other sense beneficial Ideally, of course it would fix all of these problems and we'll see how hard it is to do so Now what you might be asking yourself is Maybe maybe can we just turn the definition around so that vector graphs were defined From the factorization property So you if you take a joint distribution and you have a factorization and you can read off the graph now We saw the conditional independence is then difficult to read off from that graph So maybe we can do things the other way around can we define the graph through its conditional independence directly and Then wonder about factorization later well it turns out that that's possible and Before I show you how it works. Maybe let's look at this graph again and think about what made thinking about conditional independence So tricky in this graph. It was the fact that we had to keep track of Situations of this phenomenon of explaining a way and explaining a way arises if you make observations of variables Which are at these kind of colliders as Structures so variables that are the children of several parents So really the problem here is that these graphs have these arrows because if they didn't have arrows Then the situation we didn't we wouldn't have to separate between Collider and fan out and chain structures and they would all just be the same right if we if you didn't have directions to these arrows So that's exactly the idea that leads to what's called an undirected graphical model undirected graphical models or are also known as Markov random fields and They they arise essentially from at least for us from the idea of let's just write down a graph structure that Directly reflects conditional independence. So here's a formal definition an undirected graph That's just a very I mean, that's just another good graph. I don't graph is just a collection of vertices at those right so Circles and edges between them now such a graph Of course, you can write down such a graph and now we will call such a graph a Markov random field if we decide to Interpret the edges as implying conditional independence in the following sense So such a graph is called a Markov random field if For any subsets within the set of vertices and a separating set Which is also a set of vertices and that's a set such that every path between A and B has to pass through that set A and B become conditionally independent where when conditioned on excess This is called the global Markov property. So what does that look like here is such a graph? This is again a graph that I've taken from Chris Bishop's book, even though I've rearranged the notes a little bit and So what notice that this graph now it doesn't have arrows anymore. It's just as edges It's an undirected graph. So so far. It's just an it's just a set of vertices and edges But if I say this is a Markov random field if I treat this as a Markov random field Then this means that these individual variables. This is visual Symbols that are inside of the circles are interpreted as random variables. So as things that we assign probability distributions over and The edges are supposed to mean that when we condition on a separating set then two sets become independent of each other So let's in particular consider the set A and B. So here are three notes and here are two notes then This set four and five is a separating set because every path between A and B that you can try Has to pass through the separating set Now you can immediately notice that one nice thing about this kind of formulation is that you can make all sorts of fun Theoretical statements or analysis. So for example, you can you can ask is there another separating set between A and B Normally I asked these questions in the lecture and then people can have a discussion with me So you'll have to have that discussion with yourself For example, you might then notice that yes, we could include one in the separating set And then it would be a larger separating set, but that sort of seems Unnecessary right separating sets should be as small as possible So is there a smaller separating set than s to separate these two groups from each other? Well, no if we remove five then there is a path that connects them If we take one even if you take one and four and if you take one and five So we remove four and of course there is a connecting path as well And you can also question makes questions about what given this separating set What is the largest set of variables that are separated from each other? Well, it's not A and B because we can include one in here and then this would be a Set of three variables that are separated from each other This global Markov property implies Simpler thing a weaker statement that's called the pairwise Markov property any two notes You and V that don't share an edge are conditionally independent given each all other variables Of course because if they don't share an edge then to get from one to the other You have to path through another variable and if you condition on that variable then clearly in That blood that path is blocked So that means that for these undirected graphs for Markov random fields The definition of a Markov blanket is actually much much easier if you like then in the directed set The Markov blanket for Markov random field is literally just the set of neighbors of Xi So neighbors are all variables which share an edge with a particular variable when you condition on this blanket then It Xi becomes independent of the entire rest of the graph so Notice how easy it was to define Markov random fields And that's maybe one of the biggest appeals of this of this representation It's it's can can be defined in well, I mean that's essentially just one line if I wouldn't have to make this This block here so small so If they're easy to define and as we already saw by their definition just by writing down what a Markov random field is We've already ensured that reading of conditional independent structure is extremely easy in Markov random fields And that was maybe the quickest great slide We earned ourselves in this lecture course so far just two slides to define what Markov random field is and then immediately see that the Markov blanket is really easy to see So here we have a representation that is Obviously more powerful or more useful than die vector graphs if all we care about is conditional independent structure now remember that for die vector graphs we Defined the the notation the other way round we first we defined it from the Factorization property so we took the joint distribution which factorizes in certain ways and then use that to define the graph and then Reading of conditional independence was hard So here now we've devised a notation Markov random fields in which reading of conditional independence is easy But to do so we dropped the direction of arrows So therefore it's going to be tricky to think about the joint distribution So to rephrase the question in die vector graphs I can directly read off factorization if you give me a die vector graph I can take it and read off what the factorization at least one factorization of the joint probability distribution is But then I have to pay the price for that that reading of conditional independence structure is a little bit tedious It's not impossible, but it's a bit tedious. So now in undirected graphs. I can actually read off Conditional independence structure directly. That's a great part of this definition But what about factorization? Well, it turns out that reading of factorization is much harder because I don't have the directions on the graph anymore and Therefore can't directly separate the graph into individual parts, which have a left on the right hand side So if you give me an undirected graph What could the factors look like that we're looking at and to answer that question? Let me go back two slides and We mind you of this pair-wise Markov property Any two nodes in the graph that don't share an edge are conditionally independent given all the other variables So that provides us with like being conditionally independent Means it means that that they are their conditional distributions factorize So that tells us something about factorization Here we go again So any two nodes that aren't connected by an edge have to be conditionally independent given the rest of the graph Thus the joint has to factorize at least in the following way So if we think about the probability distribution over two particular variables that don't share an edge when conditioned on the entire rest of the Graph so by here by this notation I mean the set of all nodes in the graph that doesn't contain the two nodes I and J Then that distribution has to separate has to factorize into two terms So therefore notes that don't share an edge Can't be in the same factor Right because They are either on the left-hand side that so if they don't share an edge Then they are joint distribution looks like this and they that means they don't turn up on the left On the left so the variable J does not show up on the left-hand side of this factor Nor on the right-hand side of this factor So for any two nodes that don't contain an edge, but I don't share an edge There has to be a factor or two separate factors such that each node Individually is a part of only one of these factors So what kind of factors in our joint probability distribution does that leave us with so normally at this point? I ask people to think about this and maybe you should for a second as well So remember that we're trying to make as Many factors as possible factorization is a good thing if you have many different factors than that simplifies computation So our goal should be to try and make the factors small So that we have like the set of variables that are in factors small so that we have as many factors as possible So How small can we make a factor? Imagine let's look at this graph down here imagine that we have That we consider these two variables. Let's say these two one and two So these share an edge so therefore they have to be in a factor together Right because this statement up here says if something is not connected by an edge So if it is collected by an action, of course, it's they the two variables are dependent on each other given even given everything else because they directly affect each other So maybe we could wonder whether we could make a factor that just contains nodes one and two But notice that there's also variable x3, which happens to be connected both to x1 and to x2 So it has to be in the same factor because it Well, it directly also affects the two So what kind of factors does that leave us with well? It leaves us with factors that are at the very least consist of All sets of nodes such that within those sets all variables are connected densely with each other and the such sets are called clicks So in a graph a clique is a subset of the vertices of the graph such that there exists an edge between all pairs of nodes in C a maximal clique is a clique such that it's impossible to include any other nodes from we without its ceasing to be a clique So if you look at our graph here again, then you can ask yourself for a moment. What are the cliques of this graph? Maybe a first thing you will notice is that all pair-wise sets of nodes are cliques as I just also pointed out But those are not maximal cliques. These are not because they For any of these variables actually you can if you take two in this particular graph But if you take up a pair then you can add a third variable and make it a larger clique So in particular here we go variables two and three and four by the way, this is a graph again from Chris Bishop Two and three and four are are a maximal clique. Why are they maximal? Well, because if you add one then One is not connected to four. This is not fully connected graph and therefore it's not a clique anymore So that means we can make a factor out of two and three and four Why is not one not in there? Well, because there is a separating set two and three So if you condition on two and three then one becomes independent of four and that means that in our factorization There has to be a factor which contains only the variables two and three and one So that when we condition on two and three one becomes independent of four Now Another thing maybe to notice is that of course this red thing that I've outlined here That's not the only maximal clique. There is another one which includes x one x two and x three So our factorization has to contain at least two factors one No, sorry that it has to contain two factors not at least One with that includes x one x two and x three and one which includes x two x three and x four And it might contain additional factors that Contain only two and three, but of course that doesn't really help us for further factorization because there is also a factor with two and three and four On that subsequent slide I'm going to use the letter capital C to denote the set of all maximal cliques of the ground So by this argument we've just gone through Any distribution p of x which is represented by a macro random field g Can be written as a factorization over all The cliques and therefore also just overall maximal cliques because as I just said it doesn't really help us if we include individual factors For just the cliques because they don't simplify the computation further because there will be a maximal clique Which actually dominates the computational cost and we can therefore just look at the maximal cliques because any Clique is a part of at least one maximal clique So by this argument we now know that if you give me an undirected graph a mark of random field then I can go through this graph find all the maximal cliques and There and I know that whatever the joint probability distribution over all the variables x is it has to be possible to write that joint probability Distribution in this form. So it has to be a product of Individual terms That's called in terms for a moment. They are actually called potential functions Such that these potential functions these terms only contain variables or they only depend on variables that are within This particular maximal clique and then we multiply over all such maximal cliques And then there is this I had to write in a constant in front. Why is that? Well, this has to do with the fact that I only really know that there are functions of this form So remember that for directed graphs if I could read off from the graph that these Individual terms and the factorization are conditional distributions and I knew which part like which variable in in this set of end of Inputs to this function played the role of a right-hand side So a variable that we condition on in the conditional distribution and which variable played the role of a child So a left-hand side of the conditional distribution I knew that because the graph had arrows directed edges So we can just look at whether an Arrow ends in a variable or it begins in a variable if a variable is involved in a factor By with through an arrow that begins at the variable that it has to be on the right-hand side of this factor And if it's involved in a factor by through an arrow that ends on the right-hand side Sorry that ends at the variable then it has to be in the left-hand side of this factorization It has to be a child now in under the grasp we don't have arrows So we don't know whether a variable is on the left or a right-hand side of a conditional distribution But remember that conditional distributions are only probability distributions of their left-hand side Right p of x given y is only a probability distribution of x not of y We've spoken several times about the fact that likelihoods are not Probability distributions of their right-hand side if you remember when we did linear regression I plotted the likelihood factors in the Weight space in the very first time we did linear regression And we saw that those likelihoods are not probability distributions as a function of their right-hand side which we take back then were the weights so therefore Even though the graph tells us that it has to be possible to factorize the fact the term probability distribution in this way We don't know that These potential functions are probability distributions of any particular variable in here or all of them together because some of the variables in this clique might play the role in this potential function of a Variable to condition on and others might play the role of a variable that is actually we actually define a probability over so we simply know that these are functions which have to be non-negative because they do define probability distributions and For simplicity and that's actually a really strong simplification that I will talk about in two slides from now We will assume that the they are actually all strictly positive now if you remember the the Like the Gaussian examples we've done so far actually all of that form Of course, you can also think of other situations where there is actually a choice of some of the variables X In XC such that this becomes strictly zero, but we'll just ignore those because they drastically complicate the analysis So because we don't know which variables are left and right hand side of probability distributions or conditional probability distributions We also don't know the normalization constant That's again different from the director graph and it's the reason why I need to include this constant Z in here Because this left-hand side is a probability distribution, of course But we don't know that those right-hand sides are which part of them are probability distributions in a directed graph If you tell me what the factors are in the graph then because I know what the order is of the variables in them I directly know that I have a factorization according to the sum rule of the probability distribution P of X on the left-hand side and therefore I'm done on the right-hand side I don't and that's a problem Why because people use these graphs to say I want to write down a probability distribution a joint probability distribution Which has a following factorization property? That's why an undirected graph is interesting because it directly allows you to encode conditional independent structure So they would write down terms factors potentials that Enter into a function like this But because we only know what these potentials are going to be as Local entries of their of their neighbors We don't know what the global normalization constant is going to be because we never write down a joint generative model at least not in general that is directly an Probability distribution and therefore these partition functions have to be computed these are these normalization constant are also called partition functions in this framework and That can potentially be very hard Because we need the normalization over this entire expression now because a variable can be a member of several Maximal cliques as we saw in this example here So two for example x2 is a member of both the red Maximal clique and the other one that goes to x1 x2 and x3 We cannot necessarily put this integral inside of any of the factors of this of this factorization And so we have to do it over the entire space essentially and that's extremely hard again It's combinatorially hard as we've already discussed several times So I said that we're going to assume that the potential terms are going to be strictly larger than zero Even though of course they don't necessarily have to be just based on being a Markov random field And you might have wondered why I said that well The to cut to the chase the simple answer is that if psi is strictly larger than zero then we can think of these individual potential functions in this factorization as the Exponentials of something because the exponential function maps from the real line to the strictly positive real numbers And then this p of x is the product of exponential functions or the exponential of a sum Which is an idea that is connected to also one of the origins of Markov random fields physics And is connected with the names of these two Japs the Austrian physicist Ludwig Boltzmann and the American physicist Josiah Gibbs so these actually Markov random fields arose in physics as a model for This the statistics of thermodynamical systems through Interaction terms between individual particles which you can think of as potentials right so a potential defines the interaction term That's also why these terms are called potential functions and So maybe here cheers just a mouth right so if you do this if we assume that our potential functions are strictly positive Then we can think of them as the exponential of some term and that term Traditionally in physics has the form of the exponential of minus some energy because systems try to minimize their energy so they become particularly likely if they have a low energy and if our Markov random field is of this form that It can be written as well actually if it's if our marker from the field which is of this form contains potential functions Which are only strictly positive then we can think of them as the exponential of a sum over the individual components Why why a sum because they are the product of the potential functions and each potential is an exponential So the total distribution is the exponential of a sum a product of exponentials and Anormalization constant which we can add here and Again, there's a minus missing this should be called minus log set. Sorry about that And now we might as well introduce individual scaling factors Wc you could either think of them as being all one if you said easy that way or you could say they have a Value that is given by some kind of count so in physics This is often a count of how many particles are in that particular energy level if there are discrete numbers of energy levels So it's a kind of a state sum of how many particles are in there such distributions are called Boltzmann distributions Historically or or the associated probability measure that exists. It's called a Gibbs measure these So Well, yeah, I guess measure is something like this right so it has exactly exactly this kind of form where he is No, the energy is called the energy. Why am I showing you this? Well, I could have a do it do a detour here about physics and maybe just point out that Markov-Fendham fields are Historically connected to physics, but what's maybe more interesting is the connection to our previous lecture So this connection here provides a little bit of a teachable moment notice that this function is really an exponential function so any such Gibbs measure and I've already claiming that it's any Markov-Fendham field, but of course I haven't really like argued that yet is A form of an exponential family. It's an exponential of a sum over Individual energies where the energies are the logarithm of the potential functions. So they are if you like our sufficient statistics Times some weights which are the natural parameters minus a normal log normalization constant Now that sounds exciting because we just had this last lecture where we saw that we can learn probability distributions as Exponential families, so if you get draws from a distribution you can learn the distribution if it is an exponential family using maximum likelihood type inference and then doing so is easy because it separates into It turns into just computing the gradient of the log normalization. Ah, here's our problem Right, so I just said on the previous slide that Z is actually the tricky thing to compute So in a sense Boltzmann distributions are sort of this like the opposite side of the of the the power of exponential family distributions So they are of the same algebraic form. They are exponential families But whereas in the previous lecture, I argued that if you know the log normalization constant then Computing expected values of the sufficient statistics is particularly easy because all you have to do is to compute the gradient of the log normalization constant Here we now see sort of the the flip side of this coin Which is that if you don't know the log normalization constant then to learn it you have to compute the expected values of all of these Sufficient statistics all the energies under this distribution and that's of course tricky to do because it's a very high-dimensional integral in particular because these Variables just to repeat what I said on the previous slide Come from these clicks and the clicks might be overlapping. So in general you might have to do an integral over the entire space to To do compute this log normalization constant Now what I have only hinted at so far and a really just sort of sort of said in a very hand-wavy sense Which is that we can we can think of these mark of random fields with their clicks as Boltzmann distribution so as if you like exponential family distributions or is Gibbs measures actually is a formal statement that Has a name. It's called the Hammersley Cliff Clifford theorem So I'm not gonna do it I'm not gonna do a proof because it's actually a really non-trivial kind of statement I think as you can guess probably if it has such a complicated name associated with two different people But I'm gonna just say that so this is basically a formalization of The very hand-wavy argument I just made so if you have potentials that are strictly larger than zero then you can Think of them as exponentials of Energy functions. It turns out that there's actually a more formal direct connection between the two and here it is So if you consider the set of all possible strictly positive distributions So again, we're assuming that we're strictly positive defined over a undirected graph then this this the Conditional independence is that can be read off from the graph when interpreted as a mark of random field are equal to the Distributions that can be expressed as a Gibbs measure with the factorization that I had on here So let me show you again So that's sort of several steps wrapped into one first of all, maybe let's go a bit slow again So if you consider the subset of all distributions that are consistent with the conditional Independences that can be read off from G using the graph. So that is actually a step before that. We're saying if we use These properties right so if you use these conditional independent structures then what I've done in the last few minutes is I've gone through and said, oh, okay We can think of these marker Blankets and then therefore probably think of these kind of cliques and then meets are we can probably think of potentials as cliques and Then if they're strictly positive then they're obviously Boltzmann distributions or Gibbs measures well You can do this sort of on the other direction as well this proof you can say the If you consider the set of all distributions that can be expressed as a Gibbs measure with the factorization in star so if you can think of Function that is of this form a probability measure of this form then the set of all functions of all probability measures That can be expressed in this form with this factorization is actually equal to The set of all distributions that are consistent with the conditional independence is using that can be read off from the graph so in that sense mark of random fields the graphical model and the Themodynamical framework of writing down a Gibbs measure make a connection to between symbolic and mathematical language or is you know drawing drawing undirected graphs and writing exponential functions Okay, so that's an abstract statement. There is actually also a much more direct way to construct Mark of random fields, so undirected graphical models in the specific case of Gaussian distribution distributions and I would that's the bit that I would like to end with in this lecture So just to remind you of the content of the lecture number I think it was six on the first properties of Gaussian distributions actually might might have been five and There we did this humble Gaussian analysis that David McKay produced a while ago and Observed that or maybe just studied the meaning of the terms in covariance matrices of Gaussian distributions So then we saw that if you have a joint distribution over random variables There's jointly Gaussian with a mean and a covariance matrix Then you can read off marginal and conditional independent statements quite directly from the The Gaussian distribution from from sorry from the covariance matrix sigma in particular and that's not even on this slide But let me just reiterate it if there is a zero in an off diagonal element of the covariance matrix at entry ij Then that means that the variables xi and xj are marginally independent, so they are independent when integrating out all other variables the perhaps more exciting more intricate statement is if Is a statement about the zero in the inverse covariance matrix So if you consider the matrix that is the inverse of this positive Hopefully positive definite matrix sigma and there is a zero on its ij off diagonal element then that zero means That xi and xj are conditionally independent when conditioned on all the other variables by the way, of course this allows us also to make more general statements about the conditional independence Are under a subset of other variables? Maybe convince yourself at this point how you would do that in the Gaussian framework because marginalization is so easy in the Gaussian framework So to consider whether two variables i and j are conditionally independent given any specific subset of the other variables within the vector x first computer Marginal over all the variables that you don't want to condition on So to do that you just select a subset of the covariance matrix and the mean of The corresponding variables that you don't want to integrate out. That's very easy for Gaussians and Then check whether there is still a zero on the off diagonal Entry for these two variables that we want to consider and if there now still is a zero then you're conditionally independent These two variables are conditionally independent given only that subset that remains in this marginal Why does this work? It works because the inverse of a subset of a matrix is not the subset of the inverse of a matrix Inversion is a non-linear operation that doesn't Keep like doesn't exchange under these and under these two operations or doesn't exchange with the operation of subset selection marginalization and Gaussian distributions Okay, but so let's say we've already marginalized over all of the variables that we don't want to condition on and we want to just see Whether x i and x j are conditionally independent given everything else Then we can just read that off from the corresponding off diagonal element of the inverse covariance matrix Now that means that if we can read off this property It also means we can directly write the graph because as we've seen on previous slides if Two variables are not conditionally independent when conditioned on everything else that means they have to share an edge Why because of let me go back show you the Markov blanket again because of this property of Markov random fields their Markov blanket is given by all the direct neighbors so if actually even more generally the definition of the set itself right so if we have conditioned on all other variables and Actually it's down here the pairwise Markov property if you've conditioned on all other on all other variables in the set and The two variables are still not independent Then this statement must be false. So therefore they have to share an edge, right? That's the only explanation So that means if you have a Gaussian distribution then we can write the associated Markov random field quite directly We just take the we just draw a variable for all Entries in X so that's just a bunch of circles then we take the covariance matrix of our Gaussian We invert it and we check for all zero entries in the in the covariance matrix and if there isn't for all non-zero entries We just draw edges between x i and x j and that directly gives us our graph Yeah, and with that we're actually at the end of today's lecture so today's lecture was Preparation for Notation that we're going to use in subsequent lectures We first returned to the notion of directed Graphical models Bayesian networks that are already introduced abstractly or maybe very briefly in lecture number two Well, we saw that these graphs can be constructed directly from factorizations of joint probability distributions if you have a factorization you can write it down into into a directed graphical model by drawing a circle for every variable that you that is in the model and then drawing an edge an arrow for every term in the Factorization where the right-hand side of the factor is the beginning of the arrow on the left-hand side of the factor is the end of the Arrow, so that's a generative model This is a beautiful language and we saw that one maybe I've somewhat tedious aspect is that of course you want to use these these one reason why you want to use these graphical models is to infer conditional independent structure and doing so is actually possible in a formal sense Using the notion of the separation that was a little bit tedious to do because it required looking at not just parents and children of variables, but also at co-parents because of the phenomenon of explaining away, so Perhaps you might wonder and or be wondered for sort of the sake of argument at least is there a way to write down graphical models Which are a variant of graphical models another graphical notation Which doesn't have this sort of complication in which we can directly read off conditional independent structure Now it turns out that this is actually possible and it's Realized in the notion of under vector graphical models, which are also known as Markov random fields Which are more or less this defined to have by through that property. So you draw a graph such that when conditioned on or on a subset of the nodes Variables that are separated by that subset become conditionally independent That means that you can directly of a reader of conditional independence from such graphs. However As we now discovered there is a massive downside to that Which is that the reading of the joint distribution is now much harder to do and requires a potentially combinatorially expensive computation at least to get the normalization constant, right? Given that of course, you might wonder why anyone would want to use Markov random field over a directed graphical model if it's that much harder to compute with this Well to understand where these two models come from also historically Think about what did what you need to know to write one of these models? So for a directed graph you need to know the joint probability distribution over all the variables involved That's a full generative model and then you can write down the diameter graph At least that's a natural thing to write that then write down because you have these terms that are conditional probability distribution So they naturally lend themselves to directed connections with a left and the right hand side To write down a mark of random field all you need to know is the potential functions And then once you have potential functions you can like and if you are in a situation where you know the potential functions Then you can write down the graph by essentially creating variables for all of the individual all the variables creating variable nodes for all the variables in your model and then drawing clicks for in each individual potential functions so by connecting all the variables that densely into a click that Show up in each individual potential function So that of course gives you an idea for where these models come from Historically, this is an idea from physics where you know what the potentials are the potentials are given by interaction terms between particles for example or moving bodies and Through through potentials right of course you can also decide to simplify those potentials to remove certain interaction terms and make certain Simplifications and then the they directly inform your graphical model on the other hand Direct the graphs have a history from some statistics or at least at least are popular in statistics where you have a full generative model You have made assumptions about how all the variables relate to each other not just a small subset of particles and and Then what you want to know is everything you can know about all of these variables given a subset of these variables So you want to do inference for example, you might have Description of a medical process decision process that involves various different symptoms and Treatments and the outcome of certain diagnostics tools to or symptoms to inform about what kind of Disease a patient is suffering from so Directly graphs are in some sense at least for our purposes maybe the more powerful model because if we have a generative model, we can write it down directly and The graph because it has direction is more expressive and in particular it allows us to read off Conditional independence more or less in an automated fashion from the graph while for an undirected graph Even though we can read off the conditional independence structure directly Reading of the joint probability distribution is a much hardware process So the typical application for directly graphs is today at least is Two-fold so people use directly graph for in basically two different modes One is as a mental tool to write down something on a whiteboard or a blackboard to study how a distribution looks like We're actually going to do this in subsequent lectures and perhaps more promising more exciting direction Which is also recently gaining more momentum is the fact that once you have a graph You can do automated inference. You can automatically construct conditional independent structure Maybe use that to inform the algorithms that you might use to do inference on this is a one key concept at the heart of the sort of developing domain of probabilistic programming What we need to do in the next lectures is to refine this graphical language a little bit further so that both The abstract sort of mathematical work of writing on a whiteboard and the automated process of operating on such graphs becomes more Flexible and powerful Maybe we'll also talk a little bit more about the connections between these then emerging different frameworks of Graphical models and how to map from one model to the next and that will give rise to an actually quite powerful generic algorithm that directly takes account of the structure in a graph to create very effectively efficient in some sense automated inference in graphical models But that's for later for now. Thank you very much for your time. I hope you've enjoyed the lecture I'm looking forward to see you in the next one