 Hello and welcome to lecture number 20 of probabilistic machine learning it's 2021 and I am taking over from my alter ego from last year ago. I've also updated the slide with the overview a little bit to that that now actually reflects what we're going to do for the next few weeks. We are in the final third of the course now where I've introduced this running example that we're going to use several times over the course of the next week. A little bit of an advanced warning will every now and then have to flip away from this example just to build up speed to collect new ideas for something we're going to use when we return to it. This running example is supposed to follow along the process that you might follow if you're actually designing a machine learning algorithm yourself. And as a high level structure for this process in the last lecture already I proposed this sort of three step process. You first collect the data and last week we decided for this example that we're going to use data that collects the words of state of the union addresses of American presidents over the past quarter of a millennium basically. And once you have the data you then you then build a model that's in the probabilistic machine learning perspective a generative model a joint probability distribution over the data and a bunch of latent variables which describe something we care about in this data set. The third step is going to be designing an algorithm which actually solves the computational task of patient inference. Today we're going to step back into the second phase building the model which we've already began last lecture which you might have just watched right now. And then we will once we've completed this step move to the third phase to actually designing an algorithm in fact I'm even going to well show you how it works. This second step is building the model of course also has several sub steps and last lecture we maybe did the first part of building a model which is a very high level thing. We first identified what our task actually is that we're trying to solve and we convinced ourselves that maybe what we're trying to do is a form of dimensionality reduction. We have this data set is corpus of documents where each document consists of a collection of words from a vocabulary which are used by the speaker several times. So there are counts in this matrix here of documents and words and we decided if we're only going to use the count so we are discarding the order in which the words are spoken. We treat this data set as a bag of words. Maybe this is already a first modeling decision and then we convinced ourselves that what we're trying to do what our goal is to when we say that we would like to extract something about the structure of this data set. The topics of history in well at least a northern American history over the past 200 odd years is a form of dimensionality reduction. So that means we would like to write this dense matrix in some kind of low dimensional representation. I'm deliberately not saying low rank because we already convinced ourselves last lecture that what we're trying to do here is not just trying to write this matrix in terms of a low rank decomposition in terms of two other matrices. But we're looking for something that is kind of motivated by the structure but it's a little bit more complicated. What we would like to do is to describe every document out of the total D documents in terms of a low number of topics. Capital K topics from one to capital K and the topics themselves are or should be possible to represent by some form of distribution over the frequency of the words in each different document. Now I ended the last lecture talking about mixture. Actually, we first talked about low dimensional representations in terms of matrix decomposition. So the single value decomposition we found that that's not really a good idea to do because then we would get dense matrices here. And that has all sorts of disadvantages that we discussed last time. Then I brought up the idea of a mixture model as an alternative as a sparser representation. But we found that this idea of a mixture model is maybe too extreme. It's too sparse because a mixture model and I have a picture for this, the left side of this slide. A mixture model is a representation of a data set where each row in our matrix, each document is assigned to one and only one mixture component. So the assumption is that every document is generated by picking exactly one particular topic. So this is what this data set might look like. Let's say we look at the first 12 documents and we decide that there are 10 topics. Then this would be the matrix capital Pi in the previous slide. So this kind of Pi that represents the assignment of documents to topics. In a mixture model, this would be a binary matrix where every row contains one and only one entry, non-zero entry. Now that isn't quite what we have in mind when we talk about topics. Of course when a president steps up to the lectern to talk about, maybe you've probably seen some of these State of Union addresses, they sort of wander through a whole set of different topics that are currently important to the nation that they are addressing. So a document actually won't consist of exclusively one topic or it won't talk about exclusively one topic. What we're looking for rather is something that is actually called a topic model and I'll get to that several times today. Actually we'll talk about it at length. And that is the idea that every document should be assigned to a few topics. So this matrix here should not be completely binary such that every row only has one entry that is equal to one. But instead it should contain here a probability distribution in every row. So the entries over each row should sum to one such that a document can be thought of as a collection of topics that the speaker is trying to talk about, where every topic contributes a little bit to the document, sometimes more, sometimes less, but they jointly make up the document. And then as maybe an additional requirement, which is that of course we would like this matrix to still be somehow sparse, as this picture also kind of suggests. So every document shouldn't be about one topic, but it also shouldn't be about all topics. The topics should be somehow sparsely assigned. Every row in the matrix should contain a few non-zero entries, not just one, but also a lot of entries that are actually zero, because not every president is going to talk about every topic. And there's a similar kind of requirement for what the topic should be. This object theta in this representation of the topics, which relates each topic to the words that it's about, this object, this matrix maybe, should actually also be of this kind of sparse nature for a slightly different reason. Because what's a topic? Well, if you're talking about a topic, then that means that we're using certain words. We're not always using just one word to talk about one topic, but we're also not using all words to describe a particular topic. So if we are thinking about what kind of probability distribution we might want to use to represent this object theta, then we certainly shouldn't be using a dense distribution, like a Gaussian distribution, which assigns, well, certainly dense counts, but maybe even negative counts to how often a word should be used in a topic. That would be a weird representation. Instead we would like to say that talking about a particular topic means that we are using certain words, and some of these words we use quite frequently when we talk about a particular topic, but we're probably not going to use every single word for a single topic. So these preliminary observations will allow us to now do the second and third step in the process of building a model. We've just realized what kind of understood for ourselves what we would like to do, and now our goal is going to be to translate these ideas, these insights into how we want to represent our data set into a generative model, which involves writing about a graphical model, and then assigning prior distributions that encode these varsity assumptions that we are making. And here's the graphical model that we're going to use. Our data set, our corpus, consists of the words that are being spoken in speech number D, document D, at location I, and we would like to represent those words in terms of the following generative process. So first of all, there are D such documents, so there is a big plate around them, a copy for every single or an instance for every single document, and then the words in each document go from the first word to the very final one, so the length of the document ID. We're going to assume that every single word in every single document is assigned to exactly one topic, and for that we're going to introduce a variable CDI that identifies the topic. There are various ways in which one could encode this belonging to one document. For convenience, for notational and representational convenience, I'm going to use what's called a one-hot encoding, so I'm going to say CDI is actually a vector of K entries, and it's a binary vector where only one of the entries is one and all the other ones are zero. This makes it useful in a way that you're going to see in a moment. Now we have to say what the topics are and then how the documents are generated from the topics. For a topic, we're going to use this variable theta, so these are the entries of this matrix capital theta that we've had on previous slides, and as suggested on the visualization two slides ago, we're going to assume that these thetas provide probability distributions. So the vector theta is theta K, so that's the Kth topic. That's a vector of length, a number of words, so capital V, and that this vector contains numbers between zero and one such that the sum over these numbers is one. This represents a probability distribution over words when we are in a topic. So if a particular word is supposed to come from a particular topic, then the probability to get this particular word from the vocabulary is given by theta KWDI. And how do we create documents? Well, documents are in our model a distribution over topics. So that means that every document D can be represented by a row of the big matrix capital Pi, which is itself a probability distribution. So Pi V is a vector of length K, a number of topics, where the entries are between zero and one and a sum to one such that to generate words from a document, we can draw the words by, for every single word, first drawing the topic of the word by drawing an instance from the probability distribution Pi D. And then once we have drawn it, drawing the word by drawing from the probability distribution that is corresponds to the topic that this word is supposed to come from. This is the graphical representation. What does this look like in math? Well, this is here below. We are drawing the topics of all the words by for every document and for every word drawing from the probability distribution that is given by the entries of the row Pi D in the matrix capital Pi. Notice that there are three different products here. The third one is for notational convenience. It's because we're using a one-hot encoding. And the other two amount to an independence assumption. We will assume that the words in every document are drawn independent of all the other documents and that every single word will be drawn independent of all the other words when conditioned on, so given the probability distribution of every single document. This is the bag of word assumptions. And how do we draw words once we have the topic assignment? Well, we just draw a word from the distribution over words over the vocabulary that is given by theta. So what we've just done is that we've now identified the types of the variables that show up in this, let's call it a program, a probabilistic program in fact. And we've also identified the relationships to each other, how one generates the other and that's step 2.2 in our designing a generative process in our building a model. We do this using the language of graphical models. A, because it's sometimes useful to draw a picture like this, but also because once we've written the graphical model, as you know from the corresponding lectures, we've also immediately identified or we can read off conditional independent structure. For example, we can look at this graph and see that there is a collider structure here and here, which tells us that first of all, if we marginalize out pi and theta, we can see that the words are not independent of each other. They are independent of each other though, if we know what their topics assignments are and if we know what the topics are. So our model is not trivial. It's not assuming that every single word is generated completely independent of all the other ones. In fact, it assumes that there is a latent structure which relates the words to each other and that structure is given by the topics. Conversely, we also see that when we condition on the words, which of course we are going to do because the words are the data, when we observe the data, then we already see that the left and the right-hand side of this model will become dependent on each other. That's an instance of explaining a way. We're just noting this here in passing and we can already guess that it's going to create a bit of a hurdle for us when we're actually implementing our algorithm and we will have to deal with that because it means that all the latent variables, everything we care about will depend on each other. What we haven't done yet is the third step in the list of our battle plan for how to build our model and that is to assign conditional distributions to the latent variables or that means prior distributions. That's what we need to do now and we need to do this because it allows us, well, we're required to do it because everything has to have a prior, but also it's actually useful to do so because this assumption, this putting in prior distributions allows us to encode these ideas that we have about what these distributions should look like, the fact that they should be sparse. We've already realized that we shouldn't be putting a discrete distribution as a prior on PiD, as you would in a mixture model and we shouldn't be putting a dense distribution, like a Gaussian, for example, onto theta for various reasons, but instead we should put prior distributions actually on both of these variables that allow us to encode the fact that each of these vectors, so this is a vector of length V and this is a vector of length K, these two objects, they should be probability distributions, so their entries should sum to 1 and they should be sparse. They are also discrete distributions, so they contain numbers between 0 and 1 that sum to 1 and the conjugate prior, the exponential family that provides the conjugate prior for discrete distributions is the Dirichlet distribution that I've already introduced in the lecture on exponential families. The Dirichlet distribution looks like this or also like this, this is the normalization constant in front, which is called the beta function and these prior distributions over in this case Pi are given by a product over the individual entries in the vector raised to a power, where the power is provided by the natural parameter of this exponential family, which is a pseudo count. We've already encountered Dirichlet distributions very early on in the lecture course in lecture number two when we inferred continuous probability distributions, so the example of inferring how many people are wearing glasses, that was the bivariate version of the Dirichlet distribution, the beta distribution and we realized that these entries alpha k amount to pseudo counts. Dirichlet distributions have a nice, so they are the conjugate prior for discrete distributions, but they also have a nice aspect to them, although they have various great properties, but one that is useful for us in this application is that they can actually encode sparsity. So here is a plot of three different Dirichlet priors over the simplex, so this is a prior for three-dimensional distribution, Pi 1, Pi 2, Pi 3, because these are supposed to be probability distributions, the third variable is always one minus the sum of the other two, that means that the values of this distribution have to lie in this two-dimensional plane represented by this triangle, this is called the simplex. And what you see here are samples from this distribution for three different values of the hyperparameter, the parameter alpha. Here alpha is 0.01, you can see that up here, here it's 0.1 and here it's 0.5, I've drawn a hundred samples each, and you can see that four values of alpha that are less than one, this distribution has more mass on the corners of the simplex and on the edges of it than it has in the center. You can think of it as kind of a trough with high mass in the corners. So even for 0.5, we get samples that tend to be on the side, but for even smaller values of alpha, actually most of the samples lie just in the corners or on the edges, and that means that these probability distributions that we're going to be drawing if we draw from this distribution, they will be exactly of this kind of nature that I've shown you before. Here these are actually draws as well, it's another way of representing these samples. You can see that they are mostly 0, so white in this plot means 0, but they do contain a few entries that are non-zero, and which sum to 1, of course. So this means there isn't just one, but two good reasons to use Dirichlet distributions as our prior over both the document-topic distributions, pi, d, and the topic-word distributions, theta. The first one is that Dirichlet distributions are the conjugate prior for the discrete probability distributions that we're trying to infer. So in some sense they are the standard choice, the standard data type we want to use for these kind of variables for discrete distributions, or as priors for discrete distributions, and the second property is that they actually allow us to encode something we want to say about our collection of document-topic distributions and topic-word distributions, which is that they are sparse. We get that by setting a particular choice of parameter, something less than 1 for these alphas, or more precisely by choosing vectors alpha, which contain numbers that are less than 1 to encode sparsity. And if we make this choice, then we have effectively written down or identified our generative model. Here it is. I've now extended the graph by adding the parameters of these two Dirichlet distributions that we're going to use as a prior for pi and theta. And I'll call them alpha, d, and beta, k. Alpha, d for the priors over the topic a document-topic distributions, and beta, k for the priors over the topic-word distributions. And here is now our generative model. The word generative model suggests that it's a representation, a model for our data, which describes our assumptions about how the data may have been generated, or maybe a little bit more hands-on, which describes a program, or a computer program known as a probabilistic program, which we could use to generate the data. It's literally a very explicit process that could be used to generate something that looks like the data set that we have. This is the beauty of the probabilistic approach that we are very explicit about this, and we're just writing down basically a formal computer program for generating something that looks like our data. And we can read off what this program is by looking at the graph and following the arrows from the top, the outside of the graph. In this case, these are the two... Well, they're not roots of... Well, actually, you can think of them as roots of the graph, and starting from both of them and then drawing random numbers going forward. This is what's described below in equations. So to draw our data set W, we begin from the outside of the graph, the set of nodes that has no parents. In particular, here is one such set. And we draw the topic word distributions. Those are the rows of the matrix capital theta. We draw it by drawing every row independently that's represented by this product here and by this plate in the graph. From a Dirichlet distribution for the topic word distribution theta k with parameter beta k. Now we have this part of our graph filled out and we can turn our attention to the left-hand side and we can say for every document in the corpus, we draw first the variable pi d, which is the distribution over the topics of this particular document, by drawing every single... for every single document independently from a Dirichlet distribution with parameter alpha d. So the independence again is encoded by the product term here in the equation and the outer plate in this graph. Once we have pi d and theta k, we can now turn our attention to the inner plate here and we first draw... to get a word, we first decide which topic it's gonna come from. For example, in WDI we draw the topic of the individual word WDI by drawing a single one to put into one of the entries of the vector CDI which has entries from 1 to k and we do that by drawing from a discrete distribution. Drawing from a discrete distribution is a very simple thing to do, right? There's just k different categories, topics and we just pick one of them given by pi d. Unfortunately the corresponding equation is a little bit... it looked quite complicated but it just encodes this very simple thing. So using our one hot encoding we can say the probability for the topic CDIk is just given by pi dk. Now that we know which topic the word comes from you can finally draw the actual word that is this black circle here and for that we need both theta k and CDI. In this work we know which topic the word comes from and this means we have identified a particular topic a particular vector theta k or theta CDIk the one that is actually 1 and we can now draw the word by using this identified topic theta k to draw a word. This is a distribution over words similar to the line above and we can use it to draw an individual word. If we do this then we create something that looks like our corpus in the sense that it will give us a regular array of length d so it has d rows where each row is of length id and it will give us identities of the individual words. Notice that you can implement this as a forward pass through this graph as an actual piece of code. In fact, generating like this will be a part of your homework this week. The idea of a probabilistic inference or probabilistic programming in general is that through this process we're also going to identify or we've effectively identified a way of also figuring out all this structure and that way is given by Bayesian inference. This is the beauty of this kind of language that writing down this process, this generative process identifies everything we have to do afterwards. That's also why in papers in which such models are introduced these kind of lines that you see here these equations are typically at the heart of the paper. This is writing down those these four lines here or accompanying it with this graph identifies uniquely and exactly what we're trying to do and how we're going to go about building our model and then also performing inference in it. Here, of course, I can now tell you that this model is not something I just came up with off the top of my head. In fact, it's a relatively popular form of probabilistic inference model. It's called latent Dirichlet allocation. It has this name because it uses two kinds of latent distribution over pi and theta that are both Dirichlet. This is a little bit unfortunate in the sense that it's usually abbreviated with the term LDA, which is overloading the idea of linear discriminant analysis, which is a very basic form of machine learning algorithm many of you will be familiar with. This algorithm or this model class is associated with several people. One who is particularly associated with it is David Bly, who is a professor at Columbia University at the moment but he invented this model to my knowledge as part of his PhD together with his co-authors, co-advisors Andrew Ng and Mike Jordan. Writing down this generative model and the corresponding graph to it completes the second phase in our journey from a data set to a software solution, the building of the model. We have first identified quantities and data structures that we care about. We've thought about what we're trying to do. We've realized that we want to have a low-dimensional representation, do some form of dimensionality reduction, structure dimensionality reduction for our data set of counts of words. We've identified these decisions to represent our data set as a bag of words. We then designed a generative process, a graphical model, by writing down the relationship between the variables. We said there will be something called theta and pi representing topic word and document topic distributions and we'll assume that we draw from these distributions to generate the words. And we finally assigned priors to these variables in an almost mechanistic process by just looking at the data types and using the standard priors for the individual data types, for discrete distributions, the standard priors is a Dirichlet distribution because it's the conjugate priors, the conjugate exponential family to the discrete distribution. We immediately saw that actually allows us to encode something we wanted to encode about our model in the first place, which is sparsity. Sparsity both of the document topic distributions, every document only talks about a few topics and sparsity in the topic word distributions. Every topic is only about a few words. We also realized that this is a model, of course, that has been invented before, which is good because this is a deductive example. In practice, you might end up with a model that might be related to something you've seen before, but it might not necessarily be something that anyone has written a paper about before. But that's fine. This is, again, a strength of probabilistic modeling. You get to write down what your prior assumptions are. And you'll make, when you do that in practice, invariably some form of compromise between really writing down exactly what you'd like to say and sort of shoehorning this knowledge into a structured representation that you can actually work with. For example, here in this case, we made the assumption of a bag of words, so we introduced conditional independence between the words, given the latent representation, even though maybe we don't really believe that that's how the dataset was generated. And, of course, we decided to use Dirichlet distributions, even though maybe there are some minor aspects of the model that we can't, or at least in the simple form that we've written down, we can't directly represent. For example, the fact that, of course, there are certain words that actually will show up in every single topic or in many topics, like these kind of words that are just about not stop words, but also not really content with words yet, like employee, maybe. And we couldn't really represent this in the Dirichlet distributions that we have yet. One could do this by hand or we could just leave it out and hope that the algorithm is going to work regardless. What we're going to do now, next, and final, the third step, is to actually build an algorithm that we can then implement. Before we do that, maybe you want to stop the video and take a quick break. Okay, I will assume you've taken your break now and we can step forward to the final leg of our journey, the designing of the algorithm. The slide actually already tells us what we're supposed to do, but it's a very high level point of view. We're supposed to look at the graphical model again that we just built and use or find structure in this graphical model that we can use to build an efficient algorithm. Ideally, we find some structure in this graph or in this representation or in the equations that we can use to pull out one of the standard methods from our toolbox and apply it directly to the problem. Then we run some unit tests, some sanity checks, see whether the algorithm actually works and finally think, like, come back and try and make an efficient implementation. Today we'll certainly not get to the efficient implementation. Instead, we'll actually find that this final part, the design of the algorithm is more of an art than a science. The building the solution, the software solution, writing the algorithm, that's very much the part that the human does in machine learning and it requires expert knowledge to do well. So let's see what we can do with our model. Here's our graph again and for once I've left out those big equations. Our goal is going to be to compute a posterior distribution over the latent quantities. These are all the variables in the graph that are not filled in, all the bits that are white. So those are the variables collected in the array C, the array Pi and the array Theta. Maybe, arguably, we don't even care so much about C but that's maybe a secondary issue. To compute this posterior this conditional distribution for the latent quantities given the data well, we have to do Bayesian inference. This is actually a nice thing. We don't have to think about the computational task at hand. In a frequentist framework the generative model is not actually part of the framework and instead we directly have to jump in and invent some algorithm. Instead, in the probabilistic framework we've written down the generative model. Once we've written down our joint distribution over all the variables the data and the latent quantities the inference task is not up for debate. It's just Bayesian inference. The posterior is given by prior times likelihood divided by the evidence. Or alternatively, using the product rule by the joint divided by the marginal for the data. The evidence term. So, what is this joint that we have to work with? Well, let's write it down and this is quickly where things get tedious. Advanced warning, you're going to get to see a lot of math now. But that's exactly unfortunately what we have to do to build efficient algorithms. In general, of course we can factorize this joint distribution using the product rule in an arbitrary way. But the graph tells us in an arbitrary way of factorizing it that induces interesting conditional general independence. So it says that the joint distribution over c, pi, theta and w all the variables in our model as given by this graph is given by the probability for pi given alpha for c given pi times the probability for theta given beta and the probability for given, its direct parents c and theta. This is what the graph tells us. In fact, the graph tells us a little bit more. It tells us that there are these plates around the variables which imply further factorization. The probability distributions over the document topic distributions pi d actually factorize for every individual document. Likewise, the topic assignments for each word factorize both individual documents and individual words when given pi. The probabilities for the actual words given the topic assignments also fully factorize over every single word and every single document and the probability distributions over the individual topic word probabilities also factorize. This is this plate. So there are three different plates here each of which induces one factorization. A factorization over d, a factorization over i and a factorization over k. The graph doesn't tell us but we know from our generative model that we've decided to represent these individual abstract probability distributions with concrete choices. We already, early on in our modeling assumption stated that we want the probabilities for the topics given pi to be discrete distributions and the probabilities for the words to be discrete samples from the topic word distribution identified by the documents sorry, by the words topic. And we then decided in a somewhat knee-jerk reaction because these are discrete distributions and the conjugate prior for discrete distributions is the Dirichlet distribution to use those Dirichlet distributions as priors for those parameters pi and theta. So we just plugged in some Dirichlet distributions. Why did we do that? Well, we can now see if we look at this why we might have decided to do things this way because when we write down our joint model now, we can actually see that there are certain interesting ways of rewriting the model which are down to our use of exponential family distributions and their conjugacy to the discrete distribution or the Dirichlet exponential families conjugacy to the discrete distribution. To see this let's focus on the left-hand side of this expression. So notice that pi only shows up in those terms and theta only shows up in those terms. So we can look at the blue and the green things separately and let's look first at the blue thing and ignore the green thing for a bit. Actually the green bit is going to be almost entirely analogous and here we notice that if we now plug in what actually a Dirichlet prior is that things actually become well they look complicated but they actually become simpler. So here I've plugged in definition of the Dirichlet distribution from previous slide. It's just this normalization constant times a product over the individual pi terms. And now we can look at this and see that there are terms here over D and there are actually there's brackets around them so basically each of these brackets contains one term for each document D so we might as well rearrange the brackets and have one product over D and yeah only one bracket outside and then there's this normalization constant for each document and then there in each document a bunch of terms over the individual topics k from 1 to capital K with a term pi D k inside with some exponent. That's actually why we introduced this one hot encoding for C, this way of writing C because then these terms are very obviously there. The only annoying thing is that there is this product over I D here so we have to find a notation that lets us deal easily with this. So let's do that. So from here from the previous slide to this one I've really just picked out these blue bits and copied them here again so this is the expression you've just seen before nothing has changed and now to make our life easy let's find a way of somehow well arguably getting rid of this product over I by introducing a notation that will turn out to be very convenient not just for this lecture but for several lectures onwards after this. I will introduce a count array. I'll call that array n little n for counts which is indexed by three indices D, K, V which are speaking indices for the document the topic and the word identity V in the vocabulary and I'll use those arrays to count how often we have a signed or our algorithm thinks we should assign the word V from the vocabulary to topic K when it occurs in document D. So this is what this notation means. It's count how often how many I's we have in this product over I such that the word at location I is actually the word V and it's assigned to K. It'll often be convenient because this object will be used in various ways to collapse or index this or splice this array in various ways. I'll use the usual codon notation to similar to what you are used to from NumPy maybe or from previously MATLAB or previous older languages to think of a slice through this array at index D K and I'll use the dot notation to collapse out to sum out individual dimensions. So N D K dot for example means that we're looking for how often have we assigned any word no matter which one to topic K in document D and there's a corresponding version with N dot K V how often have we assigned the word V to topic K in any of the documents and N dot K dot often have we assigned anything anywhere to topic K. If we use this notation then this basically wraps out or gets rid of this product over I and we see that everything here simplifies a lot which is arguably expected because we are using the conjugate prior for our discrete distribution and so things are supposed to get easier. We have here two brackets both of which contain a product over D and over K so we can make one big bracket with a product over D and over K and there's the normalization constant for all of the priors that just stays and then we have a term that factorizes over terms in K over pi D K to the power of the prior parameters plus these counts how often have we assigned any word in this document D to topic K. This is not a Dirichlet distribution because a Dirichlet distribution would require us to be normalized so if we integrate out over pi this here, this number is not one but instead it's an evidence term basically. Similarly we get a similar expression for the right hand side so I've returned now to our larger overarching joint so we're looking for the joint distribution over types of variables then we just realize that this left hand side of the big expression can be simplified thus and here on the right hand side there's a very similar kind of situation we here have a product over K but also in the likelihood a product over D and ID and thankfully the notation N D K V which I've just copied down here again I'll lapse a little bit with the stuff underneath allows us to simplify this and just think of this as individual product factors over K index by K where we just count up how often we encounter any in any document over D the word V at any location and assign it to topic K that's N dot K V and so we actually see that our joint distribution already simplifies it simplifies into two bigger chunks one of which only depends on Pi and the other one only depends on Theta well where is C in this expression well C is hidden away in these counts N up here and up here so does this help us in inference well at first sight you might think no because if we wanted to compute our posterior distribution for here for everything for C, Pi and Theta given the words W then we have to write down this joint which we just found we can easily do an explicit expression but then we have to normalize by this normalization constant which is an integral over this expression over C, Pi and Theta and if you now look at the bit that we found here then if you think about this you could kind of think that maybe I can imagine I mean you might have to stop the video and stare at these expressions for a bit but you might realize that it might not be all that hard to do an integral over Pi because this is going to give us something like a Dirichlet integral expression and it also might not be hard to do an integral over Theta but to integrate out C is going to be difficult because C is hiding away in those ends now this is of course annoying but it's a typical situation in probabilistic modeling that variables start to depend on each other when we observe data now in fact really we shouldn't have been surprised that this issue showed up because we could have read it off from the graph we could have just looked at this and see that this is a collider structure so when we condition on the data then the latent quantities up here and here will become conditionally dependent on each other the graph basically already told us this that we should expect that things should be hard could we have gotten around this well not really because today the kind of model that we wanted to write down this low dimensional representation forces us to write down a graph like this so is there something in the graph that we can look at and hope for salvation for simplification well if you humor me for a moment this is sort of thing you might do if you're standing in front of a whiteboard discussing with your colleagues how to get them to work then you might sort of stare at this for a bit and say well you know do something that almost sounds a bit childish but actually turns out to be a good idea which is to say well what if you imagine for a moment that we had some of these variables that we don't actually have could we then do something about one of those variables and if you look at the graph you can actually maybe think for yourself that if we had pi and we had theta then we have the mark of blanket of C and therefore C should become independent of each other let's see if that's actually true so for that I've written this down here properly on the slides we'll imagine that if we had pi and theta then actually the posterior for C given theta pi and w but let's call this a posterior even though it's more like a conditional distribution that actually becomes an easy object to see that let us first think about here I've written down the thing again that we had on the previous slide let's think about what this posterior looks like so here we just use mechanic operations of probability theory the conditional for C given theta pi and w is the joint divided by the marginal for the joint over everything divided by the marginal where we sum out just C just a bit that we want to have the posterior or the conditional for so what is that well let's look at our joint again for C, pi, theta and w now I've written it on the previous slide in this collapsed form this is actually called the collapsed form it's very nice and compact but it's actually more useful to go back to the representation where we don't have these not this notation n but the C's actually show up explicitly here this is the expression we had two slides before I've just sort of up expanded the brackets again and now we can look at this expression and check for where C actually shows up and we'll notice that C actually only shows up in these two well they are arguably kind of likelihood terms while out here in these bits there is no C these are just the priors for theta and pi so when we compute we want to compute the conditional for C given theta, pi and w then we will have in our numerator these terms out here and these terms inside and then in the denominator an expression like that same expression but we sum over all possible values for C so because those terms do not depend on C we can take them outside of the sum and they cancel above and below the division line and we're left with only the expressions in here now we can stare at those a little bit and notice again that in these two brackets both brackets contain terms products over D and products over I so we can make one big bracket out of that with a product over D and I let's put that in front products will show up both above and below and our sum is going to be over the individual C, D, I, K so because of this independence the sum can go inside and the only thing we have to be careful with is the K because K will show up in these actual terms inside and we're left with an expression that has both in the numerator and the denominator a product and I so we can take that outside of the quotient and inside a product over K above with these individual terms pi D K times theta KWDI raised to the power of C, D, I, K divided by a normalization constant that is inside literally just a sum over all possible values for the topic assignment so let's call that variable K prime let's give it a name for the summation over these terms so it's sometimes hard to read off or to immediately see just from looking at these expressions that they are easy to compute but if you think about what you have to do here and spoiler alert you're going to have to do this in your homework anyway then this is a completely separate computation over the products and the only thing left in here is an individual sum for each document and for each word over the the value of these terms so if you had pi and theta then these are just floating point numbers and these sums in here is a sum over K entries capital K number of topics and that's usually a small number that's the inside of our low dimensional representation so that's the small sum and we have to do this completely independently for all words in all documents and that's something that should be easily parallelizable it's the kind of operation that should be relatively easy with tensor arithmetic in fact it is so maybe this observation gives us a glimmer of hope that there might be something we can actually do if we had theta and pi and w then the individual word topic assignments are kind of possible to infer this form of conditional independence we can read off from the graph because pi and w and theta give the Markov blanket of C remember that the Markov blanket is the set of all parents, co-parents and children of a node now what about theta here the graph unfortunately doesn't help us because even if we assume that we condition on C then we still have a collider structure there's a big variable cw in here if you condition on that then we still have an invert pointing set of arrows so we can't expect just from the graph that these two variables might become independent but maybe if we look back again at the expression that we just constructed for our posterior or actually sorry for our joint then you might have already noticed just from the suggestive form that I've chosen to write this that we really already have two different terms here for the like we can already write the joint in terms of two different terms one which evidently only contains pi and the other one which evidently only contains theta C shows up in both of these terms but that's something we can condition on so why do these terms show up well they show up because we chose exponential family distribution so they are not something that arises from the graph structure that we decided to impose but something that arises from the exponential family structure that we decided to use so the corresponding the fact that the exponential family we chose as priors are conjugates to the likelihood terms if we had C which we don't then the posterior for theta and pi would actually be easy because the posterior for theta and pi given C and W is given by this expression so the joint that we have up here divided by the marginal for C and W so the integral over the joint over pi and theta and then that means we just have to take those expressions and integrate out theta and pi we mentioned a few slides ago that that should be an easy thing to do why well because these two terms completely separate so theta only shows up here pi only shows up here so if we want to do these integrals we can think of them separately as an integral over this term times d pi right times an integral over this term d theta and what is this term well it's some constant that doesn't depend on pi let's take that outside the integral actually first of all separates into an integral over dk so individual terms pi dk and finally they are well actually not quite because the integral is over the simplex so over d it easily separates but over k we have to do the integral jointly because we have to integrate over the space of all pi dk such that when summed over k they sum to 1 but that integral is exactly the normalization constant of a Dirichlet distribution it's the normalizing factor of something some expression like this and that normalizing factor is just the beta the the beta function right beta of alpha dk plus dk dot if this sounds complicated to you then let me just rephrase it in another way because what I've just done is I've like explained for the I don't know third time in a different kind of way that the Dirichlet prior is conjugate to the discrete likelihood so the posterior arising from these observations if we know what c is so the marginal distribution sorry the conditional distribution for theta and pi given c and w is literally just a product over a bunch of completely independent Dirichlet distributions as an individual Dirichlet distribution for every single document where the parameters of this Dirichlet distribution are the parameters of the prior plus the number of times we've observed in document d any word in all of the different topics and we've also just realized a product over completely independent Dirichlet let's call them postivios for the topic word distributions which have parameters given by the prior parameters plus the counts that keep track of how often in any document we've observed any of the individual words assigned to topic k so what we've just realized here is that while we cannot compute the joint posterior over c, pi and w we can compute these two conditional marginals you might call them so they are maybe sort of semi postivios we can compute this distribution for theta and pi given c and w and the distribution for c given theta and pi and w and even though this might not seem like it's actually a solution for us to open up our toolbox maybe we can find something in there that we can use so remember that over the course of this term we've populated this toolbox of ideas and the main goal of this exercise that we're going to go through is to demonstrate how we can use this toolbox in practice to give you some hands-on experience with what it's like to actually use these kind of ideas in practice we've decided here to use actually quite a few of these ideas from our toolbox already when we did the modeling part, step 2 in our recipe cooking recipe for building probabilistic models we used graphical models and exponential families and conjugate priors as ways of building models that are expressive efficient to implement and tend to have good factorization properties we saw just now that doing this doesn't necessarily give us something trivial to work with it doesn't solve every single problem but it has solved a few issues for us we've encountered that a lot of the math actually simplified quite conveniently so now when we're building our algorithm we're on the right-hand side of the toolbox so modeling was step 2 now computation is step 3 we can look into our toolbox and think about what kind of options we have available to work with our model to really work with the model that we've given ourselves, the representation of the data that we'd like to use and we can sort of mentally go through this list and you might realize that already the first entry the Monte Carlo type of inference or sampling contains a sub-algorithm that allows us to address this inference problem and that sampling algorithm is GIP sampling you will remember in lecture 4 I introduced one specific type of variant of the Metropolis Hastings base case of Monte Carlo methods called GIP sampling which is applicable if you have models where you can draw analytically one subset of the variables given all the other variables or actually where you can analytically sample any of the individual variables when conditioned on all the other variables back then I had to write down this somewhat abstract piece of math that you can currently see on the slide this is a copy from the corresponding slide in the lecture but now it's going to be quite concrete for our setting because here we are directly or explicitly in a setting where GIP sampling is supposed to not just applicable it's also supposed to work well because we have a model involving latent variables in our case those latent variables are called C and Pi and Theta and we've just realized that when we condition on C then we have closed form expressions Dirichlet-Posterios for Theta and Pi and when we condition on Theta and Pi then we have closed form posterios where closed form means discrete probability distributions computed efficiently over C so we can equate back and forth and we don't even have to have a big outer loop that iterates over all of the individual very many variables right Pi C Pi D C D I and Theta K but instead actually things factorize when we condition on certain subtypes of variables so when we condition on C we have a lot of factorizing distributions sorry when we condition on Pi and Theta we have a large set of completely independent distributions over D and I and C D I so those we can draw independent of each other in parallel and if you condition on C then we have a set of completely independent Dirichlet-Posterios over Pi and Theta which we can draw from in parallel so that means we can implement a GIP sampler a big for loop that keeps iterating like at least on paper it iterates forever of course in practice we'll stop it after a while and in every single iteration of the for loop we'll do the following thing and here this is not code yet because you get to write the code but it's math that you can maybe read where you can read off the code from basically so one iteration of the for loop is going to be given that we have a current instance of C and the data W of course let's draw those Theta's and Pi's and draw them from the posterior distribution over Theta and Pi given C and W and we just saw on the previous slides that this posterior distribution actually factorizes into two separate parts a bunch of Dirichlet distributions for Theta given a bunch of parameters and a bunch of Dirichlet distributions for Pi for every single document given a bunch of parameters and we can do that in parallel because this product term here means that all of these distributions are completely separate from each other and then having drawn Theta and Pi we can turn the to the other side of the model and draw C given Theta and Pi and drawing that involves drawing discrete random numbers so for that we can use a library but it's actually a very elementary kind of operation we construct discrete distributions where the individual discrete distribution can be computed entirely in parallel over all the documents and over all the words in every document and then just inside we have to take a sum over all possible topics that this word could possibly be assigned to given Theta and Pi and just draw from this normalized probability distribution this is comparably easy to implement because there are libraries in NumPi for example but also in other more advanced libraries to draw random numbers from Dirichlet distributions and of course to draw a choice from a discrete distribution and because of the factorization so factorization means products here all of these computations can be done in parallel so you can imagine that this is the kind of stuff that you can make particularly efficient by writing arrays operating on arrays and writing vectorized code so how is this actually going to work in practice well you're going to find out because it's your homework to implement this algorithm on a concrete problem now when you're encounter such a problem in the wild for the first time so when you're working with a big dataset like our state of the union addresses then it can be useful practice, good software development practice to build a toy model first to understand, that you can better understand what you even know where the ground truth is to build a form of unit test for your model and for your algorithm to see whether it actually works this is a little bit harder to do in machine learning than it is in other forms of computer science because the data has a statistical nature so it's not so straightforward to just write a simple rule based unit test instead we tend to build toy models for which we know the structure and here is such a dataset that I've actually not invented myself but I've taken it from another very important paper on latent Dirichlet allocation which we will discuss later on again because it introduces another fun algorithm that we'll come back to in a few weeks it was written by Griffiths and Stivers it was published in the Proceedings of the National Academy of Science PNAS a little bit after the original paper by Bly and Eung and Jordan and it's a bit of a meta paper and Jordan was very popular back then that it helped popularize latent Dirichlet allocation because they actually applied this algorithm latent Dirichlet allocation to papers like the collection of papers the corpus of PNAS itself so it's a fun kind of idea in that paper they also introduced this toy model that I'm going to be using to showcase how the algorithm actually works so we are deviating now, we're going away from state of the union addresses as you would do when you're implementing the algorithm in the wild for the first place for the first time and use this toy model this toy model consists of a thousand documents where each document is designed such that you can think of it as an image so that the way they do this is they make every single document a hundred words long where the words come from a vocabulary of size 25 25 being significant because it's the square of 5 so we can think of this as an image of size 5 by 5 so every individual document looks like one of these little images that you currently see on the slide these images are created by drawing from Dirichlet distributions for every single document over topics they actually choose the parameter to be 1 which is a bit questionable because it means they're not really sparse anymore but this makes it a bit more interesting actually because these models now are a bit corrupted and maybe you can even just by staring at the images already see what the topics are so the topics here the matrix theta is predetermined and designed by hand by the authors and they look like this so the individual topics are just bars either vertical or horizontal this is convenient because this dataset basically provides us with or it allows us to use our human brain's best capacity image processing to see whether the algorithm works this is much more convenient than using an actual corpus of words like our state of the union addresses for bug fixing because it's very hard to look at those datasets and get a feeling for whether the model actually finds some interesting topics or not here we can just look at the output of the Gibbs sampler and check whether it learns over time that those so what are we going to do we're going to take this dataset which consists of an array of size 1000 documents times 100 words where each word comes from a selection of 25 different words in the vocabulary and run Gibbs sampling on them so Gibbs sampling actually is not the algorithm that these original authors used they used a collapsed version of Gibbs sampling that will come back to later a more efficient way to do Gibbs sampling and how would we do with our first idea of how to do inference how would we do this with our Gibbs sampler well we would write it you will write a for loop which does exactly this a piece of python code that iterates and does this sampling step over and over and over again and as we do this we can at every single time step of course look at the current sample and check what it looks like sorry the current sample for theta and look check what it looks like and check whether it looks like those topics and you can maybe imagine that the main challenge in doing this will be to write efficient python code to do that not just to use libraries that draw from a division distribution and from a discrete distribution but also how to construct the parameters for these distributions such that they can be vectorized efficiently and can be processed in parallel efficiently so if you do that right and to do that you have to basically look at this video probably another time or at the slides and try to follow what we really did and then implement this in python using our templates then you will get an algorithm that over time will converge so what I am going to show you here is every 10 iterations we are going to look at 200 iterations redrawing all of theta all of pi and all of c and I am showing you the current states the current sample for theta the topic word probabilities every 10 steps we initialize the sample with a random initialization you can see that these are topics that don't look like anything they are basically just noise and as the sample now moves forward you can see that it actually comparatively within 200 samples converges to the right topics so why are those the right topics you might notice that they are not ordered the bars are mixed up and the stripes of course this is an aspect of this model there is absolutely no reason for the model to prefer a certain order among the topics the topics are exchangeable so in this sense the model has the permutations of these choices of topics but of course each individual topic actually corresponds to one of the original ones nearly there is a little bit of variation there is a few counts in here so it is not quite exactly right yet but if you would run the sample for a little bit longer it would actually converge to something very good so let me go back and do this slowly again and step by step and you will notice that this sampler very quickly converges on some of the topics you can see that some of these are already pretty good so this and this and this maybe are pretty good representations of the topics and then there are also some like topic 6 and topic 4 where the sampler is not quite in a good state yet so this is a visualization of the situation that we looked at in lecture 4 as well that these Markov-Chamontekalo methods they can take a while to burn in to reach a good representation to get into the characteristic set the typical set of the probability distribution and so this sampler here is still in burn in at iteration 40 it is not quite in the typical set yet but if we let it keep going for a bit then in particular you can now see that topic 1 topic 4 and topic 6 start interacting with each other and the sampler eventually figures out it breaks the symmetry between these three topics and hones in on one of them as we see more and more samples emerging so if the sampler keeps running you'll actually notice that the sampler has honed in on one local minimum of this distribution right one local region because it doesn't exchange the identities of the topics for each other every other permutation of those topics would also be a typical set but this Markov-Chamontekalo method will take a very very very long time indeed to find what to accidentally or not accidentally but to incidentally flip permute some of those topics with each other so this is again one of these issues with Markov-Chamontekalo methods that even though they asymptotically draw from the full joint distribution of course they typically spend a very long time in one of the modes of the distribution nevertheless we can see that this algorithm works in the sense that after a finite amount of computation 200 iterations it actually finds a good representation this is the sort of thing we might want to look at and say now it started from this noisy data set this looks like this and our algorithm has figured out that this data set should be represented in this particular way this low dimensional representation and then you can look at these topics and think about what they mean in the case of our state of the union addresses of course those will be selections of words that co-occur and hopefully have semantic meaning you can imagine that doing this right building this algorithm such that these 200 iterations don't take all night is a little bit of an exercise in software design and that's why we are doing this final part of the lecture course for as tedious as it may seem what we've done today as boring as it may have been to go through all this advanced algebra this kind of elaborate arithmetic these data types and these conjugate priors and Markov chain Monte Carlo algorithms this really is the strength of the probabilistic framework when people criticize Bayesian reasoning they often bring up this fact that you have to write down the prior and Bayesians often use as kind of turn this argument around and say well that's really the strength of probabilistic modeling that you get to write down everything you know very explicitly and then actually incorporate in doing so all your prior and that can then speed up inference so that you can learn things that are very structured and interesting and explicit well that's nice to say but that also means we have to be willing to actually do it in practice what we did today was exactly this was an example of how to really write down a set of prior assumptions and a model a generative model to infer and really nontrivial amount of structure and notice how hard it would be to do something like this with a deep learning framework because it would be very difficult to encode explicitly what we mean by sparsity by relationships between these variables and also notice how when we do that we of course have to pay a price in that we immediately are faced with a challenging algorithmic question there's a trend posterior that has a complicated structure and computing a tractable approximation to this posterior is not trivial however there are many different ways of computing an approximation for the posterior we saw one today which is to use Markov-Germontekalo methods and draw from the distribution and this can be done particularly efficiently if we pay attention to conditional independence structures now this is certainly not trivial it's not something you can just learn in an afternoon which is why we have to get a master's degree to do all of this but it's a very powerful framework and it does follow some structure we saw we followed this rough cooking recipe today we collect the data of course that's something everyone has to do and then we build a model using very much the probabilistic framework as our guide so we write down a generative model a joint probability distribution over all the variables that we get to observe the data and all latent structure we have to invent the latent structure we care about write down a representation in terms of some quantities that we care for this sounds like just inventing stuff but of course that's what a computer programmer does all the time we just come up with variables that are meaningful and useful to represent what we'd like to do then we design a graphical model that describes how these variables are related to each other we choose data types and therefore distributions exponential family distributions that are convenient algebraic property typically they tend to be conjugate priors for the variables we get to observe and then having written down our generative model and being hopefully reasonably happy with it we can start to think about the computation for that we first look at the graph again try to read off conditional independent structure see if there is some interesting structure in there that makes things easy for us and then try algorithms tools from our toolbox today we specifically focused on Markov chair Monte Carlo to try out experiments with typically toy models first to see if the algorithm is implemented correctly today we looked at Monte Carlo over the course of the coming weeks we will encounter first another type of variational inference the last tool missing in our toolbox and then also come back to the other tools that we have in our toolbox and then we look at the likelihood inference and Laplace approximations to test this algorithm and scale it up to our real application domain and then in the end of course we would like to make it even more explicit improve its performance or add new functionality to represent things we really care about all of this is much harder to do in a non probabilistic framework than in a probabilistic one because you get to really write down what the relationship between variables requires really thinking about your code in a way that is at least currently still significantly more challenging than your standard deep learning but that's also why people who know how to do this are in high demand on the job market and are at least currently able to command quite enjoyable salaries with that thank you very much for your attention today I'm looking forward to see you in the next lecture