 Hello, and welcome to lecture number 21 of probabilistic machine learning. Today I want to tie up a thought from last lecture and then begin a bit of a conceptual detour that will introduce or get us to consider a new class of models that although they are not immediately useful for our example at hand, will allow us to add one final methodological tool to our toolbox, which will be the last one for this course. In the last two lectures, we have encountered a running example that we will keep using for almost the rest of the entire course, with the exception of the short detour we'll do today. And this example gives us an opportunity to explore or to collect some hands-on experience with how you built a customized, specific machine learning solution to a concrete problem. I've kept pointing out several times, and I'll do that now again, that one can think about this process in a sort of a three-step level. Collecting data, then building a model for them, and then finally building a computational solution, an algorithm on the computer. All three of these steps involve human labor, although of a different kind. We began with collecting our data. This is now sufficiently far away from us that we don't have to discuss it too much anymore, but the key point there was to focus on collecting as much meta data, as much surrounding information as possible about the data set. Why? Because machine learning isn't black box, and because building a model requires prior knowledge, and we get that prior knowledge from exactly this kind of meta data. On the other hand, though, as much as every single new problem, of course, is unique and a problem of its own, and requires a customized solution, there are also some recurrent patterns, just like there are patterns in software design that we can use to efficiently build models. And we've collected many of those already over the course of the lecture so far in our toolbox. Modeling tricks on the left-hand side, and computational tricks, algorithmic tricks on the right-hand side. For our concrete example problem, which is a collection of documents that collect speeches of American presidents, we began our thought process about the model by realizing that the overall task, the high-level task we're trying to solve is one of dimensionality reduction. We want to take this sort of array-like structure of documents that contain words and represent them in a low-dimensional fashion to say every single document captures a few topics. Our first decision we took to this end was to discard the order of the words and assume that we can represent the documents in terms of counts of how often each word from a vocabulary with size V shows up in every single document. And we also convinced ourselves that we shouldn't think of this process of dimensionality reduction in the linear algebra sense, but that we require a little bit more structure to our model. This was the beginning of the real modeling process, and we've already done that now over the course of two lectures. We've maybe agreed together, although you didn't have much of a say in it because we have this unfortunate situation that I have to convey these ideas to you through a YouTube video. But I hope I've convinced you that an interesting model to consider, at least, although of course not the correct one, because there isn't really a correct one for this dataset, is that of so-called latent Dirichlet allocation, which assumes that both the documents and the topics are sparse, but not binary in the sense that every document is representative of a few topics, not all of them. And we've decided to model this by putting a Dirichlet prior over the document topic distribution. So every document can be represented as a probability distribution over topics that tends to be sparse in the sense that the non-zero components of this probability distribution are only a few. And similarly, the topics themselves are probability distributions over words in the vocabulary, which are also sparse. So they're also Dirichlet distributions, or our prior over them is also a Dirichlet prior, which means that a topic is a probability distribution over words such that few words show up frequently in those topics and most words tend not to show up. We can use this structure then to produce the documents by drawing for every document, first a document topic distribution, and then for every single word in the document, from that distribution a topic for the word and then creating the word conceptually by drawing from the corresponding topic we've just identified. We wrote down this generative model in this fashion in a graphical model, and this is useful because A, it helps us to visualize the process. We can think about those arrows, although not in a causal sense, in a generative sense, so we can pass from the outside to the inside of this graph down to the words. And when we do that, we've written down a generative process, so a way to actually create words from these document corpora. At that point, we turned our attention to algorithmic design. So now that we have our model, we can study it and consider, in a more mathematical form, what it looks like, try to read off from the graph conditional independent structure or maybe also get conditional independent structure from the mathematical representations and then hopefully use these insights that we might gain from this process to get an idea for an algorithm. We did this admittedly tedious process, last lecture, and it involved these complicated expressions, writing down the joint distribution as an equation and staring at it, trying to find some structure. We noticed that what may have seemed at first like an ad hoc kind of knee-jerk thing, which is the idea to use exponential family distributions as the standard types for priors of variables that show up in a model, that this actually has paid off because we can nicely simplify these expressions and arrive at a relatively compact representation of the joint distribution. We also looked at the graph and encountered that we can get various interesting conditional independent structures. In particular, we noticed that if we would condition on theta and pi, then we can read off from the graph that the individual words topic assignments are then of a relatively simple form. They factorize over the documents and even over the words and we can assign for every single word and every single document in a local computation a probability distribution over the topic. And then this is something that's not easy to read off from the graph, but which arises explicitly from our choice of priors, the Wischler priors for the two different probability structured variables, theta and pi, that if we condition vice versa on the word topic assignments C, then pi and theta actually neatly factorized. The distribution, the joint conditional distribution for theta and pi turns into a product of a bunch of distributions which also factorize over the document topic assignments and the topic word assignments. One for each document and one for each topic. This led us to consider this insight that this distribution nicely factorizes. Let us to consider a relatively generic algorithm from our toolbox, Gibbs sampling. So maybe at this point we're talking about algorithms, but I mean arguably there is a crossover from the modeling phase to the algorithmic phase where we use these insights into conditional independent structure to construct actual algorithms. This Gibbs sampler is a for loop that keeps iterating conceptually forever, although of course in practice we're gonna stop it at some point by drawing iteratively the, well actually jointly the variables theta and pi because when you condition on C then they're independent of each other from the corresponding posteriors which happen to be Dirichlet distributions and then drawing the word topic assignments from the discrete distributions that are given by their conditional distribution. If you've done the exercises this week then you've actually implemented this algorithm and you may have noticed that it requires a little bit of work to get right, a little bit of like a final step of conceptualization of thinking about your code, how to vectorize it efficiently, what kind of tools to use to neatly structure the code and make it run fast but it's an algorithm that actually works. It works on our toy problem that we considered last week and it also works on the actual dataset. Gibbs sampling thus provides us a first working implementation of our machine learning solution for the problem we've set ourselves. It provides an algorithm that realizes our model in practice and performs Bayesian inference. It's often important to have such first implementations, such algorithms that work because you can look at their output, you can check whether it makes conceptual sense, whether the model is able to find the kind of structure in principle that we're looking for and of course you can show it to, I don't know, whoever is paying your salary to point out that you've found something that actually works. However, of course, as soon as you've implemented this method you probably begin to wonder, well, is this actually the most efficient way of solving this problem? It's a very ad hoc kind of thing, we just looked at the structure, we discovered some conditional independent structure, we realized that there are these groups of variables called theta and pi and C that are conditionally independent of each other so we can use Gibbs sampling, right? Maybe even while you've implemented this method you've realized that it may not be the most efficient thing to do because what's actually happening here is that we are sampling C given theta and pi and then sampling theta and pi given C and of course we notice that theta and pi depend on C and vice versa C depends on theta and pi so one might be wondering or one might be worried that the sampler is perhaps not the most efficient thing or it's not performing the most efficient kind of sampling process because it has to sort of pull itself up by its own boot strings. This is a typical situation in probabilistic modeling that we first come up with a simple solution that works and then we begin to think about efficiency. So for the next few minutes let's think about one first thing you could do to make this algorithm more efficient and spoiler alert, it's gonna take looking at the numbers at the formulas, at this complicated math again even though we might by now really have had enough of it and thinking about it even harder using more insight that we actually got from earlier like structural ways of looking at the tools we're using to encounter ways in which we can, make our algorithms just more efficient. To gain such an insight, one thing we could do is to look again at what's actually happening here. We've assigned conjugate priors, Dirichlet priors to the discrete likelihoods that are kind of at the heart of our graphical model in the innermost plate and what the Gibbs sampler is doing is it's repeatedly updating a conjugate posterior arising from, Dirichlet posterior, right? Arising from the conjugate prior inference with this discrete likelihood and it keeps doing that over and over and over again. So a very vague idea that might form in your mind is do we even have to explicitly do this in code? Can't we somehow get rid of those variables theta and pi and kind of keep around the fact that they mediate. They are the things that mediate the distribution over the C's. It'll turn out that that's actually possible. And to understand why, let's maybe go back to the lecture notes for exponential family distributions. Here is the central slide from lecture number 15 for exponential family distributions. This is a summary slide that shows that for exponential family distributions or likelihoods, if you like, so in our case, this is a discrete distribution, there is always a conjugate prior which is of exponential family form and whether we can use it or not depends on whether we are able to evaluate its log partition function or its partition function F. So if we know how to do this integral, then we can define a conjugate prior for this likelihood which gives a posterior distribution that is of the same form as this distribution. This is here the generic form of a exponential family. And we noticed as a side aspect which back then may have seemed not so consequential that actually kind of a part of this conjugate prior inference is that we can also compute the evidence that shows up in the corresponding application of Bayes theorem. So this is our prior, this is the likelihood, the corresponding evidence, the normalization constant. Well, what is that? It's of the form let's integrate out prior times likelihood, prior times likelihood is, and we integrate against the latent quantity W, prior times likelihood is another expression up to constants that is given by an exponential of this linear form that we know how to integrate over, right? That's this, the expression we get up here is the sufficient statistics plus the prior natural parameters times W minus account variable times the log partition function of the likelihood integrated against W. This here is a constant. We know what that is. It's just F of A nu. And this up here is of the exact same form as F. And so our evidence is actually just a ratio between two normalization constants of an exponential family, what that we can compute. Well, we just have to update with the sufficient statistics of the observation and update the count to get our normalization constant. So integrating out the variable from the conjugate prior in a conjugate situation is possible in closed form. In our concrete case, this generic structure involves discrete distributions and Dirichlet distributions. So for us, the likelihood is the probability for the topic, word topic assignment C, given the latent document topic distribution. And we've decided to use as a prior for that the Dirichlet prior because the Dirichlet is the conjugate for this discrete distribution. And the normalization constant of the Dirichlet is the beta integral, which we've now used several times. And so far for our Gibbs sampling, we've used the fact that this gives rise to a Dirichlet posterior. But it also gives us a marginal for the word topic assignments, which is given by integrating out pi in the joint of C and pi. And that marginal is given by a ratio between two beta integrals. The normalization constant of the prior in the denominator and the normalization constant of prior times likelihood in the numerator. So maybe we're actually able to integrate out pi and theta in our process. Let's see if we can actually do that. So apologies, here's a huge big slide of math, but it just reuses structure that we've seen before. So here I've copied in in the first row the most compact expression we had so far for our joint distribution in our model, C, pi, theta, and w. C, just to remind you again, are the word topic distributions assignments, pi are the document topic probabilities, theta are the topic word probabilities, and w are the identities of the actual words. So we've managed to reduce this expression to this kind of nice and compact representation, which is a big product between terms that involve pi and C, where C is hidden away in those counts. And another expression that involves only theta and C hidden away in the counts. So we already noticed and used when we implemented GIP sampling that the expression we have here is actually, well, this year's the normalization constant of a Dirichlet, it's beta, the beta integral of alpha D and the same here for beta K. And here in the back, we have essentially a Dirichlet distribution up to the normalization constant, which is the integral over this expression, D pi. Well, that normalization constant is the beta integral of alpha D plus the count variables over the case, right? So that's a vector of how often we've seen any word in this document, the in all that of the individual topics. So we can write this expression up here exactly in this form. So this is really just a rewriting of, this is a Dirichlet distribution up to the normalization constant. So if I write a Dirichlet distribution, I have to multiply with the normalization constant to get the same expression. And the same thing happens over here. It's exactly the same situation. The only difference is that we get a different kind of count variable. So now alpha is replaced by beta and instead of counting how often we've seen every individual topic assigned in document D, we now instead sum out the documents and say, how often have we seen any of the individual words in topic K, no matter which document? Now we see that this is exactly of the kind of form where we can integrate out pi and theta. So if we go from the joint to the marginal over C and W, then we have to integrate out pi and theta here and that's trivial because this is a probability distribution over the pi D, this is a probability distribution over the theta K. There are no other points here where pi and theta show up. So the integral over this is just one for every single document and one for every single topic. So the marginal is given by the product of these two expressions. To make more sense of this in the line below, I've just plugged back in the definitions of what beta integrals actually are. Beta functions, beta integrals are ratios between gamma functions. Now, this may seem like a really complicated expression and not necessarily one that helps us speed up our computations any further. In particular, if you remember that previously we were able to use fast libraries that just give us access to draws from Dirichlet distributions. But if we stare at this expression a little bit longer and there look kind of deep into them, it's actually possible to extract a very interesting structure from this. So let's consider what this joint distribution over the unknown word topic assignment C and the known data, the words W, actually looks like when we focus on a particular word. So let's consider the word that is at location I in document D. Now that word contributes, so first of all the C's just to remind you they only contribute to those counts N, here, here, here and here. So just from that we see that our word in document D at location I, it contributes in, well, I mean four different places here on the screen which actually are part of several more than four factors. So that's down here where there's an N here and there and here, four different places. Now, first of all, since we are considering document D, there's only one term in this product over documents where we actually have to think about. And there's also only one term in the V factor or V product here where little V is equal to the identity of our word W. The second thing we notice is that if we are considering which topic to assign to this word in document D at location I, then this expression down here actually won't be affected by our decision to assign this word to a particular topic because why, well, this expression down here involves a sum over all possible topics. So if we change the assignment of our word in document D at location I from one topic, let's say topic two to topic three, then this expression down here doesn't change. In that sum, there's just one bit where we subtract one and another term where we add one. So overall, there won't be a change. So this is a part that we don't have to worry about when we consider which topic to assign to word at location I in document D. So we're left with only three locations that actually matter, this one, this one, and this one, three terms. Each of these terms is a gamma function. So we've now encountered gamma functions several times in our exercises and also across the lectures to remind you gamma functions, and this is up here, gamma functions are a continuation and a continuous interpolation of the factorial function. They were well invented, first defined by Euler in this exchange of letters with Goldbach, and because they are this interpolant, they have this property. So the gamma function of any x plus one, where x is a real number, is equal to x times the gamma of x. We can make use of this here to think about how our choice of topic assignment for word I in document D actually affects the, well, the probability, the joint probability, P of C and W. So to remember, in the end, we want to do posterior inference on C given W. So, well, maybe not in the end, but as an intermediate thing, we want to do posterior inference for C given W, and that will require us to write down this joint, P of C and W, and that's, we can think of that as like a prior for C times the likelihood for W given C, and then divide by the normalization constant. So we can think of that as individual terms over the individual world. So for every individual word, we will get an expression that is equal to gamma of alpha DK plus the count that we get when we assign document, in document D word I to topic K. Well, that's the counts that arise from all the other words that are already in here plus one, exactly one in the one topic K that we've assigned our word to. So when we add a plus one here, we can write that using this property of the gamma function as alpha DK plus the count that we get by assigning the document, in document D word I to this particular topic K plus all the other one, minus one, sorry, times the gamma functions. So that minus one, I'm going to write in this form. So what this is supposed to be is the count. We get in document D, in topic K, when we sum up all the words, all the words except for the one word in location I in document D, that's a without the I, N, the K dot without the I. And that's of course a number. It's the number that we have from all the other words, not the one we're considering. And then that number gets multiplied with a gamma function of that number. Now that gamma function is also going to show up in our normalization constant, which we get in base theorem, right? When we integrate out the assignment of this particular word. And so it'll cancel in the numerator and the denominator. And a similar thing happens for the other two terms. So here, back here, let's look at that first. We have a gamma function over beta KV plus the number of times in any document we've assigned any word to particular topic K. So in this big product, only the term where our verb V is equal to WDI actually matters. And then that factor here will be the beta KWDI because V is equal to WDI plus this count that we get when we assign our word to topic K minus one. And that is exactly equal to the count we would have gotten if we didn't consider our word at document D location I, from all the other words. When we evaluate this count at the index WDI and then a gamma function, which is both in the numerator and the denominator. So it cancels out. And then finally, it's the most important, the most complicated expression. It's this one down here, which we ever get an inverse of. So there has to be a minus one here. And it's again, a gamma function over a sum over all words, how often we've assigned all these words to topic K. So when we assign our word, whichever it's identity might be to topic K, we're gonna get one extra term in here, which we can take outside of the gamma function just like before. And this expression is left. That's the number of times in any document we've assigned word V to topic K, if we don't count our current word under consideration and sum over all the word identities. This, the product of these three terms is proportional to our posterior distribution over for assigning word I in document D to topic K. Proportional too, because it's a discrete distribution. And so to make it an actual discrete distribution, we have to normalize. We're normalized by just taking all these terms and summing over their values. We can also let our computer take care of that. Why is this insight interesting? Well, because it allows us an efficient way to speed up Gibbs sampling. Because if you do it like this, then we can notice that we can sample the word topic assignments without ever touching a distribution over pi and theta. This is useful in two ways. It's useful, A, because it allows us to, well, first the simple thing, it's easy because it simplifies our life because it takes away two lines, two complicated lines of code in our implementation that the computer has to go through to assign probabilities for theta and pi, or draw, sorry, draw in a Gibbs fashion way assignments to theta and pi. And the second advantage is that by taking out the effect of theta and pi, marginalizing over them, we might actually allow our sampler to converge more efficiently because it doesn't have to take this detour of mediating the effect of reassigning all the words to different topics. First, two-hour distributions over theta and pi and only from then back to the words. This idea is called collapsed Gibbs sampling. And it uses this joint over C and W, which is also a marginal over C, pi, theta and W when we integrate out theta and pi, by only doing a much simpler loop than the one we considered last week, which is a loop that only assigns the topic identities of the individual words in all the documents and all the locations. One way to do that would be like this. This is like a pseudo code. Of course, you can probably think for yourself that there's a better way of doing this that is a little bit more efficient. This is actually the arguably original Gibbs sampling algorithm. It's the one that was introduced by Tom Griffiths and Mark Stivers in their paper that I've already cited last lecture, Finding Scientific Topics in the Proceedings of the National Academy of Sciences. As I said back then, this wasn't the first implementation of LDA inference, latent-divisional allocation inference. It's actually arguably the second one, but it's the first one using a Markov-Chermontecalo method, this kind of collapsed Gibbs sampling. The original way of doing inference in latent-divisional allocation, which is due to David Bly and his co-authors, it uses a different kind of way of doing inference, which we are going to work our way towards over the next two or three lectures. And we've actually already begun making our way towards it by what we just did in the past 10 minutes. We just realized that to do Gibbs sampling, we don't actually have to keep drawing distributions over theta and pi, or we don't have to keep drawing from the distributions over theta and pi, but instead we can directly only draw the assignments of topics to the words from their marginal distributions. By the way, of course, in the end, our interest lies not in those word topic assignments. At some point, we are of course interested in the distributions over the topics for each document and over the words for each topic. How do we do that? Well, we can do that after our Gibbs sampler has sampled for a while by going back and realizing that we can draw and we can actually evaluate even better the posterior distribution for theta and pi given C and W. So if we have draws from C, from our Gibbs sampler, we can always evaluate the completely factorizing the posterior and that can be completely offline. We don't have to keep doing it while our Gibbs sampler is running. So what's so nice about this collapsed kind of Gibbs sampling, apart from the fact that it might speed up our code, is that it has a very neat conceptual interpretation. Maybe one way to get our hands on this interpretation is to go back and look at our posterior, the implied posterior for theta and pi again in the condition on C. So, and also the posterior on C given theta and pi that we discussed last week. So first of all, let's notice that if we have an assignment for the individual values of C, so for the word topic assignment C, then we get a factorizing posterior over theta and pi that is given by this Dirichlet posterior. We've already discussed this several times, but maybe one new aspect to point out here is that if we look at the parameters of these individual Dirichlet distributions, then they consist of, that their parameters are these pseudo counts, and they consist of the sums of the prior values, alpha, and the counts that arise from our assignments of topics to the word C. And typically, what we actually deliberately decided to use small values for alpha, values less than one. We motivated this because by the fact that it creates a sparse Dirichlet distribution that tends to put most of the mass in sparse representations of the document topic assignments and the topic word assignment. So both the entries in alpha and the entries in beta are less than one, which means that for almost all documents and almost all topics in almost all documents and almost all words and almost all topics, the entry in this vector is dominated by N. So what does this mean conceptually? It means that our assignments of, or our belief about the topics of each document and the shape of each topic are dominated by the values of the individual words. Our priors are very vague. They are kind of hidden away that dominated by the data. So what makes this model really is the assignment of the topics, the words to individual topics. But where does this assignment for the words of the individual topics actually come from? Well, it comes from the conditional distribution for the individual word topic assignments, given those latent quantities theta and pi. Well, we realized that those factorized over all documents and all words. So it's like every single word has its own little distribution that is given by this individual probability. So there's this latent thing called pi and theta, which tells the word what its topic assignments are. And that latent thing happens to be the same thing, it happens to itself depend almost entirely or basically entirely on in turn the assignments of topics to the individual words. So every individual C depends entirely on theta and pi and theta and pi vice versa depend almost entirely on the Cs. Now what we just realized in the last 20 minutes is that we can integrate out this effect of theta and pi, this latent thing, the latent force that actually creates it. And instead, think about the entire dynamics of this model exclusively in terms of what the assignments of topics are to the individual words. And that means that every single word is now not affected anymore by this theta and pi, because theta and pi don't exist anymore in this view. Instead, it's just affected by all the values of all the other words or the topics of all other words. You can think of this, and that's actually the analogy that leads to the interpretation of this way of doing inference. You can think of this as somewhat similar to the behavior of a gas. Let's think of a free gas in some kind of volume that consists of a bunch of particles. Now, even if you're not physicists, you know that the dynamics of free gases are created by the individual particles kind of bumping into each other in a statistical fashion and pushing each other around. So that means the trajectory of an individual particle in a gas is created entirely by its interactions with other particles whose trajectories in turn are also created entirely by their interactions with the other particles in the gas. So similar, this is similar to the situation here where the dynamics of topic word assignments, the dynamics of the Cs are created in this model almost entirely by all the other words' topic assignments. So everyone interacts with everyone, but the contribution of every individual word in every individual document to any other word in another document is minute because it's mediated only through those counts. Now, in this physical analogy, the non-statistical, the deterministic view of the behavior of these particles in the gas is created by the individual potentials created by all the particles. So every particle in a gas has its potential around it, whatever that may be, we don't actually have to go into details. And the sum of all of those potentials together create, for all the other particles, create the potential for one additional particle that it moves in. In our case, those potentials correspond to the assignments to our topic word and document topic distributions, theta and pi, respectively. And the statistical viewpoint on the thermodynamic setting in physics is called the mean field. Actually, you can go to this slide here. You can link off all the different particles together, creating a potential and that potential is the thing that determines the behavior of every individual particle. But because they're creating it together, you can also think of, at some point, all these potentials kind of average each other out. And we can integrate out what those latent potentials are and think of a mean field, an average contribution that all the other particles make on one individual particle by their typical distribution that is given by the values of theta and pi. So what the Gibbs sampler, this collapse Gibbs sampler does is then often described as operating on this mean field, but directly operating on a collapsed out distribution over the individual word topic assignments that takes into account all the values of all the other words topic assignments. At this point, this idea of a mean field may feel like a bit of a tangent, but it's actually a wonderful segue in a way that isn't entirely clear yet. Because it will kind of rediscover this notion in about two lectures when we consider a different way of doing approximate inference where this mean field idea is much more front and center. To get to this point, we now have to take a bit of a detour. So I invite you to put what we just learned aside for a moment and turn our attention for a bit to a different model class. So when I introduced topic models, I argued that we want to have a model like something here on the right, which assigns every document to not just one topic, but to several topics, a probability distribution over topics assigned to each individual document. I also showed this plot here on the left back then and said this would be arguably the simpler thing to do, which would be a totally binary sparse assignment that encodes or assumes that every document only belongs to one and only one topic. Such models are called mixture models and it'll become clear in the next few minutes. Why? So mixture models are usually used not for this kind of data set that we're looking at, this corpora of documents, because of course a document is about several different topics, typically, at least typically. They are usually used to separate different mutually exclusive types of objects from each other. We actually encountered such a data set way earlier in the class when we spoke about classification. I showed you this data set, which is actually data points from the fashion amnesty data set, which contain images of items of clothing. And here in blue it's one type of clothing. I believe it was a sweater and in green another type of clothing, maybe a pair of pants or something. And we argued back then that you can solve these problems in various different ways. One of them is to represent these two different groups as two different distributions. So that every group is assigned, or every object is assigned to exactly one group. This is one such data set. Here's another one that is often used and that we will use now as an example for this kind of modeling. Here in this case, we have labels available to this data set. We know that the green stuff is one type of clothing and the blue is another type of clothing. Now, of course, in reality, when we want to do something like this, a typical reason to employ a mixture model is that you don't have them. Here is a data set, as I said, a popular one from the 1990s that records for this very famous geyser in Yellowstone National Park in the US. For every eruption, so this geyser is called Old Faithful, produces a very reliable, as the name suggests, set of eruptions several times a day, and it records two quantities about these eruptions. One is the duration of the eruption and the other one is the waiting time in between eruptions. So you could clearly see that there's a kind of a relationship between them. Long eruptions tend to follow a longer waiting period. And you could maybe argue that there is some structure to this, some further structure to this probability distribution that arguably maybe allows us to describe this data set in terms of two clusters of eruptions. We have long eruptions that are about four and a half minutes long and occur after a waiting time of about 90 minutes. And we have short eruptions of about two minutes that occur after about 50 or 45 minutes of waiting time. Mixture models are algorithms that automate this process of taking a data set like this, of unlabeled data, and returning something like this, an assignment of these data points to two different clusters, two different classes. When I use the word classes, of course this immediately brings up an association with the concept of classification that we discussed earlier in the lecture. Back then we looked at very similar data sets. This was one that I showed back then. Data sets where we are given training data that consists of also dots scattered around the space that belong to two different classes. And they don't just have to look like this. We also looked at more complicated models or situations like this where there is no easy way of just drawing a line between the two classes and separating them. But instead we just have a space that is somehow covered with input points that belong to two different classes. Now what we're discussing here, this clustering problem is different from this setting in two ways. It's different on the one hand in that it's an unsupervised problem. In classification we were given the labels of these individual points for the training set. So the training set consisted of, let's say this group or this group of blue and red points. And the fact that they are blue and red is part of the training exercise. That's one difference, but here in clustering we don't have this but there are no labels but there's also another difference and that's the question we're trying to answer. So in classification the task was given that we've seen this training data set. If I now give you a new point, let's say over here or to make it more a little bit more obvious over here, then please predict what the class of this point is. In clustering the task is not so much to given a new point, predict the class, but more abstractly to say something about the classes, to describe what the classes look like. One particular way of doing that is to create more instances of the class. This is called generative modeling. So maybe to take a high level view of machine learning, this second half of the lecture and also what we'll do next lecture completes maybe our list of foundational types of machine learning problems that we are describing with the probabilistic language in this course. For a large part of the lecture course so far, we have talked about supervised machine learning. Supervised machine learning problems, most of you will know, are tasks where we're given pairs of input and output to train with. And then our task with producing the output at a new test input location. Depending on what kind of data types the input and the output are, there are different names for these different types of supervised machine learning problems. Regression is maybe the word for supervised machine learning tasks where the output space is real valued, either univariate real valued or multivariate real valued and it's multi output regression. So learning functions that map from a more or less arbitrary input space to real the real vector space. By the way, we noticed that the input space really doesn't matter because as long as we can define a kernel over it, feature functions on it and the structure of the features of the input space is masked anyway. Classification, which we discussed next is the setting where we observe discrete labels, binary or multi class labels, but we observe them during training, that's why it's supervised. And then there's a generalization of that, which we discussed in the context of generalized linear models, where the output is pretty much anything, any space that can be made connected or maybe even isomorphic to the real vector space by some transformation. For example, predicting strings and text and graphs and proteins and so on and so on. And we also had special names for certain special input types. So for example, if the input space is univariate real, we were talking about a time series. The problems we're looking at now are not of the supervised type but of the unsupervised type. In fact, the topic modeling we're discussing up to now also is a form of unsupervised machine learning. Unsupervised means that we're only given a collection of data points that are from one space, they are samples from some potentially complicated probability distribution, some from some generative process. And our task is, well, what exactly is our task? So the fundamental task, arguably, is generative modeling. So that is make more of those data points, draw additional points from this distribution that we don't get to see that produced these samples. But a task that is closely connected to this, as we will see, is that of clustering. So what I just described, given a bunch of data sets, give me some labels for subgroups of this data. That means assign a class, a cluster identity to each individual XI. An important question to ask at this point that I usually do in a more like dialogue fashion with students is why one would even want to do clustering. This is one of the most frequently misunderstood concepts. So there is a natural tendency to assume, or to kind of think that the point of clustering is to find some structures, to find some representation, some labels that are useful for to understand what goes on in the data set. And this is not a totally bad idea. And in fact, when we do topic modeling, we also try to do that. We're trying to find structure that represents the text in the form of some, not clusters, but topics. However, when we do that, we always have to be very careful and we will notice that this is the case for our topic models as well, not to overly interpret the result of this clustering. Because as we will see, it strongly depends on the generative model, the description of the underlying data. What is perhaps a more realistic task to solve is the more general form of let's make more of this data, produce additional ones. And one way to do that is to subdivide the data into clusters, describe these clusters, and then use these clusters as a representation of the structure of the probability distribution to produce more. The reason I'm mentioning this aspect so explicitly is that the most basic and fundamental or maybe most classic algorithms for clustering actually don't always allow us to do this, to draw more samples from the cluster. The traditional, the oldest way of doing clustering is K-means clustering. It's an algorithm you've probably come across if you've had an undergraduate course in machine learning. I'm never, nevertheless, going to remind you of it and reintroduce it, so to refresh your memory. It's an algorithm that was invented probably, it's a bit unclear, because it's such a generic rule that maybe there are several people who can claim to have invented it. But a good candidate for its invention is Hugo Steinhaus, a Austrian-Polish Jewish mathematician who had, unfortunately, a very difficult and often sad life because his life overlapped with the Second World War and the so-called Third Reich. And it's an algorithm that works as follows. We begin, we are trying to construct a set of the eponymous K clusters in our data. We assume that we know what K is. What we do is we initialize a bunch of points in the space of the data, so in X space, at random locations around the data, and those are going to become the representatives of the cluster. Notice that we already see that, of course, because they are points, they are not going to allow us to produce more data points. Once we have initialized these random locations, we now iterate by doing two steps, one after the other, back and forth, back and forth. The first one is that for every datum, we measure the distance between the datum and all the clusters and assign each datum to one of the clusters. We can represent that in code in two different ways or in notation. One is to assign a particular cluster to each datum by just saying the nearest point, the nearest neighbor becomes, not the nearest neighbor, the nearest mean becomes the cluster identity, or we could, for each pair of cluster and datum, assign a binary variable that says this datum belongs to the cluster or doesn't, and these are called responsibilities for the clusters, for reasons that will become clear later. So this first step means assigning data to clusters, and then we update the representation of the clusters by putting the cluster representative at the mean of that cluster. That's why it's called a mean. So we're computing, this is where this notation is particularly convenient, summing over all the elements that are in the cluster and compute their mean, that's then what this does, and set the representative of the cluster to that mean. And then there's a final line here at the bottom that says let's keep doing this, let's iterate on these two steps, assign an update, assign an update over and over until the algorithm doesn't change anything anymore. So that happens once there is no more change in the responsibilities, if there are no cluster identities being changed, and then the algorithm stops. One could wonder whether it always stops, we'll get to that in a moment. Here is a representation of this in code, that's a little bit more compact, so it doesn't run over the rim of the slide. It's a very simple algorithm, we initialize at random, and then we iterate in a while loop going forward by finding cluster identities, like assigning each datum to the cluster center that is nearest to it, and then updating the cluster centers by setting the location to the center of the mean of each group of points that are assigned to it. Here's a way of writing this in a very compact form of notation, R is the matrix of responsibilities, so it's of size, number of clusters, by number of data points. We can multiply that together to select the points that are in each class, and then divide by the size of each class. This here is a, you can maybe convince yourself that this gives a vector, that gives us the identity for each class. Now, how does this algorithm look like in practice? Here is the old faithful dataset that I introduced before. This is a figure that is recreated from a similar figure that is in Chris Pishop's textbook. The code for this was written by Ankatrin Schaakamp, a previous scribe for this lecture course. In gray, you see the old faithful dataset here, and we initialize, let's say, at random our clusters, there's the red cluster and the golden cluster. Well, randomly chosen locations, if you like, and now the algorithm iterates by first assigning responsibility, so all the points that are nearer to the red point than to the golden one are assigned the red identity, and all the points that are nearer to the golden one are assigned the golden identity. If you compute the means of these data points, these clouds of data points, then the red point moves down here, and the golden point moves up here because there are way more golden points here than there. Now, we update the cluster responsibilities, so all the points closer to the red point become red, all the ones closer to the golden point become golden. We update again, and as it turns out, now the algorithm is actually converged because after this update, there will be no more change in the responsibilities of the clusters for individual data points. So this is the five minute introduction to K-means, which is maybe enough because, as I said, you've probably heard about K-means in an undergraduate introduction to machine learning. It's about as elementary as the nearest neighbor classifier or linear classifiers like linear discriminant analysis, maybe, and the goal of the reason I introduced it here again is not that I want to talk about K-means per se, but because K-means is going to be the beginning of a thought process that leads us to our final tool in our toolbox. To get to that point, we first have to understand that K-means has significant, well, you could call them drawbacks, but actually they are more like pathologies. There are certain things it just doesn't do well or doesn't do at all, and by trying to fix them, we will arrive at the class of algorithms, the generic form of computation that we're actually looking for. I already pointed out the first problem with K-means, which is that it's not a generative model. It doesn't allow us to produce more of the data. It just tells us that there's this cluster, it's this group of points. We call them a cluster, and that's it. It doesn't allow us to produce more data points. Now, maybe this is useful. It's similar to how we are going to use our topics in our topic model, so it's not entirely a bad idea, but it's a very limited form of functionality. In particular, it doesn't explain to us why the clusters it finds are the clusters it finds. So that makes it very hard to interpret the result of this algorithm, to understand what it has actually found, what kind of structure it has actually found. Another problem with K-means is that it has parameters that are difficult to set. So here is another example data set at the top. This is a data set that was maybe hand-constructed by David Bacchai for his textbook on information theory, inference, and learning algorithms. Great book, by the way, as I've several times pointed out already. This data set, again, recreated, so Uncutting Child Campus recreated it from the original image. And if we run K-means on this data set with two classes, then if we're lucky, it actually works quite well. So down here is K-means running in four steps on this data set with two randomly initialized means, which happen to be quite close to each other over here. That's just a random coincidence, but actually the algorithm recovers from this maybe unlucky initial point by assigning all of these points to the red cluster, all these to the red, all these to the black cluster, updating, reassigning responsibilities, updating again, and then converging. So now we see that this is maybe what you would look for. This over here is a group of data points. We could call one cluster and this is another group. But maybe if you look at this image up here, the one that isn't labeled by classes, maybe you disagree that there are actually two classes, two clusters in this data set. Maybe it's actually three, right? One, two, three, maybe it's four. Maybe there's a fourth one here. Now we can see this data set, we can look at it, so maybe we can convince ourselves that we shouldn't use two classes or clusters. But A, even now it's difficult to decide whether we should use one or two clusters. And one of the reasons for that is that K-means doesn't, it's not a generative model, so it doesn't actually explain to us what it even considers a cluster. So we kind of have to guess, we have to impose our own assumption of what the cluster should be. And that also means we have to set a number of clusters because that's a free parameter of the algorithm. And in particular in high dimensional problems it can be very hard. It's also not just hard, it actually also has an effect on K-means that we can see in this picture here that is also modeled on a similar one in David Mackay's book that isn't exactly identical. So here you see two different runs of the algorithm, the top row and the bottom row, that are created by using four classes initialized at random means. So here are one, two, three, four initial means and here are one, two, three, four initial means, randomly chosen. If we run the algorithm until convergence we get two very different kind of clusterings. Both of these clusterings are maybe not what we had in mind when we initialized with four different clusters. In this case, the left-hand side of the dataset even though it's a relatively compact cluster is split into three and this thing on the right is becomes one big large loose cluster and here we split the left-hand side in two sides and the right-hand side in two sides maybe arguably a better clustering would be to put a small cluster over here. Again, we don't really know why, why because the algorithm doesn't tell us what a cluster actually is. So a first pathology of the algorithm is, of K means is that even though it's named after this ominous K, the number of clusters, there isn't an obvious way to set the number of clusters. So fixing this issue will be one of our tasks. Another pathology that is less obvious is that because K means doesn't describe what a cluster actually is, it just identifies it with something that is close, it can have pathological behavior for clusters that have some shape. Here are two examples, again from the textbook of David Pekai. Two different data sets here and there. Here in this data set, there is one very compact cluster and one arguably larger cluster. If you run K means on this data set, then it finds what is maybe a very bad clustering. It creates a cluster up here and then some cluster down here that for our visual analysis kind of joins a part of the large cluster and the small cluster together. And here are two, here's a data set that consists of arguably two clusters that have elongated a little bit. If we run K means on this data set, then it splits these clusters exactly along the wrong direction. So there's one class up here and one down here that with a dividing line that goes almost perfectly horizontally through the data set. So these observations maybe explain why K means is considered more of a textbook algorithm than an actual practical tool for contemporary machine learning. It has several pathologies. It has three parameters that we at least at the moment don't know how to set and it can find clusterings that go against our intuition for what the right cluster should be. An underlying reason for this is that the way it creates these clusters is at least at this point in the course not entirely clear to us. It just produces these points by iterating back and forth. So an intermediate goal for us will be to fix K means. And this being a course on probabilistic machine learning, of course we're going to do this by finding an interpretation for what K means does in terms of probabilistic modeling. We'll do that in the next lecture. But what we'll do today to get to this point to construct our probabilistic interpretation of K means which will lead us to improvements that fix these issues of K means is to maybe look at the way that K means is classically analyzed and see if we can already find some hints in there for how we might want to think about K means from a probabilistic perspective. And we can find such a hint in the classic statistical analysis of actually not statistical but the classic analysis of the convergence behavior of K means. And that is connected to another mathematician who lived about a hundred years ago actually before Steinhaus. And that's Alexander Lyapunov, I think his name is pronounced, although in English people often say Lyapunov. He was a Russian mathematician who was particularly interested with the theory of control and optimization and control of dynamical systems. He is famous for a result that is stated here as well which we are going to use to interpret, to divide an interpretation of K means. And this inside is a really beautiful one which is that you can think of many algorithms even though they are not phrased in this way if they are convergent as optimization algorithms. You can do so whenever you can find a function called the Lyapunov function that is decreased by the algorithm in every single step and that function is positive or actually more generally that function has a lower bound. This kind of insight is valuable because it allows us to take algorithms that are motivated in a very different fashion like K means and interpret them as minimizing a risk function or loss function. As you know, many algorithms in machine learning in particular in statistical machine learning are phrased as minimizing some form of typically empirical or regularized empirical risk. Traditionally, Lyapunov's theorem for this or the existence of a Lyapunov function isn't actually immediately used to create interpretations of algorithms. They are more typically used to show convergence of dynamical systems or of algorithms in particular that take several steps because if you have such a function that is decreased in every step, it's clear because the function has a lower bound that the algorithm eventually has to terminate. It won't change the function anymore. This is the case for K means as we'll see in a moment but notice that this doesn't mean that the algorithm always converges to the same point. It just converges to a point where that function has a local minimum. We have already seen in practice that this is an issue for K means every time you re-initialize K means it might end up in a different final configuration, a local minimum of the Lyapunov function. So what is that function for K means? Well, it's given by this function here on this slide. I've copied over the code again from previously, it's exactly the same thing again. And I'll argue that a Lyapunov function that is minimized in every step of the algorithm is given by the function of the parameters of the algorithm. So in responsibilities and location of the means that is given by the sum over all data points for I from one to N and the sum over all clusters, little K from one to capital K, the total number of clusters over the responsibility of each cluster for each datum, remember that those are binary variables, one hot encoding, times the distance of the datum from the mean squared, L2 distance. To see that this function is decreased in every step, we look at the lines of the algorithm and we find that row four of this code which finds the responsibilities evidently minimizes this function Y because it cycles through all the terms in K for each datum and finds the one value for R to switch on where this number is smallest. So that means if you think about this entry of the outer sum, for every single datum, we can consider all possible such terms for distance to the current to all the individual cluster means, find the one where this term is smallest and switch that one to one. That evidently at most reduces the value of this function unless it doesn't change R. And the second step, well, the second step naturally minimizes this function because it finds an M such that all these distances are smallest on average. We can also do this like properly by computing a gradient of this function with respect to the individual means MK. If you do that, then so clearly the inner sum falls away because only one term in that sum depends on this corresponding MK and we're left with the outer sum and the inner derivative is, the two comes down minus one from the inside and we get this expression. If we set that to zero, then we can confirm VINSA ourselves that evidently we have to set MK to exactly what this algorithm does. That's this line here. It can also compute a second derivative, the Hessian and find that it's actually a diagonal matrix that has only, has its diagonal this expression. So remember that the R's are all binary variables. So this is a number that is larger than zero. So this problem has a positive definite Hessian. It's therefore a convex optimization problem. And we've actually found a minimum of this expression in step five as a function of M. So every individual iteration of this algorithm reduces this particular loss function. Now the classic analysis is at this point that goes on to point out that by the existence of this Japan-North function, we now know that K-means is convergent. It will always reach a point, a stable point where this function has a local minimum. And actually for us, this is maybe not so crucial. The more important inside will come from this exact form of this function. I'm not going to talk about this in this lecture. Instead we will use this as the starting point for the next lecture. But this is maybe very convenient to end on this point because you can take the time while you switch from one video to the next to think about why the existence of this risk, the fact that K-means can be interpreted as an optimization algorithm that minimizes this risk functional in terms of R and M. What that might tell us about a probabilistic interpretation of K-means. And then maybe that even hints at what we can do to fix the pathologies of K-means that we've just identified. I'll do that, we'll talk about that in the next lecture. For now, let me summarize this second half of the lecture by saying that there's this algorithm called K-means, which is a simple algorithm for clustering. It always finds a stable clustering by minimizing this L2 risk functional. However, it has some pathologies. It cannot deal with oddly shaped clusters. It hasn't have an immediate way of identifying the number of clusters. Fixing this will be our goal in the next lecture. We'll do that by finding a probabilistic interpretation of the algorithm. And that probabilistic interpretation will not just lead us to ways of improving K-means to fix these pathologies. It'll actually also give us a practical example that will lead us to identify another class of approximate inference algorithms for probabilistic models, which will complete our toolbox or probabilistic modeling algorithmic toolbox for the purposes of this lecture course. I'm looking forward to seeing you in the next lecture. Thank you very much for your attention.