 Hello and welcome to probabilistic machine learning lecture number 19. We're now entering the final phase of this class and we're going to change our step a little bit today. Over the last 18 lectures, we've seen that probabilities and probability theory provide the elementary mathematical tools to extend propositional logic to reasoning under uncertainty, to extracting information from data that isn't perfectly informative. And the way to do so is to distribute truth over a space of hypotheses rather than to assign a binary amount of truth to an individual statement. We already saw in lecture number one, actually in lecture number two, that this powerful philosophical idea comes at significant computational cost because we now have to keep track not of an individual address, of an individual statement, but of a distribution of truth values over an entire space of hypotheses and doing so can potentially be exponentially hard in the size of the hypothesis space. So to address this issue, over the last 18 lectures, we've developed various mathematical tools and ideas both to represent these probability distributions in an efficient way, in a useful way, in an expressive way, but also to manipulate them in a computational fashion in an efficient and tractable kind of way. We began by observing in lecture number two that conditional independent structure allows us potentially to separate parts of the reasoning process from each other so that we can deal with them in a separate, computationally separate fashion and thereby reduce computational complexity. We then encountered Monte Carlo methods that allow us to compute interesting properties of probability distributions in a reasonably general and reasonably, although not perfectly satisfiable, tractable computational way. Then we discovered Gaussian distributions as a very powerful language to actually I should show you probably this slide, right? Then we encountered Gaussian distributions which are a way to represent continuous-valued probability distributions that allow particularly efficient computational manipulation because they essentially map the rules of probability theory, sum and product rule and therefore base theorem for inference on to linear algebra. And we learned that when we combine these Gaussian distributions with an efficient representation of functions in terms of either a finite or infinite basis, we can use the Gaussian algebra, basically the Gaussian framework of reasoning to learn functions and thereby solve supervised machine learning problems. This is particularly clean, if you like, if the function maps from an arbitrary input space to the real line, if it's a real-valued function. But we saw that we can even twist the framework a little bit and make it work in settings where the output variable, the observed quantity, the dependent quantity is not real-valued but maybe binary-valued or strictly positive-valued or has some other kind of structure that can be represented through a link function from a latent real-valued function to the output space. A key computational trick that we leveraged in this setting is that a plus approximation, which is a very lightweight, somewhat flawed and perhaps dangerous because it's local but very computationally efficient approximation to build tractable inference algorithms using Gaussian distributions. Then we saw that there are other probability distributions, in particular those that can be represented as an exponential family, that can also provide interesting parametrisations that capture interesting aspects of probability distributions over non-real-valued spaces in general while retaining a certain amount of computational tractability. And then in the most recent lectures, we returned to this question of conditional independence and studied a little bit more in detail a graphical representation or actually three different ways of graphically representing joint probability distributions and their properties, in particular conditional independence and generative structure that led to a semi-automatic framework for computing quantities like marginal distributions and most likely states within or joint maxima in joint probability distributions through an algorithm that amounts to or can be interpreted as passing messages between variables in a graph. Through this journey, we've now collected not just some fundamental mathematical insights but also reasonably practical things, in particular a collection of models and modeling ideas that allow us to describe data sets and their underlying latent structure and talk about them but also phrase in a mathematical fashion concrete quantitative models that describe structure that you might be interested in. And we've also collected various mathematical computational ideas that allow us to actually infer latent quantities from observations to actually build learning machines in a probabilistic fashion that can run in, well, polynomial time rather than the potentially combinatorial cost of general probabilistic inference. And building these models and the associated computational tools was the majority of the work actually. We could write down the fundamental ideas in lecture number one but then actually making them practical, making them usable in a real-world setting required us to come up with these models and computational tools and every single time we did so we almost immediately encountered not flaws but problems, challenges inherent in these models and computational tools because whenever you want to do something concretely you have to make certain choices, be it in the prior or the likelihood or in the computational tool itself that have flaws, that remove information that you might otherwise actually want to model and these are just prices you have to pay for actual algorithms that you can actually use in practice. So we're now at a point where we could start to very rapidly list a much larger number of modeling ideas and also some more like extended computational ideas. The last 20 to 30 years of study in the community of machine learning and its connected fields have given rise to a large zoo of models that can be used to describe a large number of individual data sets and one way to end this course would be to have a rapid fire list of these model classes and we could go through one model class per lecture until the end of the lecture course. I don't want to do that though because I think you're not going to be able to remember all of these models and you might forget about them right away again. It's also sometimes difficult to motivate why some of them are still particularly important or not. So instead what I'm going to do is I would like together with you to focus to pick one concrete example application and try and really build a model and a solution to a particular inference task. In the process of doing so, we will encounter a few of these other models that are related to it or we will encounter them on the margins while we think about how to build our model in the same way that you might encounter such ideas while you're thinking about your own data set in the future. We will also actually encounter one more key computational tool that I haven't yet introduced which is variational inference and we'll do that from next week onwards. So what kind of data set should we use to work with for something like six lectures three weeks? Clearly it should be something that we can actually work with in this reasonably limited time. It can't be everything. It can't be a super hard big data problem but also it shouldn't be something totally boring. So how about we set ourselves the task of trying to build a model of history. Build a learning algorithm that learns high level currents of the history of mankind. Maybe the more recent history of mankind. That sounds like a bold goal, right? And it's the kind of goal that machine learning is famous for. These somewhat overhyped big ideas that then in practice always turn out to be a little bit less cool than you actually thought they were based on the title. That's exactly going to be the case here as well. So of course if someone gives you a task like this or if you set yourself a task like this, some kind of big and bold high level task you're very quickly going to realize that if you want to actually realize it in practice then you'll have to necessarily constrict yourself a little bit as you become more detailed about what you really want to do. So in particular if you want to build a machine learning solution the very first thing you might want to think about is what kind of data you actually have available and that doesn't just mean what kind of data has been collected by anyone everywhere in the world over the history of time because that data is not usually accessible to us in a way that we can quickly turn into a Jupyter notebook and just get started with. So instead we're going to have to find a concrete data set that is available to us that is easy to parse and nevertheless contains non-privileged information. And in doing so of course we'll have to pay a price in generality and we'll have to limit what we can actually do with that data set. The data set I've picked for this year and for this part of the lecture course is the collection of speeches given by American presidents once a year called State of the Union Addresses. These speeches have been given since 1790 and that's actually the main reason why I want to use them. It's a little bit disappointing of course to use data or speeches given by American presidents. This is going to give a certain bias to our model and actually quite a strong bias because we're basically going to have to deal with American history rather than all of history. This is maybe a bit disappointing because I'm giving this talk in Germany so it would be nice to use a data set that uses recorded speeches of German politicians and in fact it might be possible to do something like that if you're interested in a master thesis on this topic maybe let me know and then we might be able to change that in future years. For this year though we're going to have to focus on this data set. Maybe let me just first introduce you to it and then we can talk about what its properties are. So the American Constitution article, the US American Constitution article two of it states that the president, Charles from time to time, give to the Congress information of the State of the Union and recommend to their consideration such measures as he, Charles, judge necessary and expedient. Notice that the Constitution apparently thinks that the president is always a man. So this stipulation, the fact that it has been in the Constitution of the United States of America means that this kind of presentation or speech has been given every single year. Actually in one year there were two of these addresses since 1790 and that means we have a continuous set of data that goes back over 200 years, 230 years roughly. That's nice because it gives us a summary of the American history maybe and it's going to be a high level kind of summary because the speech is given once a year by the president. So that's maybe good. That gives us a good high level view on history. Actually these speeches haven't always been speeches. What you see here on the right is the speech given by George Washington, the very first one. It wasn't the speech, it was actually a letter. And in fact these speeches have been letters for on and off over the course of history and they've only been generally been delivered in as speeches, as concrete verbal speeches given to Congress to my knowledge since 1982. So now they look of course like this. They've also been on the radio since 1923, on television since 1947. They've been moved to the evenings from the mornings in 1965 and have also been on the web since 2002 life. These are the kind of meta-informations about this data set that we're going to cast away for the moment and we might return to them at later points at the very last end of when we built the fine-grained details of our model because of course you could imagine that this kind of structure is actually useful to understand this data set. Maybe presidents talk differently if they actually have to talk rather than to write a letter and maybe they change their tone if they know that the nation is listening in life rather than just talking to their, what you might call their peers in Congress. We also know that these speeches are given by individual presidents. We know the names of these presidents. We know that these people usually are in office for something like four or eight years and that of course the first time they speak they might be talking, the first time they talk they might give a slightly different speech than in later years because they set out their agenda rather than having to explain what they've just done. Okay, so that's the kind of data set we're going to be dealing with and I think this is of course an imperfect, very much imperfect but a reasonable kind of data set to work with. It's not trivially large. You can find it entirely online as ASCII text. So it's easy to load, work with. We're going to provide you with this data set on the exercise sheet so you get to actually use it yourself and you can even read the speeches if you like. And this data set is more or less complete. It covers the entire period from 1790 to 2019 and of course encodes some kind of structure of history. So our task that we might set ourselves here is to extract the high level currents of history from these documents. The documents are just a collection of words or actually they're an ordered collection of words, strings collected in ASCII files. And of course many of these words don't really carry meaning. They're not important for the historical information encoded in these talks. And the words that do encode information also don't encode this information individually by themselves but by their relation to other words that are in the same texts by their semantic meaning that is encoded in them. That's the kind of information you and I will try to extract together from these texts. We can already begin to think about what this might mean and I'm only going to sort of flash this here once and then we're going to think about it a little bit more in a few minutes. This is clearly a dimensionality reduction task. It's a question about extracting a little bit of information from a lot of words. And it's also clearly an unsupervised task because there's no one, at least for our purposes, there isn't anyone who has gone through these speeches and told us that this and this and this word are important and they only show up there and they don't show up there anymore. There is no labeling available to us, at least for the purposes of this task. So it's not a supervised learning problem like the regression problems we've encountered in previous parts of the lecture course. It's an unsupervised problem where we only have samples, which are the words and they have all sorts of structure, of course. We know who is given the speeches. We know when they have been given. We know which word belongs to which speech and even where it shows up in the speech. But there is no label that directly says this word is important and this word isn't important. So that's what we're going to try and do. And to do so, of course, we have to think about all sorts of models that might represent and might encode what we actually want from this dataset. And then once we have decided on a model, we have to think about how to computationally realize it, how to build an algorithm that actually performs inference in this model, that assigns probability distributions to the latent quantities, the structure that we envision in our model. So before we do that, I should say, I should put a little bit of this claimer here, which I think is important. We're still in a probabilistic machine learning class. It's not a class on natural language processing. So my goal here is not going to be to introduce you to the state of the art in natural language processing and we're not going to try and build a model that can compete with the best in natural language processing. Nevertheless, my goal will be to build a model that actually captures as much structure as we can in those six lectures in which we're going to deal with this dataset. The reason I say this is because I want you to understand what to expect from the next few weeks. Next to the primary mathematical and algorithmic content of this lecture course, one of the messages between the lines, a core message between the lines, is that I hope to help empower you to work with your own datasets in the future. Whether it's in research, in academia, in science, or in industry, you are going to encode your own, you're going to encounter your own datasets and your own modeling challenges. And when you do, then I hope that this course will help you, help enable you to build your own models and your own algorithmic solutions for them. For these, I use the somewhat corny, admittedly word, craftware. By craftware, I mean similar to craft beer, customized, hand-built, artisanal software, if you like, that addresses specific datasets in an efficient and effective manner. Whenever you encounter a dataset in the wild and you start thinking about how to extract structure from it, you immediately encounter a kind of a trade-off where every single concrete step you take to build a model forces you to remove, to ignore certain structure. And that's always a bit painful because it feels like you're doing a disservice to the data and to whoever you're trying to extract structure for. But also you have to do so because otherwise you can't possibly build a tractable algorithm. So the more you know about how to build structured models in a customized way, and the more time you can spend on thinking about how to speed up and how to make the code efficient, the closer you will come to this goal. It is always going to be a little bit disappointing because you will always have to ignore details. But if you're willing to invest thought process and time and if whoever pays you to do so allows you to do that, then you can usually extract more information, more interesting information than with a standard solution. And you can also often do so in a more efficient way than if you use a standard solution. That's maybe actually still an interesting aspect of machine learning next to other parts of computer science that in machine learning it's not always true that toolboxes that might be available are necessarily the most efficient, even computationally most efficient implementation. They are often quite efficient, but they are not perfect. And they are also often restricting maybe in an overly strict sense what you can do with them. So software toolboxes are cool and they are super useful and the fact that they have arisen a lot over the past decade or so has massively propelled the field of machine learning forward. But they also still come with a relatively restrictive flavor. They only allow you to do very certain things. And if you want to do something that isn't quite what the designers of the software library envisioned, then you basically can't do it. So you have to then implement the code yourself. So please don't be surprised if over the next few weeks I'm going to sometimes spend time on aspects of a model that might seem tedious to you. Maybe they even are tedious because I want to send a message that sometimes it's necessary to do the tedious work and that it can pay off to do so because it allows you to do things that you couldn't otherwise do and to do them in a more efficient fashion. All right, so with that, let's end the Sunday sermon and actually start working. So the first time you get a data set, or maybe the first thing you do when you get a data set that I already said in previous lectures is that you try and extract as much as much meta information as possible about the data set. So you try to sit down with whoever gave you the data and try to figure out as much as you can possibly know about the surroundings of the data, the genesis of the data. We've already just done that. I mean, in this case, the data comes from the internet basically or from the public domain and in these reflections on the history and the way in which these speeches are given that I just did, I basically provided a little bit of this kind of meta information. Now that we have the data on our desk, let's look at a few more quantitative aspect of the data and try to get a handle on the actual raw data as it sits on our disk and in the process of that, give a few names to things so that we can talk about them more efficiently. First of all, what's the size of the data? It's actually not a particularly big data set. There are 231 documents in this data set. These are the speeches ranging from 1790 to 2019. That's not 230 because there were two speeches given in 1961. I will call that number of documents capital D. That number will show up a lot over the coming weeks. And now each of these documents is a collection of words that were spoken one after the other in sequence. So you can think of the data set as a regular array where each entry, each row of the array has a varying length and I will call the length of that individual row of the array. ID, where D ranges from 1 to capital D and it's going to be on the order of a few thousand words. You can check out yourself when you get the data set with your exercises, what the exact size is. It varies, of course, from one speech to the next but it's usually on the order of a few thousand words. And these words, of course, are not arbitrarily varied. There is a relatively finite number of words people use when they speak in English, even relatively eloquent people like most US presidents are. So we will typically actually be strict the vocabulary to a very precise set and depending on how we do that we might end up with something from a few hundred to a few ten thousand words. And I will call the number of words that we consider capital V. So our data set is very quickly, once we do analysis actually not going to consist of strings, right? It's going to be a regular array where each entry in the array contains an index, an integer number that indexes the word in a dictionary. Now, I said before, this isn't a class on natural language processing but a little bit of low level language processing we have to do anyway. So one big aspect or one problem with modeling text is that a lot of spoken languages relatively redundant, it contains lots of filler words like and, to and do and have and had. All of these arguably don't really carry much semantic meaning at least not for the high level task that we are aiming to do. So we should remove those words. These words are often called stop words. Conveniently, there are tools that do all of this automatically for you and this is one place where a software toolbox surely is a really great thing to use and you're going to get to do so in the tutorials that go alongside this lecture yourself. So once we remove these, the vocabulary is going to shrink and we still get to choose how large you want to make the vocabulary and you'll typically see that the size of the vocabulary we consider actually affects the model. That's an inconvenient aspect that we might have to look at again in a few days or weeks. But at this point already, so remember I just said in the Sunday sermon that we're going to have to make painful reductions in complexity and throw out structure. Here comes the very first very painful part of structure that I'm going to throw out to make things easy to think about and this is partly to make many different model classes accessible. I am going to ignore the order of the words. I'm going to pretend like every document is just a jumbled up collection of words and this idea is often called a bag of words. Obviously, you will immediately, I mean, you realize that that is not necessarily a good thing because when two words show up next to each other, that can convey a different meaning than when they are far apart from each other. And in fact, in some cases, even the order of the words two words next to each other can carry meaning. Nevertheless, we're going to do that because we hope to extract some high level structure because it's very convenient. We might in the final lecture on this topic actually we turn to this idea for a moment and think about what we can do to soften the effect of this somewhat drastic choice a little bit. One nice aspect of doing so of treating the texts as bags of words is that they allow a very convenient data structure which is to rearrange this data set of a ragged array of documents into a matrix of counts and that's going to look like this. So this is our data set if we treat it as a bag of words. What you see here is a matrix of size D by V. Remember, that's the number of documents and that's the vocabulary index of the vocabulary for the purposes of this picture at least I've chosen the vocabulary size quite small to the top 500 words because it makes it easier to make a plot like this and that's maybe not necessarily the correct number of words for our modeling purposes, but it's easy to look at this way and what you see here in encoded as color in this matrix is the frequency of individual words in each document. So what this matrix contains are integer counts of how often an individual word shows up in the particular speech that corresponds to the row of this matrix. So it's nice to have a look at the data set like this because you can already see all sorts of interesting structure that might help you decide how to model this data set down the line. For example, you can see that the number of words of course varies from time to time. Here is a particularly verbose speech. Here is one as well. You can maybe even look at this data set and guess which of these features were spoken and which of them were written maybe you can even identify certain presidents who were particularly laconic and didn't didn't talk all that much and then you can also see that the frequency of the words varies a lot. So there are some words that are extremely frequent and other words that aren't as frequent. This is already after removal of stop words. Now, one thing you might want to do of course is to check out what kind of words those are and why they are so frequent. I will leave that to you because you'll actually get the data set. We in the meantime can start thinking about what you might want to do with this data set to reveal structure in it. And this is one of these points where this online teaching is maybe not particularly elegant in this term because of course ideally we should now be together in a room and have a conversation about what we should try to do with this data set. So we can't do that today. So maybe you want to stop the video for a moment and think for yourself about how you would find structure in this data set that extract semantic information from this matrix of counts. So maybe the first thing you thought of is almost a bit of a knee jerk reaction because you're seeing a matrix and you're thinking low rank decomposition. Maybe we can write this matrix in terms of a low rank expansion like this. So here is our data set again. By the way, I will call the data set X. I haven't actually said that yet. But here is a big bold X. This is what we get to see and we might want to model it as some kind of auto product of matrices or maybe yeah, well, yeah, up matrices. Eventually they might be other objects in matrices, but let's just think of them as matrices already. And that means maybe we want to represent this matrix as a low rank expression where the D documents are written in terms of capital K topics. I will use the index capital K for this sort of latent dimensionality and the words are written as K topics as well. So you can think of two matrices Q and U and U is a projection from the space of vocabulary words onto topics. So the elements of these rows of this matrix or actually the columns of the matrix U, which are the rows of U transpose might contain some numbers that provide weights for how much a particular word contributes or is part of some topic and the matrix Q, the columns of the matrix Q may contain numbers that describe how much a document consists of a particular topic. If you thought of low rank matrix decomposition, maybe you're feeling bad because it seems like it's the only algorithm I know is PCA, but you shouldn't feel bad about it. Trying PCA on a data set is actually maybe a good idea in any case. So in fact, Neil Lawrence recently tweeted that have you tried PCA on a data set is maybe the data scientists version of have you tried unplugging it and plugging it in again. It's maybe not a bad idea. Of course, we will quickly see that PCA is not going to be the answer to this particular modeling task because if it were then we would be finished in 15 minutes and it wouldn't be a major exercise. But maybe it's a good thing to start with this anyway and think about how far we can get with an extremely simple linear algebra algorithm. So I know that many of you maybe almost all of you have seen some form of introduction to PCA before and if it's just by myself in the lecture on data literacy that you might have taken last year. So I'm not going to do and the entire derivation in detail again. What I'm going to do instead is I will use this opportunity as I will for other models as well to point out that PCA has a probabilistic interpretation and that actually requires me to do a little bit of the derivation again or maybe actually I'll quickly rush through the whole derivation of PCA in the classical fashion and then we can spend a little bit of time thinking about whether PCA also has a probabilistic interpretation. So in the unlikely case that you haven't had a class on probability sorry on dimensionality reduction or on PCA then let me very briefly do a re-derivation of it for that everyone is on the same page. PCA is a dimensionality reduction technique and encoding technique. It's the base case the linear quadratic base case of a larger class of dimensionality reduction techniques which find an encoding from a high dimensional to a low dimensional space. In our case the high dimensional space is the space of words and the vocabulary and the low dimensional space is the space of these topics that latent. The latent dimensionality and such that there is also a decoding and we choose we motivated encoding by saying we want the sequence of encoding and decoding the reconstruction of a particular data set to be somehow good where somehow is according to some loss function and this can be done this might be done for various reasons. In our case we want to do it to find structure and you can see that find structure is already in quotation marks because PCA maybe isn't really meant to find structure. It's meant to maybe reduce complexity right to approximate a matrix in a low dimensional fashion and we will find of course that the structure we find using this approach maybe isn't actually all that exciting. Within this quite general and potentially actually quite complicated class of encoding and decoding reconstruction dimensionality reduction techniques. PCA is the the the grandfather the easy the basic solution the ancient solution which consists of the choice of encoding and decoding as a linear function so both phi and psi are linear functions and the loss is a quadratic function. Because gothic functions and linear functions go well together so of course that still means we have to choose what the encoding and decoding actually are not just any old matrices and what the encoding looks like and the classic way to derive PCA contains well maybe two steps. The first one is to just write down a data in some representation in some form of basis see what the right encoding is if you know what the basis is and then to choose the basis right. So let's first do the first bit which is maybe the trivial thing. So here's a very quick linear algebra refresher just so to get your kind of get your mind going again here's our data set again. I've now introduced it several times it contains real numbers by the way maybe some observation might have on the side here is some of you might already have noticed that our data set isn't actually real valued right really our data set that this R should be an N for the natural numbers because we have integers in our in our accounts and we will get back to that. It's actually an interesting kind of aspect that in a I mean in a in a well-chosen representation and if you if you wouldn't be doing all of this in Python which hides the difference between real numbers and integers from us sort of by by being very agnostic about types a more strictly typed representation would actually warn us that that's a dangerous thing to do maybe if we had a very like clean functional language to do data science in if we wanted to do a matrix decomposition on a matrix of counts it would tell us maybe you shouldn't be doing that because they are integers in here and this the algorithm you're going to be using assumes that they are real numbers but when we're working in Python which isn't a perfect tool and it hides this kind of interesting type information from us and for the moment we'll pretend that we have noticed right and then we'll come back to it at the end of this lecture so let's pretend for a while for a moment that our data is actually real valued and then this is a matrix that's a real well it's a matrix of real numbers so and here's the key is now finally our quick linear algebraic pressure sorry for the detour and of course linear vector spaces can be covered by bases so let's say the collection of vectors ui is a basis basis means in for the purposes of the next few slides that these vectors are orthonormal to each other that's what this equation is saying and there are sufficiently many of them to span the entire space that we are thinking about this means that every vector can be represented in this basis by taking the vector projecting it onto the basis and then representing it in terms of the basis with some coefficients by the way this also be written in a matrix like way like this and when I write a bold XD what I mean by that is a vector that represents one row of our data matrix X this is a frequent problem that maybe confuses some people that this is now I'm pretending that that's a vector so that's a column like object when it's actually a row of the matrix X it's one document containing the counts for this for this particular document in terms of the words this is a frequent problem actually in data science because the customary format is that data comes in rows in a in a you know in a database or in a spreadsheet traditionally and so the the first index is the is the datum but of course if you want to do linear algebra we want to think of individual data data as vectors and well that's just what we have to live with so what we are going to do is try to approximate this data's this datum XD in some basis and by approximate I mean that we'll choose a basis and then we will say that we'll only use a certain part of that basis a low dimensional sub part of it let's say we use capital K out of the D degrees of freedom to represent the data set and then we want to embed that representation in the high dimensional space how do we do this well here's sort of the classic way to do that we say that our approximation x tilde D of the true XD is going to be an expansion in the basis actually it's a representation in the basis by some numbers a that's a vector AD containing elements ADK and they are K of these elements those are coefficients for our basis vector UK and then in the remaining bit in the other v-k dimensions we are just going to do a shift for the entire data set so that's one vector B which is the same for every datum notice that it isn't indexed by D which is just a shift of the entire data set around in the high dimensional space now if you are comfortable within your algebra you you can guess what the values for A and B should be that are optimally embedding this this datum in this high-dimensional space but if you don't or if you are particularly cautious then let's do a derivation because one thing we need to figure out what the optimal embedding actually is is a decision or a statement about what optimal means and the usual way this is chosen is to say optimality means optimal in a least square sense so I would like to you find values for A and B such that this x tilde is as close as possible to x d in a quadratic sense I'm pointing that out because maybe at this point you can notice if you're paying close attention that these choices are not entirely uncontroversial or natural a canonical maybe but there there's it's not like there is only a unique way to do this kind of encoding and the only way becomes unique is because we've made a conscious choice to use a quadratic loss here's how this works and you've probably seen this derivation before so I can do it relatively quickly even though it's a bit tedious so we decide that we want our embedding to be optimal in the quadratic sense that means we want to minimize the norm between the embedding and the original datum and we want to do that for every single data set in the same way so on average you want to be good it's another choice actually to make we're weighing every single datum in the same way if you now plug in the definition from the previous slide on what x tilde d is that's here then we can wonder about what the optimal values for A and B are so let's take the derivative of this expression with respect to one individual entry in our approximation ADL then and this is the sort of okay just differentiation of multivariate objects first of all we notice that any entry AD anything so where that where individual D shows up is only present in one particular of these entries of the sum so the first sum here drops we're left with only one one such term and then this is a sum over square elements elements of the vector so we take we now do the chain rule so we take the for each element in the sum the derivative the two comes down to the front and we take the inner derivative there's only one entry in this sum over K where ADL shows up it's the entry where K and L are the same and the inner derivative of that then is minus you L V by which I mean it's the elf basis vector and then the V's entry of that within the sum this sum here now amounts to a matrix sorry to have to an inner product between two vectors this is a vector this is a vector now we use what we know about these basis vectors which is that they are all phenomenal and that means that this sum here is a value is to zero because J here is over different entries entries that are larger than L so these this inner product between these two vectors is zero up here we just get a multiplication of a vector with with this corresponding basis vector and here in front we use all for normality to get bit of the sum and the left with a single expression the two these cancel if you want to find the maximum the mode of this expression then there's a unique choice that ADL should be set to the inner product between XD and UL that's why that down here and the same thing works for B the minor difference in B is that B applies to every datum because we're shifting the whole data set so the sum doesn't drop out of this derivative and other than that in here we can again do the chain rule two comes down now inner product and the inner derivative applies to this part of the expression now we use author normality again these this bit here remains in front but now there's still a sum over D because it hasn't gone out this year value is to zero because the inner product between these two basis vectors is any of these two basis vectors is zero because L is now larger than K and here we're left with one element of the sum which just gives us BL the two cancels we can be arranged and we find that the entries of B should be set to the entries of the well okay so these ULs do not depend on D so we can take them outside and what we have here basically is minus a sum over all the data divided by D so that's an empirical average that the data mean the empirical mean of the data set let's call that x bar and you have to find it up here and then we can write B as the projection of that data mean on to this individual basis vector UJ so of course you've seen this derivation before so you know to do that to find the optimal encoding you have to project onto that onto the individual basis vectors and for B because it's the shift you have to do this for the mean of the data set and for a for the individual encodings you can do it for the individual data but notice that the reason we do that is not entirely unique it's only unique because we've chosen a quadratic loss isn't that maybe interesting if we try to think of this from a probabilistic perspective now the other bit the second part of the derivation is once you've decided how we will encode once you have the basis is to choose the basis to say what what should the basis be in which we do the expansion and the way to do that is and again that's a little bit of tedious math it's actually all relatively easy is to basically find out that what we're doing here is related to the sample covariance matrix to see that we consider this term that was on the previous slide the difference between encoding and to datum we plug in what we just found out about what our encoding is that's in here and we write the the data set in the basis that was on the first slide on this so that's by first I mean this slide then this first sum here we can order we can give a split that into two parts first a sum from one to K and then from K plus one to to V so that we have similar terms to over here we notice that the first two parts are identical of course because we are expanding in that basis so the only difference between these two expressions is the name of the summation variable at the index L or K everything else is the same so they cancel and back here there is a difference because in one case we are projecting the individual datum onto the basis and then the other one we project the mean onto the basis so that's really the price we pay for our embedding it's the difference between the individual datum and the mean and the mean interestingly arises from minimizing quadratic distances right if we weren't minimizing quadratic distances that would be another estimator that shows up here not x bar but something else you'll leave it to yourself to think about what kind of estimator might show up there so these are the only two parts that are left now this sum is over the same entries they go from k plus 1 to v we that means we can take the sum out here use the distributive law to rearrange the sorry the associative law to rearrange the the terms and notice that we can now think of our loss or j which is this which is the square norm of these expressions as by plugging in what we just found a sum over the over terms that well I mean what's the easiest way to say that well we can just think of this as well I'll just tell you why we can think of it as and trace of the eigenvector sorry not the eigenvectors just basis vectors uj I'm giving away the trick and the basis vectors uj on to the sample covariance matrix so s is this matrix it's the average of the outer products of the distance of the individual data from the sample mean and what j is is essentially projecting that date this matrix onto the basis we've chosen and then summing up those values the trace of that of this this matrix that arises from it for only the basis vectors in which we are not approximating the ones that are not part of our embedding the ones from k plus 1 to V and then the the actually mathematically tricky bit which we're not going to do because it's a little bit tedious is to think about now what is the optimal choice of basis what is the basis that minimizes this expression and intuitively it's maybe clear what you should do if you think of the eigenvalue decomposition of s so a diagonalization of s then the way to minimize this expression is to find an orthonormal set of vectors u which are equal to the eigenvectors of s and then choose the order of those vectors such that they are ordered in descending order by the size of the eigen values so that this expression here is a sum over as small a set of numbers as possible it's the smallest eigenvalue so to maximize our reconstruct minimize our reconstruction error and maximize the quality of our embedding we embed according to the largest variances the largest the eigen directions of the data set in terms of its covariance matrix with the largest eigenvalues again this isn't really a proof but it's maybe intuitive right if you think of your data set it has some principle axes and we're going to embed in those axes that are the largest and remove all the small dimensions which you might think of as noise now even though this seems like a mechanical process we've made interesting choices along the way we've decided to use linear functions we've decided to use a quadratic loss and quadratic losses have all sorts of interesting structure and you will already know from previous parts of this lecture that they are connected to certain kinds of probabilistic models if you're not getting yet if you're not catching my drift yet then you will get it in a second if we choose this basis then we get we can express what the reconstruction error is it's going to be the sum over the smallest eigenvalues from k plus one to b of the covariance matrix of our data there's an alternative way to represent this computation which is to think of the singular vectors of the data matrix if we remove the data mean so if we first center the data set so we compute the data set mean and then subtract it from every row of our data set then you can think of this center data set and if you compute the singular value decomposition of the data set then by definition the right singular vectors so these are the entries of this matrix u t and we take the rows of u t and we take the top k entries of sigma those are the same vectors on which we are projecting and the entries in sigma are actually the square roots of the eigenvalues of this data covariance matrix s which exists because s as you can see is a positive 7 definite matrix so the square roots of the eigenvalues exist so yes you've seen this derivation before but have you ever thought about what this quadratic approximation or this quadratic loss function actually encodes I know that some of you have because you actually asked me this in a lecture last term which is partly why this is in this lecture now but maybe some of you haven't so by now at this point in this course we know that this these kind of empirical risk functions that are being minimized can often be interpreted as a negative log loss and if it's a quadratic loss and if it's a sum then this corresponds to a product in the likelihood so independence assumption over the exponential of a square and that's a Gaussian likelihood function okay so here is this expression again we can think of j as a negative log likelihood function at least up to a scaling and shifting because those shifting by something that doesn't depend on x doesn't shift the location of the minimum and the scaling by something also doesn't shift the location of the minimum so our likelihood function is the or is probably the exponential of that so it should be a product over individual terms that are gaussian individually and they have some kind of scale some kind of variants which isn't present in the expression for j because we only minimizing it so it doesn't really matter what that scale is but once we write down our likelihood in a probabilistic fashion we are forced to give a name to this quantity and call it the variance this is the way in which probabilistic analysis forces people to explicitly state their assumptions we're actually not the first people to make this kind of observation of course not it was maybe first made by tipping and bishop in the late 90s and there's a parallel publication of on the same time by the wonderful Sam Robes who unfortunately is not with us anymore these by the way the date so this exposition that we're going to go through is mostly due to the way the tipping and bishop did it it's taken largely from Chris Bischoff's book and unfortunately these authors use it ever so slightly different notation or different not actually not different notation but a different way to capture the embedding from the way I just did so there's going to be a minor difference between the exposition and the one in the on the previous slides due to or the truly reflected in the shift parameter and how it's represented in in the previous exposition the shift is on only the part of the data set that is not being embedded and this formulation the shift is on the entire data set so if you're confused by that maybe check whether that's the source of your confusion it's a very minor thing though it just simplifies the exposition so just by this structure just by observing that our likelihood that is sort of represented by or hinted at identified by the loss function we are minimizing is a Gaussian likelihood we can already make some some assumptions or statements about what the corresponding generative model is going to look like which have a present biographical model over here so first of all there is a product here so there will be a factorization a plate in the graph over D copies of the individual documents XD we also know that there is a parameter sigma squared is likelihood which isn't part of the classic derivation of PCA but it has to show up once we write down a probability distribution so it should show up in our graph somewhere over here and then there are bits in that are also in the derivation of PCA which aren't yet reflected in this likelihood which are that we know that X tilde D will have to be a linear low-dimensional embedding so we'll have to be able to write XD as some matrix V times some low-dimensional variable ad and up to a shift and that will somehow show up in our in our likelihood so we will know that there will be both the variables V and ad and also a shift in this graph now what we don't know is how to represent the the corresponding which of these variables should be probabilistic variables so which one should be parameters and which one should be variables and the other thing we don't know yet is how exactly they should be distributed so what their priors should be whether they are even our priors so notice that PCA doesn't really have priors as such it's just a maximum likelihood type kind of formulation and so we could imagine that we probably don't need particular prior structure in in to reconstruct PCA we do only need somewhere a little bit more vaguely to encode the fact that we want this embedding to be in terms of some independent low dimensional representation so we will say that a has a distribution that is probably Gaussian and it actually has to be Gaussian because of a thing that we'll see in a moment with zero mean and isotropic scalar covariance in its k-dimensional space the this choice is basically arbitrary if we choose a different covariance here then it will just affect the implicit definition of the embedding V by itself so if we don't do that then if you don't want that to happen then we just choose this covariance to be one and we assume that this shift epsilon that this kind of measurement noise right that clearly has interpretation as a measurement noise because we're thinking of the true data as the embedding plus a shift plus some error this error will have to have a Gaussian distribution because if it has a Gaussian distribution and if a also has a Gaussian distribution then that motivates our our Gaussian likelihood why because under these choices the these two together define a generative model for xd right which is the marginal over this the definitions of a and well xd is defined through a and an epsilon right which are both probabilistic variables so we can integrate them both out to get a marginal distribution over x which is a Gaussian around mu around the shift with a covariance that is given by you can as you can read off basically from this definition V V transpose plus sigma times the unit matrix and the conditional probability so x given x tilde is going to be this likelihood that we just wanted to construct with a sigma square here so the marginal distribution if you're not conditioning on an x tilde is a Gaussian which has a covariance matrix see it's independent of the individual data and now we can check whether way our division of mu gets us back to something we know from the non probabilistic interpretation of PCA by doing maximum likelihood because that's what PCA did but it was maximizing or minimizing the quadratic loss so maximizing the Gaussian likelihood consider the Gaussian likelihood for this marginal over x it's the logarithm of the log likelihood of course the log marginal likelihood is the logarithm of this expression so it's a big sum over individual terms many of which are constant out of so we take them out of the sum and there's then a remaining quadratic form which involves the C and now we can first check how would we maximize this expression in terms of mu so mu only shows up back here not in the front and you can you don't even have to compute gradients to see that obviously the optimal choice for this quadratic function is to set mu equal to the sum over the individual XD normalized so that's our data set me once we plug that in here then we get this expression simplifies to XD minus X bar transpose times C inverse times XD minus X bar and we take the sum over D which corresponds to a trace over the matrix XD times mu XD times mu transpose times C inverse we can rearrange this is basically a big sum right and we can just rearrange the terms in the sum and that is nice because XD minus mu XD minus X bar XD minus X bar transpose is equal to a matrix that we already had on previous slides which is our sample covariance matrix so now we're making a closer connection again to the classic derivation we can think of the likelihood as the log marginal likelihood as a term that involves the quantities that we had on previous slides both X bar and S and now the only question left to figure out what V should be under this probabilistic interpretation is to optimize this expression with respect to C to find the maximum likelihood estimate for the embedding now as it turns out that derivation is a little bit tedious so I'll save you the time and just tell you that Chris Bishop did it for us and found that the maximum likelihood solution for V is essentially the one you would get from PCA which is the the first k right principle eigenvectors of the data set which I'll call you as on previous slides multiplied with essentially the eigenvalue so lambda contains the eigenvalues of the data covariance matrix S corrected up to the sigma and this is due to the fact that we are now thinking of mu as a global shift of the data set rather than a shift only in the embedding space and then an arbitrary rotation are which can be chosen anywhere you want because it doesn't affect the well if you apply it and it basically just corresponds to a rotation of the latent representation a which is spherically symmetric anyway so any arbitrary spherical rotation is not going to it's just going to change the interpretation of what the individual terms are but it doesn't change the outcome of the of the embedding so this is sort of arbitrary and it's a natural choice is to just set it to the unit matrix and but you can set it to more than anything else that also gives a corresponding value to sigma square and it's the sum over actually is the average value over the smallest eigenvalue the v-k smaller eigenvalues and that gives an interpretation or to to PCA that we maybe couldn't read off so easily from the non probabilistic derivation PCA assumes that the data comes from a low-dimensional latent Gaussian variable called ad which has been transported has been mapped through this matrix v or essentially through the matrix u actually and that latent variable has scales that are given by these eigenvalues these are the variances essentially of the latent variables and then the remaining bit the bit that's not encoded by this latent variable is noise according to PCA its standard Gaussian noise with a variance that is given by the average value of the smallest eigenvalues of the data set once we have this kind of generative model by the way we can compute a Gaussian posterior on the embedded on the embedding which up to a minor correction that there is a sigma showing up in here is essentially the reconstruction you would get from PCA and the this this kind of representation as piece of PCA as a probabilistic algorithm actually provides some interesting insights about how PCA should be used and whether it's a good model and also even allows extensions which we're not going to do today to figure out first of all how many components there should actually be so what K should be what the load-eventual embedding should be it even allows an approximate treatment of learning the embedding in a probabilistic fashion so there's a way to extend this graph to build a full variable in here this is called Bayesian principle component analysis but we're not going to do it today because we have a task to solve so just in case you were wondering what this was all about I've just used this opportunity as we make our first analysis of our data set to pass by PCA again and point out that simple statistical algorithms like PCA often have probabilistic interpretations here this interpretation helps us understand what the implicit assumptions in PCA are and even allows an extension towards algorithms that can be more powerful than PCA but we want to do an analysis on our data set so let's now take PCA and see whether it brings us forward when we're trying to find topics latent low-dimensional structure of these speeches of American presidents so to do so I have here's the kind of code we need to do of course PCA is an absolutely straightforward algorithm essentially just one line so I had to copy a few more lines in here so to make it a little bit more interesting so what I did is I took our data set used the tokenizer and toolboxes that we're using that you are going to be using in the exercises to extract the word counts that gives us a matrix X count which is the matrix we've looked at before and you've seen that before and now all we need to do to do PCA is to just compute a single line and tell you the composition of the data set that's it it's a single line now if you have taken my course in data literacy before then you probably will not notice that there's something a little bit wrong with that line and if you want to think about it for yourself think about like stop the video for a second and then we can talk about it once you've stopped the video I can tell you that what I'm not doing here is I'm not subtracting the mean of the data set why do I not do well because our matrix contains counts so it's already a very non symmetric matrix in the sense that it's there contains only strictly positive numbers and in fact it's a sparse matrix because many words are not used in individual speeches and if I would subtract the mean that would sort of break this nice structure of the matrix a little bit now of course when I do this when I use this SVD here then I'm strictly speaking not actually doing PCA anymore because PCA assumes that we're first subtracting the mean and that's right in fact I'm using a different algorithm which comes from the natural language processing community and it's called latent semantic indexing and it is exactly this it's PCA without subtracting the mean you can try for yourself what happens and you subtract the mean because the answer is going to be actually not that all that dissimilar from what we're doing it's a little bit more expensive because the matrix is not not a sparse anymore but for this very small matrices you won't actually notice the difference so here is the output of this kind of analysis by the way I'm not showing you Python code right I'm not showing we're not opening trip in the notebooks in this lecture even though this is supposed to be an applied product lecture course because it's going to be your job in the exercises to do that and you're going to have the fun of playing with the actual code so this is the output so the code on the previous slide produces these kind of matrices so this is Q this is you and this is the these are the singular values in between so you is going to be our embedding this is the matrix that we might want to use and if you take the rows of that matrix and you sum them by the sorry you sort them by the absolute values you take the most important words if you like then you're going to get a list like this so these are maybe representations of the top 10 topics of this data set they contain words like collections like tonight fight taxis faith century today enemy fellow maybe this is something to do with war I don't know we year program world new work need help America so this maybe sounds like some kind of social policy kind of stuff dollar war program fiscal year expenditure million okay that's a bit weird so it's a weird mixture of war and and taxation and then a few more of these and you can see that well actually what can you see well maybe if you stare at these lists for long enough you see or you give interpretations to them just as I just did like this is something about more modern history this is something about social policy and this is a weird mixture of something with war and taxation and that's actually a dangerous thing to do and we're doing it here deliberately to show you how easy it is to look at expressions like this and put all sorts of interpretations into them in the over the next few weeks you will see many more of these kind of representations of topics because they are maybe the most natural way to represent the topic is just to say what are the words that correspond to this topic but I want you to use this output as maybe a cautionary example of how easy it is to apply your own brain and apply and sort of introduce structure to a data set because this is actually a very bad representation of this data set why well first of all maybe let's just note that it's not surprising that there are lots of interesting political words in here because they are political speeches so the fact that words like world and war and fiscal show up is totally expected because that's just what presidents talk about secondly let's think about what this model actually does it assigns maybe I'll show that in a second but just let's look at these matrix once again this matrix here contains real numbers which provide a low-rank approximation of our full count matrix and those real numbers are dense there's a number in every single point here every word according to this model belongs to every topic in a certain way more or less and particularly weird about this is that some of these entries are negative some of the weights are negative that's just a nature of also normal basis that they have to contain negative numbers if they also contain some positive numbers so the single value composition that we just constructed what it does is it just does its job it produces the best the least square sense low-rank approximation to our matrix s that's x that's what it does so you can think of these matrices as mapping from documents to topics and from topics to words but these matrices are dense and maybe we don't want our topics to be dense and they contain numbers that are very difficult to interpret they are difficult to interpret because first of all they have a real a real value so if you want to interpret that we need to give a meaning to this real value it's some kind of weighting so if that number is large that what does that mean does it mean that the topic consists entirely of this word or does it mean that this word contributes strongly to this topic and even weirder those principal components these principal eigenvectors principal vectors sorry they contain negative numbers what does it mean for a word to contribute negatively to a topic maybe it means that if that word is present then that topic is less present than the document that might be the intuitive interpretation but that's actually wrong right if you have a very large negative entry in a principal component then that actually means that this component is strongly represented by that negative number so the fact that there are negative numbers in this representation is really weird and that's an example of how overly relying on black box solutions on pre-packaged pre-fabricated models that come from some toolbox is dangerous because the corresponding variables have no clear physical interpretations if we want to move beyond this we have to start thinking a little bit more about our dataset nevertheless it's interesting to see that we can do this kind of approximation of course with our dataset and it gives a baseline that we can later compare to when we build better models how do we build better models well this is one of these moments where in a real personal lecture together with you in a lecture hall we would have a conversation about what we should do to improve this very basic kind of model you're not here so I'll have to do that myself what I would propose we do is go back to this representation of our dataset that we've looked at before this kind of dimensionality reduction visualization of what we are trying to do and think about parts of PCA that we don't like and that we might want to change to get to a better algorithm one thing I don't like about this low dimensional representation is not even so much that they are matrices but the kind of numbers that are in those matrices and that they are not constrained in the right way so maybe a first thing to fix and it's not going to be the last thing we should fix is the entries in the way we think about what a topic actually is for a document so under PCA the each document is represented by the embedding in terms of the topics so those are these A vectors and those are real numbers they can be positive or negative and you don't have to sum to anything they're just numbers they're just loadings now really a topic representation should be something a bit more sparse a document isn't about all topics at the same time and it's about the specific things it's part of a particular topic that you want to talk about so maybe a particularly strict way to do that would be to restrict the this matrix that contains the loadings for the the documents to contain binary numbers zero and one and maybe constrain it such that there's only ever one single one in each row of this matrix so that means that each document is part is assigned to exactly one topic that's obviously not going to be perfect but it gives us a connection to an interesting class of models that we could consider as we pass by which is so let's just for the rest of the vendor of this lecture actually assume that what would happen just think about what would happen if we would set these this matrix such that each row contains one and only one we one and everything else is zero this idea is connected to a particular class of machine learning algorithms that are called mixture models and they are unsupervised generative machine learning models what do I mean by that well let's make a quick connection back in time to when we did Gaussian process classification back then I showed you a picture like this and said and said if this is your data set how do you classify it into into two parts so this is a supervised problem where we have labels on data and back then we saw that we can do solve this supervised problem in a discriminative fashion by basically drawing a line through here and this even worked if the data set is not linearly separable then we can learn nonlinear decision boundaries that separate one class from the other but back then already I showed you that real world data often look like this even if they are labeled so these are actually the first principle components of these images of fashion items and you can see that here maybe this you don't want to think about this kind of class separate separation problem in terms of some decision discriminative surface because it's clear from just looking at this data set that there's some underlying generative structure these it's not like certain regions in this domain create a particular image it's that the images create a certain distribution over the pixel values of these classes and they might overlap and also so in the overlap region it's just harder to take a decision and really we should just be uncertain because we really don't know and also in these regions far outside far away from the data maybe we shouldn't really predict anything either because they are not part of any of these two classes it's not like one of these two classes is more likely or is the correct one a point out here is just neither a pair of trousers nor a sweater so to describe such models as such data sets you might want to use models that are of the type that I just sort of hinted at if I give you a particular datum so this could be now instead of the pixels of one of our documents it could be a document sorry it's instead of one of the pixels in a collection of images or fashion items imagine that this was a representation of our state of the union document corpus where each dot here is one document and the axes of this plot correspond to counts or maybe some transformed counts of the words in this document and then one could say that maybe depending on which kind of like distribution one belongs to this could be one class of topics that belongs to one class of documents that belongs to one topic and this could be another class of documents that belongs to another topic of course the problem we have is that we don't have colors to these pixels we don't have to these dots we don't have preassigned labels so what we would need is an algorithm that takes such a data set and invents labels to these dots colors for the dots that are somehow explaining the structure of this data set one very prominent class of such models is a mixture model in fact it's usually a Gaussian mixture model and these models tend to be introduced with more simple data sets low dimensional data sets like this one this is a very famous one it's a recordings of the duration and intermediate break between eruptions of a famous geyser in Yellowstone National Park in the US it's called Old Faithful it's a geyser that erupts roughly every hour and then actually spouts water for about 2-5 minutes and there's a famous data set that was collected well actually the data there has been recorded for a long time already by the National Park Service in the US and there was a data set released about them in the 1990s that looks like this and you can clearly see that this kind of data has structure there are basically two clusters here and these Gaussian mixture ones are clustering algorithms algorithms that basically take such a data set and invent an explanation for it by saying well there are probably two different ways in which the data set is created there are two different clusters one and the other and I can assign labels to these individual data points and push them into one cluster or the other the way they work is that they assume that this data set is generated in a generative fashion so there is some kind of well it's a generative assumption but actually it has a causal interpretation as well even though we should be careful to assign causal interpretations to generative models we can still use them if they are meaningful so the explanation is that what happens is that some causal process first decides which of the clusters will we belong to and then once we know what cluster it is it draws from a distribution for each cluster separately so this cluster up here has a distribution that looks like this there's some kind of blob over here and this cluster is some kind of blob of data over there so if I would tell you which label as which cluster identity belongs to an individual datum you can then produce a much more precise kind of generative distribution for each datum here is a mathematical representation of this and here is a graph for it the mathematical representation says that every datum is produced independently if you know what the class label is and it's produced in the following fashion we first draw a class label let's call it ZD so Z is a discrete variable so it's a vector of length K that contains exactly one as an entry and all the other ones are zero and we draw this label from a probability distribution let's call it pi so that's a bunch of numbers K numbers that sum to one and that are larger then at a non-negative larger or equal to zero and then once you know what the class label is you can then draw conditionally on the class label datum the X's from a particular distribution and that distribution might have parameters let's call them suggestively mu and sigma because in a typical way these models are designed these generative probability distributions are Gaussian distributions so here is that explicit way of creating this turning this structure into a probability distribution you first draw the class label to do that we just draw from a multinomial distribution so draw our binary vectors at dK and then once we know what the class is we draw the actual X or this should be an X I'm sorry from the from a Gaussian distribution it has a particular mean and a particular variance here is a corresponding graphical model for this we have parameters of the model that are the cluster probabilities and the cluster parameters they are K of these and this is a vector of length K and from the cluster the generative model is first draw the identity of the cluster element the cluster the cluster the identity of the class and then draw data points from the cluster identified by ZD now we're not going to talk much about how these algorithms work because they're actually not ideal for our dataset why? because our data contains so this is sort of they're going in the right direction they're inducing sparsity they're fixing this kind of issue that on here we would like to have some kind of matrix that is sparse but Gaussian mixture models at least don't really fix this issue here which is that the topics of course shouldn't really be real valued words are drawn as integer counts in our dataset here is our dataset again I've already shown you this matrix up here now what I've done is I've created a scatterplot that is basically a two dimensional cut through to this matrix at two points so I'm plotting the frequencies of two different words next to each other here are the frequencies of the word freedom and consideration and here are the words government and consideration and they are relative frequencies in our dataset so they are 200 and what was it? 271 roughly no 231 dots in here and you can see first of all that they lie on a grid they lie at integer values of course because these are integer counts and you can also see that maybe at least visually this doesn't look like a weighted sum over individual Gaussian distributions although that's maybe not so much of a problem because this dataset maybe doesn't perfectly look like a weighted sum of Gaussian distributions either but here maybe the situation is more extreme but what's more varying is that really we have this discrete structure we have individual counts and if you want to have a good model we should choose a base distribution for the mixture components that can reflect this discrete nature so this brings us back to something that we did earlier in the lecture the insight that variables should be assigned like according to their type to a corresponding probability distribution that captures their type and the properties of their type as much as possible while remaining like providing tractable inference and those are exponential family distributions rather than always Gaussians for everything Gaussians are maybe the correct exponential family distribution for real valued numbers but here our observations really are counts and so maybe we should represent topics by a distribution over counts and that's a probability distribution and that's actually the final slide of today's lecture we maybe want these topic word distributions to be distributions rather than a map from words to topics that contains real numbers because then we can think of a generative process for our documents that says if I tell you what topic this document is talking about I can then draw the counts for this matrix by drawing the words with the right frequencies maybe that's what a topic is it's a certain frequency of words showing up at least in terms of a bag of words model of course when we've already thrown out all grammatical structure from our sentences so maybe what's left in terms of semantic structure is how frequent individual words are and those might be represented by the topics if we do so then our matrix which I now call pi I've switched the notation actually because we are really explicitly now saying these are data types that correspond to these counts that create these counts generative objects for these counts then the idea of a mixture model would amount to saying these are draws from a probability draws from a discrete probability where we see an individual one showing up as saying this document has been drawn from this topic but if we're already at that point then maybe we can be a little bit more general and say actually a document topic is not represented by an individual draw it's actually represented by a probability distribution itself which means that the individual topic as our individual document should maybe be part of several topics at once sorry the other way around an individual document should contain several topics it should talk about several topics because of course that's what the American presidents do when they go up and talk to congress they talk for a few minutes about one thing and then for a few minutes about another thing and for a few minutes about a third thing and then they are done so a document is really a collection of topics where the topics make up corresponding amounts of the time a certain fraction of the overall amount of volume of the of the speech so those fractions should be represented by probability so that these are numbers between 0 and 1 that sum up to 1 right, fractions of the overall time that's the kind of model we would want so we are going to build that model in the next lecture we're going to end on this what we've done today is we've collected a interesting data set at least arguably so we've started to look at it by first applying a very simple toolbox solution called PCA to it and that gave us an answer that answer though is unsatisfying because it comes in the wrong data type it comes in in terms of real numbers and linear maps from the space of documents and words towards the low dimensional space apart from the reason why this is we could read off from probabilistic building a probabilistic interpretation of principle component analysis or actually latent semantic indexing which is the related topic in a divided concept in natural language processing we saw that PCA can be interpreted as assigning a latent Gaussian belief over latent variables and a linear relationship between these latent variables and the external space the observable space and this is perhaps not what we want to phrase when we talk about topic distributions because topics are not points in a real vector space but instead they are discrete objects discrete choices to talk about a particular thing rather than a smooth space in which we are moving around and also the topics themselves are not just linear weighted representations of words that have positive or negative numbers no they are reflected in the text by the frequency with which individual words show up in a discrete fashion so the right data type for this kind of generative process is a discrete probability distribution not a continuous value continuous support distribution like the Gaussian so we would like to use not a Gaussian distribution but a discrete distribution to generate the words and actually we might also want to use a discrete distribution to generate topics for documents so that each document can be can consist of a few individual topics not all of them at once but only a few of them so that people can talk about different things and in particular also not talk about something for a while by assigning zero probability to it and so that each document can consist of several different topics rather than just one the challenge that remains for us and that you are going to address in the next lecture is to phrase this kind of observation more precisely in a mathematical or probabilistic generative model and then and that's the bigger problem come up with an algorithm to actually do inference in this kind of model because the algorithmic question as I've now said several times is always the real challenge when we build probabilistic models we will deal with all of these issues in the next lecture I'm looking forward to see you there maybe right now