 Hello, and welcome to lecture number 25, the pen ultimate content lecture of probabilistic machine learning. In the last lecture, we completed our toolbox, our machine learning toolbox, both for modeling and computation, with the addition of variational inference, a quite advanced and efficient approximate inference technique, that maybe strikes a balance between the lightweight nature of Laplace approximations, which are approximate optimization based methods of inference that are based on a completely local approximation of the log posterior around the mode, and Monte Carlo methods on the other hand, which are algorithms based on the use of random numbers that are asymptotically exact, so they actually compute the exact posterior, but only in the infinite limit, not as a finite time optimization process. Variational inference strikes a balance between those two by providing an approximation based on an optimization framework, so it's an algorithm that converges in finite time, but converges to an approximation that is a full probability distribution, which also tries to approximate the posterior in a global fashion, so not just locally around the mode, but in a global fashion in the sense that it's trying to minimize KL divergence. We saw that we can achieve this not by explicitly minimizing the KL divergence between the approximating distribution and the posterior distribution, but by instead maximizing another quantity called the elbow, the evidence lower bound, which is a lower bound on the evidence as the name suggests, so it's a lower bound on a constant where the gap between this lower bound and the constant is given by the KL divergence, so by maximizing the bound, we're minimizing KL divergence, and then we discovered something truly marvelous, which is that it's possible in some specific cases to do this maximization of the elbow iteratively, but in closed form iterative steps without having to explicitly impose a functional form on the approximation queue, but instead, we only impose a factorization, we only require the approximation to have certain factorization properties, and then we know that the optimal such approximation with this factorization property has the property that as a function of the quantity that is in that term, it's given by an expected value of the log joint distribution under all the other approximations, because this is a function over this variable, not just a number, it provides the entire form of the approximation, and we saw that in some cases that functional form is actually tractable, and we can actually compute it iteratively and therefore update the elbow, increase the elbow, and find a KL minimizing approximation. We did this for our topic model, for the model we've been now considering for several weeks of an algorithm that assigns or a model that assigns a topic structure or low rank structure to the words that occur in a corpus of documents, and we in the last lecture went through the derivations for the corresponding variational bound, and this model uses a cunningly chosen parametric distributions for all the conditional terms in the joint distributions, so they are exponential family priors for the latent quantities, and then corresponding conjugate exponential family distributions for the lower level latent quantities. We found that in this model we can actually do this, that you can find this closed form iterative update rule, which we discovered leads to a factorizing, and actually fully factorizing approximation for all the latent parameters in this model, so both the document topic distributions, the topic word distributions, and the word topic assignments completely factorizing and assigning a Dirichlet distribution form to both of these latent low rank representations variables pi and theta, and a completely factorizing discrete distribution to the word topic assignments. Maybe because it went by so fast in the last lecture, you may have wondered for yourself, why does this actually work so well? So there's this beautiful effect that we're just imposing a bit of factorization on the only thing we actually really imposed was that we wanted to have an approximation that separates the C's from the thetas and pi's. That's the only thing we put in. But then automatically, not only did we find that those, there's a further factorization resulting from the approximate form so that the distribution that minimizes scale divergence under the restriction that it should factorize between C and theta and pi happens to be a fully factorizing distribution that is independent across all documents and all words for the word topic assignments, independence across all topics for the topic word distributions and independent across all documents for the document topic distributions, sorry. But also did we notice that the approximating distributions had a tractable closed form. They are of this exponential family form. They are Dirichlet distributions for pi and theta and they are discrete distributions for C. Maybe you've wondered for yourself, why is this? One of the first things that I want to do in this lecture together with you to get a bit more of an insight of where this comes from. So I said at various points in the previous lecture that it's a good design principle. It's a good rule of thumb that when we design generative models we try to use exponential family distributions in particular pairs of conjugate prior and likelihood because I said it makes your life easy and the fact that variational inference became particularly closed form tractable and easy in this setting is maybe an indication for why that this is such a good idea. But we can be a little bit more formal about this. It actually turns out that there is a good reason for why we can expect both induced factorization and closed form forms for our variational approximation. So here is an abstract way of looking at what we've been doing. We've been considering constructing a variational approximation to a generative model for some data x that involves two different variables and somewhat more immediate latent variable set and a latent quantity eta. So in our model in the topic model z is equal to, let me go back, z is equal to c that's the word topic assignments and the other parameters eta correspond to this stuff out here to pi and theta. Maybe let's first consider mostly pi and then for theta actually the thing works the same way basically because it's only a minor variation. Then what we have in our generative model is that first of all the joint for x and z and the latent quantity z is a exponential family distribution. So in our case it's a discrete distribution over the words when you assign the topics like when conditioned on assign the topic z and we decided to use the conjugate prior for this discrete distribution which is given by a bunch of products over Dirichlet distributions or more generally in the abstract form the conjugate prior for this exponential family is another exponential family. Remember from the lecture on exponential families that all exponential families have conjugate priors that are themselves exponential family distributions. In this case the likelihood is discrete and the conjugate prior is Dirichlet but any other such combination would also work and then what we imposed when we did variational approximation approximate inference is that we wanted the posterior distribution over z and eta or in our topic model it's c and pi to be factorizing between those two types of variables. Now what you can find is that in such situations it's actually generally to be expected that a was actually always clear that we get induced factorization and b it's we can often expect to find a tractable approximations which are certainly in the exponential family. To see that let's just consider what we did. So we constructed a variational bound now we're doing this for the I think six time in a row now so hopefully by now it becomes a bit it has become second nature to you. So to get the approximate distribution on z we have to compute the expected value under the other approximations so q of eta under the log of the joint. So the log of the joint is we see it up here right the well so we just the log of the joint well it's to be careful to be very precise and the joint here of course would involve a p of eta as well the prior but the p of eta is a term that doesn't depend on z right so when we are constructing our approximation for z we can put that thing under the constant just to be clear. So the log of this joint is the log of the prior plus the log of the likelihood the log of the likelihood we can read out read off from up here it's just a sum because of the IID assumption over those terms which are inside of the exponential and because it's an exponential family there is a linear relationship between eta and the sufficient statistics of this exponential family. So therefore when we're computing our expected value over eta eta will only show up in here and we'll get a induced factorization right we now have a sum of individual terms so the individual z n will be of individual terms so we'll know that our q of z is a product of individual terms for each z n but not only that we also see that the approximating distribution for z n is itself the exponential family that we have of the form right of the exponential family that we see in our likelihood or in our joint distribution and the only thing that changes is the value of the natural parameters and therefore of course also the value of the log normalization constant and that value is assigned to the expected value the expected value of the natural parameters under our q of eta rather than well some point estimate in the case of the M or some other distribution in the exact form. So here we see that we get for that induced factorization and exponential family form what happens for q of eta just to be sure as a similar situation here so for q of eta we have to take the expected value of the log joint under the approximation for z again the prior doesn't depend on z so we can take the expected value inside we get expected value of the log joint let's look at this again so now where does z actually show up right z only shows up in the sufficient statistics and that means our approximation for eta will be again of exactly so it well it's going to be of the right exponential family form why because the prior and the likelihood are conjugate to each other and the only thing that changes under this approximate operation is that we are now computing expected values of sufficient statistics but that doesn't change the fact that will end up with a linear function in eta so an exponential family so that means our approximate distribution on eta will also be of exponential family form so long story short if you're constructing variational approximations even or in particular if you're constructing variational approximations use conjugate exponential family priors wherever you can they are not just useful because they sometimes allow closed form inference they are also a useful tool to keep around or useful habit to get into even if you want to do approximate inference afterwards because they tend to make approximate inference easier having addressed this issue and I want to tie up another loose end of our in our algorithmic toolbox another question you might have about the relationship between these algorithms in the toolbox I said when I introduced variational inference that it tends to be the more elegant the more high performance the more powerful algorithm when compared to Monte Carlo methods because Monte Carlo methods tend to converge relatively slowly and require significant computational resources because they are actually in theory only correct in the infinite limit but variational approximations are optimization methods that just construct a probabilistic approximation in finite time however maybe you remember that a few lectures ago for the specific case of our topic model I introduced the a way of speeding up GIP sampling inference in topic models which is given by the collapsed GIP sample back then I we we first started by doing GIP sampling so we found that there's one possible way of doing inference in this model which is to iterate between sampling from the conditional distributions for pi and theta given c and w and then sampling from the conditional distribution for c given pi and w and theta and that was an algorithm you implemented in your homework and it works but it's still comparably inefficient because we have to keep iterating back and forth between those two sets of variables in this one and then I pointed out in the lecture that we can actually because of the structure of this distribution which is given by Dirichlet priors times discrete likelihoods both for theta and pi this is a case of observing data that is drawn from a conjugate prior and likelihood model we can actually marginalize over the latent quantities theta and pi and get a marginal distribution just for c and that distribution has closed form it just consists of a bunch of terms that are ratios between the normalization constants of Dirichlet's beta functions now this seemed a little bit useless at first because c isn't actually the variable we care about it's the variables theta and pi that we really concerned with so being able to marginalize them out seems a bit backwards because we are interested in them we don't want to marginalize them out but this property of the model was nevertheless helpful for Monte Carlo because we could now draw directly from the conditional distribution for c from the posterior for individual terms in c given w that can be done particularly efficiently in Gibbs sampling both in terms of implementation and in terms of the properties of the Gibbs sampler because this means that the intermediate steps the mediation provided by the terms theta and pi that actually conditionally make the c's independent that mediation is directly included in the Gibbs sampling loop so we don't have to do this like what now we realize is an artificial separation between going through all the c's and only then updating theta and pi but instead by updating one c after the other we immediately create the effect of that change on all the other c's through those rates of quantities which are pseudo counts related to each other so it after all it seemed like maybe Gibbs sampling can be made to be quite efficient after all it's maybe not a big fun to implement this this algorithm but it can actually be quite efficient so now that we've seen variational inference and the variational bound that we constructed so far was on all three variables sets of variables theta and pi and c you might wonder does this mean like which of the two is now actually the better algorithm for Gibbs sampling we can do collapsed inference and the variational algorithm is an optimization method those are two seemingly orthogonal things that are both potentially helpful so which of them actually is the better answer now funny enough it turns out it's actually possible to do collapsed inference in a variational scheme as well and what i'd like to do over the next 15 minutes or so is present to you how this is done now advanced warning this is not straightforward at all it was actually a paper that was in itself just this idea for this particular topic model was itself a new rips paper a few years after the original topic model paper came out and it's it's certainly not something that i would expect you if you're building a probabilistic model especially not for the first few times to just think about the main message of this these few minutes in this in this lecture is a high level one which is that if you really want to get your probabilistic inference to work well then there are sometimes smart mathematical tricks but those will always come from the ingenuity from the creativity of the human well i say always at least for now they have to come from the ingenuity of the human writing the code and smart people can make machine learning work really well even today while people who are relying on toolboxes will have a hard time doing that that's good news for you because if you are the expert who knows about this stuff that makes you very valuable to well whoever is paying the bills so you don't have to feel like the derivations i'm going to do are stuff that you could have come up with yourself you don't even have to understand all the details fair warning if you want to do the homework this week you will actually need to understand it because we're going to implement this but you only need to understand the final result so here is how it works in the collapse skip sampling framework we collapsed out theta and pi let's see if we can do that here as well in the variational bound the story goes is this and this is actually the story pretty much exactly from this paper so in the original variational bound we constructed last week for our topic model we decided that we want to impose as a factorization on our joint model theta and pi and c a separation between the c's and the pies and the theta and then actually we found that that is exactly the same as additionally imposing a factorization on all the c's because as soon as we factorize between c and theta and pi the distributions on c and in fact also on theta and pi but this is not important now factorize themselves through induced factorization so it's not actually an additional assumption to impose this kind of factorization and by the way in fact the original paper by David Bly and Andrew Ng and Mike Jordan on latent division allocation even used a fully factorizing distribution from the start a full mean field approximation so now the idea is going to be what if instead of this factorization we impose a strictly weaker factorization which is that we allow for an approximation on theta and pi that is actually conditional on c it's a conditional distribution for c so it is allowed to depend on c and theta and pi rather than a fully independent factorization if this sounds odd to you it's probably because you might be thinking well is that even then much of an assumption anymore so the we can write the posterior distribution over c and theta and pi given w of course using the product rule as a posterior over theta and pi given c and w times the posterior over c given w so that expression here looks a lot like what we are assuming here and in fact it's going to turn out to be the same thing so when we minimize the KL divergence between the approximating distribution and the full posterior if we plug in this assumed factorization here I'm going to make a notational simplification I'll use I'll write capital C again for the fully factorized approximation but soon we'll come back to the assumption that this actually fully factorizes it's just a simpler way so that I don't have to write so many product terms the important bit is that and here you might want to stop the video for a bit and stare at this equation because of the properties of the log the if you impose the structure of our factorized approximation we actually get two separate terms one for the KL divergence between the our approximating distribution on theta and pi given c under c plus the KL divergence between this bit and q of c we arrange that to find that we're computing two different KL divergences one between our approximating distribution on theta and pi two the posterior on theta and pi given c in both cases and the KL divergence between q of c whatever it might be and the posterior on c now the perfect way to minimize the first term would be to set those two arguments to be equal to to set this distribution this approximating distribution to the posterior for theta and pi given c and in fact why don't we do that because we actually know from two lectures ago that we can compute this posterior it's of tractable form it's just a bunch of Dirichlet distributions for pi and a bunch of Dirichlet distributions for theta with parameters that are incremented by those pseudo counts NDKV that arise from c so we can just do that and then this part of the bound will just be zero sorry the bound this part of the gap between the elbow and the true evidence will just be zero so that's the perfect thing to do it tightens our bound and the only thing left is the KL diverges between q of c and p of c given w to the posterior for c given w so by constructing only an approximation on c and setting the posterior for setting the approximate distribution for pi and theta to the true posterior we are also doing a form of collapsing we're effectively collapsing out theta and pi we'll just say we'll construct an approximation on c and if you then ask us afterwards for a distribution on on theta and pi well we'll just set it to the posterior distribution that arises on theta and pi given c okay so that means all that is left to do is to construct a variational bound purely on a bound on the posterior of c given w right that we could call that the marginal posterior pure of c given w now why didn't we do that before that seems like a pretty smart thing to do because this is an easier thing to do well or this is an easier easier term right we don't have to deal with theta and pi well it'll turn out that it's not entirely straightforward because the terms that will arise are a little bit more complicated but let's sort of boldly go forward and just see how far we get and then when we get stuck we can start thinking so let's do this we're already in the on the previous slide for collapse gip sampling I already wrote down the marginal distribution for c and w the marginal joint where it was this ratio of a bunch of normalization constants of Dirichlet distributions which can be written in terms of gamma functions and now we have decided that there are basically no theta's and pi's anymore just c's and w's and it's just a well maybe arguably a bit awkward bunch of terms that involve c's so remember that the c's are collapsed into these counts n d k v which count how often by assigning c we have counted in document d a word of name v in topic k and then we can collapse out some of those dimensions by dots to construct a variational bond on this we again write down the elbow this is the same quantity as before and we're going to maximize this elbow to minimize the k l divergence between the resulting distribution on c factorizing distribution on c and the true posterior for c given w now remember that the posterior on c given w was not straightforward to write down so we can't expect this to be particularly easy because we're not going to just find let's see let's set c to the posterior on c given w the posterior for c given theta and pi and w that was easy to write down but this marginal posterior is not going to be easy to write down nevertheless by doing this we can expect to gain something because we have in the previous slide by collapsing out theta and pi strictly made less factorization assumptions than in then previously for the factorizing variational bound and so therefore this elbow will be tighter than the one that we constructed in the previous lecture and therefore we would hope that our approximating distribution q which has a strictly lower k l divergence to the true posterior then the sum of those two k l divergences between the posterior on c and the posterior on theta and pi that should be a better bound then so we know how to construct this bound in principle we have to find the individual terms the individual factors in the distribution over the c d i so that's the assignment vector the one hot assignment for topics of word i in document d word number i we do that by constructing the expected value of the log joint under the assignment of all the other variables well let's see how far we get with that we have our joint up here we can take the log of it and we can see if we can construct expected values so one first um fun thing is that it's going to be a little bit different from the previous time we constructed a variational bound is that so previously we had to construct a variational bound on c by constructing an expected value for log joint under all the other ones and that kind of gave us some leeway in the sense that we could first think about this approximating distribution on c find its form and then deal with the fact that we don't know yet what the form the explicit functional form of the other approximating distribution was here it sees everywhere basically and so we maybe have to guess a little bit more what kind of distribution we're looking for but thankfully we're in luck because those cdi's those are discrete variables they are hard assignments of each word at location i in document d to a particular topic because they are discrete values we can write down the most general possible probability distribution over those finitely many discrete values in terms of a discrete distribution so we know that our distribution over cdi will be a product over a bunch of probabilities for the individual entry k to be the one that we set to one such that those gamma dik are for each fixed value of d and i a probability vector so the sum over the entries in k sums sums to one and that's going to be useful because we know that we can write down the q over all the other di as again a gamma um like bunches of gamma dik entries so if you have this tensor or array three-dimensional array gamma dik then we have the approximating distribution we just have to normalize along k we'll use that proper this property of the gamma this gamma function in a moment if we just point this out in passing i've now said this several times the gamma function is an interpolant of the factorial function so in particular it has this property for integer values n so it can be written as a product over these shifted individual terms that also means we can write its log as a sum over the log of such terms so here written down again the joint has no change so far now let's see if we can construct this approximate distribution which is given by its logarithm is given by as a function of cdi as the expected value of the log joint under all the other approximating distributions for the c without the i that's the notation i'm going to use so capital C without di that means the assignment of all the other variables c or the other topic all the other words topic assignments if we explicitly pick out the one at location di now for that we can look at this expression up here and check where cdi actually shows up or it doesn't show up in this term it doesn't show up in this term it doesn't show up in this term and it doesn't show up in this term so there are four locations where the n show up in the n are agglomerates of the c's it's here here here and here so they're really expecting to see four terms in our relational bound however if we um look a little bit closer you may be able to convince yourself that in this particular case down here what we need to compute is the sum over k over n dk dot so if you consider word the word at location i in document d then if we change its assignment so if we decide that its topic is not topic let's say one but topic two then we're really just shifting an extra one around from one term in this sum over k to another term so we it's like for k equal to one we're subtracting one and for k equal to two we're adding one so overall we're not actually going to change the value of this quantity so the only bits we have to care about are this one this one and this one so we need to compute under the assignment of all the other c's the expected value of the log of this expression plus the log of this expression minus the log of this expression up to constants and now finally we're almost done we're turning our attention to this bit that I just ran over there's functional property of the log gamma distribution that it's given by a sum over individual terms and now if we are considering what happens if we change the assignment of word i in document d from one topic to the next then that means that in that in this sum only the the ultimate term only the one for n minus one is actually going to be there or not be there right because we're increasing n by one or we don't all the other terms will always be there no matter whether we change this assignment by one up or up or not so we can put them actually into the constant we can drag them outside if this is confusing maybe this is a moment to stop the video and stare at those expressions for a bit and then maybe you can convince yourself that that's the case so we can actually get rid of those gammas and we're only left with the inner most points the innermost expressions for this and this seems like a this may seem like a quantity that we can actually work with right we just have to compute expected values of the log of some pseudo counts but those pseudo counts actually come from discrete distributions so we should be hopeful that we can compute expected values we actually did this in a something similar in the gaussian mixture model where we had to we had a similar term actually we had a term where we had to compute expected values under a discrete distribution over the counts the only difference we have now is that we have log counts log n rather than just n that should be that should be straightforward no well actually so this imagine yourself doing this not yyt trying to work this out together with with max velling when he wrote this paper but do yourself you arrive at this point you follow all the the rules the the game plan for how to construct a variational bound you decided okay i'm going to collapse out those variables that i'd actually don't need there's a quantity that i can write down that that it has an explicit form this joint over c and w i can construct an approximate uh so therefore i'm gonna like i just have to construct an approximate variational bound on the c's i crank the handle right i just get i just do what i'm supposed to do to construct a variational bound and now i'm left with this expression right beautiful that shouldn't be that hard to compute what's the expected value of under a discrete distribution of the log of the counts right it's a log of a sum how hard can it be now if you look up on wikipedia what that is you actually first not find much of a statement about that and you have to dig a little bit deeper and then you'll find out that there is actually an expression for the expected value of the log of counts under discrete distributions for these individual like the variables that make up the counts and it's given by this where these here are factorials and factorials are already a bit complicated to compute but much harder than that they actually involve a double sum so before i even tell you what these id's are which are a special function if you don't even have to know about the much more problematic cases that there is a double sum here over quantities that are of order number of points in the count so computing this bit will in general be quadratically expensive in the count that we're dealing with so it might actually be that the original authors of late individual location David Bly and doing like Jordan i don't want to like accuse them of anything but maybe they actually did this they thought about this opportunity when they first arrived their variational bound maybe David Bly was sitting in some library trying to figure out what the expected value what in god's name is the expected value of the log of a sum of a bunch of independent discrete random variables only to arrive at this expression and to find wow okay that's really not worth it let's use our fully factorized variational bound and so it's perhaps not surprising that it took three years so perhaps variational Gibbs something was published in 2004 it took three years for someone to come up with a solution that actually allowed the bound to be collapsed and it was these two people uit who is by now a professor at the university of oxford and max velling at the university of amsterdam together with david newman they came up with a way to address this issue and that way is really one of street fighting mathematics it's a case of just not giving up in the face of mathematical numerical complication and just pushing on and being willing to cut a few corners to get something good out and i i remember that i was i was at new rips 2007 myself it's actually my first new rips conference and i met you why there when he was presenting this paper and i remember myself that this was something that would act was actually happening in the machine learning community at this time by now our field has at least for a moment become so hectic and rapid that people barely have time anymore to construct these well thought out good approximations but maybe it's about time to start thinking about them again because they really are what makes algorithms perform really well what saves computational resources and therefore energy and human time and co2 so how does their approximation work well so here's the expression again we have to deal with we want to have a discrete distribution with parameters gamma di k which are given by up to normalization the exponential of the expected value of a log log of counts so y and max varying at several different insights the first one that they went through is that they noticed that here we have a bunch of counts over random variables which under the approximation are actually independent of each other and each single one of them is a discrete distribution so their sum has a binomial distribution you've maybe by now encountered the binomial distribution many times if you've been in my data literacy class you've seen it for sure it's the distribution over the sum of independent Bernoulli random variables and each of which has their own probability be positive or negative the that this distribution looks like this and if the counts are sufficiently large it looks very much like a Gaussian distribution and that's no accident because it's then the sum of iid random variables so the central limit theorem applies and we're getting a approximate digaussian distribution with a mean that is given by the expected value of that sum so the sum over the individual probabilities and a variance that is given by the sum over the individual probabilities times one minus the individual probabilities so we can construct the mean and the variance of the individual counts so of these ends in here we've actually already used this in our Gaussian mixture model what we can't yet construct are the logs of those expected values but what we just saw by realizing that we're looking at Bernoulli distributions is that these random variables actually are nearly Gaussian distributed so they are well described in terms of their mean and their variance so the next insight that these two people had was actually three I forgot about David Newman but he doesn't have a picture online so I had to live with those the faces of these two people so the the next thing they realized is and this is it's actually arguably something that you can do after your first year of undergraduate academic training in a mathematical field well we're approximating a non-linear function at a location that is relatively well specified by a mean and a variance so what do we do well we maybe do a Taylor expansion so let me just actually I'll show you that in a moment so we need to compute the logarithm of something where we know expected values of and variances of this so let's do a Taylor expansion of this quantity around the expected value of this quantity here if I for the moment I've dropped all the indices so that you can see more what's going on so the Taylor expansion is going to be well the zeroth order term is just the value of this function at the expected value the first order term is the difference between the actual value and the expected value times the first derivative at the expected value the derivative of the log is just one over and the inner derivative is just one plus one half times the square distance times the second derivative the second derivative is minus one over the square in the derivative is still one so we can write our random variable log of alpha over n approximately up to second order terms as this constant plus this linear function plus this quadratic function if you take the expected value of this approximation of our random variable then well here we just get there's no further expectation to get so we just get the term back here the expected value of this distance is just zero because the expected value of n is this so this term cancels out and the second term is the expected value of the square distance between n and its expected value which is known as the variance of this random variable so here we have an expression that tells us how to compute approximately the variance of the log of a bunch of sums of individual independent random variables by approximating them with a Gaussian and taking a second order Taylor expansion what are those means and variances well we can keep track of them by computing the means and variances of the individual Gaussian random variables and summing them up because under the approximation they are independent of each other so we just sum up the individual values of gamma that's something we've already done for the Gaussian mixture model and we sum up the for the variances we sum up probability times one minus probability and with that we've actually broken the problem we've found the solution we can now construct our approximately our variational bound which is a discrete distribution over the individual words in each documents topics assignments it's given by the expect the exponential abnormalization of the expected value of this thing which we've just decided to approximate with so each term in here is approximated by two terms by the log of the expected value which is like dragging the expected value inside of the log and then of course doing that is incorrect so we get a we correct for the error that we make by dragging the expected value into the logarithm up to second order by computing the additional terminal variance so if we plug this in here and this happens at three different locations here here and here so our we now have an explicit form for our discrete distribution it's given by abnormalization e to the log of so e and log cancel so just this term so the this is the bit for the pi side of the model times this term divided by this term so these are the individual bits of the log that come from exponential of log of closed form value and then we have to correct with the variance that as an extra times exponential of minus minus plus bunch of variance terms and those are going to be the quantities we need to track to update our variational bound so here is actually the collapsed variational inference algorithm at each iteration of the loop we don't do anything about pi and theta we just update the topic assignments of every single word and every single topic by computing a somewhat let's call them a leave one out statistic so we consider n dk dot without this one individual word and we consider its mean and its variance and keep compute this distribution and then directly assign with normalization a new value for gamma d i k and that new value for gamma d i k of course allows us to recompute n now as a few things to notice here the first thing is that of course to do that we have to store not just the mean but also the variance of these counts so a good set of of variables to keep around the memory are those expected values and variances of the n dk v collapsed in various different ways so these are these six different objects which are all matrices or vectors just to remind you what is this very value of the variance well if you update gamma by some new number then you just add that number into here and you add that number times one minus that number into here okay that's the first thing i will accept up to normalization so in the end we have to some divide we have to divide by the sum over this quantity of a case and that's a probability distribution so that it's not proportional anymore but an equal two so the the first thing to notice about this is that this is a or the second thing the way right so the first thing is you have to keep around those quantities how hard is it to keep those in to keep those in in memory well there's a convenient thing here that these assignments are not random numbers notice that we're not drawing from a probability distribution the way we did for Gibbs sampling but instead we just compute those numbers and then we directly store them and also that in this expression that we're computing here i so the identity of the word in the document and this is something that you don't see that immediately surely right from this complicated expression but if you stare at it for a bit you can convince yourself that i doesn't play an explicit role i only enters through the identity of the word at location i in document d through wdi so and that wdi is just enters as an index right so it just tells us which index to look at so that means if there is the same word shows up several times in a document then those same words in the same document same document will have the exact same value gamma di k because we're not drawing random numbers we're computing probabilities for them and those probabilities don't depend on which individual value i has but only which identity the word has at that at that location so that means we can store the value for the values for gamma di k in memory complexity o of t times k times v the number of vocabulary not the length of the individual document this is really convenient because from a memory perspective for long documents where each word tends to show up several times this can be a more compact representation and it's also convenient for computational from a computational perspective because we can loop over this these instances of v rather than over the individual instances of the word and we gain this because we are not performing sampling right we're not not assigning an explicit instantiation of the topic assignment to every single word we're just considering in expectation on average in the mean field sense what's the topic assignment for words of this type are in this document so long story short it's possible to implement a collapsed version of variational inference in latent top uh Dirichlet allocation it consists of keeping around these sufficient statistics and recomputing those quantities based on these statistics over and over and over again you see that this is a it's a bit of a tedious expression but it's also if you think of how hard it is to write this line in python clearly it's not super complicated right it's just a bunch of sums and inverses and exponentials is not a hard function you may also realize that it's yet another challenge one that you may have to face if you're doing the homework this week to implement this in for example python in an efficient way because we'll have to compute these leave one out statistics and then update ideally all of these in a vectorized fashion rather than with a big for loop that goes over all these and all eyes and okay certainly all over the eyes but over the v's and nevertheless it shows what can be done if you persevere in the face of adversity in machine learning building a simple solution one that you can question and look at and play around with is often not so hard but if you want to build a really good solution one that works well so in here in this case it means one that has a very small gap between the elbow and the evidence and one that works fast one that that iterates quickly converges quickly then you have to be willing to do some street fighting mathematics to jump over a few hoops and approximate a few things someone sometimes in a bit of a gung-ho fashion just do some second order expansions on some variables that aren't actually gaussian but which are gaussian approximated to get an algorithm that actually works well no one can expect you to do this right away off the bat at this point in your career but being being willing to do this to really try and get algorithms to work really well is a great skill set and it clearly requires all that complicated math that you did in your undergraduate courses and maybe at that point thought that that they were a little bit too hard that was the first part of this lecture now we'll turn our attention to the modeling side away from the algorithm at least for a moment to see if we can make our model that now has such an elaborate algorithmic side more expressive to reflect better what our data is about that's another variant of the um the like the the job of the human in the machine learning loop but before that maybe it's a good time for you to take a quick break and then we'll continue so if you followed along the homework of the past few weeks and implemented the latent division allocation topic model for the state of the union data set then at this point you should be able to run your code on this data set and you get an output that output might look like this this is a um a presentation of the topic distribution for the documents the documents being speeches uh given every year so we can think of them as a time series each color is one topic and of course they sum to one because they provide a probability distribution what we're looking at here is the mean prediction so that's the the average topic distribution under the divisionly approximate posterior pi and what you can see is that there is some structure emerging there's some topic um of you know the olden days and there's something that comes up around 1820 which is maybe a bit weird because there aren't there isn't something particularly massive happening around this time in american history that the american civil war would be somewhere more around here um and then there is a new topic around arriving in the 20th century which is maybe understandable there clearly seem to be some kind of spikes around 19 well the 1940s and then something afterwards and then some new topic arriving in like the late 20th century and the 21st century so okay maybe this is somehow structured but maybe you agree with me that after all this work we've spent on this model all this complicated implementation this is a little bit underwhelming it's not a particularly beautiful structure is also there's a there's a lot of spikes up and down in this model this is not a smooth kind of development and well arguably that's not what we're looking for this is a typical situation when you're implementing probabilistic models you're describing some aspect of the data that you think is particularly prominent and hope that it answers all your questions but then as you actually start working with your data even though the model is kind of working you also notice that it's not quite good enough to do the job you're aimed for what we would be hoping for what we never said that out loud but maybe what we're hoping for is some kind of topic structure that reveals some some you know some structure some latent structure in history so when this happens in practice then you go back to your model and wonder where you could improve where you could add more information now what we're worrying about what the problem here seems to be is that over time so that means across the document corpus the distribution of topics doesn't have doesn't reflect a structure that we were expecting a priori so we would like to add this a priori knowledge that there's some smoothness in our data set into our model here is our model again with the joint distribution once as a graphical model and once as an expression of mathematics the object we are looking to change is on the left hand side of this of this graphical model this is the distribution over the topics and what we would like to say is that the topic distributions of documents they are not iid they're not completely independent over the document corpus so what we'd like to do is to change something about our priori assumptions about alpha this is interesting because so far i haven't even spoken about alpha at all we've just set it to a constant so if you want to fix this issue we would we have to think about how you would even adapt alpha before we do that let's maybe just realize that we can actually do this because our document corpus has structure so what what we have to access is this kind of stuff this is the data set that you're that you by now probably have on your hard drive if you've done the homework these are the individual texts of the speeches and i mean just maybe you've looked at them before on your hard drive and you've seen that they have this labeling every document comes if with nothing else with the name of the president who gave the speech and the year that they do this in what this is is meta data it's information about the structure of the corpus that we're currently not using and we'd like to add it to our data set or to our model now how would we do this well so to keep things simple for a moment let's first think about how we would adapt alpha in the model that we currently have and then we're going to think about how we would change the model to explicitly include this kind of information so how would we adapt alpha in the model that we currently have here is our latent Dirichlet allocation model again if we wanted to adapt alpha or maybe beta as well so those two hyper parameters that so far i've just kind of dropped off to the side by the way i should say something here i've actually always when i showed you this model so far indexed alpha by the document d and index beta by the topic k meaning that in principle we could accept a varying topic distribution for each document a priori and a varying prior for the topic word distributions in every topic and these alphas could be vectors right because the parameter vector of a Dirichlet distribution is a vector it's just so far to make keep things simple i've just assumed that alpha and beta are just two scalars but there's really no need in the model to do this what if we can we can certainly work with varying vectors but let's assume for a moment again that we have used our scalars even if they were scalars how would we adapt them how would we set them we shouldn't just set them to like you know open one or one over the number of topics i sorry one over the number of words or one over the number of topics in here to get sparsity but maybe we want to adapt them this is a case of well so the natural framework to do that in is maximum likelihood or maximum apostory or estimation and this is like another case of a model where we have a structure quite similar to what we've encountered in the gaussian mixture model and what we did there is that we used em what is the structure we're using is that we have a joint distribution where we can't quite write down the likelihood itself that's the thing we care about so that's the probability distribution for the data given only those parameters but instead we have these latent parameters which we can't easily integrate out so in the gaussian mixture model we introduced a latent variable z which allowed us to compute the posterior of that variable and then we could iterate between setting the posterior or setting constructing some approximation setting it to the posterior and thereby tightening a bound and then improving the the elbow instead of the evidence and i said in the em lecture that we can use this idea of variational inference which we're here using already also to do hyper parameter adaptation that's exactly what we can do here this is a typical situation to use em in remember that em is in our toolbox so if you're encountering a model where you have some complicated probabilistic inference inside and then at the end you want to do you want to do maximum likelihood on the parameters if you have an approximating distribution on the latent parameters you can use this framework ideally in particular if you've actually found a distribution that minimizes the elbow sorry maximizes the elbow minimizes scale divergence so how does this work just to remind you this is the thing we want to maximize this is the lower bound for it the elbow it's given by this expression and remember that when we computed our variational approximation whether it's in the collapsed or the uncollapsed form we actually constructed this elbow like we spoke about the fact that you can compute it has a somewhat annoying complicated form but it's possible to compute and for purely variational inference we don't need to compute it actually we just have to compute the updates to the variational approximations but i said back then it's useful to implement the elbow anyway if only to watch it rise so that you can check that the algorithm is actually converging and if you want to do em we actually need the elbow because we're going to use it as a surrogate for the evidence so the likelihood for alpha and beta which we can't compute in closed form so once we have the elbow we know that there is this thing that we want to maximize actually maybe we want to maximize the posterior the posterior being equal to log posterior being equal to log likelihood plus log prior minus a constant that doesn't depend on alpha and beta and we then know that because this is a lower bound on the the likelihood the log likelihood that we can think of the log posterior oh that equation is the wrong way around let me just fix that right so the log posterior is lower bounded by the elbow plus whatever the prior might be and we'll talk about the prior in a moment so at if we want to now estimate alpha and beta what we can do is we can try and maximize those two expressions and thereby we won't necessarily maximize the log posterior but we'll probably get something that is quite close to the optimal choice why well another way of thinking about that is well one way of thinking about this is that if you've reached the maximum of a lower bound you've probably close to the maximum of the of the actual upper bound another way of thinking about this is that if you compute the gradient of the well if you can compute the gradient of the full log posterior then that gradient is equal to the gradient of the two expressions that we have written down here plus the gap between the elbow and the the the evidence the log likelihood and that gap is the KL divergence but if we are close to the if you are at the maximum of the elbow then that means the KL divergence is as small as it can be that means it's close to zero and therefore its gradient is probably almost zero okay so this is a framework for optimizing the hyper parameters of such probabilistic models and we can use it to find good estimates for alpha and beta all we have to do is to take the elbow here it is again from a previous lecture this function that we can compute under the approximation because we have approximating factorizing variational approximations Dirichlet on theta and pi and discrete distributions on c under which this log evidence has a attractable expected value now what's left to do for us is to think about let me go back this prior to think about what we actually know about the structure of in particular alpha well we're going to actually for the rest of the lecture basically ignore beta and not because it's not possible to to tune beta as well it's just a little bit less interesting in this particular data set it's very natural to think about a structural prior information on on the the documents and maybe not so much on the topics of course you could think of potentially injecting some external information about what the topics should be maybe there are some words that you want to sort of pull out for individual topics but that's a little bit dodgy to do maybe so i'll just ignore it we'll we'll focus more on the the structure of the of the corpus so we now just have to think about a way to get a prior on alpha and then we can just we know algorithmically what we're going to do to fit the model now what's that prior going to be well it's going to use this meta data the identity of the president and the year that they gave their presentation in their state of the union address that meta data basically provides features for each document in the corpus it provides us with information about about each individual document and places it in a latent space that is indexed by time and the identity of the president that is giving the speech so what you can see here is really the power of probabilistic modeling if we want to add some information in a structured fashion we don't have to rely on some black box idea like i don't know some random deep neural network that we just dragged out of somewhere but you can really just extend the probabilistic model by adding the variables that you think actually explain the the the stuff we're interested in now what we what we will what's left to do for us is to think about how we get these features into alpha maybe you can think about that for yourself for a moment as well and then run an optimizer on the em bound to optimize for alpha before we do that i want to point out an interesting aspect of this from a software development perspective let's return to the question of whether you want to use or to which degree you want to use packaged software solutions so for latent division allocation there are implementations available online one of them that is quite popular is in psychic learn i have the code for latent division allocation in psychic learn here on the screen it's maybe let me make this a little bit larger this is from actually i should go up right that's the github uh repo for psychic learn for the latent division allocation um variational inference is actually the online latent division allocation because that's the only one that's implemented inside pi let me go back again to the actual code here it is and and if you read the implementation then you'll see that um the in the very let me like this right so the so the the parameters that it takes in are the number of topics that you'd like to provide the default is 10 that's actually what we are using in our data set so far as well and then the document topic prior which is known in our um code also as alpha and the topic word prior which we call beta which is here called eta and notice that psychic learn expects you to provide that prior as a float as a scalar so there is no way in psychic learn to add the kind of structure that we are trying to add here and this is typical for the payoff the the trade of your striking if you're using package solutions if you just want to try out latent division allocation on a data set it's great that psychic learn exists you can just whip it out you don't have to implement anything you just call the code and it provides some lda um estimate for you but if you as soon as you want to change something as soon as you want to add additional information you want to change the model this is suddenly not possible anymore and now what's left like the only two options for you available are that you're that you're taking this code which is thankfully open source and try and like hack it so that it does what you what you wanted to do but then you'll suddenly face all sorts of other issues like for example this particular implementation is for a streaming online inference in large corpora of documents and there it's really not so straightforward to think about how you could do what we're doing um or you have to implement the code completely yourself as we've done in our homework so while it's certainly painful to implement algorithms yourself and think about how to make them efficient and how to tune them and how to make them run fast it doing so empowers you to extend the model in any which way you want because you have control over the code so use toolboxes wisely do use them to try out ideas do use them to get inspiration for um models that you might want to use but once you really want to solve a probabilistic modeling task you have to build what I call craftware highly tuned well-built machine learning solutions and that's what makes what separates the machine learning expert engineer from let's say a data scientist so how do we include this structural information about the temporal structure and the author structure in our data set in our model well notice that what we're doing here is that we're trying to learn a smooth function of topic distributions across some input domain where we get to observe some transformed like quite complicated in a quite complicated fashion transformed value or information about the value of the topics this is a supervised machine learning problem where we're trying to learn some latent function and that's a variant of regression it's a variant of a general linear model so the right prior that we'd like to use for alpha over the document corpus is a Gaussian process how do we add our Gaussian process to this model here is a updated version of our graphical model you already know the entire right hand side nothing has changed here this by the way of course is a strength of this framework we just add the bits of information that we want to include extra we assume that every document now really and truly has a different prior and that prior is defined by the parameters of a Dirichlet distribution that's a vector for each document of length number of topics that vector is informed by the features the metadata the identity of the president giving the speech and the year and we're assuming that it's that these alpha vectors change smoothly across time and maybe in a discrete fashion from one president to the next so we encode that by a Gaussian process prior and there's two little challenges here the first one is so a Gaussian process prior requires a kernel we'll talk about that kernel in a moment because it's the powerful way like the powerful lever we can use to encode all of the information we have but there's a minor complication which is that the alpha parameter vector of a Dirichlet distribution of course has to be a vector of two to count so it has to be positive it can't be negative and we can easily achieve that by adding a link function between some latent function let's call it f for lack of a better character and the actual alpha d and one simple link function we can use here because we only need positivity or non-negativity is the exponential function so the assumption that I'll update our generative model by saying this document collection the words in this corpus w they were created by first drawing a latent function of that represents the topic distribution across time in history then evaluating that function at the individual locations and by the individual presidents that are the speakers taking the exponential of that function so we get something non-negative just provides the input to a Dirichlet distribution the parameter vector of a Dirichlet distribution and then after that everything is exactly as before we draw pi d from the Dirichlet distribution we draw the topic word distributions from another Dirichlet prior as before and then draw for every word and every document the topic assignment and draw the word from the corresponding word distribution of that topic the beauty of this approach is that all we have changed in this is really just a prior on alpha or on log alpha which we call f so that means the only thing we have to change about our algorithm is nothing has nothing to do with the variational bound we can keep using the code we've written for the variational bound well okay so my no complication is we do need access to an actual alpha or an actual pi so we can't use the collapsed variational bound on pi we can collapse out theta if you want to you can think for yourself about how that would work to just collapse out theta but not collapse out pi but we do need pi explicitly because it has to somehow enter our distribution our variational approximations that if we can explicitly talk about optimizing its parameter but other than that we just need to we could even reuse the old code for the uncollapsed variational bound and then the only thing we now need to change is the prior for the parameters alpha and beta so for beta we're not going to use a prior because we'll keep using the thing that we've used before and so as I said before we can then do em where we approximate the log posterior for alpha and beta with the well we approximate the likelihood with the elbow use the the prior and the prior for us here is going to be the a Gaussian process on log alpha so the logarithm of a Gaussian process up to constants that don't matter for the optimization is given by minus a quadratic form or there's a square missing here that I'll fix which is given by well just this inner product right between f which is log alpha and a kernel gram matrix oh that's a minus here as well okay let me just fix both of the typos so okay now it's correct so there was a square missing here of course and a inverse of the kernel gram matrix here so if we add such a term which has the obvious gradient which is minus k inverse times f with respect to f then our optimization algorithm is going to make sure that the alpha parameters are smoothly changing across the the the corpus I mean now what's left to do is to decide which kernel to use now thankfully we've spent so much time with Gaussian processes earlier on in the course that we can quickly think about what kind of kernel to choose and here again let me just okay let me just actually show it to you so what one option that we could use is to use a kernel that enforces smoothness across time but also allows for distinct changes of topic from one president to the next I'm doing this by choosing a particular kernel across years so x here is the year assigned to to each document this is a rational quadratic kernel it's a variant of the square exponential kernel we could also use a gaussian kernel the spare exponential slash rbf kernel whichever name you want to assign for it just just for good measure to have something again slightly different I'm using a rational quadratic kernel which is a variant of the gaussian with it's a variant in the sense that it's a scale mixture across gaussian kernels so the curves that come out are smooth but they have they have length scales that vary a little bit and multiply it by a discrete kernel that says the topics of two speeches are only related by this kernel so they are smooth as defined by this kernel if the president has not changed from one speech to the next but if two speeches do not share a president then they have a smaller covariance there is a there's the opportunity for small shifts in topic from one president to the next although we still assume that there is an overarching you know an arc of time of history that kind of forces certain topics into the hands and the mouths of presidents how do I choose the parameters of this kernel well here ever since we're so high up on the hierarchy of the model I really just set things to a fixed value theta to five that's the output scale the length scale to 10 years and the smoothness parameter to 0.5 that's just a typical value to set one infinity would be would mean that we get back to the gaussian kernel if we set it to something very small we are able to get a little bit of non smoothness and gamma to 0.9 which means that even if the president changes we still expect some smoothness and I've created samples from these from this distribution as you can see here so these are draws from pies from from topic word distributions from this whole generative process so taking a gaussian process taking the exponential drawing from the corresponding Dirichlet a topic distribution and then you can look at those samples and see that they maybe they correspond to kind of the stuff that we're looking for so at every different at every individual point in time there's a certain sparseness so there's one topic dominating there's also a smooth change of the topics across time not such an extreme up and down again this is a strength of the probabilistic approach that we can generate these samples and use them to guide our intuition for how the model should be chosen if you then implement this entire algorithm then and run it on the the corpus of the state of union addresses you might get out and I can finally show you a result a plot like this so this plot actually isn't created by this particular code that you've been developing it's from a paper that I myself wrote of just by now nearly 10 years ago together with David Stern and Ralph Hebrich and Tori Grapple on these kind of smoothness assumptions and topic models back then we used a slightly different way of enforcing smoothness using a somewhat more elaborate algorithm but the the the differences don't really matter for the purposes of this of this exercise you might well get a very similar plot and here we've printed the three most prominent words of each topic so prominent means that their probability is most attenuated in this topic relative to all other topics across time and what you can see is that well I mean clearly we've spent a little bit of time making this plot look nice but you can you can see that there are individual topics emerging and then vanishing again across time in a semi smooth fashion that are represented represented by words that maybe do have a historical sense so for example there is a pretty prominent topic showing that includes war that shows up around the you know the first and the second world war there is a topic of like worries about work and labor and good work that shows up in between the wars during the Great Depression there is post-war talk about world peace there is a big emerging topic in the 20th century about the American people that maybe it's actually historically relevant because before the 20th century maybe there wasn't even a concept of the American people as a separate people and there's an early big topic about the development of America as an early country an emerging country maybe as somewhat surprising topic this blue thing up here is that there's a special topic being used that includes the words war and Spain as a very prominent kind of statement well if you know American history to a degree that I have to admit I don't then you might know that around where the spike is here around the late 1800s or 1897 this actually co-occurs with the Spanish-American war which was an involvement of the US with Cuba in particular during his war of independence here's a sentence of William McKinley from his state of the union address in 1897 that clearly talks about this topic very prominently and brings the word Spain and Cuba and war into into like the forefront another maybe interesting bit is this this bright green topic up here that maybe you can't read this that says energy and oil is a topic of sustainability and oil and energy crisis that emerges only in the 20th century and it has a spike in the late 1970s which co-occurs with the 1979 oil crisis here's the corresponding sentence from Jimmy Carter the state of the union address in 1980 that well I mean you can read it for yourself but there's statements about oil supplies from the Middle East and you can kind of imagine that that's a special topic that showed up during that time so by like taking a model that we could basically have taken off the shelf that is even available in a standard toolbox but we implementing it implementing it efficiently and then changing the way we optimize its hyper parameters to enforce smoothness and explicitly encode prior information we have about our corpus we can extract information that arguably maybe actually has some interesting structure it's not a causal model of for sure but it's a way of looking at the data that that can maybe raise awareness of structure in this in this corpus and if you didn't know like specifically about like what what these documents like or if you didn't know as much about the the topics of those documents as we do because they are about world history maybe we could use a model like this to find interesting structure in corpora that discuss works that are a little bit more complicated like academic texts for example now at this point you could stop and say this is a nice model you could also ask yourself is there more to do could we further improve and of course you could you will always find further opportunities to refine a model for example if you actually look at the documents then you'll notice that they are typically structured in call in in paragraphs so we could use those paragraphs to separate the model into sorry to separate the corpus into said let's call them like documentlets or doclets or little sub documents which tend to have a topic structure that such that every sub document is often just about one single topic but of course together they share an overarching topic structure because they you know you might talk about one topic and then come back to it over the course of one one document maybe you can think for yourself how you would include such a structure in the corpus maybe as a hint just let me tell you that with this approach we already have all the right tools available the only thing we need to change is the kernel and you can think for yourself how you would do that i'm not going to do that further instead i'm going to use this to end the presentation and with it this part of the course over the past few lectures using the topic model as an example we've studied what it means to build genuine solutions to machine learning problems from the probabilistic perspective doing so involves like all machine learning three ingredients a dataset a model and an algorithm the data comes from outside it enters your world as the programmer and your first task is to try and get as much information as possible about it from whoever is the source of the data in particular this means including as much meta information as possible which will then inform your prior about what well for the model when you build the model you there's you're slowly shifting your attention from trying to get to describe what you know about the data as precisely as possible to increasingly also trying to find a representation for what you're trying to achieve that is mathematically and computationally convenient a few general guidelines for this process are that you're trying to write down a generative model in terms of a graphical model already thinking about conditional independent structure and use exponential family distributions and conjugate priors and likelihoods as much as possible in the hope that they'll simplify your life on the computational side and then as a final step building the algorithm is really a process that is highly focused on computational efficiency which for which you now have a whole tool set available at the end of this course from basic generic algorithms like Markov chain Monte Carlo that work for basically any problem but can be computationally quite intensive and yet are also relatively easy to implement through simple but very like maybe potentially dangerous approaches like maximum likelihood and the plus approximations to elaborate tools like variational inference that and EM that require quite a lot of derivations a lot of writing math on pieces of paper that and then implementing really non-trivial algorithms but can provide solutions that are very robust very performant and computationally efficient with this we've reached the end of this section of the course and basically the end of the entire lecture course there will be one more lecture before the revision about what you do when you actually have to take decisions based on probabilistic estimates but for now I hope that you've taken away from this process the impression that while building machine learning solutions using the probabilistic framework can be challenging and require a lot of thought from the human it's also a very powerful process that allows us or specifically you if you're well trained in it to build customized highly performant solutions that are also that can at least be properly understood much in the contrast to some black box models that arise from other parts of machine learning for today I thank you very much for your attention and I'll see you in the next lecture