 Hello and welcome to probabilistic machine learning lecture number 26 and therefore the final lecture of this course. This is going to be a different lecture to the ones you've seen before. It's not a lecture that brings new content, but one that only revises content we've been doing over this course. It might seem a bit wasteful to spend one out of 26 lectures on revising material. I mean I've also spent some parts of all the preceding lectures revising material, but one out of 26, that's about 3 or 4%, that's maybe an amount of time we can use, we can invest to gain an overview over all of the stuff we've done rather than to drill down further and do even more math. The last 26 lectures that you see on this slide have tried to span an entire toolbox of knowledge that hopefully empowers you yourself to build learning algorithms based on the notion of probability theory that hopefully work in realistic practical settings as well for you. We began this course with foundations in the form of probability theory. These foundations will actually take up a significant part of the slide that I'm using to show our entire toolbox that I've not shown several times and that's not by accident because in probabilistic machine learning the theoretical foundations are, even though they are surprisingly simple and can be summarized in two or three equations form like a rule book of a mechanism that cannot really be questioned but fundamentally explains how to perform inference from observations of data on latent quantities. We began this process in lecture number one essentially by writing down the axioms of probability theory. In the interest of time we're not going to repeat the entire formal definition but we quickly saw that they boiled down to essentially two fundamental rules. The sum rule and the product rule. The sum rule tells us how to get rid of variables in our reasoning process that we don't want to make statements about. If you think that the process you're questioning, you're interested in is dependent on various quantities which you don't know then you can get rid of those quantities in when making statements about others by summing out over all possible values that they could take. Multiply with the probability for them to take that value. The second rule is the product rule which essentially tells us how to construct statements about one variable which we don't know given observations of another variable by defining the notion of a conditional probability distribution P of A given B. A more or less direct corollary of both of these statements together is the theorem of Thomas Bayes, well named after Thomas Bayes even though he then didn't actually introduce it which makes a connection between the posterior distribution so what you know about a quantity given that you have observed data and the generative process, an explanatory process for this data. The posterior is given by the prior probability for this value of the variable times the likelihood divided by the evidence. Since we've now done 25 lectures on probabilistic machine learning this connection should now be very natural to you and I don't have to remind you of its fundamental sort of structure which is that the numerator and Bayes theorem provides the probability for one explanation, one hypothesis, one explanation for the data and the denominator normalizes, standardizes this probability by comparing the explanation, this one single explanation for the data to all possible explanations for the data. What's maybe by now more obvious is that this kind of structure up here provides a generative way of writing down your beliefs about how the data were generated and when I say generated then what I mean by this sort of conceptually is that the role of the prior that is often criticized in philosophical debates about Bayesian inference is actually maybe not as crucial as that of the likelihood function but both of these quantities can be questioned and have to be questioned by you, the designer of the algorithm and the model to build a reliable algorithm and if you worry about the role of the prior in Bayesian inference then you should be even more worried about the role of the likelihood it's just that the likelihood is so fundamental that no one dares question it even though it's often the more prominent reason or the more prominent way in which assumptions enter the thought process We saw already in lecture one actually that this process of distributing truth across a space of hypotheses allows us to extend propositional logic to what you might call plausible reasoning to the process of making statements about unknown quantities in terms of confidence rather than in terms of commitments to individual concrete logical statements that have to be either true or false and this is the real core idea of probabilistic modeling that instead of committing to an individual statement we are distributing truth over a whole space of hypotheses and then keep track of all possible hypotheses and weigh them relative to each other according to their probability under the data this is also the key difference to the statistical formulation of machine learning if you want to make statements about quantities that are not directly known and that's maybe the entire idea of machine learning formally speaking to empower computers to work beyond statements beyond boolean statements that are either true or false to make statements about quantities that aren't perfectly identified by the data then you have basically two options you can either follow the statistical approach which is to say I'm still going to commit to one statement I'm just going to make one particular prediction and then I have to analyze that prediction and make statements about why this prediction is a good one and which sense it is good does it converge efficiently? is it guaranteed to eventually converge to the true value as the number of data points increases and at which rates does it do that? or you follow the probabilistic framework in which we never actually commit to a unique statement instead we distribute truth over the entire space of hypotheses and then refine this weighting of truth in the light of data so that we become more and more confident but we're never infinitely confident philosophically speaking I personally find this latter approach more pleasing it seems more powerful in particular it gives access to a notion of uncertainty to the width of the distribution the amount of explanations that are left and how concentrated truth is across that space of explanations however there's also a significant price we pay for this notion of uncertainty and that is the fact that keeping track of an entire space of hypotheses is exponentially more expensive than keeping track of a single hypothesis in lecture number two I pointed this out with this simple example of let's say we have 26 binary variables which are either true or false then committing to one statement about their truth value requires us to store, well, 26 bits and one single statement that can be stored in 26 bits but if you want to keep track of a probability distribution over these 26 variables then we have to consider all two to the 26 which is something like 67 million possible realizations of these hypotheses that's a combinatorially large space and so probabilistic inference is fundamentally a very challenging computational task and that's why the largest part of building probabilistic machine learning models and algorithms consists of using tricks both on the modeling side and on the algorithmic side to reach tractable algorithms one something you can actually implement on a computer in polynomial time we saw in lecture two that one of the key ideas one can use to this end is that of conditional independence which essentially says that certain parts of the reasoning process separate from each other either under the prior or so that's conditional on nothing or when conditioned on particular parts of the data an example I did back then is this famous example by Judea Pearl about being informed of an alarm ringing at home and then later finding out that there is also an earthquake we saw that this reasoning process which involves four binary variables can be naively encoded by using the product rule to write down the joint distribution in terms of a bunch of factors which are sort of trivially true under the product rule which requires us to use 15 parameters but when we use relevant domain knowledge when we use the fact that we know that the alarm ringing has nothing to do with a radio announcement or that radio announcements are independent of burglaries happening then this representation this probability distribution this joint probability distribution over these four binary variables can be encoded in only eight variables or eight parameters sorry eight numbers that we have to store because this is such a powerful concept and of course I mean the difference between 15 and 8 is not much but we're only talking about four binary variables here and later on we saw much much more complicated situations since this concept is so important we actually encountered a graphical language to represent this kind of conditional independence or actually maybe not to represent but to ease to help us think about this conditional independent structure and that was given by this form of a directed graphical model which we began to add to our toolbox and this was the beginning of this process of stocking up our toolbox so that we can build sort of a powerful set of tools we can walk around with and do probabilistic machine learning how do directed graphical models work? they are actually relatively simple visual aid to write down joint probability distributions when you have access to the generative model to the conditional distributions that make up the joint you just write down a circle for every variable that shows up in your reasoning process you blacken in the ones that are observed and you draw arrows, directed connections in this graph from one node to the other by considering the individual terms in the factorization and drawing an arrow from each right-hand side of a conditional distribution to each left-hand side of a conditional distribution these kind of graphs are in some sense a universal language in the sense that every joint probability distribution can be written down as such a directed acyclic graph however this fact in itself is not all that useful because the fact is that even a joint distribution that doesn't factorize essentially which has no further simplifying structure can be written as this directed graph but this doesn't really help us because in this situation the graph carries almost no meaning because of course under the product rule you can rearrange these terms as you like the graph only becomes interesting when we have access to non-trivial conditional independent structure because then the graph becomes non-dense and we can try to read off conditional independent structure from this graph we can do this using actually I have to go forward a little bit the notion of de-separation which we already encountered in the atomic kind of structures that arise from graphs with just three variables which are encoded by this, I'm not going to read it out again relatively complicated set of rules that can also be thought of as encoding the notion of a Markov blanket that we can reason about when we think about conditional independence however we already saw back then that while this is of course a useful property to have the graphical models, directed graphs are a limited language to some degree in the sense that this representation of directed edges between variables cannot encode necessarily every conditional independent structure of a particular joint probability distribution in one single graph an example we did back then is this simple example that was eventually constructed by Stefan Hammering which I forgot to say back then so I should say really now it's now in the top corner up here which is that of two coins that are thrown and then when they show the same face a bell is rang we saw that this kind of very very simple extremely simple generative process measuring parity actually has conditional independent structure represented by these three different factorizations each of which corresponds to a different directed graph and those directed graphs actually do not like none of these encodes all three of these conditional independent structures every single one of them only contains a subset or encodes a subset of these conditional independent structures so what this means is that for directed graphs there are situations in which we there are probability distributions where one single graph can only encode a certain amount of conditional independent structure and in fact this became apparent also in our final example that we did over the past few weeks a latent Dirichlet allocation topic model where certain kinds of conditional independent structure were possible to read off on the graph but others which were arguably at least as important were impossible to read off from the simple graph is it possible to solve this problem in a simple way by increasing the expressivity of this formal language of graphical models well we briefly touched on the idea of undirected graphs which are another tool to represent joint probability distributions which are actually less expressive they only use undirected edges and we saw that using this kind of notion which is historically actually older than directed graphs has its own strengths in particular one strength is that you can just read off conditional independence more or less directly from the graph simply by checking whether conditioning on a certain set of variables blocks all possible paths from one set of variables to another if that's the case then these two sets of variables become conditionally independent however we also saw that this graph this visual representation is in many ways less powerful than directed graphs in particular because these graphs do not have directions on the edges they do not encode which variable is part of the right left hand side in a conditional probability distribution and that's important because conditional probability distributions are only probability distributions for their left hand side rather than for their right hand side and that means that if you have if someone just gives you such an undirected graph then you cannot read off the joint probability distribution including normalization you can only read off that this graph has to have a factorization structure that well can be read off from the graph basically by checking for all the maximal cliques of the graph that's actually not entirely trivial process but it's possible and then writing them down together but because we don't know where a variable like what kind of structure these factors have so in particular we don't know whether the variables that play a role in this factor are showing up on the right or the left hand side of a conditional probability distribution we don't know what the normalization constant of this joint probability distribution is in general but now I've gotten ahead of ourselves a little bit in the flow of this lecture course this is material that we did in the second half of the course it fits in really well here though because graphical models really the key value of this representation of this visual representation of distributions for us at least for the purposes of this course is to identify conditional independence and to have a visual way of writing down generative models such that we can think about them more easily on a whiteboard there are also meaningful use cases for graphical models and a more formal automated way of reasoning about probability distributions which we did not touch upon all that much in the course they are connected to the idea of probabilistic programming with this look ahead though let's go back all the way almost to the beginning of the lecture to the lecture course to lectures number three essentially about continuous probability distributions so back then we realized that all of the derivations we did so far were essentially meaningful for probability for probabilistic random variables which are discrete but then we have to be a little bit more careful when we work with continuous random variables or continuous latent variables in our reasoning process actually it turns out that this is one of these things where yes one has to be careful once really to make sure all the definitions work but once we've done that things become quite natural the careful bit we had to be specific about is that the notion of a sigma algebra is although possible to define on continuous domains in a general way through the power set is a bit dangerous to do and that it makes sense to define a sigma algebra that has meaningful interpretation in terms of volumes spanned by spaces and it turns out that one quite natural way to do that is to borrow the idea of a topology of a natural way of measuring volume on a continuous domain this is connected to the idea of the Borel sigma algebra which is the sigma algebra that is induced by the standards topology on a hypothesis space in particular for the one space we always care about the real vector space that's the Borel sigma algebra is the sigma algebra that is induced by the natural Euclidean notion of similarity and closeness and volume in the high dimensional or in the multivariate vector space we saw that not all but many many probability distributions have an associated notion called the probability density function which is actually a very natural way to distribute truth across in a sort of tractable representational way across the hypothesis space when such PDFs exist then they are actually a very natural object to work with mostly because the rules of probabilistic reasoning the product rule and the sum rule and therefore also base theorem transfer to these probability density functions that's not true for their cumulative versions the cumulative distribution functions but it is true for these PDFs and that's actually the reason why we use probability density functions as the natural object of interest for the entire rest of the course there's only one caveat that one really has to keep in mind when working with PDFs over continuous domains and that's the fact that this dx here at the end of the integral this sort of base measure against which we're integrating that is relative to the definition of this probability density function so we have to say basically how we measure volume in a space to get the PDFs if we want to change the way we measure volume if we want to do a transformation of random variables from one into the other then we have to use a transformation rule which is actually quite a natural one but it's important not to forget about it which is this, well it's natural but it's also maybe a little bit tedious to write down in general multiply with the Jacobian of the inverse transform to get from one probability density function to the other when we transform from one continuous variable to the other this is essentially all the machinery all the theory we need to do probabilistic reasoning so yes this took the first 20 minutes of this lecture because it's such a fundamental mechanism we use when building probabilistic machine learning models but it's also only the beginning of the process of actually writing something that can be used in practice it's the foundation it's the mechanism we're going to use for the entire well actually not just for this course but forever but when we actually want to do this on a computer we have to find ways of performing the associated necessary computations, marginalization, conditioning and so on, computing expected values of random variables under distributions and moments of probability distributions and so on we need to find ways of doing this in a more or less general way and one very general way which we already encountered in lecture number 3 very early on in the course is that of Monte Carlo methods which are algorithms that perform integration the core operation of probabilistic reasoning by replacing integrals with finite sums where the elements of the sum are evaluated at locations that are randomly drawn from a probability distribution so a Markov, sorry a Markov a Monte Carlo method we'll get to Markov in a second a Monte Carlo method is an algorithm that computes such an integral which is the elementary operation of like the elementary challenge in computing with probabilistic models they replace such an integral that's an integral over a function against a probability distribution so p of x dx which is also sometimes written as dp of x with a sum over evaluations of the function at locations xi where the xi are drawn from the probability distribution initially we saw that this is actually a great idea because this sum is then a random number it's a sum over random numbers it's evaluated at random locations so it's itself a random number and we saw that that random number has in some sense good statistical properties it's in particular an unbiased estimator of f so that means that it's expected value is equal to the quantity we're trying to compute and its variance drops over time it actually drops with a rate that can be shown to be the optimal rate for unbiased estimators it can also be seen as not a particularly exciting rate it's one over the square root of the number of samples the way in which the error drops so the variance stops like one over n and the variance is the expected square error so the expected error is just one over the square root of n find though that gives us a general way to compute integrals against probability distributions the only thing we need to be able to do to do so is to draw random numbers from a probability distribution in lecture number three we initially saw that yes this might be possible but if you have access to unit distributed random variables or otherwise sort of standard distributed random variables by transforming their distributions in some way but this transformation is not always straightforward for general probability distribution speeds only too or possible for some basic distributions so we have to find more powerful algorithms to perform this drawing process a first issue we had to solve was to come up with algorithms that work even if you have access to an unnormalized probability distribution so if you don't know what the normalization constant is that was actually not all that hard it turned out though to be much harder to draw general random numbers even from an unnormalized probability distribution if that distribution has no simple formula in particular if it has more than just a handful of dimensions we actually and I want to just make this show to mention this again as we more past this topic you also noticed in passing that even the entire philosophical idea of randomness is a little bit ill-defined that one can debate whether randomness even exists in particular whether it exists for this kind of computational setting because of the way that we usually could generate random numbers on a computer however if we cast those philosophical doubts aside then we're still faced with these computational questions of how to even draw random numbers we know that once we have random numbers we can use them to compute Monte Carlo estimates but where do we get our random numbers from well we encountered a really powerful notion for this which is the idea of Markov chain Monte Carlo methods these are algorithms that do not produce independent samples but instead perform a random walk of some form across the space the input space to our probability distribution P and we design that random walk such that asymptotically when we take all the entire sequence and scramble it at random we get back asymptotically independent draws from the probability distribution the key mechanism for this is sort of the basic idea for such a mechanism is represented by the Metropolis Hastings algorithm which probably wasn't invented by Metropolis or Hastings but by and I said that wrong back then it's actually the married couple Rosenblut rather than the brothers Rosenblut which I said back then which is obviously wrong they were actually husband and wife as I've been informed by some of you but more importantly how does this algorithm actually work well it uses a proposal distribution a local way of creating random numbers to explore the space by checking whether it wants to take a certain step and then if it wants to take that step so with a certain probability going there or if it doesn't just staying at this point and adding one more sample to the distribution the basic form of Metropolis Hastings this simple form is turns out to be well at least if you implement it naively particularly efficient because it creates random walk behavior remember this picture which I showed back then black dots are IID samples red dots are the sequence of Metropolis Hastings steps and you can see that this process takes a very long time to mix across this entire space so to fix this issue we encountered various advanced variants of Metropolis Hastings which are actually possible to motivate from the framework of Metropolis Hastings but which use other tricks to speed up mixing one of them one key one is the idea of Gibbs sampling which is axis aligned exact sampling which is an algorithm we can use if you have probability distribution which have closed form along the axis of the problem we actually use this algorithm later in the course for our Lake and Derichier allocation model another one which is arguably even more elegant and powerful instead of Hamiltonian Monte Carlo which is a general algorithm that requires access to gradients but of course these days we have this access at least for continuously differentiable probability distributions to build a dynamical system that models the behavior essentially of a physical system that moves with mass with momentum through this space that we're trying to sample from and then uses this idea to reduce this random walk behavior to some degree this was the first algorithm we encountered in lecture 3 and 4 already actually 4 and 5 I'm sorry and we I did this back then this algorithm to provide us with a first general class of methods we could use in the exercises and we could use for like across the course as a tool to generally solve integrals but Monte Carlo methods are not a fast way to solve integrals in fact they are maybe the slowest possible way to solve integrals because of this 1 over square root of n convergence rate so they are really maybe the last thing you should try or maybe the very first thing you should try before you find a faster, better algorithm another thing you could try and maybe that's the very first thing you can try if you can find probability distributions in which you don't even need to sample because you can compute the required integrals in closed form and the most important class of such models which really helped us populate our toolbox massively by adding many new tools to them is that of Gaussian distributions Gaussian probability distributions are an extremely powerful tool a sharp knife in our toolbox because they map the elementary operations of probabilistic inference onto linear algebra because this is so important I'll slow down here again a little bit and point out these properties once again Gaussians are exponentials of quadratic functions and because the sum of quadratic functions is another quadratic function and a cut through a quadratic function is a quadratic function and a projection of a quadratic function is a quadratic function Gaussians inherit these wonderful properties that means that the product of two Gaussian probability density functions is another Gaussian probability density function note that that doesn't mean that the product of two Gaussian random variables is a Gaussian random variable This is true because the sum of two quadratics is another quadratic. Linear projections of Gaussians are Gaussians, so if you have a Gaussian random variable and you're interested in any linear projection of it, where a is some linear operator, then that's still a Gaussian distribution, which has parameters, mean and covariance, which are very easy to construct by simple linear algebra operations. And in particular, so as a special case of this property, that means that marginals of Gaussian distributions are also Gaussian distributions. If you have a big complicated model over many variables or even as we saw a few lectures later, infinitely many random variables, then you can compute the marginal distribution. Remember that that's an extremely complicated operation that involves an integral over all these latent quantities. That integral boils down to an extremely trivial statement. You just have to pick out the corresponding elements of the mean and the covariance matrix and that's it. This is a wonderfully powerful part of Gaussian models because it basically removes the most expensive complicated part of probabilistic inference. However, it's actually maybe also a weakness of these models because it means that if you have a joint distribution over infinitely many variables and you only care about a few of them, then what you end up doing is throwing away all these infinitely many other variables because they do not affect you believe over the variables you care about. The final sort of fourth and final great property of Gaussian distributions, which arguably is also connected to the fact that linear projections of Gaussians are Gaussians, is that the conditional of a Gaussian distribution is another Gaussian distribution. So if you have a set of jointly Gaussian distributed variables and you know one of these variables or its value and you want to know what that tells you about other variables, then computing this, which is arguably a posterior distribution, is easy in the sense that this distribution is itself a Gaussian distribution and its parameters, its mean and covariance, can be computed using linear algebra operations. This involves solving linear problems in here and in here. You can also think of this as inverting matrices even though we saw in the flipped classroom that that's not quite the same thing. And of course these operations are not entirely trivial, they are arguably non-linear operations because they involve inverses of matrices. So they are not linearly expensive but they are feasible in polynomial time. And of course one can think about simplifications of this process, approximations that further reduce the computational cost. And in particular this means because this is an instance of the sum rule and this is an instance of the product rule, the corollary of sum and product rule base theorem also becomes linear algebra when we use Gaussian distributions between over variables that are linearly related to each other. Here down here is the form of the posterior distribution in all its most general possible glory. So if you have a prior distribution over a Gaussian random variable and make affine observations, affine Gaussian observations of this random variable which means that the observations y we get are linear projections plus a shift in general of this variable with Gaussian noise around them, then the posterior distribution of any affine projection of x, so x and also in more general all linear transformations and shifts of x are still Gaussian distributed with means and with a mean and a covariance parameter that although it looks complicated in this expression and this most general expression can be computed using only linear algebra operations. This makes Gaussians the elementary object of probabilistic inference and it might explain why we spend quite some time in this course using Gaussian models to build powerful tools for machine learning. Arguably the most powerful use of this Gaussian framework is for the setting of learning or the task of learning functions, learning functions that map from an input space to an output space in which we observe pairs of inputs and outputs. So that's a basic, the most basic form of supervised machine learning if the output space is real value. Because this process of regression is of learning functions is so important, we spend quite some time in the course discussing it from various different angles, from theoretical, conceptual, to practical angles, also trying to build connections to even other lectures and other parts of machine learning. We began with a relatively simple observation which is that if you can write down a function as a weighted sum, linearly weighted sum of features, then you can use the Gaussian framework to learn that function from real valued observations. So if you can write our function f that we'd like to know which we're going to evaluate, we will collect values y at locations x, then and assume that we can write this function in this kind of form or actually more generally in the form of some feature function where feature functions map from the input domain x to the real line weighted by a bunch of weights, w1, w2 and so on, and assume that the weights are Gaussian distributed, at least if you can assume that the weights are Gaussian distributed, then we can use this Gaussian framework to learn such a function. Why? Because a Gaussian distribution over the w's amounts to a Gaussian distribution over the function. That is because the function is a linear map of those weights and here is a picture one out of many I showed you of such an implied induced distribution over function values. In this case I've used feature functions that are little bell shaped bumps. I'm careful not to call them Gaussian bumps even though they are Gaussian functions because as we actually saw that was maybe the biggest takeaway of the corresponding lecture, the shape of these features really has barely any effect on the computational aspect of this process. It only serves to shape the prior distribution and of course therefore also the posterior distribution once we get some data. If we change the features the belief both under the prior and the posterior looks very different and that's a great thing because we can use this framework then to build very powerful expressive flexible general models, but the computational process doesn't actually change at all. We always assign Gaussian distributions to the weights. We assume that we make Gaussian observations of a linear projection of those weights. This is the right thing to point out. I'm sorry. This also means that we get an implied Gaussian distribution over the function and when we now observe these function values at particular points up to Gaussian noise then we are making a linear project. We are observing a linear projection of the unknown weights and that means because of the wonderful properties of Gaussian distributions at the posterior distributions over both the weights and the function itself are Gaussian distributions. Now those expressions look complicated and they are. You've implemented them several times if you did the exercises, but the important bit is that well actually there are several important bits but one of them is maybe the most important one is that these expressions are Gaussian distributions so they have parameters mean and covariance and those parameters can be computed from the prior and likelihood parameters by using linear algebra of computations that means by multiplying vectors and matrices by solving linear systems of equations. So inverting matrices are actually more generally using solvers to solve these linear systems of equations. Another maybe less obvious property that we encountered is that it's actually possible to write these postivios in various different forms in particular and this uses the so-called matrix inversion lemma or the sure complement form of matrix inversion. We can write these postivios for example over the weights but also over the function values in forms where we have to solve a linear system of size number of observations by number of observations or in a another form which requires us to solve a linear system of form number of features by number of features. Obviously the letter one is a good idea if you have more data points than features and the former one is a good idea if you have more features than observations. Before we start to think more about this though maybe a first question that comes up and actually that also came first in our lecture course is I've introduced these feature functions here these phi of x what do I do if I don't know what the right features are. I've just observed that I can use all of these different features because I'm basically free to choose features in any way I like so which ones should I choose? In the probabilistic framework the answer to this question is provided elegantly by the probabilistic mechanism in the form of what's called hierarchical Bayesian inference. Formally this means if we want to learn the function which is in particular a function of its parameters and we want to learn from data but we also want to know what the parameters of that function are then we should in general perform Bayesian inference so that means we should put a prior over theta and then use the likelihood for theta under the function actually under the data to compute a posterior distribution over the parameters theta which create our features so we can index a set of features by parameterizing them in terms of some collection of parameters theta and then try to infer those parameters that's actually almost nearly possible in the Gaussian framework because this likelihood for the parameters theta so that means the conditional probability for the data y given the parameters theta is actually an expression that can be written down in closed form it's the normalization constant of Bayes theorem and because the product of two Gaussian probability density functions is another Gaussian probability density function up to normalization where the normalization constant is itself of Gaussian form that normalization constant this thing in blue here this evidence term is actually something we can write down it's a Gaussian distribution over the data with mean and covariance function which do not contain function values but they do contain values of theta so in principle we could use this quantity to do Bayesian inference this is a likelihood for theta under the data y we could multiply with a prior and try to compute the posterior however the way in which the parameters theta here show up is not in general at least the one in which we can do general simple Gaussian inference because theta doesn't show up linearly in this expression so in general we'll not be able to write down a posterior distribution by multiplying this prior with a likelihood because that posterior distribution will have what I sometimes call intractable form someone asked a while ago about what I mean by intractable so that means that even though I can write down this function I can't actually talk about its global shape in a simple way for example I cannot tell you what it's mean or what its variance is because to do so I would have to compute an integral that I don't know how to compute nevertheless though we could look at this expression at this likelihood or actually also at the posterior because multiplying with a prior is easy and at least try to find point estimates for theta by finding and that's one of the tools in our toolbox the maximum a posteriori or maximum likelihood estimate for theta if I only want to construct this point estimate then it's actually enough to do a simple form of transformation or follow a kind of path that we've actually encountered several times over the lecture and maybe this is a good point to point it out it's an instance of maximum likelihood inference how do you do that well first we notice that if you want to only maximize a probability distribution which happens to have this particular form then we might as well maximize its logarithm because the logarithm is a monotonic transformation it doesn't change the location of the maximum only its value you might also just as well minimize the negative logarithm because the maximum of something the location of the maximum of something is equal to the location of the minimum of minus something and they're taking this logarithm and mind writing a minus in front well first of all it simplifies the expression to the end that we can talk about this function more precisely because we get rid of the exponential in the Gaussian distribution but it also provides a connection to other parts of machine learning in particular we notice that this maximum likelihood estimate amounts to minimizing a loss function which happens to be a quadratic loss plus a term that is actually can be thought of as a form of regularizer which is interesting because it's a likelihood function rather than a posterior and that regularizer directly arises from the fact that we're computing a marginal distribution that we're integrating out the latent function considering infinitely many possible realizations of the function we call this model complexity penalty term the Occam factor and we saw that it has some regularizing properties even though it doesn't solve every possible regularization problem or it doesn't solve all of our regularization needs which is maybe not surprising because this is just a likelihood function or a log likelihood function it's not a posterior distribution so this writing the this estimation problem in this way makes a connection to empirical risk minimization but it also makes a connection to maybe another prominent part of machine learning which is deep learning we realize that what we've just done here when we try to find parameters that for our features when we try to learn our features is that we're trying to learn a representation of our data and learning representation is maybe the more general notion behind deep learning and we can think of our Gaussian regression algorithm with parameterized features as a form of deep learning actually maybe the more general way of talking about deep learning or the most general concept behind deep learning is actually that of differentiable programming of being able to compute gradients of computer programs to optimize the parameters of this program actually our Gaussian regression framework is an instance of that as well I don't have to slide for it here but you'll remember that I showed in that lecture that we can compute the gradients of this Gaussian model using automatic differentiation and if you know about automatic differentiation then that's not a surprise at all but it's maybe important to point out that the idea of auto diff which is really at the heart of the success of deep learning is not limited to a particular architecture of deep neural networks it's a general notion that of course can be used in probabilistic machine learning just as much as an empirical risk minimization settings this insight made a gave us a connection to one important class of machine learning algorithms or models if you like deep learning and almost immediately afterwards we saw that there is another connection to a very powerful framework which in this view is associated with instead of trying to parameterize finitely many features and trying to learn these parameters through an optimization process instead trying to use infinitely many features making the network not deep but infinitely wide we did this derivation kind of slowly by beginning to observe that when we compute our posterior distribution over the function values we have to compute a certain number of quantities that involves the features but actually when we stare at these expressions for a while we realize that the features never show up separately in a sum there is never a lonely phi in these representations instead phi is always either multiplied with a mean vector or we have to compute inner products between feature functions weighted by some prior covariance this led us to realize that maybe we can get away with increasing the number of features even towards the infinite limit if we are careful to ensure that these inner products can still be computed in particular these inner products actually two of them one here and one there they are quite general because gaussians are like gaussians stay gaussians under affine shifts so we can think of these as just affine shifts of both the data and the posterior so we can actually shift by more or less anything we could just think of some function and it doesn't actually matter what that function is because there's always an arbitrary shift so we can even make it independent of the feature functions the more important the business end of this process is this inner product between features that's what we really have to get right and it turned out through some derivations that i'm not going to repeat that this inner product can sometimes be performed in finite time even if there is an infinite number of features involved that's because this is a sum this inner product and there are certain sums that remain tractable when you take the infinite limit series and in particular integrals such inner products that well this inner product itself is actually called a kernel and we saw that you can use this notion of integration as an infinite sum to build kernels that have infinite degrees of freedom this gave rise to first the definition of a kernel kernels are a notion that often confuses people not for the fact that they exist actually i think this is like people understand that they are integrals which allow us to do to deal with infinitely many features at once but maybe more for what they actually are they are functions that can be evaluated on pairs of input setpoints or pairs of collections of input setpoints to give rise to matrices so a kernel is a function but actually we use it as a sort of second order function which is you give that function not an individual element of the input space or two of them but you give them a collection of points and then that collection gives rise to a matrix actually two collections of points which gives rise to rectangular matrices if those matrices have the property that if we build this matrices over the like this the same collection on the left and the right hand side of points so if you build a square matrix in the formal way that is defined in the sentence here and that matrix is positive semi definite for every collection of input points then we speak of a particular a mercer or positive definite kernel and those are the functions we need to use in our Gaussian inference framework because a they correspond to inner products and be they give rise to covariance matrices to positive definite matrices otherwise we wouldn't be able to build Gaussian models over them such Gaussian models that arise from this implicit construction are called Gaussian process distributions these are stochastic processes a stochastic process is a potentially infinite collection of random variables such that every finite subset of those variables follows some pre-specified distribution in the case of a Gaussian process that's a Gaussian distribution so a Gaussian process is a probability distribution over function values such that every finite restriction to certain function values like a subset of finitely many function values is of Gaussian form it has a Gaussian distribution which is identified by a mean and a covariance vector and matrix respectively which are constructed by evaluating the mean function and by evaluating every possible pair of input points in the collection capital X and building a square matrix out of them which is then positive semi definite if we use an actual kernel so how many kernels are there well initially we encountered just a small collection of these kernels we encountered by hand constructing them the Gaussian kernel also known as the square exponential or the radial basis function kernel and then we encountered the vener process which is a Gaussian process associated with a kernel given by the minimum function we encountered we realized that we can do a sort of very closely related construction on the vener process which essentially amounts to integrating over sample paths of this vener process and arrive at another stochastic process called the cubic spline kernel maybe or the integrated vener process associated with it because it produces posterior mean functions that are cubic splines and there are actually a few other kernels like this famous neural network kernel which I didn't actually introduce all that much in the lecture which can be constructed by hand so that might give the impression that there is only a very finite toolbox of kernels to use but that's totally wrong because it's actually possible to construct new kernels from old it's possible to do so by scaling the outputs of kernels in a scalar fashion with a positive scale by scaling the inputs of kernels in an arbitrary more or less arbitrary non-linear fashion as long as you scale both inputs by the same function by summing kernels and even by multiplying kernels together and we saw that each of these operations amounts to sort of an element of an of an algebra of operations that we can perform on Gaussian processes associated with these which give us a powerful modeling language with which we can build Gaussian process models and then I actually spent an entire lecture showing you how to use these notions these parameters that arise and this collection of kernels that one can sort of put in a subtool box if you like to build a relatively powerful language of regression algorithms that allow us to even do a form of mechanized scientific inference on data I used very simple data a time series actually of data points of my own body weight I did that because it was easy to do sort of a tell a story about this and make a personal connection but of course this is a sit like just an example for the kind of use cases that probabilistic machine learning has when you use such powerful tools it's really a language and that's maybe the most powerful aspect of probabilistic modeling that is sometimes hard to talk about because it's such a sort of soft subject but it's it really provides a language in which you can write down your prior beliefs to a very detailed and quantified way all of the quantities have units of measure they have natural choices they have often actual tangible prior knowledge associated with them and all of that knowledge can often be encoded into for example such a regression algorithm like the Gaussian process model we used here to be able to do so you have to have a toolbox of your own in your mind for how to build such kernel models such Gaussian process regression models and the more kernels you know the more operations you know that you can work with the more powerful your own toolbox becomes that doesn't mean though that this toolbox is unique to the probabilistic inference framework in fact this domain this Gaussian process regression corner of probabilistic machine learning is extremely intricately connected to the other theoretical foundation of machine learning the statistical machine learning framework so at this point in the lecture course I tried to build a bridge to the parallel course by my colleague Oeke von Luxburg to show that in certain cases the probabilistic framework and the statistical framework though not in all cases are extremely close to each other in particular you can think of Gaussian process regression as the probabilistic analog to least squares regression or actually more specifically to kernel rich regression this is important because it allows us to do a very careful philosophical comparison of the notions used in statistical and probabilistic machine learning the point estimate of the Bayesian the posterior mean function happens to be identical and we could easily show that to the kernel rich estimate which is motivated in the statistical framework as the minimizer of an empirical risk given by a square loss plus a regular writer that is given by the norm of the interpolating function in the reproducing kernel Hilbert space associated with the kernel the covariance function of our Gaussian process prior so we can think of the mean function as a point estimate in a hypothesis space called the reproducing kernel Hilbert space that's made me not so surprising because it's a point estimate so of course there's going to be a statistical estimate for it what's maybe more interesting is the connection of uncertainty the role of uncertainty as an error estimate it turns out that in this Gaussian framework the posterior error estimate of the Bayesian the expected square error the posterior variance is actually equal to a worst case estimate up to an unknown constant for the statistical machine learner it's given by the maximum square deviation between the true function and the posterior mean function under the assumption that the true function lies in the reproducing kernel Hilbert space and has bounded norm we spoke about this connection for an entire lecture because it's interesting it tells us that sometimes even though these two frameworks or proponents of these two frameworks often try to separate them from each other in certain cases they really are extremely close to each other so close that one might think that they are essentially the same but there are certain subtle differences one of them being that for example the Gaussian process hypothesis space is a little bit larger it's like a completion of the reproducing kernel Hilbert space and all of the samples lie on that shell that completes the reproducing kernel Hilbert space that itself is maybe more of a technical like a gotcha kind of observation that doesn't really matter all that much but there's another connection or difference between the frameworks which is given by what we just discussed a few minutes ago this fact that in a Bayesian framework we can marginalize overall hypothesis space and we can only do that because this marginalization operation fundamentally requires a probability distribution to integrate against otherwise you don't get a finite result of the integral so that marginalization operation allows us to do hyper parameter inference in the form that we discussed a few slides ago which is not naturally or it is very hard to motivate from a non probabilistic perspective without appealing to probability distributions so at this point we are returning to our after this kind of more philosophical mathematical observation maybe we are returning to our toolbox i've already added some additional tools a few slides ago without saying yet what we did with them but i can tell you now and remind you now that we introduced these additional ideas so we've actually now done all of this on the left hand side right we introduced these additional ideas on the right hand side as computational tricks to expand our modeling language in the previous lectures we used actually we haven't done this yet so i should do this first so in the lecture so far we've considered maybe a base case so a general input domain over which we're trying to learn our real valued function and we saw that that was a powerful modeling language it's also in general inexpensive but even also a somewhat restrictive language it's expensive because the cost of Gaussian process inference grows cubically with a number of data points effectively limiting the number of data points we can work with to a few thousands before we have to resort to computational tricks and approximations and another one is that we are strictly requiring our output domain to be a real vector space so let's deal with that next but first we did a quick pass in this course to point out that there are certain settings certain generative processes which you which simplify inference which can even lower the cost of inference to linear in the number of data points one of them is to use finitely many features we've already seen that before because we can then use Schur's complement to do the inference in a space in which the cost is linear in the data points and cubic in the number of features that is faster than inverting a full kernel matrix over all of the data points which is a more powerful model which assumes that all of the function values we date to each other another way to save computational cost which is connected to an entire maybe area of research even beyond machine learning called signal processing is to assume that we get data a data set that contains actually a sequence of observations a sequence that have Markov structure that means that subsequent observations are conditionally independent of all older ones when conditioned on their immediate predecessor such Markov chain structure even without appealing to Gaussian distributions gives rise to a simplification in the inference process we saw through a admittedly somewhat tedious derivation that in such Markov chain structured models inference on the entire set of latent states can be performed in linear time by a process called filtering a forward pass and smoothing a backward pass along this graph by the way this algorithm is a the base case the simplest form of the more general algorithm called belief propagation or message passing that we encountered later in the course this particular process so far is abstract it has linear time cost but of course it involves in each step an integral that might still be intractable there are linearly many of these integrals and they are over local low dimensional spaces but they are still integrals so you have to find ways to perform them in closed form or at least in a good numerically approximate way the easiest such case is if again all of the distributions involved here are gaussian distributions and the relationships between the variables are linear the then we end up with a pair of algorithms that are called respectively for the forward path the kalman filter and for the backward path the rauch tungstribel smoother and this simple form of filtering and smoothing in linear gaussian models gives rise to maybe the most elementary of all machine learning models filters and actually as i pointed out in the the plas room associated with this lecture even the even more simple extremely simple algorithm of exponential moving averages or rolling averages so this is one way of lowering computational cost it's not necessarily a way of expanding well i mean by lowering computational cost you can expand the space of data sets we can deal with to essentially unbounded because it like linear cost this allows us to pass through a data set or deal with even infinite data streams that come in over time because inference cost doesn't rise but we're still limited to a certain number of output variables real-valued output variables what do we do if we have supervised machine learning problems in which the output variables are not real-valued and therefore cannot be assumed to be gaussian distributed well we fiddled a little bit we use approximations to construct approximate postivios using a gaussian process prior over a latent function and transforming that function into the domain on which we see our data the best or the first example of such a setting also the best study the most widely used one is that of classification where the observations we make are not real-valued as we had encountered in previous lectures but where they are discrete values values from one to k or one to d for a number of classes such as labels on images to say this image contains a certain object in particular this also contains the special case of binary classification where something is either in or out either class one or class zero or class minus one or class plus one for this particular binary case we studied in a bit more detail how to construct a tractable algorithm and that algorithm actually gave us a template for a more general class of computational methods or one computational method that we ended up using actually for a whole wide range of models so here is that derivation for the binary classification case it begins with the observation that we have to use a likelihood that's a squashing function from the latent space of the function to the data observation space which has to be because it's a squashing function some kind of sigmoid in particular maybe the logistic function and that's a non-gaussian likelihood so when we multiply this gaussian prior over function values with that likelihood then the posterior is not a gaussian distribution so we need an approximate method if we want to continue to use the gaussian framework of turning this non-gaussian distribution into a decent approximation that is a gaussian approximation and that approximation happens to be or one way to construct such an approximation happens to be to find the mode of this distribution which you might as well do in log space so in the logarithm of the probability distribution find the mode and then at that mode do a second order Taylor expansion so that's easy because we already have the first term that's the value at the mode there is no gradient so there's no no first order term and then the second order term involves the curvature of this log probability distribution which we can think of as an approximate well as providing a quantity that we can use as an approximate well posterior covariance of our gaussian process model actually not quite that quantity but it's inverse with a minus in front or negative inverse this approximation is called the Laplace approximation and we don't just well we can't just use it we don't need to just use it for a binary classification it's such a general tool that it can be applied actually can be applied to more or less any continuous probability probabilistic model where we can find the minimum with the caveat that it's a local approximation so it might be more or less arbitrarily wrong if we find the mode that is not a good representation of the space in which the distribution gives mass but if we are careful and make sure that our model roughly speaking has this kind of structure which is that it's unimodal it's a log convex then and doesn't have some kind of like crazy shape that we can't capture then these kind of approximations tend to be actually quite decent so we can apply them to a larger class of models beyond classification beyond gaussian process classification models in which we have a gaussian prior on a latent function and make observations that we can think of as observations drawn from a probability distribution that is given by a squashing of this latent quantity the squashing doesn't just have to squash to the unit simplex as we did for binary classification it could also squash to the multivariate simplex for multiclass classification or to any other domain an example I used in the lecture is the squashing to positive values which is sort of a good model for datasets like this like these corona virus cases in Germany because these are curves that are strictly positive and they have sort of a wide dynamic range upwards but a hard lower bound at zero such for such models we saw that we can do a transformation that for example a natural transformation that suggests itself is the logarithmic transformation that well gives a neat interesting kind of the plus approximation that actually captures some interesting aspect of the generative process of this problem for example it naturally ensures that the higher variance of measurements of small numbers is naturally captured in the observation model and we get sort of small relative error bars for large counts and large relative error bars for small counts but this is just one domain in which we can use these squash generalized linear models and or more generally actually use the Laplace approximation for approximate inference in fact the Laplace approximation is such a powerful tool and is so lightweight that we can even use it in really heavy weight machine learning models such as deep neural networks at that point in the course and I'm not going to go through this derivation of course at this point again I pointed out that we can use this idea of Laplace approximations in deep neural networks by following the same kind of process which we find the minimum of some post of some posterior distribution of some loss function of some empirical regularized empirical risk minimization problem and then compute the curvature of this loss at the minimum and treat the location of the minimum as a mean and the negative inverse curvature of the of this risk minimization problem the Hessian as it's as the postivial covariance of a Gaussian model then using this framework we can even assign Gaussian posterior distributions to deep learning models such as real neural networks for example and at that point I couldn't quite keep myself from pointing out recent work in my own research group that shows that these kind of approximations have actually good theoretical properties for example they guarantee a certain form of robustness to real classification but what's maybe more important is that they are so flexible so flexible in two senses a they're very lightweight so they can be computed post hoc you first train your neural network and then you just compute the Hessian which is easy to do with standard toolboxes now and it comes at the cost of essentially just a backprop pass so it's not that expensive it can be done post training as well and conceptually speaking not computationally but conceptually speaking it's also an interesting observation that in this process we are essentially approximating a non-gaussian probability distribution with the Gaussian whether non-gaussian this doesn't just arise from the non-gaussian likelihood there was never actually a Gaussian prior really to begin with unless you make very specific choices in your neural network which you don't actually have to do to apply this framework so you can do approximate gaussian inference even in models that are entirely non-gaussian where neither the prior nor the likelihood is gaussian if you're able to compute curvatures and find minima of course that then also gives rise to more skepticism about these kind of approximations you have to be more careful when you apply them and do the kind of theoretical analysis that is sort of pointed out hinted at in this kind of short slide at this point we are now actually like more than three quarters into the course we have reached the bit where we got to somewhere slightly more exotic slightly more advanced questions of probabilistic modeling and we already have amassed a really powerful toolbox we can build general probabilistic machine learning models for all or more or less all supervised problems you can think of where you have a more or less arbitrary input domain and a more or less arbitrary output domain by building powerful latent gaussian models and using that a plus approximation to connect to the observation so maybe the natural next question is so what comes beyond unsupervised machine learning what do we do if we are have to do unsupervised machine learning well a first if you be restrictive basic observation is that unsupervised machine learning is essentially learning a probability distribution so maybe there are ways of doing what gaussians do for regression so for learning a function to instead learn a probability distribution rather than a function at least for basic cases we learned the primary tool to do so and that's given by exponential family distributions which provide conjugate priors to likelihoods a conjugate prior is a choice of probability distribution for a particular likelihood function such that the posterior distribution that rises from it is of the same functional form so it's the same kind of family of priors with updated parameters it's natural of course this means that this prior has to be chosen relative to this likelihood the natural language to construct such priors is that of exponential family distributions an exponential family is a parameterized and therefore family of probability distributions parameterized by some parameters w which has the form that it's the exponential of a linear expression in the parameters w up to a normalization constant which hopefully is known so that the computation actually becomes efficient these exponential family distributions are available as conjugate priors to a quite a broad class of observation models of likelihood functions and so here are a few that i've listed and these therefore provide essentially what is a data type or collection of data types for probabilistic inference in other parts of computer science you're you're well worse than used to to data types so to objects that provide interfaces that work well in certain settings like integers and floats and strings exponential families are a similar analogous kind of concept they are natural partners for certain types of data for discrete binary data there is a beta prior or probability distributions for rates there are Poisson priors for covariances there are priors for functions there are Gaussian priors and so on and so on we saw in the lecture on these exponential family distributions that using these exponential priors that exponential family priors that are conjugates to a particular likelihood function actually amounts to learning that likelihood function so that likelihood has a bunch of parameters which you might not know and those parameters if you put the conjugate prior over them can be learned in a fully Bayesian fashion using the conjugate prior as the prior and give a posterior that concentrate around the generative process the parameters of the likelihood assuming that the true data is actually generated from the likelihood in fact it's even possible to make a somewhat more general more statistical type kind of statement which is that if the data does not come from a distribution that is actually uh indexable addressable in the space of likelihood functions we parameterized by our parameter w then over time we'll find a model a generative model for the data that lies in the class of these likelihood functions and minimizes cool bug library divergence within this class of likelihood functions to the true data generating distribution these exponential family distributions by expanding the space of variables the types of variables that we can address with a probability distribution under reasonably tractable operations really completed the modeling part of our toolbox by moving beyond Gaussian distributions and therefore beyond real valued variables we actually have almost like a sort of a Lego brick set available in which we can build models for more or less arbitrary distributions and arbitrary types of situations what wasn't complete at this point of the course yet is our toolbox on the computational side because we would rapidly notice as we move to our application of the latent Dirichlet allocation problem which is of this kind of Lego brick style model class so we just build together exponential family distributions to construct a generative model for our data we notice there that of course the immediate penalty we receive is that we are faced with computational problems that don't have a closed form solution and they're not always of the type that can be answered with a plus approximation or with Monte Carlo methods actually sometimes Monte Carlo still works here we tried that we also saw that it kind of works but it's often slow or expensive or requires a lot of thinking to implement efficiently so we used this opportunity to introduce our final set of tools for the purposes of this course which is first an elegant tool to perform maximum likelihood inference in these kind of models so to learn hyper parameters or variables by computing maximum likelihood values for them and then secondly a even more powerful tool to do essentially the same kind of thing but in the space of probability distributions rather than point estimates so let's do the first one first this maximum likelihood inference algorithm for such general models is called expectation maximization or em and it's a relatively general algorithm that hinges on you being able to find a set of additional latent variables that simplify the computation in the sense that instead of trying to numerically optimize for the marginal likelihood and also therefore that's really not a log likelihood we introduce a set of variables z which make the which actually simplified a structure of this probability distribution by introducing some form of conditional independence and then build an algorithm that iterates between computing an expected value of actually computing the posterior over those latent quantities which is that often tractable under a set of parameters and computing the expected value of the log joint so rather than the log of the expected value we compute the expected value of the log which is often much much easier under this posterior distribution over the latent quantity set given the actual data and then maximize this object this expected log joint posterior expected log joint distribution over the data and the latent quantities to motivate why that's a good idea we realize that this operation this computing the expected log joint under the posterior over a z actually amounts to maximizing some lower bound which ah you have even written it down wrongly which is called the evidence lower bound even though it's an expectation because it is a lower bound on the evidence of the model so on this expression up here for any choice of q for any probability distribution q that bound becomes tight when we choose our distribution q to be equal to the posterior distribution p of z given x so em can be motivated as an algorithm that iterates between raising the bound and making it a non-tight bound and then closing the bound by setting by updating approximating distribution q to the posterior distribution under some choices of parameters and in fact this observation that we are really here tightening and then optimizing a bound motivates a larger class a more powerful set of approximate inference algorithms called variational inference which amount to the idea that instead of computing the true posterior over z given some given the data and the parameters instead we just approximate some distribution we introduce a distribution q and try sort of impose some restrictions on it so that it becomes tractable and then try to maximize this evidence lower bound which amounts to that's another way of thinking about this process which amounts to minimizing the KL divergence between this approximating distribution and the um the true posterior so of course we can minimize the KL divergence by setting them equal to each other but often we can't do that because the setting is too complicated our late interviewer allocation model if we treat theta and pi as parameter as variables rather than as parameters is such a case in which there is no closed form for the posterior distribution over all of the quantities c theta and pi but only over a subset of them we saw that we can address this issue by constructing an approximation not even necessarily of parametric form so we don't even necessarily always have to say that q has to be a member of a particular exponential family but instead we can get away with just imposing factorizations by imposing that we want the distributions uh the approximating distributions to separate to factorize from each other and then sometimes the inference actually becomes tractable to do this to minimize KL divergence we saw after a little bit of deliberations that this process can also be thought of as finding approximating distributions um over the factorizing sub parts of our model by iterating across those factorizing sub parts and at each step setting the logarithm of the approximating distribution for one sub part to the expectation of the log joined under all of the other sub parts this approach is called mean field approximation because it finds a um not because but it finds a general approximation that is a free form if you like arguably free from optimization in the space of all probability distributions under the only constraint that they are factorizing in this particular way of course this doesn't always work for any class of models but it's a first smart thing to try before you impose a hard constraint on the parametric form of the distribution and then that's the next thing you could try to get an uh even more tractable kind of approximation the nice thing about this is that you you you are sort of freed you get you're quite powerful in that you can control what kind of constraints how deep you want your approximation to be by slowly beginning with first trying to find the true posterior that would maximize the bound make it tight if you can't do that introducing factorization constraints trying to get a mean field approximation and if you can't do that either finally constrain the approximating distribution to be a specific exponential family and then just find the corresponding parameters for them this very powerful framework of em and variational inference completed our toolbox on the computational side as well it's actually not true that this completes our toolbox because of course there are many more algorithms available in expert probabilistic machine learning but this list already gives a very powerful set of of ideas that you can use to build relatively general algorithms for a large class of data types of data collections or data that might get to you in the real world and that was actually maybe the key goal of this probabilistic machine learning course to empower you to build your own solutions for your own data set rather than having to use the box standard deep learning model that you might find in a tutorial in a on a on a deep learning framework instead the probabilistic language empowers you to really build customized models for your personal data modeling needs that you might encounter in science and academia or in an industrial setting towards the end of this course we also outlined a little bit how this process actually works in practice of building a solution to a particular data set and this wasn't just i mean partly this was meant to just give you a concrete like guardrail to a walk along when you build your own solutions your own products but it's also this slide that i showed you several times also was meant to highlight how this abstract philosophical notion of prior and likelihood and probabilistic inference by extending propositional logic actually survives to some degree but only to some degree into practice when we talk philosophically about probabilistic modeling we often speak about encoding all of our prior knowledge faithfully in the prior and the likelihood and then performing probabilistic inference but we realized doing our concrete example of the topic modeling setting that that's actually not possible in practice instead your data hasn't been collected by yourself you don't know everything about it in in reality most of the time so you need to talk to whoever collected the data to collect meta information to go beyond the raw numbers on your hard drive that meta information informs your choice of prior but of course you can't write down any arbitrary prior because it would make computation almost almost guaranteed to be intractable so instead you have to use collections of model building blocks from this side of the slide from your sort of collection of lego bricks to build your model exponential family distributions gaussian distributions using kernels and features to represent your knowledge and then in a final step you the designer of the of of the solution has to move from building a model which is sort of a philosophical exercise into building a computational algorithm that actually performs computations in real on a real world system in finite tractable time and building these algorithms partly interacts with the modeling process because you have to build a model in which the inference is actually tractable and it interacts it requires a lot of your own mind to build efficient solutions by thinking about the structure of the model using a graphical representation maybe to get an insight into a rational independent structure by doing derivations for variational bounds to or to find smart ways of implementing a mark of j montecarlo method to arrive at a tractable model in the end machine learning then a particular probabilistic machine learning is a task performed by humans on a computer to allow computers to refine models that distribute truth over a space of hypotheses in the light of data by performing a often challenging and complicated continuous valued computation that requires a lot of mathematical insight when you get it right though it's an extremely powerful tool to extract knowledge about the world from data and make it useful by mechanizing the process of scientific inference and that's really what a large part of our world and our tasks as computer scientists is about so i'm going to end this lecture and this lecture course with this quote by Pierre Simon the marquis de la place maybe the actual real founder of probabilistic machine learning well probabilistic inference for sure that life's most important problems actually almost all quantified problems are problems of probability and by collecting mental tools to mechanize this process of probabilistic inference i hope you now feel a bit more empowered to build your own solutions for your personal your business your scientific your academic needs i hope that you have enjoyed this lecture course across the last 26 lectures despite the occasional hiccup and despite the huge challenge that has been this corona crisis and i thank you very much for your investing your time into watching these videos goodbye