 So, last week, the second lecture on Thursday was by Natana Ebusch to complete the section on time series. I, therefore, am not going to show you feedback. I haven't even looked at it, to be honest. Because, well, there's not much point in me discussing what Natana Ebusch did and whether you like what he did or not. Instead, I want to use the time to do a bit of a review of what we've done in the course so far. And the reason for this is the following. So, normally, I do a summary lecture at the very end of the course, at the very last instance, the last time we meet. And that is sort of, is natural, of course, to do, because by then we have all the content. But I realize that this also means you get the summary about what, four days before the exam, and that seems a bit short. So, if you want to prepare for the exam, you probably want to start slowly now. So, maybe now is the right time to do a brief summary. The other advantage of doing it this way is that, arguably, what we've done so far up to this point of the course is maybe like the most core, the most important stuff that I felt really has to be in a probabilistic machine learning class in 2023. And now, what we're going to do from today onwards, after this summary that I'll go through now, is a bit of a potpourri of different things that we could also have done. And it's not going to be complete, there could be many more things we'll do, we could do, but for lack of time just can't. So, I'll point in a few directions and it'll sometimes be a bit deeper and sometimes a bit shallow and quick. So, I also don't quite know yet how long that's going to take, so maybe by next Thursday I'll still want to use a little bit of time to talk about a few things. So, I thought long story short, now is a good time to do a summary of the structured part and then we can just see how far we get in the later stuff. So, here's a summary. What have we done over the course of this term so far? So, this is a class on probabilistic machine learning and we started with describing what probability actually means. We laid down the rules of probabilistic reasoning. And those involved writing down some axioms by Gamagorov and then realizing that these axioms which really just describe that we're going to try and measure volumes of spaces and assign a finite volume to an initial set of hypotheses and then just track how that finite volume changes if we either transform the variables or if we restrict the space in a particular sense and endow this restriction with the notion of inference with learning. And that led us to these basically four rules or three rules and one initial assumption. The initial assumption is that truth is sort of unit so we assume that there is a statement that is true within our hypothesis space and then we distribute this finite amount of truth over the entire space and then manipulate this amount of truth by two rules which tell us on the one end how to get rid of one variable that we might not want to consider. That's the sum rule which says if you want to know what the probability of one out of many variables under consideration is then you sum or integrate out the values of all these other variables and you're left with one probability distribution over this one variable, this one aspect of your problem and the second one tells us how to not get rid of one variable but how to use one variable that you might be able to observe to reason about another one that's called the product rule and it just says to compute this object that we actually give a name to and call it a conditional probability distribution we take the joint so the probability of both things being true and divide by the probability of just one of them being true. We can combine these two rules basically just plug this into the denominator of this expression and we're left with Bayes theorem which everyone has seen before but now it's properly derived and this is I argued in lecture one, two and three that this is the fundamental mechanism for learning, for inference, for reasoning about quantities that you cannot directly observe from observations from data and that this is a universal mechanism that we can apply across all of science, all of computer science whenever we encounter data and we'd like to reason about something we can't directly measure or observe this is the right mechanism to use. We also immediately discovered that there's a problem with it which is that, which is sort of two problems actually so the first one is that because of, because this paradigm which allows us to keep track of many possible alternative hypotheses that might be true, it also forces us to keep track of all of them it doesn't just allow us to, it also requires us to and if you have several variables to keep track of then the complexity of keeping track of all of the possible values they could jointly take grows exponentially with the number of variables that we have so if each variable, even if each variable only has binary values if it's either one or zero then for n variables we still have two to the n possible encodings that we need to keep track of at the same time but actually and that's the second problem not only do we have this exponential blow up but also typically we want to keep track of variables that don't have binary values but that might have continuous values real numbers between minus and plus infinity and of course they are uncountably infinitely many of those so that would be in the base of this exponential complexity there would already be an infinity so it's like infinity to the number of uncountable infinity to the number of variables we keep track of and that seems completely useless well it also verily is intractable so we need mechanisms to phrase this entire process in a tractable fashion and then the entire rest of the lecture course was about how to find those mechanisms and we took a sort of a dive down from through different types of models different types of descriptions of probability distributions all the way down to the algorithms we implement on computers the low level algorithms, the linear algebra itself to realize this framework, this abstract thing on an actual Turing machine like the one that we have in front of us so the first step was to say maybe we need to phrase those probability distributions over continuous variables or in general continuous variables in terms of some finite tractable objects some functions we can actually implement on a computer and this led to this surprisingly maybe quickly powerful but also tricky framework called exponential families so we looked at a type of probability distribution over a potentially continuous variable X which can be written in this form so in particular this involves something that only depends on X which is going to be easy to deal with something that only depends on the parameters of the distribution at the end and then a mixing term between them in an exponent so an exponential of a function of X times a bunch of parameters and the important bit here is and it's a pattern that we got through the entire course that this is linear in the things that we have to deal with the parameters not the thing that we get to observe or talk about X but the thing that we need to describe it W and those distributions like all of these terms have names so this is called the sufficient statistics this is called the natural parameters this is called the log partition function that's the base measure they all have sort of fancy names and we discovered that actually probably all of the probability distributions maybe not all but the large majority of those probability distributions you have encountered so far in courses on statistics and statistics and probability and machine learning and so on they all almost all fit into this framework the discrete distribution, the multinomial distribution the Dirichlet distribution, the beta distribution the Poisson distribution, the Gaussian distribution the Vishard distribution, the beta and gamma and so on and so on they all can be written in this particular form so why is this particular form so useful? well because it allows us to translate this abstract object base theorem into something you can actually do on a computer through the insight that every exponential family has what we call a conjugate prior so for every exponential family and this is I realize a complicated slide so you'll have to look at it a little bit while you prepare for the exam for every exponential family there is another exponential family which has the property which looks like this so we have constructed like what it's some kind of algebraic argument what it should look like which looks like this and which has the property that when you multiply it with this exponential family so this is a distribution over x parameterized by w that's a distribution over w parameterized by some other parameters alpha and u when we multiply these two together as we have to in base theorem then the resulting product is of the same algebraic form as this prior so it can be written in this form with like this structure and we essentially have to sum up the sufficient statistics of the well let's call it likelihood exponential family and account for how many observations we have and then if we can do that so this being able to do this boils down to being able to evaluate this function f so if we know what that function is the log partition function of the conjugate prior then we can do everything we want to do with inference or all the basic operations of inference we can compute posteriors we can predict future observations that's the bit down here and that's maybe all you need to know so you can reason about what you know about the things the parameters you would like to know and you can predict next observations or future observations in x-space so the question is we could re the sort of shift around things so we can decide for example to somehow move the base measure into the sufficient statistic and add a weight to the sufficient statistic that is just one or we can reparameterize the w's so we could have a different set of parameters here and actually that's quite common in many of these distributions I just mentioned people use other parameterizations in which w is not like there is no linear time and the weight but something else because it's convenient and now the question is doesn't this affect this kind of structure so if there is something else in here doesn't this lead to a different f here and the answer is e sort of and also sort of not so in the sense that actually it's probably best if you just try it out yourself by changing in particular the parameterization of w we end up with a different conjugate prior so that conjugate prior is going to be parameterized in a different way but at least asymptotically in the limit of large number of data points this inference framework will converge to the same point well to the equivalent point in the new parameter space so it's not a different description of the generative process of the data the generative process of the data is fixed by the exponential family itself so if we decide to use those sufficient statistics and a particular base measure no matter where we write that base measure whether it's outside or in here with a parameter then that describes how this learning algorithm will behave but there's also sort of a convenience aspect that maybe under some parameterizations those integrals will be easier to write down so in general actually they will typically be equally feasible because if you can do them in one parameterization you can do them in another parameterization by a change of variables so it's not so much that you can't do the integral if you rephrase it it's more like that it's easier to see what the integral is if you rephrase it in the right way but and that's sort of there's a third point which actually gets me almost to my next slide the main challenge with this framework the main sort of actually also the power of this whole framework is that it boils down the entire complexity of Bayesian inference all of it assuming someone gives you phi to the problem of figuring out f the log partition function so back then I showed you some I realized maybe quite complicated code that I'm still actually kind of excited about because it shows that this entire process of what you could call classic statistics you know you write down some particular exponential family p of w and then you realize that you can do cool Bayesian inference with conjugate priors all of this boils down to effectively two steps the first one is you choose the sufficient statistics phi which is sort of the core part well I mean phi and then you need to know log z but so basically the central part is phi and that bit is describing what you believe about the world so someone comes in either you or someone else says I believe this data is generated so data x is generated through a particular distribution that I can parameterize in this way so by picking a particular choice of w I can model how the data is distributed I just don't know what w is so that's the modeling part that's where the philosophy comes in and then there's a second step which is like totally mechanical if you like so it's fixed once you've written down phi which is that well then you need to know z well maybe you're lucky and you know z and then you know what the conjugate priors is it's this thing and now what you're left with is the computational part is you need to know what f is so what happened through most of human progress until I don't know the early 20th century or maybe even the late 20th century is people just staring at the math on a piece of paper or a blackboard and coming up with a particular pair of phi and f and that's where all these old white dudes with their fancy distributions come from Euler and Gauss and Dirichlet and Laplace and what they're all called they just realize that there's a particular f they can evaluate and what we have now got in 2023 or maybe already for a few decades is these marvelous machines in front of us which allow us to actually automate this process make it much more powerful and do this bit, this computing f in an approximate fashion and back then in the lecture I introduced for the first time this tool that we kept using since then called the Laplace approximation which is the idea and I'm going to get to another slide where I'll show it in more detail but you already remember that we linearized the logarithm of this expression at the mode, so we find the mode that W star that maximizes this distribution then linearize or actually not linearize we do a second order Taylor expansion at the mode and that gives us a Gaussian approximation to this distribution over W for which we can compute this f in closed form because we happen to know the Gaussian integral and this is a partial answer now I'm going back to your question if you change the parameterization if you choose a different choice of W that indeed changes what this Taylor approximation is at the mode and because it's like the Laplace approximation is fundamentally a local it's just a Taylor expansion in this W space it's not correctly transformed as a measure which Gaussian approximation you get actually changes if you change the parameterization and if you do Laplace approximations then what you believe about W will actually patently change with your parameterization so if you're not doing an approximate thing then in some sense how you choose W doesn't matter but if you're going to approximate then it does actually matter and you could think about good parameterizations for W maybe even automatically chosen ones which give really good calibrated Laplace approximations incidentally if anyone is interested in doing a master thesis on this let me know I have some idea for something to do so but now we've already talked about Gaussian distributions and that's going to be our next step so we realize that there are all these different exponential families and they can be used for different purposes but these purposes, the ones that we find in textbooks tend to be relatively restricted they're very specific for particular applications they're very nice because these exponential families they almost provide like a standard library of distributions like in programming languages Python and the other languages they come with these standard libraries C as well that provide a bunch of functionality that you might want to use like integers and floats and so on and similarly exponential families provide sort of a base case of Bayesian inference for very basic things like inferring individual probabilities with the beta distribution inferring the value of a real number just a single real number or a vector of real numbers through the Gaussian distribution inferring rates with Poisson distributions and yeah and so on but what we typically want to do these days is something more powerful we want to learn functions that map from inputs to outputs from data and we realize that to do that we have to focus a bit more dive even deeper into the the hierarchy of how to implement computations on a computer and focus on one particular exponential family the Gaussian one why? because Gaussians first of all are the probability distributions the exponential families the base data type for real vectors for vectors of real numbers the conjugate prior at least for their mean is the Gaussian distribution itself so there's a nice kind of closure where we don't have to go further and most importantly they actually come with really convenient algebraic properties so I showed this slide which I've like waved at a few more times which is another one of those slides that you really have to look at and think about for a while and show up on your cheat sheets I guess which basically summarizes the fact that or is a mathematical way of detailing what I phrased as a simple sentence namely that Gaussian distributions map Bayesian inference onto linear algebra so if all the variables we care about in our reasoning process are Gaussian distributed probability measures to their values and their relationships with each other are linear then or affine in the most general sense then all conditional and marginal probability distributions arising from the interaction of these variables in particular also posterior and evidences and so on are all of Gaussian form that's this big line down here and the parameters that we need to compute to get this Gaussian form involve only linear algebra operations so this big fancy line down here of these parameters for these Gaussian distributions so that's a mean and a covariance for any affine map of x that is constructed from observing linear projections of y all of these terms in here involve multiplications between matrices and other matrices in general but sometimes also vectors and then this fancy object in the middle which we tend to write as an inverse of a matrix but which actually means we're trying to solve a linear systems of equation and this process of doing linear algebra that's something that computers are actually really good at they're really good at it because this involves summing and multiplying floating point numbers which these computers are good at and also it involves algebraic structures that all of you are really good at because you spend the first at least one semester maybe two semesters of your course studying everything there is to know about matrices and their properties and vectors so we can use this mechanism which now is really an algorithm that you can implement on a computer to even learn functions we discovered how to do that we thought about it for quite some time and then ended up with even more complicated math so now we're quite deep down already in the computer thinking about exactly what kind of computations we need to do but we're not quite at the end of the dive yet so we realize that we can use this to this mechanism for learning a set a vector not just a set but a vector of real-valued variables to maybe not entirely satisfying but still quite sort of powerful framework to learn functions functions that map from an input X to a real-valued output Y and we do that by deciding on a particular parameterization of the function so we choose a function i or actually a set of functions phi which you could call features or transformations which or maybe even link functions depending on where you're coming from which take in an input X and compute a bunch of numbers which are the evaluations of these phi of X for every individual feature and then say the function the value of the function weighted sum over those features weighted by W and we assume that we get to observe this function value with a little bit of Gaussian noise why? because Gaussians are an exponential family whose conjugate prior for the mean is another Gaussian probability distribution and if we do that then the resulting posterior distribution over the weights over these parameters because of the properties of Gaussians is a Gaussian distribution over the weights with a new mean vector and a new covariance matrix which are annoying complicated expressions but they are linear algebra expressions so there are things we can implement in JAX in NumPy and just call and that distribution on the weights space also directly induces a distribution on the function space on the output of F by another linear projection so if we evaluate this F at some other X and we have a posterior over the weights then we can marginalize out this distribution over the weights and get a distribution on the function space and we realize by the way that we can write these distributions in two different ways one where there is this matrix involved to need to invert which is of size number of weights times number of weights and another one which involves this matrix which is of size number of data points by number of data points and it was the first instance where we realized that sometimes we really have to care about like for the implementation to structure our computational cost if we have way more features than data then it is better to use this form and if we have way more data than features it is better to use this form because it saves time and that already led to our first so we played quite a bit with this and I really hope that you what just got the point of this but also enjoyed it a little bit that there is a lot of freedom in how to choose phi we realized that we can pretty much take any function phi of functions phi coming in acting on x even discontinuous functions even unbounded functions really crazy choices of bases and this framework will just always work assuming that we take care to implement the linear algebra correctly so that it can deal with situations in which this matrix for example is singular and we did this we had this quite elaborate text code with this gaussian data type or class which had all these nice functionalities that you want from a gaussian distribution that it can condition on observations it can project onto other linear maps of the variable that it is representing it can compute log probability density functions and sample and so on and so on so this is actually in some sense quite a powerful framework and I hope that you will not leave this course thinking ah that was just an intermediate step I should never use that actually this is maybe one of the most important machine learning algorithms out there I keep being in talks, workshops somewhere even at big conferences where someone comes up with a really complicated deep learning procedure and I was there last week the reason why I couldn't give the talk was what instance of this where someone gives a really complicated talk and then I sit there and I think this could really just be least squares regression if you really wanted to so one of the things to keep like one of the alarm bells to keep at the back of your head is if you go to some presentation and someone argues that they have some beautiful deep learning solution to some complicated problem and they don't give clear evidence that this can't be done with simple least squares and maybe you want to try for yourself why because these algorithms are very easy to implement they're very easy to control they're fully understood it's just linear algebra they scale well as well if you have finitely many weights right if you have like a I don't know 500 weights here then this matrix will never be larger than 500x500 so it's going to be very fast to use and they produce uncertainty quantification they have all these beautiful properties they don't require stochastic gradient descent to work there's no parameters to tune they're actually a really cool thing a really cool tool but we couldn't quite keep ourselves from being content with this model we stared at this expression for a bit and realized that in these posteriors over the functions there are all these inner products there are so many of them that it's almost like it sort of hurts the eye to look at this equation there are so many inner products of phi's with sigmas that the pattern kind of really jumps into your eye and this led to a very powerful observation which is that actually what we need from our code isn't the ability to evaluate the feature functions it's something more abstract it's the ability to evaluate these inner products and if you have a piece of code that computes these inner products no matter how it does that then we can implement this algorithm and this hints at a really powerful idea which is that sometimes you might be able to do these inner products even if there are infinitely many features because some sums can be done in closed form even if they have infinitely many terms and this leads to the idea of a Gaussian process distribution I'll just show this one slide for it but of course there are several lectures on it it's a framework for learning functions that has arguably an infinite amount of freedom so that you can keep learning from one data point after the other when I say infinite amount of freedom that's a bit of a dangerous statement to make because in our constructions in our first pedestrian careful constructions of these variance functions these inner products also known as kernels we realized that to make this work with actual features we not only have to increase the number of features to infinity we also have to decrease the variance of each of the weights towards zero at the same time and so in some sense we have an infinite amount of freedom but also each amount of freedom is infinitely small so there is some price to pay but that nevertheless means that these models are in some sense very powerful in particular we did a complicated lecture on theory in which I mentioned that these models can potentially learn any function well not any but any function within a very large class of functions for example with any continuous function if you just give them enough data points that doesn't mean that they will learn that function with a good rate it might take a lot of data to reduce the number of data points to reduce the error but they are very flexible they can learn any function and all we have to do to make this work is this linear algebra that was on the previous slide so then we said okay linear algebra how does this linear algebra actually work and I realized that at this point this may have been a bit too much for you I saw a lot of disappointment when I come up with slides like this maybe in hindsight you can appreciate after the exam is over maybe you can appreciate that this is actually useful what we did here and this is just again a placeholder slide is we really try to dive as deep as you possibly can in a class with a relatively theoretical approach so almost all the way to the silicon not quite the chip code but down through the stack of algorithms to ask what actually happens inside of this linear algebra what does it actually do and I think this is very useful and very important to do because we tend to think of these algorithms as sort of primordial they come from 1974 and they don't have to be questioned they just do whatever they're supposed to do but these algorithms the Jolesky decomposition Eigenvalue decomposition conjugate gradient they were built for particular purposes and in particular they were not generally built for Gaussian process regression they were built for least squares estimates for example and now we realize that these least squares estimates they are just one half of what we need the least squares estimate is the posterior mean of the Gaussian process but as this other thing the posterior covariance that we kind of also want the quantity that allows us to question the model to draw from the posterior to quantify uncertainty and optimize this model we'll talk about that in a moment and so when we opened up those these tools for least squares in particular first the Jolesky decomposition we realized that it's actually possible to understand what these methods do they contain a big for loop in there this white thing that goes through effectively the data set it iteratively loads views on the data set and then computes the corresponding necessary parts of the kernel gram matrix the thing that is called k in this code up here that this thing gets used here and constructs basically a view on this matrix so this k times s should be understood not as just an actual matrix factor product but an abstract description that says please compute this number z which could be computed in some other way for example you might not necessarily actually want to build a matrix k it might be some code that just computes the entries of k that are necessary and then so this is these yellow mustard colored things this is the process of loading data this is some bookkeeping and I say bookkeeping because these are cheap operations they are just all of linear like the linear cost in the size of s particular like here and then we can use the result of this bookkeeping to actually enrich what the classic linear algebra methods do to actively construct directly the two quantities that we need for Gaussian process inference an estimate for the inverse of the matrix times the vector y which is the thing we need to compute the point estimate the mean but also an estimate of the inverse of the matrix itself which is the thing we need to compute the uncertainty so if we return this blue and greenish thing alongside with a data structure that sort of remembers how we've actually constructed it then we have provided everything we need to do full Gaussian process inference compute the mean and the covariance and project that mean and covariance out onto arbitrary function evaluations so these bits are for on the data, on the training data and the mustard colored thing is for testing for other points and if we do that then we don't actually need linear algebra anymore this is our linear algebra now it all boils down to this algorithm and the main thing to think about is this bit up here how do I actually load the data which bits of the data should I load the ones that I care about what is caring about actually mean in particular we might want to decide to just go through the data points one by one in an arbitrary order that's called the Joreski decomposition we could go through the data set by computing very informative linear projections that are effectively projections along the eigenvectors of K this process will be expensive because then the policy requires us to actually figure out what those directions are but when we do it this algorithm will converge as fast as it possibly could it will have the optimal rate of convergence so it will need as few views on the data as possible this is maybe a way of constructing the most informative data set for this process and that's called conjugate gradients this algorithm actually exists but these are just two points in the algorithmic landscape and as you move forward after this course into future semesters and the years out there as a machine learning engineer I am foretelling you that the process of loading data efficiently will be a concern that will come up sooner or later and for kernel machines for Gaussian processes you now know how to think about this for deep neural networks and arguably no one really knows how to think about it but there are smart people out there who are beginning to think about it and maybe you want to be among them as well so the process that we currently have where you just randomly load some document from the internet and use that as a training point for your large language model might not be the smartest thing to do and I am guessing that in a few years we will think about it very very differently maybe based on these insights so at that point we were really deep in the engine room of machine learning at the very bottom the stuff that goes into the GPU the algorithms that really do the heavy lifting and then we sort of pulled back a bit and said okay this is deep enough the next thing down would be the compiler and just in time compilation and the assembly or whatever let's stay away from this let's go back out again because the what we've constructed at this point is a way to learn functions that map from an arbitrary input X to a real vector Y vector, so multi-output but still real vector and actually this seems very powerful on the input side it could take in any X as long as you can find features that deal with it so we can build models that map from spaces of graphs and languages and words to outputs but the outputs have to be real valued and that seems really restrictive maybe we want to predict something else maybe we want to predict classes or words or proteins or something more structured output prediction that used to be called so we started to think about this output problem and to do that we went first looked at a particular class of problems classification which actually provided a template that more powerfully can be used on pretty much arbitrary data sets and it boils down to saying we're going to stick with our framework of Gaussian process priors and so we'll assume that there is a latent function F which we have this prior but we will change the likelihood we will say the observation Y can be of a general structure it doesn't have to be a real vector and we'll deal with the fact that it's not a real vector by inventing some link function sigma that links from the latent function F to whatever the output is, the label is in particular if the label is a binary number plus or minus one or zero or one or whatever left and right, green and yellow then we can use a sigmoid function as the link function if it's a multi-class problem we use the softmax if it's something more structured we use something else for example if the output is a rate if it's a number of counts that go from zero to infinity we use Poisson distribution whatever, you just come up with your favorite transformation and this only raises one problem which is that it breaks this beautiful algebraic structure that we've used so far and we can't do closed form inference anymore using just linear algebra we weren't quite willing to give up on that so we kind of used the sledgehammer to make it be linear algebra again and that sledgehammer turned out to be Laplace approximations so we find the mode of this posterior finding this mode requires some numerical methods and I actually admittedly didn't really talk much about how to use these numerical methods why? because there's lots of opportunity here in tubing and to learn about them maybe you've been to the deep learning class by Professor Greiger and I've talked about some simple stochastic optimization methods if you really want to get the detail view next term there's going to be a lecture course by Professor Hein on optimization and I'm pretty sure he'll talk about anything you can think of optimization certainly much more than I can so if you really want to know how to do this bit I've just outsourced it to Matthias Hein next term and it's a beautiful piece of algorithmic thinking it's a really rich literature to think about but let's just say we have something that finds this f hat then locally at f hat we can do a second order Taylor expansion of the log posterior distribution that Taylor expansion involves a constant term a linear term which is hopefully zero if you actually add the mode and a quadratic term and if the logarithm is a quadratic function that means that the exponential of this so the probability distribution itself is the exponential of a negative square and that's a Gaussian distribution so we construct an approximate Gaussian distribution which is centered at the point estimate and has an error estimate which is actually more like a sensitivity map given by the inverse of the Hessian of the loss function and then we can work so after that everything is linear algebra again and we even actually spent half a lecture on how to use this structure to build a particular optimization method Newton optimization which is an extreme case which is a very sort of simultaneously very powerful but also in its individual step quite expensive optimization algorithm it's maybe the optimization algorithm that is on the extreme end of expensive individual step but extremely informative individual step we saw that doing this for a simple data set actually led to a massive increase in performance drop in run time compared to gradient descent so my main message there was don't use gradient descent without questioning it might still be a useful thing to do quite often but don't just use it and not think about it and it also gave us a road in to deep learning we said well at this point deep learning we just have a general output loss which is just a logarithm of this likelihood and we have a prior over the unknown function which is something with a quadratic term so and these are all things that we can do in deep learning models as well finding modes finding actions so that actually led to a framework to construct Gaussian process posterior distributions from pretty much arbitrary deep neural networks so now we're quite close to where we are at the moment in the course so I can speed up a little bit we realize that we can take we can do what we just did for Gaussian process priors with logistic likelihoods and apply it the same framework to every deep neural network and when I say every I mean every neural network for which you can compute gradients that's pretty much every neural network if you can't compute gradients you can't do deep learning and therefore you can compute Hessians at least assuming that everything is twice continuously differentiable and secondly for which this little L the empirical risk actually is the logarithm of a likelihood and that is a bit of a constraint but most empirical risks actually fulfill this constraint so the cross entropy loss and the L2 loss in particular also for multi-class classification are logarithms of likelihoods so therefore we are fine so this works as before we train the net we find the optimum at the optimum we compute the Hessian we invert it we linearize the network that's the new extra step that we now need to do because we don't have a Gaussian process prior on f anymore but only on the weights this involves the Jacobian and we're left with a Gaussian process that replaces the output of your neural network it just says instead of just this trained neural network I know I've now realized that this is a point estimate which I can think of as the mean of a Gaussian process and it is surrounded by this sensitivity map that I call the Laplace tangent kernel and of course this causes some computational questions how expensive is it to do that what kind of matrix do I actually want to build here in Psi probably not the full Hessian for the entire network because if it's a bit more network that's not tractable but you did some homeworks and saw that there are lots of cool approximations for these curvature matrices that can actually be used in practice so now we're really close to the sort of as we've been down to the engine room now we've moved out and walked over to what people do out there with like the new hyped up deep learning stuff and found that probabilistic reasoning applies there as well we can make it basically conform with the stuff we've done so far so in the last two lectures we now opened up sort of a third direction to go into which was now that we know how to do linear algebra and how to do deep learning what about data sets that have other kind of challenges other kind of structures that we need to maybe think about and one of them is temporal structure what if you have data that comes in as a continuous stream that never ends potentially then we can't we use this paradigm anymore of data set I'll take it into batches and then I'll just do gradient descent until I'm converged and then I'm done and then I can do linearization scoushion processes whatever test and train time break it's a new kind of structure and we found that there is this beautiful idea it's actually already quite old 100 years old or so or older even which describes temporal structure in terms of a finite amount of memory that doesn't get constrained more and more by the data but which actually evolves with the data across time that's called a Markov chain you've seen this graph now many times with a bunch of latent variables that are called the state which have a description for how they change over time that's the chain up here the chain that changes the states and also a description for how we observe at every point in time those are these observation models down here so with these two lines we can describe a class of models by picking different right hand sides here which as we saw will create an algorithmic structure called filtering and smoothing that allows o of well o of t time inference for t observations each individual local update is o of 1 it doesn't require looking further ahead into the future or into the past than the immediate neighbors and if again here we have sort of if we keep reusing all the stuff from deep down in the engine room all the Gaussian distributions with linear algebra then this algorithm simplifies to something very important called the Kalman filter and the RTS smooter which is a base class for probabilistic inference and signal processing in dynamical systems so with that we now have a toolbox that you can use when you leave this lecture hall after term ends to apply to a large class not a universal class of problems but a large class of machine learning problems it's fundamentally built around this mathematical insight of base theorem and the sum rule and the product rule whenever someone gives you an inference problem no matter what your first thought is base theorem so what do I need to do I have to write down the probability of everything all the variables that I get to observe and that I want to make a statement about have to go into a probability distribution then I have to start thinking about which parts will be computational challenges what kind of algebraic forms do I give to these distributions p such that on the one hand memory cost and compute cost doesn't accumulate too badly so I get more and more data and also the individual steps of conditioning actually remain tractable and we now found various really cool ideas both on the modeling side and the computational side to deal with these challenges a knee-jerk thing to do is to try and use Gaussian distributions as much as possible and linear relationships between them we saw that these linear relationships can be a pretty rich language because you're allowed to use feature functions and use least squares estimation we can also use Markov chain models Markov Gauss Markov models for inference in things where in settings where the data is infinite where it keeps coming in as a stream and we can use effectively the combination of autodiff and linear algebra called Laplace approximations to construct general algorithms that work on large classes of problems but it's now time for a break and after the break I will try to round off a little bit this toolbox by talking a bit about what you do with the parameters of the model that really don't seem to fit into this setup so let's continue at 12 past so I could have said that now we're done, well I kind of said that so now we have all these cool tools all these model classes but there are a few more elephants in the room which I'll try to address and I won't actually have time to do all of them in the entire course there'll be some things that will have to fall off at the end maybe I'll do at the last lecture just point in a few directions and say these are more things we could have done and I think one question that is actually quite interesting to think about is whether it's possible so now on the last slide on this one we had lots of different model classes but within each of these model classes there's a whole range to choose the model quite freely so if you choose a Gaussian process we need to decide which kernel to use and what to set the parameters of this kernel too remember the example we did in lecture 10 I think with the CO2 curve there were all these numbers we had to set and if you use a deep neural network we had these plots and we saw that the uncertainty has all this complicated structure maybe we would want to ask ourselves what kind of non-linearity do I actually want to use and how many layers should my network have how wide should these individual layers be so these are all questions that the previous slides didn't actually provide answers to because they are about parameterizations of the model which kind of by definition we don't want to be probabilistic about so when we did this example with the CO2 curve I briefly rushed past a slide like this and I realized that it was maybe a bit too fast so I thought today we'll use another 20 minutes to be a bit slower and think about it so really the abstract problem is we're going to do inference on some latent function f this is actually the thing that we ultimately care about we want to use it to make predictions about the world that's the output we're interested in the latent variable what we have is training data pairs of inputs and output y and x and of course we care about them because we have them but we also don't need too very much about them because they're just there they sit on the hard drive they are there we don't question them at least in probabilistic reasoning the fundamental paradigm is the data is just given we literally say given additional degrees of freedom that we could tune and like then I had this slide that says you could sort of think about data, something called variables and then things that are either parameters or hyper parameters so these are the things that you don't care about if you like but you still need to use them to make things work sometimes people call them nuisance objects, nuisance parameters things that just happen to be in your model because otherwise you can't write it down but you don't actually care about them you just want the whole thing to work you just want to set them somehow sometimes during my PhD I met someone who called them Strativarius factors because you fiddle with them so these are the bits that well you could turn the sentence around instead of say this is the bit that I don't know how to deal with the stuff that I don't know how to deal with I just call a parameter because if I would know how to deal with it one of two things, either I would care about it it's part of the description of the world and then it should move into the F bit it should become part of F or I don't care about it but I know how to deal with it and knowing how to deal with it in probabilistic reasoning is integrated out use the sum rule to just get rid of it and then you never have to think about it again but some of these parameters we just don't know how to integrate out because the corresponding integrals are very intractable and then we have to set them somehow so how do we do this well it turns out that there is a fundamentally correct way to do this that only works on paper and that fundamentally correct thing to do is to stare at base theorem and realize that if we are explicit about theta then actually stare at this base theorem then we actually realize that the normalization constant of base theorem that I've so far often kind of waved away can be used to answer exactly this question about theta so so far when we encountered inference problems I always said well what's the right way to deal with an inference problem? Base theorem okay so up there where the marker is that's base theorem we care about the unknown thing F function which we want to have a Gaussian process prior over and then some data Y and for that we need to take a prior on F multiply it with a likelihood for Y given F and then normalize by the evidence P of Y that's without the F so we've integrated out F and so first of all there's usually this problem that the normalization constant might be tricky but if everything is Gaussian it's fine it's just closed form so we don't even talk about it we just let it drop on the wayside it's a Gaussian distribution so it's normalized by construction whatever there's this thing which we don't actually have to write down because it's kind of automatically dealt with by our Gaussian library but if you look at this and now realize that there is a model in this whole thing a model described by those parameters theta then those theta show up everywhere on the right in red they are always on the right hand side of every conditional the whole thing might depend on theta I mean maybe you're lucky and theta only shows up in the likelihood or only in the prior but in general it's just everywhere and it's also in the normalization constant and now this normalization constant what actually is this well it's a probability for the data Y given theta also given X but X is given so it's literally given we know what it is so we don't have to worry about it the inputs are just there theta we don't know so what this is here is a likelihood for theta and if we had it and in Gaussian process models we actually have it we could use it in another instance of Bayes theorem that says let's take this object from up here multiply it with a prior normalize and that's our posterior in principle that's the correct answer what we know about theta but I say in principle because first of all this object might not even be tractively computable for Gaussian processes it actually is and then secondly even if this is available this thing down here probably isn't because it's another integral of more complicated structure where every individual term in the integral the integrand itself is already an integral it's an integral hidden up here as well so it's like a double two level integral but in principle this is what we'd like to do and now we just see how far we get so by the way this approach to thinking about how to fit parameters has different names it's sometimes called marginalization of the likelihood sometimes it's called type 2 type 2 maximum likelihood because maximum likelihood would be finding the f that maximizes this expression and type 2 maximum likelihood is finding the theta that maximizes this expression where we integrated out the f it's also maybe more importantly historically called the evidence framework due to a PhD thesis by David Bacchai who spent his entire like first 5 years of his career dealing with this kind of question and and in some cases there is actually a close form answer to it so in Gaussian process regression back in lecture 10 when we did this experiment this CO2 curve we realized that for Gaussian process models if we specifically use a Gaussian process prior then we can stare at this big slide with the Gaussian math that I had a few slides ago and discover that when we multiply the Gaussian process prior for f with a Gaussian likelihood for for y given f to construct we don't just actually construct a posterior so here is our prior for the weights in this case of a parametric function or the prior for f times a likelihood for y then we don't just get the posterior sorry here it is this one which we have so far used for regression but we actually also get the normalization constant the thing that goes in the denominator of Bayes theorem you could just literally divide this over here and then you have Bayes theorem and it has an explicit form it actually is an object that is of I would say Gaussian form because we can write it in this in this form with a curved curly n but it's a function of theta it's not a Gaussian distribution at all it's just something we can write down but it depends on theta in a nonlinear fashion because theta enters all these terms that are in there it enters the features phi for example or for Gaussian process models it enters the mean function the kernel and maybe even the noise term and now what we can do is come up in general with a nice prior for theta to multiply here find a joint posterior because this thing will have very complicated algebraic form in theta but what we can do is we can find the mode of this object we can find the mode of the the this object which is a likelihood we could also take the logarithm of that and find the mode of this log likelihood and we could even add a prior to regularize further so actually there's a slide on this which back then I went through a little bit too fast and it says let's do this so we look at this expression from the previous slide that's the thing that was here this green thing from over here which I have been here actually so here it is again that's the expression from before so this marginal is likelihood times prior and then you integrate out the unknown function this is possible for Gaussian distributions in closed form it gives us this term which doesn't depend on f times the posterior over f which is a probability distribution so if you integrate over it it's just one and this is the bit we now have to deal with so we do what we've done so far within the other map we have some more posteriority type estimation problems as well we say we might as well maximize the logarithm of this because the logarithm is a monotonic transformation and so it doesn't change the location of the mode we could also put a minus in front and minimize instead because maximizing a function is the same as minimizing minus that function in terms of where it finds a minimum and then we just write out what this thing actually is the logarithm of a Gaussian distribution it's the locked PDF that you can also find in our Python code and the reason to do that is to stare at it and think a bit about what we're actually doing when we optimize this object so this remember the Gaussian is 1 over square root of 2 pi to the d half times the determinant of the covariance matrix raised to the 1 half times e to the minus 1 half so this is the logarithm of it and now we've dragged in these theta's these quantities that might affect our model and we've just left them in everywhere the first thing we see is that there is a constant term here at the end this is just how many data points we have and it just involves 2 pi so this bit we really don't have to care about because it's constant so we can't change it by changing theta it's just a number then there's two terms left the first one actually measures how close the data is to the prediction under the prior not the posterior but under the prior scaled by the inverse covariance under the prior so this is I said that back then a few times this is kind of how far you are how far the data is from what you predicted it to be under the prior scaled by how uncertain you want to be about those those variables so how surprised like how surprised are you when you are this far away from what you thought it would be but this is how far you thought it would be away squared from like two things so squared right so that's a number that we actually what do we want that number to be even did a homework exercise on this maybe we want it not to be zero actually right because this is supposed to be a probability distribution we want it to be one or n actually the number of observations so if you do minimization of this expression this actually sounds a bit dangerous because what we could do is we could just pick mu to be exactly equal to y and then we could set this variance to just be zero and then that would be a very very small number actually a very let's be careful with the minimum so we could set this to something very large right if you can't so if of course if you have a model for the mean that is so flexible that we can just make it equal to the data okay fine then we're screwed but even if this is let's say this is a simple function it's a constant function for example we could set this to the mean of mu of y and then we just scale this by some really really really large number and then this whole thing will become very very very small nice so theta could just be chosen such that this thing has very large variance this oh it's a bug in this slide okay thanks for pointing this out so this is sort of simultaneously okay that's stupid so either okay we get to pick which one okay so just just pick one of these two I think both of them kind of make sense on their own but not together so either take this one or this one so this will be for a Gaussian process model you just have a mean function and for a parametric model you just have some a bunch of features right okay that's a stupid bug but actually I mean the interesting bit is about the matrix in the middle so for typically for Gaussian process regression models we don't even want to have a particularly strong parameterized mean we just want it to be maybe even zero or just a constant right what we do care about is this kernel and its parameters this is the thing that really matters and this is where this extra term becomes relevant so if we just wanted to do maximum type one likelihood so if we just wanted to maximize this expression with respect to f without computing a posterior so just this bit this one Gaussian in here then we could do this by making the variance very large because then that term would just become very small but since we've integrated out this posterior here we get this extra term over here and what is this thing well it's the log determinant of essentially the kernel ground matrix plus some noise and this thing is supposed to be minimized so if we make this very large then we are adding something to the speed that we want to be minimal so remember that kxx and kxx plus lambda are positive definite matrices so this determinant is a positive number because determinants are products of our eigenvalues if the eigenvalues are all positive then that's a product of a positive number so it's a positive number so we can't make this thing well we take a logarithm of it which might be a negative number then but it's well defined it's always possible to actually write down and if you make this very large to make this term very small then at some point this thing starts to dominate and there will be an optimum to choose this observation that there is some kind of regularization emerging even without a prior on theta is sometimes called a penalty for model complexity or it's called the Occam factor you might have heard of Occam's razor who has heard of Occam's razor everyone so you've heard of this guy maybe one of the he's the oldest person that shows up on slides in my lectures William of Occam a catholic dogmatist high medieval high middle ages born in Occam he died in Munich and there is this street named after him in Munich Occamstraße why? because he had a really complicated life but it's 1200 something so I think for us it's pretty much impossible to understand what his life was like he was one of these monks that were arguing long before the reformation for poverty of the clerics and he got into problems with the Pope because he was arguing too much for the Pope to be poor as well and that's why he had to emigrate to he left Avignon and then was excommunicated for it and had to move to Bavaria and weird, it's difficult to understand from our perspective but he also wrote some philosophical treatises and it seems that back then he was a bit like this was a time of high middle ages lots of complicated flourishes on the thoughts of people no math whatsoever, no order in thinking and he tried to clean up a bit because he felt that people were just like they were becoming too decadent and having too much money they also became too complicated in their thinking and he wanted to clean up everything and everything to be clean and poor and restricted and stoic maybe even and so he wrote as these treatises also on the philosophy which he the actual quote is a bit complicated here it is in Latin which essentially means he said that you don't need several possible explanations to drag around for one thing if you can find one simple explanation for it then that should be enough and people sort of later on over nearly a thousand years since then have translated this into this kind of idea that we should have models that are simple that don't have so many degrees of freedom in particular you can think of situations in which if it turns out that we can find one single explanation so a rank one term a single feature that explains all of the data why perfectly then this is a rank one matrix and this determinant is zero and so the logarithm of it is minus infinity which is optimal so if you can find descriptions of the data that don't even need noise then that's a wonderfully perfect explanation of course in reality typically that's not going to be possible unless we make a model too flexible unless we choose theta such that we can always learn a rank one decomposition of the data and if we do that then we shouldn't use this framework and we should instead put a prior on theta here to say we don't want to have classes of models that can learn everything well that can learn every finite data set because then they won't generalize so this does not mean it's actually the main point that you don't need a prior on theta and it doesn't mean that we should choose arbitrarily complicated models to describe what's going on so this term itself does not fix the problem of overfitting it just helps a little bit so what we've now done in the next in the when we translated this Gaussian formalism to deep learning actually also can be used for model learning so what I've shown you so far are these slides from lecture 10 with the Gaussian process models and you will remember back then I had an actual piece of code that learned about 10 or 12 different parameters, length scales and output scales of different physical processes in the CO2 curve and since then we've moved on we've decided to use Gaussian distributions everywhere else in deep neural networks for example and it turns out that you can actually use the exact same framework there as well you just halfway through inject the Laplace function to make everything Gaussian again so here's how this works let's say we have a general p of y given f so for example it could be a sigmoid likelihood or a logistic loss function or a cross-entropy loss and some prior over f in particular it could be a Gaussian prior and it works even if the prior is not Gaussian then we do Laplace so we find the mode of this expression with respect to f let's call that mode f star we find it by taking the maximum of the logarithm of the posterior and then do a local Taylor expansion so we compute our Hessian again then we get our local approximate quadratic expansion in the log space so p of y given theta is the integral over prior times likelihood so prior times likelihood is up to second order this expression raised to like e to the this okay and now we realize that we treat f star as a constant so everything that only depends on f star can actually move outside of the integral so this bit does not depend on f because f is fixed clamped to be f star and here f is fixed to be f star so we can take those out of the integral and of course we combine the x and the log to just get prior times likelihood and what we are left with is just an integral over a quadratic form so that integral is just this it's a Gaussian it's a Gaussian integral it's a two pi showing up with a dimensionality and then importantly the log determinant of psi and this is now an expression that we can actually optimize for parameters so both f star and psi will depend on parameters and we can tune them and just keep doing this integral which isn't actually an integral right it's just an it's sort of translated integrals into differentiation into second-order expansions in particular also as a side note if p of f the prior is actually a Gaussian distribution then we can simplify things a little bit further so then we know that p of f will be a Gaussian form and then this bit here will become particularly easy and we are just left with the logarithm of this and then a quadratic form and this entire expression can then be written more succinctly like this where this is the working piece the log likelihood minus a quadratic term minus a combination of both the Hessian of the loss and the Hessian of the prior which is just k inverse and we saw briefly use this matrix B in the Gaussian process classification so this is a way to particularly efficiently so this is actually a very general way of finding good models you I give you any model as long as it's twice continuously differentiable and I can think of the loss function as an actual log likelihood I can always do this I can just let it run I need to move to the final slide and so you can do this on pretty much any model you can write down including a deep neural network the only price you're paying for it is how do you tune this thing then well with an optimizer somehow so you need to compute a gradient of this expression with respect to the parameters that you're trying to optimize so those parameters might show up in k and in B in W in here as well depending on what you want to optimize and maybe those parameters are even discrete then it's a bit more difficult to do this optimization but it's a general framework but I don't want to end here I actually want to introduce one more little algorithmic trick that will lead us into the last two or three lectures that we're going to discuss which can sometimes be applied and when it can be applied it's very powerful similar to how oh I just realized I forgot to upload the slides so no one has told me to put them on okay they'll be on Elias in five minutes and remember when we did GP classification we had a similar situation where we first talked about gradient descent and said okay I really don't know how to optimize this thing I'm just going to compute a gradient and follow the gradient and then later we realized ah actually we can get the Hessian in nice closed form numerically stable maybe we can do Newton optimization and it took us a bit of time to do that but when we had it done it worked so much better than gradient descent it went like a thousand times faster and just converged all the time so here's a similar situation that in general we would like to do this thing we would like to compute the marginal evidence or log evidence and optimize it what is this thing, well it's the logarithm of a big integral over a bunch of quantities oh I should introduce what those quantities is for the case of this slide I'm assuming there is a model that has parameter theta observe data y and I've now replaced the latent variables to just be called z because f is a bit too suggestive of regression and for regression we now know how to do it but that could be anything then this is the thing you would like to do, you want to get rid of all the variables you don't know and then optimize for theta and we realized that we can do this in principle with Laplace which is always possible but then we are faced with a complicated optimization problem so there is a particular structure that turns out to be surprisingly useful which we are going to study over the next, certainly on Thursday entire next lecture but maybe even longer after that which you also need for this week's homework so I want to spend these three lines in five minutes to make sure you understand what the idea is and it's the following it says instead of trying to compute this thing which you don't know how to compute in closed form and instead of approximating it with a quadratic term which is then just an approximation so it might be wrong instead see if you can evaluate this expression and this expression at the moment just falls from the skies you just have to believe me that this is an interesting thing to compute and this is the problem with this algorithm that everyone who has to teach it has to decide to either spend two lectures doing some complicated theory that no one understands before you get to the actual algorithm or to just show you an algorithm that you don't understand why it's useful but maybe you can understand what the computation is so that's the thing I do now let's see so what we are going to compute is the expected value so this is an integral over a probability distribution against the function of the expected value the expected value of the logarithm of the joint so notice how the difference between up here and down here is that the logarithm is now inside of the integral not outside that's a different thing because the logarithm is a nonlinear function you cannot just drag it in and out of the integral but if you put it in here and then take the expected value against the posterior distribution that would arise from the model with a particular choice of parameter theta star in some cases and we'll find a few these postivios can actually be computed and these logarithms of the joint actually are relatively easy algebraic expressions for example if this is a joint Gaussian then this is a quadratic function and if this posterior is a Gaussian then the integral of a Gaussian against the quadratic form is a closed form expression it's another thing that you can find on slides about Gaussians and then what you do is you maximize this expression with respect to this theta so we keep this theta star fixed and we maximize it with respect to theta for example you compute a gradient of this expression with respect to theta and then just take a single step and because and then you keep doing this so you update theta that gives you a new form for this expression now you have a new theta star then you optimize again in theta and then you get a new expression of this form so this process iterates between computing this q and maximizing it this q is an expectation so therefore this algorithm is called expectation maximization again, who's heard of EM? ah, now, okay, easy good, so here is EM again have you seen it like this? have you seen it in a particular application? you know? nobody knows, okay tell me on the feedback otherwise I might tell you something next Thursday that you already know so here's the algorithm again, it's just rewritten keep going between computing this expectation and maximizing it why is this a useful thing to do? why would you even want to do this? well, we need two pieces of insight to understand why this is a useful thing to do and we'll only do the first one now and then the more important one on Thursday the first one is to understand why this is even something you can do and why it can be fun to do that kind of math and the other one is why it's a useful thing to do in which sense this is a correct algorithm to use and we'll need to do this on Thursday so and to answer the first question why this is why you can even do this, why is this fun to do what does this give me we'll look at Gauss Markov models and your homework will be exactly this actually and someone asked me during the break so in LTI models in these linear time invariant models where I do Kalman filtering and smoothing how would I know how to set those parameters of the linear time invariant system these magic matrices called A and Q and H and R, how do I set those? well, you set them with the M and here's how so remember that for our Gauss Markov model we made the assumption that the model consists of this nice factorization into the chain X of t, given X of t minus 1 for all times t, that's this bit initialized at t0 and then locally observation, Y of t given X of t and theta now are actually those matrices it's exactly the things that we need to describe our linear time invariant system so here they are, I've just plugged them in everything is Gaussian with initial means and covariances and then linear maps between the X's with Gaussian noise with Q and linear observations Y with Gaussian noise R it's useful to have an algorithm that computes and operates on the logarithm of this expression if you take the logarithm of a product of Gaussians you get a big sum over log Gaussians and log Gaussians are quadratic functions so this is now a sum over squares and sums of squares are somehow easy so what's now left to do and that's actually your homework so I'll just tell you what you need to do is you take this expression you write it out explicitly as the logarithm of a Gaussian so there will be quadratic forms something like Xt minus A Xt minus 1 transpose times Q inverse times and so on and now what the algorithm says for maximization is to compute the expected value of this expression against the posterior over X so what's the posterior? Well it's this product of Gaussian distribution actually not product it's this factorizing structure of Gaussian distributions which involve the smoother mean and covariance that Natanel told you about as well MS and PS and they're all Gaussian they're just a product of Gaussians with MS and PS as means and covariances and then you need to know one trick which is that for Gaussians it's possible to compute the expected value against the Gaussian of a quadratic form so here's a general quadratic form some inner product of a linear map of X against some shift scale against a Gaussian with mean m and variance v and it happens to be this thing so that's the quadratic form evaluated at the mean plus the trace of a linear map of the quadratic form quantities against the variance v and you just write it down it's just plugging everything in and that gives you an expression that involves a and q and h and r and actually we give you that expression to make it easier and now what you're left to do is just take the derivative with respect to a and q and see what you get and it'll be an interesting update that just allows you in closed form to say what a and q is so with that I'm at the end today we summarized a lot and then we realized that one thing is still missing we need to think about how to fit models but in general you would like to compute the evidence and maximize it but computing the evidence can be hard so you can do Laplace but Laplace is a fundamentally an approximation it's local it's not actually the exact integral it's just an approximation so if it works it's good but if it doesn't work we can't be sure why so instead we looked at this other algorithm called EM which most of you have already heard about which we'll talk again about on Thursday which gives an algorithm where the integrals we need to compute tend to be a bit easier there's still integrals but they tend to be easier and in special cases like linear time invariant systems and also mixture models which I might talk about on Thursday actually this integral is closed form and it gives very neat efficient updates so please leave feedback and tell me how much you know about EM so that I have an idea of what to do on Thursday