 Welcome back to probabilistic machine learning lecture number six. Here is where we are in the course. Lecture one we saw that probabilistic reasoning is a foundation for or is an extension of propositional logic to reason with uncertainty. Lecture two we saw that doing so can be computationally very hard and to make it feasible we have to find conditional independent structure in our reasoning problems. In lecture three we extended this notion from discrete and binary distributions to continuous domains. In lecture four and five we saw how to actually perform computations in such continuous probabilistic models and one way to do so is to use random numbers to sample numbers which we can then approximate integrals because integrals are the core operation we have to solve when we do probabilistic inference. Today we will begin the part of the course that is much more directly focused at actual applications and really solving concrete inference problems and we will do so using a framework in which sampling actually for a moment for a while is not going to be so important anymore. To achieve this we have to talk about this man, Karl Friedrich Gauss, one of the most prominent mathematicians of all time and this actually so this is this is what German money used to look like when I was in high school. This is him, Karl Friedrich Gauss and what you see here next to him is first of all the skyline of Göttingen where he taught and in here a little curve that all of you have seen before and in fact there is a formula already printed over it that I can zoom into and this is the one-dimensional version of the probability distribution that is named after him the Gaussian distribution also known variably as the bell curve the normal distribution the central distribution and so on and so on. You can already tell from the fact that a this guy used to be on the money and b that this curve has so many names that it's clearly important. Now usually at this point in the course I ask people why this distribution is important and then I wait for a moment so maybe you can think about this for yourself and actually the right answer usually comes up. If you come from a mathematical background maybe you've seen the Gaussian distribution introduced as the distribution that maximizes entropy given its first two moments mean and variance maybe you've heard about the central limit theorem I'm afraid to say both of these statements are not the core reason for the popularity of the Gaussian distribution they're actually maybe more of a post hoc motivation for the use of this distribution. No the real reason for the popularity of the Gaussian distribution is that it has beautiful convenient mathematical properties. What we're going to do today over the course of this lecture and actually also the following ones is to you like first get to know these wonderful properties this is going to be today and then make use of them to build real machine learning methods in fact to build maybe the foundational set of methods that cover a large part of what people might want to do with machine learning. Gaussian distributions are going to be an absolute fixture of the rest of this course so in advance I should tell you that when there is some somewhat tedious math showing up today then that's because we're going to use that math for the entire rest of the course so I urge you to pay attention even when it sometimes gets a little bit tedious and we'll just do a bit of arithmetic. So this is the one-dimensional version of this curve and it's described by this equation. So this is a probability density function for a random variable called x and this distribution has actually two parameters that are usually introduced they are called sigma and mu typically. We use a shortcut like this a curly n for the normal distribution that is a distribution over the random variable capital X which takes values lowercase x on the real line so for any real number x and after the semicolon we write the two symbols for these two parameters of the distribution. It actually happens to be the case that and I haven't shown this but I can just tell you that this mu is equal to the this location of the distribution it's sometimes called the location it's also equal to the mean so the first moment of this distribution it actually also happens to be the mode of this distribution so the point where the distribution reaches its highest value. Sigma which we are going to use this notation to write sigma squared because sigma squared defines the variance of this distribution so that's the expected square distance of draws of this distribution from the mean. That's why we write a square because then because variances are expected squares so we can talk about the square root of that expected square which is sigma this is the standard deviation of this distribution. You can see and this is maybe more important the algebraic form of this equation it's an exponential of I'm just gonna show this definition again it's an exponential of a quadratic expression so it's e to the minus and then a square where the square is the distance to the mean and then we divide by 2 times sigma squared. The 2 is there for computational convenience in terms of the definition otherwise it could be absorbed into sigma but many things are much easier if we leave that two around. In front of this exponential there is a normalization constant which happens to be one over well the normalization constant is typically defined as the inverse of this which is sigma times the square root of 2 pi. This is sometimes known as the Gaussian integral even though it wasn't actually invented by Gauss it was solved by Laplace I'll have a quote of this later on. So the a few interesting properties to begin with is this so this is a probability distribution which means it integrates to one it is obviously this number here is strictly larger than zero for any value x on the real line any finite value x so this means this distribution has full support on the real line it's also clearly symmetric so if you exchange x and mu in this expression then you get back the same thing because a square is symmetric around this permutation and maybe this is actually the most important way of thinking about the Gaussian distribution is that it's the exponential of a square so we could also write this as e to these and then some polynomial that has a constant and a quote on a linear term and a quadratic term these would require us to use other parameters let's call them eta and tau these actually also have names but I'm going to use them much more rarely today we'll talk about them a little bit in the second half of the lecture this tau is often called the precision and eta which is equal to the precision times mu is often called the precision adjusted mean so you can already find a bug here in these slides between this the definition with the square here and there I'll fix that in a later version of the slides okay so that's this univariate distribution actually the univariate case is not so interesting it's just much more intuitive we'll come back to it we'll come back to the multivariate version which is much more interesting in a few minutes before we do so though let's see why let's see the first reason why this distribution might be so useful and that is that the Gaussian family of distributions so that's a distribution of sorry that's a family of distributions that is parameterized by the parameters mu and sigma squared this family is closed under multiplication and by that I mean the following so let's say we want to do inference on a real number something that has a value between minus infinity and plus infinity and we know it to be real then a typical assumption people make and here we start to see what I mean by convenience is to just assume that the prior is Gaussian and assume that the likelihood is Gaussian this isn't necessarily true in practice but it's very convenient to do so for the following reason if you multiply two Gaussian distributions so if you assume that the thing you don't know the variable capital X the random variable capital X which takes values little x is a Gaussian distributed a priori and the probability to observe some measurement y given the true value x is also disturbed by Gaussian noise as people would say then scientists in the quantitative sciences typically draw something like this so they draw a measurement that's a circle and then arrow bars to the left and the right which are meant to be equal to mu so or two times mu sometimes so the width of this distribution then the posterior so what you know about X having measured y is given by base theorem of course which is here and to do that we have to multiply these two distributions with each other and then normalize now if you do that then you'll notice that let me just do this need a pen what we're multiplying here is I'll write that down again p of X which is a Gaussian of X around mu sigma squared and p of y given X which is equal to a Gaussian of mu around X with variance mu squared then to mark to compute their product between the two we need to take this times this so each of these is an exponential of minus one half times X minus mu squared divided by sigma squared and here with well let's write it like this exponential of minus one half times X minus y squared divided by mu squared here I've already used that we can exchange X and Y I've left out a normalization constants because afterwards we're going to integrate over this distribution right to do base to do Bayesian inference we will have to normalize by the evidence and any constants in front of this expression that doesn't don't depend on X are going to be integrate like part of the integral as well and so they're going to cancel in the denominator in the numerator of this fraction so we actually really only have to care about the product between these two so the product between two exponentials is equal to the sum of their arguments so what we have to care about is the sum of these two and that's a sum of squares so that's a sum of two square polynomials and of course that means that the sum is again going to be the square of a polynomial sorry it's gonna be a polynomial of second order right a square expression so let's write that let's get rid of the minus one half because these are obviously in both in front of this so we just need to do X plus mu squared over sigma squared plus X minus Y squared over nu squared so let's expand this expression this is equal to X squared I lost a minus here X squared minus two X mu plus mu squared the whole thing divided by sigma squared plus X square minus 2 X Y plus Y squared divided by nu squared now what we will get is and so this is actually high school math we're going to do is called completing the square in English if you are a German student you might have heard of quadratische again so so what we're going to do is we'll it will collect all the terms in X that are in that are of order X square and then all that all the terms that are of order X and then we'll push everything else into stuff that doesn't depend on on X and try to construct another square expression and then everything that we have to add to make this work will go into a constant that doesn't depend on X which we can then drag out of this expression and put it into the normalization constant so let's do that we will get terms in X square which are given by 1 over sigma square plus 1 over nu square times X square convenient plus an expression that has a 2 in front of course and then we get mu over sigma square plus Y over nu square times X and then we get some other stuff that doesn't depend on X so if you want to have a square term then we're going to have this or actually the square root of this in front of our X and we are so therefore we can expand by this here all right we'll have to add multiply in 1 over sigma square plus 1 over nu square and then the square root of that because this is the square of it and then we need this thing again divide by it so that we actually let's put the one half here and the minus one half there because then this is going to be our term in X and this means that we will have here a term to complete which is given by 1 over sigma square plus 1 over nu square inverse and there's no square root here because it's going to be part of a square and then we are left with mu over sigma square plus Y over nu square and then there's more stuff which doesn't depend on X so this expression is going to be the square of 1 over sigma square plus 1 over nu square and then the inverse of that times X square minus and then this expression here but the square root sorry not the square X minus the square root of this which is which is 1 over sigma we've taken one of these out already so we have to multiply by it again right so that means what that's why we actually keep this in the square if you know what I mean so it's going to be 1 over sigma square plus 1 over nu square inverse times mu over sigma square plus Y over nu square square and this is of the form of this Gaussian so what we're going to get is an actually where this I can go back to the to the slides a procedure that is Gaussian and it has a mean that is given by this complicated expression I just wrote on the board and a new variance that is given by this and of course it has a normalization constant and you can work out for yourself what that normalization constant is it turns out that this itself has the form that is also an exponential of a square of course because of the structure and therefore can also be written as an evaluation of a Gaussian probability density function at locations that don't depend on X so why is this cool because so first of all we can think a little bit about what these numbers actually mean here so once we've seen a data point and we've measured it with a precision that is inversely proportional to nu square and we have a prior uncertainty that is somehow scaled by sigma square so notice I'm using the word uncertainty for the variance here because of course uncertainty is the width of the distribution then we get a new estimate that is of the same form it's also a Gaussian distribution and it has a new location that is given by the weighted at the two weighted estimates so we have an initial estimate mu and an observation y and both of these come with error bars that scale like sigma and nu and now what we do is we weigh each of these measurements by the error and then normalize the error essentially so that we get a new weighted estimate for the for the correct value and a new variance that is given by the inverse of the sum of the inverses of the uncertainty so if you think of these as precisions then the new precision is the sum of the precisions of the individual measurements you had a prior and then a measurement so this here is the prior the black line the blue thing is the measurement and our new estimate is more certain than either of these because it combines information from both sources of information by weighing them relative to their individual error and therefore the more precise measurement in this case that's the observation gets us gets sort of weighted more and gets us closer to the observation this can be extended to more than just one observation because we now have another Gaussian distribution after this one measurement so if we have many such IID measurements then the same situation is similar to what we observed in the example of the beta distribution with what's the proportion of people wearing glasses just now we're measuring not a not a probability but a real number with various measurements then we have a similar kind of situation of what's called as a conjugate prior so this Gaussian distribution multiplied by a Gaussian likelihood is another Gaussian posterior over the quantity we care about so now if you keep having more and more of these measurements it will always at every point in time have a Gaussian distribution as a current posterior estimate and we can keep updating this by keeping track of these numbers that are coming in a convenient way to do this is to store the inverse of the variance and the inverse of the variance times the mean because then we can just sum up these individual parameters of the likelihoods that come in these quantities here are called sufficient statistics and I will tell you later in the course why but it's already convenient to just know that this is a name that will come up this is nice why because we're doing Bayesian inference so we're keeping track of uncertainty and we don't have to keep track of a very complicated high dimensional integral so so in the general case right if we just had joint distributions of a y and x then we would have to solve here complicated expression so a univariate integral over lots and lots and lots of complicated terms that get multiplied together and instead of having to do that all we need to do is to just sum a bunch of floating point numbers and you know of course that computers are good at summing floating point numbers it's about the fastest thing you can do so therefore this operation is massively more efficient than to keep track of the abstract notion of what Bayesian inference would tell us to do so even though you might be unhappy with the choice of having to use Gaussian distributions for everything because I haven't really motivated them beyond the algebraic properties you have to read they may I hope you will agree with me that it's absolutely beautiful at least in the one-dimensional case that we can keep track of uncertainty after an arbitrary number of observations simply by keeping around sums of real numbers of floating point numbers on a computer this is similar to the situation with the binary observations someone wearing glasses someone not wearing glasses which were a different kind of variable it's a probability right and now it's a real number and this choice of Gaussian distribution makes this kind of computation particularly easy by the way so then here's a short historical note there's an argument about who actually came up with the Gaussian distribution it's a bit complicated there's also a lot of egos involved Gauss was a very confident man his contemporaries in particularly Laplace and also the genre where the genre was slightly before him actually were also big egos and one could argue who came up with this first so Gauss wrote a famous treatise about the movement of planets in Latin of course this is just a German translation in which he constructed a framework for building estimates of the trajectories of planets around the Sun which are ellipses so they are intersections of cones and in there he came up with a way he basically invented almost all of modern linear algebra if you like and maybe even inference so he comes up with a way of making measurements and combining them together by summing their squares and then constructs even a notion of uncertainty that is essentially this probability distribution he does this in here and you can tell from the text that he's really trying to keep the computations possible for himself and later on in a moment we'll talk about the multivariate version of this he actually has to solve systems of equations so he comes up with the Gaussian elimination framework which is what we use today to do linear algebra and nevertheless though there is an interesting point here that at some point he needs to know the integral over this expression so here it is our e to the plus one half times a constant so he uses a different notational style and then there's a square here so notice that Gauss doesn't yet have the symbols with the little subscript 2 superscript 2 so instead he just writes delta times delta just to say the square he needs a normalization constant so he needs the integral over this thing here and then actually admits himself that this integral was not solved by him even though it's now called the Gaussian integral in fact it's solved by Laplace and he cites the correct thing down here and then he actually has to sort of go on a little bit with this big ego that he actually came up with it as well so he says well I actually say that Laplace kind of came up with this but it can actually be derived from a theorem that he has had in one of his earlier papers before Laplace solved it but actually Euler came up with the solution to this integral but he's sort of I mean only to a variant of it and he devised the form of this integral himself a little bit earlier but he couldn't put it in his text because the diarata were too late so you know he didn't he's just happy to say this is Laplace's work right so he can see people depending on which country you're coming from you might call this a Laplacian integral if you like you might call it a Gaussian integral or an Eulerian integral if you're a Swiss it doesn't really matter so interestingly maybe as a historical factoid Laplace actually had to solve this integral because he wanted to actually solve the beta integral which he didn't know how to solve so he made an approximation which we will talk about in a later lecture and that approximation required him to solve this one now called now so-called Gaussian integral so basically Laplace invented the Gaussian distribution to approximate a beta distribution now of course the one-dimensional version of this distribution is not particularly interesting because typically we don't want to reason about individual quantities and then directly observe that quantity with noise much more frequently we're going to have to deal with multiple variables that relate to each other and for that we need a multivariate version of this distribution and thankfully it exists and it actually inherits all of the beautiful properties of the univariate case there's almost nothing that the univariate Gaussian can do that the multivariate can't do however to do so we will have to deal with multivariate calculus and that will give like mean that we will have to deal with a lot of linear algebra so fair warning the rest of this lecture will be essentially all about linear algebra and if you need to brush up on your linear algebra knowledge then maybe after you've watched this lecture go back and Google a few times so the multivariate version of the Gaussian distribution is also going to be the exponential of a square but now we need a multivariate square the multivariate version of a quadratic function is known as a quadratic form it has this form here we will now assume that x and also the parameter mu are vectors so there are collections of real numbers of length n and we construct a real number by taking the inner product of such vectors scaled by such a matrix which will now will now call sigma so this matrix sigma will replace the variance sigma square as a multivariate version of this for this to work we have to make sure that up here they will only ever be positive numbers because otherwise e to the minus a negative number might be something that can get very large and we don't we might not necessarily be able to normalize this distribution in fact that we won't be able to normalize it to a probability density so therefore this matrix has to make sure that this inner product is always a number larger or equal than zero and such matrices that have this property that any inner product of any vector from the left and right to such a matrix is always larger than zero are called positive definite matrices so we want this matrix to be symmetric positive definite and such matrices are called well such this word just means that this matrix is symmetric and that for arbitrary vectors V this inner product is always larger than or equal to zero so minor note on notation because I know that there's always someone in the audience who's really careful about this very precisely speaking what I have written here is the definition of a positive semi-definite matrix because there's an equal sign here as well to be strictly definite this equal sign shouldn't be there in this part of the community however it has become a custom marry to just call this definition semi-definite anyway because sorry to call this definition positive definite anyway because the semi-definite version is such a sort of notational annoyance and it's easy to deal with so we will just call when I when I say I'm talking about a positive definite matrix I actually mean something that has that fulfills this definition with the equal sign okay so this gives us now a way to take to define a probability distribution over multivariate quantities that are elements of the real vector space and this distribution now looks something like this this is a two-dimensional plot so it's some kind of cloud in this multivariate space it has a center this center is given by the vector that is given by mu this vector is actually still the mean of this distribution and it has a shape and the shape of this distribution is given by maybe you can think about this for yourself for a second if you don't know it's given by the eigenvectors of this positive definite matrix sigma symmetric positive definite matrix sigma if you don't remember anymore what an eigenvector and an eigenvalue is or a matrix then please look it up at this point so actually you could think about which of these two directions is the large or the small eigenvector of sigma this is always a little bit twisted around because they're in the definition to be similar to the definition of the variance of the univariate case we have an inverse of this matrix here by the way if this matrix is positive semi-definite and there are zero eigenvalues then the inverse is a little bit tricky to define but I'll get to that a little bit later it's relatively straightforward to deal with this there are just notions in linear algebra that ensure that this all works for us so just as before for the univariate case this Gaussian multivariate distribution has a lot of nice properties but they extend way beyond what is currently on this slide nevertheless this function integrates to one so it's a probability density function because it is strictly larger than zero for any real vector x it is normalized by this expression here which is a multivariate generalization of what we had for the univariate case there is still a 2 pi here but is now taken or square root of 2 pi but is now taken to the power of the dimension of the input space and this expression here is supposed to be the determinant of this matrix sigma again if you don't remember what a determinant of a matrix is please look it up you can think of it as measuring essentially some kind of volume described by this function this function is also still symmetric under exchanges of the roles of the parameters x at the variable x and the parameter mu because you can flip x and mu here and then that happens on both sides and clearly no nothing changes and it can also still be written as the exponential of a polynomial function you have to be a bit careful now because the polynomial is now a multivariate polynomial so it involves a constant and a linear term and a quadratic term and the parameters of these linear and quadratic terms still are called precision for this matrix lambda which happens to be the inverse of sigma and the precision adjusted mean if you like which is the product of the precision and the mean and that's called eta and they are now here the linear term and the quadratic term in x by the way this can be rewritten as this inner product can be written as a trace over this other matrix just as a side note okay so the kind of derivations I just did for the univariate case here on the whiteboard behind me still carry over to the multivariate case I'm not going to do the derivations again because they are a little bit tedious if you want to you can do them for yourselves so it's still true that the product over two Gaussian distributions is itself a Gaussian so I have to be careful here with what I say the product of two Gaussian probability density functions so of these functions here which we call curly n which are a map from x to the real numbers the positive real numbers if you multiply two of these that's equal to another Gaussian distribution and the normalization constant this is not the same as talking about the distribution of the product of two Gaussian random variables if you're confused by that sentence maybe stop the video here and think about it for the moment what this means is that if we measure a Gaussian random variable twice or multiple times each time with a likelihood that is Gaussian then the posterior will again be Gaussian because the function of these sorry the product of these two functions is what is that what is the product of two exponentials of a quadratic form so it's an exponential of the sum of two quadratic forms and the sum of two quadratic forms is another quadratic form and it has parameters which are actually given here by the stuff on the slide so the new covariance and that's what this matrix is called the covariance matrix C after taking the of the product of the distribution that arises from taking the product of these two Gaussian PDFs is given by the inverse of the sum of the inverses of the covariances so this means the new precision is the sum of the positions of the observations the new mean is a more complicated expression which is the the precision of the sorry is the covariance of the of the the resulting distribution times a weighted sum of the means of the two cores of the two input distributions so why is this important it's important for the same reason that it was important in the univariate case to compute posterior distributions after Gaussian observations we have to do operations that are easy on a computer these operations are a little bit more tedious than what I showed you in the univariate case because they involve as the most important well as a basic operation they involve the product of a matrix with a vector a matrix vector product nevertheless this is something a computer can very efficiently do especially with GPUs and it involves the inverse of matrices or the sum of inverses inverted and this inversion operation is expensive we might talk about this a little bit in the flip classroom but it is still an operation that a computer is very efficient at solving so here's a little picture of what I mean by this if you take this Gaussian maybe let's call it a prior and you multiply it with a likelihood that is itself Gaussian then the product of these two probability density functions that's different from the distribution over the product of these two random numbers is itself a Gaussian distribution because these two clouds of uncertainty when multiplied together actually and then normalized again become itself a Gaussian distribution by the way the normalization constant which we have to compute to compute this integral is itself an expression that can be computed in the form of the PDF of a Gaussian it's not a Gaussian PDF because it's over the parameters rather than a wax but it's an expression that can be computed by evaluating a Gaussian probability density function that however is not the only beautiful property of Gaussian distributions there are many more and they all boil down to the fact that quadratic functions are just really nice because they aren't it's also not just the case that the sum of two quadratic functions is another quadratic function it's also the case for example that the projection of a quadratic function onto a univariate subspace is itself again a quadratic function so if you have a random variable that is Gaussian distributed like this with a mean and a covariance given by this red cloud and now you're interested in the projection of this random variable onto any other space defined by a linear projection so if you consider a matrix A and consider the random variable that is given by a times z so clearly this is a mapping of z so therefore this is a random variable but it's a linear mapping then it turns out and I'm not going to show this I'm just gonna tell you that the distribution of this derived random variable is also a Gaussian distribution and it has a mean that is very easy to compute it's just given by applying the linear operator to or this linear matrix this matrix sorry to the mean and it has a new variance or covariance that is given by applying this matrix from the left and the right so multiplying sigma from the left with a and from the right with eight transpose and so here's a picture for that right if you think of this multivariate distribution that's the distribution over z and then consider the variable that is given by projecting this entire distribution onto this line then out comes a Gaussian distribution and the mean of that distribution is just a projection of the mean onto this space and the variance of the distribution is just the projection of the covariance of this distribution onto this space by the way this doesn't just work for univariate projections so for where a is a row vector it also works for general linear maps we will make use of that a lot this is another great property of Gaussian distributions but there are way more so in particular it's not just well actually so there's a special case of this where we choose the so if we choose the projection a to be a row vector that is a unit row vector so it has a one somewhere and then a zero everywhere else then what we're essentially doing here is we're computing the marginal of an individual variable so to get right you imagine you have a joint distribution over x and y then if you apply this a from the left to this vector then you just get x right so what this what this expression on the previous slide says is that if you have a joint Gaussian distribution over multiple variables x y z and so on and you care only about one of the marginal distributions of this of this joint let's say over x then that marginal is itself a Gaussian a and b it has a very very simple accessible form the to get the mean of this distribution just pick out the corresponding entry of the mean vector and to get the covariance of this distribution just pick out the corresponding sub matrix of this joint covariance matrix this is amazing because it means that if you have a model to do inference with that has a very high dimensional representation over a large space but you only care about a part of that space then that's easy to do because you just pick out individual parts of the distribution it's also actually maybe a dangerous property of Gaussian distributions that we have to deal with later on which sort of another way to phrase this is if you have a complicated model over many many parameters many many variables and you only care about one part of them then having a Gaussian model corresponds to treating that sub part of the of the joint as if you could just drop all the other variables so that's property number one and notice that essentially this allows us to define a kind of some rule for Gaussian distributions so if you have a joint distribution over two variables and you care about only one of them then you have to compute a marginal and for Gaussians this operation the sum operation is very easy it amounts to well it's not even linear algebra it's just selecting from a list or an array it's also true and this is a different property that not projections of Gaussians but cuts through Gaussians are Gaussians so if you have a quadratic function and you take any linear cut through that function that's itself a quadratic function so therefore if you condition a Gaussian that means cut through a Gaussian at one point let's say here then the corresponding distribution is itself a Gaussian distribution so here is the corresponding rule for this it's a little bit tedious it says given like the conditional distribution for a variable x given that a linear projection of a of x is given by y is given by this corresponding like definition of the conditional distribution which happens to be itself a Gaussian now why this is exactly the case takes a little bit of time to divide and I want to do more interesting fun stuff with you so I'm just gonna tell you that it is of this form it is an itself a Gaussian distribution that is given that has a parameter that is given by the mean element of this the Gaussian marginal over x plus and then there's an annoying complicated expression that we will talk about in a minute but what you can see here is that it only involves matrices and vectors and the inverses of matrices so these are linear algebra operations and computers are good at linear algebra so therefore this property basically defines a form of conditional or a conditional distribution or product rule for Gaussian distributions so if you have a joint Gaussian distribution over several variables and you care about the conditional of one of them given the others you can construct a corresponding marginal that corresponding conditional that conditional is Gaussian and its parameters mean and variance mean and covariance are given by quantities that can be computed using linear algebra this can also be summarized here so I have written it down again so maybe a little bit more clearly so if you have a prior distribution over variable x that is Gaussian with mean and variance and then you make observations of another variable y that is a conditional distribution that is also Gaussian over the observed variable and the mean is a linear map of the quantity you care about x with some covariance lambda and maybe it can even be an affine map so it's a linear mapping plus a shift then the rules of probability that are usually complicated some and product rule and the posterior distribution that involves complicated integrals for Gaussians actually reduced to linear algebra so the marginal distribution over y which is the evidence term in base theorem actually is of the form of a Gaussian quantity why is that it's because the product of two Gaussians so here is a Gaussian and there is a Gaussian and we take the product of the two these are essentially two Gaussians over x because the Gaussians are symmetric under exchanges of the variable and its parameter is itself a Gaussian and the normalization constant is also of a Gaussian form and here is that normalization constant it's given by this expression and notice that these expressions just involve linear algebra that just matrix vector products and the posterior distribution is a Gaussian distribution over x with a new mean and a variance and sorry here is the mean and here is the variance and that new mean and variance is given by linear algebra operations we'll actually take a moment now to talk about exactly what these linear algebra operations do and what the terms in there actually mean but before we do so it's more important first to realize why this is cool because it maps the otherwise complicated probabilistic inference onto linear algebra so we saw in lecture two that probabilistic inference can be combinatorially hard so exponentially hard in the number of variables in our model but if you have joined Gaussian models then the inference consists of linear algebra and if you've done a linear algebra before then you know that the cost of linear algebra operations is polynomial rather than exponential in fact the cost of this is at most cubic in the number of variables in x or y and or y actually so when we use Gaussian distributions for everything and of course we sometimes have to question whether we are allowed to do that but when we do so then inference probabilistic inference reasoning with uncertainty boils down from integration which is hard and combinatorially complex to linear algebra which is easy to implement and polynomially hard what exactly are the computations we have to do so if we have a variable called x that we care about and we have a prior over that's Gaussian and make observations called y that are linear maps of x up to Gaussian noise then the posterior over x is a Gaussian with a mean that is given by the following expression you take the prior mean that's the mean that we had before we got to see an observation and then here is a complicated expression that if you stare at it for a while separates into here an expression that is a vector so y is a vector that's the observation minus this is essentially a prediction so this is the marginal that's the mean of the marginal distribution so that is if you ask the prior over x what y might well be then you would predict it its mean would be this expression and then you take the difference between the two this is called the residual it's how far what you actually got to see is away from what you would have predicted this number to be and then you scale it by this expression which is called the gram matrix it shows up over here again this is this expression is actually the marginal variance of y so this is the scale on which we expect y to lie and then we have to invert that so what we're essentially computing here is a corrected residual we're saying how far is the measurement away from the prediction on a scale defined by how far we expect the measurement to be away from the prediction and then we project back onto the space of x where we actually care about this expression here is sometimes called the gain because it multiplies the residual by a linear map computing this object is so this part is easy this part is easy the hard part is this matrix inverse here matrix inverses are cubically expensive in the size of this matrix this is a square matrix because a is on the left and the right hand side and it has size that is equal to on each side to the length of y we need to invert this the corresponding covariance matrix so essentially the multivariate error bar on this measurement is given by the prior covariance let's call that covariance an uncertainty because that's what it is it's a width of the distribution so what we know once we've seen why is what we knew before but our uncertainty was before and then the uncertainty is reduced by the measurement by an expression that involves an outer product if you like or maybe an inner product depending about how you think about these quantities of the this vector this matrix or yeah in general matrix here that we already had in the mean and the inverse of this gram matrix that we also had in the mean so one nice thing about this is that these quantities show up twice in the mean and the variance so once we've computed this inverse or solve like figure out how to compute with these inverses this operation here does not add additional computational cost to this one which means in the Gaussian framework we get uncertainty for free once we have computed our point estimate we get the uncertainty for free now by the way there is a really interesting reformulation of this so this is one important equation that we're going to be making use of over and over again and there's another form for this expression for this expression so the exact same distribution can also be written by rewriting this complicated expression here in a different fashion which is to take another matrix which is given by the inverse of the prior covariance plus a transpose times the precision of the likelihood times a inverted times the what you could call the the weighted the precision weighted means so this is the prior mean times the prior precision plus the shifted observation times the precision of the measurement multiplied with a from the left-hand side and the corresponding posterior covariance is given by this expression which we already have up here this is the sort of the new posterior precision inverted so why is it useful to have both of these forms because x and y can be of different size so notice that we make observation of a linear projection of x so it could be that y is a real number and x is a long vector or it could be that y is a long vector and many many measurements that containing many many measurements of x and x is a small thing maybe x has only three entries but we have 20 measurements of x then depending on the relative size of the vector y and the vector x relative to each other the matrices that show up in these expressions are of different sizes this matrix here is of size length of y squared and this matrix here is of size length of x squared so if x is larger than y it makes more sense to represent this to use this representation because then we only have to invert a smaller matrix but if x is smaller than y so if you have many measurements of a smaller set of variables then it's better to use this representation because then this matrix is smaller and because the computational cost of this entire operation is dominated by computing essentially this matrix inverse it's important to think about which way around we do this matrix inverse this idea that the these this posterior distribution basically can be represented or the parameters of this distribution can be computed in two different ways is actually due to one of the most prominent mathematicians of the 21st century arguably his name is Isai Shure he was born in what was then the Russian Empire in present day Latvia he was a member of the German Jewish community there he moved to Berlin to study and after his PhD in which he actually already worked on linear algebra and basically worked on the the result that we're using here he moved to Bonn where he became the successor of Hausdorf and there was a professor there for a modern a decade Shure is one example of the brilliant minds that Germany lost in the eve on the eve of the Second World War he was forced to leave his academic position because he was of Jewish faith and he had to move first to Switzerland and then was forced to give up a lot of his or a large part of his fortune almost all of it before emigrating to Israel which was then called Palestine in 1939 where he soon died at a relatively young age of 66 probably or maybe at least partly because of the psychological stress of having to emigrate this is a sad story but the result that we're using here is absolutely beautiful and it's only weakly represented by this by this linear or representation in form of matrices it's actually a more general statement that Shure provided for general groups well relatively general groups but it's simple to understand in this form this the fact that we can rewrite this expression like this expression is due to the fact that x and y have a joint Gaussian distribution which has a joint covariance matrix and the inverse of that joint covariance matrix can be written in two different ways or can be computed in part it contains two matrices that can be computed in part from each other this can be written in the following form as a statement that is known as the matrix inversion lemma or also as an unfortunately we can't call it Shure's lemma because Shure's lemma is something else actually that name is already taken but it is saying that we can write inverses of matrices that have this kind of block structure which is essentially any matrix as long as the following quantities have inverses wherever I use them by computing first this object which is due to Shure named the Shure complement of this matrix and then computing the individual terms in the inverse in this particular form now you can see this kind of expression show up in here and in here and you can use these to construct inverses of matrices the same trick can be also can also be used to compute the inverses of matrices that have this kind of form so the inverse of a matrix that can be written as one matrix of which you might know the inverse plus this expression which could be for example a complicated out inner product or outer product of so for example you and we might just be vectors so it's a sort of a lengthy expression with a small matrix W inside the inverse of this matrix can be written as the inverse of this matrix which we might know already maybe it's the inverse of the prior covariance which we've previously computed minus a correction term which involves this then in this case much smaller matrix so if you and we are vectors then this is just a scalar the same result can also be used to compute determinants of matrices so and this is called the matrix determinant lemma this is called the matrix inversion lemma this is called a matrix determinant lemma and this is basically these two basically relate to each other I'm showing you this partly because this is a really cool tool and if you're working with Gaussian distributions then you will going to have to use this a lot but also partly to highlight that all of the computational challenges we now have to solve if we're using Gaussian distributions are not integrals anymore they are just linear algebra linear algebra problems are not necessarily trivial to solve they require knowledge like this but they can be solved using linear expressions rather than potentially combinatorially hard or totally intractable integrals because of this property Gaussian distributions map all of probabilistic inference onto linear algebra so product of Gaussians are Gaussians that means that the that if you get two sources of information about a random variable and they are both of Gaussian form then the resulting distribution is also Gaussian linear projections of Gaussians are Gaussians which means that if you have any set of variables and you want to reason about linear projections of them and if the original variables are Gaussian distributed then the marginal distributions are Gaussian distributed in particular this means that marginals of Gaussian distributions are Gaussians because a marginal is a specific kind of operation of this form and conditionals of Gaussians at least when conditioned on linear operations are Gaussians therefore because this is essentially the sum rule and this is essentially the product rule because the sum and product rule map onto linear algebra if everything is Gaussian and linear inference in the probabilistic sense so Bayesian inference you computing posterior using Bayes theorem also involve just linear algebra expressions so if you have a variable that is Gaussian distributed under the prior and you make observations of linear projections of that variable in particular this includes directly measuring this variable where the measurements are corrupted by what's called Gaussian noise so if they are distributed according to a Gaussian distribution around a mean that is a linear projection of the variable we care about then the posterior distribution and actually all linear maps of it is again a Gaussian distribution and those expressions then are maybe tedious to write down and at first sight they look really complicated you have to stare at them for a while to see them but the important part is that all operations that show up here are just linear algebra they are not complicated untractable integrals they're just linear algebra and therefore Gaussian distributions essentially map all of inference all of probabilistic inference onto linear algebra linear algebra is something that a computer can do very well and thus this explains why Gaussian distributions are so prominent in probabilistic machine learning because they allow us to keep track of all of these complicated objects in a very simple way just by doing linear algebra this slide isn't gray but you can think of it as a gray slide and take a break here and stare at these expressions for a bit all right for the remaining part of this lecture for the next 30 minutes or so I'd like to use that time to introduce two very basic examples for the use of the Gaussian distribution for simple inference tasks the goal of these examples is simultaneously to give you an insight into how this big Gaussian machinery that I produced in this abstract fashion here is actually used and also to tie a connection between the Gaussian distributions we've been talking about today and what we found in lecture two when we thought in an abstract way about conditional independence structure then on discrete variables and now I'd like to talk about continuous variables these two examples come from a really cool text by David McKay called the humble Gaussian distribution which I've cited up here the first example in contains three variables so consider the following situation let's say we want to talk about the temperature in a building or two different buildings and a temperature outside so let's say we're measuring we might potentially measure temperature in one of these three places so there are two buildings that might be quite close to each other and evidently of course the temperature in both buildings on the inside depends on what the temperature on the outside is it not doesn't necessarily have to be the same though because the two buildings are maybe different or the measurements are in different rooms that are affected by the sunlight from the outside in a different way so let's introduce three random variables which we will call x1 and x2 and x3 and they correspond to the temperature in building one outside and in building two now how are we going to build a general probabilistic model for this we can write down a graphical model like this just intuitively already but also and by like I mean this graph also actually encodes assumptions about the generative process producing the temperatures what is temperature what so let's say there is an outside process that produces x2 it could be the Sun shining from like above onto the ground now given x2 we can reason out what x1 might be x1 is going to be and let's make the following simple linear assumption it might be a linear function of x2 so this w1 is a weight that might have something to do with the temperature conductivity of the walls and how many windows this building has it's just some number times the outside temperature plus a disturbance that is that depends on the building now in general of course you would like this function here to be as complicated as possible to capture every single single thing you're currently thinking of and later on in the course will come to models that can do this kind of stuff but today for simplicity let's just assume a that there's a linear relationship between x2 and x1 and b that the disturbance relative to the value of x2 is a Gaussian random variable it's just some form of uncertainty that we decide to capture by a Gaussian distribution we will also assume that as a similar relationship for the other building for the temperature in building 2 which we call x3 confusingly that's going to be also a linear function of the outside temperature plus another disturbance and this other disturbance has nothing to do with the disturbance in building 1 because it's a different building but we'll also assume that it can be modeled in a Gaussian distribution now what might the outside temperature be x2 well again we might need well in principle we would have to use a complicated distribution that captures all the aspects of the outside world but for simplicity let's just say that's a Gaussian random variable as well so the probability distribution for this temperature new 2 is given by a Gaussian distribution with a mean and a variance and the assumptions we've made here is essentially that these three numbers are independent of each other so that they this is the assumption up here so we will assume that these three disturbances so the outside temperature the disturbance in building 1 and the disturbance in building 3 and that these are three different processes that have nothing to do with each other so they are independent of each other that doesn't mean though that x1 x2 x3 are independent of it if each other it just means that new 1 new 2 and new 3 are independent of each other what's the corresponding distribution of the variables x x1 x2 x3 these are random variables and they are a linear map of new 1 new 2 and new 3 so x2 is just new 2 so we can write x2 as we can introduce a linear map a which maps from new to x and the second entry in x is x2 which is just new 2 so that's 0 times new 1 1 times new 2 0 times new 2 and new 3 okay what's x1 x1 is the disturbance new 1 so there's a 1 here this comes from here then it's w1 times x2 which is itself new 2 right so we can plug this directly in here otherwise we would have to construct a more complicated map and then it's 0 times new 3 there's no new 3 in x1 and finally for x3 we have 0 times new 1 there is no new 1 in this expression then w3 times new 2 which is equal to x2 plus new 3 so there's a 1 in here now we can use the properties of the Gaussian distribution which is that I'm going to use the sentence the Gaussians are closed under linear maps to mean more precisely this property here we can use this to derive what the implied distribution is over x1 x2 and x3 and it's this Gaussian distribution because this is a Gaussian distribution and this is a linear map and therefore x which is a linear map of new will also be a Gaussian distributed random variable with a mean that is given by applying a to the mean of new and a covariance matrix sigma which arises from applying a from the left and the right to this diagonal matrix with entries sigma 1 sigma 2 sigma 3 if you do this you can do that by hand then you arrive at well an obvious mean which is just you know new 1 plus w1 new 2 new 2 and w3 new 2 plus new new 3 that's this and then this complicated matrix where this is sometimes confusing to people I've decided only to print the upper diagonal part of this matrix because this is a symmetric matrix so therefore these parts down here are just the symmetric equivalence of these I've learned over time that some people are confused by me writing matrices like this and other people are confused by me writing down symmetric matrices with all the entries doubled because they think there are so many numbers in there to look at so on this slide I'm going to just show the upper triangular part and a non-subsequence slides I'm going to show the full matrix now what can we read off from this structure in this matrix so remember when we wrote this graphical model before then so first of all this implies that the joint distribution over x1 x2 x3 can be written as p of x2 times p of x1 given x2 times p of x3 given x2 we know from lecture 2 that this kind of fan out structure implies that first of all given x2 x1 and x3 should be independent of each other and when we marginalize out the marginal over x1 and x3 should not in general be independent what which of these properties can you read off from this distribution well we have to look at the covariance matrix which captures the structure of relationships between these matrices and it's the only part of the Gaussian distribution where such structure can be stored so what you can first read off is in this covariance matrix the statement about the marginal about x1 and x3 why because remember that one of the properties of the Gaussian distribution is that marginals of Gaussians are Gaussians so to get a marginal distribution we just have to pick out a sub matrix so the sub matrix for the marginal distribution over x1 and x3 is the matrix that consists of this term and then this term and the term down here and the symmetric part here I noticed that on this off diagonal entry so on the diagonals we just see the marginal variances of the individual variables that's fine but the off diagonal entry here is non-zero unless w1 and w3 or w3 are zero so unless right xx3 or x1 don't actually depend on x2 so in general this here is a non-zero number and therefore these two variables now depend on each other if you know something about one of these variables you know something about the other variable as well the structure of this graphical model though also includes another information which is that when we condition on x2 then these two variables become independent of each other we cannot read this off from this matrix actually but there is another matrix we can construct from it where we can read this off and one way to do that would be to take this matrix and invert it and see that look at the precision matrix of this Gaussian which we actually need when we you know evaluate this Gaussian distribution I'm not going to do that though because inverting is three by three matrix is a little bit tedious instead I'm actually going to write down the generative model that arises essentially from this kind of structure and do that so here is our graph again I've just copied it over this is the same equation as before it's just on the previous slide I've moved it over and now for simplicity let's assume that we set the means here to zero this is purely so that the notation becomes shorter and we don't have to look at so many different terms you can if you do that then you can think of new one new new two and new three not as the actual temperatures outside and these two buildings and disturbances of the buildings but as offsets from their means so new two is now not the temperature outside anymore it's the deviation of the outside temperature from its mean temperature and so on if we do that then things get simpler so now this means we can now we can first write down the joint distribution for x1 x2 and x3 which is encoded in this graph so that's what I just already said and if you can see the color I'm going to use color to keep these terms in x3 and from these individual like quantities separate from each other if you don't see the color it doesn't matter you'll still be able to follow along and so this factorization amounts to saying that there are actually three different Gaussian distributions here there's a Gaussian distribution for x2 which is given up here and then there's a Gaussian distribution for x1 given x2 which means that we can think of x1 minus w1 x2 as the independent Gaussian random variable new one with mean zero and finally we can think of this term as a random variable that's given by x3 minus w3 x2 and that random variable is called new 3 and it is independent of all the other ones here are these three Gaussian so here I've already put these into the exponential of course Gaussian distributions the product of three Gaussian distributions is the product of three exponential functions that are contain a square in the exponential that's equal to the exponential of a sum of these individual squares here are these three squares this is the term from x2 this is the term from x1 given x2 and this is the term from x3 given x2 now x1 2 and 3 actually show up in like jointly in all three of these terms it's not that this is a term about x1 and this only about x2 and this about x1 and this only about x3 no there is an x2 in here and an x2 in here because we are conditioning on x2 so let's rearrange these terms so that we get all the bits with x2 on one side and all the bits with x3 on one side and all the mixture terms with x2 and x3 or x3 and x1 and so on on one side in doing so we are implicitly inverting the covariance matrix of this joint distribution because we're going to get actually a shape of this distribution as a joint Gaussian distribution with an inner product containing inner product in the exponential with a matrix and that matrix is the inverse of the covariance matrix by definition of the Gaussian so if we do that then basically we have to expand these squares here and there plug it get these individual terms so here are all the bits with x2 there is one where there is no w in here and then there is a square in x2 from here and a square and with w3 from here and then and there is an independent term for x1 and then there are mixture terms for x1 and x2 we survive from here and the mixture term with x3 and x2 that arises from here and a term only in x3 if you rearrange this then we can see that we can write this expression using this matrix so this is the precision matrix of our Gaussian and this matrix tells us that x1 and x3 become independent when conditioned on x2 there are two different ways to see this one is the sort of mechanical pedestrian way imagine that x2 is set to a particular value that's what conditioning means that means that all the terms in this matrix that show up along here along this cross basically are now fixed to certain numbers or actually these terms also show up up here right so now if you look at these expressions so everything here is now a constant because we've set x2 to something and in this term and in this term you can think of x2 as a constant now as well I'll notice that there is no term in here that contains an x1 and an x3 together in this sum there's only a term in x1 and this is a constant right so this is only a function of x1 now and there's only a function of x3 and they're separate from each other another way to see this is that there is a zero on the off diagonal entry here this means that if we keep x2 fixed then this expression factorizes into two factors in a product one depending only on x1 and one depending only on x3 that's exactly what conditional independence means so what we can tell from here is that to find marginal independence we have to look at the marginal variance so we get that by plugging out parts of the of the matrix and that means if there is a zero in the covariance matrix then two variables the two variables identified by the index of this location in the matrix are independent of each other conditional independence can be read off at least in part from the precision matrix a zero in the precision matrix implies that these two variables are independent of each other when conditioned on everything else in the model so here we only have three variables which means this zero means these two variables become independent of each other when conditioned on that one other variable if we had multiple variables and we wanted to make statements about conditional independence we would about certain ones we first have to marginalize out all the other variables that we don't care about how to do that with Gaussians you now know and if you don't then think about it and then do this inversion this three-by-three inversion essentially this is example number one how we can read off conditional independence and marginal dependence from the covariance and precision matrices now let's do a second example with another graphical model and another physical example let's say we want to reason about the costs that go into the price of electricity so let's say you're a consumer you've got a contract with a provider that makes electricity for you like the utility in Tübingen for example the Stadtwerke and let's say for the simplicity of the argument that this electricity is produced by burning gas in a gas turbine and that that's essentially just two things just two sources of price that go into this process one is the emission price for the ton of CO2 that the utility has to pay and the other is the price of the gas when they actually buy it from the from the market now one thing you might be interested in is given that you know what you're paying for your electricity and what other people are paying for the electricity very roughly what their price structure might be how much they are paying for gas and for the emissions so this is another graphical model which we again can write in terms of three independent Gaussian random variables we begin with an independent Gaussian random variable for the gas price so something we don't know let's call that new one and give it a Gaussian distribution and then there's a second variable let's call it x3 which is the emission price and that's again Gaussian distributed with a mean and a variance and these two things are independent of each other let's just say that's the case which is maybe not so bad an assumption and now our electricity price will consist of weighted sum of these two quantities and then a little bit of a disturbance that has something to do with the with the decisions of the utility about how they construct their price so by seeing our particular electricity price we might be a little bit above or below what that sum actually is depending on how the utility actually makes their price so yeah one thing to note at this point is that these weights that we now are going to have here so our price is a weight times the gas price plus another weight times the emission price plus a disturbance now a few questions you could think about for yourself is first of all what you might have if yourself economically inclined you might have been thinking when I say this when I set up this the situation that of course yeah but of course I mean these the utilities of course wants to make a profit so so they are what you are paying in price isn't actually the sum of these two it's the sum of these two plus something else which is the profit margin of course that's true and what that just means is that new two in general the mean of this new of this sorry new two this the mean of new two is not going to be zero it's going to be something positive that's the margin a second thing you might notice if you're more physically inclined is that w1 w3 are not going to be one in general because they're not just summing up these two prices these two things are measured in quite different quantities so this we're going to see in a moment gas prices the gas is sold on the international market in terms of so-called British thermal units and paid for in US dollars and the emission price has something to do with the euros per ton of CO2 but the ton of CO2 doesn't directly translate into electricity right so we have to think about what the transformation process is from these quantities into ours this is a typical situation in all in all reasoning with physical quantities things have units of measure and when we talk about such relationships we have to be careful that these scaling factors w1 and w3 actually capture the physical process right okay that much to the setup of the problem now we can do the same thing as before we can introduce we can talk now about the divide variables x1 x3 and x2 and divide a joint Gaussian distribution that arises from this generative kind of process we do that by again using an independent Gaussian distribution over a new one u2 u3 and then thinking of the correct linear map a that maps from new to x I haven't put that map here I'm sure you can think of it for yourself maybe stop the video for a moment if you do that and use again the rules of how to transform Gaussians under linear under linear maps or how to transform Gaussian random variables under linear maps into new Gaussian random variables then you'll end up with this distribution where x is now a variable x the mean is something that is given by a times mu 1 mu mu 2 mu 3 and a is that matrix we I haven't written down and there is a new covariance matrix which is given by this structure now we notice that there is a zero in the entry 1 3 of this covariance matrix and you can think for yourself for a second what that means it means that the marginal distribution of x1 and x3 is independent that the two terms x1 and x3 if we marginalize out x2 become independent of each other that's exactly what we wanted our model to encode and we can see that down here the marginal distribution over x1 and x3 is given by another application of this gaussians are closed under linear maps property to compute the marginal distribution we just pick out the entries mu 1 mu 3 and sigma 1 sigma 0 and sigma 3 and we see that they are independent of each other now we know from lecture 2 that this got a graphical model which represents this process also has an interesting conditional dependent structure so even though x1 and x3 are independent in the marginal once we condition on x2 they will become in general dependent on each other let's see if the gaussians actually the discussion model can capture this kind of structure and of course it can so here is our graph again from before copied over this is our structure again copied over from the previous slide now we write down the joint distribution over x1 x2 and x3 using this structure which amounts to saying there is these independent variable called new one which directly affects x1 then there is this independent variable x3 which directly affects x3 and then x2 is actually not an independent random variable but there is an independent random variable called new two and it's given by x2 minus w1 x1 minus w3 x3 and that is independent so there's another term in here which is called p of x2 given x1 and x3 and that's what this graphical model actually represents so this is our joint Gaussian again oh there's a square missing okay I'll have to fix that later so there's a square here and a square there and then a bracket around this with a square above okay so I'm just plugged in these values why this is this is just the explicit form of the probability density function of the Gaussian probability density function arising from multiplying the three gaussians here together now in the next line things are correct here I fixed my typos so now we can just go through and check for all the terms with x1 and bring that bring those on one side so we distribute these brackets we'll get an x1 square that depends on what over sigma square and then for me here there will be a w1 square x1 square term that shows up here similarly for x2 there is a simple term that only comes from here there's nothing else with a square on x2 but a bunch of mixture terms with x1 and x2 and x3 and x2 and they show up here and there and then there's another mixture term with x3 and x1 because there's also a term in here where they mix that's over here and then we have an individual term of x3 squared one from the independent term and one from this conditional distribution over here if you rearrange this expression into an inner product with a matrix inside we are implicitly computing the inverse of the covariance matrix and that inverse looks like this and now we can see that there is there are nonzero entries in general everywhere in this matrix what this means is that if you condition on any of these variables then the other two variables are always in general dependent on each other unless the corresponding terms in this matrix are actually zero and this corresponds to certain values for w1 and w3 in particular we can see that if you condition on x2 so if you keep x2 constant in this expression up here then there is still a term where x1 and x3 depend on each other and it contains this expression up here so there's another bug in here annoying this should be a plus okay I'll fix this in the in the in the slides because there's a plus here here it is correct there is a plus in this around this term there's a positive number in here which means at first sight you might think that means that these two quantities become correlated when conditioned on x2 but actually the other way around so think of this term now let's say we want to keep the probability for this constant then this means if x1 gets larger x3 has to become smaller and if x3 gets larger x1 has to become smaller for the term to remain constant right so this means if there's a positive term in the matrix here this corresponds to a negative correlation between the variables x1 and x3 when conditioned on x2 so let's see how this works out if we actually do Bayesian inference to reproduce this kind of structure as you would expect it and for that and we could do two things we could write down the form of this distribution over x1 and x2 if we if we just keep x2 fixed and that's essentially taking this function here and just treating everything that contains an x2 as a constant so that means there will be a quantity in x1 and x1 square so this is going to be a function a part of the function this is a constant here we have a linear term in x1 that depends otherwise just on constant and then a quadratic term in x3 and then here a linear term in x3 with a constant and then there are complicated term combining x1 and x3 we can also go back alternatively and use the general rule from this slide down here and say what we're doing here is we could do Gaussian inference on the quantities called x1 and x3 given the quantity x2 and this amounts to taking considering what the what how we can write sort of linear maps B and A to get our corresponding correct variables so here we have a quantity x that has three different entries and y is just picking out one of them so our a is going to be a row vector that has 0 1 0 and to pick out x2 and then we can measure y with some Gaussian noise in general and afterwards we'd have to compute a posterior about maybe just x1 and x3 so for that we need a B which just picks out it's another row vector with 1 0 1 that picks out the entries 1 and 3 and then we can assume that B is 0 and C is 0 because we're not shifting any measurements and then what we're getting here will tell us directly what the what the posterior is so let's do that which amounts to the same thing as keeping taking this function keeping all the terms in x1 x2 constant to a particular value and then rewriting until they get a Gaussian distribution so actually this one so here is the situation again now I've actually picked reasonably realistic numbers so if you if you look up sort of online what correct what the prices are for these corresponding quantities then you can see that typical prices for the ton of CO2 range and on the order of something between 15 and 20 or between 10 and 25 euros per ton which is supposedly way too low if we have an effect and the price for gas is on the order of something like outside of corona times at least on the order of something like five euros dollars four per million British thermal units that's what the quantities actually used for so notice that these are very different physical quantities and I've captured a sort of a very vague assumption over what their correct values are in a Gaussian distribution now let's say we make an observation we observe that we are paying let's say something at the order of 30 cents per kilowatt hour for our for our electricity price and for our electricity so we could predict actually what kind of value we are expecting for the price of electricity by computing first the marginal distribution over x2 so that's what our model would suggest the price would be we can get that by finding the right values to map from x1 and x3 to x2 and those will contain numbers that actually map from the from the correct physical quantities to the price that we are paying so maybe we can actually do that briefly on the on the whiteboard I have some numbers here so that I so that you get a feeling for how this actually works so it turns out that one million British thermal units this is how this is abbreviated right correspond in if you burn them so gas gas burning is almost perfect right in terms of heat but I produced in 293 kilowatt hours of energy and so a dollar is something like 0.93 euros at the moment I just looked that up which means that this quantity x1 the price of gas is equal to sort of well ww1 in our computation is the number we need to get from dollars per million British thermal units to euros per kilowatt hour and that's going to be essentially 0.93 divided by 293 and then we just need the right thing euro per dollar divided by British million British thermal units by kilowatt hour and this is something like that's a pretty small number 0.003 and if we actually want to talk about cents then we just have to take hundreds of euros right so that we can just get bit of two of these zeros and w2 comes from the so our the other the other quantity we talk about is the CO2 emissions price for that we need to know how much CO2 is produced when burning gas to to make electricity and there you can find different numbers online but actually it turns out that there's a very rough correspondence of kilograms of gas sorry kilograms of CO2 produced and burning gas to kilowatt hours so there's very roughly sort of one a correspondence of one kilogram of CO2 per kilowatt hour when burning gas and so this is 0.001 tons of CO2 per kilowatt hour and we already have euros so we don't have to very much about this so w2 is just going to be 0.001 very roughly speaking right so now let's see let's say we observe that we are paying 30 cents to the kilowatt hour that so notice how these two numbers are comparable to each other they're not the same but they're comparable to each other on in terms of size which is why this linear function here becomes sort of has a slope that is not all that steep actually so when we observe our x2 x2 is a variable that lies on this rank deficient subspace essentially we can only measure how far we are in this direction that's where our price is actually measured right because in this direction along this direction if I can show it like this the two quantities we care about that make up that who some makes up our electricity price actually go so if you now measure what our what our price actually is up to some measurement noise which might come from the fact that we don't really know how this relates to what they are paying and also maybe yeah we don't really know how they make their price so if you make our measurement then this corresponds to observing well this kind of picture observing the value but also getting a value for our likelihood function so the likelihood for x2 given x1 and x3 is given by this Gaussian function that we have on the previous slide we now know what x2 is but we want to know what x1 and x3 is to do so you can now think about this in three different ways one is pictorially so you can think of multiplying this Gaussian with this degenerate rank one subspace Gaussian which is itself a Gaussian distribution because of the symmetry of Gaussians around their mean and if you just multiply these two Gaussians with each other because the product of two Gaussians is another Gaussian we will get a posterior Gaussian distribution and I can already show you that it looks like this that's the new posterior Gaussian we're going to get another way to think about this is to go back one slide and look at this expression here and treat this as a function of w1 and w3 where w2 is now a constant that we've observed then you can rearrange and see that we get out a function that depends on x1 and x3 and that's our posterior it's a little bit tedious to do so but it's a mechanical way to do so a mathematical way to do so that's a bit more formal is to go to our slide with posterior forms and read off the posterior distribution which way around you do it doesn't matter you always end up with the same posterior so here I've actually done the computation you can look at this later for the actual like using the formula for the for the posterior distribution from the previous slides now the one thing you can see is that given the observation given the blue thing our likelihood the posterior distribution the great dark thing to see as a cloud here is anti-correlated so the two variables x1 and x3 given the observation become negatively correlated with each other and that's what we saw before right that makes sense because if you observe a certain price then you can explain that in different ways by combining these two quantities the two things that create the price the gas price and the emission price but they have if one goes up the other one has to go down of course all right so an off a non-zero term in the off diagonal of the precision matrix of the inverse covariance matrix corresponds to the exact other sign of correlation a positive term corresponds to negative correlation and negative term corresponds to positive correlation this is the end of the lecture today I've introduced Gaussian distributions as a general tool for inference we haven't done much practicing practice stuff with this yet if you're waiting for the applications if you've been wondering how this course is going to turn into an applied usable set of tools wait no longer than next lecture where we're going to address this issue and then we will quickly construct really useful algorithms today I introduced Gaussian distributions as an algebraic tool that maps the complicated conceptual process of probabilistic inference onto linear algebra and this works because Gaussians are the exponentials of squares and square functions of really wonderful properties the sum of two squares is another square and therefore products of Gaussians are Gaussians a cut through a gosh a through a quadratic function is a quadratic function so therefore conditionals of Gaussians are Gaussians and a projection of a quadratic function is a quadratic function and therefore Gaussian marginals are Gaussians and all linear projections of Gaussian random variables are also Gaussian random variables therefore the sum and the product rule are sort of reproduced or if we do, if you use the sum and product rule in linear Gaussian models then we always stay within this Gaussian framework and all the computations we have to do are just linear algebra. Therefore, I'm going to say in the rest of the lecture, Gaussians provide the linear algebra of inference or they map inference onto linear algebra. We now did these two examples to give a concrete very very basic, a simple hopefully convincing example of how the structure of conditional independence that we saw in lecture two maps onto the real case with Gaussians. In particular we saw that to find marginal independence we have to look at the covariance matrix at its entries and to find conditional independence we have to look at the inverse covariance matrix also known as the precision matrix and of zero, of diagonal zero entries there. An off diagonal entry in the precision matrix that is positive corresponds to negative correlation given all the other variables and the other way around and a zero there corresponds to conditional independence. In the covariance matrix a zero on the off diagonal corresponds to marginal independence of these two quantities. This lecture was maybe a lot of just algebra but I find this really important because Gaussians are going to be a very key part of our toolbox. Here is our toolbox again that I've introduced a few lectures ago. This is the sort of stuff you carry around with you as a machine learning engineer as you go to work and the Gaussians play an extremely important role in this toolbox. You know how every good craftsman actually no matter what their trade is, what they are particularly working on in their toolbox among all their specialist tools they will always have like a box of wrenches that just and screw drivers right that just apply to everything because everyone has faced with having to loosen nuts and bolts and and turn screws. The Gaussians are going to be that part of your toolbox it's actually a little own toolbox in itself that you take out to do generic work whenever you don't know what else to use you're gonna use that wrench that is inside of your Gaussian toolbox because Gaussian distributions map everything you then need to do on to linear algebra which we're going to use as a very very generic computational tool. Why? Because computers are extremely good at linear algebra so I'm looking forward to see you in the next lecture when we finally get to talk about even the slightest possible real-world applications. Thank you for your time.