 Matteo, thank you everyone for sticking with it and so let me start sharing the screen and oh I see already quite a few chat messages. It's just hello and okay it's just hello but yeah that's that was a little bit gosh there's many questions before I even start. So I'll have a look at the chat and tell you when there are questions. Oh that's very kind thank you Matteo. But before we start are there any questions regarding the last two lectures before we move on? Okay so if not then the what we're going to go for now is the today is the multivariate normal distribution so we're moving from one-dimensional distributions to two distributions so multivariate and this is particularly relevant because as you know the course is called probabilistic modeling and in fact a fairly large class of models so all linear models essentially can be understood in terms of multivariate Gaussian distributions and the computations that you need to do on these models to do base and inference are simply conditioning and marginalizing multivariate Gaussians which is a particularly important task. So the one thing that I will use heavily today from last week's lecture is the is the change of variable formula. Okay so I'll recall that if you have a random variable x distributed according to px of x and you have a deterministic function then the distribution of y is the distribution of x the density of x evaluated at f minus one of y times the derivative to the minus one evaluated at f minus one of y. Okay and as someone observed last time all that I'm saying here is basically say that p y dy has to be equal to px dx if we want to use you know a simple although not entirely accurate mathematical notation. Okay so one useful exercise for you guys to do so let x be distributed according to a Gaussian distribution with mean zero and variance one then show that p of y for y equals x squared is a gamma distribution and in fact it is a special gamma it's a gamma that is called a chi-squared so it's got a particular shape parameter which indicates that it's the square of a Gaussian. Another immediate application of the change of variables rule is that if I have x that is distributed according to a Gaussian then y equals a times x where a is any number real number y is also going to be distributed according to a Gaussian still with mean zero and with variance a squared. Okay so I saw that there was a message in the chat and I'm not sure how to get to the messages so if you have a question just unmute yourself and ask it please it's by far the easiest thing because when I'm sharing the screen it's um oh it's still a good afternoon one good uh so this is a very simple consequence if you wish to use the change of variable formula the fact that the you have to replace instead of x y divided by a and so the density would become e to the minus y square divided by a squared but you also have to have a square root of a squared in front of it which is the derivative part. Second fact that I'll let you prove as an exercise or you can take it on faith but it's it's it's straightforward so let x1 be distributed according to a Gaussian distribution this time with a mean and one and variance s1 squared and x2 being distributed as another Gaussian distribution with mean m2 and there is s2 squared then y equals x1 plus x2 is also a Gaussian distributed variable with mean m1 plus m2 and variance the sum of the variances so this is a they're all well known facts of course you can rescale a Gaussian and it will remain a Gaussian you can sum two independent Gaussians and they will remain will give rise to a third Gaussian distributed random variable so now the question is let's assume now that we have uh d Gaussian variables so all of them are distributed according to a standard Gaussian with mean zero and variance one and they're all independent so I can gather these random variables in obviously a vector in rd and because they're all independent it's trivial that the density now in d dimensions of this vector random random vector is going to be the product of let's say p1 of x1 times pd of xd and by a trivial calculation I can rewrite it as you know each of these is an exponential of minus xi squared so I can rewrite it as one of a z exponential of the square norm of x with a minus sign and then two okay so this is my simplest type of multivariate normal distribution is the one which has got zero mean and spherical unitary variance so I can rewrite it as and I use the same symbol and for the one-dimensional Gaussian and the multivariate Gaussian but I'll put a squiggle underneath for the vector so the mean is a vector of zeros here and instead of a variance I have a matrix which is the variance covariance matrix which is the identity in the dimensions now all multivariate normal Gaussians can be obtained from this basic multivariate normal Gaussian via linear linear operations linear and affine operations so if I define now y equals to ax where a is a d by d matrix plus m this each entry of the y vector so m is a parameter vector of course so it's it's not a random variable is a number on our inner d each entry of the y vector is a linear combination of axes so Gaussian variables plus a constant so I certainly have that y is going to be Gaussian distributed as well now the question is what is going to be its mean and its variance covariance matrix and so if I move to higher dimensions instead of a single number for the mean and a single variant number for the variance I need a vector for the mean and the matrix for the variance so I can calculate these very easily because the mean vector by definition of mean is the expectation of the random variable y the random variable y is obtained from this formula so it is a times the random vector x plus the constant vector m and the expectation is a linear operator so this will be a times the expectation of x and the expectation of a constant is just the constant itself the expectation of x is zero because I've started with the centered spherical Gaussian so this is just m so the constant shift is the mean oh I missed the question okay what about the variance sigma how do I calculate the variance sigma well sigma by definition of variance is the expectation of the square minus the square of the expectation so the expectation of the square here since it is a matrix I have to take x times x transpose this is the covariance matrix minus e of x times e of x transpose now e of x sorry I meant y yeah so we got so I see things flashing but then they I know here is professor can you show the formula for variance covariance matrix id of the x vector m so the variance so id is just is the identity in the dimensions yeah so it's one one one big zero here and then of course sigma is not the variance of x is the variance of is the covariance of y and now the expectation of y is m as we have already computed so we have expectation of using the formula for y I put ax plus m times ax plus m transpose minus m m transpose okay now I have to work on this now using the linearity of the expectation I get a expectation of x x transpose times a transpose so I've taken the product of the two terms the quadratic term in x then I get plus twice a expectation of x times m transpose minus the expectation of m is a plus sorry the expectation of m is m itself and transpose minus m transpose okay so these two terms go away this term the expectation of x is zero yeah because x is a centered spherical Gaussian now the expectation of x is zero the expectation of x x transpose it's its second moment which corresponds to its variance covariance matrix because the mean is zero so the expectation of x x transpose is the identity in the dimension and so we have that y is distributed according to a Gaussian with mean m and variance covariance matrix a transpose okay notice that it makes sense because of course the variance covariance matrix of a Gaussian distribution needs to be a positive semi-definite matrix and this is a positive semi-definite matrix because it's a square or a matrix essentially okay so these straight away tells you how you can for example sample general Gaussian vectors so if someone asks you to sample values y from a Gaussian with mean mu and covariance sigma then what you need to do is sample x from n01 so sample independently each entry of a vector with d dimensions then compute such that a a transpose equal sigma d there are many algorithms for computing the square root of a matrix but the best numerically stable one is the so-called Cholesky decomposition and then compute y as a x plus okay so in many cases I mean it happens quite frequently actually you might have to sample random variables from a specific distribution if your specific distribution is a multivariate Gaussian with a particular correlative structure all you need to do is to compute the square root of the covariance matrix and then transform spherical Gaussians to obtain general multilinear matrices multilinear multivariate Gaussian matrices okay so just to recall very briefly the steps so to introduce the the multivariate Gaussian distribution we started from spherical Gaussian and we showed that you can obtain the general multivariate because I can obtain any sigma and any mu by taking a linear transformation of a spherical Gaussian and in particular of course the you know the sigma that we compute as a transpose contains all the various correlations of the variables because all the correlations are induced by this linear transformation and this as a bonus shows us how we can compute how we can sample random vectors that have got a particular correlative structure so let's take a one-minute stop for questions if there are any either unmute yourself which is the best solution I think or by the way because of the change of variables formula we also know what will be the density function of a multivariate Gaussian what you will have is that y will be distributed according to you know I have to have my exponential bit first minus a half so where I had x I'm going to have I have to replace x with y minus mu times the inverse of the square root of sigma d so I'll have y minus mu I can either take it as an invertible major as and then because of the inverse of the derivative I'm going to have a determinant coming out here the square root of the determinant which is this in here so in short so this is the pdf of the multivariate Gaussian now one advantage of a variation that we've made is that it's immediately clear that the entries of the sigma various covariance matrix are the correlations in the second moments of the Gaussian so now if I want to marginalize t exponent sorry t is not an exponent is a transpose okay and so you can either take it to be symmetric yes please so sigma is indeed I mean it is symmetric it doesn't have does it have to be symmetric sigma has to be symmetric you see and sigma will be symmetric because it's a transpose so if you transpose these you know when you transpose the product of matrices you transpose the the the individual matrix but you also swap the order so if you transpose the a transpose you will get a transpose yeah sorry yeah thanks so it has to to be symmetric then why do we need that algorithm to find a square root I mean you said we need to find the square root of sigma I mean why would we need that then and why do we need that because to sample I mean this is what you know to get it won't be a right it will be square root of a transpose which won't be a the square root yeah but normally you're not told them normally you're given sigma you said give me samples from this general Gaussian with these various various matrix sigma then you have to compute the square root of sigma yeah so what I'm asking is it is to find a we use square root of sigma right yeah but square root of a transpose is not going to be a square root in the square yes square root of a transpose is going to be a because the definition of of I mean of square root of matrix is a matrix such that a a transpose is going to be okay so it's this definition of square root thanks so there's many many things that we may want to do with gaussians motivated gaussians and a simple one is marginalizing them okay so suppose we have now a 3d Gaussian so our vector y is made up of three numbers one by two y three and is distributed according to Gaussian with a certain mean and a certain various various matrix okay now if I'm interested in let's say a projection let's call it projection one two of y which is a vector y hat which consists only of the first two components so I'm marginalizing out the third component so what is the distribution of this vector now obviously it's going to be a Gaussian but how do we compute its mean and its variance variance matrix well it's actually really very simple so what is the expectation of y hat well I have to take the expectation of y1 and the expectation of y2 and that's obviously going to be mu one mu two without the squiggle underneath which is the projection one two applied to the mu vector yeah what about the variance covariance so we know that this is pi one two what about the variance covariance matrix okay so what is going to be well variance of y hat well what we need to do as we did before is compute the expectation of y hat transport yeah but I'm going to say that's just going to be you know the first two by two rectangle square in the sigma matrix yeah so this is going to be equal to the sigma matrix where I take just a chunk okay because this is going to be you know the little matrix expectation of y1 y1 transpose by this number so y hat which is the projection on the first two axes of y is going to be normally distributed with the mean which is the projection of the mean vector and a variance covariance matrix which is obtained by applying let's say the projection to the two sides of the variance yeah and that is equivalent to selecting the first two rows and the first two columns the indices up to two okay so marginalizing means removing in multivariate gaussians the other thing that we might be interested in doing is conditioning now conditioning is a bit so let's say we have a vector y again is three-dimensional and it's distributed according to a Gaussian with mean mu and variance covariance matrix sigma and now suppose that you fix let's say y3 what is the distribution of the remaining two entries y1 y2 excuse me professor yes please oh you said I I I didn't catch the last thing you said marginalizing is removing what entries it's just basically you know you take away so if you're marginalizing the third entry you take away the third entry from the mean vector when you take a third column and a third row from okay okay I okay I understand I see it was just a bit illegible oh yeah yeah you're right it's totally legible let's let's write it a little bit better so what about conditioning yes so can you again repeat what is projection projection y1 I mean what you wrote as pi I didn't yeah projection just means that uh you know mathematically you can see it as projecting on the hyperplane defined as y3 equals zero yeah so parametrized by y1 and y2 okay I mean like I mean why did you I mean I understand this marginal marginalization but I don't understand why you used this notation I mean yeah okay so the the question is is a is a good one so why is marginalizing just ignoring one well that is what marginalizing is you know so marginalizing mean regardless of all of the third vector for example the third dimension of the vector here we're marginalizing y3 so it means what is the distribution of y1 and y2 regardless of what y3 is doing yeah and what I need to do is essentially averaging out y3 so the way I've introduced marginalization on Monday is by doing an average yeah I mean thank you but what I'm showing you here that's that's that's why it's kind of interesting and not entirely trivial is that doing that average in the Gaussian world is equivalent to just removing entries yeah so y1 and y2 are random variables on their own accord they're going to be Gaussian random variables because they're obtained from the big y y1, y2, y3 by a linear operation which is this projection what is going to be the distribution of these Gaussian random variables well we have to compute it's since they are Gaussian it's first and second moment and to compute that we can either do an integral which is a pain or we can use the linear transformation on the moments which is what I've done here so that's why so that the kind of the message that I wanted to convey really in this lecture but particularly in this part is that in the Gaussian world because you know that the distribution is entirely determined by the first and the second moment then you can obtain the same results by computing moments which can be much easier as it is in this case as you can do by doing integrals so I saw there was another question popping up here why is y1 yes of course it's yeah well I mean the transpose of a scalar is itself so let's just remove it for the sake of clarity yeah one more question okay thanks yeah so that's not an entirely trivial thing but because you're in Gaussian world you can do computations on the moments and that is the same thing as marginalizing uh now conditioning instead is a little bit more complicated uh so what is the distribution of y1 y2 well we we have to to look at it so it's going to be a Gaussian again and why is it going to be a Gaussian well how would we compute so what is going to be p of y1 y2 given y3 okay well let's just write out the the full the full gory details so we have a normalization constant then we have exponential of minus a half y minus mu transposed sigma minus one y minus mu so this is the the full but here you know we fix y3 so y3 is a number now it's no longer a random variable that's how you do condition you stick the number in the formula okay and we kind of have to write it out so we define sigma as then here we would have uh sigma one two three sigma three three in the end so what is clear it is it's still going to be a Gaussian here because if you stick a number here this will remain a quadratic form in y1 y2 so the exponential of minus a quadratic form is still a Gaussian the problem is that you have to invert this partition matrix so this is one of the most painful calculations particularly when you have more than three dimensional matrices you have to look up something called the partition inverse formula excuse me could you perhaps specify what is for instance sigma one two comma three all right yeah yeah so okay that's fair enough so what i'm trying to do sigma is a three by three matrix so i want to retain the so i'm selecting the first two variables as special and the third one as special because the third one i'm fixing as a number so i'm dividing the three by three matrix in a two by two matrix which is the covariance matrix before between the first two entries and then a vector here which is the covariance between y1 and y3 and y2 and y3 so this is a two by one vector this is its transpose and this is the variance of y3 okay awesome yeah sir to get this form did you directly use the base theorem uh no uh i mean i'm not there is a good reason why base is involved but you don't have to use base theorem you're just conditioning so directly y3 you just substitute the value of y3 in the absolutely that's how you do conditioning you substitute a value in okay and so as i said this is quite a a rather painful calculation and you need to look up something called the partitioning inverse formula is a rather lengthy formula in terms of the inverses of these various matrices now that tells you basically the blocks the inverses of the blocks so it's also called the blocking inverse lock i don't want to go into this but look it up if you wish there is another thing so there are two ways in which you can define gaussians so what we've started here you see when we started talking about marginalizing we start with a multivariate gaussian and give its mean and its variance covariance matrix and then we've seen that in this case marginalizing is trivial but conditioning is rather painful because you have to do these matrices matrix inverse the other way in which you can specify gaussians is by giving equations so multivariate gaussians this is more interesting because i could say for example that you know x might be a two-dimensional gaussian and then i could say that y for example is equal to a x plus epsilon so it's another random variable y condition on x where epsilon is gaussian so what does it mean to have this equation it just means i have a random variable x and i'm observing a linear transformation of x plus noise now what happens here is that the pair x and y which are a four-dimensional vector and there are also a four dimensional random variable which because it is linear is going to be gaussian distributed but what are going to be their distribution what is going to be the distribution of this well the mean is very easy so let's call this vector z so the first two and two variables so this is going to be a concatenation of the expectation of x the expectation of y i can write it out straight away the expectation of x is mu x the expectation of y is a times the expectation of x plus the expectation of epsilon this one is zero and this one is mu x so in the end i have that the expectation of z is the concatenation of the so the mean is mu x and then the mean of y which is a linear transformation is a times mu x what about sigma of this linear transformation of this four-dimensional gaussian which is defined started from a two-dimensional gaussian and a linear equation now to compute this i need to do exactly the same computation as i did before for my definition of the multivariate gaussian and what i will get is a matrix that has got sigma x and then it's got a sigma x and transpose okay now the reason why i'm showing you this is that defining multivariate gaussians as equations has a huge advantage in terms of conditionals because i could just say so i've defined z as the concatenation of y at x and y but i could also just write that p of z is going to be equal to p of y given x times p of x because z is you know it's a four-dimensional thing which is made up of two two-dimensional things p of x i know it is mu x sigma x p of x of y given x is trivial when it is given in terms of condition of equations yeah because an important conditional so if i know x well what do i need to do to get y i just multiply it by the matrix a and then i add some gaussian noise p of y given x in this equation-centric world is just going to be a gaussian with mean ax because x when i'm conditioning on it that means i'm sticking the value in i'm fixing it and then sigma squared the identity okay why do i insist on this conditioning because someone asked before are we already doing base theorem well we pretty much are so one simplest calculation that we can do with Bayesian inference is to think of observing gaussian random variables linearly okay so suppose i have now my y which is well let's make it a scalar for a change a transpose x plus epsilon so i'm observing with spherical gaussian noise a random variable x which gets multiplied by by a vector a so the let's keep it in the two-dimensional case in the multi-dimensional case so i'm observing via an observation process which is multiplying by this matrix a i'm observing a random variable x with noise epsilon now the question could be what is p so we have a very easy way of computing the conditional y over x and we have a prior over x so the question now becomes what is p of x given y so what is the posterior to do this we need to apply base theorem so we have that this is proportional to the joint distribution so it's joint distribution where i then stick y equal to a constant and this joint distribution i write it as the product of the other conditional of the other conditional that is my equation times my trial okay excuse me yes please at first you wrote a transpose at the first line first line at at first no no the last the last page yeah because i was yeah so yeah i didn't get it what why it changed i get i changed it because i was well because i made a mistake oh sorry happens to me as well okay so don't worry this is the correct thing that i meant to say let me see there is another question oh yeah what about multiplicative noise you know multiplicative noise makes things a lot harder so if you have multiplicative noise then you are not in a conditionally you're not in a Gaussian posterior world so your posterior will no longer be Gaussian and you can't do these kind of computations analytically in general so the world of Gaussians is limited to the linear world where you can add and rescale by constant but not multiply random variables the the simple thing is that if you have the multiplication of two random variables then you know when you go into the Gaussian you have to take the square and you get some quadratic terms essentially so let me one more question so let me finish this calculation this the first basic calculation we really are doing in lives at least in this course so let's not worry about the normalization constants so here we're going to have exponential of minus one half let's make it a brace so it's the product of the two densities so they just sum and we've got one minus y minus x transpose times y sorry y y minus a x transpose times y minus a x times one over sigma squared yeah that's that's my conditional you see this is the conditional that I put in and then I have a term which comes from the prior which is x minus mu transpose sigma x minus one x minus okay so this is it this is the p of x and this is the p of y given x that is implied by my equation okay and now I want to know what is the probability of x given y as I said when you want the probability of x given y when you're conditioning on something you're fixing it so what it means I look at this quadratic form and I want to identify the quadratic terms and the linear terms so I know that this must be equal let's say let's say z prime x of minus a half x minus m transpose c to the minus one x minus m so I'm writing the most genetic quadratic form in x and this corresponds to a Gaussian with mean and variance covariance matrix c and this will be the posterior mean and the posterior variance covariance matrix how do I identify them well I have to compare the terms here okay so by looking up here I see so on the left I will have a term the quadratic term would be x transpose c to the minus one x yeah that comes here has to be equal to the quadratic term here the quadratic term here I got a term coming from the prior the sigma x minus one and then I've got terms coming from the likelihood so from the observation so this has got to be equal to x transpose a transpose a divided by sigma squared x plus x transpose sigma x minus one x okay so these are the you see if I develop the square here I get x transpose a x transpose a transpose a x divided by sigma squared I get a sigma minus one so the posterior covariance is going to be equals will be a transpose a divided by sigma squared plus the inverse of the prior or to the minus one and then I do the same thing with the linear terms and then I will stop because I'm out of time I would have x transpose c to the minus one m from here x transpose c to the minus one m and then this is going to be equal what are the linear terms here well I got some linear terms from the mean and I got some linear terms here so I'll have an x transpose a transpose divided by sigma squared y and then I'll have some x transpose sigma x to the minus one mu x okay and so m the posterior mean is going to be equal to c the posterior covariance times a transpose y divided by sigma squared plus okay so one small observation notice that the smaller the observation noise dominant the term coming from y so the smaller the observation noise the sigma squared or the observation the variance of the observation then the closer our posterior mean will be to what the observation is telling us which is very reassuring so this is our first basic calculation so we computed the we have a basic random variable a Gaussian random variable observed with Gaussian noise and we condition to obtain the posterior over x or over y over x given our observation so I think my lecture is now out of time I kind of said most of the things that I wanted to say but if you have any questions please ask now or write them down and like you can email them to the secretary and they can be passed on to me or something like that because I realize that we've compressed quite a bit today I have a question yes please why is the projection of a Gaussian variable another Gaussian variable I didn't get it very well because a projection is a linear operator yeah and every linear operation on a Gaussian remains a Gaussian okay thank you okay there's another thing in the chat is that the question okay has the reference for today's lecture a lecture now I've been uploading a website or matrix or will it be uploaded I think if you look at the Makai book all of these things are there in the chapter around 23 I think on probability distributions excuse me professor are there maybe exercises perhaps from that same book that we could do to help us better understand I think you know if you read that book that will help you a lot in all your life can you repeat the reference name again please it's it's on the website is information theory inference and learning algorithms by David Makai thank you yes and there are a lot of exercises in that book also yeah very very useful to get a grasp of Asian inference and information theory and I mean I think you'd be if you know that book very well then you know you can come and teach us okay so there are no other questions then we take a short break of 10 minutes and then go with the second lecture of Domenica Buetti okay ciao everyone thank you video ciao Matteo ciao cuon stai bene tu let me stop recording you can gossip