 Okay. We can get started. I'm a little bit late as well. I've spent the last half hour just fixing bugs in my code. So let's start as every Thursday with feedback, which was actually very good, which is probably because only a total of 17 of you actually responded. This usually happens around sort of where we are now, seven or eight lectures into the course that most of you just don't bother anymore. That's okay. I'll just assume that everyone who doesn't provide feedback is super happy with the course and doesn't have anything to complain about. Just don't then start complaining in the official evaluation once it comes around. There were relatively little particular, very few questions about what you didn't understand. I have to admit, I don't understand this question. So I'm not sure how to answer it. I've read it three times and I still don't understand what the question is. So that might be because you had to type it very quickly at the end. If you still have it after this lecture, then maybe just fill it in again into the feedback form. Yeah, let's not spend more time on this because there isn't that much to discuss. And instead, let's go back to this wonderful ugly slide that I ended the last lecture with. So quick recap of the entire course. We realized at the very beginning that there is a mathematical framework for reasoning under uncertainty that keeps track of all possible hypotheses within a class of hypotheses by assigning probabilities to them. It has also beautiful properties. It allows us to do inference. So reason about the truth given data. But it's sometimes computationally challenging at least that it's abstract form because it requires us, first of all, to keep track of all the possible hypotheses, which is a number that grows exponentially with a number of variables that we keep track of and linearly with the number of possible values that they can take, which is still a problem if you have continuous variables because then that number is unbounded as well. So we then spent three lectures discovering first and mathematical foundations for how to reason about continuous variables with probability density functions and then some algebraic structure called exponential families and their conjugate priors, which allow in some sense closed form inference on, let's say, interesting data structures on counts of observations and on real valued numbers using Gaussian distributions. And we saw that the Gaussian takes a very important central place in this entire development. It's also called the central distribution sometimes or the normal distribution because of this. And last Monday we discussed the Gaussian for half the lecture and discovered that maybe the thing that makes Gaussian distributions really interesting is not so much that the central limit theorem says that asymptotically once a distribution is very narrow, you can locally approximate it with a Gaussian. That's what the central limit theorem says. But that's a good thing to do because Gaussian distributions in a sense map all the entire process of inference onto linear algebra. And because computers are good with linear algebra, that's interesting because it allows us to rephrase inference in terms of linear algebra. So what I want to do today is turn this nasty piece of code, sorry, of math into code, which might help clarify a few things as we pass by. We can discuss how to do some of these things in code and then find, and that's hopefully going to be most of the lecture, a very first and very, very important and central application for this kind of structure. So I have a slide called code, but actually it's probably better if I just, okay, I'm quickly recap what's on this slide and then I'll show you the code because it's going to be difficult to swap back and forth between them. We discovered that a first nice property of Gaussian probability distributions is that if you multiply two of these probability distribution functions with each other, not the two random variables, not x and y, but the distribution over the same x with different parameters, when you multiply those two together, then the resulting function is again of in x of Gaussian form with a normalization constant that's also tractable. And which can be computed by taking the parameters of the two sort of incoming Gaussian distributions, means and covariances and doing some operations on them, which involves linear algebra. So we invert a matrix, sum it up, invert the sum, we multiply inverted matrices with vectors and so on. The second great property of Gaussians is that if you have a Gaussian random variable that is distributed according to some multivariate Gaussian distribution, then any linear map, in fact any affine map of this variable is also Gaussian distributed and the corresponding Gaussian distribution has parameters that are easy to compute. We just take the linear map that we've applied to the random variable and also apply it to the mean and to the covariance from left and right. So the question is what guarantees that the sum of these two precision is invertible as well? So that's a property of symmetric positive definite matrices, that the sum of symmetric positive definite matrices is again symmetric positive definite and the inverse of a symmetric positive definite matrix exists if the matrix is strictly positive definite. This property also directly implies a sort of basic property of Gaussian distributions that their marginals can be computed in closed form by just picking out sub arrays of the mean and the covariance. That's a special case where the linear map is just a selector map that contains basically ones and zeros in the right rows. And because this, oh no, yeah, no, so that's true and finally great property that's not so straightforward to see but requires a little bit of derivations, conditional distributions, linear conditionals of Gaussian random variables are again Gaussian random variables. So that means if you have a joint distribution over two Gaussian random variables, which are both Gaussian, which are jointly Gaussian distributed and we condition on one of them and then compute the conditional distribution for the other, then that's also a Gaussian distribution and it contains a bunch of terms that we will discuss later today. Because these two, conditioning and marginalizing, are the two central operations of probability theory, we can use them to construct everything else, in particular also to apply Bayes' theorem and to reason about one variable given observations of another variable which is linearly related or affinely related to the object we care about with the likelihood and a prior that are both Gaussian. So what does this mean in code if you actually want to do it on a computer? Well it means that there is this object called a Gaussian which is parameterized by median covariance and it has all these operations that we can do on it. We can multiply it with another Gaussian distribution, we can apply a linear map to it and we can condition it on some observations. So here is code, so now is the question, can you read it in the back? Probably not, lots of shaking heads. Now, could you read it like this as well? Okay, good, because then I can show more. So maybe it's not so important what I call up here, I'll tell you when I use it. This is an early version of this code, we're going to expand it over the next two or three lectures. We're going to create an object called a Gaussian. Here I'm using a little Python trick called a data class. Who knows about abstract data classes? Almost no one, interesting. There's still some Python features that not everyone knows of. So this is just a little flavor of a class that automatically creates its constructor and various comparator functions. So it saves you having to write in it by just defining what the objects are that define the class. So it's not particularly deep. So this class is going to be called a Gaussian and it's parameterized by these two things, a mean and a covariance matrix. We already saw in the exponential family lecture that maybe we would like to use the precision and the precision-adjusted mean because they are the natural parameters of the Gaussian distribution, but typically when we reason about random variables, we're going to be interested in the mean and the covariance. So that's maybe the base object that we actually care about, the canonical parameters. And I'm not going to show you this yet because we'll get to it, but basically I've already implemented various extra functions that simplify computations later on. And the one thing that is maybe useful to know already because you might otherwise stumble about it is this cache property thing. Again, hands up, we've seen this before. Okay, just one, two people. So within this functional programming toolbox for Python, which is part of the base Python distribution, there's this object called a cache property for classes like this, which basically allows us to define an object that is only computed when it's needed and then it's cached so that you don't have to recompute it every time you need it. How are we going to use this? So now in these few functions, I'm, no, in this one down here, I'm going to define these properties, these abilities of Gaussians. In particular, if you multiply two Gaussian random variables with the, sorry, no, let me go back. And now I've made the mistake myself. If you multiply two probability density functions over a Gaussian random variable, then the resulting object is again a normalized Gaussian random variable. And here I'm sort of cheating a bit because I'm leaving out a normalization constant. And maybe that's not good. And I haven't really decided yet whether that's the right functionality I want to use or not. But I've just implemented it this way. So there will, if I give this Gaussian another Gaussian probability density function, then what it's going to do is it will take the precision of the other random variable and its own. So the precision is the inverse of the covariance matrix. And then compute the inverse of that. And because the equation says that I have to do that. And then that's going to give me a covariance. And the new mean will be the precision adjusted mean, mean times precision of the other. And this one added together multiplied with its covariance matrix. And then that creates a new Gaussian. So for this, I obviously need a precision. And I need a precision adjusted mean. So I've created a corresponding thing up here. So the precision is the inverse of the covariance matrix. And the precision adjusted mean, well, the precision adjusted mean is the mean times the precision, or actually the precision times the mean, from the left. And here is the first little sort of annoying sighting that most of you will have, or many of you may have heard, if you haven't, then this is the lecture where you need to hear it. Computing matrix inverses isn't really a thing one should do. So numerical linear algebra for converting the actual inverse of a matrix is numerically unstable and relatively expensive. And that's why if we can, we should try to avoid computing inverses as much as possible. Instead, a better thing to use is the Cholesky decomposition. And now it's 830. So I need to decide whether I need to talk about this or not. There will be another lecture later in the term where we can spend more time thinking about Cholesky decomposition, but I don't want you to just think, what's this thing? I don't know. So I'll give you a very quick hand-vary overview. If you're solving a linear system of equations, so if you're solving matrix A times vector X, this vector B, you've all learned probably in high school, but very certainly in your first year undergraduate math class is how to do this. Does anyone remember? It's the thing where you don't really know how to answer, right? But some of you might, you start in your head going with, like, I need to somehow take the role and then this, you know, with the, right? That exactly this thing. So this is called Gaussian elimination. You start with the first row, you divide by its first element so that there's a 1 up here, and then you subtract the resulting, this line from all these rows, right? And then you get a 0 below here, and then you keep going like this, and eventually you have an upper triangular matrix. And the operations you need to do, to do this procedure, you could store in a lower, lower triangular matrix. And that gives you two matrices. This A could be written as a lower triangular matrix called L for lower, and then upper triangular matrix called U. And once you have this, then if you do the same operations of L, if you apply them to B, then you now have a tri-diagonal system, and tri-diagonal systems are very easy to solve. They can be solved by something called back substitution. You start at the bottom, we just have to divide by that number, and then you have the right solution for X, and then you plug that value for the last piece of X into the second to last one, and now you only have one number to solve, and you put that in, and you put the next number in, and this operation is quadratically expensive. How expensive is it to compute this matrix? If this thing has length N, say it again, cubic. Why cubic? Because this is a matrix of size N by N, so we have to start by taking this row, which is length N, and subtracting it from all the other ones, they are N of those, so that's N square, because it's something of length N, which we need to do N times N square, and then we have to do that N times again to get to the tri-diagonal thing, so N cubed. So N cubed, polynomial, right? So if you theoretically do the scientist, it's all fine, but in practice, cubic is a big for if N is a large number, right? So this process gives us these two matrices, this is called the LU decomposition, and the person who came up with these two really very inspiring characters for lower and upper was called Alan Turing. So that's just called this is this is the algorithmic implementation of Gaussian elimination, which is called the LU decomposition. But this is a symmetric positive definite matrix that we're working with this covariance matrix, it's symmetric. So these two matrices will be the same. They're called L and L transpose. And if that's the case, then this algorithm can be made twice as fast, clearly, because you can store only half of what you do. And that's called the Cholesky decomposition. So this procedure that you may have encountered also in your algorithms class can be made very numerically stable. You know, you know, subtracting and dividing floating point numbers can be potentially dangerous. You learned this in like first year undergraduate floating point arithmetic in, I don't know, maybe your algorithms class or maybe Informatic one or two. And so if you do operation like this, but there's a lot of dividing and subtracting, you need to be very, very careful. For example, one of the things one can do is one can rearrange the order of these matrices, and the corresponding elements in X and B, such that the large ones come first and the small ones come large, come last. This is called pivoting. And this is what happens inside of your cool linear algebra library. And that's why several professors might have told you not to implement your compositions yourself, but instead call, you know, sci pi dot lin dot whatever, because it's a handle to a beautiful piece of Fortran code that was written in 1978 by 20 odd different PhD students who really spent time making it really work. Okay, so if we come if you compute me times precision, it's therefore a good idea to use these Cholesky decompositions because they're stable. So what I do is, on the one hand, I allow this class to actually access its own precision. If someone really wants to do it, it's basically a wrapper around inverse. And it carries this exact same problem. If you really want to know the inverse, do it, but maybe you don't want to. Instead, if you want to do something with the precision, it might be good to have a operation that allows you to multiply with the precision. That is actually an operator that just operates on whatever you give it. And for that, we first compute the Cholesky decomposition, which I do. So here it is called L is one of these cash properties. Actually, no, it's called L factor, because there's two different functions in sci pi. Now it gets a bit tedious. One is called show factor. And the other one is called Cholesky. The Cholesky actually returns this matrix L, the show factor returns a very efficient data structure that isn't actually a matrix, it just contains numbers that allow you to solve efficiently. And so I have that it creates this L factor. And then whenever I want to multiply with the precision, I can do this. I can say if you give me an X, here's a wrapper around multiplying with the precision. It's just a way for this class to handle the fact that someone might need to access multiplying with the precision. So this, all of this together has allowed us to define multiplying Gaussian probability density functions with each other as a question. Here, I should probably put a bracket around here. Yeah, you're right. See, this is the bit that I haven't tested yet, because we're not not actually going to use it today. So but thanks for paying attention. Maybe, I don't know. I don't remember whether I've actually tried this since I fiddled with the code. So maybe I've changed something now and yeah. So maybe more importantly, the central operation is that if you have a Gaussian random variable, and we multiply it from the left with a linear map, then the resulting random variable is itself Gaussian distributed. And this we implement here. So if we if we take one of these Gaussian objects, and we multiply it from the left with a matrix, a Jack's NumPy array, for example, then this Gaussian object will return another Gaussian object, which returns, which has a mean, which is the original mean multiplied from the left with a and the covariance that a multiplied from the left, sorry, the original covariance matrix multiplied from the left and right with a. And by the way, there's a special case of this thing, where you can also add something to a Gaussian random variable. So you can take a variable z and add five to it, right? Then the new Gaussian random variable is the old one with a mean shifted by five. And so that's so simple that it's not even on this slide. It's the affine special case. And I do that here with an ad. And then there's a problem with ads that you could of course also add another Gaussian random variable, which would be the same thing pretty much, just that you also add the co variances. And so we need two different adding functions, one for if you add a floating point number or a vector, and one for if you add an object of type Gaussian. And this is tricky in Python, because Python is not type safe. So it's super annoying. Like if this were a strictly typed function, it could just like inspect its own the input and depending on the type to one thing or the other. So there is a really complicated Python hack around this that for those of you who care, and I know it's going to be like three people is this thing called a single dispatch method, where you first register one of them. And then at the very bottom outside of the object, you register some other thing where you say, oh, I can also add a Gaussian, just so that you're not confused. If you look at the code, if you didn't understand the last three minutes, it's fine. You can just add Gaussian random variables to each other. And the resulting thing is still the Gaussian. The most the more important thing is this condition. So here, we need to look at this equation a little bit. If we have two jointly Gaussian random variables, X and Y, like up there, like in the row right above, and we condition on the value of one of them, or actually more generally, if we have a Gaussian random variable and we condition on an observation of this form. So with Gaussian likelihood, where the observation is linearly related, or affine related by affine, I mean linear plus a shift with some covariance, then the resulting probability distribution from Bayes theorem is a Gaussian. Yeah, I can leave out the B and the C for a moment with a new mean that is given by the old mean, plus the covariance of X with a from the right times a sigma a transpose. Yeah, messed around with transpose and not transpose, plus the covariance inverse times y minus a applied to the mean. One way to interpret this is to say, you take the observation, you check under the marginal what you would have expected the observation to be. That's the prediction, the marginal prediction for Y. subtract them from each other. So you check the residual, how surprised are you by the observation. And to understand how surprised you are, you need to know how far you expected them to be from each other if they are random variables. And how far you expect them to be from each other, that's what this matrix tells you. That's the marginal variance of the observation. So if we divide as a minus one, this from the left under this, and that's like taking how far am I from what I've observed on a scale of how far I expected it to be from the mean. That's together is some kind of what's called sometimes called the innovation, the surprise by the observation. That's by how much we have to update. And then we scale it back onto how much this observation of Y has to do with the thing we care about, X. This is the covariance between X and Y. So what did I expect to see? What did I actually see? How far did I expect these two things to be away from each other? And what does this observation have to do with the thing that I care about? And then the covariance arises from objects that I've already pointed out last week that are very similar. They involve the same thing. So let's implement this. This function takes an A and a Y and a lambda that defines the likelihood. A is the linear map. Y is the observation. Lambda is the covariance of the observation. It's the noise on the observation, if you like. And then what it does is we compute this object that is often called a gram matrix after a Danish mathematician because it's an auto product of A and A transpose scaled by sigma. Such matrices are called gram matrices. Plus lambda. And then it does linear algebra on it. So I said you shouldn't be computing inverses. You should instead do this. So this is what happens in the next line. We call one of these cool low-level Fortran libraries to compute a Cholesky factorization of the gram matrix. Tell it that we want the lower thing. We want L, not an upper triangular matrix because these things sometimes aren't quite sure about which ones they return. And then we compute this mean, which is the old mean plus covariance between the thing we care about in the observation, times the solution to this problem, Y minus A mu inverse from the left with this gram matrix. And that's what this function computes for us. It's called solve this problem. And this second line, this one, is going to be fast because it's quadratically expensive in the number of observations. And this one is slower because it computes something of order O of n cubed. And that's nice because then we can now reuse this L to also compute the covariance. Because that will also require an inverse of this matrix. So we can take the old covariance minus covariance between the thing we care about and the observation, times the solution to the problem, gram matrix inverse times A sigma. That's what we have here on the bottom right. This is the most complicated part of all of this. And we're going to talk about it a few more times over the next few weeks. So this has to suffice for now. The main point is, it's linear algebra. So we have to be careful to do it right so that it's stable and fast. And we reuse the work invested in solving linear problems. And we have now encapsulated it in our object called Gaussian. So from now on, for the rest of this lecture, we're not going to talk about this anymore. We've shifted it in our head into, it's this complicated thing, but you can do it with a Gaussian. And it's now hidden inside of this object. And then we just add a few more helpful properties. So for Gaussians, one thing that you might care about is the diagonal of the covariance matrix, and then the square root of that. Why might we want to care about this? So these are the marginal standard deviations of every individual element on the diagonal. So that's for if you have a vector of Gaussian and the variables that are jointly Gaussian distributed, then this object tells us, in some sense, the arrow bar on every single one of these elements if we wanted to know them one by one. And that'll be useful for plotting. The other thing we might want to do is draw samples from a Gaussian distribution. And here maybe we can spend five minutes talking about how to sample from a Gaussian distribution, actually. First of all, so you've probably not cared so much about this, because that's not really something that goes through your mind while you're sitting at your V-gate dinner table. But also because you might know that there are functions that do it for you efficiently. But for a moment, it can be useful to understand what it actually means to draw from a Gaussian. The most important thing to know is it's as expensive as solving one of these big linear problems. This is very useful to know, because otherwise you might convince yourself that you could use Gaussian and the variables in a cool way to speed up computation. And that doesn't work, because you need to do this. So you had the question? Here we compute it, and here we use it. There is no B here. So this is maybe not good. I should have updated this slide. So this is the most general form where the object we care about is, again, a linear map of the thing we inferred. So this line does two things at once. It first calls condition, and then it multiplies from the left with B and shifts by C. Yes, we're doing this where the A is not even this line. It is this line where B is 1 and C is 0. Every year I show this slide, and every year I come out of it and think I should have done this slide differently so that it's not so confusing. But then I remove one thing and add one, and I realize that the slide gets even fuller. And I said, maybe I just leave that one row, because then I don't have to shift things around. The capital B is, so everything that you don't find in the equation is just what it needs to be for this to work. So capital B is 1, little b is 0, little c is 0. So the little b is a shift in the observation, and the little c is a shift of the prediction. OK, so how do we sample? So I actually have a little demo for this. I don't know, I need to zoom out. So first of all, sampling from a Gaussian always starts with random numbers distributed according to a standard Gaussian. So the standard Gaussian is the one with mean 0 and covariance matrix 1. And I'll show you, oh, by the way, I'll have a slide later from where you can get the app when you use it for more. So we start with standard Gaussian random variables. Those are these little dots here. And then you do two things. You transform them such that they have the right covariance matrix. And then you shift them, the transform one, so that they have the right mean. And so shifting is easy, you just add the mean, because the mean is a vector and it shifts. The transforming is the important bit. And there's various ways of doing it that all revolve around these objects. So one way to do this is to use this L from the Cholesky decomposition, which you can sort of think of as a variant of a matrix square root. So if A is symmetric positive definite, then we can write LA as L L transpose. And that's sort of like a square root. Not quite because of transpose, but whatever. There's also something called a matrix square root, which is literally a matrix M such that you can write it like this. And that works as well. And it's a very minor variant of this. Just a little bit numerically less useful. And that's what I do here. So we can choose a particular. So far, this thing has a simple covariance. So let's make the covariance a bit more complicated. For example, we can scale one dimension by some scale. Then this gives us an elongated Gaussian. And now in black, you see the original samples. And in red, the finished ones. But we could also add a little mean. So we shift things a bit around like this. Then we have, in black, the original distribution. In blue, the transformed one with the right scale. And in red, the shifted transform one. And we can also add a covariance. Now it's rotated a little bit. And you can see that, or maybe you don't see so well. Maybe you want to change the plot if you like. Send us a pull request. We start with the original black distribution. You transform it in blue, and then you shift it in red. And this transformation uses this L matrix up here, which now has this form. And so you can think about what this linear map actually does. It takes the first element and scales it. And then it takes the second element and scales it. And then it does something with the first element and adds it to the second. It's sort of a squish shift, if you like. It's not a rotation, because that's just not a rotation. The alternative to do is to take this matrix A and compute its eigenvalue decomposition. Does anyone know something about the eigenvalues of symmetric matrices? Your linear algebra class is calling from five years ago, or three years ago. Symmetric matrices have orthogonal vectors, and their eigenvalues are real, and they exist. And if the matrix is positive definite, they are not just real, they are also positive. So we can write this as a matrix u times a diagonal matrix that contains real positive numbers times, as you said, a matrix called u transpose, because it's a symmetric matrix. Otherwise, it would have to be a u inverse. And this allows us to also compute a matrix square root, literally, because we can write this as u times d to the 1 half times d to the 1 half times u, because it's a metric positive definite. So there are positive numbers on here, so we can compute square root. Otherwise, they would become complex, right? And now this matrix is sort of clearly the transpose of this matrix, and we have another type of square root. So you can do this as well. And it's another way to create Gaussian random variables. You have this thing with the eigenvalue. Here's our eigenvalue decomposition, and here is our Gaussian distribution. And if I flip back and forth between the two different ways of computing this Gaussian, then maybe you can convince yourself that the circles are the same. They are the same distribution, but the samples look differently, because these are just two different operations being applied to the samples. This one is a scaling of the axes, followed by a rotation. That's what u is. And this one is, well, this triangular thing. So the big takeaway is to sample from a Gaussian distribution, you have to take standard Gaussian random variables, apply a transformation, and then a shift. And there are different transformations you can do. They have different properties. For example, this one is cheaper than this one. So less-guided compositions are about three times faster than eigenvalue compositions. They are also more stable numerically. And they do different things. So you shouldn't expect the samples to look the same. So transforming samples is not the same as transforming a probability density function. But those are still valid samples from a Gaussian distribution just as much as these are, even though the dots are in different points. They are just in density acceptable. And so the final thing I need to tell you is how to actually draw random variables from a standard Gaussian distribution. And this is really the sort of thing where some very smart people have done really smart stuff for us. So there's something called the box-muller transform. Here's a picture from Wikipedia on how to do it, which uses standard uniform random variables. So there are low-level functions that produce uniform bits, like random variables on the unit interval from 0 to 1 on floating-point precision. And then to draw from a Gaussian, univariate Gaussian, you take two of these bits, sorry, not bits, but two of these uniform random numbers. They are plotted in here. And then you take the first one of these, the first axis, to decide on a radius through some nonlinear transformation that involves a logarithm. And then the other one to decide on an angle through a nonlinear transformation that involves, depending on the implementation or cosine or, I think, some other cool function. And if you take those two together, you get things that are jointly Gaussian distributed. So now you have two Gaussian random variables that are independent of each other. And if you only need one, you only take the first one and you use one of these cache properties to keep the two around. And if someone asks you for more, you just make more. And you can actually reuse the last one and add only one unit random variable to produce the next one. And they'll still be independent of each other. That's really cool. So you only need one asymptotically only one unit random variable to produce a Gaussian random variable. And that's not even the most efficient way to do this. It's just a historical one from the textbooks. There is an even faster one that is totally unintuitive. It's called the Ziggurat algorithm. And it was invented by George Marsalia, the Pope of Randomness. And yeah. So all of this happens inside of the piece of code that I showed you before, which does this. And you can tell it which method to use. By the way, this one uses yet another decomposition called the singular value decomposition. Hands up, who's heard of the singular value decomposition? Everyone, what does the singular value decomposition have to do with this or with this? This one? This one. This one. Yeah. So for the case of symmetric positive definite matrices, the singular value decomposition is almost the same as the eigenvalue decomposition up to a square somewhere. So it requires computing these matrices U, which is the expensive bit. It's just ever so slightly inside a different algorithm that doesn't use the fact that this matrix is symmetric positive definite, which makes some things a bit more stable. Why? Because I might call this function later on, because now we're building our package that we can call many, many times. I might call this on some Gaussians that have covariance matrices that actually have zero eigenvalues. So there are some dimensions that don't get scaled at all, from which we don't draw. And then I want this method to still work and not throw annoying errors all the time. That's why I tell it to use the SVD, because that can deal with zero eigenvalues. Otherwise, you could give it a method eigenvalue decomposition or even Joleski, and it will take care of it. But as soon as you get very small eigenvalues, so there are dimensions that are nearly zero invariance, this thing is then otherwise going to throw an error. Yeah. Why the alfantos? Because it's from the right. No, this is just a question of the shape of the array. So I want the number of samples to be the first dimension of the array. So if I draw 50 samples, I want an array to come out that is 50 by 2, in this case. So I rearrange the arrays so that the 50 is on the left. And that's just transposing things. Oh, sorry, we're not. You're scaling by the variance, which scaling means multiplying by its square root. It's like if you had a one-dimensional Gaussian random variable, you'd like to scale, then you would scale it with its standard deviation, not with its variance. And that's the multivariate version of this. OK, so at this point, I think it's a good idea to take a break. And we continue in five minutes at 9.05. So why is all of this important? So those of you who don't care about how to draw from standard Gaussian distributions might now be, how is this class going to teach me how to build large language models? This is completely irrelevant to these things, right? So let's see if you can make sense of this. And we will keep doing baby steps because I also don't want to lose you all. So let's say in your machine learning engineer life, you encounter a supervised machine learning problem that looks like this. So what you've done is you've measured some quantity y by setting some other quantity x to some values. You've done a sweep through x and measured y. This is the typical setting in the engineering sciences. Whereas this emerge, for example, here is my former colleague, Jeanette Boak, with another former colleague called Apollo at the Mustang Institute up on the hill. If you're building a robot, one of the things you might care about is how much torque each of the joints of the robot produces as a function of how much current you set in. That's one of the very basic things you do when an engineer designs an electrical motor, or basically how much torque do I get out if I apply a certain current to this robot? So these functions tend to be of this very simple form, more torque, more current, more current, more torque. And what that means is there is a function that you would like to know. And this function maps from x to y. And you don't know what it is. You've just measured it at a bunch of points. And because you're a good engineer, you've measured it with error. Because you have a sensor. And so there's like a torque meter, like a little, OK. So 50 years ago, it would have been a spring. That is on the robot. And it goes a little bit up if you put in some current. And you measure how much the string deforms. And then you measure and you convince yourself, how much do I have error bars on this? I don't know, a millimeter. Or if you have these days like a nice torque meter, then you can turn it around and on the back it says plus, minus, I don't know, 0.05 Newton meters. So you know what the error is. And because you're a good scientist or engineer, you've added, where is it? It's the first wake up. You've added error bars to your plot. What do error bars actually mean? Gautions. Whenever someone draws an error bar, they mean that there is a Gaussian likelihood somewhere being assumed. So this is up here. We assume that these measurements y arise by evaluating a function f at locations x with an error that is given by some number squared. That's the square error, sigma squared, for every individual measurement. So times a unit matrix. That means we assume that their measurements are independent of each other in their error. This is a case of conditional independence. If I knew what the function was, f of x, then the measurement errors are independent of each other. Nice. So this is a Gaussian. We can do Gaussian inference. We just need to define a Gaussian over a function. But functions are infinite dimensional objects. They come from these fancy spaces that the mathematicians have told us about in second undergraduate term. But I mean, this is not something I can put into how am I going to put a function into this Gaussian. I just, if I go back to my code, I said that the Gautions have a mean and they're variants which are arrays. They are not callable objects. Annoying. So what do we do? Well, it's 2023 and everyone has heard of machine learning. How do we divide functions with numbers on a computer? Neural network. Good. It's a good knee-jerk reaction. Neural network. OK. So no, no, no, no. I'm not trying to make fun of you. It's a good reaction. So what are neural networks? They are parameterized functions. That's all they are. OK. So you've looked at this function. What does it look like, this function? A cubic. OK. So I don't know if anyone has had physics or engineering undergraduate degrees. What do you expect as the response of an electrical motor if you put in current? What kind of relationship will be between the torque and the current? A sinusoid. Linear. Good. OK. There are still people who think in terms of linear functions. Yeah, it's just going to be a straight line. If you put in more torque, it's more current, more torque. It's just a constant in front. And maybe there's a shift at zero. Maybe there's something. Sometimes you have like creeping currents. So even if you set it to zero, it actually slowly goes like this, the robot, because it kind of, that's just a very small torque. So how do we write a linear function? Yeah, we write f of x is w1 plus w2 times x. So it's a shift that in high school, you would have called the intercept of the function. Then axenschlitt in German, I think. And these days, of course, we call it a bias, because it's just a neural network. And this thing on the right is called a bias feature. But I never liked bias, because bias is a really loaded word. But whatever, it's just a constant. It's just one. And the other weight is what in English is called the gain, or the slope of this linear function. So we can write f of x as one parameter we don't know plus a second parameter we don't know times x. And I need to think about when I switch to code and then not. So maybe what we notice here is, because you've already had a deep learning class, I can't surprise you with this anymore, is that this is like your simplest possible neural network you've ever encountered. It's a single layer, linear neural network. So the only thing that we haven't talked about yet that some of you might already have on your mind is when I do deep learning, I need to define a loss function, right? There's no loss function here yet. OK, let's forget about the loss function for a moment. We'll get back to it. Think about it in terms of Bayesian inference. There are now two numbers here. They're called w1 and w2. And we don't know them. So we're going to have to do inference on them. And the way we created this function was by writing a linear function like this, w1 plus w2 times x. One neat way to think about this is that there is effectively already a hidden layer in this neural network. It's just that we know the first layer of the neural network. It's a function that takes x, the input, and then it returns two features of it. The first one is the identity feature. It just returns x. And the second one is the bias feature. It just returns one, whatever x is, all the time. And that's it. Now we can do Bayesian inference. We just put a prior over w1 and w2. And we notice that this here is a linear map of this unknown vector, w1 and w2. That linear map is linear in w. That's what we care about, because w is the thing we care about. And this thing multiplied from the left is a function of x, which happens to be a linear function as well, because it evaluates x. But that doesn't actually matter. What matters is that the thing we don't know, w, gets mapped linearly through some function. So actually, I'm going to show you an app for that, for which I should probably first show you a slide that tells you how to find the code for this here. So if you clone the git report that you already have, if you've cloned before and just go now, just pull from the repo and go to the seventh lecture, then you can find this thing where there's something wrong. I need to reload. Now it's better. And this is turning what we just had on the slide into a piece of code. So on the right, you see this data set in black. And what I've done is just a code up here. I'll zoom in for this, so that you can read it. And then later on, I'll zoom back out again. I've defined a Gaussian prior distribution. Actually, let me show you a slide. I'll define a Gaussian prior distribution over these unknown two numbers, w1 and w2. Which has necessarily a mean and a covariance. One simple thing to do is to set the mean to 0 and the covariance to 1. That's this distribution. And I have this in the app as well. It's down here. So how do I do this in code? I say I want to have a Gaussian that has a mean that is given by some mean vector, which in the app I can actually shift around. And a covariance, which is given by this nasty covariance matrix, which I define over here. It's just a long expression, because it's an app and I need to put it somewhere. That has defined my Gaussian prior. That's it. Now I have this object called prior. That prior over w, though, is automatically also a prior over the function f. Why? Because the function is a linear map of those weights w. And we knew, that's why we did all this work, that linear mappings of Gaussian random variables are Gaussian random variables. So if we define this function f, sorry, phi, which in this case is just a function that takes x and returns x and 1 and multiply it from the right to the mean and from the left and the right to the covariance, then this gives us a prior distribution over functions. How do I do this in the app? I say, let's define a grid so that I can make a plot, because I can't plot function objects. So I make a grid that goes from left to right, minus 5 to plus 5 in 100 steps. And then I evaluate this function called phi, which takes in x and returns one column of 1s and then a column of x's. And then I apply this phi from the left to this Gaussian distribution prior. And it returns a Gaussian distribution that has a mean and a covariance and all these other things. And we can sample from it and whatever, all the stuff that we care about. And I make a plot. So what have I plotted here? I've plotted the mean in fat gray that comes from this mapping that happens to be the zero function. Why? Because I've set the mean to zero. And then I draw three samples from this function by calling the sample function, which we just discussed before the break, and it produces three sampled functions, 1, 2, 3. And they actually correspond to these three functions over here, to these three numbers. So you can see that there are three samples in the weights, 1, 2, 3. And each of these is one of these functions. Why? Because here is one weight vector where the zero vector is a positive number, and the gain vector is a negative number. So that's a function that goes through zero and in x above zero, and it has a negative slope. That's this function. And then we have one that has two that have positive slope, w1 is positive. And one of them crosses below the zero, and the other one crosses above the zero. That's those two. So now comes the shaded area. Next question. Where does the shaded area come from? So that's already an interpretation that this might be a space of possible function. So what I'm doing is, I said before that we can efficiently evaluate the diagonal square root of the diagonal of the covariance matrix, and that that gives a marginal error bar on each element of discussion variable. And that's what I did here. So I call the standard deviation of each of these elements, and then just plot mean plus two standard deviations, mean minus two standard deviations, and shade the area in between with pipeplot.fillBetween. I fill between mean minus two and mean plus two standard deviations. The mean is the gray line in the middle. That's zero line. I can make it a non-zero line by adding, for example, OK, I'm going to click on here in a moment. What will happen? Let's check whether your intuition is right. If I shift the first element of the mean of the weights, so the first weight is the intercept, you can maybe move with your hands. What do you think is going to happen? Yeah, exactly. You're doing it. It's going to shift the whole thing up and down. OK, let's do that. So we've moved in W0, and that means that the function has moved up in a constant way. I can make it move up, and I can make it move down. What happens if I shift the other thing? People go like this. Only one person is like, it's going to do it like this. Because the gain will change of the function. So I click on here, and this thing becomes steeper. And if I click over here, it goes steep down. And what happens if I change sigma 1? So I'll scale this thing output. Oh, this is a good exercise. We should do this more often. Yeah, someone does, eh, eh, eh, eh, eh, eh, eh. But where? Eh, eh, eh, eh, eh, eh. Everywhere? It's the whole gray thing just going to open up. Let's see. Only the middle, right? So this sort of, the narrow part becomes wider. Oh, it goes back down. OK. What happens if I change the other one? Well, the outer parts become wider, because the gain goes up and down. What happens if I introduce a covariance between the two? So if I say positive gain is correlated with positive slope. Eh, sorry, positive intercept is correlated with positive slope. Let's try it, eh? So now, functions that are above 0 at 0 will tend to have a positive slope. That's all this says. OK. Well, fine. Whatever. So now what do we do? Patient inference. We have our data. And we can start loading data. So let's do that. I have a, oops. I have a function down here where I can select individual data points to be added. Which one should be at first? Let's start from the left. So what I'm going to do is I'm going, we did all this work in the first part of this lecture to be able to say, these cautions take care of the annoying linear algebra. So what I'm going to do is I'm going to say, this likelihood, let me just go back and maybe do I have a slide that, yeah, OK. So this likelihood can be thought of as, I need to zoom out so that you can see better. This likelihood effectively means that individual observations are made independently given the function. And that means I can compute a posterior distribution by calling the prior and conditioning it on the fact that I have some pairs x and y. This is called a supervised machine learning problem. And I compute the feature functions of x. Remember, that's just 1 and x. And then I condition the distribution on x on the fact that, or sorry, the distribution on w on the weights, on the fact that phi x times w is plus Gaussian noise of variance sigma squared, so random number, is an observation called y. And then I don't think about what this does anymore. Inside there is this fancy linear algebra happening, but we've encapsulated it away. It's fine. And then afterwards, this gives me a posterior and I can plot the prior exactly like the, sorry, I can plot the posterior exactly like the prior. It's just a Gaussian distribution. So after that, I don't think about it anymore. So let's put in the very first datum, 0. So this is the data point here on the left, the very first one. It happens to be down there. To this object, there corresponds a likelihood, a Gaussian likelihood. And I plot this in blue up here. It's this very elongated blue thing. Why is it so elongated? Because we're observing one thing, a scalar Gaussian random variable, but we're doing inference on two Gaussian random variables, w1 and w2. So it has to be degenerate. And that's fine, because likelihoods need to be probability distributions over the data, but not over the thing we're trying to infer. It just looks like this. It's fine. It's still a Gaussian. It's not a probability distribution, it's not normalizable. But who cares? It's just a likelihood. So we multiply the prior by the likelihood. The gray thing is the prior. The blue thing is the likelihood. It's a product of two Gaussian probability density functions. So the resulting object is a probability density function, which is called the conditional. And that's this red thing. That's just what it is. And it tells us that this function is probably looks like this. It's in red. What happens if I put in the second observation? The second observation is actually quite interesting, because it's down here. It's very noisy. And it's far away from what we might expect it to be. So let's plug that in. And you can see that the likelihood is actually says this is a very steep function that goes down. But because we are multiplying by a prior, the likelihood and posterior are almost the same, because it's already a Gaussian distribution. It's two different likelihood terms multiplied together. But because we have a prior that regularizes, this algorithm still says it's probably going to go up, because I'm below 0, and the gray thing just doesn't allow us to have a function that goes down like this. If I wanted this to be in there, what would I need to change? Change the prior. So I could scale up the variance like crazy until it allows this to be in. Yes? Warped likelihood? A log likelihood? No. Oh, ah, ah. Good question. So I'm not plotting the value of the likelihood in y. What I'm plotting is a function of y and x and w, actually, which is itself of Gaussian form. So it has at every value in x a mean and a standard deviation. And I'm plotting the mean and a standard deviation. So there is sort of a z-axis to this plot, which is the actual value of the likelihood. And at each of these points in x, you can think of a little Gaussian bell curve that sits there and protrudes out of the wall. And the height of that distribution, that is the likelihood. But I can't do that, so I do color. In later plots, I might actually have shadings. And then we'll see. OK, so now we put in more data. Second data point, third data point, fourth, fourth, fifth data point, seven, nine. Well, we just put them all in, whichever order. I don't care. I need them all in. OK, now all the data are in. And this is our posterior, Bayesian inference. Now we've learned a linear function with uncertainty. We can draw from it. You actually see samples in the background. They're not so easy to see because this distribution is so tiny-narrow. Yeah. Let's think again briefly about what happened. So here is the complicated algebra part that we hid away from ourselves in the code. We've defined this function. So we have a data set that consists of pairs of x and y. In this case, x is real numbers and y is also real numbers. We compute a feature function of them, which is a function. It's a neural network with a single layer that has a bias feature and a linear feature. So it returns x and also 1. And then I do this for all of the individual observations, all the way to xn. And then I actually transpose this thing. So in this slide, there's a notation here. So this is actually what the code produces. It produces an array of shape n number of observations by, let's call it d, the dimensions of the feature function, or maybe f is better. f, it's the number of features. The number of features is 2. It's the feature that returns x and the feature that returns 1. And then I say that the function that I care about, I pretend is of form f of x equal to feature function transpose times weight, where the weights, there are two of them. And one of them is the weight for the bias feature, and the other one is the weight for the gain feature. And that gives me basically observations of f of x up to Gaussian noise. And then I just condition. So how do I do conditioning? Well, I'm not going to show you that slide again, because it's so annoying. So I'm actually going to step through this and save you all of these math parts, because they're so complicated and annoying. And we've implemented them in code now. So we don't need to care about them anymore. So what we do is we define a prior that takes in a mean and a covariance. We define our feature function, which is a function that takes in x and returns 1 and x all the time. We don't actually need to do this unless you want to plot. Then we load some data. And here's what we do Bayesian inference. We just tell the prior, please condition yourself on having observed the fact that phi of x times the two variables that you are representing is equal to an observation y up to Gaussian noise of variance sigma squared independent of each other when conditioned on the function. Just do it and return whatever comes out. And we call this the posterior. This is a posterior over these two-dimensional weight vector. And then we can do what we have up here to make plots in x. So because Gaussians are close under linear projections, I can compute the features of some plotting grid and map both the prior and the likelihood through them. Yes. How do I estimate sigma? I don't. I just know. So I've turned around my measurement device on the robot and it said 0.5 Newton meters. So sigma is 0.05. If I don't know, we'll get to that. OK, so that was easy. Kind of. Well, after we'd spent the first 45 minutes writing this piece of Python code that does all the conditioning for us and all this linear algebra bit and all of this and Choresky decomposition and eigenvalue decomposition and writing down complicated code, now we can just call prior.condition on the observation. And the only thing we need is the data x and y. This feature function phi, our neural network that I've defined, it still pains me to call this a neural network because historically of course but whatever, that's just what it is now, it's fine. And then we just condition. So we're done, no? Supervised machine learning is solved. It's only one problem. Most functions, they don't look like this. They look like this. So here's another function. And well, what's up with that function? It's like someone actually goes, it's not linear. It's not a linear function. What do we do? I go back to the code. Add a bunch of new phi's. So what he's saying is, look at this code. The only thing I need to actually change is this, this line because if I think about this, here's still a function that goes from x to y, just goes to a function. And I just need to replace this function with something else. So what other function could I use? Yes, we could use more weights. And I was actually curious to hear what would come out of this group in 2023, like what your suggestions are. Maybe someone would have said deeper network, right? That's good. You're saying just more weights. So where do we add the weights? Do we make our network wider? Let's do that first. Let's think about deep later. Deep is somehow complicated, right? So wider is easier. Let's use more weights. How? x squared, x cubed, OK? So I've got an app for that. So here is our function. I've already put all the data in. And what I've now done is I've defined something called polynomial features. That's a proper way of doing this, which says, you give me x, I return polynomials of order up to number or feature minus 1. So if I set this to 2, that's our linear thing from before. What it returns is x to the power of an a range through the number of features. And I actually divide it by, complicated, complicated, complicated. That's the factorial. I divide it by the factorial of the number of features of the order of the feature. So to make it a little bit like a Fourier type, sorry, like a Taylor type feature, right? So we'll have 1 over 1, x over 1, x squared over 2, x cubed over 3 factorial, x fourth over 4 factorial, and so on. Why? Because otherwise I can't plot these. They're going to go off so fast that I won't be able to see them. So I can plot these features. No? What the fuck? What's up? Yeah, I need to redraw. It's just so great that you can barely see it. I think it's too low. OK, not nice. There's a gray overlay that I should probably have changed. So there's a prior function that you can barely see it. You can maybe barely see this little shape thing here, which we've had before. So I could condition on it, and it would give me this linear regression result. But clearly, this is not a good regression. It's not a linear function. And maybe you could think for yourself why you think this is wrong. And take this outside afterwards and think about why it is the wrong thing to do. And we'll talk about how we can talk about how that's the wrong thing later. Like you said, we should now add more features. So I'll take the number of the features and add a third feature. So now it's cubed. And that gives me this prior in the background and this posterior. So someone said more features, so fourth order, fifth order. You like this? Good, getting better, no? Is that a better model than linear? That's better. Is it good? No. So what do we do? Did you say the problem is that the uncertainty increases in extrapolation? Why is that a problem? It's good, no? Far away from the data, I'm not sure. But that's not the thing. It's just this looks like a bowl and this curve does this kink here in the data. So what do we do? Oh, we could define piecewise polynomials. OK, I don't have those in my list. So what else could we do? Use other features all together. Very good. So if I add more order here, I don't know, 16th order, you get something, by the way, in the back in gray you see all these polynomial features. You'll get something, but it's not good. And it extrapolates like crazy, like these functions, they go down to 10 to the minus 6 and then come back up again. So yeah, other features all together. What other features should be? Sinusoids, someone calls out. Cosine, yeah. So here now I have a bunch of cosines in the back, which I can also change their frequency and their output scale. And this is what this posterior looks like. So I can maybe scale them up a bit, make them broader. That doesn't really do much, no? Not interesting. Maybe I can change their frequency, make them longer length scale? No, that's not good. Let's make them shorter length scale. Oh, OK. Now it's like, why does it look like this? Does someone know? Because they are all cosines. So cosines are symmetric around 0. They all are. And no matter what your data is, the space of functions you can span with cosines has this property that it has to be symmetric around 0. So this is the best thing we can do. OK, what can we fix? Can we combine all of them? Can we add sines and cosines? Yeah, we can. So here now in the back, you see both sines and cosines. So up here, by the way, you see the features. So the feature function now stacks cosines and sines. And then divides by the square root of the number of features so that things don't blow up. So I can make this a bit broader by scaling all the features up and down. That's like scaling the prior, basically. And I can change the length scale, make them a bit faster moving. And I can add more frequency, maybe like 44 of these features. Huh, getting somewhere. That looks interesting. Any other ideas? So the question is could we have shifted and scaled the cosines? In a way, I've sort of done that, because cosines are periodic. So by having all these frequencies, they overlap at various points. You could add a shift so that they don't get one at zero, but at some other point. But that would just shift this entire plot a tiny bit left and right. And once you are one period over, it repeats. So you're thinking of a neural network, right? We could have some features that shift things back and forth. Maybe that's a good idea. Maybe let's think of some features that we can put sort of localized in space. What kind of features would you like to use? You've had a deep learning class. What kind of features would you like? The one that I've expected, actually, someone to shout out in the very first instance. Someone's already said it. Hello. You can use a relu, yeah. So if you feel kind of dirty about it, right? It's like, relu, relu, relu. Okay, let's do it. So here's a bunch of relus. So these are relu features that are very regularly spaced. So I put them from left to right. There's actually, let's make them less so that you can see. So here are eight relu features. The first one starts there and then it moves up. The second one starts there and then it moves up and so on. And they're distributed around. You only see six of them because the other ones are further up. Here they are, I've defined relus properly. So they are the maximum of, by the way, does everyone know what a relu is? So it's a function that is constant and then it becomes linear. That's what we have up here. A rectified linear unit. Okay, so, and by the way, just to make sure for those of you if you're wondering what's going on, if you're lost, the only thing we do is we've taken the code from the previous app and we've replaced phi, done. And now we are playing with phi, right? We're doing different features. So I could take a lot of these features regularly spaced and I could maybe scale them up and down. That's like a joint weight for all of these, right? So that they don't go, they don't rise so steeply. There's something wrong with the prior plot somewhere. I'm not sure. If I make them rise really steeply, then the posterior looks like this. You like this more than the flatter one? No, it's not so good, no? Somehow it would look, it would look nicer like this. Okay, uh-huh. What else? So they are just, if you look at the code up here, this function takes in X and the number of features it's supposed to build. And then it constructs a range from zero to the number of features it's supposed to do. And then for every even feature, so if the modulus of two is zero, it adds a minus one so that the value points in the other direction and for every other one it doesn't add the minus one so that the value points in this direction. And then it subtracts one half so that we're centered, sorry, no, this one half is what the minus, what gives us the minus. And then it multiplies by the linear function because that gives us a value, right? So we take a linear function and we cut it off at some intersect to make it zero and then linear afterwards. And then we divide by the number of features so that they become a bit flatter if you have more features. Otherwise we would sort of increase the height of the prior with more and more features. We condition on the other random variable called y having a particular value. So there's a capital Y, this is a random variable and that Y happens to have these values. It's a vector that contains these numbers. We know why, I mean, we know these black dots. You're saying, ah, so you're asking do we know the data generating process? No, I'm pretending that I know, I'm just saying it's Gaussian. But that's what everyone does. No, I know why, why are the black dots? Someone has given them to me. So in the Bayesian formalism, the data is always there. It's not something random. The data is just numbers. That they sit on your hard drive and you've loaded them literally with this piece of code and now they're just there. They're just values that the random variable has taken. And now we need to say what the likelihood is. That's the fundamental problem of statistics, whether Bayesian or frequentist. You just have to say what the likelihood is. And here I'm claiming it to be Gaussian. Because that's what every engineer does when they draw an arrow bar. They just pretend that it's Gaussian. So we're running a little bit out of time. I just want to play with this a little bit more. I just want to make one very important point that maybe you like this plot, you could come up with other features. Maybe there's someone, anyone else want to suggest any other features? Sigmoids, you've seen it in here maybe. So this is difficult to see. Maybe I need to remove the lower the number of features so that you can see them individually. So what you see here in the back, barely, because somehow this projector makes the colors bad, is three sigmoid features. So sigmoids are like one over one plus e to the minus x. So they start at zero and then they go up to one and become flat again. And there are three of them. The first one starts there, the second one starts there, the third one starts there. It's a really a shame that you don't see the prior, but it can so nice. And so I can scale them up or down. Now there are three of these. So if I make them wide enough, this thing is going to learn that probably three functions that's first go down a little bit, then they go slightly linearly up almost and then almost linearly further up. And of course I could add, I could make these sigmoids steeper by dividing the input by a function and we just get kinks basically. Little steps. Here are the features by the way, I should have just pointed at those. So they are one over one plus e to the minus and then shifted by some location and divided by some length scale. And I could of course add lots and lots of those, 50 or so, and then we get something that is, if I make the length scale a bit longer also nice and smooth. So sigmoid features can also be good. Here I'm using one over one plus e to the minus x. I could also use a ton of h, whatever, this looks very similar, yeah? This is a very good question. So here I can maybe briefly go back to this nasty math because sometimes it does pay to look at the math. So if we have a prior on the weights and this likelihood for the observations or this likelihood here, sorry, and then condition on it, we get out of the posterior distribution over the weights which has this form. So it has the Gaussian with this mean and this covariance. And yes, this is a complicated expression, but the thing that I want you to see is that this posterior covariance, this thing here over the weights, this expression that is being computed inside of our Python class, it does not contain y. So y is here in the mean, but it's not in the covariance. So what this means is this machine learning algorithm that we've just built, this very basic one, it produces an uncertainty that does care about where it has observed, x. And how it thinks that what it has observed relates to the data, that's phi, but it doesn't care about the actual number that it has seen to compute its uncertainty. It's just a property of Gaussian conditioning. And that's why when you have data that doesn't fit at all into what you're thinking that you're seeing, then you can make this thing confident or unconfident, it doesn't matter, right? It doesn't notice that the data is off. And we will need to fix that, but we will. And just to, so you can all play with this yourself a bit more, maybe on your screen you can see the prior properly. You can come up with all sorts of other fun features, for example, also step functions. They become, they look ugly like this, but if you add lots and lots of them, they actually kind of look a little bit decent if you scale them properly, right? So let me scale them nicely, maybe like this. And then we need a little bit less of them. They can actually make kind of nice interesting structures. They're like pixels in an image pretty much, right? And various other ones. So you could come up with all sorts of crazy ones. Here's another kind of step function that starts at zero and goes to one. But I'll go to the end because I want to give you a chance to give some feedback as well. What we did today is to, first of all, construct Gaussian linear algebra in a piece of Python code that encapsulates all the complexities of it. You spend 45 minutes thinking about these complexities and why they are so important with all these pictures here. And then we drop them into the code and never look at it again. Well, not never, but not for the rest of this lecture. We'll think more about them later, but for now they are encapsulated away. And then we realized that we can use this object we just built to learn functions, even non-linear functions. And the thing to do, to be able to do this was to define a linear map from some parameter space which you might call a weight space to a function space by multiplying with features of the input data X. And what we just discovered by playing around with this app is that we are totally free to choose any features we like. Any. If you don't like re-lose, you can use Z-lose. If you don't like periodical functions, you could use sigmoid functions. You can also use mixtures of. You can have five linear features, polynomial features and five trigonometric features and 180 step functions. Drop and mix and do put them together in any way you like. You're not constrained by the layers that TensorFlow provides to you. You can write your own layers in Jacks or in PyTorch or whatever you like. And they all fit into this Gaussian framework. So what we've built here, of course, you can already give away the secret, is a linear neural network, a one-layer neural network with Gaussian weights. And we've seen that we can do closed form Bayesian inference on them. And it produces a posterior that is actually interesting to look at. We'll start looking at it much more next Monday and then think about what we can do with this thing. Thank you very much.