 Welcome to lecture number seven of probabilistic machine learning. Here we are in the course. In the first six lectures we saw that probabilistic reasoning is an extension of propositional logic that allows us to reason with uncertain statements. We saw, lecture two, that doing so poses new computational challenges because we now have to keep track of an exponentially large, potentially exponentially large set of hypotheses. Lecture three, we saw that we can even extend these probabilities to continuous spaces, which in a way complicates the computational issue even further. So in lecture four we got to encounter our first computational tools that provide actual methods to do inference in even continuous valued general probabilistic models. These are called Monte Carlo methods and we used them, extended them in lecture five through a relatively generic tool set. In the last lecture, the most recent one, I introduced another crucial tool for probabilistic inference, which is given by Gaussian distributions. Gaussian distributions, as we saw back then, map probabilistic inference onto linear algebra. Linear algebra is something that computers are good at, so therefore they're going to be a beautiful tool to use. Gaussian don't just, like what I mean by mapping inference onto linear algebra is not just one single property of Gaussians, it's a whole list of such properties. Products of Gaussian probability density functions are given by up to normalization, another Gaussian probability density function, where the parameters of this distribution are given by, like can be derived from the parameters of the inputs for the product, using operations that amount to linear algebra, inverting matrices essentially and multiplying matrices and vectors. Marginals of Gaussian distributions are Gaussian distributions and that's convenient in particular because the marginalization operation is very straightforward, it amounts to just selecting subsets from a vector and a matrix. Therefore, this operation provides an implementation of the sum rule in the Gaussian family. Linear conditionals of Gaussians are Gaussians, that means that if a Gaussian random variable is mapped through a linear map, then first of all we get out another Gaussian distribution where the parameters of the distribution are just mapped either from one or from both sides through the linear mapping question and therefore also we can show the conditionals of Gaussian distributions, so that means given an observation that is related to the quantity we care about in a jointly Gaussian linear fashion, the posterior distribution essentially or the conditional distribution arising from this kind of observation is also Gaussian and because this provides the product rule and this is the sum rule of probability theory essentially, we can also construct Bayes theorem and see that if all the variables in question are jointly Gaussian distributed and all the observations we make of them are linear projections of them, then posterior distributions of these quantities and all their linear maps are also Gaussian distributed. The corresponding expressions can sometimes be a little bit tedious and you have to stare at them for a while to see the structure in them, but so far the most important takeaway was that all of the operations you see in here are just linear algebra, they're just products of matrices and the solutions to linear systems of equations, so that amounts to in simple terms inverting matrices. What we're going to do today is to use this insight to construct our first concrete, still very basic, but concrete Bayesian machine learning algorithms that allow inference on non-trivial quantities that are actually important in practice. I will still start with a very simple example of course and this will very quickly evolve into a much more complicated framework or much more powerful framework as well and in the process we will actually have to look at these expressions a few more times a little bit more in detail but we'll do so deliberately very slowly so that you get to see what's actually happening on the computer. So what is the problem we're going to solve? It's the problem of learning a function or in fact inferring a function. In probabilistic reasoning the words learning and inference are often used interchangeably even though other parts of the machine learning community don't do so we will see over the course of this lecture why this might be the case. The problem we're going to start with is trying to figure out a function that explains this kind of dataset. So here we have an input x in this case we assume x is a real number but very quickly we will move to examples where x can be a vector in a real vector space and this x maps to some other quantity which we get to observe with noise. So we're going to collect pairs of x and y like these little crosses in this plot and then the question of course that arises is if I give you new x can you predict new y's? What this amounts to is to say there is some functional relationship between x and y and what I'd like to know is this functional relationship. This problem of course is known as supervised machine learning where we get inputs and targets x's and y's. Now this is a particularly simple dataset maybe and so normally in the lecture I'd like to ask you now why is it a simple dataset and but this is one of these questions where students often find it too easy and then don't even dare answer because they feel that they might look like a fool. So of course the important aspect of this particular dataset is that it's quite well described maybe by an underlying linear relationship by a straight line. Don't worry we'll quickly talk very very very quickly talk about functions that aren't real but this is still an interesting case right. So where does this arise? So you might think that this is too trivial a problem but it's actually not entirely impractical. So here is my former wonderful colleague Jeanette Boeck when she was still working in Tubingen with this robot it's called Apollo. She's now working in Stanford but the robot is actually still here in Tubingen and so robots are complicated machines and doing something interesting with them requires expert knowledge but they are also some low-level tasks and robotics that are quite easy which I'm going to use as an example for well easy. They have quite simple structure which is something I'd like to use as a motivation for this for this dataset. So let's say you've just bought a new robot and one of the first questions you might ask while you're setting it up and testing it out is if I send a certain amount of current through one of the motors that activates the joints like one of the joints of the robot so let's say I don't know maybe the motor the electrical motor in here that twists this arm then you can imagine that if you increase the current then the force applied onto this arm increases probably relatively linearly with the current and if you lower the current the force goes down and if you run the current in the other direction so in the negative direction wherever that's pointing then the arm is trying to turn in the other direction and the force applied onto that arm will increase linearly as well. So what you might do to figure out the relationship between these two is a measurement like this so you basically apply a some kind of force measurement device to the to the arm and then you activate just this one joint and you run through various different settings x might here be the current going in it can flow either in one direction or in the other and what you're measuring here is the force that is produced at the arm. You do a bunch of measurements like this and of course because you're a good scientist you know that your measurements are imprecise the machine that measures this force is maybe not so good maybe the whole geometrical setup for the robot is not so great so there's quite a lot of measurement noise here. With apologies to all mechanical engineers looking at this I'm sure in practice everything is much more complicated it's just a motivational example right so what you might want to know is what the linear function is that goes through this kind of data what is a linear function well a linear function yeah okay so here's the question again maybe maybe we can think about this briefly about in in terms of variables so we have our input x if our output y there's a functional relationship between them notice that they get to see both that's why it's a supervised learning problem the only thing we don't know is the function in between this latent function now of course there are many many many possible functions right there's the space of all functions that map from x to y is incredibly large so what we're going to do is something extremely restrictive we're going to assume that this function can actually be described by just exactly two numbers those two numbers are given by the intercept and the gain of this linear function so we will assume f of x is given by one number which we don't know yet plus another number we don't know yet times x so that's clearly a linear function w1 is called the intercept so w1 is the value of f at zero right and w2 is the gain so that's the slope of this linear function that we're looking for one particularly convenient way to write this function which we're going to use a lot is this one i'm going to briefly talk about the notation in in a bit more detail in a moment which uses a function which i'll call phi, phi short for features and these are this is a function that takes in an input and it returns a function or several possible functions of that input in our case this function just returns first of all the constant one and secondly x so this seems like a trivial function but it simplifies this notation because it allows us to write the function f of x as a linear map not just of x but also of the vector of these unknown numbers w1 and w2 so our learning problem now looks like this here's a graphical model a directed graph what's happening is we take our x that's the input and then we compute these features phi of x in this case the features are just one and x now we have decided that we're going to do this so we know what these features are there's nothing uncertain about them for us because we claim to know that this function is linear the only thing we don't know is the intercept that's the weight on the first feature and the gain that's the weight on the second feature now what we'd like to do is learn or infer these two quantities w1 and w2 and they are related of course to the data to the output in a linear fashion so we could of course assign a generic prior distribution over the this bivariate wheel vector space of span by w1 and w2 and then we could as we could assume any kind of likelihood function that connects the function f of x to the observations y and do Bayesian inference that's the generic framework however if we don't choose the prior and or the likelihood in a convenient way then this inference will in general be quite expensive we will get a generic distribution posterior distribution over w1 and w2 and it can be quite hard to compute this posterior because it involves a complicated integral for the normalization constant and so on so to simplify our task because let's be honest this is a very simple task so we can't afford to run some super complicated uh powerful supercomputer on it we will decide to make very strong simplifying assumptions which is that we will use Gaussian distributions to describe the prior on our model and the likelihood and now because there is a linear relationship between the thing we care about the w's and the stuff that we get to see the noisy version of f the posterior is going to be Gaussian as well so let's see if we get this to work so we will put a Gaussian prior on the weights here is such a Gaussian prior I've decided to use a particularly simple Gaussian prior so here is our bivariate space of quantities we don't know yet w1 and w2 I've decided to put a Gaussian prior distribution over these w now this is a bivariate distribution that involves a mean that's a vector of length two that's this dot here in the middle this circular dot down here and I have also put a covariance matrix around this distribution that's described by this matrix sigma and it's sort of shown both as a like that the pdf arising from this is given as a shading in green in the background and these circles here are equipotential lines of the Gaussian at one and two standard deviations away from the mean now you can guess what I've chosen here as a sigma it's the unit matrix so this is a standard Gaussian it's just because I couldn't really think of anything else for the moment let's see if it does something interesting and then of course we can change it later on what I'm also showing are three random numbers drawn from this Gaussian distribution these are these green dots okay I'm deliberately going very slowly you can also stop the video if you want to at any point because we are very quickly going to ramp up this process to much more complicated models and it's good to think about it in this very basic fashion now so this prior distribution on the weights of this function actually also imply a Gaussian prior distribution on the values of the function and the reason for that is that the function we care about f is a linear function of and here is an important point a linear function of the weights so this function here happens to be a linear function in x but what's actually more important is that f is a linear map of the weights so here is some function phi that applies that is applied to the weight vector w so because of this this Gaussian distribution over the variable w implies a Gaussian distribution over a joint Gaussian distribution over all the function values that f can take anywhere and I'm showing you this plot here so well I'm showing you these function values in this plot here so here you see a map from x to f of x and what you are seeing in this plot again going to go slow because I'm going to use these plots a lot are many different things so first of all this mean in here this green dot at zero zero which I've chosen to be zero zero corresponds to a mean for all the function values so at any location x the prediction the mean prediction for the function value f of x is just phi of x times mu so mu here is zero and therefore the mean of the function values is going to be zero everywhere and I show this as a solid green line in this plot secondly this covariance over the weight space applies a covariance between function values now what this means here is that if you take any two function values from anywhere in here so let's say we take two and minus four then the covariance between the function value at location two and at minus four is the inner product taken between the feature vectors of two and minus four so that feature vector is one and two and one and minus four we apply it from the left and the right to this to this covariance matrix here and because it includes this includes this feature function involves x what we get out actually depends on x and you can see the diagonal entries of this variance matrices shown of these covariance matrices shown by the this green kind of sausage of uncertainty around the mean value and notice that it gets smaller as we get to the origin and gets larger as we move away why is that the case well at zero the entry for the variance of the function value at zero is just actually let me write this out the covariance between the function value at x and the function value at x prime at for any two x and x prime is given by the feature of x i'm going to use the simplifying notation transposed times sigma times phi of x prime so what is phi x phi of x is the vector one and x transposed with this matrix in this case i've actually chosen this to be the unit matrix let me write it out like this and then we apply from the from the right hand side one and x prime so if we choose particular values for x and x prime and let's say this is the unit matrix then we're going to get sigma one one actually let's just just keep the sigma sigma one one plus sigma one two times x prime plus x times sigma two one plus x sigma two two x prime so in particular at zero if x is zero we just so if x and x prime are both zero then we get the variance of the function value at location x or zero then all of this is zero and we just get sigma one one which happens to be one so that's this value here if we are at six then we get one plus six plus six plus 36 that's the variance i'm plotting here of course not the variance but the standard deviation that's the square root of the variance because that has the right quantity so right units of measure and you can see it that it's sort of on the order of the square root of um well uh well whatever that is right so one plus 12 plus 36 okay i'm also showing in blue in here the values of these feature functions phi so there is a constant function here in blue there is just one everywhere and then there is this linear function that is just x and then finally every single sample in the weight space corresponds to a value uh or to a particular function in this hypothesis space over functions so let's see if we can get these out so this is maybe a good point where you want to stop the video and think about which function of these three that i'm actually so i can tell you that these three green lines in here they correspond directly to functions associated with these samples now you might want to stop the video look at these at these three green dots and think about which of them corresponds to which straight green line if you've done that now let's do it together and check whether you were right so here is one function that is it's close to the origin and it's just above the origin which means that its intercept is zero it goes basically through zero and it's a slope is positive that's this line right so it goes almost through zero and has a small slope here is a function that goes has also a small intercept it's just below the origin and it has a steep slope that's this function and then here is a function that goes that has a negative intercept so it goes below the origin and it has a negative slope so it goes down and that's this function okay so now what i'm going to do is i'm going to use this as a tool as a as a didactic tool over the next one or two or three lectures notice that actually this entire distribution is circularly symmetric around the origin on the sort of in a circular way that is described by the shape of these circles another way to think about this is if i took this plot and i rotated it such that these shapes would stay circles then every single sample from this rotation is an equally good sample right there is no reason to prefer this sample over one that is rotated a little bit over here so what i'm going to do is i'm going to show you animations like this where samples typically three of them are running around on circles and then that means every single frame of this video represents three equally good samples now of course each of these samples has a different probability one of them is closer to the origin so it's running on a circle that where each point has a higher density and some are running outside but every single one of them across like for for each of the circles so for each sample every frame of the video corresponds to an equally good representation it will do this because soon i will have to stop showing you these weights because then they're going to have more than just two dimensional weights and then i can only show you these plots in function space and then it can be helpful to get these animations to see the sort of flexibility in the model okay so if you've stared at this for a while then you can also find the slides by the way of course on ilias the animations are in there you just have to use the adobe reader to see them otherwise you don't get to see them before we can now talk about the notation a little bit again before we go to the inference so sometimes people get confused by this and in fact well not just sometimes admittedly it is a confusing notation but it's extremely helpful so let me take a few minutes to recap again what we've just done here to simplify the process of thinking about a very large number of function values and several possible weights to explain them at the same time using notation that is very much inspired by the beauty of linear algebra so what we're going to do is we're going to learn a function that um well we will learn a function from a data set that is given by x and y x is a collection of inputs there is a reason why i'm using a capital x here and not a vector x because we're quickly going to find that the x's can really just don't necessarily have to come from a vector space they can just be collections of x's y are observations these are the outputs of the function and we will assume that these are real numbers so if you have a whole data set then this data set consists of a collection of n different x's and n different y's and now i'm going to use the following notation i will introduce this function which i just already did phi which is the feature function this is a function that maps from the input domain from x to a real vector a real vector of length f where f is the number of the features in this first example the number of features is one sorry it's two so there is uh feature number one feature number two the first feature is the unit function the second feature is the linear function x and instead of writing phi of x in this notation that we that you're typically used to for functions i'm actually going to write phi under the subscript x and so a because this simplifies notation a lot so if you have a lot of x going around everywhere then this just makes these expressions very long so this simplifies this a bit you can also actually take this as a slightly suggestive as meaning that this is essentially this feature function you can think of as a very very long vector where you could plug in any value of x and then this actually this particular value of x kind of indexes one location in this vector or actually this matrix so if you take all possible values of x on the real line then that's an uncountably infinitely long vector if you apply f to that you get an uncountably infinitely long matrix of height two where every single first entry is one and every single second entry is just x okay this is a little bit shady to do that to do it this way but actually it works really well so um good so we can also do this for our data set if we evaluate phi at the entire data set then that's phi of capital x rather than low cap low x and that gives us an actual matrix that has size number of features by number of data points so if we have let's say 20 data points then this is a matrix of size 2 by 10 because we've decided to use two features 1 and x and there are 10 input points so phi of capital x is going to be this in this case it's a sort of short and stout matrix and then i'm also going to talk about the function value f evaluated at some x i'm going to use the same kind of notation to simplify this and get rid of all these complicated brackets and of course there's also function values at all of the input points at all of the training points so how do we get those function values well these are a vector so we apply um by our definition of the function phi of x transpose to the weight vectors so here's our vector of weights w1 w2 which we don't know yet but we're going to infer them and here is this fun that is matrix phi of x that we have up here we've taken the transpose of it which gives us the set of all the first features and all the second features so these are all one and these are all x1 x2 x3 and so on not to xn and if you now take the inner product of that we get well the vector matrix vector product of these expressions then we get a vector of functional values which were at each of these entry points we get the inner product between the feature vector of this particular location x123 xi up to xn with the weights and these are just the function values at x1 x2 and so on up to xn okay this might sound like a tedious notational exercise you will soon see that this extremely simplifies what's going on and I'm going to quickly start using this in a relatively unquestioning way I'm also sometimes going to mess a little bit with the notation and use capital phi's for these matrices this just happens in between because I try to be I try to clean up my slides but sometimes I still forget to fix some of them okay so this has given us our notation that maps this description of our function f of x onto linear algebra and that means we can now do Gaussian inference on a function given data because the data is linearly related to the aspects of the function w which we don't know so to do so we just do Bayesian inference so we write down a prior and a likelihood and then use Bayes theorem but because everything's Gaussian the posterior is going to be possible to compute using linear algebra so we start with putting down our prior I've already done that before this gives this implies a prior on the function as well it's not like the relationship between two these two is just due to the properties of Gaussian distributions because f is a linear map of the weights now we introduce a likelihood that says we are going to evaluate this function f of x at various points x and we will get out noisy measurements of the function at these locations so we're going to apply a current to the motor in our robot and we're going to measure with a little bit of measurement noise what the force is being produced by the joint so the likelihood our observation likelihood we will assume to be Gaussian and again this might be wrong in practice but we're just going to assume that it's true because it simplifies our computation and what we do when we evaluate one of these data points is we evaluate our function f our function f by our assumption is given by a linear mapping of the weights and we assume that we have some measurement noise so that's an arrow bar on the y's and I'm actually going to assume that all these different measurements are independent of each other so there is the covariance here it's just a scalar which I call sigma squared times the unit matrix actually this assumption is not strictly necessary it's possible to work with a generic covariance but it's a very typical kind of assumption that people do in practice right every time I measure I make a different kind of mistake and the mistake I make at time t and at time t minus one are completely independent of each other up to the fact that they are Gaussian and have variance sigma squared and our measurements of course of the underlying function so given the function the observations are actually independent and if you want to go back and check out the graphical model on a few slides ago you will notice that that's actually true by our model this conditional independence of the observations now we can compute a posterior so to do so we multiply the prior with the likelihood so we multiply this distribution with this likelihood and we get every normalized and we get a posterior on the weights so to do so we just have to multiply one Gaussian with another Gaussian and then divide by a normalization constant which also has Gaussian form and we can look up what that is on previous slides that we've done in the previous lecture so now we're going to get one of these complicated linear algebra expressions that we spoke about in the previous lecture and maybe now is the time to actually plug in like the corresponding numbers so if you read out what what the expressions are you can just plug in the the concrete values we have in our model now and see that the posterior on the weights is going to be the prior mean sorry it's going to be a Gaussian with a mean that is given by the prior mean on the weights so what we knew what we guessed the weights to be before we got to see the data plus the covariance matrix of the weights with times the feature vector of the training data from the right and then here's a big matrix that is the inner product between the feature vectors of the training data weighted by the prior covariance of the weights plus the measurement noise inverted times the observations y the data minus what would be the prediction for these observations the predictive mean which is applying the features of the training data to the prior mean so what this what's happening here is we're computing we're computing a prediction for what we think we're going to see then we see what we get to see we take the difference between the two that's the residual we scale it by the predictive variance so by what we thought the difference would be between these two um squared and then um so so correct basically this measurement in this way so that's kind of a scale on which we expect things to vary and then we map back out onto the quantities we actually care about so these are the observations and how we predict them to covariate and now the question is how do how do these observations relate to what we care about well they co vary with the stuff we care about because they're related in a Gaussian fashion what's the posterior variance on these weights it's the prior covariance minus and then basically an outer product of these covariances between weights and data scaled by the inverse covariance of the data so that's saying how much variance is in the data so how informative is it to evaluate the data and then scale from the left and the right how they relate how they co vary with the stuff we actually care about now as I pointed out in a previous lecture it's possible to also write the very same expression in a different way using the matrix inversion lemma which is maybe at least in theory due to Issa sure and we get another way of writing this this resulting posterior weights and actually computationally this is this is much much much much better to do because here we have a matrix that is of size number of data points by number of data points so it's n by n and we need to invert it so the cost of that is going to be n cubed while here we have matrices that are of size number of weights by number of weights so this matrix is two by two and this is also two by two it's the same matrix and inverting a two by two matrix is much much easier than inverting an n by n matrix to do so of course we need to get these these quantities in here in the in this matrix and you can see that this is an inner product of feature vectors with themselves so this operation is linear in the number of observations n so doing this here is of course number of features squared times number of observations so f squared times n f here is two so it's just four times n beautiful so that's an easy thing to do and then we only have to invert a two by two matrix okay so once we have this posterior on the weights we also get a posterior on the function values and to get to that it's actually a really really straightforward operation all we have to do is remember that the that the function f is actually just a linear map of the weights so if you have a Gaussian posterior on the weights given by this expression or maybe this up here as well then of course the posterior on the function values is given by just applying this feature function f of any input point you want to test from the left to the mean so that's you can see here this is exactly this expression plugged in here again and from the left we've just multiplied with the feature vector of any any test points and to get the covariance between two test points we just multiply from the left and the right the expression up here or actually the expression down here with the corresponding feature vectors so we're going to show we're going to we're going to now look at how this actually works in practice for our concrete data set however at this point it might be useful to you to just briefly stare at this expression and make sure you understood what we just did in this derivation now not everyone likes looking at algebra in a symbolic form so if you're a kind of person who prefers prefers to have a visual view on what's going on let's do the same derivation essentially again but now in a visual a more visual form so here is our prior distribution again that we've defined on a previous slide a prior over the weights which directly corresponds to a prior over the functions i'll call this weight space and this function space this is based on notions that come from the wonderful book by caravasmus and chris williams about gaussian inference or gaussian machine learning essentially now we have this data set that consists of these measurements of let's say i mean of course the data set is made up but let's say this is a data set that is a measurement of torques or forces created by the motors in our robot and we begin with our very first measurement so you can already see the entire remaining data set plotted in here but let's say we only get our very first data point so far that's the first measurement we make this one measurement is associated with a likelihood term that we multiply into our prior this is a likelihood for this individual observation that's a one-dimensional gaussian distribution in this on this particular function space but it's a linear map through the likelihood of these two weights so in the space of these two weights this gaussian actually corresponds to a degenerate gaussian which is shown by this blue line here that's the likelihood function now it's it's a degenerate gaussian so it has only a rank one covariance even though it's a bivariate space so you can see it sort of being well degenerate in this direction but that doesn't matter the product of two gaussians is another gaussian and as long as the posterior is full rank everything's fine and because our prior gaussian has a full um symmetric positive definite covariance even though the likelihood is degenerate the posterior is still a full rank gaussian so the product of the screen distribution with this blue distribution is this red distribution which is our posterior distribution you can see what the likelihood tells us this one data point written in terms of the weights is telling us that the correct function probably has a positive slope because this point is below the zero and it according to this data point it might not have an intercept actually if it has an intercept so it this sort of uh well this likelihood actually says that the intercept could be anything right but if it's a positive intercept then the function probably has to have a positive slope and if it's a negative intercept then the the slope is probably a little bit less so that means if the interest right if the true function kind of goes through here then it's probably a flatter function and if the true function goes through here then it's probably a steeper function according to this data point and let's say we get a second data point in let's not just pick the next one in line but maybe move to this one this is a data point that is very close to the origin so it provides relatively little information about the slope and actually not even all that much about the intercept which is why we get a relatively wide likelihood here so this is interesting maybe even though this Gaussian and this Gaussian have the exact same width because it's measurement with the same measurement noise the information content about the weights of these two data points is quite different this data point is very close to the origin and here the prior already is relatively narrow anyway so this additional data point provides very little information right it just sort of cuts out a small part of the of the prior so the product of this prior and these two likelihoods is now this posterior distribution which is also corresponds to a posterior infunction space which is this red function now let's say we get one more datum that's up here so what does this data point tell us it tells us that the slope probably is positive because we are quite far above the the prior width in this region and therefore the likelihood actually lies up here see that you notice that the slope of this line here is now the other way around because we are on the right hand side of zero so what this likelihood says is this function probably has a positive intercept and a positive slope if it doesn't have a positive intercept then it has to be even steeper so if that if the true line kind of goes through below here then it has to be even steeper to go through this data point which is why we go up here and vice versa so now of course you can get many many more data points if you multiply all of our likelihoods in every individual datum then we get an overlap of all these complicated Gaussian likelihoods it's a bit of a busy plot now obviously and what you're left with is this small posterior distribution here which is now quite narrow because it's a product of so many different Gaussian distributions that it actually has to be quite concentrated and this corresponds to a posterior in function space which you see over here which is this relatively narrowly constrained region describing linear functions you can still see the animated samples so you can see that this these red dots are also moving around just like the green ones and we're constructed where we have contracted from this wide prior distribution to this narrow posterior distribution which tells us quite a lot about the true linear function is we're not certain about what the linear function is and in fact given more and more of this data we will never become infinitely certain you'll just become more and more certain as the data accumulate about what the true function is okay so this is what we're trying to achieve with these complicated computations it's linear algebra that leads us to this conclusion here is this algebra again it's a complicated expression bunch of expressions but now if you wanted to implement this in code you'd actually have to look at these expressions and get a feeling for how to implement them efficiently so what I'd like to do now is save you that time and do it with you on screen in actual python code so what we're going to do is and I'm not going to type it down I'm just going to actually show you the code is go through this process which here is sort of nice curly symbolic mathematical symbols and translate them into actual code the takeaway I want you to get from this process is that Gaussian inference really boils down to explicit linear algebra operations rather than a call to a complicated software library that is some kind of big black box that does something magic so if you are a recent arrival to the machine learning world then you might think of machine learning as this field that lives in these very complicated toolboxes like pipe torch or tensor flow or what they are all called and these are all complicated black box environments in which you have to use complicated stochastic optimization methods like stochastic gradient descent and add them and what they are all called and somehow you have to just hope that this shaky construction of software will just work for you by contrast gaussian inference consists of very elementary linear algebra operations and for these I mean you still need certain toolboxes because you don't want to implement linear algebra from scratch yourself but these are very low level operations so let's look at what this process looks like in code and just to remind you we're going to define a prior distribution on weights we're going to define a feature function phi this these together define a prior distribution on function space then we're going to get data that you observe with a gaussian likelihood and we can use this data together with the prior to construct this gaussian posterior on the weights and therefore also on the functions and to do that we just have to compute these quantities here because gaussians are entirely described by these two parameters by the mean and the covariance and to compute these quantities we need to do a little bit of linear algebra we need to take vectors subtract other vectors from them solve linear systems of equations so this is not quite the same as inverting a matrix but it's related to it and then map again the solution of these linear problems through other vectors and matrices so here's the code I have deliberately written I should say this in a way that is deductically written it's not is the most efficient way to do or it's not not the most beautiful Python code maybe in particular I'm going to show you scripts that do this computation in practice of course you would implement this either in a more functional or a more object-oriented way with methods that allow you to swap out certain parts of this of this model I'm not doing this here because I want you to see how simple the structure is really of this computation so in particular we're going to we are not going to load in up here anything complicated like tensorflow or py torch or whatever else we are going to use numpy which provides basic linear algebra operations and in particular we're going to need three important parts in our process well we need I o because we need to get data in fine then we're going to draw from a Gaussian distribution for that I load multivariate normal which is a way to produce random numbers actually in the flip classroom we'll talk in more detail what exactly this algorithm does that this little black box here how it draws its random numbers because it's a fun story of its of its own and then we're going to do a linear algebra which numpy mostly provides and we're going to need one specific kind of linear algebra operation which is to pre-compute efficient deep decompositions of matrices of in particular symmetric positive definite matrices such that we can then afterwards essentially multiply with inverse matrices which is not something you do in practice instead we are solving linear systems using this operation and these methods here are called cholesky decompositions now let's go back briefly to our math what we what we are going to need is a prior distribution over the weights and a feature function to construct from this prior over the weights a prior over the functions so what I'm doing in the code is I begin with the feature functions they are here so for this I define the function that we've called phi which returns our feature vector that is a one and x or in this case because the input is called a one and a and so here's one perhaps convoluted way of implementing such a function of course you could also just write a function that produces a numpy array that contains a and one I do is and do it in this way because then it it more nicely vector vectorizes and we'll see later on that we can easily extend it so what this produces is it takes in a and then it computes the zero's power of a and the first power of a because numpy or because python is zero base this weirdly means that we can be take the power of a to the range of two which is the range of zero and one and okay that's our feature function that was pretty straightforward now we need to have our prior and for this part for the weights and for this I have the code here so you can look at this later on Elias of course I'll upload it so here is the entire code on one in one cell I hope that you can read this on YouTube you might otherwise have to open the Jupiter notebook directly on YouTube you might have to increase the resolution to see this so this code consists of exactly the process that we've written in algebra before we first need to define our prior on the weights this happens in these four lines or three lines then define the corresponding prior on the function space that's this part then load the data construct the posterior and that's the final bit so first let's define the prior for that I need a constant I need to know how many features my feature function actually returns that's called f for that I just evaluate the feature at any particular input point let's take the zero doesn't matter and just check how many entries there are and now I know that I need two features essentially of weights for two features so two feature weights for that we need a mean vector and the covariance matrix as in the examples I've shown you before I'm going to set the mean vector to zero and the covariance matrix to the unit matrix that's what's happening here okay that was easy now we have our prior on the weights what's the corresponding prior on the functions so notice that of course a function is a more abstract object I could implement this in a more functional style actually producing functions that can be evaluated anywhere now I've noticed in the past that it's often easier for people to understand if we talk about a concrete realization of a function now a function is something you actually evaluate at some point so what we're going to do later on in this course sorry in this on this in this Jupiter notebook is we're just gonna actually create a plot so how do you make a plot well you need a plotting grid and here I'm creating this grid I'll say we'll take a hundred points and distribute them evenly from minus eight to eight so that we can make a nice little plot and now let's compute the prior distribution on the function values of those 100 points because that's actually how I made these plots that I've seen on the slide so far maybe if you want to play with this code I'll leave it to you to think of a more functional form for this that leaves the feature functions as functions themselves and then only evaluates them more lazily when they're actually necessary to evaluate so what is the mean vector for the function values at all these 100 locations well it's the inner product between the feature vector and the prior mean for the weights that's this thing here and what's the covariance of these functions the function values well it's the it's the inner product from the left and the right of the feature functions weighted by the prior covariance sigma so notice that in this code I've actually defined phi to be a row vector rather than a column vector so the transpose is on the other side doesn't really matter right you just have to take care that everything fits together one beautiful thing of linear algebra is that if you are ever so slightly careful then the linear algebra actually does some kind of implicit type checking for you by making sure that the matrices fit to each other when you evaluate them now we have the the prior distribution over the weights over the feature sorry the function values it has a mean and a covariance matrix and what we can do with it is well we can draw random numbers from this Gaussian distribution and we can make plots for the plot we actually need this kind of this sausage of uncertainty around the function values that you've seen on previous slides I'll show it to you again so that you know what I mean it's this green region here this line this line is the diagonal of the covariance matrix in fact it's the square root of the diagonal because it's a standard deviation so let's get this diagonal from here we just extract it from the covariance matrix then we take its square root and do a little bit of numpy foo to make sure everything works that's something we can plot later on we can also draw random numbers these lines that I showed in animated fashion on the on the previous slides I'm not going to create the kind of slide the kind of samples you need for animations that's sort of an some advanced question we can talk about that in the flip classroom if you like just bring it up so to do that I just tell numpy to create random numbers of a Gaussian distribution with the mean given by m and a covariance given by kxx and to please draw five of these because then we can make a nice plot okay now that was the prior now we load data so for that we do some i o I actually have this data lying around in some mudlap format but of course it doesn't matter I just load it from somewhere and um then now we have the data so the data has has it's a supervised problem so data contains x's and y's and also it comes together with a noise so with sigma sigma is the standard deviation of the measurement noise notice how this parameter sigma is actually part of the data set this is a typical situation so if you're an engineer right you make you take your measurements of how much force your your motors are producing and then you look up whatever device you're using to measure the force or the torque you turn it around somewhere on the back there's a there's a piece of paper glued to it that says measurement precision plus minus whatever so and so many Newton and that's exactly the sigma you plug in here providing the error is part of the model and part of the modeling task so if you know what the error is maybe you do then you have to plug it into the likelihood now we also need to know how many data data points there are so let's just figure out how many entries x has and let's call that capital N now we have data and now we can do inference or learning if you like so to do that we need the quantities that show up in these equations on this slide so we need this matrix here which is the inner product between the features of the data and itself weighted by the prior covariance plus the noise and then in the symbolic form here we wrote the inverse of that matrix times a vector now we're going to need the inverse of that matrix actually twice in this computation actually so now there are two issues with linear algebra on a computer the first one is that it's usually the wrong idea to actually compute the inverse of a matrix because matrix inversion is a numerically complicated process that is also relatively inefficient there are smarter ways to do this process because we don't actually need this matrix we only need this matrix applied to a vector so to do so we should solve a linear system of equations which is actually a more well behaved process and secondly we have to keep in mind that we have to do this twice here and here so we should pre-compute the complicated part of this operation and then leave the simple part which is applying this object to this vector to later steps and to do so I first I am going to decompose this matrix into an upper triangular matrix called the Cholesky decomposition which always exists for these symmetric positive definite matrices and which can be computed in cubic time relatively efficiently and numerically in a stable way and then once you have this upper triangular matrix you can apply it efficiently by back substitution to both of these problems and efficiently solve this linear system so let's do that so we first compute this matrix it's called phi transpose sigma phi and we need this prior mean that's this object sorry the prior mean on the data that's the object we're going to subtract from the data to get the residual and now we compute the prior covariance of the data the prior covariance of the data is the prior covariance of the latent function plus the variance of the data that's just how the math works out as you could see on previous slides now we do the complicated computation that's the this line here which is by far the most expensive bit of this entire cell this pre-computes the Cholesky decomposition of this matrix so it computes an upper triangular matrix which you can then use for efficiently solving several linear systems and then once we have that factor we can apply from the right or from the left this object which is this k little x capital x which is the inner product between the feature functions of the test point or the predict point the point where we are going to evaluate the function the covariance of the weights under the prior and the features of the data once we have that we can compute let's go back to our math once we have so I've actually computed this object here that's this the phi of k little x capital x and what we're now going to do is actually to produce posterior predictions on the functions rather than the weights we're going to take the data subtract the prior mean on the data that we just computed solve this linear problem and apply from the left hand side this k little x capital x and maybe also add the mean in this case we don't have to do that because the prior mean was zero so this part is just zero okay so to do so we get we compute a posterior mean here we go which is the prior mean plus which happens to be zero plus a which is the solution of the linear problem that is given by g inverse times k little x capital x so that's let's go back that's well that's this part here that's the entire this entire expression including this bit outside of the matrix and then multiply it from the right with y minus mean and you see that over here then you get a posterior covariance which is given by the prior covariance minus k x x times g inverse times k x x transpose and that's exactly what we're computing here we're just reusing the matrix inverse or the the work we put into producing a decomposition of a matrix which makes matrix inversion implicitly efficient to compute the posterior covariance once we have these two parameters of the posterior distribution mean and variance we can now for example draw random numbers from it using again the multivariate normal formalism with a mean and a covariance let's draw five samples or we can plot this posterior sausage of uncertainty you come by extracting the diagonal from this covariance matrix taking its square root and use that for plotting now if you go further down like on Elias if you look at this at this code and go a little bit further down you'll find the code for creating the plots as well but that's just some somewhat tedious actually I can go down here it's just a somewhat tedious bit of of of mug plot lip foo so this produces these plots and what it does is it's just it just reuses all the quantities we just computed and makes nice shiny pictures out of them other than that that's all so the important bit to take away here is that this is all just linear algebra it's all elementary operations you can do with numpy it's no cause to complicated software packages other than standard linear algebra libraries but I mean these have been around for 50 years or so so they are not a particularly new invention this is important because it shows how elementary Gaussian computations are they are totally understandable they are they are well structured they are numerically very efficient and running this piece of code takes much less than a second for this data set all right so with this we are actually at a good point for you to maybe take a break and briefly think about what we've done we've we've constructed a framework for inferring a function learning a function of course that's a core operation of machine learning a supervised learning problem how we've seen how to solve such a admittedly basic form of supervised learning problem using gaussian inference as a mechanism to map the abstract notion of probabilistic inference onto linear algebra once you've taken your quick break we can go on to ask the obvious next question which has been probably on your mind for the past few minutes already which is but what happens if my data set looks like this now of course in practice very few functions are actually linear and if you really have a linear straight line function to learn maybe you don't need to break out the complicated probabilistic machine learning toolbox and you can use some simple old idea I mean in fact actually many of the things we talk about here are old ideas to be honest they are sometimes several hundred years old gauss himself in his original work on the orbits of planets essentially solved exactly this problem he made observations of the planets at various points in time and then fitted a function that amounts to a posterior of a gaussian distribution we'll talk more about this actually later on a little bit so let's say your robot doesn't work in this simple way it has a more complicated exhibits a more complicated relationship between x and y maybe this is the data that you've actually collected now you can tell of course that there is no straight line that goes through this data that explains all this data well however the situation is extremely similar to before right we still have an input and an output they're actually still of the same type they're both real valued in this simple example and they're still a latent function f it's just that this particular linear form that f of x is just one constant plus another constant times x doesn't fit anymore so the question you can ask yourself is what do we need to change in our framework to make everything work well do we have to change the gaussian inference bit this is an important question to think about keep remember again what which part which properties of gaussian distributions we used to make all of this work we used the fact that gaussians are closed under linear maps that's the important part now the way we use that is by constructing a linear function but a function that is linear actually two different ways the function we looked at f of x happened to be a linear function in x but it was also a linear function in w because we can write it this way as well now the bit that we actually used to make inference easy was not the fact that the function was linear in x it was the fact that it was linear in w so think for yourself what could you do now that your data set looks like this to create another kind of model that can deal with this kind of non-linear shape normally in a lecture i would ask for questions like for answers from the from the audience you might want to stop the video here for a few seconds and just think about what you can change so maybe your answer was and that's usually the answer that comes up to say well maybe we can just change the features there is not really a reason why we only needed a constant and a linear feature we could add more polynomial terms here we could go to quadratic or cubic terms right what do we have to do to make this work actually let me just first show you a picture here's what this looks like instead of just using a linear feature instead of just using a constant and a linear function i'm going to use also a quadratic and a cubic feature so that means i'm assuming that the function we're looking for is a cubic no sorry yeah a cubic polynomial in w in x right and linear in w so there's now four unknown numbers the weight for the constant term the weight for the linear term the weight for the quadratic term and the cubic term so i can't show you a picture of the weight space anymore because it's now four-dimensional but i can't show you the picture of the corresponding prior over the function space and you can see here wobbling around in this picture various polynomials of fourth order and you can see that the hypothesis space now looks a bit more different it's more linearly non-linearly shaped and if it computes the corresponding prostitute over this cubic function then this is actually what you'll get out it's a non-linear function you can't really see that this is a cubic function but it is so the important bit is what did we need to change in our code to make this work and for this let's go back to our python notebook do we need to change anything in this code you might want to stop the video and think about it for a moment the answer is no this cell stays exactly as it is because the only thing we changed is the features and the features are defined up here so now you might realize why i chose to write our feature functions in this way because all i have to do to get to a cubic function is to change this two into a four that's it nothing else changes the entire remaining code just stays the same and we get this kind of output now actually this isn't a particularly good model for this data set and you can see that it's not a good model because the data is not explained well at all by this model the probability of getting to see this data under this model is very low now think for yourself where does this probability for the data to be produced by this model actually show up in our Bayesian inference it shows up in the evidence term in p of y in the denominator of Bayes theorem and we can compute that because it also takes a Gaussian form if you would do that you would see that it's actually extremely unlikely that this model produces this data but the Gaussian framework doesn't really well the Gaussian framework provides a way but the Gaussian posterior doesn't really adapt to this as you can see the uncertainty of this model is very low in this region even though the data is quite far away from the prediction this is an important aspect of Gaussian inference which is caused by the fact if you like that the posterior is given by this expression here's the posterior over the functions here's the posterior over the weights um in written in various forms what you can notice is let's say let's say we look at the posterior over the functions that's maybe fun to look at um here and let's look at this line at this form so here's the posterior mean and here's the posterior uh covariance now notice that the data that we see why actually only shows up in the mean there is no why in the posterior covariance and this leads like this means that the posterior uncertainty doesn't actually depend on the numbers you got to see right and that's why we get this overly confident region here the Gaussian framework at least in the way that we've used it so far does not take into account or does not adapt to data that is actually badly fitted by the model so what do we need to do to address this issue what do we need to fix to make this work maybe your answer is okay maybe let's just use more polynomials right so third order maybe just not enough to get this wiggly shape so let's use seven uh term or eight terms right up to seventh order polynomials then um this corresponds to this more complicated model which is more adaptive and it learns this kind of function again think for yourself what do we have to change in our code to make to to get to this point well we just have to replace a four with an eight nothing else changes okay this is what the posterior now looks like well maybe this is still relatively dissatisfying for you because the data is still not perfectly described but perhaps more varyingly you get this extreme divergent behavior on both ends of the data set so maybe this isn't the right thing to do yet so what else could we do think of various other ways we could do to fix this issue now normally here in the lecture it takes a few seconds for people to realize that there is nothing in the definition of this code so far that requires us to use features which are polynomials of x and in fact any other function of x can work as features and then I often ask people so what are the features you would like to use it's actually my favorite part of this lecture and it's so annoying that you're not here to share this moment with me and then people usually say ah what oh you could use you could use oscillating functions oh you could use Fourier features like cosines and sines and let's use that right so here I've used a feature function which provides which takes in x and then it returns the cosine of x the cosine of 2x of 3x and so on until 8x and then the sine of x and sine 2 of x until the sine of 8 of x you can see the feature functions in this plot there are these blue lines here these are all the cosines and these are all the sines they are eight of them each again what did I need to do to change to make this code work well I need it let's go back to our python code I don't need to change anything down here all of this stays exactly the same there's no change here the only thing that needs to change is this cell and in fact one of these cells I don't really know which one is the one that we are going to use let's see it's a bunch of features this one here sines and cosines which we're going to which we use for this particular plot so if so if you use these particular feature sets then the posterior looks like this now okay that's still not great it's a little bit smooth here in this region so the reason for this is that I've actually not used features of sufficiently high frequency I'll leave it to yourself to play with the code that I have on the stupid adult book and change the frequency so that you fit a little bit better in this region you can think for yourself about because there we are at zero here whether it's better to change the sine or the cosine features for this particular data set so one nice thing about this posterior is that it's much more well behaved in extrapolation it kind of much more smooth it doesn't it doesn't diverge in this extreme fashion as the polynomials did before it's also maybe perhaps not so perfect because one assumption that is encoded in these features is that this function is periodic so if we would extend this plot to the left or the right much further you would see this function start to oscillate back and forth maybe that's not what you want maybe you know about something about your data that is periodic in this way maybe you don't and depending on whether you know that or not you might want to include this as a feature set or not so normally at this point I say okay are you happy with with cosine and sine features what other features might you want to use and I very much encourage you to do this for yourself stop the video here think for yourself about the kind of features you might want to use to model this data set so then one thing one next idea people often come up with is pixels so if you come from a computer from a computer vision graphics kind of background that maybe pixels are a very natural thing for you so what you could use is little steppy functions right that to describe your your your function space that you have a lot of flexibility because step functions allow big jumps in data data points from one point to the next so there's a bit of a technical issue actually when we when I say I'm going to use step functions we need to say exactly what kind of step functions we need because there are two different step functions there are step functions actually there are three different step functions if you like right as one kind of step function is this one so these are step functions that start at minus one and then they move up to plus one at some point and each function moves at a different point so here this set of step functions starts at well here's the first one that moves down from plus one to minus one and here's the next one from plus one to minus one plus one to minus one and so on these there's a little bit of a slope in these lines that's just a plotting artifacts they are actually hard steps these continuous steps and every single one of them can be switched on independent of each other but because they are all non-zero everywhere they either plus or minus one they all overlap with each other and you get this kind of hypothesis base that can create really steppy blocky functions and if you apply this to this dataset this is the posterior you get so okay maybe this looks a bit ugly and you might find this a bit annoying right that it has these hard steps but in many ways this is a very flexible modeling class because it allows very strong steps in the function and can adapt to them without getting like having nasty effects elsewhere in the in the predictive range and it also extrapolates kind of nicely I mean of course there are these steps but in the towards the left and the right here you just get relatively flat behavior like non-nasty extrapolatory behavior of this model maybe this is not so bad actually there's another kind of step functions which are the step functions that start at zero and then switch on at one point so here I use this heavy side function notation which just just a function that is zero if the its input is less than zero and one once the input becomes positive so if you have such a model then here on the left hand side all the features are zero and therefore the hypothesis space for function values also ensures that all functions must be zero over here and then as we move along to the right more and more of these features switch on and they all add up together and so there's more and more room for function values to grow these kind of technical issues in the definition of the features actually matter then right they create they directly affect the hypothesis space created by this if you don't know why I belabor this so much then maybe keep it in mind if at some point we start talking about deep learning again and maybe you've already thought about deep learning before and you might understand that the choices of features are actually really important and that this way of thinking about regression in a probabilistic fashion produces like gives you an intuition for what's going on that is much much harder to create with deep neural networks so okay this is our hypothesis space now if we get data and adapt the model to the data to compute a posterior this is what the posterior looks like now notice that in on the left hand side this model has to return to zero it has to because there are just no features over here to explain anything other than zero and it that's sort of around here to the data now at this point I again ask people what other models can you think of and then they usually come up with all sorts of other ideas we can do a few here and I very much encourage you to download the code and play around with it yourself and see what else you can implement so here is for example a bunch of features that are little v functions so they come down from the left and then they go back up to the right again and they are sort of shifted a little bit in a in a linear fashion so that that things get easy to implement and they create this kind of feature of hypothesis space of functions that are piecewise linear because the features are piecewise linear but they are they are sort of extrapolating in both directions if you use this kind of prior then you get this kind of posterior this is actually quite nice right it looks like a good interpolation it's very adaptive and the functions we get are piecewise linear so they are reasonably well behaved they don't have these nasty steps anymore that we got from these pixel like functions you could use also all sorts of other kind of nasty weird maybe not necessarily nasty but weird or maybe exotic choices of features so in certain applications in science people have come up with very smart choices of features here for example are Legendre polynomials which are an interesting set of polynomials that are defined on a bounded domain so let maybe you know for some odd reason that your data explicitly lives in this domain from minus 8 to plus 8 and all the dynamics of the function are constrained within this region then you could use such a basis and it will produce a very funny nice beautiful hypothesis space and if you now adapt your model to compute a posterior if you do learning or inference which are all the same thing essentially then this is the posterior you get out or maybe you could use what I call Eiffel Tower regression so I'm not actually sure this is exactly true but I've read somewhere that in lock space the Eiffel Tower is a triangle maybe that's wrong but if it's true then it corresponds to this kind of feature set so here the phi of x are exponentials of absolute distances of x from centers from various centers and here I've varied these centers from minus 8 to plus 8 there are 16 of these features you can see them in blue here or maybe you can think of them as in Fuji mountains maybe that's a better way to explain this maybe this is maybe I should call this Mount Fuji regression this is the prior hypothesis space you can see that the the features create these little artifacts in the in the marginal variance they kind of go up and down up and down and if you compute a posterior for this model then we get this kind of output again now you could complain about certain features of this model but maybe you don't have to now at this point usually someone raises their hand and says but how about Gaussian features and of course we can use these as well I just do them at the very very end because I don't want people to think that you can only do Gaussian regression with Gaussian features so here are what people might call Gaussian features or also sometimes they're called square exponential features or radial basis functions because people don't want you to get confused with the Gaussian nature of the inference so here the the feature functions are exponentials of square distances so these are essentially Gaussian functions but notice of course just to remind you that the entire framework here uses Gaussian inference and it doesn't matter at all what the shape of these features are right it just happens that we now decided to use these smooth little bell shaped functions if we do that then one nice aspect of this is that these functions become very smooth so the green lines you see here animated the draws from the prior are always very smooth functions because there are sums of extremely smooth functions of these Gaussian bell shaped functions if you use these then the posterior distribution looks like this and many people find this aesthetically pleasing I mean I agree it's a beautiful plot that the hell of creates this kind of smooth lines that go through most of the data points because they just kind of get shifted around in the right way and it extrapolates in this kind of fashion now whether you like this plot or not depends on what kind of aspects you care about maybe you want to do extrapolation and then this model is not particularly good in extrapolation so if you go back one slide you see that over here actually I've stopped adding features so to the right and to the left this model actually returns to zero so if you extrapolate far out then this model will just predict zero values outside of this range maybe you like that maybe you don't right maybe you care about the interpolation at this point and then you notice you might be happy about the fact that this is a very very smooth function that interpolates beautifully between the data points now the question at this point that is probably in the back of your mind if you are an analytic person is okay now I've seen eight different features what kind of features am I allowed to use so here's the question to you what is the set of feature functions we are allowed to use in this framework to do regression on a real valued function and the answer is there is no constraint to this feature set for all intents and purposes sometimes people then say oh but what about discontinuous features am I allowed to use discontinuous features and the answer is of course you are allowed to use discontinuous features let me give you an example here are functions that are discontinuous they have hard steps from zero to one and it's totally fine it's not a problem at all there is really no constraint on the feature functions all you need is a function that maps from x to the real line and then you can do that by the way this of course means that x doesn't have to be a real number so we could come up with features that map let's say from a bivariate space so from a two-dimensional real input space x1 and x2 to the real line so here for this example I've created a little egg carton of features so I've created these bell shaped square exponential features and tiled the space with these kind of features so there's quite a lot of them now there's I think 10 or 8 in each direction so this is 64 of these features all we need is a function that maps from x to the real line if x is a bivariate real valued input fine then we get a hypothesis space over multivariate functions and of course x couldn't doesn't just have to be two-dimensional it could also be a hundred dimensional or 10,000 dimensional or million dimensional this doesn't really change anything about the process because if we go back again to our code think again about where x actually shows up in this in this computation actually it doesn't show up at all essentially I mean it does show up here as an input to x but when we compute our mean and prior covariance x is already shielded within phi so beyond this point here and where the data comes in we never actually use x explicitly we always evaluate the features of x and then operate on the space that is spanned by the features of x so in fact x could be multivariate but it could also be something more exotic we'll talk about other models later on where x is maybe a string or the graph or anything else could be an image and image are images are essentially multivariate vectors right but it could also just to drive home that point actually be a one-dimensional input but then we could map to a multivariate output so for example we could say our input space is one-dimensional here we have time so x is now just time basically and we have let's say we have two different outputs we have an output which you might call x and an output which you might call y now we could take our time t and just produce two sets of features every point in time is associated with two features one that affects y and one that affects x then since we are just learning weights for these features these two-dimensional features together span a two-dimensional space across time so here's this three-dimensional space and if I draw from this Gaussian prior over the weights of all of these features we are essentially producing this three-dimensional object now if you look at it from above like if you turn this cube and turn it around then you see a curve so of course we can use Gaussian regression not just to learn nonlinear functions from x to y but also to learn curves rather than just functions that map from x to y as long as we know where in time we are evaluating these curves so I'm showing you all of this partly as a sales pitch for this beautiful framework to do regression on functions which we are going to be discussing for several lectures from now on but also to maybe create a bit of a worry in the back of your mind which is maybe phrased well in the following way so what we've just discovered here is that we can use any feature set to span a space of hypotheses of functions which we might want to learn and if we change the feature set then we get very different kind of priors over function spaces function spaces like this or like this or like this and these are all clearly very very different now maybe that's a good thing because it means you are free to design a very powerful model for your inference task maybe this is a problem because it means that you are forced to design a very specific hypothesis space for your inference task so in future lectures we'll have to think about how to deal with this worry and how to come up with ways of adapting these features for us or somehow other ways of dealing with this having to choose features but for today I'd like to end here and summarize what we've done today for the first time we used maybe we built a we got to the point of doing a form a basic form of machine learning we noticed that Gaussian distributions which map probabilistic inference to linear algebra can be used to learn not just individual numbers but actually functions we do that by describing the function space in terms of a finite number of numbers and then match or connect these numbers to functions by writing the function as a linear map from the weights into some other space but that other space is spanned by features over the inputs x doing so allows us to learn even nonlinear functions of the input as long as they are linear functions of these latent weights w such models are sometimes called general linear models and we'll talk more about them in later in later later lectures what we saw is that the choice of features we can use for this so first of all we saw that implementing these methods is let's say easy because it doesn't require complicated software tools it just requires linear algebra it's also fast because it doesn't require nonlinear optimizations stochastic optimization and so on it can typically be done using just linear algebra which is in a particularly efficient operation on a computer and then we discovered that we can use a very rich set of features to describe to create nonlinear hypothesis spaces over functions and this is maybe beautiful because it allows I mean first of all it should just be seen as beautiful because it allows you to span to create very specific very powerful hypothesis spaces of functions and in many applications maybe you actually know enough about the function you're looking for to be able to design such explicit features which then allow you to actually like learn structure from your data and we're going to do more examples of this in later lectures where we actually do this kind of very structured inference of course it also maybe raises the worry that having to do this having to design features is perhaps not ideal and if you want to have a generic learning algorithm you might want to have a way of dealing with this issue that you have to choose features we'll talk about how to do this and actually several different ways of dealing with this issue in subsequent lectures I'm hoping to see you there again thank you very much for your time