 Welcome to probabilistic machine learning lecture number nine. Here we are in the course. As always, we saw that lecture number one, that probabilistic reasoning is a way to introduce uncertainty into the more restricted process of propositional logic by distributing truth over entire spaces of hypotheses. Doing so can be computationally hard because it means keeping track of a potentially combinatorially large space. So we need various computational tricks, computational and modeling tricks to make this work in practice. One of them is to use conditional independent structure to simplify certain parts of the computation. Another one is to use random numbers, Monte Carlo methods drawn from probability distributions to compute quantities like moments, in particular means and variances, and also evidences for probabilistic reasoning. A third option is, or at least a third tool in our toolbox, is to specifically choose certain probability distributions that are amenable to certain operations. Maybe the most important relationship between variables is the linear one, matrices. We saw in the past two lectures, actually three lectures, that the right probability distribution for this operation, the one that is particularly well adapted to this linear relationship is the Gaussian distribution. We learned in lecture number seven that we can use this Gaussian distribution not just to learn to do inference on variables that are linearly related and jointly Gaussian distributed, but even to use this framework to learn nonlinear functions. Thereby to solve one of the key tasks of machine learning, which is supervised learning of specifically in this case, real valued function, so that's regression. To do so, we distributed a set of features over an input domain and assigned weights to these features to create functions, put Gaussian distributions over those weights. Now, if you make Gaussian observations of these individual and not of the individual features, but of the function which is a linear combination of the features at various points, then we can compute posterior distributions over the function and also the underlying weights. In the last lecture, we spoke about how to learn these features. That's one, of course, obvious challenge that you have to choose which features you're going to use. In the last lecture, we found one particular approach to this problem, which is to fit these features. This means that while keeping our nodes to the probabilistic framework, essentially decide not to keep track of an entire space of hypotheses anymore, but just to optimize within the space of hypotheses, such that the probability of this particular choice of features is maximized. We made a connection of this idea to the vastly popular area of deep learning by noticing that we can write this process of learning weights in a Gaussian fashion and fitting the parameters of the features by maximum likelihood, type two maximum likelihood, or maximum a posteriori to essentially learning a deep neural network. Well, deep. In this case, it's just a two-layer neural network where one layer is integrated out and one layer is specifically fitted. Of course, you could add more layers and then it truly becomes a deep neural network. That's more of a parameterization issue. Today, we're going to look at another idea, which is, at least intuitively speaking, the orthogonal one to this approach. What we did in the last lecture was to say, we're going to use these individual features. Let's say there are nine of them, which ones should they be so that the process works as well as possible. Now, another option is to say, maybe we can use more features and increase the number of features and if we do that, then maybe things will become more expensive and we'll have to deal with the expense that comes from that and let's think about what we can do about this. But maybe if you increase the number of features, then the individual feature isn't that important anymore and if you have a lot of features then the algorithm will somehow just still pick out an interesting function or expand an interesting hypothesis space even if the features are not particularly specifically chosen to be particularly good. This will give rise to a beautiful concept or a beautiful conceptual framework that is still very popular in machine learning and it's maybe one of the two foundations next to deep learning in the modeling space because it'll turn out that it's actually possible to do this process with infinitely many features and we'll see today how that works. So to get to this point, I first need to start by cleaning up our notation a little bit from the previous lecture. So here is our posterior distribution over the function values given some observations y at locations x and the specific choice of features phi. This is what the posterior distribution looks like on paper if we assign a Gaussian distribution over the weights with mean nu and covariance matrix sigma. By now you've seen these expressions several times and maybe they've become less scary even though they're still so extremely long and so tedious to look at. Now before we look into complicated math, you can still settle down and follow along without stressing your brain too much. The first thing we're going to do is just to clean up our code a little bit. To do so, we just look at the expressions in here and notice that there is a certain types of structures in here that show up quite a lot. So if you look at where phi actually, the features actually play a role in this expression, then you see that there are two types of expressions that involve phi. One of them is an inner product between phi and mu, the mean of the weights. This one shows up here and here with two different phi's. In this case, it's the features of the output, the prediction points, and here is the features of the input, the test points. Yeah, sorry, the training points. And then there's another expression which is this one and this one, and this one, this one, this one, and this one. Oh, it shows up quite a lot. So what this is is an inner product between one feature vector, the prior covariance of the weights, and another feature vector. And interestingly, there is no expression in here where phi is, but as a lonely phi, if you like, where there's just a feature vector lying around that isn't multiplied with anything. So therefore, we actually don't really need in our code an explicit place for this feature of phi. Instead, we could pre-compute or define functions that encapsulate these two operations. And then that'll clean up our code because then we can recall these operations on this various combinations of little x and capital x. And we'll give a name to these two objects. This one will be called the mean function because it's a function that is applied to the mean. And the other one is the inner product between two feature vectors weighted by the prior covariance of the weights. And we can use two different names for this because they are historically connected to different ideas. From the Gaussian perspective, this is a covariance function because it describes the covariance between two function values at location a and b, but it's much more prominently connected to the word kernel. Why is this a kernel? Well, the word kernel is overloaded in mathematics to mean many different things, but here it means that there is an inner operation that is done a lot of times here and that actually builds the core of our entire process. So that's the kernel of our model, if you like. So what do I mean by that? Well, or why is it important to do this? Why is it helpful to do so? Well, let's look at our code again. So here I have the code that we always used in lecture number six for regression, sorry, lecture number seven for regression, which I've just cleaned up a little bit. So what you can see here is I'm just loading a bunch of Python libraries and then in that lecture number seven, we did parametric regression by first defining the features. Here I'm using these bell-shaped features, the ones that look a little bit like, or that are a Gaussian curve, they're not a Gaussian distribution, they're just bell-shaped, but it doesn't matter, right? You saw that we could use all sorts of different features to do this kind of regression. Then we defined the prior space over the weights for these features for that we need to know how many of these features there are, let's call that F, and then we create prior mean and covariance mu and sigma. And then we had in lecture number seven, by now you've seen this code, this piece of Python code that just does parametric regression for us. And it computes these quantities, the mean on the function value means on the test points, function value covariance is on a test points, function value samples, and arrow bars, so the sausage of uncertainty around the mean, it loads some data, then produces, computes a posterior by computing a posterior function value mean, posterior function value covariances. And by doing, it does that by, solving a complicated linear equation using Cholesky compositions to make them efficient. And that gives us our posterior function value means, posterior function value covariances, posterior function value samples, and posterior function value arrow bars. Now, we again, we see that there are all these expressions with inner products in here, wherever there is an at, so a matrix vector product, there are these operations with phi. So here is inner product between phi sigma phi, inner product between phi sigma phi, inner product and so on, up here. So that's a little bit tedious, so let's clean up our code a bit and instead we define, having defined the features above, we define these mean function, this mean function and the kernel. But to just say, these are both functions that take inputs, evaluate the features of the inputs, either just one feature or two features or two inputs and then take the inner product of those feature, matrices essentially, in this case, with the prior covariance or prior mean of the weights. If we do so, then we can rewrite our code and I've already done that here a little bit to simplify things later on. I've now actually defined a function which does this regression for us. It's a Gaussian parametric regression, if you like, for the moment. And it actually does the exact same bit, it's the same code, I've just cleaned up a little bit. So first of all, I'm now creating a dictionary to return the output, so we can reuse this code over and over again, but I'm still calling the prior mean. So now, instead of using the inner product, I'm just calling the mean function and I need the prior covariance and instead of calling explicitly the inner product, I'm just using this abstraction called the kernel. That means I can still draw samples, I can still compute the posterior error bar, I can still load data and I can now also compute the posterior distribution by constructing the exact same quantities, but wherever there used to be an inner product in the previous code, I've just replaced it with the mean function or the covariance function. Of course, this code is gonna be on Ilias, so you can check it out later on. And then here is still a bit of complicated plotting code that produces this kind of output. So I should probably run this a little bit so that we get to actually see what's happening. So whoop, let's just run that once, whoop. And here is our posterior distribution that we already know from a previous lecture. So here, this is the smooth posterior distribution which we get because I'm using these Gaussian features in here, which are these sort of Gaussian little bumps, so they give quite smooth posteriors. Okay, so that's fine. And this is actually an easy spot for you to take a quick break. That was actually a nice little warm-up exercise. We've just seen that we can clean up our code by introducing simple objects called the mean function and the kernel function, which are really just encapsulations of these operations, of these inner products of feature vectors with their means. This seems almost trivial and actually it is, but it's going to empower us in a moment to do something absolutely amazing, which is to increase the number of features to infinity. To infinity, you say? How is that going to work? Well, now that we have this structure called a kernel, that's gonna be the particularly interesting one, the mean function is going to be almost trivial to deal with, we can think a little bit about what exactly is happening inside of that kernel and see if we can use some cool mathematical tricks to empower this operation more. So let's look at this expression again. Let's say for simplicity, that that's not actually necessary, just simplifies the argument, that our prior covariance is independent. So that the sigma matrix is just a unit matrix times a bunch of constants. So here I've already chosen the constants in a very convenient way because I'm going to use particular features that are distributed over our input domain from the left to the right. They have a rightmost end, let's call that C max and a leftmost end, that's called that C min and they are F of these features. But it doesn't really matter, I've just chosen this expression so that the remaining derivation is going to become easier. And then there's a constant in front, which can be anything. So because that constant can be anything, you can also forget about these numbers for a moment. It's just gonna simplify the argument. So let's say that our prior covariance is a diagonal matrix, then what is this inner product? It's inner product, it's just a sum. It's a sum that sums up the values of all the feature functions at the pair X i and X j. So at the two input locations X i and X j. And X i and X j might be individually part of the test set or the training set, the plotting locations or whatever, they're just inputs. The fact that these X i and X j going here is not gonna be that interesting for a moment. The more interesting bit is this bit up front here. So what we are computing is a sum. Now, sums of course are things that computers are good at, but they are also things that mathematics allows us or has provided us with interesting tools for. And in particular, there are certain sums that remain tractable even as the number of entries in the sum goes to infinity. Because they have a certain structure that allows us to write down an analytic form of the value of the sum regardless of how many entries it has. One example of that are sequences, which you've probably seen in your undergraduate math lecture in your first year. Another example, which you know even since high school are integrals, which are infinite sums that have such a structure that you still know what the value of the infinite sum is because it's the area under some curve without having to do the infinity long sequence of summing up arbitrarily like infinitely many, sometimes even uncountably infinitely many individual segments. And we're going to use this idea for specific choices of features to allow our model, our neural network if you like, for which we're doing a Bayesian inference with Gaussian weights to be infinitely large. So we're going to add infinitely many features and in doing so make our neural network infinitely wide. And the only trick, the only on the lever we're going to pull is that we're not just going to put arbitrary features in there, we're going to specifically choose certain features and place them in a very regular fashion and that regularity is going to allow us to deal with infinitely many features. So let's do that now in a derivation that is actually due in this particular specific form to David Mackay again. But it's a generic kind of derivation that you can actually do for various features and in fact we will do it for various features today. So let's say that we choose this particular family of features, these Gaussian features that I've used on previous slides. So let me just go back so you see what I mean again, it's this kind of situation. So we have these little in the background blue bumps, those are our Gaussian features. And there are in this plot, there are 16 of them. They go from minus eight to plus eight. Each of these features has a location. In this case, the location is five. This case, the location is three. Here the location is minus two, right? And then they all have a certain width which we just fix to be a certain width. And they are regularly distributed across this domain. We're going to keep this structure. And these features start by the way at a rightmost end or they start at the leftmost end if you like that's at minus eight. And they go to a rightmost end which is a plus eight. Okay, let's see what we can do with that. So here are our features again, now with a bit more symbols. That's what the features look like. They are exponentials of squares, negative squares. Each of these features has a location, CL. So that's the numbers from minus eight to plus eight in this case of the plot. And they have a width lambda but we'll keep that with constant in this case. The plot that just showed you lambda was actually one. But of course it doesn't have to be one. Now, let's see what this particular form gives us if we plug it into this generic expression up here for the inner product between features. So we'll just do that. Here is our expression. We've decided to use our diagonal covariance matrix. Again, we could of course use a different covariance matrix but I've decided to use a diagonal one because it makes it very easy to do this derivation. Then what we get is, so from above here, right? This one we just copy down and here we just plug in the values of these features. So the only difference between these two features is that they are evaluated at different locations, one at XI and the other one at XJ. Now, what can we do with these features? Well, these are Gaussian features. So they are exponentials of squares and they're not Gaussian probability distributions and the fact that they are of this Gaussian shape has nothing to do with the fact that we're doing Gaussian inference. It's just a choice of feature. But the product of two exponentials of squares is the exponential of a sum of squares and the sum of squares is still another sum of squares where you can do quadratic completion and extract individual terms, right? So the product of two Gaussians is another Gaussian times a normalization constant. The same algebraic structure or arithmetic structure almost is something we can use here. So we can multiply these two exponentials that's the exponential of a sum. That'll give us two square terms at XI and XJ both containing CL. So we can rearrange these expressions to drag out the terms that depend on C and the terms that don't depend on C. The term that doesn't depend on C is this one. Now, that's nice because L, the summation index only shows up in CL. So this bit up here doesn't depend on CL at all. It's actually a constant that is the same for every single term in the sum. So let's just take that out of our sum. Nice. So now we're only left with one term in the sum so that we can think about this. So what is this? It's a bunch of terms, F of them, and every individual one contains an exponential of a negative square where we subtract CL from a number, which once we fix XI and XJ, that number is just a constant, it's just a number. So what we have here is a sum over individual Gaussian factors, if you like, that all depend on different value CL and the same constant and the same width. So let's see what we can do with this particular object when we increase the number of entries in this sum. And to do so, this is the next slide. What I've just done here is I've literally copied over the very final line of this slide onto the next one so there's no change here. You can just look at this again for simplicity. And now what we're going to do is we're going to mentally increase the number of features towards infinity. But we're not going to just increase it in some arbitrary way, we're gonna increase it in an extremely regular way. What we'll do is we'll add more and more of these features to increase their density, but we will keep them at regular distances. So they will still be on a regular grid. It's just that the grid becomes finer and finer as we add more and more of these points, the distance between them becomes smaller and smaller. But we still keep the boundaries on the right and the left, eight and minus eight, if you like. Then if we do so, then the number of features that is within a certain box of width delta C on this input domain, well, what's that number? Well, it's the total number of features which is capital F, which we are increasing times the relative amount of volume in that small box over the entire box, right? So that relative amount of volume is the small box, the width of the small box delta C divided by the width of the whole box C max minus C min. That's just how much volume of the entire measurable volume is in delta C. And now if we increase the number of features further, here we are a little bit way, but I'm sure you're gonna believe me that this sum here then turns into a Riemann sum and actually into an integral. So as we increase delta, as we increase F and decrease delta C asymptotically, we're getting an infinite sum, which is called the Riemann integral over these feature, is infinitely many feature functions where each individual feature function is centered at C and has a sort of a shift, well, sorry, it's evaluated at C, it has a shift that is given by this constant, one half X i plus X j. Now, what is this integral back here? Now, here, that's the actual power of integration. So there are certain integrals that are just known, they're just closed form. So this here is a Gaussian integral, right? It's the integral over the exponential of minus a square. And we know the value of this integral, is the difference between two error functions. Or actually, if we just increase C max and C min, so increase C max towards infinity and decrease C min towards minus infinity, then this is just, so that means we're just assuming that the box that we're putting our features over becomes arbitrarily large. So now we have infinitely many features across the entire real line that lie infinitely dense close to each other. Then we have an integral here over Gaussian features, and that's just a standard Gaussian integral. So we know what that is, it's just the square root of two pi times lambda. And if lambda is one, this is the Gaussian integral and it's just the square root of two pi, right? So what we're left with is the bunch in front, which is constants, which don't depend on C. So that's sigma squared times this expression, times this integral, which is just the square root of two pi times lambda. Interesting, so we've just managed to work with infinitely many features and yet still have an attractable expression for our inner product that is called this kernel function. And it's given by this particular expression, which is square root of two pi times lambda times sigma squared times the exponential of minus xi minus xj squared over four lambda squared. Notice that there is no quantity in here that relates to the number of features or well, yeah, just the number of the features. So in fact, I'm gonna redefine what typically in when this operation is, when this function is used, these constants sigma squared and lambda are redefined a little bit. So typically lambda is redefined to be lambda one half, so that there's a two here and sigma is redefined to be sigma over the square root of two pi lambda, so that there's just a constant sigma up here in front. But that doesn't really matter, it's just a renaming of the variables. I'll notice that, well, actually no, let me first show you this and let's see what's actually just happened here. So here is a pictorial view on what we've just done. If you don't like the math, you can see a picture and then in a moment we'll look at the code. So we've started by putting finitely many features on here. One, two, three, four in total, there are eight here at the moment. Now I've increased the number of features. Here there are now, I think in this plot there are 32 or so of these features. You can still see them, there are finitely many of them. And as we increase the number of features, they stay at a regular location and they stay between minus eight and eight. As asymptotically, it's actually possible to use arbitrarily many of these features and even increase the box arbitrarily wide to the left and the right. And we can still draw all the quantities that our code computes. We can still draw the prior mean function. We can draw this sausage of uncertainty, this arrow bar. We can draw samples, that's what they look like and we can even compute posterior distributions. This is what they look like. Even though we're using an infinity line neural network. What do we have to do to do that? Well, let's look at our code again. So here is our code from before. I haven't changed anything. And now I'm gonna do a little bit of a rearrangement of the only thing I have to change, let's go back up, so this is just plotting. The only thing, okay, I can actually, maybe I can collapse the plotting bit so that we can see this. Okay, so here is our code from before, our Gaussian parametric regression, if you like. And here I don't need to change anything because we've already encapsulated all the necessary operations in M and K, in the mean function and the kernel function. So all I need to do is redefine our kernel. So the mean function, I'm going to make things simple and just say the mean function is just a zero function because we assume that either you could say we assume that the prior mean of the weights is zero or you could say that whatever the prior mean of the weights is, I've just subtracted some constant which I'm allowed to do because it's a Gaussian inference where I could subtract some B, which happens to have exactly that value. The tricky bit is gonna be our kernel. So our kernel is now not going to explicitly take inner products between features anymore. Instead, it's directly going to be a function that takes in, and actually let me go back and show you this on the slide. It's directly going to be a function that takes in as its input X i and X j and then just returns the evaluation of a function that depends on X i and X j. And now we're going to build matrices out of these objects that take arbitrary pairs and sets of pairs of X i and X j. So to do that, we need to do a little bit of Python foo. For those of you who like programming, proper computer scientists, you're gonna like that. We're gonna define actually an abstract function which we just call the kernel. So a kernel is a function of second order, if you like, which takes in a particular function. So in our case, that function is the exponential of a square times a constant. And then returns another function, here it is, which is itself a function of two inputs. So it takes in a set of X i's and a set of X j's and then for every pair i and j in those sets, it evaluates this function on X i and X j and then does that for every possible pair of i and j and then builds a matrix out of that. A potentially rectangular or square matrix. So that's the abstract operation and we've decided to, in our derivation, we ended up with applying this particular function here to this process, which is that you take your inputs, X i and X j, I call them A and B here, and then we return a constant, which I've here set to nine, you'll see later Y, times the exponential of the distance between the two inputs, squared divided by two, divided by some width, which I call L square here. And actually that width can be set to anything, so I'm already using an abstract form for this that we can sort of change later on. So let's use, let's build a kernel from that, for that we call our kernel function and say I want to have this function, which can be applied to arbitrary points, groups of inputs, and for each of these inputs, just apply this particular function. So let's do that, I can run this code and if I do this and then rerun the plotting code up here, hang on, we get, we can still do Gaussian inference on function values and we get out this slightly differently shaped posterior because we're now using infinitely many feature functions to do this. Now what I've just done here is I've defined two objects that are important, that have a wide wide spread use, they're called a kernel, and then this object that we've just constructed, but you might have been wondering about what that is and that's called a Gaussian process. Now these two words are big and fancy and we're going to use them a lot over the course of the lecture, but actually you can think of them exactly in the way we just constructed them and that's maybe more useful than the usual definitions you get to see, which are already cleaned up and made more elegant so that they work in a general sense. So here is a proper definition if you would have done this sort of from the front or top down by first defining all the objects, a kernel is obviously actually in like slightly more proper mathematical definition as a Mercer kernel or sometimes it's also called a positive definite kernel is a bivariate function that takes two inputs from an input domain. In our examples, those input domain was always the real line but of course it can be many, many other different spaces and returns a real number such that and here is this kind of weird definition that I just implemented in Python essentially for any finite collection of input points, let's call them capital X for entries from X1 to Xn, the square matrix that you can construct by taking all pair-wise expressions, X i and X j and evaluating that function on them. If that function is always positive semi-definite, then this case called a positive semi-definite kernel or a Mercer kernel. So a kernel is a second order function, a function that takes in numbers and then returns another function. So it takes in this function f which in the example I just did is this exponential of a square function and it returns another function which is this thing that you can use to construct matrices. You give it a collection, actually you can give it two different collections, a collection A and a collection B and then it returns a rectangular matrix of size length of A by length of B where for each possible pair in A and B we evaluate this function f and build this matrix out of it and here is kind of the proper Python code for doing that for like as I just used it in our code. So just as a reminder, positive semi-definite before we talked about this means that this matrix has the property that when we multiply from the left and the right any arbitrary vector we always get a positive or at most zero number. And as other ways of describing such matrices they are matrices that have symmetric matrices which have non-negative eigenvalues or they can also be written as outer products of a bunch of other vectors. So we didn't come across this kind of function in this form. Instead I chose a particular set of features and then took inner products of these features and then used those to construct a covariance function for our Gaussian process. Now, so I need to show you to show to you that this process actually gives such a Mercer kernel and here is a lemma that says that K is a Mercer kernel if it can be written as a sum over, so as an inner product which is a sum over evaluations of some function phi evaluated at x and x prime. And in fact, I'm going to even allow for the more generic case which I could just construct it with these Gaussian features of this sum being an integral over some domain of positive measure under some measure nu. So you can think of what I just did here as the sort of the straightforward case where this measure was just in a back measure and I integrated from minus infinity to plus infinity over this sort of product of the same function evaluated at two different inputs. Why is that the case? Well, let's look at the sum case. If you have such a matrix and then you multiply from the left and the right any vector V, then that's a double sum which just separates into two different sums of the same form where if you look at them closely you notice that these two sums are actually the same. So they're just a square of a number and the square of a real number is always larger or equal than zero. So therefore if we construct our matrices in this form if you construct these kernel functions in the form we always get positive semi-definite Mercer kernels. Actually it turns out that the opposite with a subject to a few technical constraints is also true that any such Mercer kernel can actually be written in such a form but that proof is much, much harder. So I'm not going to do it. And then we use this object, this Mercer kernel to define another concept which is a probabilistic concept which we will call the Gaussian process. I've already constructed it for you for the specific case of this particular kernel but here is a proper definition. Let mu, which we will call the mean function be any function. It really doesn't matter what that function is. And let k be such a Mercer kernel. Then a Gaussian process which we will write like this. So we will talk about a probability measure that is constructed where implicitly so actually I'm going to reuse this notation which strictly speaking is a little bit dangerous because we don't know yet whether this is actually a probability measure but it turns out that this notation actually works. We will write GP for Gaussian process over an object that is of type function with these parameterized by these two objects. This is a probability distribution over a function that maps from the input domain X whatever it might be to the real line such that and that's the actual definition. If we take any finite restriction as we did in our code in our Python code both to make plots on regular plotting grids and for finite data sets. So if you take any such finite restriction to function values at particular locations and here I'm using this notation that I've introduced two lectures ago with subscripts instead of brackets to evaluate functions. If the restriction of F to any such finite set if that's a Gaussian distribution with mean that is given by evaluating the mean function at those points and covariance that is given by evaluating the kernel at that point, then we call this a Gaussian process. So this definition is actually somewhat backward. It's just describing what I've just done in this construction where we've noticed that we can construct covariance objects, covariance functions which allow us to take any arbitrary sets of inputs as long as they are finite and always give back a positive semi-definite matrix. So therefore of course we can use that as a Gaussian distribution then we can do Gaussian regression, Gaussian parametric regression. Well Gaussian regression, let's forget about the parametric using these objects and now we just have to wonder, well what does that mean, right? Are we like what are we actually doing here? Well what we're doing is we've written a computer program that's literally what we've done. A computer program that takes arbitrary inputs, test set and training set and no matter what those inputs are as long as they are finite and within the domain that we've specified, capital X or bold X, we always get out Gaussian distributions which we can work. We always get out Gaussian prior means over the function values on the test set. We always get Gaussian prior covariances over those tests set. We get Gaussian distributions over the data set and therefore, because we have a linear relationship between training and test set, the, sorry, a linear relationship between data observation, linear Gaussian relationship between the data and the function, we get a Gaussian posterior over the function value F. Now I've only done one particular construction for one specific choice of features for such a Gaussian process yet and you might be thinking, okay, that means there is just this one sort of peculiar object, right? Where you take Gaussian features and you distribute them over the real line or infinitely far, infinitely dense and then you get this kind of single specific point of an infinitely wide neural network and that kind of works. Now it turns out that that's actually not so specific. There is a much larger set of such Gaussian process models and to get a sense, actually that's what we're going to do for the rest of this lecture. We're going to think about trying to get a sense for how large this set is. So to do that, we'll first look at a few more features and see if we can keep redo this kind of construction that I just did for the Gaussian features with other features. So in a previous lecture, I, so two lectures ago, lecture seven, when we spoke about different features that we could use, I also introduced these step functions. I said that there are two different kind of step functions. There are step functions that start at minus one and go to plus one and then there are step functions that start at zero and go to one. Let's use those. So here are those feature functions again. They are, here in this picture, there are four of these feature functions. They always start at a certain point. The first one starts at minus eight, the second one at minus four, the next one at zero and then at four. So there are four of these features and they are given by these heavy side step functions. So step functions, so functions that become one when their argument is larger than zero. So what happens if we replace our Gaussian features in our construction from before with these heavy side step functions? So for that, let's actually do this on the backboard or the whiteboard for a change. So let's say that we have feature functions of X and such that the feature function number L is given by this heavy side step function X minus C L. And we're gonna, so of course, we are again going to do this spiel that we will assume the prior covariance to be diagonal matrix with a relatively specifically like smartly chosen constant in front. Let's not think about that constant yet. We can do that later, so make it work. The interesting object is going to be this sum, this inner product, which is a sum over the, actually, I mean, that is going to be phi transpose, sorry, that's going to be phi transpose X i times the phi of X j. What is that? That's a sum of individual features L, they're capital F of them. I mean, you can even write that up here over a heavy side step functions, X i minus C L theta X j minus C L. Okay, what's that expression? Well, these are functions that are either zero or one. So if one of them is zero, then the whole product is zero. And then the entire term drops from the sum. So the only case in which we actually have a term here in the sum is if both X i is less than C L, sorry, if X i is larger than C L and X j is also larger than C L. So that means if the smaller of the two X i or X j, the minimum of X i or X j is larger than C L, then we have a term. So we have here a sum over L over theta of the minimum of X i and X j minus C L. I'm actually gonna give a name to this variable because writing the minimum of X i and X j is always a bit tricky. So I will just call that X bar. Oh, God damn it. So now, the integral. So we're going to do the same kind of construction as before. We'll increase the number of features capital F towards infinity until we are essentially talking about a Riemann sum. So we will eventually need an integral from some lower bound. Let's call that lower bound C zero all the way up to some upper bound. Let's call that C max over this expression. So over a single, heavy side step function theta X bar minus C, the C. Now, and here we can reuse, they can do like a second step with heavy side step functions. Of course, this expression again is going to be zero unless X bar is larger than X C. So we are essentially integrating, because we're integrating over C. So C needs to be less than X bar. We're integrating from C zero all the way to X bar, not to C max over just the one function, right? Plus a zero, an integral over zero from X bar to C max. And what is that? Well, it's just X bar minus C zero. So let me do that for you on the slides. Here is a simplified form of this expression again, of this derivation again. Here's our feature. I've just defined it properly as this heavy side step function. Here's a constant that I've now smartly chosen. And here's just derivation we just did and we end up with a credit function that is given by a constant times the minimum between X i and X j, that's our X bar minus C zero. By the way, of course, you can drag that C zero in here and just think of the minimum of X i minus C zero and X j minus C zero. Okay, what does this look like in terms of pictures? So here is our Gaussian regression model with step functions. Now I've used a little bit more, not the four anymore from before, but now there are, I think, 16, one at each step, at each location. And now we can increase that number of features more and more and you see that asymptotically we get an object that actually has a finite variance. So notice, by the way, that as we move back and forward, this may be a good point to say this, from here to there and forward, the scale of the y-axis stays constant. I'm not changing the scale of the plot. And that means that even though we increased the number of features here, the variance of this process does not increase arbitrarily. This seems like a good thing, right? Because you don't want to have a model that is arbitrarily flexible, that has infinite variance. And the way we achieved this is in this construction by dividing the variance by the number of features. So that as we increase the number of features in here, we also decrease this constant proportionally and therefore we remain with a finite variance. So of course this is good from a modeling perspective because we want to have a model that isn't infinitely flexible. Because then we're also infinitely uncertain. And if we are infinitely uncertain, that might be a bad thing to be. And then we might not be able to learn anything. However, of course it also, if you think of this from a sort of deep learning perspective, and this also means that as we increase the number of weights in our network towards infinity, the scale of the individual weights also drops proportionally to the number. So that means each individual feature only contributes an infinitesimally small amount to the overall model. And you could wonder whether that doesn't constrain the power of these models in some sense. And of course it does, and we'll find out how in next week's lecture or the next lecture actually. Okay, so it turns out that there is this limiting process. Here it is. And it has this covariance function. And you might think that this is just an arbitrary one that I've chosen because I've picked these with step functions, which might be fun to do. But actually this is in many ways maybe the most fundamental Gaussian process. At least this is historically the most important Gaussian process that was studied first. Why? Because it's interesting in physics. You can think of this process that happens here as the kind of path you get by taking a particle, putting it at location zero at time C zero. So here I've set C zero to be eight, right? Here. And actually minus eight. Minus eight, so C zero here, right? This point. And now what happens is that as we go forward through this, well let's call it time, then at every point in time there is an infinitesimal kick that this particle gets and that kick is distributed according to a Gaussian, an independent Gaussian, right? Because that's exactly the construction we have. So at every single point in time there's this new feature being switched on and it has a weight that is independent of all the previous ones and it scales with a constant divided by the number of features. So it's infinitesimally small. But there are also infinitesimally many of them. And now what this particle does is it does a random walk across time, forward in time. And that's actually what particles in a free gas do. This is called Brownian motion and Albert Einstein actually got, well, he didn't get the Nobel Prize for this. It's one of the four papers that he wrote during his Anus Mirabilis is on Brownian motion on exactly this kind of stochastic process. At the same year he also wrote papers on the photo effect and on specific relativity. He actually ended up getting the Nobel Prize for the photo effect, but this paper probably was just as important and it was, it's maybe arguably one of the first proper mathematical descriptions of this stochastic process called the Wiener process, actually, Brownian motion. And here you see sort of an excerpt from his original paper. What you see here is a probability distribution over the location X of a particle. So X is our output, this one here at a time t. And what Einstein is showing here is that this probability density that corresponds to this location. So the probability to be at location X at time t is given by a Gaussian distribution that has something to do with the number of particles and the normalization constant of a Gaussian, which we already know, times Gaussian expression over the location scaled by a standard deviation, sorry, scaled by a variance that is growing linearly in time. And this D is a diffusion constant that has something to do with physics, but other than that, this is exactly this kind of curve that we see here. And by the way, if you do regression with this kind of curve, then you get this kind of posterior, which is a nice, maybe regression model in some sense, it's nice because it creates an interpolant between the data, that's the solid red line, which is piecewise linear. Why is it piecewise linear? Well, remember, like think about what the Gaussian, what this posterior is of this regression function. It's the, where do we have that? Let me go back and show you the solid corresponding slides. Up. The posterior mean of this function of X is the prior mean, which we've set to zero. So this is a zero and this is a zero. And then we have the data here and a matrix that we get to invert. And then on the left-hand side, we have this matrix, which contains evaluation of this kernel function. So the, for the point little x, this here is a row matrix that evaluates the kernel at all the test locations, capital X, all the training locations, capital X, and all these kernels are of this minimum form now. So they are all individually, each entry in that, in this rectangular matrix is of this form. So this is assuming that the test point is larger than the training location. That's a linear function. And the sum of the linear functions is another linear function, which is why what you see here as in the Poland are always piecewise linear functions. And they have a kink at every data point because at every data point there is one new step being switched on. This connection to physics is actually, I actually recommend that you have a look at this paper if you can read in German, and of course there's an English translation as well. You might be a fun read in an afternoon or an evening to read this original paper by Einstein, not because it gives you much insight into stochastic processes or Gaussian processes in particular, but because it gives you a feeling for how advances in physics work on, I mean, fundamental seminal advances in physics were constructed just over 100 years ago and maybe it gives you a feeling for why machine learning might be mechanical continuation of this process. And our field actually still works in these relatively non-mathematical ways that arguably Einstein was following as well. I'm not saying that he was a stupid man at all, but he uses a relatively concrete approach to mathematics that we are now also commonly using in our field. So, okay, this is our construction for the stochastic process that describes Brownian motion, so that which creates these paths which are instances of Brownian motion, which is called the Wiener process, which is a Gaussian process and it gives rise to these interpolants which are piecewise linear and they are sometimes called linear splines. Great, you might think. But of course, you're a self-respecting machine learner and you're not using features like step functions, nobody does that anymore. Everyone these days is using vector-fied linear units and clearly, of course, those are somehow much more powerful, right? And we lose our like this wonderful type of feature and surely these can't be connected to any Gaussian process. Well, let's see what we can do about vector-fied linear units. These are also connected to some Gaussian process. For that, I need to briefly clean up my whiteboard such that you can actually see what I'm gonna be drawing. Okay, so let's say we're instead of our piecewise constant functions, step functions, we use piecewise linear functions as our features. So notice that with features which were piecewise constant functions, we just got piecewise linear interpolants. Now we're going to use piecewise linear features. Let's see what kind of interpolants those give us. So let's say we use features phi L, which are given by of X, which are given by theta, so the step function X minus some location CL times X minus CL. So that's a VLU feature, right? A feature that looks like this, where here is CL. Now, let's take inner products of these features. If we put them at some location, and we get phi X transpose phi X, which is a sum over individual terms, and let's just write them both down and directly sort them a little bit. There's gonna be two heavy side step functions, X and X prime, X prime, or let's call them XI and XJ, whatever you are, as before, XI, XJ, the sum is over L minus CL times XI minus CL times XJ minus CL. And just as before, just as for the step functions, now here's a product of two heavy side step functions. They are going to be, this product is going to be zero if one of them is zero. So we can instead think of the sum over L over theta, let me write this sum function, of the minimum of XI and XJ, which again, I'm going to be calling X bar minus CL, because CL is the same in both expressions, times that bit at the end. And then we can already expand that a little bit to see what it's going to be. It's going to be a quadratic term in C, so there will be a CL squared, and then minus a linear term in CL, which has XI plus XJ in here, and then, well, in quotation marks quadratic term, so that's a term where you multiply XI and XJ. Now, what happens if we increase the number of features? So here, since we are trying to make a connection to deep learning and rectified linear units, we have to be a little bit careful about how exactly we generalize. So there are actually various different ways of placing these rectified linear units in a regular fashion over some domain. We're going to be one of them, which is that we put these rectified linear units all pointing in the same direction from the left end to the right end. So that's basically the analogous thing to what I've previously done with the other features. But of course, these V-loos have this kind of symmetry towards the right, or non-symmetry towards the right, so you can also define them the other way round, right? You can also say, maybe I could have another feature which points in the other direction and moves up here. And then maybe we could have both of these. So you could have a construction where you have two linear features that are separate from each other, always at different points. All of this is possible, and maybe if you want to do that for yourself, you can try it out. I'm just going to assume, and these will give rise to different kernels, right? So what I'm going to assume is that we have a regular set of these features that are being switched on one after the other, and they are asymptotically dense. So we'll put more and more and more of these onto our grid and increase their number until we get a Riemann integral that goes from a left end where the process starts, where it is at zero, from C zero, all the way to C max. And then we get this expression inside integrated over C, right? So theta of X bar minus C times C squared minus C times X i plus X j plus X i, X j. D, C. Now, we can do the exact same thing as before for our linear splines, for the step functions, for Brownian motion. This integral contains only, like the integrand here is only non-zero if the integral, if this expression is larger than zero, so that means that C is less than X bar, so we only integrate up until X bar. And now I need to be careful that I have the space to write that down, so you can still see it. Let's see, well, what's that integral going to be? Well, it's one third of C cubed, where C is evaluated both at X bar and at C zero. So that's X bar cubed minus C zero cubed minus one half C squared, it's just a polynomial integral, right? Times evaluated at X bar and at C zero times X i plus X j plus constant, so that's X bar minus C zero times X i, X j. Not sure, you can still see that, but you can maybe do it for yourself. So this expression is actually typically not written down this way, but that's our kernel already. What people often then do is that they rearrange this expression and end up with this kind of covariance matrix. I'll leave it to yourself to check whether this expression I've just written down here is actually the same as the one I just derived on the whiteboard, so maybe a little fun exercise of playing around with minimum expressions to see that they're still the same. And here, okay, I've used X zero rather than C zero, but you can probably make that for yourself. This construction of a kernel that arises from summing up or integrating up an infinite number of rectified linear unit features is connected historically with this wonderful lady. She's called Grace Wabba. She's an American statistician. She wrote together with her PhD advisor a wonderful book on her PhD advisor, a wonderful paper actually on these kind of models and then published a book later on by herself. I just noticed called Spline Models for Observational Data. It's a beautiful book that arose before the machine learning community proper existed and actually introduces a large number of these models already in a relatively generic, general and almost universal kind of form. What kind of model is this going to give us? So here's a pictorial view again, like we've just done the math, now we should do a little bit of pictures. Here are our features as in previous plots, initially four, one, two, three, four features. Now we increase the number of features to more and more VLU features and you see that the stochastic process stays in the same range because we're proportionally scaling the variance down and asymptotically we get this kind of smooth like process with sample paths that are these smooth paths and you can actually draw from the associated Gaussian process as well. And this process has various different names as well. One way to think about it is that it is an integral over the linear process. Why? Because let me go back to this slide. This rectified linear unit feature is actually an integral over a heavy side step function, right? It's an integral that goes, it remains zero until you hit CL and then it becomes a linear function rather than a step function. And that means we can think of the stochastic process, the Gaussian process that arises from taking these features and then applying to the outside this integration operation, which is a linear operation, right? Integrals are just sums. So therefore the associated Gaussian process is actually, can also be constructed by applying this linear map to the Gaussian process for Brownian motion. And therefore this associated process is the so-called integrated Brownian motion or integrated Wiener process prior. That's, however, is maybe not the most interesting connection from a machine learning perspective. Perhaps more interesting connection is the shape of these interpolants here. So this is the posterior that we get when we use this prior, this Gaussian process prior on this dataset. It starts at zero, of course, because our process has a left most point where it is zero because all the features are switched off. And now there's an infinite number of features with infinitely small variants being added up. And you can see that the interpolant in the middle, this red line is now a smooth function. It's not piecewise linear anymore. It's actually piecewise polynomial. Why? Because it's still a sum, a weighted sum over the kernel functions. And here's our kernel function again. At various, well, sum up over all the input points, right? And those kernel functions are, as you can see, cubic polynomials or also as you could see from my construction here. Now, that means that the interpolant is a sum of a cubic polynomial. So that means it's another cubic polynomial. And it's the cubic polynomial, which asymptotically as the error bars go to zero actually go through all the data points. There's only one such polynomial and that's called the cubic spline. So cubic spline interpolation is actually a specific form of Bayesian inference called Gaussian process regression with the integrated linear process kernel, which was maybe for the first time properly written down or divided at least in this fashion by Grace Wamba. And with that, we are at another Grace slide, a summary of what we've done in the lecture so far. We've seen that these parametric models, parametric Gaussian regression models can sometimes be extended not by adding more layers to the neural network but by extending the number of units in it towards an infinite limit. And if you're lucky, then we managed to use a particular choice of features and distribute these features in the right way such that as the sum, the entries of sum goes towards infinity, we approach an asymptotic regular regime in which we can do the sum in a closed form as a Riemann integral. Then we arrive at a still tractable, entirely tractable model which is called a non-parametric model because we are not keeping track of individual features anymore only of their interactions, of the interactions of an infinite sum of such linear models and that this particular kind of non-parametric model is called a Gaussian process. Influence in these Gaussian process models remains tractable even though we are simultaneously tracking an infinite number of features. It remains tractable because we are never actually talking about the individual feature, we're talking about the associated posterior over the function that we're trying to regress on and if we're quite honest, it also remains tractable because we increase the number of features while simultaneously and proportionally reducing the variance of the associated weights. Maybe a point to keep in mind is that we still need to do Gaussian inference. So we still need to invert that covariance matrix and that's of course going to be cubic in the number of entries in the data set. So cubic in N because we have to invert a matrix of size number of data points per number of data points. In previous lectures, I've shown you how to do inference with parametric regression which allowed this reformulation of the prior onto the weight space using the short complement and that was actually faster. It was associated with a cost that is cubic in the number of features but linear in the number of data points. In Gaussian process regression, we cannot do this anymore because there is no weight space to talk about anymore or if you wanted to, it would have to happen in this infinite dimensional weight space, the space of infinitely many features which of course is way more expensive than any data set. So that's a downside of this approach is that at least in principle and on paper, it's cubically expensive in the number of data points. Now there are all sorts of smart approximations out there which I don't have time to talk about here today which actually do allow to reduce the complexity of this kind of inference using approximate methods. In principle though, these methods are cubically expensive in the number of data points. Now I did the initial construction by showing you how to construct one particular kernel which is called the RBF kernel, the radial basis function kernel or also the square exponential kernel or the squared exponential kernel or the Gaussian kernel. But now we just saw in the last few minutes that we can actually do a very analogous construction using other kinds of features and other asymptotic limits of infinitely many features. We did this specifically with piece by step, piece by constant step functions, heavy side functions. And this gave rise to a very popular, very famous Gaussian process called the Wiener process which has as its kernel, this minimum kernel, it gives rise to posterior mean functions which are piecewise in linear and therefore this is called the linear spline. Even though, and maybe I should make this more explicit, one thing to notice is the piece, that posterior interponent is linear, but the samples are not. The samples are actually very rough. It can even be shown that these samples are non-differentiable almost everywhere. But we'll talk about that in the next lecture. And then just for fun, and maybe for a connection to contemporary deep learning, I pointed out that the infinitely wide limit of a VLU regression network is actually also a Gaussian process and it's associated with this kernel at least under one certain construction of these features and that's called the cubic spline kernel. There are other ways of distributing the VLU features across the input domain. They give rise to other forms of kernels, but all of these kernels always are in polynomial in the inputs and they are of cubic order, polynomially. They give rise to a slightly different shape but their interponents still remain cubic splines. It's just different types of cubic splines. So you might, if you've worked with splines before, you might know that there are different cubic splines, natural ones and less natural ones. And these are maybe the natural cubic splines, natural in the sense that they extrapolate outside of the data in a linear fashion. There are other such kernels and in the course of the history of our field, various other kernels have been constructed in a very similar fashion to what I just did with these three examples. For example, there is a kernel called the neural network kernel which was derived by Chris Williams in 1998, which is based on a construction where you use other kinds of features, sigmoidal features rather than piecewise linear features because that was during a time when people weren't so keen on redo features. But it might seem a bit tedious to do that and you might get the impression that by these kind of constructions we essentially have to take, we always have to start with some kind of neural network structure and then extend the number of features and try to find a particular limit. And that seems very constrained, right? Because it almost seems like we are inheriting all the problems of deep learning by having to use these feature sets and then manually manipulating them. But actually that's not true. It turns out that these four examples here are really just that, just examples. And the space of all Gaussian processes spanned by the space of all kernels is actually very large and continuous. And that's because you can build new kernels from old ones. For example, using these here as starting points and then manipulating these kernels to get new models. And doing so creates an extremely powerful modeling language which we are going to deal with over the next few lectures as well. I want to show you a few examples of how to use this modeling language to build very expressive models, probabilistic models which allow tractable inference and also allow you to encode a lot of structural information about your problem a priori by including things you really tangibly know, manifestly know about your problem. To explore this space and identify this space of kernels we can remind ourselves that kernels are implicitly defined through their property that they give rise to positive definite matrices. So a kernel is a function such that when you evaluate it on a bunch of points and building a matrix out of it is positive definite. Therefore, we can think about all the operations you can do on positive definite matrices and therefore identify the space of kernels. So what are these operations? Well, you can do the following four operations on kernels and this means that essentially kernels span what's called a semi-ring. So if you have a kernel already and you multiply it with a positive number then you still get a kernel. Why is that? Well, you can do that in a one-liner. If you have a matrix such that any vector multiplied from the left and the right always gives a number larger than zero then if you multiply that number by a positive number of course you still get a positive number, right? And that's the same as multiplying the entries of this matrix each every individual one by alpha. Another operation you can do is you can take a kernel and map its inputs through some feature functions or equivalently think of its inputs as mapped through some feature function. Why does that work? Because kernels actually, I mean, I didn't tell you what I kind of insinuated that every kernel can be written as this auto product form. I only showed it the other way around. But let's say we have a kernel that can be written in this inner product form then we know that this is positive definite because our proof for that was if you take this such a matrix and multiply it from the left and right with a vector then if you take a matrix that arises from taking elements of some collection of X and evaluating for every possible pair of entries of the collection, the numbers and building a matrix out of it that's positive definite. You can show that that's positive definite by multiplying from the left and the right with a vector and noticing that that expression essentially amounts to taking squares of real numbers. Now that remains a square if we take the input X to phi and transform it through some other function that maps onto the same space of course so that phi still works. Then the proof still works so clearly this is still a kernel. The next operation we can do is we can take two kernels and sum them together because the sum of two real numbers that are larger than zero is larger than zero. And finally we can take two kernels and multiply them with each other. Here I don't mean the product of two matrices but you just take the actual function the kernel function and multiply it with another kernel function. And the resulting matrix which is actually a matrix that is the so-called Hadamard product so the element-wise product of the individual matrices happens to also be positive definite. This final statement is actually totally non-trivial and it's not easy to show. It's known as the Schur product theorem and if you want to understand how it works you actually need to read a proper proof for it. Nevertheless we can use all four of these properties to create new Gaussian process models from existing kernels and that will give rise to our modeling language. We're going to use that in a later lecture to build a concrete model. Now I just want to give you an intuitive feeling for what these individual operations do. So let's first look at scaling the output so scaling the kernel itself. What that amounts to is scaling this dimension of the process if you like. So the output dimension. Why? Because so here's our kernel. I'm using for all these plots I'm using the Gaussian kernel, this smooth one that uses this expression but of course you could use any other kernel. Remember that the variance of this Gaussian process is given by so these thin horizontal line, these arrow bars are given by the diagonal elements of the kernel covariance matrix, the Gram matrix which are of course scaled by this number. So if we increase that number then what's gonna happen is that the process becomes wider or scaled up in this dimension. This actually has a somewhat non-trivial effect on the posterior which I'm plotting here on the right hand side. If I go back and forth and you keep your attention focused on this part of the data then you see that this interpolant actually becomes steeper or less steep. The reason this happens is that this kernel Gram matrix also shows up in the data covariance and there it gets added to the diagonal in this case covariance matrix of the data of the likelihood. And by scaling up the prior what we're essentially doing here is we're telling the model to rely more on the data and less on the prior so it adapts a little bit more aggressively to the prior. The next operation I, so right we can scale the output and that essentially scales the output variance and makes the model a little bit more flexible inside of the data. The next operation I introduced was that we can scale the inputs left of a kernel so we can scale, we can introduce new features or we can map the inputs through some feature functions. A particularly simple feature function is a linear rescaling. So just taking the input and just multiplying it by a number. In this case here I've multiplied all the inputs by the number one over five. So I've divided by five. And what that does is it takes the inputs and basically like stretches them out by a factor of five. And the corresponding process is much smoother. Like now the samples now oscillate on a scale that is scaled up by a factor of five. And that also is visible in the interpolant which becomes a much smoother function. You can do this the other way round and scale by two basically. So multiply by one over 0.5 squared and that gives a process that is now much, much more rough. It oscillates on a shorter time scale and you can see what this does to our posterior distribution. It becomes more flexible inside of a data domain and it becomes more conservative in extrapolation. But remember that we more generally didn't just say that we are allowed to multiply the inputs by some number. We're actually allowed to map these inputs through an arbitrary feature function. It doesn't have to be a linear one. So for example, here I'm mapping the inputs through actually a cubic function is that like what I applied to the kernel. And what this means is that the process over here becomes much more flexible because this is a region where the cubic feature is large. So locally the features get more scrunched together or the inputs. And on the left-hand side this function becomes much, much flatter. So the process becomes more regular here as well. This is a way to produce flexibility in your regression model in very specific locations and remove it in other locations. So here we have a model that extrapolates very smoothly on the left-hand side and very conservatively on the right-hand side. I also said that we can add two kernels together to get a new one. Here for this particular plot, I've added two kernels. One is this Gaussian kernel that we've seen on previous plots. And the other one is actually a trivial kind of kernel that is just actually an inner product of features. It's the kind of kernel that we used in lecture seven or at the very beginning of this lecture. One that isn't an infinite sum. It's just a finite sum. But of course that's also a kernel because it fulfills all the properties of the definition of a kernel. And the resulting model produces locally this kind of smooth flexibility of the Gaussian kernel. But globally it creates this additional cubic interpolation ability of this cube, sorry, quadratic interpolation ability of this parametric model. And you could see what this does to your model. It allows you to create sort of global structure, learn that this function might have some kind of bell-shaped, bowl-shaped structure and add like more aggressive extrapolation abilities. And finally, we can multiply two kernels together to get a new kernel. Remember or maybe notice that plus and times are similar to and or in sort of interaction space. So the kernel defines the covariance, right? So in some sense sort of the relationship between two variables in their marginal distribution. So if the kernel is zero, that means two variables become marginally independent. And if the kernel has a large value, then they become marginally very dependent. Now the sum of a large and a small number is a large number, but the product of a large and a small number is a small number. So the product of two kernels gives rise to a covariance structure such that if one of the kernels, if under one of the kernels two points are independent of each other, they become independent under this model while the sum of two kernels gives rise to a structure where if the two points in the input domain are correlating with each other under one of the kernels, then they correlate under the sum of these two kernels. So here I've taken a sum, we just talked about this and here is a product. So now I'm just for fun using a different kind of kernel. So here there's just a single feature function which is this quadratic function. And I'm multiplying this with this Gaussian kernel. What this gives us is a stochastic process that has the shape of this feature function. So it only allows kind of large function values here on the far right hand side, but locally produces this smooth kind of behavior of this Gaussian kernel. So this of course has just been a rough, relevant tour and we'll do an entire lecture, two lectures after this to talk about how to use this kind of structure to create specific models for very specific applications. What we've already seen though, and just on a high level with this is that the space of Gaussian process models is actually not consisting of individual points of kernels that you have to construct by hand and find laboriously. No, you can take these individual starting point kernels and create new kernels out of them by applying these quite powerful analytic transformations that mutt from one kernel to another. You can scale the output, you can scale the input, you can add and you can multiply kernels with each other to arrive at new Gaussian process models. Now in many ways this is great because it means that the span of the space of Gaussian processes is very large. If you can build complicated models out of these kernels and use specific kernels to encode certain kinds of structure, like certain smoothness, certain length scales, certain output scales, we can calibrate our uncertainty. But of course it's also a challenge because just as in the case of deep learning we're now or in feature learning, representation learning we now have this issue that we have to take these decisions. What began maybe as an exercise to get rid of all these parameters in our model has now actually just introduced new parameters. And if you know certain aspects of your data set then you can use this modeling language to reflect these or to include these aspects into your model. But if you don't know them then of course you unfortunately have to figure out a way to set them. Now thankfully, and this is the final point that I just want to briefly make on the side and then we can deal and talk in detail about how to do this in practice later. We can of course use the same ideas that we used in the last lecture to talk about how to learn features to now learn kernels. So here is a slide that we've essentially used on the previous lecture already. Remember that when we do Gaussian regression we can construct not just a posterior over the function given some data and a bunch of features or parameters of the features. We can also compute the model evidence, this term down here and that model evidence in the feature formulation at least actually is itself of a Gaussian form. So it's a number that can be computed by evaluating a Gaussian PDF with a bunch of means and covariances and those means and covariances depend on the features and of course therefore on their parameters. In the last lecture we spoke about how to use this quantity to learn these representations by maximizing this expression as a function of theta. Now notice that these expressions in here actually are also, they also include these quantities that we've been using to abstract away to non-parametric models. Here's a mean function and here's a kernel function. So just as in a previous lecture we could maximize this expression to learn how to set the individual parameters of individual features. You can of course also use the exact same framework to learn the parameters that affect the entire population of infinitely many features. Now there's no more time today or at least in this lecture, depending on when you watch it on YouTube to actually show you how to do this in practice. I just wanted to point out that of course it's possible and then we will come back to this in a later lecture when we do a concrete example that shows how to construct a Gaussian process model for a very specific application to extract a specific set of information, even scientific facts from real world data. For today, we're at the end, we saw today that there is another way to make Gaussian parametric regression models more powerful, not by tuning the individual features and to do so create a powerful feature parameterization language called deep learning, but instead to increase the number of features, not in depth but in width, if you like, towards infinity in the hope of creating a model that has infinitely many degrees of freedom. And we saw that this is in fact possible and there were maybe two prices we had to pay for this. One was that we had to make very specific choices about which features we are considering and how they relate to each other in terms of structure. We had to distribute them in a certain regular fashion and the other price we had to pay was that we had to reduce the variance of the individual features in this infinite set towards zero, otherwise we would have gotten an ill specified model. Nevertheless, it's possible to do that and it gives rise to a new class of models which we call Gaussian process models. These are also called non-parametric models because they hide the parameters or if you like they have infinite numbers of parameters which are not an explicit part of the model, they are hidden inside of the kernel operation which keeps track of infinitely many features at the same time. Maybe a third price, a computational one we have to pay is that there is now no weight space anymore, at least not a weight space that has a finite size that it might even be less than the number of data points and therefore we have to do the inference in function value space which means that it's always, it always requires the inversion of a matrix of size, number of data points by number of data points so that's cubic in the number of data points, at least in principle, okay, period. And we, even though we did the initial construction in terms of specific models, very specific kinds of families or features, we then realized that actually there is an entire space of such models, a very large and expressive space in fact that is spanned by various operations you can do on positive definite matrices to get back other positive definite matrices. We can take a kernel and multiply it with a positive number. This corresponds in the corresponding Gaussian process model to scaling the prior uncertainty of the model in the output space. We can take a kernel and well at first we can just scale the inputs so that corresponds to scaling the input space but we realize that we don't necessarily have to use a linear scale here. We can actually take an arbitrary transformation of the input domain including a nonlinear one to get a nonlinear rescaling of the input space and still remain within the sort of Gaussian process formal language. We can add two kernels to each other and we can multiply them together to get new kernels. This, so adding a model that arises from the sum of two kernels implies that if two input locations are correlated under one of these models, then they are correlated under the sum of these models. While the product is in some sense the opposite that says if two points are independent of each other under one of the models, so if the kernel is zero for one of them, then they are independent under the product model. Using this language we can create a very powerful formalism that creates a large class of interesting regression algorithms which can once the kernel is specified can perform regression in well cubic time in the number of data points but using just linear algebra which is still a very powerful framework. It's maybe a bit of a downside of this flexibility that even if you're not using it to construct your model and encode prior information, you still have to accept the fact that you could of course do so. So by not doing it, you are implicitly still encoding a certain kind of prior information that you might not actually believe in. To address this issue at least to some degree we can use the same framework, type two maximum likelihood that we introduced in the previous lecture for parametric models to learn the parameters of kernel models of Gaussian process regression algorithms to tune these additional aspects of the model to data as it arises. Gaussian process models are one of the most important parts of the modeling language of probabilistic machine learning and so they are going to get their own place in our toolbox. I might have also written Gaussian processes here but then this might be confusing with the Gaussian distributions which is why I wrote kernel models here, kernels. They are so important in fact that we're going to spend several lectures talking about what to do with them. There will be one lecture that is more hands-on thinking about how to use them on a concrete data set. There will be several lectures on how to extend them to data that doesn't quite fit into the Gaussian process language and in fact Gaussian process models are also one point where the connection to the statistical formalism for machine learning is particularly close. So I'm going to use one more lecture to talk a little bit about the theoretical aspects of these kernel models and how they relate to the other way of thinking theoretically about machine learning which is statistical machine learning. All of this though is going to happen in later lectures. I hope that you enjoyed this lecture and I'm hoping to see you again in the next one. Thank you for your time.