 So last lecture, let me just do in math again what we did last lecture and then see if this helps clear up some confusion we sort of started to look again at This parametric regression Setup that we constructed coming in from exponential families Understanding that there's notion of a conjugate prior that is particularly elegant with Gaussians because the necessary computations are just linear algebra and if you can construct priors on function spaces actually on parametrized functions by Assuming that a function can be written through some kind of mapping with some feature functions and then assigning a gaussian distribution to the weights and then marginalizing over the weights and Or sort of using the fact that gaussians are closed under linear maps So if you have a Gaussian random variable and you apply a linear or affine map to it That's the legaussian random variable and notice that this gives this kind of distribution over values of the function at input dot or bullet and the corresponding Objects in here are these where so this sort of is implies a piece of computer code that predicts the value of the function at arbitrary collection of points, so in particular also at individual locations, but also at sort of linspaces or grids or collections of Evaluation points for the function which is always a gaussian distribution Which has a mean and that mean is given by a vector that we get by taking the mean vector of the weights and applying from the left this Feature map that we get for all the input points on our mesh and then applying it like that gives us a matrix Right, and we apply that from the left to the mean vector and The covariance matrix So if you have n input points in x that is an is a n by n matrix Because we have to construct a joint distribution over n locations and that matrix has this particular form so it's We get it by taking the covariance matrix over the weights sigma That's of size number of features by number of features or number of weights by number of weights Let's say f by f and then we apply from the left and the right this feature matrix, which is of size n by f So when we do that this kind of inner product we get an n by n matrix out And then we add add the observation noise because that's just part of the linear map. Yes So the question is this is a marginal distribution. That's true So we are marginalizing out the weights We're doing an integral here by and you could call this a prior and there's a likelihood by the way not posterior, right? So if you marginalize over this you get a marginal distribution over the function Function values. Now the question is why Should we do it like this? And there's sort of two I think questions going around one is in which sense is what kind of object Is this well? This is a prior itself But it's a prior over the function now because we've marginalized out the weights So it's a different object We've mapped it into function space in some sense and the other question is why should we choose this particular set of features? Why not something else and we'll talk about that today? Yes, okay, so this is a good question. They are these two dots and One is one is filled in that's a bullet and the other one is a circle That's not filled in and there's a question about what this notation actually means And this is actually the reason why I use these symbols because this is a little bit tricky and it has to do with broadcasting in sort of array-centric programming Because on the one hand you could say so if you think of of the dot as an Individual number right individual location and a point in X Then what we can do is we can evaluate this function K for arbitrary pairs of inputs, right? And by doing that we construct a covariance between the function value at One input location and the other and to stress the fact that this function K has two inputs That don't need to be the same. I'm using these two symbols filled in circle and empty circle But if you think of this circle as an array of input locations that our code will natively broadcast over then Sort of the the circle and the bullet are kind of the same thing, right? They're just two sets of in that two sets of inputs or the same set of inputs And then we get a covariance over arbitrary function values So from a function perspective when we implement this function K, it's a bivariate function that takes in two inputs and If we think about the distribution of a function values Then the function values will be collections of inputs. They are array style objects And then we will always broadcast over this function K and get big matrices that contain evaluations of this bivariate function at various points and Actually understanding this point is maybe the most tricky bit of that aspect of all of this This is why these functions also have a big name. They're called the kernel There's already sort of stresses that this is very central to this concept So we saw last week that the one thing we need from this kernel the one property at the actually sort of the Observation is ah so really we only need this M and this K object, right? The features can be sort of hidden inside They can be abstracted away and the M object we noticed it just needs to be a function We just need to be able to evaluate it wherever we want. That's the only thing that actually matters So it's pretty flexible. We can put in whatever we like But for this K object the kernel there's a bit of a constraint because we will need to make sure that this is always a valid probability distribution and you'll remember that mean vectors or cautions are just arbitrary vectors But covariance matrices aren't just any matrices. They are symmetric positive definite Or actually they are positive definite and maybe at some point in the tutorial or so we can talk about why symmetric doesn't actually matter so much So positive definite means that here's the definition again at the bottom of this slide for arbitrary vectors V If you take such a matrix a in particular also this kernel matrix and apply V from the left on the right the number That comes out always have to be Well, so for definite, it has to be larger than zero for semi-definite. It has to be non-negative So larger or equal to that zero and actually we are fine with larger than or equal to zero and so the question was How can we make sure That such functions always have this property? How can we construct such functions K such that they always have this property of being positive definite? And that's what I did last one way to get to such a construction is what I did last Thursday. I wrote down these bunch of Gaussian features so little bell-shaped features square exponential features and then we observe that if we increase the number of these Features more and more and more and more while also reducing their height Sort of proportionately to the number actually to the square root of a number then we get a Converging sum a series so an infinite sum that actually converges to a finite Object function valued object not a finite number, but a finite function valued object K Which we can evaluate an arbitrary points X i and X j Such that matrices constructed from these pairs or broadcast out from such collections are always positive definite And in some sense they take they sort of mask an infinite sum over infinitely many features The parameters are gone the weights are gone and we sort of have an infinitely wide neural network that we're working with This is also why such models are called non-parametric because the parameters are Infinite there are no more parameters to actively or explicitly talk about and I pointed out already last Thursday that actually this construction which may seem quite ad hoc is actually sufficient in the sense that Any function that is positive definite So it has this property that it is like follows this definition. I just on the previous head on the previous slide can always be written as some kind of expansion or In terms of either a sum or an integral over either finitely or infinite infinitely many such feature functions and In fact also every function that can be written like this is Positive definite that part is actually whether if you easy to see the proof is even on this slide The other the other way round is not easy to see at all and it requires functional analysis So okay, yeah, so this particular construction kind of worked, but of course it just gave us one kernel So today we have to think about how many other kernels it can give us But before we do that let me first quickly we cap a little bit more So we gave a name to this object that arises We said if we do things this way, so if we encapsulate the computation of the covariance Inside of a function so that our Gaussian object actually only calls that function once it's needed We call this object a Gaussian process. So process is a word from statistics that always implies that of Constructions of a potentially potentially infinitely large collections of random variables Which are described by the properties of any finite restriction of this infinite collections So this is exactly what we have here also by definition at such a Gaussian process is a computer program Basically a probability distribution over a function with values such that every finite restriction to function values has Valid Gaussian form Yes, they are also stochastic processes. Yes They are not point processes There are other types of stochastic processes, but they are stochastic process as stochastic process is really just a collection of random variables such That any finite restriction of those random variables has some property and in this case the property is that it's Gaussian distributed That's why it's called a Gaussian process We will see other stochastic processes later on for example mark of processes By the way, someone already used the word mark of process on some I don't know in the feedback or in the forum This is not a mark of process in this construction. There will be mark of processes later. It's another thing. It's completely separate So the one thing I didn't do on Thursday because I didn't have time for it is to talk about what do that? This is just a prior right? It's just defining a distribution over function values What we would like to do is to do machine learning We'd like to give it some data and then it should return a posterior distribution over function values So, how do we do that? Well one straightforward way to do this is to just go back to the parametric setting to this the setting with a bunch of features and Think again about what we actually did in that setting So here is the slide that I copied over again from a previous previous lecture We just to remind you if we assume that the function has this parametric form with Gaussian priors over the weights and Gaussian likelihood then The posterior over the weights is a Gaussian with a bunch of linear algebra objects and the associated Posterior over the function values we can get through this kind of detour through the weight space so we can first think about weights a bunch of finitely many weights so vectors of weights of our finitely wide shallow neural network with quadratic output loss and then Compute the posterior in the weight space and project back into the function space By again using the fact that Gaussians are closed under our finemaps So when we do that we get a Gaussian Posterior over the function values which looks like this and I'll clean this up a little bit So I'll take this equation and move it up to the top and now we can see again that the posterior over the function values has a Bunch of has a mean and a covariance which are again basically functions of inputs and now I've highlighted again with the bullet and the circle where those inputs go and We can notice again that The main observation here is that there is never a lonely phi in these expressions You'll never find an it's just a feature vector phi That isn't multiplied from the left or the right with something else So we don't actually need to write down phi as such We just need the objects that it gets multiplied with in particular We have one instance where phi is multiplied from the right over here and over there with a mean vector We already know that we call this the mean function and that it's a bit boring and that the interesting object are these Here and here and also here which are inner products again between the features Our matrix and the other features. So again, it's a big array multiplied with an array multiplied with an array and That's just an instance of evaluating our kernel at other kind of pairs of inputs So if we have the kernel available, we can fill out these individual objects and write them like this So our posterior over the function value now I'm going to call it a Gaussian process because now it's really a function that we can call and has a mean Function why because it has these inputs dot that makes it a function and What does that function do? Well, it evaluates the prior mean and then it takes the data Subtracts the predictive mean. So here's our story again of what did we observe? What did we expect it to be we expected it to be mx so the mean evaluated at the input locations at the training set Construct the difference between the two that's a kind of a surprise. How far are they a residual and then divide actually? It's a residual It's just the difference between what we thought it would be and what we got to see it it to be and then we sort of Divide by or we multiply from the ref left with the inverse of this covariance matrix Which gives us a measure of surprise so the covariance matrix tells us how far we expected things to be apart from the mean So if we standardize by it, we get some kind of standardized Distance, right? How far is it from what we thought it to be on the scale of what we thought it to be? And then this object, which is a vector, right because it's a vector multiplied by a matrix So the vector we call it alpha sometimes called represent the weights Then we multiply it from the left with a kernel which takes a which computes the covariance between the observations at the training inputs and any arbitrary input we would like bullet and The posterior covariance is the prior covariance or the kernel which is which is a function Minus an object that again looks a lot like what we've already seen here. So we take this matrix from the mean And it's inverse or by the maybe we have a Cholesky decomposition of it or some other matrix decomposition matrix square root and multiply that from the left and right with our kernel function and Notice that so there are two things to notice here Which by the way, so we can rewrite this a little bit more compactly like this This is already kind of computational view and the two important things to notice is that we can do all of this in In the kind of in the in the data space if you like so the computations We need to do is we take the data We construct this vector alpha. That's a finite vector and we construct this matrix The gram matrix and its inverse so the Cholesky decomposition this LL transpose and that's a finite computation polynomial time computation and Then we multiply from the left and right with these function valued objects Okay, with a bullet or a circle inside and that's the lazy evaluation part So we can do the training of this shallow neural network in finite time and then afterwards only afterwards start to think about where it should predict That's a very useful property to have right. There's some kind of lazy evaluation happening here The other thing to notice is that this is just by construction another Gaussian process Why? Because it's a Gaussian distribution over function values, which has a mean function and a covariance function that we can instantiate This actually shouldn't be surprising because we know mathematically that Gaussians are no exponential families And we have a likelihood that it's also Gaussian and so the posterior This is just a conjugate prior Posterior story again over and over again, but now we see it sort of in a very concrete form. Yes so that Just to make sure I understand the question is where do we get this that the circle the white or black dots from okay so maybe in an in a In an abstract sense, this is the kind of confusion about functional programming Right, why do we get the input for the function from well in a sense? We don't actually think about it yet We're just constructing a function and then the inputs will come later But if you are confused by this white dot being white Maybe it's easier for you to just think of a black dot every time I write a white dot And then you're just thinking in broadcasted matrices, which is maybe actually the stronger view So if you're if in your head you're already doing broadcasting then that's good If in your head you want to think of a function that takes two inputs where each input is of type X then White dot and black dot make more sense So how do we actually implement something like this? Well, it's just another Gaussian process, right? So the prior we constructed Had this had this form that I showed you last week That it's a Gaussian process Yeah, this thing I'll zoom in a bit more Can you read it like this in the back as well? Yeah, so priors are these this new class called Gaussian process Which are parameterized by two functions mean and covariance and they have this thing called call me I evaluate this thing which does the lazy evaluation which instantiates the actual evaluation it triggers evaluating the functions So after conditioning so when we condition We can this thing a data set x and y and a likelihood So we need to say what the sigma is right the noise on the evaluation And then it's supposed to return an object that is itself a Gaussian process And So that means we're going to use inheritance for this right the thing we are constructing is a posterior Gaussian process and posterior Gaussian processes are also Gaussian processes so they will reuse a lot of the stuff that is already implemented in Gaussian processes and remember that Gaussian processes use Gaussians as the internal object So we inherit all these tricks about sampling and then all the compositions of covariance matrices and so on and so on Directly from our Gaussian base class So how does this conditional Gaussian process look like it's implemented down here Here is the constructor for this thing this inherits the property that it's a Gaussian process Here's proper object oriented programming. That's why you learned all of this in your second semester We instantiate this class By saying the prior this thing has anything called a prior and the prior gives us the mean function in the covariance function m and k And so we just take that over now We already have mean and covariance function because they are stored in the prior and then it gets gets a data set Y and X for which we just make sure that they have the right shape That's just sort of asserting making things the right the right way. We store the noise of the of the The likelihood and now There we must also make sure we initialize all the super classes So that's in particular the Gaussian process and also the Gaussian type class and now we implement the Basically the two where is it the two important things here the posterior mean and the posterior covariance And that's the those are the only bits of code that we'll look at today. The rest is just plumbing basically So a posterior Gaussian process is a Gaussian process that has a different type of mean function for which we start with the prior mean which is this Function valued object non-parametric object self prior X mu and notice that when we call prior X Then that Instantiates evaluating this function. So the moment you someone hands you X This thing inside will call the prior at X The prior at X will call the Gaussian base class and construct the Gaussian distribution with mean vectors and covariances And suddenly we have finite objects and everything is tractable and Then it constructs this covariance matrix K a covariance between the black circle Here is our black circle. It's this X and the training data The training data is stored inside of this train deep neural that shallow neural network and This indexing here just makes sure that this thing knows that it should broadcast Right, we will construct a matrix of size number of input points cross number of training data points And then we multiply with this thing called the represent our weights which is this linear algebra object called alpha which is on the previous slide and They got constructed up there. You can actually see what happens, right? We we construct first the Cholesky decomposition of the covariance and then we might be sold for X a prior data Minus prior mean Minus whatever if the likelihood had a shift and be at that as well. It's just for generality And what's the posterior covariance similar thing? It's the prior covariance, which is now instantiated the moment we handed an A and B It becomes a function Minus covariance between the input points or a is the black circle and The training data. So now we have an array of size whatever a is cross number of training points And then Cholesky solve internal computation with a with a matrix times prior kernel at training data and White circle the white circle is called B here. So B is white circle a is black circle and that's it Now we've defined our Gaussian process and what we can now do is something like this We can load a bunch of data points And define The objects we need for our Gaussian process I've already done this last Thursday. So I've simplified things now a bit We need a prior mean which could be zero but for generality. I'm assuming it's a constant mean Which if I don't give it a constant it's just zero So it's just a function that is just straight line At height c where if I don't give a c it's just at zero And then I need this kernel Which is this by construction from last Thursday a function That takes two inputs x and y black circle white circle And it evaluates the square distance between x and y scaled by the length scale l Summs out the batch dimension my the last one Divides by two minus in front puts an exponential around it. So that's just a Gaussian function. It's just x of Minus square distance divided by two Scaled by it Those are the two objects we need I have already run this no, I haven't Okay, now it's run and now we construct. This is the actual piece of inference So We say this the kernel actually want to use I'm just going to wrap this somewhere is This kernel where we set the length scale to one. So this is currying. It's functional programming We just get rid of the length scale so that we can hand it to the Gaussian process. Yes Yeah, I can also use so you can also write this kernel as lambda of ab Se kernel of ab comma l l equal to 1.2 And then I instantiate a prior say this is a gaussian process which uses a constant mean in this kernel And then I say please condition on this data set Which is observations y at locations x with noise sigma BAM done Actually Not much has happened yet Nothing in fact we've just constructed an abstract object which hasn't done any computations yet. That's why it is so fast We just set up all the structures. There's been no training yet Why because I haven't asked you to do anything So I'm also going to define a plot function Which is just you know nasty matplotlib. So you can you know, you can look at this later But it's it's not has nothing to do with the math we're doing here It's just a little bit of you know fiddling around to make a nice plot And then I can say please plot the prior and the posterior And here it is. So the moment I call this function it actually does something it tells our objects Oh, I want you to know I want to know a prediction at all these points and that it instantiates all these objects It constructs a cholesky decomposition of the training data Stores it somewhere and then blah blah blah gives us this prediction So in the back in black you see the prior distribution Which comes from this gaussian process And in red you see the posterior that arises from this data set Yes So wonderful question. How do we change this shape? so I've sort of From up until now it seems like This gaussian kernel is just what we need to use Because that's the only thing we've been able to do so far right We had this one construction with these gaussian features with the little bell shape features And then we took the limit of ever ever more of these gaussian features And that's it right so it has to eat this kernel Maybe the one thing we could actually change is we could set the length scale to like something else Let's say two right so we that's a slower broader gaussian features and then this looks like this it does more smoothing Mmm a bit boring Or maybe we could set the length scale to 0.5 And then This looks like this so now it's a little bit More sensitive kind of it changes more rapidly But somehow this is This is underwhelming Right, you've learned about deep learning and all these features and relays and complicated setups This seems like it's encapsulating a little bit too much Now we're just stuck with this one kernel Are there any other kernels right? Is there anything else we could do and yes we can So what we're going to do now is just a few more of these constructions for a few minutes So here's another one which is pretty much exactly the same construction we did last thursday But with a different collection of feature functions Let me maybe motivate this a bit. Whoops. Sorry wrong sides so in lecture six I had all you will remember that I had this sort of various constructions for Gaussian priors over function values And one of them that we used was this with the what which what I called switch features So what you see here is I need to zoom out a little bit, right? So Here in green You see in 17 Little step functions That start as you can see in the code up there From minus 8 to plus 8. There's a lint space over 17 of these little steps That just are zero all the way to wherever that lint space grid is and then they become one So they are zero all the way from the left and then they become one um So the math way of writing down these features is like this The feature i and x is what's called the heavy side step function heavy side with an i not with a y Um Which has the property that it's one whenever the input is larger than zero and zero everywhere else That's these step functions, right? They are just zero and then at the point they become one Non-differentially just a big step And we're going to use the same prior covariance for the features that I've used before actually with a like Chronic or delta here ij so the covariance between the features is just diagonal to make things easy That makes means that our inner product between that which we need inside of the kernel Can be written as this single sum And here i've already simplified something So here's the important thing to think about if you have two features Which are zero If xi is Less than cl And another feature which is zero if xj is less than cl Then in this inner product we have a we have a product between two functions that are either zero or one Right, so these features look like they go like this and they come up and they go up here where this is cl And this is the x space So if I multiply two of these so if I multiply theta of xi minus cl with theta of xj minus cl Then what is this function? It's either zero or one right because both of these things are either zero or one So when is it zero? It's zero if one of them is zero And it's one if both of them are one This is an and function That's what it is It becomes one if both are one and it's zero all the other times So that means in this in this term in the sum we get Basically a single step function that becomes one If both of these inputs are larger than cl So that in particular means that the minimum of the two inputs is larger than cl right So that's the tricky part of this derivation If you get this bit then the rest will be Well easier All right, so if I multiply two of these Maybe a way to think about this is you could be a second feature that comes in from the left and its Location is sort of well actually no well, that's that's confusing. I shouldn't write it like this Because then this is in feature space rather than in x space So maybe it's just easier to think about this and say if you think of this as a function of xi and xj For this to be larger for this to be one both of both terms have to be one So the minimum between these two have to have to be larger than cl Okay Now we do the same Hand-wavy thing again that I did last week Where I say oh, let's just assume that we have more and more and more of these features And they are on a regular grid a regular grid is called the lint space and umpire From c max to z zero. I'm not calling it z zero because it's going to have an interesting property in a moment And I increased the number of features. So in this construction with the road. I basically I raised this slider So I put more more more more more more of these And you can see that there are more and more and more of these features and they are on a regular grid right But because we are scaling by the number of features Actually, the gray thing in the background doesn't become broader It just there's just more and more and more of these features the resolution increases if you like And that's it mean nothing else then asymptotically this thing Becomes an integral in the Riemann sense where individual little boxes Where in each delta c box infinitesimal step The value of c is constant right because it's infinitesimally wide And we are summing over the values that are in the sum and the sum becomes an integral That's the other bit where you have to sort of squint your eyes a little bit And then after that it's mechanical right after this you're done The rest is just stuff you did in first your first undergraduate year math So now it's an integral Integral over the step function. So we integrate something from c0 to cmax, which is either 0 or 1 So when does it become 1 and 0? Here's a little bit I think you have to sort of bend your mind a little bit around it, but it's really not complicated The c is negative here right so as c increases For a while we keep seeing ones because the minimum of xi and xj is larger than c And then at some point c becomes larger than the minimum of xi and xj and then this thing becomes 0 So that means our integral will go from c0 until minimum of xi and xj and then after that there is nothing more to integrate And it's just a one function doing that thing so integrating the one function is easy It's just c Right and we just evaluate c at minimum and c0 and that's it. This is our kernel What does this thing actually look like in quote So I've implemented this here Here this is the kernel we're going to use. I'm going to tell you in a moment why it's called vener um It has two inputs xi and xj and it just computes the um It computes the minimum of xi and xj Subtracts a shift which I call c0 on the slides And um, then it does a maximum around the product just for like continuity sake For that this works in multi-dimensional inputs as well If we do this if we use this kernel Now I've used a lambda notation instead of fun tools partial And do the same thing again we construct a Gaussian prior condition on the data called the posterior we get this output So the prior in the background looks like this Thing that increases in width And it has these really wiggly functions inside And the posterior looks like this And it extrapolates constant on the right Interesting, maybe this is a property you might want sometimes It extrapolates in some sense much better than the one we had before because it doesn't go back down to zero It just stays at wherever it went to And it has this starting point Which I call minus eight because that puts it to the left end of the plot What we could also set it to you know minus five And then we get this plot whoo So there's nothing over here We could set it to zero And then there is nothing before zero Why? Because the covariance is just zero And if the covariance is zero the function value has to be zero there's nothing else to do And this works because we are using inside of the gaussian base class the sample function Which uses the svd which doesn't complain about this it just says okay There's nothing to sample from so i'm just going to sample zeros done Why is this an interesting process to think about So here it is again is a plot for it So this is actually the oldest gaussian process arguably And it comes long before these gaussian little features This little bell shape features Why because one way to think about what happens here is that we have these individual little step functions that get switched on Right, so we start from the left end of the plot And at some point the point is called c zero the very first of these step functions switches on And then what happens is that step function is associated with a gaussian weight So it's like a stochastic random gaussian perturbation up or down So our function will now take a step up or down by a value that is given by the scale of this feature times a gaussian random variable And then we take a step forward by c max minus c min times number of weights And another step feature switches on and it gives another gaussian random perturbation up or down And then another one and another one and another one and with every time step with every step on the x axis Our function gets a little kick up or down By a gaussian random number And in the limit of infinitesimal such steps We get The interesting behavior that the functions that come out of this They are very irregular Because at every infinitesimal step there is an infinitesimal perturbation up or down and overall We can see if you look at the math That the kernel which is the covariance Is a linear function if you think of this if you set x i equal to x j That's the diagonal of the covariance matrix So that's what we plot in these plots actually the square root of it is what we plot the standard deviation Then this grows linearly with x Right, there's just x here. It's just x minus constant So the standard deviation is the square root of that. So that means the standard deviation as you can see grows like a square root and the intuition for this is think of a particle That is moving through time and at every time step It interacts with the other particles in the room That give it a kick in a random direction So if this is one dimension then the kick can only be up or down So y is only one dimension But it could be two or three dimensions as well. So in this room there's a bunch of particles as well And they all move around and that basically sort of approximately every infinitesimal time step They interact with another particle that bumps into them and gives them a random perturbation in one direction And this is exactly what this guy used to construct a theory for the thermodynamics of free gases Albert Einstein wrote a paper in 1905 1905 as you may know is this annus mirabilis where he wrote three papers that were all Nobel prize worthy This is one of them. It's not the one he got a Nobel prize for but he said the problem of so-called brownian motion So the motion of free particles in a gas can be described by thinking of infinitesimal steps where the particle always gets a random Gaussian perturbation And he does this construction that is pretty much exactly what we just did and then finds that the Statistics of this path as a function of time t so t is our x axis and his x axis is our y axis Grows with the square root of time And that explains brownian motion you see particles move around in a stochastic random walk type fashion And as time moves on you see the particles move away from the origin But with a square root of time in distance so brownian motion Or sometimes called the vener process for statisticians Is the very first gaussian process and it was invented in 1905 By Albert Einstein. Let's put it like that It's a bit complicated because many people have thought about this process before and of course He also cites people in this paper, but pretty much that's the paper that you should think of 1905 to theory the Wärme gefordered the bewegung von inrunden flüssigkeiten suspendierten teiche So about um i'll do that as well and then we take a break 1905 around 85 years later We get a wonderful contribution by this woman. She's called grace wabba She actually already did this work much earlier in her phd with her phd advisor kimmeldorf Where she developed a statistical tool set out of these physical processes called gaussian processes And she wrote a book which came out of her phd thesis. I think called spline models for observational data In which she introduces another kernel Actually, she introduces a whole lot of them in a much more theoretical fashion than we do here But which we can think of as another third possible construction of a kernel nor we've had two so far This spare exponential kernel very smooth and this veno kernel, which is actually very very rough She introduces a third thing which we can think of from our perspective in 2013 As introducing velu features There she says if you think of this step feature that we just looked at here You could integrate it from the left right And if you integrate a step function You get a step linear function That starts at zero you integrate zero zero zero zero zero up until the point you get to the step and then you get a linear function right We can do this in our code here as well I call this velu, of course Actually, I use these by Sort of symmetric velu. Let's make it simpler and use a one-sided velu that starts at one point And then at every Like linspace step there is such a linear function coming out This is a regular shallow neural network with velu features if you like If you like the physics interpretation, you can think of a particle with momentum in a gas So every time every infinitesimal time step Our particle gets a kick from another particle, but because it has mass If you leave it, it will just keep its drifting in this direction That's why each individual kick gives a linear term And now we get more and more of these kicks that each move the momentum of the particle around But they move the momentum and not the velocity and not the not the position of the particle actually If we do this construction, then well, okay, so One observation is that this is like integrating the function But we could keep our like work simple and do the derivation just like we did on the previous slides So we say we now have a feature set again and now the features are these velu features They are not step functions anymore. They are velus, so they are constant and then they become linear And then we do the same thing over and over again. So Our covariance function called kernel will be of this type. That's what we're trying to compute Again, we're going to assume that we scale by the width of the space and divide by the number of features If we do this, then we take an inner product over maxima And now you have to again think about well for what what does this have to be so that the function is non-zero Well, both inputs Have to be larger than cl so And after that the function will become linear So a way to do this is to write it with a max and a min and then Multiply by the features which then become linear at that point where x i and x j are larger than cl And again take this limit of infinitesimal steps Then the sum becomes an integral We just replace cl with c and integrate over c And we notice that we have to do the integral from c0 up to the minimum just like on the previous slide Everything is exactly the same. We're just integrating a different function now. It's not a constant function. It's a linear function Actually, it's two linear functions And if you integrate over The product of two linear functions so The product of two linear functions is a quadratic function If you integrate over a quadratic function, you get out a quartic function something to raise to the power of three If you do that you get this kernel Which looks, you know complicated, but It's something we can implement in a piece of code and I've done it for you um Actually not here, but in There's another example, which I call integrated venor because that's what it is. It's the integrated venor process And it just does this so here is this complicated expression that you just saw on the slide um Let's call it integrated venor Venor by the way after knob at venor, right? And how it does this interesting thing that it produces this sort of very smooth Function that goes roughly sort of interpolates between the data and it extrapolates linearly upwards not constant but linearly Why is that that's the last thing we'll do before the break if you look at this expression You see that it's a cubic function in the x Right, this is clearly a polynomial in x And there is a two here and then another x and then there is just a three here So the whole thing is a cubic object in x Yes Which just that's just what the integral gives you it just looks like this Ah or this one. Okay. Yeah, that's actually a bug. Yeah like this. Okay. Yeah Sorry, good point. So it's just minimum cubed of this plus one half absolute distance minimum squared of this So if you think back to what the posterior covariance actually is Where is it here? um Then you can think of the posterior mean Is the prior mean plus a bunch of numbers that we compute with linear algebra multiplied by the kernel So that means the posterior mean function is actually a sum over kernels Right this this thing here. That's a sum. It's a sum over the xi alpha xi times kernel Now if each of these kernels is a cubic function Then the posterior is a sum over cubic functions And what it's trying to do is to interpolate between the data So if you interpolate between if you interpolate between data points with cubic polynomials And you have n data points and n such cubic terms available What is the interpolant the sum over n cubic polynomials? It's a cubic spline It's The cubic functions that interpolate between the data And that's why this is called a spline kernel So splines used to be these things that Technical drawers used to make straight lines, you know bezier curves and so on So it used to be these little metal bendy rulers That they would put onto the the drawing board right and bend them so that they would minimize What you just if you take a metal piece of metal ruler and bend it a bit it takes a minimum energy curve Called a spline And then you can use it to draw a nice smooth line on a piece of paper. That's what these are Everything's connected So now we can take a break. We're a little bit late for a break, but let's do it anyway I'll continue at 11 18 What we have now seen Is there isn't just one kernel There isn't just the square exponential kernel There's also the vener kernel and there's also this cubic spline kernel And in a way, this is maybe a relief because it seemed very restrictive this gaussian process framework, right? But it's also a worry that you know from deep learning as well. Now suddenly we have to choose It would have been nice to say there is this one universal framework for learning functions And it's called the square exponential kernel And I'm this is not a joke actually for a significant time duration of the machine learning community people actually thought this In some sense that we will talk about After the Lent break actually And now we've discovered that there's just this one thing. There's the other kernel. There's the vener kernel as the integrated vener kernel Are there more kernels? We should think at some point about whether one of them is better than the other But first you need to understand how many there actually are Well, it turns out there are many more kernels Here is a very important type of kernel Which we can't at least I don't know how to easily construct it from a feature construction in a general sense Well, I sort of have an idea for how but it's not going to be nice And it's maybe an example of how sometimes Mathematical insight is actually very exciting or no, sorry very useful So this is the so-called matern family of kernels. It's named after I think swedish or norwegian forestry expert who constructed them for the first time It looks complicated and it is So this is a function of two inputs And it has two parameters called new and l where new has to be positive and l also has to be positive You can think of l as a length scale And new as a parameter And what it does is it takes there's a gamma function in here, which we already know, okay good And then there is this rational function here with this new And then there is this this thing called k new which is the modified pestle function of the second kind And whenever something like this shows up, you're like, okay, I'm gonna have to call a library That's there's some fancy vikipedia entry for this That tells you that this thing is the solution of some differential equation and god knows what actually for new equal to Essentially an integer number. So for integer values of new This pestle function has a particular explicit form That involves essentially factorial. So there are gamma functions showing up again It's just a bit of an inconvenient parameterization that matern chose So we actually have to set new to an integer plus one half Otherwise the things don't line up, right? But if you do it this way then the first three Of these kernels look like this the first one is this kernel So it's just the exponential of the minus absolute distance Not squared just absolute distance So these individual that this kernel function looks a little bit like an eiffel tower It looks like like this Well, I mean Maybe the eiffel tower looks like this and if the eiffel tower would continue below the surface then like this There's an xkcd where the Randall-Mann row actually claims that the eiffel tower in lock space looks like a triangle So that would be true if this is the case So this process actually also has an interpretation It's the corresponding thing In physics to a vener process if the particle that moves around in free gas isn't actually free But it sits in a potential well that has parabolic shape. There's the physicist nodding in the middle So if you have a free gas that sits in a harmonic potential Then it gets sort of whenever the particle moves away from zero it gets pushed back towards zero And that's what this thing does and it produces actually we can look at our code Um I have a thing here called matern one And it has a parameter called l that I need to set that's the length scale So I need to make sure that I set this to l And the length scale should be probably something like one Then you see this in red. So we again get functions that are very Strongly rough. They are non-differentiable almost everywhere to make a mathematical statement Um, and you can see almost the shape of this kernel inside. I can make it a bit more obvious by scaling it up I'll Multiply it by like basically 10 in the standard deviation And you can sort of see this thing sort of going all the way to the data points and then retracting back towards zero You can also make this very very strict by Sort of making our potential well Acting act fast and then the function will quickly return to zero pretty much Whoops. Oh, that was the wrong length scale Like this like Becomes right right back um Okay, and there are other versions of this three half five half and so on and the interesting thing about this is that for each integer value of nu or p actually here The the samples that come from this stochastic process become differentiable by one more order So the on-shine urnback process is non-differentiable almost everywhere for three half we get functions that are differentiable Almost everywhere, but not twice differentiable almost everywhere for five half We get twice differentiable functions, but not three times differentiable functions and so on And this is why this is a very interesting class of kernels because it can be used for analytical purposes to construct sample spaces hypothesis classes that have a very defined regularity I have a certain number of derivatives available and actually there's a limit case if you take the Um p to infinity It turns out that through some beautiful math you can show that the resulting limit kernel is the square exponential kernel again So if we take this order here, you see it on the in black on-shine urnback So p equal nu equal to one half p equal to zero And then once differentiable twice differentiable and infinitely often differentiable below an entire family of kernels There's another type of construction that I just want to show to drive from the point that there are really a lot of kernels Which is called the rational quadratic kernel and it's actually a construction We could do because we've done exponential families So the square exponential kernel has this gaussian shape And we already know that there is a conjugate prior for the gaussian called the gamma distribution Which allows us to Integrate out an uncertainty over the length scale of this kernel essentially So if you assume that you have a square exponential kernel with a length scale where the length scale is actually gauch Gamma distributed and then integrated out You get something like an infinite sum over gaussian shaped kernels square exponential kernels which have um all sorts of different length scales and How many of each length scale gets contributed to this integral depends on the parameter alpha and beta of this construction If you set beta to 1 And integrate out then alpha shows up obviously in the integral and you get this expression at the top here Which is this thing And you can think of this as a scale mixture of these smooth functions So they are infinitely often differentiable almost everywhere But their length scale kind of moves up and down you see in the samples that sometimes they become quite smooth And then suddenly they have a big bend up and down and then they become smooth again So i'm not saying you have to use this kernel everywhere or that is somehow better than the square exponential kernel or whatever My point is actually that there are many such kernels There is this square exponential kernel that we saw last thursday There's now the vener kernel that we started if a half an hour ago There's this cube explain kernel There are this these this matern family that includes models for particles with mass and momentum and so on And then there is even another one So for example, chris williams in 1998 published a cut There was a phase in the machine learning community when people were excitedly constructing kernels that had lots of lots of cool properties For example, if you assume that you build a neural network, which in 1998 meant that there were sinusoidal link functions ton H functions flying around And you assume that those ton H functions have some distribution around zero because that's how you initialize your neural network With gaussian locations around zero Then you can integrate out actually this object because that's an integral that chris williams found in a book And it gives you a covariance matrix that a current function that looks like this. So the arcosign the arc sign of inner products with square roots You can implement that kernel yourself if you like and back then people were excited about it because it's and who we can Have infinitely wide neural networks Maybe we never need to train neural networks again We can just do linear algebra And it felt very powerful Because linear algebra is really powerful All right, it felt like we never have to do gradient descent again We don't have to simulate the brain. That's that's stupid stuff that like yandekun and jeff hinton does right We can do kernels With infinitely many features and everything is just linear algebra. No sgd. No adam directly So I mean it's 2033. So you know that somehow that didn't quite work out But we'll need to think about why We'll do that after the land break So maybe just to close this start here. We now have already a collection of parameterized kernels Which one should we choose? Well before we can even decide which one to choose If you're a mathematician, maybe you have more of an urge to think what is the structure of this space Can I use those kernels somehow as starting points? To index a space of kernels and build more And for that we can go back And think about what we actually need from a kernel So the one thing We needed kernels to be was what? Someone can shout it out Positive semi-definite so they have to be functions That when evaluate on a collection of points Give a positive definite semi-definite matrix every time So positive definite matrices So I've learned that when I go over here The camera can't see me anymore And that means that when you watch the videos over the land break You're gonna say ah, I couldn't see you there and then I don't need to take a picture or whatever So I'll just wipe out the Blackboard for a moment so While I wipe out the blackboard you can think for yourself what a positive definite matrix was again if you forgot already So it's a matrix which has the property that if we take any arbitrary vector And multiply it from the left and the right to our matrix this will be non-negative For every possible choice of vector In whatever the size of our matrix is Right so What could I do with this matrix Easily Oh, yeah, and you're right the matrix is n by n. So um What could I do to k to make to without changing this property if I had found one kernel which has this property What can I do to it? Good. Good. Good. Good. Lots lots lots lots of idea. Okay. Let's slowly. Let's go again. Who's it's something with a constant Multiply by a constant. Yes, okay good thing. So we could multiply by A constant alpha from the outside That's like multiplying by alpha on the inside and nothing changes as long as alpha is larger than zero Interesting what else? Linear combinations we could have alpha times k plus beta times k prime Because then if you do the from left and right because it's linear algebra everything is just you know, okay good What else? Yeah Those are the easy ones right Yeah Say again Decompose k In which sense Something like a cholesky. I think this is getting a little bit outside But you have yes, maybe there's something pointing in the right direction, which is actually Sort of obvious ones. I've shown you the theorem. So we had this theorem by mercer That says that we can write any such kernel We can sort of in in in like In quotation marks we can write it as a sum over i Phi i of x So the function at x and y Phi i of x times phi i of y So what we can do is If we have any function Of x of the input We could we could sort of if it has an inverse we can think of this as Phi i of Some transformation. What should I call the transformation? Let's call it gamma Inverse of x Right and then we're in another space sort of And if I apply the same gamma to both inputs then That's still a kernel because it's still of this form. It's still a function applied to the inputs And then taking the inner product of these functions This is actually a result of mercer's theorem directly. So it's not so straightforward. You need the theorem for it And then there is something else actually Which is the fact that if you take two Symmetric positive definite matrices and you multiply them with each other point wise So hadamar product not matrix vectics product, but hadamar product So element wise product star in python in numpy Star not at Then they remain positive definite And that's completely not obvious So you can't just look at it and say ah, that's true All right So this is this is called the sure product theorem And if you want to read a proof there is actually two papers that you can look up And they are not short. So it's complicated So what this means is we can take a kernel And we can multiply it by a constant that's larger than zero. We can scale the input Unabitarily as long as we scale both inputs with the same function We can add linearly combined kernels And we can multiply them as functions So that's what a hadamar product is we multiply the functions that make the elements of the matrix And they always remain positive definite What one way to talk about this is that the set of kernels forms a semi-ring Because of these linear combinations So what does this actually mean for our modeling language? Right, it's sort of mathematical property. What does this actually mean? Well, so let's first think about the scaling thing so we can scale by a constant If you scale by a constant you've actually already seen me do this Just now in the last example That means scaling the output Distribution of this Gaussian process So on the left you see in green the prior and in red the posterior from a particular matern class kernel So this is this is the one that makes one's differentiable functions Where we multiply by one implicitly on the outside And I've actually done this just now in code. I've multiplied by 10 to make it broader And that's literally what happens. It just makes the output space broader And this actually has a non-trivial effect Why because there is noise on these observations, so they have little error bars And that means there's sort of a trade-off between how broad the prior is and how noisy the observations are And so you see that the line here at a particular at the end actually gets now moved right to the data points The model becomes a bit more flexible if you like This wouldn't happen if sigma were zero But sigma zero is also a bit of a pathological case because then you would always interpolate between the data The other thing we could do is This business with the inputs, so we could take the inputs and scale them by an arbitrary function Transform them by an arbitrary function a simple function to use is a linear function So if I take the same prior again now multiplied by 10 and I use a particular transformation of the inputs that is called dividing by l Which is a linear function Then that Gives an a scale to the input space of the of the gaussian process we go So so scaling the the kernel outside scales the output space and scaling the inside Kind of changes the length scale of this process for small l Distances are measured on small scales. And so we quickly return to zero And for large length scales distances are measured on large scales. And so it takes a long time to return to zero Yes So you're you're asking yourself the obvious question in the room It's good because it means you're like you're basically ahead of what i'm trying to present that that's good That means it's not too fast which is There's all these degrees of freedom now. How are we going to set them? Right, can we somehow can we have a a method for learning it? It seems annoying that we have to go in and say oh Now I need to like I thought that was just one kernel But now I realize I have to scale the output and it changes how the method method works And now I realize I can change the input and it changes how the model behaves. How should I set that? Okay, we'll get to that actually not today But we'll get to it on thursday and we'll spend most of thursday talking about it. Yes So if the question is if you choose a kernel, that's not positive definite what happens with our code And to be honest it depends a bit on how it's not positive definite. So if it returns zeros And then our code is not going to complain. It will just happily draw Zeroes and just add zeroes everywhere and we actually had a case of this in in Here, right? So on the left we have zero variance and nothing happens. It's not it's still semi-definite but not definite If your kernel returns nans Hopefully you'll get error messages But then of course depends on your implementation So if you go into our gaussian base class and fiddle around with it you can make it to other things And now the interesting question is what happens if you if your kernel returns infs And you can think about that. I won't I won't tell you Um because returning infs might actually be useful We might have some dimensions in which our model has infinite degrees of freedom So, uh, but just to be clear, right this linear map is just one very specific example with a length scale It's not the only thing you could do You could for example scale the input by some exponential function relative to some point And then you get something like this where at the starting point the function is basically infinite length scale or actually not infinite Sorry length scale one because x of zero is one And then over here the length scale becomes very very very very thin Or You know, you figure it out try some other transformation that you like You can use the code that I have put on Ilias for today that I've just shown you a few times today arbitrary input transformations And for those who are already again one step ahead Well, what is this here? This might be a neural network that you choose And if you've trained it there's still a Gaussian process floating around somewhere Another thing you could do as we saw said is linear combinations of kernels So here is an example where I've combined two kernels actually In sort of full generality one of these kernels is what you might call parametric So it's just a finite sum over a bunch of terms and the other one is one of these standard kernels So here I've taken the rational quadratic kernel And I've also just taken This what what we call polynomial feature functions in the in lecture six for parametric regression So it's just the sum over one x x squared and x cubed always with The corresponding scalings in front and you can sort of see this in green in background in the in the prior there is this There's there's this polynomial shape right with a constant plus a linear term plus a quadratic term plus a cubic term And then on top of that we get another added function which comes from this rational quadratic kernel So it's kind of smooth wiggling up and down And this is a model for functions that globally behave like a polynomial But locally have some kind of wiggliness to them and Hmm. Ah, yeah, you can also scale a little bit by I think he has two different versions of this where I scale with different length scales somewhere um So why would you care about something like this? Well, you may have heard of Um domains like scientific machine learning or interpretable machine learning Where you'd like to build a model that you actually understand So we actually know what it's learning not a deep redo neural network with a billion weights Where afterwards you don't know anymore what's going on and nobody can tell you what it learned It didn't learn and whether it has learned a certain data point or not and so on But you'd like to be able to say, you know, I I know that the function is twice differentiable I know that it has a scale of five at the origin And there is a global linear trend in it But on top of that there is some small deviation that is differentiable But it sort of happens on this length scale and it's small compared to the global trend and so on and so on So on Thursday, that's actually exactly what we'll do. We'll go through an extended example And figure out first how we actually fit all these parameters of the kernel those that we don't know How we set those that we do know and how we combine this these abilities of kernels linear Output scaling input scaling combinations of kernels and multiplication. Ah, this is actually by the way This is what this is. This is the product the final thing, right? We can multiply kernels with each other Right, that's that this is why I was confused. So here. It's a sum there. It's a product What's the difference between a sum and a product? Well as sum is sort of Like an ore Right, so two things have high covariance with each other if one of the terms and the sum is large And a product is like an ant Two function values co vary strongly with each other if both terms in the product are large compared to zero Right, um That's the end. I'll leave out a bit about how to learn the kernel. We'll do that on Thursday So what we found out today is that there isn't just one kernel There is many many many different non parametric models, which have infinite degrees of freedom. I'll get to you in a moment. Um, and Actually, there isn't just five of them. There isn't 10 of them There is entire families of them and they can be Like combined in an algebraic fashion by multiplication with a scalar Transforming the inputs multiplying matrix the the functions with each other Like the kernel functions themselves and linear combinations of them. So adding them together. They span an algebraic structure called a semi-ring that Can be used to construct entire complicated model classes and of course that raises the question On the one hand, does this mean these are very powerful models and maybe we don't need deep learning anymore And on the other hand, how do we choose one of these models? How do we decide for ourselves which one we're going to use? If there's so many degrees of freedom And those are the two things we need to talk about over the next two lectures, but now there's a question Are there some kernels that you can't reach? I'm saying so, um, This is a very interesting question For that you have to see how you construct the semi-ring, right? You need some kind of generators Um from which you can start at applying these algebraic operations to try and reach kernels And so your question is really is there a generating set that is in some sense maximal such that all kernels are reached and I'll tell you that I don't know So this is probably one of these questions that has more to do with theoretical computer science or functional analysis and there's probably some very tricky complicated answer somewhere about non enumerability And if you make it complicated enough, it probably has something to do with the halting problem. So um What I can say is And we'll talk about this after the length break that there are families of kernels that span very Expressive function classes in particular all continuous functions Where continuous can be measured in various different ways. So both Lipschitz and epsilon delta and helder type um smoothnesses Can be used to construct families of kernels that address these entire function spaces But how they address them I haven't actually said yet. So we'll do that after the after the length break So it's a very powerful class of of models And if you want to build languages to reach complicated model classes, then it's all about what you start out with before you start combining Okay, that's what I wanted to say if there are no more questions, then I'll see you on Thursday. Thank you very much