 Welcome back everyone from The Brick. I hope you had a relaxing lent vacation. It's an opportunity to maybe think a bit about the content of the lecture so far. We're back today and we're starting as always with feedback, which was pretty good last time, which is not so surprising because it was a pretty hands-on lecture with, in a way, not so much new content, just a recap of what we've done so far. And the feedback was, in general, quite positive. I'm leaving out a lot of positive comments, but some people said that they expectedly liked that we got to talk about a concrete application, a concrete question and how to realize it in code. And then some people said that I'm again talking too loudly. So how is it now? Good. There are actually two people who said it was too loud again. I can always turn it a little bit further down. And if I get too loud, then I don't know. Somehow, let me know. Some people didn't like the visualizations and there were several questions about this final step in the last lecture about hyperparameter optimization. So we will get back to that in the next one or two lectures and then we'll talk about it for several lectures, actually. But before we get to that point, there's also a question about how to interpret the uncertainty in the plots and how the adaptation of hyperparameters actually helps to get uncertainty calibrated. That too, we will do like half a lecture about to cover this bit, although not today, actually. So the plan for the next few lectures is that in a way you could say when we've now reached the methodological toolbox that we need to cover the vast majority of contemporary machine learning and we're going to, in some sense, spend a lot of time thinking about it. So I could cheekily say we're going to slow down now and take our time to think about many different aspects of these models. It's probably not going to feel like slowing down, but we'll see. And I want to start today with a lecture that tries to build a theoretical, cleaner understanding of what we've been doing so far, in particular also with connection to the statistical machine learning class that many of you are in as well. So what we've seen so far as a quick recap is that we can use the mechanism of, or we can sort of represent and realize the mechanism of probabilistic inference, which in general is quite abstract and maybe even intractable, on contemporary computers quite efficiently, in fact using just linear algebra when we phrase everything in terms of Gaussian distributions over random variables that are linearly related to each other. So there are two important steps there, linearity and Gaussianity. And then we saw that we can use, we can even use this computational mechanism to learn functional relationships between variables, supervised machine learning, to learn functions. And we do that by expressing functions first in a parametric form, so through some mapping of the input X into some weight space and then do Gaussian regression on the weights. But we actually discovered that it's possible to take, to sort of consider an infinite set of weights at once through a notion that is associated with this concept of a Gaussian process and which I introduced on the code side as a form of lazy functional style programming where the Gaussian distributions are initially just represented by functions that get called at whenever we actually need it to instantiate a finite dimensional Gaussian distribution. And these sort of objects are defined, these Gaussian process objects are defined through this mean function and the covariance function, which are objects that we work with. And I'll say more about them in a moment. One thing I did point out is that these covariance functions have this interesting property that they are positive definite and the term for positive definite functions is that they are a kernel. And this is the point where I should ask you, in the statistical machine learning class have you talked about kernels yet? Not yet. What are you currently doing in this stats ML class? I might tell Professor Hein afterwards what you said. I shouldn't have said that. Does anyone want to say, what was the title of the last lecture before the vacation? I don't think you've had one yet. But you're operating already in a four year basis. So empirical risk minimization on convex problems. So that sounds though like you're getting close. So kernels are traditionally a concept that is quite central to statistical machine learning. And depending on what Professor Hein wants to do this term, he's like me adapting the content all the time. He might spend quite some time talking about kernel machines. We'll see whether he does or not. But you probably all have heard that there's this thing called kernel machines. Hands up. Who has heard of kernel machines in machine learning? That's more than half, but not much more than half. So there is traditionally a big part of machine learning. In fact, up until about sort of let's say 27 or so, from the late 90s to for like roughly a decade, machine learning was pretty much dominated by kernel machines. And if I use the word kernel, it's sort of natural to ask is there a connection to them? So what I want to do today is try to get a cleaner understanding of what we've been doing so far with Gaussian processes from the mathematical side and how they relate to the frequentist methods called kernel machines. And a few questions that you could ask in this concept are these Gaussian processes, they seem to be quite abstractly defined. That is lazy objects that are defined through mean functions and covariance functions. Do they even define something that is uniquely defined? So if you're a mathematical person, that's sort of a natural question. Do they define one probability measure? Is this uniquely identified? Is this like a probability distribution over functions? What kind of spaces of functions are they creating a distribution over? These kernels, since you haven't heard about them in so-called machine learning yet, what are they, positive definite functions? Can I think of them as like generalizations of matrices? And then are they related to kernel machines? And actually that last question I should have removed because I'll do it in a later lecture, but it's a natural question to ask already. If I use an infinitely wide neural network, so if I track infinitely many features, does that mean I can somehow build an infinitely powerful learning machine? And the answer to that is actually quite intricate because it's sort of both yes and no. So what I'm hoping for is, and I'm going like high level slowly, is that where you can leave the lecture today with the following four answers slightly more concretely in your head. The first one is that indeed Gaussian processes define a probability distribution over spaces of functions. With a caveat that the space of functions that we have to consider when we think about them this way is very vague. It's so vague that it's very difficult to even do much with it. But because the objects that define a Gaussian process are kernel functions, we can study that object, the kernel function more specifically that sits inside of the Gaussian process to get a more concrete idea of what the distribution actually looks like. In particular also by drawing connections to the frequentist framework of statistical machine learning and the kernel machines there identifying some quite close connections between the two that will help us understand what patients and frequentists actually are on about but also identifying some very subtle differences between them in the sense that the objects we study in the end are not actually the same ones although they allow interpretations to be sort of transferred between each other. But again that fourth sentence I should have removed five minutes before the lecture because I decided not to do it today. So what's going to come now for the next 90 minutes well actually 70 odd minutes is a lot of math and I know that not everyone likes lectures that have a lot of math. Today's lecture has pretty much no pictures. Our last lecture was lots of pictures and visualizations. Today there's almost no pictures. So advance apologies that there will be a lot of theorems flashing by and you typically won't be able to get all the details but I kind of have to show you the theorems to explain how the math works. So your mental goal for today shouldn't be to get everything we're going to be talking about in detail or like remember all of it for the exam that can't work but to get a gist of what the mathematicians are trying to tell you when you sort of squint your eyes with these theorems and that will lead to these four kind of high level answers so actually just three high level answers that I had on the previous slide. And that will form a foundation on which we can operate in later lectures and which allows us to use a few words, a few big words to sort of make sure we understand what we're doing. And those of you I know that there are some of you who always like to have even more math if you'd like to know more you can come to the front during the break and also at the end and ask me and also Marvin Fröthner who is sitting here in the front for more questions he's sort of here as my fact checker and linter so that when I say something wrong you can like say na na na na not quite and we'll see. Okay yeah and if you really want to know much more details you can read any of these three papers or even books and there's even more, yeah. Okay so the question is is the Rasmus and the Williams outdated on this? No it's not outdated it's just not very theory heavy. So if you look at that book there's a chapter at the end I think it's chapter eight or so that has some connections and equivalences to other frameworks but it's relatively high level also because these two so that chapter I think is written by Chris Williams but it's not as deep as these presentations. By the way someone asked at some point about whether one can reach all possible kernels from a set of kernels by combining them wherever you are I can give you an answer sort of answer in the break or afterwards I just don't want to do it here because it's just going to confuse people even more but if you like to hear something about non-computable kernels fuck us at the end. So let's take a sort of more careful look at what these Gaussian processes actually are. So your first question might be this abstract object what actually is it? So here is a refined definition of a Gaussian process that is pretty much the same that I've already used in the previous lectures but it's now a little bit more specific. So a Gaussian process is a random variable f over an index set or actually it's a family of real valued random variables which sort of we can write like this and I'll explain to you what this notation means in a moment on a common probability space such that every finite combination of function values or random variables evaluated at finitely many locations follows a multivariate Gaussian distribution. So what this notation introduces is sort of a second object so so far we've talked about f of x as a random object and in this definition we're a bit more concrete about the fact that there is some randomness being injected and you can think of that actually well I'll first tell you what the second definition is and then we can talk about the joint thing a little bit more concretely you can think of this distribution through in particular two objects first the so-called mean function the mean function we can define more concretely as the function that gives us for a finite combination of input points the expected value of this stochastic object process and secondly the covariance function so the covariance function is the function that takes these random variables and then computes for every finite combination of inputs the covariance between function values at those points so that's a definition that he doesn't use the word kernel at all it's a definition from the probabilistic perspective here is an object that defines a probability distribution through this kind of construction and in particular it defines the randomness the stochasticity the probabilistic aspect quite concretely you can think of these objects f of x as random variables that are concretely sort of they take in randomness from some underlying probability space through this mapping with this omega object so you could draw omegas first from some base space somehow make sure that and then somehow somehow make sure that the resulting f of x and omega are Gaussian distributed and that's actually exactly what we do in the code so the the the python code that you've used so far very directly constructs exactly this kind of object so we instantiate these objects mean and covariance by evaluating our lazy objects by like forcing them to evaluate at a particular set of points and then we effectively take a random a set of random bits make sure that they're like transform them such that they give standard Gaussian random variables through whatever the box muller transform or the cigarette algorithm or whatever and then multiply them from the right onto the Cholesky decomposition of the covariance matrix and at the mean so clearly this is an object that depends both on where you evaluate and what kind of random thing you want to put in you could also think of this as some kind of Jax non pi or Jax dot random dot key kind of object there's this underlying source of randomness that constructs these objects so the definition here is one of a probability distribution that has certain properties it has a mean and a covariance function and that's a formal definition for a Gaussian process every Gaussian process has a mean and a covariance function of course because every Gaussian has a mean and a covariance and so if you instantiate at any point and get a Gaussian distribution you can instantiate these objects but as I already sort of slipped in during the presentation every one of these covariance functions also is a kernel so a kernel is an object, is a function that constructs matrices that are always symmetric, positive, semi-definite and that's well here's a proof for it but actually I mean you can look at the proof afterwards it's pretty much just saying covariances are positive semi-definite matrices so therefore any covariance function is a positive definite function and if you call positive definite functions kernels if you use the word kernel for short then every covariance function is a kernel but actually it turns out that the same statement also works the other way around and that's much less obvious so every kernel, every positive definite function defines a Gaussian process you can think of every kernel as a covariance function in the specific sense of this lemma up here which comes from this paper in 2014 a book actually introductory book which says for every function function m or mu or m and every positive definite kernel there is a Gaussian process with this mean and this covariance function or with the m as its mean function and k as its covariance function this is sort of a statement about the existence of Gaussian processes as well-defined objects so we are allowed to do this functional style approach to constructing distributions over well, some space or functions because of this construction and so the proof of this is actually not so straightforward and it relies on a somewhat, in my opinion, somewhat weird kind of theorem that is called the Convergraf extension theorem and actually it took me a long time like Marvin has tried to convince me that it's very useful for a long time it took me a while to understand why it's useful this is also the first time I do these slides in this kind of lecture so I have to think about them myself this Convergraf extension theorem really says we are allowed to do what we do in our code in the sense that here's the theorem below if, actually, I'm going to read out this theorem and then you can think about it together so this theorem says, and it's quite old actually it's called Convergraf, right consider a non-empty index set, i assume we're given a bunch of, let's say well, Borel spaces, so measurable spaces for every possible i in the index set now consider the set of all finite non-empty subsets so that's a thing you can think about it's the thing you can actually call on the computer the object that the call function can operate on then any so-called projective family of probability measures I'll say what that means in a moment uniquely, that's the operative word defines a probability measure on the measurable space given by, well, sort of the product space of all of these elements of the index set how through this kind of operation which actually also defines what a projective family of probability measures is it's a collection of probability measures for which you can construct from one joint probability measure a projection, so a subset, the measure on a subset by taking the probability measure sort of backwards through some operator that just selects points in the index set so you can tell that I'm sort of a bit weak here because this operation is so elementary it really just says if I can pick objects finitely many objects from this index set i and can pick it in some sense sort of meaningfully then this works and this meaningfully can actually be defined a little bit more concretely there are variants of it, if you check the Wikipedia entry for the Kondogorov extension theorem you'll get a quite different statement which is sort of a bit more concrete and it says what you need to be able to select from this index set i such that the resulting pj are they sort of work under permutations and under selecting parts of it and ignoring the rest so this is an extremely vague kind of definition so the only thing we need to be able to do on these spaces is to select a bunch of finitely many input points and this is true for pretty much any space you want to define this over in particular also uncountable spaces like the real line that's why this is a powerful statement actually so what this theorem really just says is you can define these objects called Gaussian processes if you can define the operation call that we have in our code so if the x space that we're operating over allows us to do call on x and call returns an array and arrays have all the properties we need we can permute their entries we can drop parts of it in particular so we can slice through the arrays so if there is an operation that returns an array then we're good then we can define a stochastic process on this space and call it a Gaussian process and that's kind of cool because it means this is a very powerful statement it's a very general concept we can use on crazy complicated spaces it's also in a sense sort of a problem actually this is on the next slide because that space itself whereas this kind of property itself is so weak that it's very difficult to do anything with it so here's what the slide says if you sort of think of it your construction the other way around so you think of a function f that isn't yet instantiated but it always has in some sense its random key then you can think of this as a sort of a sample path as a function that hasn't yet been created but we can now call it this is a well-defined construction as you can probably tell if you've looked at our code because that's sort of what we do and that means there are actually these objects actually kind of exist in this sense they exist in the sense that we can construct code that draws from them so in this sense this object f.omega is actually a well-defined random variable and that's why we can think of Gaussian processes as probability distributions over a space of functions and that's the kind of object that gets studied in stochastic analysis from a probabilistic perspective just a moment and the only problem we have with this is that this construction which so far hasn't used the covariance function at all it just says there is an object called a covariance function that defines the covariance of this process it defines a distribution on a space that is extremely unstructured it's this so-called product space of notice the so-called product sigma algebra on R to the input domain in particular it's not a space with a norm so therefore we can't measure distances between functions it's just some very very very general space and it contains all sorts of crazy functions like for example this sort of function that is zero everywhere except for one real number where it is one that's a function that is extremely unstructured and it's not the kind of space that we typically want to operate in so if you want to ask analytic questions like how quickly can I learn one function or are the sample paths that this thing produces somehow well structured are they continuous are they differentiable do they have certain upper or lower bounds and expectations and so on then it's we need more structure to work with and that's why most of the analysis for Gaussian processes actually uses additional structure in the object that has to come from somewhere so it comes from the mean function in particular from the covariance function so if you want to study how a particular Gaussian process behaves you need to study the covariance function the kernel and that's where a lot of structure is going to come from but that was the question back to slide six ah so the question is can I say again why it has this particular this f of x and omega has this particular form actually it doesn't necessarily have to have this particular form so the definition just says we want an object that produces functions that have this property so for every finite combination multivariate Gaussian distribution you can construct such an object in very different ways in particular if you are given a mu and a k you could also say whenever you give me an index that x I and in fact that's actually what we do in the code I'm going to construct kxx then I'm going to construct a single value decomposition of it that consists of two matrices u and d let's call them and then I call mu x plus u times the square root of d times omega that's another construction for Gaussian random variables or if your omega isn't actually Gaussian distributed yet it's just uniform random variables between 0 and 1 then you first apply a transformation to those to make them Gaussian and so on so this equation really is just supposed to say here is a concrete way of taking a bunch of random numbers and turning them into Gaussian joint random numbers with this particular mean and this particular covariance is this what you are getting at? I think maybe for most of you at least the best way to take something away from this slide is the code we wrote is really precisely the definition of these objects because what does it do? it has a sample function which says give me mean and covariance function by constructing the object first and then you could construct actually the sample function such that at instantiation of the object, the Gaussian process object we hand over Jack's random key object and that key thing is sort of related to this omega it's not exactly equal to omega but it's the thing that makes omega it's the source of the randomness and then you have this program which produces random function values by if I give it an index at x calling these two functions to construct these two objects this is an array and this is a matrix and then doing some linear algebra on it and calling the source of randomness to produce these samples and it's actually also a nice aspect of Jack's is that it makes it very concrete where this randomness comes from you can fix the key and then you'll always get the same function back and if you change the key you get a different function back the important thing so far is these are actually here's the sort of summary slide GPs are indeed probability distributions over function spaces they are function valued objects function valued random variables and they are well defined in this sense so the construction with a mean and a covariance function uniquely identifies one probability distribution over a space of functions there's also a downside to this very probabilistic approach which is that this space of functions that it is defined over is pretty much unstructured if you haven't talked about the kernel and the mean yet that doesn't allow us to say much about the functions we're going to draw so if you want to think about what kind of properties this Gaussian process object has we need to look at the kernel in particular we already realized that the mean function isn't so important we can always just subtract it we don't want to draw mean function GP but we have to look at the kernel as the defining object and look in structure of this kernel to get a better feeling for what kind of learning machine we've actually built what it actually does, what kind of samples does it produce what kind of functions can it learn and so on so we have to think about these kernels again and since you haven't looked at them yet in statistical machine learning that's actually maybe good I can give you the very intuitive introduction to them and I have the advantage that I don't have to teach the statistical machine learning class so I can sort of ride right over these concepts and leave out some of the nasty proofs and leave them to Professor Hein to talk about them when he actually gets to them I'll send him an email and remind him to do that so first of all the most important thing about these kernels again is that they are functions of two inputs which we use to construct arrays two-dimensional tensors also known as matrices by putting in two sets of inputs and then broadcasting onto a matrix and these matrices have the property that if you broadcast on an object with itself you get a square matrix which is symmetric positive definite so matrices are objects we all know and you've all studied for a long time in your undergraduate classes in your first year undergraduate math lecture you learned all about matrices and linear algebra you learned that matrices in particular can be represented by their eigenvalue decomposition or actually more generally their singular value decomposition but for the matrices that we care about their eigenvector decomposition and in particular for symmetric positive definite matrices like the ones that we talk about here we know that the eigenvectors are actually orthogonal to each other someone already pointed that out a while ago and that their eigenvalues are all non-negative so if it's symmetric positive definite they are all larger than zero and if they are semi-definite they are all larger or equal than zero so any symmetric positive definite matrix can be written in sort of this way or maybe I should have called it V lambda V transpose can't wipe out at the moment with a matrix that contains the eigenvectors which are orthogonal to each other so that's why we can write a transpose here rather than a minus one and a diagonal matrix containing the non-negative eigenvalues so a natural question you could have is if we now have these functions called kernels for which we write k xx this kind of object then is this a little bit like this kind of representation so can we think of this notation a bit like indexing into an infinite dimensional array that's also why I actually used the suggestive notation with the subscript because it looks a little bit like a matrix so is this actually allowed can we do something like this do you understand what I mean? not really so can I think of this as this is what we actually call in the Jack's NumPy code well with a lot of slicing and nuns and ellipses operators that you were all confused over during your exercises I know but this looks a bit like there is this function that spans an infinite dimensional matrix and we're calling into it to get its elements and in fact it turns out that's actually true in some sense but also not entirely so it's true in a technical sense that is realized through the following theorem by this guy James Mercer, British mathematician who studied these bivariate functions in particular and there is a corresponding concept for these functions to the concept of an eigenvector for a matrix so an eigenvector is a vector V such that when you multiply the matrix with the vector you get a scaled version of the vector where the scaling constant is called the eigenvalue so for kernel functions there's a corresponding thing called an eigenfunction which has the property that when you integrate against this function the kernel you get back a scaled version of the function and the scaling constant you could call an eigenvalue with respect to this eigenfunction that's a question so the question is if you're thinking of symmetric positive definite kernels should be think of complex value and eigenvalues here so this is the general case for symmetric positive definite kernels these lambdas will be real value but you can define eigenfunctions more generally on such operators and so this is true and actually it turns out this is the theorem by Mercer that one can just like we can take any symmetric positive definite matrix and write it like this and then eigenvalues you can also write any positive definite kernel as sum sum is important because it's over a countable space of indices over something like an outer product of the eigenfunctions times the eigenvalues so this here is really like I can even match the indices like k or a I'm just going to do it ab ai lambda ai because it's diagonal ib that's exactly the same kind of construction so in this sense you can actually think of kernels as an infinite valued matrix that we're indexing into which has a basis formed by these eigenfunctions but there's a little caveat which is that there is really only one way up to isomorphism of counting over the the natural numbers so if you have a vector that contains elements at finitely many locations then it's clear what's happening here because it's natural I've left out the sum actually but it's a sum over i you can't really do anything other than permuting the way you count but for real valued or general input spaces x we have to say how we integrate what the summation actually does and that's why we need a measure over the space and that measure actually introduces all sorts of complications which you can in non-trivial fashion in particular could be a probability measure and we know that there are many different probability measures or even the real line then you get different eigenfunctions so what the eigenfunctions actually are depends on how you measure them that's annoying but it's just an unavoidable problem in spaces with less structure than the natural numbers so in the sense of this theorem we can use a lot of kernels as infinite-dimensional matrices we're sort of allowed to do that but we have to be a bit careful because if you want a concrete representation of these infinite-dimensional objects if you want to instantiate them in terms of some basis then one way to get that basis is to use the eigenfunctions but that basis is only relative to some base measure that we need to define or construct an actual question you might have that I'm not going to answer but what you assign is what are the eigenfunctions, what do they look like I've shown you these kernels the square exponential kernel, the Matern family all of these different kernels and you could ask what do their eigenfunctions look like can you show some examples and I'm not going to do that in some specific cases the eigenfunctions are actually known for example for the Matern family but in general it's actually not so straightforward to figure out what the eigenfunctions are the fact that the eigenfunctions exist helps us to think about what kind of objects we talk about but we won't typically be able to actually write them down instead we can ask well why do we even want to think about kernels in terms of matrices or infinite-dimensional matrices because we want to think of Gaussian processes in terms of Gaussian distributions we want to somehow get from this function-valued object Gaussian process onto some more tangible representation over finite-dimensional index sets that we can do analysis on in particular we might want to ask what kind of space of functions do these kernels somehow span and does that space that they span have something to do with the Gaussian process that we're operating with maybe not a super concrete question if you want to just do machine learning and train large language models but since this is a lecture called probabilistic machine learning we kind of really have to answer it to get a sense of what probability distributions we actually work with so what we've seen so far just as a quick summary is Gaussian processes are indeed distribution sum of function spaces and they are directly identified by the kernel that we choose we need to look at the kernel to understand their structure because just the fact that they are Gaussian processes is too vague now we've just discovered that these kernels actually provide some interesting structure for example we can think of them with some caveats as infinite-dimensional matrices so matrices span spaces and when I write down a matrix you could then ask well what is its basis what are its eigenvectors what are the eigenvalues so therefore what is the image space of A what kind of vectors can I index with A and similarly you could ask through a kernel what is the space of functions I can reach through this kernel that I can represent with this kernel and then a natural question is what does that have to do with Gaussian processes which so far we've defined as probability distributions so I will do that after the break now we'll take a 5 minute break and I'll continue at 11.05 if you have any questions of more detail make sure you can come to the front minor correction here my mental lintour has already helped me discover one thing I said wrong in the first part of the lecture I said this object called a projective family of probability measures is defined through this operation but I have to be a bit more careful actually a projective family of probability measures is a family that is defined on sort of PI so if you have a finite dimensional probability distribution already then a projective family of probability measures is one on which you can do this operation if you have a subscript I in the P what the theorem says what to the chronograph extension theorem says is that there is a projective limit where you can do this on the full P actually even if P is not countable but really what this theorem says is if you can implement a kernel on your computer on a Turing machine then it defines a Gaussian process that's pretty much what this theorem says because what is a kernel a kernel is a function that takes in inputs from whatever the space is even if it's a very exotic space and it returns an array and arrays have as objects in programming languages have exactly the properties we need you can take the elements of an array and permute them and then you can permute the corresponding dimensions of the distribution and you can take the array and slice into it so you can take subsets of the array and that's exactly what the two operations need to be able to do so when you can do that then you can define a Gaussian process so what this theorem says is if I define a Gaussian process with mean function and covariance function mu and k then there is a space called actually it's on the next slide called r to the x it's even a weird notation but that space actually is well defined so the examples that this process is going to draw will lie somehow in that space but you would like to have something more concrete you'd like to be able to say well maybe there is a space in here called I don't know c2 the space of all twice continuously differentiable functions is the sample going to be in there as well or outside well then it depends on omega but I still would like to point out this picture so what this construction, the probabilistic construction kind of provides is just to say there is a meaningful space within which all the samples lie they exist, they are well constructed and they are in there but if you want to analyze this object we need to get a bit deeper into the kernel to understand where the samples actually lie and now what we're going to do next is we'll discover that you can actually use the fact that they are kernels and draw from insights from statistical machine learning to get a kind of inner bound on where the samples might lie but inner bound what that means we're going to see in a moment so it's going to be like approximating them from the inside if you like and here the high level takeaway is Gaussian processing question is just one particular way of talking about a conceptual framework utilizing it and studying it that has been discussed for well pretty much the entire history of quantitative thinking of mathematics it's related to notions that you may have heard in other lectures probably none of you have heard of Krieging has anyone heard of Krieging so it's a historical term that comes from the geosciences or our colleagues down the street over there the geoscience department if you have a lecture on how to build smooth maps you might learn about Krieging which is pretty much the same concept it's just a little bit less generally defined but maybe most importantly it's related to kernel which regression which is a concept that I'm pretty sure you're going to hear about from Professor Hein it's also sometimes called view more commogorov prediction for historical reasons and it's linear least squares regression as we already noticed so just to make that connection clear again if we consider f restricted concretely to a particular set of inputs so if we sort of ignore for a moment the theoretical complexity of instantiating the GP on a concrete set of inputs and imagine we've already done that then we can think of what we do as Gaussian regression so we have some prior over function values which is Gaussian and a likelihood which is Gaussian and has a linear relationship to the function values we do Bayesian inference so we multiply the prior by the likelihood and normalize by the evidence we already know what the evidence looks like that gives us this Gaussian process posterior which has a mean function and a covariance function now we can think of this mean function as first of all as also as the mode of the distribution because Gaussians we all know what Gaussians look like there are these bell shaped things and so the mode is equal to the mean for them conveniently so we can also compute the posterior mean by or find the posterior mean if you wanted to know it by maximizing the posterior distribution so that means we can also minimize minus the logarithm of that distribution or that's not right forget about this but just this line and minimizing minus the logarithm of this distribution and what's the logarithm of a Gaussian well Gaussians are exponentials of minus squares so if you take the logarithm the exponential goes and if you take the minus we just have a square and this is this loss function that we are minimizing so we can think of the point estimate that comes out of this object the mean as the minimum of this loss function and so what is this it's the square distance between the predicted function values and the observed function values weighted by the label noise the observation uncertainty plus the log negative prior and that's the square distance between the predicted function and the mean function as measured by the kernel covariance matrix AXX so here is the definition for it so an inner product weighted by the inverse of the kernel matrix so what we are computing here is the function fx which we then call posterior mean the function which minimizes the square distance between the predictions and the observations it finds the solution of least squares and that kind of idea has been studied for a long time because it's tractable probably the first person yes so the question is what is with the little X and the capital X so okay so the simple answer here is there's probably there might well be some variations of little X and capital X on this slide for the purposes of finding well okay so you can think of I think the slide is correct if you think of a dividing line between these two equations right so here on the left we have a different kind of things on the left hand side and here you notice that it's just the expected value of f at the training points capital X so far I haven't said anything yet about other X's right so that's what we obviously what we do for generalized Gaussian process prediction but if we for the moment only think of the capital X the training points then we're just finding the estimate that minimizes the square distances concept that has been studied for a long time arguably the first person who might have studied it is Andréan Marie Le Gendre French mathematician of whom we apparently don't really have a proper painting there's just this weird satirical sketch of one of his friends that was made during a debate I think so he might have looked like this and he wrote in 1805 I think 1805 actually in French right about this method which is quite simple and quite general and it consists of finding the minimum of the sum of the squares of the arrows and we get this from equations with simple coefficients which are quite easy to derive and he even says further down here that he just put this in an appendix these derivations because they're not so complicated why because they're linear algebra so you can construct them analytically with just you know methodological algorithmic rearrangements of rows and columns of a matrix and then so there is depending on who your patriotic allegiance is with some people say that the French invented the least squares method and so Le Gendre seems to have sort of thought of it well no Le Gendre seems to have published it first but maybe at the very sort of around the same time for us Germans Gauss also did sort of the same thing he published it later but he claims to have worked on it earlier so you know you can pick whichever one you like in here he writes in German the most probable system of the values of the unknown quantities must be the one in which the squares of the distances between the observed and the computed function values are achieved at the smallest sum ah so that's pretty precisely what we're doing the most probable answer the one with the highest probability is the one which minimizes the square distance between the function values and the observed values so so far Gaussian process regression well Gaussian regression she ignored the process bit for a moment is just these squares constructing the smallest the estimate that minimizes the square distance to the observations and there's even functions called least squares in NumPy that do that for you so this is classical linear algebra and now the question is is that all it is or is there something additional to it yes so the question is why do we even write down an optimization problem is it because we can't have a closed form for it no we do have a closed I mean the line above gives you the closed form answer so it just happens to also be the solution problem yes so we don't have to do optimal well we sort of we do optimization it's just linear optimization it's just linear so the process of computing this matrix inverse and multiplying a vector from the right that's optimization it's just the most basic version of optimization yeah so the mean happens to be the solution of an optimization problem the answer to which we can find using linear algebra what this says is oh it's empirical risk minimization what we're doing right it just happens to be that of a form where the solution is a closed form why because it's a quadratic loss function with a quadratic regularizer and the sum of squares there's another square and the minimum of a square can be computed with linear algebra but now what about this like the predictions at other points at little x how do they come about so you remember that the posterior mean of this Gaussian process actually it's in the slide this thing up here that's a function I also sometimes write it with this bullet notation for little x bullet and bullet and then here bullet and circle if you like this object also can be evaluated anywhere outside and it's created mediated by this object called kernel k so what is the space that these objects lie in well we just saw that kernels can be thought of as matrices with some caveats and matrices span some spaces so is there a space of functions that is spanned by this object well there is and so since Matthias Heidt has not done that to show this to you yet I can mention it to you there's a thing called the reproducing kernel Hilbert space that's a space that is reproduced by our kernel I'll tell you in a moment what that is actually no I'll tell you now no I can't tell you yet because I have to first tell you the other thing it's a space that is reproduced by our kernel and it's a Hilbert space hands up who has heard of a Hilbert space before ah very good almost everyone so a Hilbert space is a space that has an inner product and therefore a norm Hilbert spaces are wonderful spaces because you can measure distances in them and we want to be able to do something like this for example because we want to be able to say that our machine learns what does it mean to learn what it means that you get closer to the truth well to get closer you need to measure distance so this is a very interesting candidate space because it allows us to say something about the distance between an estimate and a true function because it has an inner product so how is it created this is the general definition of such a reproducing kernel Hilbert space or often called RKHS because it just rolls off the tongue and it's very abstract and therefore maybe not quite graspable but in a moment we'll see other ways of representing it that is that are maybe more tangible so an RKHS a reproducing kernel Hilbert space is a Hilbert space so it has an inner product which is defined on functions in the following way it has a so called reproducing kernel K such that first of all that function K when you pick one side of it one of its input to a particular input X from the space big X underneath that's a function right it's a univariate function you can think of this as sort of a function right functional style programming if you fix one of the inputs you still have a univariate function left that univariate function has to be in the space that's the first thing and then you can select a function value f of X through this operation below and that operation below is this thing that's called representing a function that's the that's where this word comes from it says compute the value of f at a particular point by taking the inner product which you can because it's a Hilbert space where it has an inner product right the inner product with of a function with the kernel that's a at first it's a very abstract thing to look at what has helped me in the past is the intuition that what this says is that the kernel is a little bit like a unit function so for matrices there's something called the unit matrix which has the property that if you multiply the vector from the right hand side onto it then the if entry of the resulting vector is equal to the if entry of the original vector and so here's a sum hidden in here over J here is an inner product which is not necessarily a sum but sort of something related to it right this is the inner product between the operator I evaluated at I and J the function V evaluated at J is the function V evaluated at I that's what the kernel is it's sort of like a unit thing we haven't so far used it but the definition still works we won't so let's think about that later whether we actually need it or not ok so the important thing is that there is a theorem that says every kernel has and every RKHS has one kernel there's a one-to-one map between kernels and RKHSs so whenever you see a kernel you can think of an RKHS and in a statistical machine learning lecture once you get to it you'll probably do it this way you sort of directly identify the object kernel with its RKHS and all the analysis will take place in this space that is a very abstract representation so a question you could have is what do functions look like that lie in that space well one way to represent functions that lie in this space is through the so-called reproducing kernel map representation that's a more concrete way to construct this space which says you can think of the space of all functions that can be written as finite sums arbitrary finite sums of is it even true that it's finite? no it doesn't just have to be countable so any countable sum over weighted values of the kernel good here we go so you can think of and this is a little bit again if you think of the kernel like the unit matrix the unit matrix is the matrix that has a basis that consists of the unit vectors so you can write any vector as a linear combination of the unit vectors the EI vectors similarly you can write any function in the RKHS as a linear combination of the kernel which is this sort of a way of representing the entire space and because it's a Hilbert space you also need to say what the inner product actually is well here it is it's yeah this so sort of a sum over the coefficients of the functions scaled by the values of the kernel at those points and this is interesting because it allows us to draw a connection to what we've done so far to Gaussian process regression why? because this says think of functions that can be written as weighted sums of kernels those are functions that are in the RKHS and there is an object in our Gaussian process regression framework which is exactly like that can someone guess what it is without thinking the covariance matrix maybe a little bit too vague it's the posterior mean so the estimate that our algorithm returns this thing is a kernel function evaluated at all the training points and the input times the matrix whatever inverse times the vector y so matrix times vector is just a vector so we can think of this mean function as a finite for any finite data set sum over kernel functions evaluated at arbitrary inputs and the training points with some weights which are given by the solution to this least squares problem so you can think of the posterior mean in Gaussian process regression here's the one image in this entire lecture this red thick line in the middle of this object you can think of this as a weighted sum of kernel functions centered or evaluated at the training points so here for this data set you see the training points below at each point there's a little dashed line that goes down and gives us one kernel value one kernel so the kernel here is the square exponential kernel is this Gaussian kernel for this picture and we we just take these kernels and we wave them by whatever the result of this least squares operation is some of the numbers are negative, some are positive so you just plug them in and you get this weird family of weighted kernel functions if you sum them up you get exactly this red line in the middle so the kernel rich estimate, kernel rich regression estimate the posterior mean function actually is an object in reproducing kernel Hilbert space well I thought we'd set the kernel rich estimate I haven't introduced what that is yet so there's a corresponding object in statistical learning which you will encounter pretty soon which is this so called kernel rich estimate which is this object so in the statistical machine learning literature is motivated differently but it's this object that you already know and it's motivated by saying we minimize the so called regularized L2 loss where the L2 loss is this and this is something we already had it's the square loss between observations and predictions plus what's called the RKHS norm of F so instead of minimizing the quadratic loss on exactly the training points we can also minimize the loss in some function space and it turns out and I'll leave the proof to Professor Hein that you can do this minimization in closed form because of some interesting property called the representor theorem which allows you to represent this solution to this function value optimization problem in a finite dimensional space and you get an object out that has this form and we know it already it's called the posterior mean of our Gaussian process regressor so what this means is that this is the new the new line down here every kernel is associated with this object called the producing kernel Hilbert space that's a space of functions that is in some sense spanned by the kernel and we've just noticed that one object that we work with the posterior mean function of our Gaussian process regressor happens to be in that space and it can be motivated from an empirical risk minimization perspective as the L2 regularized least squares estimate and that might sort of lead to a kind of a knee-jerk reaction to say ah so why do we do GP regression in the first place why do we have to go to the probabilistic machine learning class and not just to the statistical machine learning class would be half as much work, much easier this term is it maybe that GP regression is just kernel which regression is the exact same thing why do we need to do this connection right so for that to answer this question we have to think about two additional objects remember that our Gaussian process algorithm produced well at least two outputs a mean function and a covariance function a posterior covariance function trained on the data and then we use these two objects to draw from a Gaussian distribution we need to make a connection for the other two as well the covariance function and the samples let's first do the covariance function and so I realize we are 50 minutes before the end or so so you're gonna have trouble following but I'll do it slowly because it's actually for me this is the pinnacle of this lecture so it turns out that there is a statistical interpretation for this object called the posterior covariance function this thing that we use in Gaussian process regression we could have devised in a different way in the statistical machine learning class so if I were the one teaching the statistical machine learning I would talk about this object I'm not sure Professor Hein will do so if he does great if he doesn't you can ask a snarky question about it so one thing you might care about if you're a frequentist is so just remember how frequentist analysis works we span a space of hypothesis we say this is the space of hypothesis we're willing to consider now we get some data we would like to find an estimate within the space of hypothesis which minimizes some risk and then we can study that risk minimizer and you know check whether it has some good properties of convergence or whatever so we found that minimizer it's called the kernel which estimate or for us in this lecture it's called the Gaussian process posterior mean so there is a space of not going to draw it over here because that's just going to be confusing there's a function f we don't know it it's in this big space called r to the x and we have now some kind of smaller space called the rkhs of k we want to find the function which is closest to f as measured by the empirical risk we found that thing it's called mu x the posterior mean of our Gaussian process now what you'd like to know is how far are we away right so that we can measure how quickly we converge to do that you could do a worst case analysis because that's what frequentists like to do you could say how far could this estimate actually let's call it mx how far could this estimate mx be from f at worst right what's the worst that could possibly happen well one problem with this is that it's a function space so we can make lots of functions in that space and they can be arbitrarily far away from m so to make the analysis meaningful we have to say something additional about the f we have to assume that it has a finite norm in the rkhs finite norm and without lots of generality the analysis then usually goes like let's assume that the norm of f is actually unit it's at most unit because if it's more than unit we can just multiply by a constant and then everything we do is just up to an unknown constant okay so we want to find the supremum over all functions in the rkhs oh this is wrong the picture I do here stupid it's good actually it's good that I made this mistake because it might be useful to point out that that's actually the analysis we do instead we just consider the Hilbert space reproduced by k and we say there is a function in there and we are going to find our mx from the data and we will assume that f actually is in the space because it has bounded norm that's the definition of it being in there so what we like to do maybe is this and say maybe f is anywhere how close can we get but that's going to be very tricky because as I said at the beginning of the lecture this space has really no structure we can use to measure distances so we will have to leave this question for another time and first think about this so to answer a question like this you would need to impose additional structure on f you need to be able to say that it lies in some other space so we can't do that let's first assume that f is actually in the Hilbert space then we can measure this distance and you start with the prior mean m0 and then our Gaussian process regressor or the kernel rich regressor will do some kind of path as we get more and more data and hopefully it gets towards f that's something we would like to achieve so we want this supremum normed distance now we first plug in what m of x actually is so this is the definition of m of x that I had on the previous slide it's this object defined through evaluating f at a bunch of points that's our data let's assume we've measured data without noise for simplicity so y is equal to f evaluated at location xi then the posterior mean is just this that's just the posterior mean written the other way around assuming we have prior mean 0 you follow so far two people are nodding many are not anymore so this is from the previous slide well here right if y is f of x and we write this stuff together as wi as a vector then oops here then this is our posterior mean just plugged it in now we use this reproducing property the definition of the reproducing kernel Hilbert space which is that we can write the evaluation of functions f at any location in particular x and also xi through inner products with kernels so we just replace every instance of f evaluated at a particular point xi with the and here with kernel evaluations onto f so that's by the definition of the reproducing property of the kernel from two slides ago then we use the fact that the supremum of such a square this is an inner product between two objects the left-hand side let's call it a and the right-hand side f let's call it b the supremum of such inner products is reached if the inner the left and the right-hand side is equal to each other there's a formal way of of proving this to the Cauchy-Schwarz inequality but I'm actually not going to do it you can maybe convince yourself that the way to maximize the inner product between two vectors if both if one of them has length one f is to just take them both to be equal to each other so if we do that so if we set here on the right-hand side f is equal to the left-hand side to find the worst possible error we could have then we get a square norm of this object and now we can plug in the reproducing property again because why what's the norm squared well it's the inner product of this thing with itself and whenever we take inner products of such functions which have a dot in it we can use the reproducing property of the RKHS and get back actual values of the curl now we have an expression that doesn't contain weird function norms anymore but just actual numbers and the rest is just simple algebra we plug back in what W actually is from here we arrange and we find the object that we already know the posterior covariance so you can write a theorem that says if we use a Gaussian process prior over an unknown function and assume that we make noise free observations we just measure the function wherever we train then this posterior variance the expected square error that the Gaussian process returns the object that we've used in previous lectures the object we've used to quantify error to draw uncertainty bars around function values actually the square of it but whatever can be identified exactly equally with the worst case approximation error that should be a minus not an equal between the estimates the mean and the true function at any location X actually not just the training points any location under the assumption that F lies in the reproducing kernel Hilbert space and it has bounded norm so this I think is quite interesting why because it gives us an intuition for how frequentist analysis relates to Bayesian analysis you may heard people outside of this lecture hall saying that frequentist analysis is more general it just assumes a space of hypothesis and then we typically do worst case analysis we just assume that we get this estimate which minimizes the risk how bad could it possibly be and then we have very general statements which don't require priors in particular you can construct this estimate this least square estimate without ever talking about Gaussians anywhere you could say there is a function which I evaluate at a bunch of points and I regularize it with the RKHS norm this gives me a kernel which regression estimate which minimizes the regularized square loss and then I can make a statement that says no matter what F is as long as it lies in the RKHS and has bounded norm the error is going to be bounded by this number and equivalently the equation could come in and say I never need to talk about RKHS at all I just define my Gaussian process regression algorithm which defines a prior over function spaces over one function space a prior measure over a function space well defined as we saw earlier today through a mean function and a covariance function if I set the mean function to 0 and use a likelihood that happens to be this sort of noise free observation which is a limit case of Gaussian likelihood then I get a posterior mean function which is my expected function value and I get a posterior covariance function which in particular also gives me expected square distances so average case errors between function values and the estimate and it's given by lo and behold the exact same object the thing that the frequency analysis would call the worst case error now actually happens to be the average case error of Bayesian so it's it seems like that Bayesian made more assumptions because of this prior business right Gaussians ooh they're random numbers but actually they ended up with a more cautious error estimate an average case estimate which is interpreted as a worst case estimate under the other form of analysis the main thing I want you to get away from this so here's the summary final slide actually not final slide almost final statement the posterior covariance function the average square error of the Bayesian is a worst case square error estimate in the RKHS and you will hear about the RKHS I am sure in the statistical machine learning class so when you hear people talk about worst case and average case analysis it's a dangerous assumption to make that the worst case is always more conservative than the average case analysis because it depends on the assumptions underlying each framework and what has happened here is that this frequentist analysis has made a pretty strong assumption that F lies in the RKHS and then you can do a worst case error estimate but so far we started a lecture by saying the GP is only about this VB big space R to the X and therefore we haven't actually assumed that the function lies in kernel-hippled space we just use this kernel which happens to be our covariance function and we do regression on it and when we do that the expected error to any function drawn from this process happens to be the exact same algebraic quantity and that maybe gives you an intuition that the GP sample space the space that the GP draws from is probably not that space because if it were our average case estimate would also be a worst case estimate because it is a worst case estimate in the RKHS so we kind of know, we feel that the GP probably is a broader distribution it's sort of between the RKHS and this full space so I'm going to close the lecture by pointing out that this is in fact actually the case using three theorems that just happen to be true but they're not entirely obvious so the first one is and we'll leave out the proof I'll just say something there's another way of thinking about the RKHS which is sort of related to the one we just saw so a few slides ago I showed you this way of thinking about the RKHS which is that you can think of the space of functions that are weighted sums over kernel evaluations but if you think back that earlier in the lecture I said you can think of the kernel as an infinite sum countable infinite sum over eigenfunctions then you can combine these two statements and say you can also think of these functions as a weighted sum over eigenfunctions in particular here it is so the RKHS is also the space of all weighted sums of eigenfunctions which have the property that the sum over their coefficients square coefficients is finite and then we can also compute the inner product in a maybe more direct way because we directly get corrections through this representation and then it turns out that there is actually a way of writing draws from a Gaussian process in terms of the eigenfunctions it's called the Kaun-Löwe expansion due to two I think Belgian mathematicians Kaun-Löwe or Kaun is probably Finnish I have to look it up which says that you can draw from a Gaussian process or you can think of draws from a Gaussian process because you can't really because it's an infinite process by considering the space of all eigenfunctions weighing functions by their eigenvalues square root of the eigenvalues and then drawing standard Gaussian random variables IID so you draw sets that are drawn from independent Gaussians 01 this is the infinite dimensional version of the thing that I've wiped out or that was on a previous slide one way to draw from a Gaussian process on a finite dimensional domain is to just build a covariance matrix or a single-average decomposition which is what this thing does and multiply from the left with standard Gaussian random variables and add the posterior mean or the prior mean whatever the mean so this is a way of thinking about draws from a Gaussian process you can't actually do this in practice of course because the set of eigenfunctions is countably infinite in general so you can't really implement this as a finite time algorithm you can write it down like this and you can use it for analysis which is actually I can put on this tiny little bit of space below which is if you actually do this if you draw your functions like this and then compute the expected RKHS norm of these objects you go back to the previous slide you realize that we can compute norms like this so that's the norm that's the inner product the norm of course is the inner product of f with itself if we do that then we get the expected sum of squares but the draws are IID so we have the assumption up here that they are independent of each other so we can take the expectation inside of the sum and there are standard Gaussians so the expected square it's just the standard the variance which is just 1 so it's like an infinite sum over 1 which is not a series that converges so draws from the Gaussian process are just not in the RKHS so they are in a space that is broader than this space and I think I will leave it at that and try to summarize here what we've done so these today was an abstract lecture and no worries next Monday by the way there's no lecture on Thursday public holiday next Monday we will return to something a bit more tangible still not so many pictures but a bit more tangible but today we had to lay some kind of concrete foundations we saw that these objects called Gaussian processes are well defined even though we defined them on functions they are well defined if we can implement them I'll say on a Turing machine so if you can write a piece of Python code that implements a kernel on your input domain implementing a kernel means you write a function that returns a race and a race are things that you can slice into and with elements you can permute then you can uniquely define a Gaussian process on this space that's good but that in itself doesn't provide all that much analytic tools to think about what the sample actually the sample paths of this distribution over functions will actually look like so if you want to do some kind of analysis for example to learn to convince yourself that this is a learning machine so that its estimates converge towards some true value you need to sometimes say hallucinate additional structure you need to find some additional structure to think about and that structure will have to come from the kernel so what you now do is you stare at the kernel and you realize that kernels can be written as relative to some base measure as some kind of infinite dimensional matrix and also they span some interesting spaces called a producing kernel Hilbert spaces so a knee-jerk reaction might be to study the reproducing kernel Hilbert space and this is not a stupid idea to do it's just not the complete answer it's not a stupid idea why because it allows us to draw a connection to the kernel matrix it turns out that there is a corresponding algorithm called kernel witch regression which produces point estimates which are exactly equal to the posterior mean of the Gaussian process and it produces well it can be made to produce worst case error estimates if you really want to that's not something that's usually done in kernel witch regression but you can construct them in the way that we did and then you'll discover that the worst case approximation error for functions inside of the reproducing kernel Hilbert space with bounded norm is actually equal to the other object that our code returns the posterior covariance so in this sense posterior covariances are uncertainty estimates they are worst case error error estimates inside of the RKHS but they are also by definition average case error estimates in the sample space of the RKHS of the GP and that sample space is very patently not the RKHS that's the final observation we've made so that's why we have a course on statistical machine learning and one on probabilistic machine learning one in which you learn to think in terms of losses and minimizing loss functions and then analyzing worst case errors and one in which we think in terms of distributions cases from which we draw random variables and then measuring result the resulting remaining measure in the hypothesis space conditioned on the observations in the conditional distribution that's called Bayesian inference and it turns out that in the particular case of these squares or Gaussian regression these two are very closely linked they produce the same point estimate the same algebraic objects as their error estimates it's just that we interpret them very differently one time as a worst case error one time as an average case error and the underlying objects are different in one case we just return a point estimate called the rich regression estimate and in one case we return an entire distribution called the Gaussian process posterior and that distribution is broader even than the worst case error estimate okay I'll leave it at that and on next Monday Marvin will be here because I'm gone next week to talk about more other aspects of Gaussian process regression thank you