 Hello and welcome to probabilistic machine learning lecture number 10. In the past few lectures, namely since lecture 6, we've been talking in quite some detail already and we know when you're done about Gaussian distributions. We found that Gaussian distributions directly link linear algebra and probabilistic inference or to put it another way they can drastically simplify probabilistic inference which is otherwise potentially exponentially combinatorially hard by turning the necessary computations for probabilistic inference into linear algebra if all variables are jointly Gaussian distributed and linearly related. We saw that this nice aspect of Gaussian distributions allows us not just to reason about individual variables but instead to even learn more generally real valued functions mapping from an arbitrary input domain to the real line. This is quite a powerful framework. We started by phrasing these functions we're trying to learn as finite sums over individual features. Then we saw that we can actually learn these features because there are such a big choice among them so if we fix a certain family parameterized family of these features we can learn them in a process that has quite a lot to do with deep learning and is maybe only weakly motivated by probabilistic inference so far in as much as it's just maximizing the posterior probability for the feature choices under the model. We'll return to that in later lectures and then in the last lecture we saw that there's actually another way to make these Gaussian parametric regression models more powerful which is not to adapt as a finite set of features to the data but instead try to expand the set of features in a sense to make an infinitely wide neural network towards infinitely many features and learn all of them at once and this is possible by a process called Gaussian process regression which is empowered by an internal computation called the kernel which allows summing over infinitely many features in closed form by by a finite time operation. Now what we're gonna do today is a little bit different from previous lectures in so far as we'll take a bit of a like a not a detour but we'll slow down for a moment to appreciate the beauty of this kernel idea and think a little bit more about it in detail. This also means that this lecture is going to be relatively theoretical quite mathematical and I know that this is not to the liking of the entire audience. Human minds work differently some people need pictures and geometric intuition to understand the problem other people need code an algorithmic procedural description of how something works in a formal language and yet another group of people needs mathematical abstractions symbols to understand what's going on. Today's lecture is for that final third group there are other lectures in the course which cater more to the form two groups. What I'm going to try and do today is think with you a little bit about the kind of questions that might have arisen during the last lecture when I introduced Gaussian processes and actually before that kernels there may while you're watching the lecture you may have had certain questions in your mind that might boil down to the following three. The first one is I what is this kernel object that you've just introduced using a certain notation for it how should I think about these kernels are they are they just some infinitely large matrices well let's see and then in a second step I want to make a connection to statistical machine learning to because some of you might be taking the course by my colleague Rick from Luxburg or maybe you have a statistical background and you're coming to this probabilistic view from another direction and you may be wondering so I've heard about kernel machines of course I know about support vector machines and other kinds of statistical machine learning algorithms that are motivated from the kernel perspective how do these relate to this Gaussian process business are they somehow separate or not. Let's talk about that actually for the majority of this course and then a final question is so if if Gaussian processes or kernel regression algorithms correspond in some sense to an infinitely wide neural network I've heard that neural networks are universal function approximators are these infinitely wide networks universal learning machines can you learn any function and what does that mean so we'll talk about that a little bit as well but let's start with the first question which is what is this kernel actually and is it an infinitely large matrix as a little short warning at first because we only have one lecture and time is finite I will have to significantly simplify some of the statements that I'm going to make today because I want to tease out some important insights and for those I will often leave about leave away leave out certain technical aspects that are not important for the intuition but of course which are important if you want to really do math with them so apologies to those of you who are strict mathematicians if I'm going to be a little bit hand wavy here and there if you don't like that then you're very much invited to read more detailed introductions for example there is a very precise book by Steinwald and Christmann on which is titled so perfect the machines but which has a very beautiful introduction to kernels as well and if you are specifically interested in the connection between Gaussian processes and kernel machines there is a paper that was written primarily by Motulovo Kanagawa and the Baratship Rumbodur and I'm also actually a co-author on this paper because it arose from a Dagstul workshop a long time ago with you know Cedinovic as well where we try to make the connection more precise this paper has actually been in review for over two years we're still working on the final version there will be a new version coming out in a few months which will be even better but here you can really find the detailed explanations of how certain concepts in the kernel world connect to Gaussian processes so now though let's think about kernels and are they actually infinitely large matrices so to motivate why we have to talk about that I have to say that I keep noticing that people have a really hard time a certain subset of the lecture always has a really hard time understanding what a kernel is so let me be precise again about my notation I am using this notation where I write objects like this and what I mean by this is a matrix which arises in the following sense here is a set A which is a collection of entries that are let's call them x1 or actually it's called them little a 1 to little a n that comes from some space x and then there's another collection b which is b1 to bm also a subset of the same space and then there is a function little k which maps pairs of these entries to the reels and this here is a matrix that is of size r to the n by m which arises by taking every possible pair of entries in a and b and evaluating the function k on that pair so that means k a b of that's the entry i j arises by taking this function and evaluating it on a i b j so in particular this kind of way of writing this matrix maybe is suggestive of a matrix so in particular we can also write the matrix k a b which is the evaluation it's just a scalar evaluated of that is the evaluation of this function k at and at locations a and b so this looks a little bit like when you have a matrix and that matrix has an entry i j right which is just a real number and that's also a real number so maybe this means that this object is is like a like a computer program that allows us to evaluate subsets of a potentially infinitely large matrix wouldn't that be interesting well why would it be interesting well it could be it would be interesting not just because we use this notation I mean if that's just just a simple way to write down a function that doesn't really mean anything yet but maybe there is more to it so maybe if you can think of a kernel as a large matrix then some of the concepts of matrices trend like translate over to this world what kind of properties might be be interested in well so what that there is of course a beautiful large theory on that describes the space of matrices in a very powerful language and that's called linear algebra and it contains statements very at this it is very hard about these objects called eigenvalues and eigenvectors so for generic matrices here's a quick refresher there are these vectors V sorry for square matrices such that when you multiply this matrix with the vector you just get back the same vector but scaled by a scalar and those vectors are called eigenvector eigenvectors and these numbers here are called eigenvalues okay so these vectors are interesting for various reasons one of them is that at least for symmetric matrices and our kernels actually are symmetric because and of course that's not just all that so we kernels have this interesting property that the Mercer kernels we talk about is that these matrices if you construct them on the same set KAA that those are symmetric positive definite so this in particular of course means you convince yourself of that maybe as a short little mental exercise that this means that the function has to be symmetric so symmetric matrices have this interesting property that the eigenvectors span the matrix in the following sense so if you have a symmetric matrix and then the eigen vectors are real vectors and they form the basis of the image and also actually the pre-image of a as a metric positive that the matrix can be written as the outer product of the eigenvectors scaled by the eigenvalues these eigenvalues are non-negative real numbers and this in particular also means that the eigenvectors are orthogonal to each other and they can be made also normal by scaling them so one question you could have is is there a similar statement for kernels so are there special concepts for kernels that allow us to span these symmetric positive definite objects in some sense and what are their role and it turns out actually there are such objects and they are a generalization of the notion of an eigenvector called an eigenfunction so our kernels map from large space to the cross part of a large space to the wheels so on that large space that we need some kind of infinitely long vectors and these are functions so it turns out that there is such a concept with a few caveats and it's due to the English mathematician James Mercer who lived at the turn of the 19th to the 20th century and taught in Cambridge he was born I think just around Liverpool so he showed that there are and this I mean and this this result is in many ways the mathematical basis for kernel machines really and like by extension therefore also Gaussian processes he showed that there are four and you can think of the following object so this so far is just a definition you can think of you can consider objects phi which we will call eigenfunctions which have the property that if you take such a function or kernel such a bivariate function and then you integrate this function phi of x tilde against some measure new x tilde new of x tilde then you get back the function evaluated at x so at the other entry here scaled by a scalar lambda and that scalar can then be called an eigenvalue notice that this is quite similar to the definition of an eigenvector right so here we have a sum over the entry j and we get back the entry i of the vector v here we have an integral over the entry x tilde and we get back the entry x of phi now you can imagine that for this to work you need all sorts of technical constraints like for example this whole space has to be measurable otherwise you can't be you talk about this integral and so on and so on and that's one of these points where I leave out some technical aspects now Mercer showed that in the specific case where x and u are span a finite measure space and this kernel is of this Mercer type so it's exactly the kind of kernels we've been talking about so symmetric positive definite type kernels so kernels which span create symmetric positive definite matrices then they actually exist such eigenvalues and eigenfunctions with respect to this measure new and it turns out that there is a countable set of such eigenfunctions and eigenvalues which can be used to span the kernel in the following sense in that these eigenvalues are actually non-negative and real and the eigen functions are also also normal so they can be made also normal to orthogonal and they can be made also normal by scaling and the following series this one here converges absolutely and uniformly new square almost everywhere which means that up to essentially a set of measure 0 you can write it in the input domain and be you can write the kernel in this following way so by taking the outer product if you like of the kernel the kernels eigenfunctions and scaling with the eigenvalues so in this sense kernels really are infinitely large matrices and this statement already shows that there are a few caveats to this kind of statement so the first one is maybe the most prominent one is that this all of these statements are is relative to some measure new so you have to say and this is of course maybe not so surprising because we are talking about an object that is defined on a continuous domain and on such continuous domains you have to say what it means to sum to integrate so depending on how you think about your domain yeah you will get different eigenvectors or eigenfunctions actually for matrices this isn't a problem because for the natural numbers there is an unnatural way of counting from zero towards large numbers but for continuous domains there isn't and you have to say how you distribute volume over your space to even find out what the eigenfunctions are and if you change that measure you typically get out quite different eigenfunctions another question of course is well this is even useful to think about these eigenfunctions because we don't know what they are right this statement is not constructive it's just saying they exist so for matrices we have algorithms that can construct eigen vectors and these algorithms are in general cubically expensive in the size of the matrix of this quadratic matrix so in a continuous world where the kernels live of course we can't use these algorithms because we are now in an uncountably large domain where you can't just like do the operations that the eigen vector construction algorithms and linear algebra would use so you might be wondering why like what does this statement actually help me if I don't even know what the eigenfunctions are and indeed it's true that in general for general kernels it's not usually possible to just guess what the eigenfunctions are if we could that would be wonderful because then we could drop the computational cost of Gaussian process regression from cubic in the size of the data to linear in the size of the data because we could directly write down the matrix inverse of the kernel gram matrix now it turns out that actually it's possible to do that for certain specific kernels so there are very specific families of kernels for which is possible to explicitly compute the eigenfunctions or at least something that is very closely related to them and the most famous and maybe actually the only real example of this is due to this man he's called Salomon Buchner he was born in today's Poland educated in Berlin became a professor in actually habitant in Munich and where he actually during his habitation work created the result I'm going to present in a second and he then had to emigrate due to the rise of the Third Reich to the US and became a professor in Princeton and took American American citizenship where he lived for the rest of his life he introduced this theorem that is now named after him Buchner's theorem which states that at least from our perspective if the kernel is a function not just of two inputs but only of the distance of these two inputs these such kernels are called stationary so if you can write the kernel of A and B as the kernel of a single value which is the distance between A and B then it turns out that such kernels so for them so if this is a kernel right that means it fulfills the definition of Mercer's theorem of Mercer's kernels then you can think of this kernel as the Fourier transform of a probability measure so here's the statement again consider a complex valued function k on r to the d if this is the covariance function of a weekly stationary mean square continuous complex valued random process that's what our Gaussian process is and it's stationary so stationary means that it can be written in this form then and only then it is the Fourier transform of a probability measure so you can write the kernel of this distance tau as the Fourier transform of some measure mu which is a probability measure now this actually means almost that you can think of the result of you can sort of see the result of Mercer's theorem in this representation here because this means that you can write if you take tau to be A minus B then you can write this expression like this and you can sort of see an auto product here of or so normal basis functions this is not entirely true because Mercer's theorem talks about a countable set and this here isn't the countable set so there are a few caveats but you know this is sort of it's almost like saying the eigen functions of this stationary kernel are actually these this Fourier domain is Fourier functions so remember that this here like e to d e to d i x i is is a right cosine of x minus i sine of x so this statement which is quite abstract it's and very powerful can actually be used to to divide concrete algorithms and here is how this works this is sort of where you get from formal mathematics to machine learning so for specific kernels you can actually compute this Fourier transform so for example for the Gaussian kernel this is particularly easy because the Fourier transform of a Gaussian is also a Gaussian so if you have a Gaussian kernel here then you know that it's a Fourier transform of a Gaussian measure and that it is orthogonal sorry yeah it that it is that the corresponding probability measure over these frequencies s here is independent so what you can do is you can you can draw from this Gaussian measure here random frequencies and use these to construct a Monte Carlo estimate of the posterior mean of a Gaussian process regressor this idea has various names it's called random Fourier feature expansions or people have used words like kitchen sinks for it and it's due to Ali Rahimi and the Ben recht or at least it was sort of formalized and published by them in Europe's in 2008 this is kind of the step about 100 years forward from pure mathematical description of complicated positive definite object to a concrete algorithm using insights about kernels which can actually drastically reduce computational cost at least in low-dimensional spaces and using specifically the Gaussian kernel for Gaussian process regression so what we've just seen is that you can think about kernels as infinitely large matrices with a few caveats in the sense that they actually have something called an eigenfunction eigenfunctions these eigenfunctions are only defined relative to some measure and eigenvec eigenvalues these values are non-negative usually you can't guess what these eigenfunctions are in some specific cases though you can actually even if you can't we're still this insight is still useful because it allows us often to talk about well to do to do mathematical analysis of the associated machine learning machinery and that's what we're going to do for the rest of this lecture so for the rest of this lecture I want to step outside of the probabilistic framework a little bit and look at it maybe from the outside if you like and the reason for that is that Gaussian in particular Gaussian process regression is one very interesting point where the probabilistic formalism is very closely connected to other ways of thinking about well let's call it scientific inference of extracting information from data that's also the reason why you may have seen the quantities that I that I show the kind of computation we needed to do for Gaussian process inference in various other concepts before and depending on which community you come from what you studied before you took this course you might actually have heard about essentially this algorithm I've been describing a Gaussian process regression in various other names before so Gaussian process regression is known in other communities under other names like creaking for example which is mostly due to a guy called Matarón but Kriege was a geoscientist in South Africa when who came up with this very similar kind of framework actually maybe the same framework arguably for prospecting for gold in in South Africa I saw in the world quite a while ago I'm not gonna call that a year because I'm probably gonna be wrong and you've also maybe heard of kernel which regression if you've taken a statistical machine learning class or another form of statistical inference if you're coming more from the signal processing or control perspective you might have heard of venocomobler of prediction or in fact of linear least squares regression and that final word is probably the best description of what's really going on here because as I've already said before Gaussian process regression that it's very hard is very closely related to finding the minima of quadratic functions it's just that when we do probabilistic inference we keep track of the entire function which in this case is an exponential of a quadratic function rather than just its maximum so to make this argument again which I've made in the past let's to understand where this connection comes from this is maybe the most fundamental and easiest to understand part of today's lecture let's look at the at the posterior for a Gaussian process again so let's talk more specifically actually about the posterior over the function values at some location x for the moment given the data y this posterior distribution is given by Bayes theorem which is the multiplication of the prior times the likelihood divided by the evidence as we know then we've now chosen to use a prior that is a Gaussian process by likelihood that is also Gaussian evaluated by connecting to the function value by evaluating at particular locations capital X and then the evidence is actually of a form that can be computed by evaluating the PDF of a Gaussian now and we can compute that and that's this object we've been computing in for the past few lectures and I mean specifically last lecture but it's just a variation of the things we've seen before now let's for a moment just for simplicity focus on the function values that the true latent function obtains at the look the training locations at capital X at the locations where we have data and let's consider the maximum value of this posterior distribution so the maximum of this Gaussian distribution so okay so this posterior is a Gaussian is a Gaussian distribution right it's not a Gaussian process anymore because it's a finite number of function values so it's just a Gaussian distribution over F at location capital X evaluated at this location with this mean and this variance now the maximum of this probability density function is of course exactly at the mean because that's just a property of Gaussian distributions that their maximum is obtained at the mean this maximum we can also think about this in another way how do how we get that maximum well if you if you think about the location where this PDF obtains its maximum you could think about the minimum of minus that function because that's the same thing and in particular think about the location of the minimum of minus the logarithm of that function because the logarithm is a monotonic transformation and therefore does not shift the location of the minimum if you do that then you can just think about these objects up here so if you take the logarithm of that we get the logarithm of that plus the logarithm of that minus the logarithm of the denominator the denominator doesn't depend on F so it's just a constant so it doesn't shift the location we can forget about it and that's one of the nice things about statistical estimation that you don't have to deal with this complicated integral down here to normalize and what they left with is the logarithm of the likelihood plus the logarithm of the prior and what are those well the gaussians are exponentials of negative squares so if you take they are logarithm and then the minus of that you just get squares so the the mean of the gaussian process posterior the point estimate if you like of our gaussian process algorithm the thick line that I keep plotting in the middle of my plots is given by the minimum of the sum of two quadratic functions where this year is essentially an empirical risk so it's a sum assuming we have independent noise over yi minus f of xi squared so that's a least that's a square loss plus a regularizer which penalizes the distance of the estimator from the prior mean function which we often set to zero and weights by this kernel gram matrix at the training locations so what this essentially is is a regularized least square has problem and of course there so least square estimation is a fundamental part of scientific inference and the reason for it is that it's so easy to compute and it's for exact same reason why we like gaussians and probabilistic reasoning because quadratic functions are so nice to work with cuts through quadratic functions are quadratic functions projections of quadratic functions are quadratic functions and linear maps of quadratic functions are quadratic functions just like the very same statements also apply to gaussian distributions this is really why this idea of estimating quantities by assuming quadratic losses or equivalently gaussian priors and likelihoods everywhere has been around science for a very long time in fact there can be a debate about who invented it and two good contenders for it depending on your national allegiance are a gauss or in fact the french mathematician legend who arguably came up with it first so here is an original text from legend in the 1805 where he came by the way so there is there's apparently no proper picture available of legend it turns out you can find various pictures online of a guy called legend but it turns out that it's probably a different person who is also also goes by the name legend but not the mathematician legend there seems to be only one picture available which was drawn by this artist Julien Leopold Boyer and actually he tried he apparently drew this picture as a kind of caricature of various mathematicians and noted that that must have been legend so he must have been a fun guy if this picture is any anywhere close to his personality legend are already in 1805 wrote a text about the problem that everyone was trying to solve back then the determination of orbits around the Sun of various objects in the case of legend right was not planets but comets and he introduces this this methodology this method about which he has been he's been talking which he calls metode the muandrukari the method of smallest squares and a few years later guy cut with a house talks about the very same method the method that that estimates the location in this case it's actually the trajectory of a planet around around the Sun as far as I know in this particular chapter essentially the same problem but so he's looking for the estimate which minimum that so the most likely system is the system where the values of the unknown quantities P Q and R and S etc are the ones in which the squares of the distances between the observed and the computed values of these functions is the smallest sum so the smallest sum of squares here we go again now it might be that Gauss wasn't entirely aware of Legendre's work or maybe he was and he didn't want to talk about it doesn't matter when scientists started thinking about how to estimate unknown objects in this case trajectories of objects around the Sun they started talking about building estimates that minimized the square distance and interestingly actually gauss himself this is the text in which he introduces the probability distribution that we now call the gaussian distribution talks about probabilities already if very explicitly so long story short gaussian estimation is related to quite to least squares estimation and whether you are coming from a different community in which you're thinking about least squares or from a community where you're thinking about gaussian distributions doesn't matter they had the two are very clearly very closely related to each other and it's it doesn't make sense to try and separate them from each other and claim that they are very different things they are evidently related to each other so what we're going to do today in the course is to try and tease apart where exactly the minor and technical specific differences lie between gaussian estimation and least squares estimation and in particular of course in the context of this most powerful framework in this context which is the kernel machines where we have an infinite set of degrees of freedom available how many of you will be taking either online or in tubing in the parallel course by my colleague professor von luxburg on statistical machine learning where you will have seen these slides or maybe you will just even get to see these slides and others like them when she discussing kernel machines you're maybe you haven't taken that course but you've taken another class on statistical machine learning or statistical estimation or a theory of machine learning or just machine learning or whatever it might be called sooner or later you will have come across this notion of kernel machines it's usually first introduced from the coming from the direction of support vector machines and then towards regression here we take the other direction we start with regression and then go to classification later of course because there is this parallel course by my colleague von luxburg I'm not going to try and reproduce that course it is however by design here in tubing in that these two courses take place in parallel and are both compulsory for the students of the machine learning master here in tubing in and the reason for that is that we want to deliberately guide you towards thinking about the theory of machine learning from both of these directions in other courses if you even get to see both sides there is often or actually many parts of our community as a whole there is still this impression that the theory of machine learning is a struggle for supremacy among the probabilistic the Bayesian viewpoint and the statistical the frequentist viewpoint we here are typically believe that that's an anachronistic view and that it's better to combine the strengths of both viewpoints by getting you to think about machine learning from both different perspectives and therefore even though of course we try to keep these courses separately every now and then and today is one of these points we try to make specific connections between the two courses by explaining how the language of one of these courses is to be understood in the context of the other so whether you're taking or looking for looks books class or any other theory of machine learning class if you've learned about the theory of kernel machines you've come across the notion of a reproducing kernel Hilbert space and it's usually introduced in an abstract form actually a professor from looks book I think doesn't do it this way she does it the other way around but eventually people usually arrive at this very abstract notation of whatever producing kernel Hilbert space is usually abbreviated RKHS it's actually shorter to say than the producing kernel Hilbert space that's why I'm going to say RKHS sometimes as well and RKHS is a Hilbert space of function so a Hilbert space is a space that's endowed with an inner product and inner products are these things that are symmetric non-negative and obey the triangle inequality and it's a Hilbert space of functions that contains a specific object called the kernel K which is first of all an element of that Hilbert space so if you take this kernel which as we know is a by a bivariate object and you consider it as a function of just one of its inputs let's say the left one and fix the right one then no matter what the other what the right hand side is this kernel is always in the Hilbert space and secondly and that's the intriguing properties called the reproducing property every function in the Hilbert space can be written by as an evaluation of an inner product of that function with the kernel that is sort of a univariate function on the left hand side evaluated at x on the right hand side so if you want to know what f of x is you take the entire object f in the Hilbert space you apply the inner product to the kernel function evaluated at x that's an element of the Hilbert space and you get out f of x it's like picking out an individual function value and it turns out that there is a one-to-one map between kernels and Hilbert spaces and the beautiful thing of having the parallel chorus parallel for Luxburg is that I don't have to do proofs for any of this I can just claim that this is true because if you want to hear the proof you can ask in the in the statistical machine learning class now this is the object of interest in the analysis of kernel machines in statistical machine learning so let's use it to see how it connects to what we've been doing so far which is talking about posterior mean functions and posterior covariance functions and samples of Gaussian processes and in doing so we will see that here in the domain of least squares or Gaussian regression or kernel rich regression or creating or whichever other word you want to use for this concept these two domains are actually extremely close to each other these two these two schools of thought if you like probabilistic and statistical inference to see that let's look at the objects that we use in probabilistic machine learning posterior means posterior covariances posterior samples and so on and see how they relate to the quantities that you might be interested in in statistical machine learning which is for example the kernel rich regression estimate to be able to do that we first need to understand what this RKHS actually is from our perspective and for that it's helpful not to use this abstract definition but instead use one which is an alternative way of defining what the reproducing kernel hybrid split is actually there are several different representations for the RKHS this one is maybe an abstract one it's the reproducing representation if you like there is also another representation which is called the reproducing kernel map representation which i'm not going to prove because oekel von luxeburg i believe does the proof in her lecture which says that you can also think of the the RKHS as the space the Hilbert space of functions that is constructed in the following way it's the space of all functions that can be written as a countable sum over evaluations of kernel functions at various locations xi weighted by real weights alpha i tilde and end out with an inner product that is given by this expression so that's a sum over the individual weights normalized by the value of all of these kernels okay so that's interesting why because we have an object in our Gaussian process regression algorithms which looks a lot like this and that's the posterior mean our posterior mean function if you assume that the prime mean is zero um the prime mean is not that important so if it's if it's non-zero you can just shift everything by the prime mean then the posterior mean looks as this so to again be a little bit more precise assuming we use a Gaussian process prior over functions f with a zero mean function and a covariance function given by a kernel and we use the factorizing or actually the the generic likelihood Gaussian likelihood that assumes that we get data y that is that amounts to evaluating the function f at locations capital x and then adding Gaussian noise to it then the posterior mean is this object as we've seen on previous lectures and what what is that object well it's you take the data you multiply with the inverse of a matrix so you solve a linear problem that gives a bunch of weights essentially so what are these right it's a vector times a matrix it's another vector so you can you can think of these as weights to a collection of kernels so this little row matrix here at location x is a row vector matrix containing the kernel at x and at all the individual data points x1, x2, x3 up to xn so we can write this expression as a sum like this and clearly this function is exactly of this form so what this means from our perspective is that we can think of the reproducing kernel hybrid space as the space that is spanned by posterior mean functions of Gaussian process regressors and I've tried to visualize this here in this particular picture so what I've done here is I first done Gaussian process regression we're using a Gaussian kernel because that gives a nice picture you've seen this data set before here is a bunch of evaluations made we assume with Gaussian observation noise and I've used this Gaussian process prior with the Gaussian kernel that gives this posterior Gaussian process distribution that I'm indicating with this shaded region here where the the intensity of red is the density the marginal density of the Gaussian posterior and what you can see in the middle is this posterior mean function that posterior mean function is essentially at least from the statistical perspective a point estimate it's our best guess for what the true function is we can think of this object as lying and only this object as lying inside of the reproducing kernel hybrid space because it can be constructed by taking every individual datum here and computing the weights for the actually well at the point of the datum we take the kernel so the kernel is this Gaussian object here which you can see in lots and lots of black lines in the background and then each of these kernels is weighted by a weight that is given by let me go back one slide that is given by this vector here so we take the datum and make to make a list of data into a vector why then construct this matrix inverted apply to the vector and that gives a bunch of weights and these weights can be positive and negative of course and you can I've plotted up the weights here in these as dashed lines if you multiply these weights with the kernels you get these individual kernels and if you sum up all these individual kernels you get this red line now notice that these are not at exactly the location of the data so it's not like each datum is like surrounded by a kernel that goes through through this datum that would be if you like kernel regression like sort of a that's a different form it's not kernel which regression different different form of regression but instead we compute these weights which can be positive and negative depending on where this data point actually lies that's what the matrix inversion does for us and when people do statistical analyses of these algorithms that you might call kernel which regression or Gaussian process regression or something else then they talk about the space in which this red line lives and how powerful this space is and whether it covers certain hypothesis spaces whether it converges towards the true function assuming that the true function lies within that space or maybe it doesn't lie within that space and so on so that's one very interesting and intriguing close connection between the frequentist and the Bayesian world if you want to use these loaded terms the when the Bayesian computes a posterior mean they are actually computing a kernel which estimate and therefore all the analysis of kernel which regression estimates in terms of reproducing kernel hybrid spaces applies to this object to the posterior mean so you can see that as a criticism of the Bayesian viewpoint because why do we need the Bayesian viewpoint if we can have the statistical viewpoint you could also see it as a criticism of the statistical viewpoint and say why should we use the statistical viewpoint if we already have the Bayesian one which of these two viewpoints is closer to your heart depends on your own biases so to reiterate the here is a formal statement let's consider our Gaussian process model where we've taken a Gaussian process prior and a Gaussian likelihood then as a from a probabilistic perspective we've arrived at various objects one of them being a point estimate called the posterior mean which is our best guess for what the true function might be and it's given by this object it turns out that this object can also be thought of as the minimizer so that means as the function within the rkhs which minimizes this and regularized empirical least squares loss where the regularizer is given by the norm in the rkhs of this function a proper proof of that goes beyond what i've done a few slides ago with just the function values at the training locations it requires a little bit more but again i don't need to do that because it's in professor von lux books lecture it's actually just a three-line proof uses the representor theorem and then the Cauchy-Schwarz inequality but we're not going to do it here because i don't want to vic construct proofs that are done in the other lecture so having seen that you might now think that the frequentist and the Bayesian viewpoint are just the same so why do they fight each other that much well notice that when we talk about the the posterior distribution over a Gaussian process regression we are not just talking about the posterior mean we're talking about an entire probability distribution which has a shape at not just a location but also a width and that and in fact actually more than that but it has a whole probabilistic form like it has an entire pdf which is parameterized by the mean and the variance that variance is a quantity that is very dear to the heart of people who buy into the probabilistic framework because it quantifies uncertainty and uncertainty or the ability to quantify uncertainty are often seen as the the main selling point of the probabilistic framework by keeping track of a remaining volume of hypotheses we can be uncertain about the the unknown quantity so the question of course that we have to answer next is is there a statistical interpretation of the second object in the the Gaussian process posterior which is the posterior variance and that quantity of course has to have the nature of an error estimate so the posterior variance for the Gaussian process regression language is the expected square distance between the true function we are trying to estimate and this posterior mean which we've just identified with an element of the RKHS now it turns out and I'm going to do this proof while also actually telling you about it the posterior variance which I've plotted down which I've written down down here again can be thought of as a worst case bound on the distance between the true function and this posterior mean function we just computed within the RKHS assuming that the a norm of the true function is bounded let's see what I mean by that in detail so first of all let's simplify a bit and assume just for the simplicity of the argument that we are making measurements without noise so let's just assume for a moment the Bayesian and the frequentist agree that when we evaluate why the data we actually get to see the true data it's possible to do this same derivation also for the noisy case it just becomes much more tedious then so I'm not going to do it and now we are going to try and connect the minds of the Bayesian and the frequentists by computing one quantity that has an interpretation on both sides let's start with the frequentist side a quantity you might be interested in is how far the true function f of x is away from this point estimate that you've just constructed and we're not going to construct an expected value because there is no probability measure to talk about for the statistical viewpoint instead we're going to say the hypothesis space is given by the reproducing kernel hybrid space the RKHS let's just assume that the true function is in the RKHS notice that we're not making any probabilistic assumptions we just use this hypothesis space and the only assumption we're going to make is that the norm of the function the true function in the RKHS is bounded it's finite so in particular we could say it's bounded by one so its true norm is less or equal than one if it's less or equal than a constant you can just multiply whichever number we get out by that number right okay so let's think about this object so we need the supremum so the largest possible value for all among all functions in the RKHS that have a bounded norm of this quantity the square distance so first of all let's plug in what these quantities actually are so here is f of x we leave that and what we're going to plug in is the expression for what this posterior mean actually is so that's the same for the frequentist and the Bayesian we've just agreed that it's the it's either the kernel which estimate or the posterior mean of a Gaussian process and it's given by this so here I've rewritten the expression a little bit I basically turned it around if I go back to actually one slide here it is then here is this expression again we've assumed that sigma is now zero because we measure without noise right and why is therefore actually equal to the function value itself we just evaluate the true function let's say everyone agrees on that then here is our weight w and this time I'm going to compute the weight the other way around so in the previous slide I took y and mapped it through the inverse of the matrix and called that the weights now it's going to be more convenient to do it the other way around um we will just call that this a weight and then we will come back to this later it's just a way to encapsulate this number so I don't have to write them all the time now as a first step we're going to use the reproducing property of the reproducing kernel Hilbert space which means that we can write these function values f of xi for all i and f of x we can write these as inner products of the unknown function f with the kernel and that means we can rewrite this expression as an inner product squared with the kernel with the uh of the unknown function f with kernels and now in these expressions on the on the left hand side of this inner product all of the instances of f are gone we're just left with kernels and now let's look at this supreme moment think about how we find the supreme of this inner product so this is a standard argument that's made very widely so I'm not going to waste too much time on it which is based on the Cauchy-Schwarz inequality if you want to maximize the um an inner product between two objects in terms of the second object we just have to set the second object to the first object why so here's the Cauchy-Schwarz inequality if you don't know what the Cauchy-Schwarz inequality is it's this statement it's the statement that for functions with an inner product the norm of the inner product between two um uh two elements a and b is bounded above by the product of their norms this is essentially the triangle inequality it's sort of an or maybe it's an it's a corollary of the triangle inequality and um so now we want to we have an expression of this form here with an a and a b our a is this thing on the left hand side our b is the function f and we want to find um the expression that that maximizes this bit and we want to find it such that f has a norm that is given by one so let's consider the function that is given by this bit on the left hand side divided by its norm so actually I've written down here that might have something that might help so we're going to consider a b b is um so we have an expression that is the inner product between a and b we want to maximize that expression as a function of b keeping a constant let's consider the function b which is given by a divided by its norm evidently this function has norm one all right so that's good because we want to find supremum that have that have bounded norm if we um so that means by the Cauchy-Schwarz inequality on the right hand side here we can put a one and um if we compute this inner product between a and a over its norm we get the inner product between a and a itself that's given by the norm squared that's just the definition of the norm squared divided by the norm so that's just the norm of a and now we have an equality on the left and the right hand side and that's clearly the largest number we can expect to get so we should choose our f to be given by this function divided by its norm if we plug this in here then notice that there's a square up here what we're going to get is a square of a norm of that function divided by the norm of that function so sorry a square of the square of the norm divided by the square of the norm so we'll just get the square of the norm back a little bit complicated but it's really just taking keeping track of squares so this is the that's the that's now a value for this supremum and now we can plug in the reproducing property again so this norm is an this norm squared is an inner product between this thing with itself so um what is that we just write down what the inner product actually gives us so now we use linearity of the norm and sorry linearity of the inner product which allows us to write um to to take these these individual terms in the in this sum apart we have this on the left hand side and on the right hand side so we have to use two different summation indices we get one double sum over i and j with w i w j and now an inner product between a kernel at i and a kernel at j and here we can use the reproducing property again and use the fact that the that all kernels are elements of the rkhs so this inner product between a kernel and another kernel is just the kernel evaluated at the two inputs of the the left and the right hand side of the inner product minus then we get mixed mixing terms between this and this kernel so this gives us a single sum over w i with a single kernel x at x i and finally sort of a in quotation marks quadratic term the inner product between the kernel with itself and that's just k x x and now all we have to do is just plug in again the definition of w which involves a bunch of uh kernel gram matrix inverses which cancel out nicely you can do that for yourself and you'll find that we arrive at exactly the expression that is the posterior variance of the Bayesian viewpoint so what we've just shown here is that the the expected square error between the mean and the true function under the Bayesian probabilistic perspective which is given by this object assuming no noise is exactly equal to the worst case error under the statistical viewpoint for a bounded norm estimate in the rkhs this is a really interesting connection because that which i encourage you to think about because it gives a very quantified view on the philosophical differences between the notion of worst case and average case error estimation in the statistical and probabilistic viewpoint respectively if you ever have the chance to see a Bayesian and a frequentist argue with each other you'll often see them make these emotionally loaded arguments over the worst case and average case estimates you will hear sentences like um frequencies do not make assumptions they just do analysis or you will hear Bayesian claiming that they can make error estimates that are well calibrated that the frequentists can't do statements like the one we just found put both of these sides into question what you see here is that actually there is one quantity that both sides can agree on that are that are motivated in a very different way but happen to be the exact same value and interestingly what what is actually a worst case estimate so the word worst case estimate suggests that it's a more conservative estimate is actually an expected error from the probabilistic perspective which is in a sense a weaker kind of statement so that in if you like the Bayesian actually expects a larger error then the then the then the frequentist because for the frequentist this number which was computed is the worst it could possibly be but for the Bayesian it's just what you expect and quite often you expect more than that actually now of course the frequentist could go no no no but I've just I just have this unknown number here which is the norm of this function and of course I could scale by that and then I just get this is just a rate of contraction it's not actually the true error and that's totally true but then you have to say well what is that bound actually and to estimate that at some point the two sides will have to talk to each other so the reason we did this is to see that these two philosophical viewpoints are often not at odds with each other and even though they make statements that are motivated in a very different way they might end up with the same kind of concepts now what I don't want you to take away from this discussion is that the Bayesian and the frequentist or the probabilistic and the statistical perspective are the same even on this particular problem of regression on real value functions and to make that clear let's talk about a third object that is a part of the code for example that we've written for Gaussian process regression that we haven't yet analyzed we've already spoken about the mean which is a point estimate which is the same on both sides about the variance which is a worst case estimate on one side and the average case estimate on the other side but the third objects are the samples and that you can draw from our Gaussian process posterior and it turns out that those samples are actually one point where the probabilistic and the statistical perspective on this problem of regression differ in a very subtle way and to see that well I mean I can make it short the answer is the samples that are drawn from the Gaussian process posterior are not elements of the reproducing kernel Hilbert space and to be able to see that I have to very quickly flash you with a few theoretical results the first one is that it turns out that there is a third way to write the reproducing kernel Hilbert space I've already shown you the abstract one in the definition in terms of the reproducing property and then the one in terms of the reproducing map property where you representation which you can write the elements of the reproducing kernel Hilbert space as sums of the kernels there is also a representation in terms of the eigen functions of the kernel that's actually one of the reasons why I introduce eigen functions in the first place at the beginning of this lecture so this is called a Mercer representation and without much complicated to do I'm actually not going to do the proof but you can do the proof for yourself if you stop the video here and look a little bit at these two lines down here the it's possible to write the reproducing kernel Hilbert space the RKHS as the set of all functions F which can be written as countable sums over the weighted eigen functions scaled by the corresponding eigen values such that they have bounded norm and for that we need to define what the norm is and or actually we have to define what the inner product is otherwise it's not a Hilbert space and the inner product is given by in this representation the sum over the weights of these individual components so that's a very elegant way of writing down the norm or the inner product you just expand in terms of the standardized eigen basis if you like so standardized by the by the size of the eigen values and then just take the weights associated and then so this is theoretical result number one which is well proved down here but very simplistically and then statement number two which i'm not going to prove which is known as the Kahun and Löwe expansion which states that a draw some a Gaussian process can also be written not in the way that we've been using so far by computing explicit representations with kernel gram matrices and then computing their Scholesky decompositions and mapping random numbers to them but they can also be written using basically the results we've used before and by doing the following you take an accountable set of random numbers that are drawn iid from a standard Gaussian distribution so here we go iid draws some Gaussian distribution and then you compute this countable expansion of a function which is given by this sort of basis span by the standardized eigen eigen functions and scale by these random numbers you've just drawn so this is essentially the non-parametric infinite dimensional version of one way to draw from a Gaussian distribution you take standard Gaussian random variables you scale by the square root of the eigen values and map out through the eigen vectors which are here eigen functions now we can use that together with the previous result to compute the norm of such a random draw from the GP in the RKHS notice that this is actually the form you would expect for a function that is in the RKHS so we can think about it as the expansion of functions in the RKHS so it sounds like a draw of a Gaussian process is going to be in the RKHS but let's compute its norm well the the norm of the of the function of the draw the this random number in the RKHS so the RKHS norm of that function is given by um um well actually the expected value of the norm squared sorry that's been more precise is given by the expected value of the inner product of this function with itself and we can look up on the previous slide what that is right we just have to sum up the um individual weights and here they are the same on both sides so alpha i times um alpha i that's a sum over alpha i squared and because we've drawn these alpha i independently from Gaussians these that we can move the um the sum inside and we'll move the expectation inside of the sum and we're left with and this is a simplified proof obviously um an accountably infinite sum over ones and that's obviously an unbounded number and therefore this number is not bounded above by by uh well it's not finite and therefore this part of the definition here is not fulfilled so these draws are actually not elements of the RKHS so you might wonder um maybe that's a point where you actually have to stop the video for a few seconds and appreciate what that means so draw summa gaussian process have in some sense the same form as RKHS functions because they can be expanded using this kahuna live expansion but the numbers that show up here are so large that these functions actually lie outside of their RKHS they don't have a bounded norm now the natural question that comes up now is what kind of outside is this can i should i think of the RKHS as some kind of kind of um um fuzzily defined sphere and the RKHS and the GP samples lie exactly on the boundary of that sphere or are these draws actually fundamentally in a larger space and the answer to that unfortunately is very technical and detailed and um i can only hint at it a little bit using a result that comes from various people it's actually based on like this particular form is taken from this paper i just mentioned uh written by motor lobo kanagawa and others it's um um actually a simplified form of a more general theorem by engels steinbach from 2017 which you can look up yourself which again comes from a different kind of statement by a guy called driscoll from 1973 and what it says is that the um that there is a way to adapt the RKHS such that samples from a Gaussian process actually lie inside of that space again so there is another reproducing kernel Hilbert space it's not the one that um we use to define our Gaussian process let's call it um HK theta it's called the power the theta power of the RKHS which you get by taking our current RKHS and then you adapt a little bit the eigenvalue so make them ever so slightly larger and actually ever so slightly smaller there you go um which means that we're basically expanding the RKHS because it you're now covering a region of functions with hard which previously had higher um higher norm then um the the draws from the GP with the original kernel associated with your original kernel actually lie almost surely inside of that expanded RKHS now the whole question is how much do we have to expand this RKHS and it turns out that for certain very smooth RKHS like for example the one associated with the Gaussian kernel the um this expansion that we need to do is infinitesimal so if you've drawn from a Gaussian kernel from a Gaussian kernel RKHS with a certain length scale lambda then what this is so what this is sort of essentially amounts to is to just adapt the length scale of this kernel then you're you can actually increase the size of the RKHS infinitesimally and you'll capture all of the samples from a Gaussian from the original Gaussian process unfortunately this is not true for more general kernels there are other families of kernels in particular kernels that spend some of the spaces which um require a finite expansion of the RKHS to capture all of the samples this all of course is very technical in detail and maybe you have you've zoned out for the past 30 seconds while I said that and that's okay what I wanted to say is what I wanted to bring across is a more abstract statement which is that maybe actually let's summarize the whole thing first of all in this when we talk about regression on real valued functions then the statistical and the probabilistic viewpoint on machine learning are actually quite close together they even overlap in certain points in particular the what the Bayesian would call the the posterior mean estimate which is a point estimate also has an interpretation on the statistical side it's the l2 regularized least squares kernel rich estimate and it's an element of the reproducing kernel Hilbert space what the Bayesian would call the expected square error the posterior variance is equal to what the statistical machine learner would call the worst case error inside of the RKHS under the assumption that the norm is bounded and then there's this third object called the sample form of Gaussian process and that object is causing a little bit of headache because it actually lies in an in an outside of the hypothesis space considered by the statistical perspective it lies outside of the RKHS however it lies outside in let's say a benign kind of way because it's actually part of another reproducing kernel Hilbert space that is just ever so slightly larger so really the differences between the sampling between the probabilistic and the statistical perspective here when we talk about samples is like it's there is a fundamental difference but it can be healed by considering a sort of a slightly expanded space which then captures all of the samples now having done this philosophical comparison between the statistical and the probabilistic perspective on real values regression I want to end the lecture by discussing a question that is of course of interest to both sides which is how powerful are these non-parametric learning machines actually so we arrived at these at these algorithms this you might call them Gaussian process regression or kernel rich regression or any other name by at least motivated in my lecture by this idea of taking a neural network and making it infinitely wide and keeping track of infinitely many features we did that maybe with the hope of building a model that becomes very powerful because it can keep track of infinite degrees of freedom there was a bit of a worry here maybe because in constructing our kernels when we did this in this very pedestrian way in the previous lecture we had to not just extend the number of features to infinity but also shrink their individual variance towards zero in a proportional fashion so one thing you might be worried about is whether this has done any harm or not you've seen a lot of pictures that show that this framework allows us to learn actually quite well generic functions and it kind of retains a lot of power representational power to adapt to more and more data points and maybe it can adapt to arbitrarily many data points and in doing so actually learn any function is that true well here as well the past has thanks to the hard work of many mathematical analysts has given us a surprisingly detailed picture which is one of the reasons why many people who have theoretical motivation are more excited about kernel methods than about deep learning because deep learning so far doesn't have that this deep kind of theoretical understanding and it turns out that there are certain kernels which are called universal kernels which have the property that they are RKHS so the space of functions that they can approximate if you like right the space of functions that can be represented by posterior means that are addressable through some data set that RKHS lies dense in the space of all continuous functions so lies dense literally means that there's a way to think about the space of all continuous functions in terms of a norm and then for any point in that space and any epsilon within that within an epsilon ball around any particular function in the space of continuous functions there is an RKHS element of the RKHS reproduced by this particular kernel and one such kernel which has this property is the Gaussian kernel which we've been using so far by the way I keep using the Gaussian kernel do not take this as a motivation to use the Gaussian kernel for concrete applications of Gaussian process regression you will see me use in concrete examples other kernels it's really just fun to use the Gaussian kernel because it uses very smooth functions and it has interesting properties so it gets studied a lot but you will now see actually in this experiment that it has a bunch of downsides as well in particular it's a very smooth kernel and it forces and forces very strong smoothness assumptions you will actually hear about universal kernels in a lookbook lecture as well so I'm not going to introduce them any further I'm just going to say that the fact that universal kernels exist is maybe that of give rise to the hope or is even used as an argument for saying that kernel machines and Gaussian process regressors are universal learning machines as well in the sense that they can learn any function and I mean technically that's true because if you have an arbitrarily large dataset then that dataset will address any can address arbitrary points in the RKHS with its posterior mean and therefore get arbitrarily close to any function any continuous function and that could be taken as meaning that you can learn any function or the problem with this is maybe you can think about what the problem of this is before I actually show it to you is that this kind of statement does not include a rate it's just a statement about feasibility not about how quickly you will address this function so to give you a feeling for why this is a problem I've constructed a little experiment here is a function we would like to learn it's this black line in the background that you can see I've created this function and actually I maybe I shouldn't tell you how I created this function it's just a continuous function it's actually smooth it's very smooth it's infinitely often differentiable well to be honest the way I've created this function is that I've drawn from a Gaussian process with a different kernel and that kernel is the so-called rational quadratic kernel you don't know what that is but you don't have to understand it it's just not the Gaussian kernel and the rational quadratic kernel is a very interesting case because it happens to be a kernel that produces functions that are many ways similar to those you would get from the GP with the Gaussian kernel they are infinitely often differentiable so they're fully smooth and they have a typical length scale but their behavior varies a little bit around the length scale and here I'm going to try and learn this function with another Gaussian process regressor which is created by using the Gaussian kernel so this related kernel which actually has this at the same length scale so both so this black line is drawn from a rational quadratic kernel length scale one Gaussian process and we're going to try and learn it with a Gaussian process regressor with a Gaussian kernel with length scale one the Gaussian kernel is a universal kernel so the statements about universal kernels apply here and we would expect this function to be easy to learn for this algorithm because of the universality of the Gaussian kernel and now let's say I get my first datum it's over here I apply that the Gaussian process regression framework to it I get a Gaussian process posterior and everything looks fine so far now I get more data points here's my second datum everything's fine as well five evaluations and what you can see is that the posterior mean adapts to the shape of this unknown function and the posterior contracts around it at exactly the points where it should be contracting right it it sort of becomes certain in the regions where it has seen data and remains uncertain in regions where it hasn't seen data so this looks like a very well calibrated posterior distribution now we have 10 evaluations and still everything looks good and now let's move to 20 evaluations and something suddenly went wrong so what you see here is that the entire posterior is beginning to deviate from the true function and this is bad not just because so in two ways basically right one is that the posterior mean is moving away so the error is actually getting in some sense large or the estimation error but even worse maybe the posterior uncertainty is also contracting way too fast so this algorithm now believes to know a function if you like even though it doesn't know it at all so scaled by the posterior standard deviation the true function is very far away from the posterior mean and if I keep doing that with more data points the situation becomes worse and worse we get strong oscillations the posterior mean really bends far away from the true function and the surrounding uncertainty drops really rapidly and that's very bad right that seems like the universality doesn't actually work well that's not true the statements about universal kernels actually still do apply and this algorithm as we accumulate more data points actually will contract towards the true function it just does so in a very erratic nasty kind of way and what you are seeing here and in this plot is I've plotted the error which I've computed in some like numerical high quality way of this approximation method as a function of the number of evaluations and this is a log log plot so if it were converging efficiently you would maybe expect let's say maybe a reasonable convergence rate would be one over the square root of the number of samples so that's a stochastic montetano type convergence that's indicated by these golden lines in the background these are obviously straight lines because it's a log log plot but what you actually see is this red curve that's the true behavior of this regressor and you see that it's much much flatter than the this polynomial convergence in fact it actually turns out that this convergence rate is there is actually a rate to this convergence and it's actually given by a there's a there's a theorem for this by in a paper by von der farten van zanten from 2011 where they show that this particular combination of of kernels rational quadratic and gaussian actually converges with a logarithmic rate now your computer scientists so you know that logarithmic rates are essentially not converging at all it's it's basically a useless kind of convergence right you need an exponentially large number of data to learn this function or to reduce the error in a linear rate that's not good right so this shows that rates are important if you just knowing that you can learn a function doesn't mean anything if you don't know how many data points you'll need so maybe an intuition you might have for that this is a picture that it doesn't always work for all people but if you don't like it just forget about it but maybe an intuition you could have is that you can think about using a particular kernel to learn a function as deciding to use a particular basis of a function space to or not just a function space but a specific hypothesis space and basis for it to represent an unknown thing that isn't necessarily in this hypothesis space so in this case because I've used different kernels to generate the true function and to learn it the true function lies in a different rkhs right due to this theorem by our angine rkhs are uniquely associated with kernels and what's happening here is that we're talking about an object that is outside of the rkhs and we're trying to approximate it within the rkhs that's conceptually similar if you like this conceptual comparison to trying to represent a uh irrational number like the number pi this the circle number on in terms of a rational numbers you can do that in various ways as well so let's say I wanted to tell you about pi and you don't know about pi I could do that in various different ways I'll have let's say I have to use rational numbers because I mean we don't have a basis for irrational numbers so we have to use rational numbers the way we usually do this is that we use the basis that is spanned by the decimal fractions so I can I can tell you about pi by telling you that it's 3.141 and so on and that in doing so I'm essentially assigning a weight a weight of three and one and four and one to the basis one over one one over ten one over a hundred one over a thousand and if I do that then I'll get a linear convergence rate in logarithmic space of course because I mean in a ten in a logarithm base ten I'm exactly expanding in that kind of basis so every step every additional number I provide the error drops by a factor of ten on average there are other ways of representing pi in other basis of rational numbers for example there is the so-called Gregory Leibniz formula which represents pi in a basis that is spanned by the odd fraction so one over one one over three one over five one over seven and so on and their weights are just force with alternating signs that's beautiful it's a beautiful formula unfortunately if I keep telling you that so I'm you we've agreed on using the basis of of odd numbers and I just keep telling you four minus four plus four minus four plus four minus four that's a very inefficient way of encoding what what pi is because the convergence of that of this sequence towards pi is much much slower you see this as this blue line down here there are even worse ways of representing pi I mean they have there maybe I might have theoretical a number of theoretical relevance but they have a really bad way of representing pi like for example this Nila Kantar series but there are also really good ways of representing pi for example there is so for a while there was this competition to find as many digits of pi as possible and various people were involved with this one group that was very active in this kind of race was was the the Chudnovsky brothers who came up with really efficient ways of representing pi that essentially is also a series expansion in terms of a bunch of somewhat complex to represent rational numbers or sequence of rational numbers and their sequence converges extremely fast so fast in fact that I can show you the very first point and the second point is already outside of the floating point range so I can't even plot it on this on this plot anymore it's an extremely fast convergence so what's happening here is that if we want to talk about pi then this choice of basis is very efficient and there's a similar situation in machine learning that if you want to learn a certain thing then there is usually a particular way to represent the learning problem which convert which allows extremely fast convergence and then there are other ways of representing it which give weaker slower convergence in the probabilistic perspective these correspond to different generative models to different priors and likelihoods of course there are also choices of priors and likelihoods that don't work at all and for example these could be for pi these could be the the sequence of fractions of even numbers so one half one one quarter one six and so on that wouldn't work at all so if you use the wrong prior you can't learn anything if you use this as efficiently powerful prior you might be able to learn everything but you don't know at which rate and if you choose a particularly smart prior then you can actually converge very efficiently it turns out that there's a corresponding kind of type of statement for Gaussian process figures which I want to end on that's a very technical extreme example of what's possible in the analysis of kernel machines it's also due to this paper by van der Fart and van Zandten that I've already mentioned it's not going to read out the entire thing but you can read this if you want to yourself what this statement says is that if you use a different choice of prior so you don't use the Gaussian kernel you use a kernel that spans a a sublev space so that's a that's a space of hypotheses of of functions that have a finite smoothness then it's actually possible for the Gaussian process regressor to learn any finitely often differentiable function or sufficiently smooth function to function from a different sublev space at a rate that is polynomial in the number of samples not logarithmic but polynomial and this in particular this means if you manage to find a prior hypothesis space whose elements match the smoothness of the true function and that smoothness is sufficiently high then you can actually get a convergence rate that is essentially 1 over n except for a correction that has something to do with the dimensionality of the input space but that's even suppressed by the smoothness so it is possible to learn smooth functions you just have to use the right priors your takeaway for that should be two things first don't really use the Gaussian kernel because it's too smooth rather use somewhat rougher kernels and you will see some me use some of them in other lectures and secondly this is an example of the kind of theoretical power that has that is associated with the notion of kernel machines and the fact that such statements are possible in learning theory are one of the reasons why people are still excited about kernel machines and associated notion of Gaussian process regression while deep learning so far has not reached this level of precision in the statements in statements about what these machines can actually learn of course we hope we all hope hopefully that this will change soon with that i'm at the end today was a bit of an introspective almost philosophical mathematical lecture in which we've tried to get a bit of a closer handle on the the notion of Gaussian process regression and its connection to statistical machine learning and concepts therein we found that first of all kernels are really interesting objects you can think of them with a few caveats as essentially infinitely large matrices if that also means that Gaussian process regression because it uses kernels is associated with the theoretical building constructed by statistical sorry by statistical machine learning for kernel machines and in fact for real valued regression so for Gaussian process regression the connection between the two sides is extremely close there is a corresponding concept in statistical machine learning called kernel which regression and kernel which regression constructs a point estimate called the kernel which estimate which happens to be exactly equal to the posterior mean of the Gaussian process regression and Gaussian process regression produces additional objects as well in particular an error estimate called the posterior variance and that posterior variance happens to be identified with a worst case error bound in the statistical formulation. Gaussian process regressors can also draw posterior samples these are not quite in the RKHS but they are in some sense on the boundary in the completion even though that completion might take an actual finite value and finally we looked beyond this philosophical confluence of statistical and probabilistic machine learning and to address a problem that both sides should be interested in which is how powerful these kernel methods actually are and we saw that's maybe the most important message here that there is a very deep theoretical understanding of these algorithms which can show that yes Gaussian process regression models can learn every function however they don't necessarily learn them at a good rate that rate can be logarithmically bad or it can be polynomially good if you like high order polynomials if you use the right kernels so if you want to learn a complicated function you better use a powerful kernel but don't expect it to converge magically fast with that we're at the end thank you very much for your time