 All right, well, let's get started. So good morning, everyone. Another Monday morning, welcome. And I will go straight to feedback from last week, because there was quite a lot of it. So to my belief, despite it being a very algorithmic low level lecture, most of you actually liked, well, those of you who gave feedback actually liked the lecture on last Thursday. And you also liked the speed largely. There was even one person who thought it was too slow, and three who thought it was too fast. But even of those who thought it was too fast, one person very much liked it, and another person kind of liked it. So I think that's a decent way to do algorithms. One of you, two people actually wrote that it's still a little bit fast on the math. And one of you said it would have been better to do Cholesky by a simple example. Could have done that. But a lot of others also quite liked the way that we sort of spoke about this story. Someone even suggested I should do a podcast. If I ever have time for that in the future, I might think about it. And then there were quite a lot of detailed questions about the content of the lecture. And I want to take five minutes or so to briefly address those. So the first question, I'll just go through them. There are no particular order, but I just picked them out. So the first question was, is it a problem that I haven't had a deep learning lecture before? Actually, two of you I think wrote something to this effect. No, it's not. Don't worry about it. I don't even know what actually is in the deep learning lecture of Professor Greiger. I will make my own kind of very brief introduction. And also deep learning isn't conceptually that complicated. It fits quite well into what we've done so far and what we will be doing over the next two lectures. So it will organically grow on you. Don't worry about it. Don't stop this class, because you haven't had a deep learning class. Why did we firstly do GP the first way without Joleski? So I'm assuming that this question relates to the lecture done by Marvin Ferdner, lecture 12, on training GPs with stochastic gradient descent. I will get to that several times today. But the very short answer is, so just to remind you, we did GPs with Joleski first. So we spent the first 11 lectures pretty much while from lecture four to lecture 11 on Gaussian inference using Joleski decomposition. And then there was only one lecture using SGD. And the point of that lecture was to point out that we can train GPs with SGD. But then when we do that, we lose the ability to quantify uncertainty, at least at first sight. And that's important, because there is an important class of algorithms called deep learning out there that are trained by variants of SGD. And we will need to think about what we can do to add uncertainty to them. Next question. You said that, and here is, I think, a misunderstanding. You said that normal GP allows a data set of roughly 20,000 while Joleski allows a data set of a million. How does it? No. So what I said was closed-form Joleski decomposition. So by that, I mean so-called direct linear solves. So these are methods you just call, and then you wait, and then they give you the exact answer. And you can't stop them in between, because you don't know what the output is. That's how Joleski works, at least so far. Those scale to something like 20,000. And 20,000 is very much a rough ballpark number. It's sort of correct for a computer like the one that's right in front of me. You can try yourself what your machine can do. Maybe it's 10,000. Maybe it's 30,000. Depends on how much RAM you have, and so on. And what kind of library you use, and so on. And then, iterative methods. So methods that repeatedly multiply with the matrix, and then you can stop them at some point. And they give an estimate. Those scale to the order of a million data points. Why? Because, roughly speaking, each iteration of these methods requires one multiplication with the matrix. So that's O of n squared. So a million by a million, you know, it's 10 to the 12. Your machine does something like a few 10 to the 9 operations per second. Gigahertz, that's what your CPU does. So if you're willing to wait an hour or so, like many people do to train a deep net as well, then you can run these iterative linear algebra methods. And those iterative methods include Joleski when treated as an iterative method, as we did in the last lecture. But usually, people don't think about it this way. But they also include conjugate gradient, which is this variant that I had at the end, this Lungshaws process-driven version that loads projections of the data set. I'll talk about those in a moment. And these are methods that are, in some sense, any time. So after each iteration, you can just stop them and look at what the estimate is. And then the estimate will be better than the previous iteration, but it won't necessarily be perfect. And that's much closer to what we know from deep learning. And we'll see an example of that today. Then someone asked a very interesting question that I really haven't addressed so far. So GPs, at least in the examples we've done so far, are for regression on continuous numerical data, so learning functions that map from R to R, from the reals to the reals, or from a multivariate vector to a multivariate potentially output. But what if I have other types of data? What if I have text as entry? Or something else? Or classification? So classification is a different kind of output. What do I do then? So different types of output we'll do from today. We'll spend the entire lecture today and next on Thursday, and then for two more weeks on what happens if the output of the function is not a real number. It's not a vector. What if the input is not a real number? Well, there I just want to briefly point out that in all of the slides that I've shown you so far, I've defined the kernel to be something that maps from the input space with itself to a real number. And I always use this sort of bold fund X to really just say whatever. This is the important restriction. We need to build matrices that contain floating point numbers, but the input really doesn't matter. Why? Because there's a feature function inside. So remember that we can think of kernels as some inner product of the inputs through some feature function. So if you can write down a feature function that takes in strings and does something with them, you can build a kernel. So for example, you could come up with a feature function that you could write a kernel between two words in a text as in a generic form of any kernel, or actually sort of S, WS, as an inner product over a bunch of features where feature number S could be the number of times substring S shows up in the word X, where the sum goes over all substrings, over all possible ones. And you know that there are finitely many of them for a finitely long string, right? So it's sort of classic computer science algorithm problem. And then you could weigh them by some positive number, W. There could be some whatever weight, and thereby build interest in kernels. So for example, you could do this only for characters. So you literally just take the whatever 26 or 128 characters that you want to go through in some kind of ASCII table and just count how often each character shows up in each word. That's called a bag of characters kernel. Or you do it for a bunch of words in a dictionary. Then this is called a bag of words kernel. These are sort of the classic ways of doing computer linguistic, natural language processing. These days, you could have some smart neural network that takes the string input and maps it into some vector space. So word two vector. Right now, you have some kind of two-dimensionals or multi-dimensional space where all of the words lie in. And now a sentence, a token, consists of two points in that space. Then you can take an inner product between these two vectors. That's another kernel. Define that as your kernel. And of course, the sky is the limit. You can do whatever you want. Yes? So the question is, if I have some weird input feature, do I have to make sure that the kernel is continuous, symmetric, positive, definite, and so on? Well, technically speaking, no, in the sense that if you construct a kernel like this, so if the output of your neural network is this phi, then this is always a positive definite function by construction, because it's an outer product or inner product of two vectors. So as long as this is positive, this is positive definite. And you're sort of mathematically fine. It defines a meaningful Gaussian process. Whether it's a good one or not, that's out there. So I can't say whether that solves your problem or not. Maybe your input aren't words. Maybe there are graphs. So let's say you're working in a biomedical application. You have some proteins you want to know whether one folds like the other, whether one leads to some reaction in some biological system. Maybe it's a crystal. You want to know what the temp, you want to do regression on the temperature at which it becomes superconductive to do material science. These are all applications that these problems have been, these methods have been applied to. And you can imagine that if you do this on graphs, so if X and X prime come from the space of graphs, then this might be a tricky thing to do to compare two graphs. So another way to think about these kernels is that they compute covariances, similarities between two objects. So the kernel is also the covariance between the function we care about at input X and the function we care about at input X prime. So it says something about how similar things to are to each other. So if they are graphs, you kind of have to measure the similarity of graphs in some sense. And those of you who study computer science as an undergraduate, you know that graphs are these very important objects that can be very difficult to deal with. They are all these NP-complete problems in Carps 21 list, a list of 21 NP-complete problems. It's like half of them are graphs, right? There is exact cover and all these other ones and click cover and so on. So many of the ways of comparing graphs to each other are based on these difficult to compute quantities. So, you know, can I make these two graphs equal? Some kind of covering problem? To which degree are they equal? And therefore, kernels on graphs have been a long-studied field where people have spent a lot of time coming up with good approximations because as you may also know, for many of these graph problems, there are good approximations that are not NP-complete. One famous example of that is the so-called Weisfeiler-Lehmann kernel, which was rediscovered by the wonderful Nino-Scher-Waschitz Georgian mathematician, I think, doing her time here in Tübingen. It's one of the most widely used kernels out there. It was published in, I think, let me just check. Actually, I looked it up before the lecture 2011. Yeah, so long story short, if you're, you can do regression on really weird input spaces. Strings, graph, documents, computer programs, whatever, right, and produce output. And in a very concrete sense, a lot of the applications of machine learning that are currently discussed on text, on language, but also on images boil down to very similar kind of functionality. Of course, they don't use Gaussian process with regression or classification. They use deep architectures, but in the end, they pretty much do the same thing. Next question. One more slide with quick questions. One question was, well, can't we just do PCA on the transpose data set to reduce N? So, I think in the form that this question is posed, I would not recommend doing that, because for most of you, when you hear PCA, you're thinking of a data set where we reduce the number of features of the data set. So, your bachelor level PCA thing, I know that when I talk about PCA, everyone's like, oh, PCA, okay. You have a data set that is of size N by P, N data points, each of which have P features. So, there could be images that have P pixels, and there are N of them. Then PCA is usually a method of reducing P. So, it's a number of inputs. That's changing the X in this kernel. It's not the changing the size of the kernel gram matrix that we have to deal with. Now, your question, actually, I think might assume the other thing where you transpose this and do PCA on N. I just wouldn't call that PCA. It's just not PCA, it's something else. And what it is, actually, is very close to what we did last Thursday, namely trying to compute something close to the eigenvectors of the kernel gram matrix. So, we take this inner product of a transformation of this data, so we compute phi of this thing, phi of X, right? So, this is X, then we compute phi of X, which is of size still N, but something else, number of features, where features could be infinite. And then we take the inner product of this thing with itself to construct KXX, which is phi transpose phi, which is a matrix that is square and of size N by N. And now we're trying to construct a low-dimensional approximation to this by ideally, I mean, the best low-dimensional approximation to such a matrix in the Frobenia sense is the low-rank expansion in terms of the eigenvectors sorted by the size of the eigenvalues. And that's actually what we'd like to do. It's just that guessing the eigenvectors of a matrix is not exactly easy. So, finding all of the eigenvectors is, again, cubic in the size of the matrix, and in fact, it's, in the O, there is a larger constant hiding. Computing eigenvectors is more expensive than computing Cholesky decompositions. And so, we thus use the Lungshoss process, which, Daniel, is the next question, which is a way of iteratively constructing, not constructing an approximation that can be shown to be, in some very technical sense, quite close to the eigenvalue decomposition. One way to think about this is that it produces projections phi such that this matrix becomes tri-diagonal. It only contains O of N numbers, O of, like, literally two N, or N times, N plus N minus one numbers. And the question was, actually, these two questions are the same, sort of, what role does the residual play, and what does it have to do with Gram-Schmidt? So, I said something dangerous about Gram-Schmidt in the last lecture, and said, ah, you know, the Lungshoss process is a bit like Gram-Schmidt. And Gram-Schmidt is an algorithm that produces a basis of a vector space by taking a set of vectors, N of them, and then making them from them, constructing an orthogonal set, or also a normal set of N vectors. And we seem to have started with just one on this algorithm. So here it is again, which I showed you in, on Thursday. So this is my rewriting of iterative linear algebra methods, the ones that are in this low numerical layer that we keep calling when we do Gaussian process regression, which repeatedly construct a projection at a direction, actually, sort of a vector, and then they multiply the kernel Gram matrix with that vector. Those are the sort of the operative procedures. This is the business end, that's where the operation happens. This is also the expensive part, multiplying with the matrix is expensive. And then there are these white lines, which are O of N, they are just inner products, so they're not expensive, and they amount to some kind of bookkeeping to construct interesting objects, which we can then use to build approximations for both the inverse of the kernel Gram matrix, called CI and the if iteration, and approximations for the solution of K inverse times Y, which we call alpha, these are the represent the weights in the GP. And the question was, well, what I said at the end of the last lecture was there's these initial guesses. We could pluck, we could choose those somehow. And if you choose your initial direction to be, so the very first SI here, to be the initial residual on the linear problem. So the residual is we're trying to solve, we're trying to solve K times something unknown, some, what should I call it? Actually alpha, right, the optimal alpha is Y, Y bar. Then if you start with an initial alpha zero, an initial C zero, then you can compute K alpha zero minus Y hat. And that's a kind of gradient on a problem that I'll talk about in a moment. And if you start the Langstrass process with that gradient, it's called conjugate gradient, and it produces these interesting directions. And what does this have to do with Gram-Schmidt? Well, as this algorithm runs, it keeps, we keep correcting the residual and the residual drops, and those sequences of residuals, they produce a set of vectors which you can orthogonalize with Gram-Schmidt. And that process is called conjugate gradient. It's a bit annoying that it's called conjugate gradient, it should really be called orthogonal gradient, because it produces orthogonal gradients. Unfortunately, that's just not what Hester and Stiefel called it when they introduced it in 1955. And therefore it's called conjugate gradient. Okay, so that means just a recap from last Thursday that all of the linear algebra that happens that we so far sort of blindly called in Gaussian process regression is a part of the inference process. It's actively selecting to load linear projections of the data set. A special case of linear projections is just to go through the data one by one. And then keeping track of how these projections act on the kernel Gram matrix to get essentially a sequence of projected views on the data set. Linear algebra is data loading and bookkeeping. And for Gaussian process regression, they really are the same thing. They are directly related to each other. And we've now seen that these numerical methods that do Gaussian process regression, they really are little learning machines themselves because they decide to load particular parts of the data set in some order that is smartly chosen to extract information as quickly as possible. In particular, if on this algorithm we had on the last slide, we choose as the directions some order, some pivoted order, so we ordered order of the data set. So if we decide to first look at data point number 12 and then data point number eight and data point number 36 for some reason, because we decided that that sequence is important, then this amounts to what's called the pivoted Cholesky decomposition where pivoting is this rearrangement. And if we choose the initial direction to be the initial gradient and then construct, use the Langstrass process to basically, or Gram-Schmidt basically to orthogonalize the resulting gradients, that's called, well, actually it's called pre-condition conjugate gradient if we also multiply by C zero here by the initial guess of what the inverse of the matrix might be. And then that sort of already raises this observation that maybe these inputs that we give to our algorithm, let me go one slide back, up there this optional argument C zero, alpha zero, those actually amount to some initial guess of what the solution is. They amount to a prior on the problem. Prior on the linear algebra part of the problem. And that might raise questions for you what that means about the algorithms and I can just briefly point out that this is basically what my entire research career has been about. It turns out that these estimates are just an example of priors on a numerical quantity. And it's even possible to build a Bayesian interpretation of this. You can think of this algorithm as doing maximum posteriori estimation on two quantities called C zero and alpha zero using least squares estimation. So there's actually a Gaussian prior on this matrix and this vector that can be made consistent with each other. And then we can really do sort of full uncertainty over the computation. And this is called probabilistic numerics and there's books about it. But that's all I'm gonna say about this in this lecture. And now we're already deep into the time but I also want to briefly address this one last thing which leads us into the main content of today's lecture which is classification. That Marvin Fertner said something about this thinking about the computation of gradients in terms of some sensitivity. I will do this a few more times over the next two lectures so you don't have to like photographically memorize it now, but you might have pointed out that you can think of this matrix inverse as actually as itself as a derivative. You can just observe this by staring at the equations for long enough as a sort of gradient or maybe a total derivative of this expression with respect to Y. Obviously because this is a linear operation in Y, Y just shows up here as directly linear term. So the thing in front of it is just the derivative of it. And that's sort of interesting because this means this algorithm that we have just written on a few slides ago that computes this sequence of CIs. It actually is a method that you can think of as a differentiation method. It computes the total derivative of a function with respect to its inputs. And that function is called solve. While gradient descent, which you had in lecture 12 and you observed that you can't use it to be uncertain can be thought of as computing a partial derivative of this function with respect to Y. Why? Because if you, let me actually move the slide back up again. If you think about this object, this is the gradient that we're trying to follow. And you think of gradient descent as an algorithm that just keeps computing for every alpha, alpha I. This thing, let's call it R I. This. And then it takes a step that goes alpha I plus one is alpha I plus some learning rate eta minus R I. Then if you think of this function R and take the derivative with respect to Y, there is no K in here, right? Because Y doesn't show up here. If you take the derivative with respect to Y you just get a minus one. So when we use gradient descent we don't get access to this thing to this total derivative that we want. We actually compute a partial derivative and partial derivatives as you may remember from your analysis classes are not the same as total derivatives. So we're losing this information of how things relate to each other by using gradient descent because it's a first order method. And I will get back to this today because today we'll finally move away from regression problems and on to classification. So what is classification? We'll approach this problem through a few steps. Classification problems are problems like this. So here's a bunch of data points and you've seen those in your undergraduate machine learning classes I think. So who has seen a data set like this before? Just to check. Yes, everyone, very good, really everyone, cool. So there's two classes here, red and empty or red and blue. We want to know which one's which. So we want in the end to be able to say for a new input point in this plane, what is the class? And so this is clearly a problem of type. We have input X, X is X1, X2 and output which is different from the real line. Maybe it's a binary output, zero, one, or maybe it's a probability output, sort of something between zero, one, yes? So the question is, somehow Marvin said something about we can't compute this thing up there because it's not directly accessible but you can compute this direct gradient. Wouldn't that be the best thing to do? No, the best thing to do is the algorithm on the previous slide, right? Because it's just loading the data more efficiently. And we will go through this feeling today much more intuitively or we construct an intuition for why this is over the rest of today. Let me just get there. So if you have a data set like this, what's your answer? What would you do? I mean, this is constructed to be as easy as possible. Align, yeah. So some people already have done like this and some people have said line, yeah, you can just put a line through this thing. This is called a linearly separable data set and yeah, that's easy, right? So here we could have an output that is zero or one. But of course, not every data set looks like this. Some data sets look like this where, so you tell me what the problem is now. Just one word. Overlap. There's an overlap between the data. So there's some blue in the red and some red in the blue, annoying. So this might sort of lead us to think, ha, maybe we need a function that doesn't output zero or one but which outputs some shading, something between zero and one, right? A continuous kind of output. And then we might think of this as a probability for the class label. So if it's zero, the white part, the probability for the red label is zero. And if it's one in the dark green corner over there, the probability for the red label might be one. And in between here, the probability might be 50%, right? We don't really know what the label should be. So this is the first observation that we might need functions that don't produce binary outputs, but they produce continuous value outputs. And we can think of this as a generative model for this data set, actually. So to produce a label at some point, we produce this function that lies between zero and one and that's a probability. And then we throw a coin to decide whether the actual label, the red or blue bit, is one or the other with probability given by that function. So the next observation you could have is that not every data set looks like this nice clump of points, but many data sets might look more like this. So now we have things interspersed with each other. There's just some red and some blue, but they are not like in one nice linearly, nearly separable sort of space. So what do we know now? Again, map to different coordinates. Yeah, so that's already sort of the mathematical trick, but maybe intuitively we just want to have a function that could have spatially vary nonlinearly. So the mapping to different coordinates is the idea that we could learn such a function using a Gaussian process, maybe, because we can map into some big feature space in which things are nearly linear. That's the kernel trick. Mapping into a really high dimensional space where this function actually happens to be a linear function. And that's actually what we're going to do today. And then there are also data sets that actually look like this. Can you see what the problem is here? So it's a little bit like the one I showed you before. Green and blue overlap. So there are some regions where we probably want to have like 50% probability, but here's an observation sort of for you to think about. What should the label be up here? And can you say why? Yeah, so okay, so that's true if they have no data there, but in the previous problems, we sort of also had no data in some regions and it felt a bit better probably to do this. Yeah, so there's this, you said that they look independent to each other. So maybe what I would say is that it seems like the two classes have very different distributions. Right, there's this green stuff which looks like this long thin line. And then there's the blue stuff which looks more like a ball. And so without going much more into detail on this, I just want to sort of observe this as we pass along, such situations arise in so-called anti-causal settings. So this is when the label, the thing we're trying to predict, green or blue, the color produces the input causally. So if the X is actually made by Y, because then for each label, we can have a different distribution. And actually most settings that, well, not most, but many settings in machine learning are like this. And what we often do is we just ignore it. We just don't talk about it. Because we can clearly address this problem by just saying, well, they're just gonna still gonna do what we had on the previous slides. And in the middle there between everything, we just do 50%, we just don't know, 50, 50, whatever, like the label is somewhere in the middle. But it's still important to keep that in the back of our heads because we will often have this situation that you go out here. And then if someone asks you out there what the label is, the correct answer isn't even, it isn't even, I don't know. It's like, well, it seems to be neither because I kind of know that the data looks like this. And this doesn't look like any of the data I've seen. This is out of distribution, if you like. So, but when I say causal, it's sort of tempting to think of, oh, X causes Y, Y causes X. What does this actually mean? So I can tell you how I made this data set. You can see two highlighted points here, there and there. And what I've done is actually I created this data set by taking something that's higher dimensional. I'm not gonna tell you how many dimensions it has yet. And I'm gonna, and I actually did PCA. So I projected it down to two dimensions so that I can make a plot. And here's the entire data set or two of its classes on this two dimensional plane. And there are these two points. And actually, if I show you those two points in their full glory with their 768 dimensions, then they look like this. So their images of clothing, there's this fashion MNIST data set, which is a variant of the MNIST data set. And this is, I think, just as we pass by this causal setting observation, interesting to think about. So as you leave the lecture afterwards, think about why this is a setting where the label causes the image. Because at first it may seem, ah, yeah, sure, right? Something looks like a pair of trousers because it is a pair of trousers. And something looks like a sweater because it is a sweater. But isn't it also the other way around that we call something a sweater because it looks like this? Isn't it sort of a fortance? You can, because you can put your legs in it, it's a pair of trousers and because you can put your neck through it, it's a pullover. So this isn't always so straightforward. Causality is a bit complicated. But we're not gonna talk about causality now. Instead, we'll return, we'll just note this on the side as we pass by that this is potentially a problem. But then we will do what we have written down here anyway. We will do something like this. We take the input space, we'll construct some function that creates this shading and we'll just find out how we do this. This is called discriminative learning. Just to remind you, you may have heard this term before, as opposed to generative learning. So generative learning would be, let's produce more of the green stuff. Let's sample points in this space that look like the green stuff. While discriminative learning is this, I give you an input point and you return a label. You predict what the label is going to be. So, so far, we've done regression, which is a supervised machine learning problem. So one where we have inputs and outputs, X and Y, inputs and labels, where the inputs can be anything and the outputs we have assumed to be real numbers or maybe multiple real numbers, that it's just multiple copies of the same thing and then that's multivariate regression. But we've always assumed that the thing we're trying to learn is actually a real number between minus infinity and plus infinity. The problems we're now going to address still supervised, so they still have inputs and outputs, inputs and labels, but the labels are not real numbers anymore. But instead, the labels will in general be discrete. So they will be an element of a subset of the integers. One to D or C or K, I might fiddle a bit with the notation as we go along. So just a bunch of classes. First class, second class, fourth class, whatever. Sweaters, pants, t-shirts, or animal, vegetable, mineral, and so on. And what we're trying to learn now is a function that maps from the input space, x, where x is still anything to the simplex, which is the space defined by vectors on this d-dimensional space that sum to one. So these are probabilities for class labels, where we want to be able to ask this thing. I'm at the input point, this, here's an input image. Tell me what is in the image, right? And then it might say with 20% probability it's a car and with 80% probability it's a human being. And of course, you've all seen these applications before. That's exactly what we want to do. The reason I'm showing you things in this way is we can now check how much we've learned in the previous lectures. Let's see why this is the next thing we do. So which of the stuff we've done so far can we keep using and what has just been invalidated by this change in the problem? So we used to learn functions that map from any input space to the real number. And now we need to learn functions that map from any input space to the simplex. And so maybe you're gonna think, we had so far two things, sort of. We had this kernel business and all the linear algebra story and then we had this Gaussian business, uncertainty with probability distributions that are Gaussian in shape. So the former, we might still be able to do, right? Functions from one space to the other still sounds like a lot of the stuff we've done, minimizing objective functions and so on might be possible to use. But what doesn't seem right anymore is the use of Gaussian distributions. Why? Because in your head, Gaussians are these belts, right? And their support is on the real line. They are probability distributions over real numbers. A Gaussian probability density function, clearly the ventilation is off, not sure what to do about that. Gaussian probability density functions look like something like this. And this space down here, that's the real line. So we can't use those distributions anymore. At least not without thinking more carefully about them. But that is sort of the, that's the output space, right? So we have learned these functions that go from some input space X and then at each point in X, they produce these distributions that we've drawn like something like this. Right? Where if you cut through the space like this, you get one of these Gaussian distributions, one of these. So we have to change this output bit, but the input might still work. So the way to do this is to say, we're going to need to change the likelihood in our model. We're going to assume that the observations we make are discrete. And unfortunately for discrete quantities, there's lots and lots of ways of writing them, right? Because integers, you know, in particular finite subset of the integers, you can think about it in multiple ways. And for the moment, for the next two lectures, what I'm going to use is this signed integer notation. So I'm going to say the levels Y are binary. They're either plus or minus one, in or out. Good or bad, red or green or whatever. We don't even talk about what we are encoding with this. So let's hope it's not something socially sensitive. It's just, I don't know. The one type of flower or the other, right? Minus one plus one, because otherwise minus and plus sounds bad. So we will of course, that is of course not going to work. Once we have more than two classes, we will do that next week. There's still time to think about it. If you just have two outputs, then they could be zero or one, or they could be minus one plus one. It really doesn't matter. We just need to say how we create them. And we will create them by this discrete distribution, which is actually a binary distribution. So it suffices to learn one function. That's why we do it this way, that maps from the input space to a probability for a coin to come up at our head or tails. And yeah, so maybe just from a probabilistic perspective, this is a likelihood, right? So what we need is a posterior over pi of X. So this is nearly like what we've done so far in regression. It's almost the same thing. We're going to learn a function, a function with an index X that maps from the input to a probability over the labels. And the only problem we have really is that that function now has to be constrained to lie between zero and one because it's supposed to be a probability. And that's it. That's the only thing we're going to struggle with today. So how do we do that? And now those of you who have taken that deep learning class can probably think about it or an undergraduate machine learning class. I'm gonna give you 20 seconds to think about it so that everyone has some time. And then someone can tell me the answer. So base your hand if you have an idea of how we might model a function that lies between zero and one. Okay, then you, okay, you've raised your hand but you don't look at me, so what's your proposal? Yeah, the logistic, very good. It's almost directly logistic. It's not even just anything is the logistic function because of course that's what everyone uses. So maybe more general, the idea is to use a sigmoid. So a sigmoid is a function that looks like an S, sigma being S, right? A function like this black thing here that looks like a really stretched out S that maps from the reals to the zero one simplex. So let's first think about this conceptually. In lecture three, when I spoke about continuous valued random variables and the notion of a random variable we spoke about the fact that we can construct distributions over pretty much arbitrary spaces by starting out with distributions on some other base space and then mapping through a function and the output of that function is then called a random variable. So, and then we usually don't talk about the original spaces because you only care about the random variable. So we said, whatever, the original space is something because you can just change the function and then have some different input space. Here's an example of this where the input space is the real line and the output space is zero one. It's in some sense sort of almost the inverse of how we draw Gaussian random variables, right? So remember that the algorithms that draw Gaussian standard random variables actually take uniform random bits between zero and one and transform them into Gaussians. Here's sort of the opposite. If the input function is Gaussian distributed like these three, here are three different Gaussian distributions in red and blue and green, if you can see the colors. Then if you map the density, so these lines here at equal distance through this function then they sort of get squished at the edges of the distribution and sort of stay roughly the same in the middle around 0.5. And we get these three different distributions that look more like things on the simplex on zero one. And yes, the most popular function for this is this function one over one plus export minus F the logistic function which is called logistic because it's the solution of some Bernoulli equation that describes processes that have finite resources and therefore have to saturate like this logistically. It also can be written as a integral over some interesting probability distribution. Actually all sigmoids can be written as cumulative density functions of some probability density function, but whatever. That's just the fact that they are monotonic because you are integrating up a positive thing. So if they are differentiable, you get out something that amounts to a probability density function. Why? Because it's positive, they are monotonically rising and it integrates to one because this thing goes from zero to one. So it's a probability density function. It has all sorts of nice properties that we're going to use over the next few slides which is that it's sort of anti-symmetric around zero. It has an inverse that you can directly compute looks like this. It has a derivative that looks like this which we also know. There are other sigmoids but this is the one we're going to use because everyone uses it. Yeah. Okay, so before we end for the break, before we go into the break, I just want to give you an idea of what we're going to do with a little app for once again just to clear out what our plan is going to be. So we have a framework called Gaussian process regression that produces random functions that map from any input space to the real line in red. The outputs of those are called f of x function of f. In deep learning, you've learned about those being called logits. Why? Because they are the thing that goes into the logistic. You get them from taking the outputs and taking the inverse logistic. If we take these functions and push them through the sigmoid, the logistic, we get this blue thing which is a function that maps from the input x to the simplex zero one, to the bounded domain between zero and one. And I actually get this plot by taking these three red lines and mapping them through the sigmoid function. I get the thick line in the middle by taking the thick prior mean of the Gaussian process and mapping it through this. So zero maps to one half. And I get the shading by doing what we learned in lecture three. I take the density of the Gaussian at each point and then map that density through the transformation. So it's up there, the first line in math. So I need to multiply the value of the density at each point that maps to one up there with the Jacobian to get this scushing of the lines. And that's what gives this thing up there. And then we make labels for our input points at each input location. Here I have taken five input locations. By doing what? Well, by just drawing binary random numbers with probability pi. So what I've done here is I've just drawn at each location a uniform random number between zero and one that those are the green dots. And then if the function at that point is above the green dot, then let's call that a positive label. And if the function at that point is below the green dot, we call that a negative label. So this is actually a surprisingly rich language that we can play with. For example, I'm going to ask you what happens if I move the mean of this Gaussian process upwards? You know that we can change the prior mean function in any way we like. If I move this red line up, what's going to happen? Okay, yeah, so now say. The blue part should stay invariant, is your guess? Ah, interesting, what's your guess? Yeah, so we're going to move this entire distribution into the domain, if I can go back to the plot with the signal, right? This whole thing is going to shift to the right. And that means we end up further up towards one in the map. So if I do this, and I move it up, now my machine has to work a little bit. You can see that the density kind of gets squished up against the one and it becomes much more likely that we get positive labels. And actually in this case, all of the labels are one and now the plotting that I didn't take enough care of is messed up a little bit and you don't see a plot at zero anymore. On the other way around, if I set it further down, then everything is quite unlikely, right? Then we get lots and lots of zero labels. So this is a way of creating a prior probability for one class or the other. What happens if I change the output scale of this thing? So I make this narrower or wider this distribution. That's a bit harder to think about. It's a bit less intuitive. I'll just show you. So if I make it much wider, then the functions will typically be either zero or one. So the probabilities will be in some sense harder. There'll be less local noise on the labels. If I keep drawing from one of these, if I draw one sample, one blue line, and then keep it, then that means at each point, there's a hard probability of zero or one to get a positive or negative label. And if I do it the other way around, then we get very close to 50%. Pretty much all the time, there's just a 50-50 chance of being above or below. And if I change the length scale of this kernel, so I make these functions more or less wiggly, then we can make the ups and downs between the labels more or less extreme. So if I make a short length scale, everything becomes independent of each other almost. So at each of these locations, we get completely different predictions. Or if I make it very long, then everything is kind of the same. So for each of these blue lines, for each of these sampled functions, you get almost always the same kind of output. So I encourage you to play with this if you like, because it really gives an intuition for what's going on. In particular, there's this interesting business with these samples, that if here I draw these blue lines, and each of them is one function, which defines a probability distribution over the labels. So we have an initial probability distribution that we draw functions from, and then each of these functions is a probability distribution. And that's really confusing the first time you see it, so it might be good to play with this app a little bit to get a sense of what that is. Now we take a break, and at quarter past, we will talk about how to actually implement this. Very good question actually during the break. So to come back to this problem very briefly, and then I'm not for too long, because I need to move on. Maybe it was actually confusing to show you this plot, and say this is sort of a difficult situation. So let me just clearly say one thing again. If your problem looks like this, and of course you typically won't know, because it's not like you can look at the data all the time. Well, you can do low-dimensional projections, but your real data might have 768 dimensions, whatever. Then we can still apply what we are talking about today. Everything we do today can be applied to problems like this. But maybe, like me, you have a bit of an uneasy feeling in doing this. Because there's sort of one way to think about this is, if you're in here, what would we like our model to put out? We'd like it to put out something like, well, it's one of the data. So it's probably either a pair of pants or a sweater. I just don't know which one, because it's right in the middle. While up there, we kind of wanted to output something else, not 50-50, but it's just not data. It's a very different, qualitatively different kind of output. And we will need to think about why this is a problem or not. And actually in the community, there is a bit of a sort of a largely an approach of, I learned to stop varying and love discriminative classification because you can just apply it anyway. And in fact, so maybe you've heard about semi-supervised learning, which is a setting in which you have some of these data points labeled and the other ones unlabeled. There are theorems that show that semi-supervised learning only works in these anti-causal settings. Which, if you think about it, kind of makes sense. But everything we do today and in the next few lectures can be applied to these problems. And in fact, it is applied to these problems. Because all of computer vision is like that, right? When I give you an image and say, does it contain a car? That's exactly like this. It's exactly this problem. And people do logistic regression all the time. So we've now realized that we can probably keep using our prior called the Gaussian process because we now know and like it. We have studied it a lot. We've learned about kernels. We've learned about the spaces that they span. We've learned about that we can do some interesting linear algebra on the matrices that they produce. And we can use them to describe models that create functions that map from an input space to an output space just so far that the output space had to be the real line. And now we're in a setting where the output has to be a binary label, actually a probability for the binary label just a moment. And we've just decided from the five minutes before the break that we're going to tackle this by taking the framework that we have that produces functions that are up on the real line. And then we only tack on a final bit, a final layer that maps from a real valued function to probabilities, zero one functions. And an interesting function to use for this is this logistic function, which here I'm only calling sigma for sigmoid function, which produces these probabilities, yes. So very good. The question is, what kind of model is this? So first let's observe that it's a likelihood. So what this thing does is it writes down a probability for the label, Y, given the latent function F, where the latent function F, we can assume to come from a Gaussian process. And now your question is, so in the previous lectures, we had a discussion noise model where we sort of we describe this likelihood as saying, there is the actual function. And when I measure it with a sensor, that sensor has some Gaussian noise. And then we can do regression. And maybe back then already this Gaussian noise is felt a bit, whatever, it's a convenient class of noise models. And now this is our new noise model. It's this someone took this latent function F, and then they pushed it through this sigmoid function, and then they threw a coin. So maybe let me just show you this picture again. That is literally what happens here, right? They took this function, they pushed it through the sigmoid, and then they threw a coin, that's this green thing. And then if the function is above or below, you get an input output above or below, zero or one, or plus one minus one, whatever. What kind of model is this? Well, it's a Bernoulli type model. And it's maybe just as awkward as the Gaussian noise model was before. And of course, it also contains actually, in a mathematical sense, the essence of this problem with anti-causal modeling. So this is not actually what happens, right? It's not like there is the world that produces images on a camera, on a pixel, and then there is a function somehow that takes the pixels of an image and maps onto some real valued space, and then it gets squished through a softmax function or a sigmoid. This is by the way the binary softmax, right? And then someone throws a coin to decide whether that's an elephant or a car. That's not how the world works at all. It's not a generative description of the world. And yet, that's what we're going to use. And maybe the most important, what's important to point out here is this, at this point, probably the whole Bayesian thing feels really weird, right? Where Bayesian is this philosophy of, you know, the world is generated in this sense and then you can compute possibilities. But the probabilistic thing is not fallen out of the world yet. Probability theory is just about measuring remaining volume in a space when you condition it. So what we're going to do now is we're going to go through this process backwards. So if someone gives you at various input points labels, so these are these black things up there, and says, that's a car, that's an elephant, that's a car, that's a baby, that's an airplane, and so on, we want our algorithm to use this to basically think backwards through this process and says, if we have assigned this probability measure in this way, what does this tell us about the remaining degrees of freedom in this problem? And from, in my opinion, the right way to think about this is really just as some kind of sensitivity of the model. Which parts are constrained and how are they constrained by the fact that we make binary observations? And not in a deep philosophical sense of that's how the world works. It's just because it evidently isn't how the world works at all. Completely not. And the only reason really we have for that is that it seems to work really well. So it works better than our human vision system, so might be useful to do. And in fact, we are essentially done. We have everything we need because we've now written down a generative model. So in lectures two and three and four, I said the way to do Bayesian inference is you write down the generative model, that's this, a generative model is a prior and a likelihood. There's the prior and the likelihood. And then you just do Bayesian inference. So you multiply the prior by the likelihood and you divide by the evidence. So can we do that? Well, we sort of can. It just doesn't quite neatly work out. It's not so beautiful. So if you multiply a Gaussian prior over the function with this likelihood at each of the points that is called the sigmoid function, this logistic function, then we can't do what we did so far. So in the Gaussian setting so far, we said, ah, so if you think about this in log space, then this is a sum of the logarithm of a Gaussian and the logarithm of a Gaussian and logarithms of Gaussians are quadratic functions and the sums of quadratic functions are quadratic functions, nice. So the posterior is going to be Gaussian, conjugate prior. It's a pair of conjugate prior and likelihood and there's gonna be a nice posterior, wonderful. But now this is not the case anymore. And so sort of in a way, a door opens up to say, well, what do you want to do? So one option is to jettison this entire framework of regression that we've now spent so much time on and to say, well, maybe there is like a conjugate family of priors for these sigmoid likelihoods and there's many people who've thought about this for a really long time and it's just really painful. So there is no nice algebra to discover or at least not so much, there's a little bit. Sometimes you find some crumps in some weird distribution, some kind of Poisson processes, whatever, but it never quite works out. It never becomes as beautiful and linear algebra E as the Gaussian setting. And so instead, what people say is, well, we just have to live with the fact that that doesn't quite work and we need to find approximations. So let's think about this once more from a pictorial perspective. Here is the prior and what I've done here in this picture is if you take this Gaussian process from this app, this down here. And so this is as you know actually an infinite dimensional probability distribution over a function value at every input point. So now imagine I only cared about two of these function values, the one at zero and one at minus two. Then these are two Gaussian random variables because that's what the Gaussian process is. It's an infinite, potentially infinite set of random variables which when restricted to a finite subset of them always gives a Gaussian distribution. So let's say I care about the value at zero and at minus two, these are two function values that are related to each other because they are joint Gaussian random variables that covariate. So there's this Gaussian distribution, this gray thing in the middle, which has this covariant shape, right? So it has a non-trivial covariance. It's not a cycle, it's this cigar shaped thing, right? So what we'd like to do is to say, ah, we now make an observation at F one. So we're going to multiply with the likelihood at F one and the likelihood only depends on F one because that was our observation so far or our assumption. And then we want to do inference. We want to figure out what F two is actually. So we want to multiply this thing with a likelihood. So that's a function that only depends on F one. It'll be sort of axis aligned and then see what that does to this posterior if you multiply the two things with each other. And the only problem is that so far the likelihood was a Gaussian. So we could multiply with a Gaussian and get out a Gaussian posterior. But now the likelihood is this sigmoid function. It's this thing. So if you observe the label at that point to be positive, so we observe Y to be plus one, then that means sigmoid of F one has to be a large number. So if F one is a very large number, it's quite likely to observe this plus one label because then the probability is one and we get to see the label one with high probability. If F one is a very small number, if it's minus five, then it would have been very unlikely to observe a positive label because the sigmoid of that is zero. So that's what the likelihood is. Remember that likelihoods are functions of the right hand side, so they are not probability distributions. They're just the thing that makes the conditional distribution for the output. So if we multiply this gray thing and this blue thing, we get this thing, which is not a Gaussian. It's just something. It's just annoying is what it is. And at this point, I mean this was an observation that was made like 100 years ago or so by various people, practically minded people joined the fray and say, yeah, okay. So I can't write down a closed form expression for this. I don't know how to do this on a piece of paper because of course, when they did this, they didn't have computers so they couldn't make plots like this. But what I can do is I can write down an algorithm that finds the point that is somewhere in the middle here. This thing, the mode of this distribution, it's just an optimization problem. I'll just find the point where this red thing is the highest. And then what do I do? Laplace approximation, wonderful. So I'm going to construct something like this. So in red in the background, you see the equipotential lines of this thing. So these are just the same thing. They're just now printed differently so that you can see what's going on. And in black, we see the Laplace approximation which is constructed by finding the mode and then evaluating the curvature at that point. And that gives us a quadratic function which we can think of as almost Gaussian posterior. Actually it is, I mean the black thing is then the log of a Gaussian or is a Gaussian actually, if you plot it like this. But it's not the red thing of course, right? It's an approximation to it. But maybe it's not totally stupid, right? It captures something about it. It captures the mode, it captures the shape. It captures how things relate to each other. So here's the math for this again. What we're going to do is we will take this distribution which emerges by constructing prior times likelihood. And actually for Laplace it doesn't really matter how this emerges, it's just a distribution. And then we do what we've now done several times also in the exponential family lectures. We find the mode of typically actually its logarithm because the logarithm is a monotonic transformation so it doesn't change the location of the mode and it makes things easier to work with because probability distributions are numbers between zero and one. So they're usually sort of annoying numerically. So we take the logarithm of it and they have full support. We find the mode. So at the mode the gradient is zero. So we can do a Taylor expansion in distance to the mode. A Taylor expansion as you know consists of a constant term that's the evaluation of the function at the mode plus a linear term that involves the gradient but the gradient is zero. So we don't need that term. Plus a quadratic term which involves the Hessian, the second derivative of the function at the mode. Multiply it from the left and the right in a quadratic form with the distance to the mode plus higher order corrections. We ignore the higher order corrections and just think of this as our distribution and that's if that's the logarithm of something then exponential of minus it is a Gaussian. So this is the picture and this is the math for it. We find the mode, the black thing in the middle. We evaluate the Hessian and that gives us a quadratic function. So that's the idea and now we step back down into this hell of mathematics, right? So for Gaussian process regression, if not done this for a few lectures, right? Marvin also told you about this. There's kind of the pictorial layer, the user layer learning, we have labels and things. Then there's the modeling layer which we've just done. We've decided to use a Gaussian process prior and the sigmoid link function and now we step down into the algorithm layer of now we need to find this mode. How are we going to do this? This is the bit that the machine learning engineer has to deal with while the data scientists can sit outside and say, give me a toolbox. I'll deal with it, all right? So what actually happens and today what we're going to do before we end is we will only care about the first part, the dots here in the middle, this point estimate and how to find that and we'll do that. Then on Thursday we'll talk about the other part, the curvature around it to make the uncertainty. And this dot in the middle, this point estimate is associated in statistics with the notion of logistic regression. If you've heard that term before, that's what it is, that's logistic regression. And the thing around it that gives uncertainty, that's called Gaussian process classification because we're actually using the Gaussian process nature of the problem. So what does that mean in terms of math? So we'll look at this a few more times also on Thursday. So now we do it for the first time and then we'll do it on Thursday again. We want to find the mode of the posterior distribution. So the first thing to observe is, what should it that I need to tell you about is that when we have found this mode, we can actually do prediction anywhere. And that's why, well because from a probabilistic perspective at least, we use this framework of Gaussian process regression. So if we believe the Laplace approximation, if we believe that it actually makes a Gaussian posterior on F, then afterwards we can reuse all the knowledge we have from Gaussian process regression and use it as a shortcut to save our self time. We can even reuse our code to do that. So if we find the mode and a Hessian around it, then we can construct this approximate posterior on F at the training points. So there are finally many of them. They are like labels in the regression world. They're sort of a translation from the labels in plus minus one space onto labels in real world called F hat with noise called Sigma hat. And then the posterior on any other input point F is going to be, and now you can stare at the complicated math or you can just look at this and say, ah, this is exactly like doing Gaussian process regression on observations that are now called F hat. And they're observations of function values Fx with noise Sigma. And that just gives me a Gaussian process posterior, which I can evaluate and it looks exactly like the stuff that I've seen before. Actually here, I've rewritten the inverse a little bit, but it's the same thing that you will know and well, you don't necessarily love it yet, but you've learned to use it. Here's this posterior mean is prior mean plus something, something kernel matrix times gram matrix inverse times residual, ah-ha, ah-ha. This is the sort of thing that we've used so far. So that's nice. So we can reuse our linear algebra bit and our nice Python code and do everything with it. Okay, I'll talk about the other thing on Thursday. I will actually leave that, yeah. No, so F is in real space and the observations are binary. So there is no simple way to just take the observations and directly map them to an F hat. We need to sort of find them and how do we find them? Now I just realized that actually I should jump to slide 19 and we'll talk about the other one on Thursday. How do we find them? Well, we find them through optimization. So we take, we write down the log posterior. Well, the log posterior is log prior plus log likelihood minus log normalization constant, but the normalization constant doesn't depend on F. So we can ignore it. It's just a constant. Where the log likelihood is the logarithm of our likelihood. So remember the likelihood is this logistic. So the logarithm is this because the logarithm of one is zero. And so we get zero minus logarithm of one plus e to the minus whatever the input is. And the logarithm of the prior is the logarithm of a Gaussian. So it's a quadratic form. In the kernel ground matrix plus a constant, normalization constant. And we can compute the gradient of that. And actually I have it here on the slide. So this is what the gradient actually is, but maybe we can do without that for the moment and just let the computer do it for us. We have these cool autodiff libraries. So let's see if they can do it for us. Cause I just want to show you this and then we can do the rest on Thursday. So what I've done here is, and I've put it on Ilias, is I've created a bit of a sandbox to play with this kind of problem. So what I'm doing is, and we don't need to look at this stuff up here, is as usual I load a bunch of Python crap. Then I make a data set and here I've chosen to make a very simple data set because I want to have something that's easy to produce and easy to think about. You've probably seen this one before. It's called the two moons data set. It's a bit silly, really, but whatever. There's a simple algorithm function available in scikit-learn that just makes it. So I don't, didn't want to load anything. I'm just going to make this. There are two classes here and they're completely invented. It's just red and green, whatever. So it's not a good practical example, but it's easy to work with. We're going to deal with MNIST. You're going to deal with MNIST, actually. This week already in the exercises and we'll do it also on Thursday property. And now we, here's the actual interesting bit. So what we're going to do is we'll write down the log prior and the log likelihood, sum them together, compute their gradient and let an optimizer take care of the optimization procedure. And that will actually look a lot like if you've taken a deep learning class like writing down a deep model. Maybe you can see the similarities. We will write down the model which is our Gaussian process prior. Gaussian processes have a mean function and a kernel function, so we define those two. Here the mean function is going to be a constant function which I typically set to zero. It's just a flat function at zero. And the covariance, I've decided to use the matern five kernel. So that's the third of the matern kernels. It's the one that's sort of wiggly smooth which we've used in some of the examples. And then I need this likelihood function which in deep learning is sometimes called a link function or in logistic regression. And it's the sigmoid, so it's this function. Actually I wouldn't need to define this. It's available in jacks.neuralnetworks but it's fun to write down anyway so I just wrote it down. Okay, so now this defines a prior. It doesn't define an inference procedure yet but it defines a prior. So in particular what you can do with priors is you can draw from them. So I've defined a Gaussian process. Here is the line that produces the Gaussian process. There we go. And it uses our Gaussian library that you've now used for extensively for several weeks. It's instantiated with this constant mean actually set to zero, so it's a zero mean. And a matern type kernel multiplied by five with length scale 0.7, you can play with that code if you like because it makes a nice picture. Why? Because I can actually draw from this. So now I've drawn a sample by saying dear prior, please evaluate yourself on a mesh grid so I can make a plot. Mesh grids are just plotting grids. And then draw a sample using the jacks numpy key that I handed in from above. It's instantiated at the beginning. Only draw one sample please. And then please make a plot and the plot looks like this. So what you see here in the back is a sample from the prior pushed through a softmax. Actually I should say that. So it plots the contours of let's take the sample, apply the sigmoid to it. I'll keep mixing up sigmoid and softmax because they are the same thing. So the sigmoid is the one dimensional version of the softmax. That's all it is. And then reshape it so that it works for the plotting script and please make a plot and it looks like this. So you can tell that of course, this prior sample does not know about the data yet. So it's just wrong in some sense, right? It predicts red class where there is green class and so on, but that's fine. It's just a prior sample, but it does have roughly the right shapes. So it's a bit sort of the length scales on which it varies look a little bit like the length scales that the data set varies on. In particular, it's not just all red or all green and it sort of moves up and down from red to green on scales that look like the data set might fit in them actually. Good and now we just want to do training. And the only thing we want to do today is to train the point estimate, the posterior mean. We only want to find the mode of the log posterior. And then on Thursday we'll build an actual GP. So for that, I write down the log likelihood which I could also have just called log sigmoid, right? But I like writing down an actual likelihood and log likelihood, so I'll do that here. That's a numerically stable way of implementing the logarithm of the sigmoid. So the sigmoid is one over one plus e to the minus x. The logarithm of that is that's the Jax way of writing this. So we take our inputs, I've called them y signed because I assumed them to be plus or minus one, not zero and one, otherwise I would have to implement this in a different way. And then I take plus or minus one times the value of f with a minus in front. And then I compute this thing that is actually in the line above, right? So the logarithm of this expression is the logarithm of one minus the logarithm of the stuff below. So the logarithm of one is zero minus the logarithm of one plus e to the minus f. And Jax has this convenient library function called lot at exp, which takes the logarithm of a sum over the logarithm of a bunch of expressions. And so therefore I have to call it with zero and minus sine times f as the inputs to get one plus e to the minus y sine times f and then taking the logarithm of that. That's the numerically stable way of doing that. And it's a bit annoying that we have to do it this way. Well, that's it. There's also a function called loxum exp, which pretty much does the same thing, but it treats its inputs in a different way. This is just nicer because it takes in the actual logarithm expressions which are numerically stable. Why, why is signed? Why plus minus one? Why do we use not zero and one? Well, because the sigmoid is this point symmetric function around zero. So it's convenient to use plus minus one because then we can just multiply by y and get the right answer. So for negative samples, samples at minus one, we just multiply by minus one and the sigmoid that used to look like this now looks like this because it's symmetric around the origin. Okay, now I tell this Gaussian process prior to instantiate itself on the training point set. So there's, I think, a hundred and something. How many did I create? I asked it to make a hundred samples, so there's a hundred of them and create this distribution. So this is when the functional object kind of collapses into a hundred dimensional Gaussian distribution with a covariance matrix that's a hundred by a hundred. And now I define this thing that I'm going to optimize. So this thing is called loss in sort of foreshadowing because that's what it's called in deep learning. It's the logarithm of a negative postivier. Sorry, the negative logarithm of a postivier. I always get these two wrong. So we take the log likelihood plus the log prior and those are negative expressions that we want to maximize. So instead, we can also put a minus in front and then minimize the result. That's just the convention and optimization to minimize things. We could just as well have left out the minus and then maximized and that's just a different optimization routine that adds gradients rather than subtracts gradients. Okay, and now sort of let's put ourselves in the shoes of someone who has just taken a deep learning class and nothing else. Maybe that's you, I don't know, not everyone. So what if you learned how to optimize a function? If you have access to an autodiff library, what's the algorithm you want to use? Gradient descent. So let's do that. So I'm going to set the step size to 10 to the minus three because that's apparently what you're supposed to do in gradient descent and then I'll tell it to do a thousand steps in deep learning. The steps are called epochs if they go through the whole data set and of course we go through our entire data set because it's just a hundred points. And then I write down this function. This is the optimizer. The optimizer takes in a current value of our F at those 100 points. And then conveniently, I'm just not going to do any complicated math. I'm just going to call Jack's value and gradient. So that's the function that returns the value and the gradient of a function. In that case, the function is called the loss. Evaluate it at F as its input and then it takes a step of gradient descent. Gradient descent is F, the original location minus step size times gradient. And then it just does that. So for, we initialize at zero, prior mean, and then for a bunch of steps, a thousand of them, we just run this. Let me do that. And here it runs, just a progress, a thousand steps. And here's the output that, ah, shit. That just happens with gradient descent, right? So let's go down, stupid, let's go down to minus four. Ah, whoo, okay. Now it goes off. But that's not, that's not far enough down. No, we want it to be flat. Okay, so maybe a bit faster again. Let's do two optimizers first. Oh no, ah, okay, so, there we go. Okay, I need to do more steps. One to the minus four. Okay, let's do 10,000. This is gonna take forever. We're gonna be, you're gonna need to leave, right? Okay, so, what do we do? We wait? Yeah, we just go back to our manager and ask for a GPU. Can we get this faster, please? Actually, there is actually something we could do without asking for a GPU. This is the thing that is slow. That's the function that takes forever, forever to run. And we're just going to see it end. What could we make, what could we do to make it faster? Just need your thing without thinking too much about it. Manually gradient, that's, but that's already thinking too much about it. What's the naive thing we can just always do? I was waiting for someone to say, oh yeah, it's going down, you know. Well, what we could do is we could just type up here, jit, and then see what happens. Ooh, wow, way faster. Okay, oh, now we can actually afford to do 20,000 steps, right? Whoo, let's do that. Ah, nice. Okay, so jit is just in time compilation, right? So this is, it's sometimes a good idea to do this. Now, it's not flat yet though, right? So maybe I should do 50,000. Getting there, let's do 100,000. Maybe 200,000, can I do this? Yeah, yeah, okay, okay, okay, close to zero, right? Converged, great. We just had to play a little bit with the parameters back and forth, ha, ha, ha. Okay, now we have an estimate, F trained, and you can just plot it. That's just what I'm going to do. And then leave it. So the only thing we need to do is to make a prediction on the plotting grid, right? And for that, we need to use the fact that it's actually labeled of a Gaussian process regressor that we've built here. So now we can actually sort of post hoc compute just the covariance on the prior, the KXX, without the likelihood, and then compute its Jolesky factorization. It's just 100 by 100 points, right? So it's not going to be a problem if you just run it, done. And just construct sort of half a GP, which is just a mean function. I'll leave out the kernel for later. We'll do that on Thursday. So I'm just producing this posterior mean function, which looks like the stuff you've seen before. It's just K little X capital X times Jolesky decomposition solve of the train points. Evaluate it and make a plot, and it needs a bit of time because it plots on a nice grid. What does it take so long? What are we waiting for? Ah, whoo! Okay, this looks good, no? So we've learned a function that is green where the green bits are, and it's red where the red bits are, and that's called logistic regression. And for now, we can be happy with that. And clearly, it felt a little bit like deep learning, but motivated from a probabilistic perspective. So on Thursday, whoops, we will return to this picture. We've just realized that we can solve classification problems, so-called discriminative classification problems, where you just get the input, you have to predict the label, by taking the Gaussian process framework, salvaging as much as we could possibly do, keeping all the algorithms around, and just changing the likelihood. And the only price we had to pay for that is that when we do this, we lose the elegant linear algebra form, and we then have to do gradient descent, at least for now, to find the mode. And on Thursday, we'll think again about this gradient descent business and why it was just so painful to go through all of this. All right, thank you.