 Good morning everyone. Welcome back to another week on late June. Very happy that you're all here. Today we're going to move, as you can see in the title, to deep learning but of course we start with feedback which was relatively decent last time. You roughly seem to seem to like the lecture and I should say that the evaluation of this course is also in but I don't want to discuss it today because it just doesn't fit quite from the time. I will either do that on Thursday or next Tuesday. I'll put aside a few minutes to talk about the evaluation. Here are detailed feedback. So two of you wrote that it's still too fast, there's still too much math and we're flipping back and forth too much between different layers of communication. I will try to change that a bit today. Let's see whether it works out. Some people also seem to like me jumping around so someone I'm not sure this is actually meant earnestly or ironically I don't know but thanks a lot for pointing out that I wave my hands around a lot. If it actually helps someone that's good. So someone asked, and this is a very good question, if we're constructing uncertainty in this really weird way by first making modeling assumptions that are wrong and then finding out what we need to compute but we realize we can't actually compute it and we need to approximate it, what is the uncertainty actually worth in the end? And my answer to this is it's worth about as much as the point estimate but it's a completely like additional second thing that you get on top. So there are sort of two extremes that both don't work. They're both not good that I want you to lie between. One of them is this naive thing that ah, Bayesian reasoning is perfect and therefore all the uncertainty that falls out of Bayes theorem is this perfect quantification of what might be wrong about the world. Exact uncertainty. And that's only true in the mathematical sense if you know the generative model, if you know that the data is actually drawn from some joint distribution with some latent variable and you actually observe that data and that's pretty much never true in the world. So it's just sort of an ideal we can aspire to but which we'll never actually get. So if you perfectly believe in your uncertainty you're typically wrong. But as the other extreme which is actually much closer to what most of ML practice at the moment is, which is there was just this point estimate take it or leave it. I've made this point estimate if it's wrong it's not my problem. And that's also dangerous because I mean it's in a way it's easier for the person who makes the prediction because you don't have to then you know rationalize why your prediction is wrong. When it doesn't work you can just say ah there's some machine learning thing, it's a bit weird, doesn't quite work all the time. But of course that's unsatisfying as well. So what this kind of uncertainty that we construct in this course gives you is an additional sense for the sensitivity of your model to the data, which parts of the model have been affected by which data and which have not. And how much might therefore change if you see future data and these are all useful functionality to have I would say. But so you should make use of them and you should think about how much they cost to construct but you also shouldn't go around telling everyone that that's the perfect error bar that will always be correct. For me I sometimes make the distinction between Bayesian reasoning and probabilistic reasoning which is a bit imperfect because these words tend to be used interchangeably but for me Bayesian is this philosophical thing that you know you learn from data and then you're uncertain in a with in the sense of an error bar and then there's probabilistic reasoning which just takes care to measure things correctly and then just expresses those measures that fall out of the computation and then you can analyze those measures and think about whether you trust them or not. The other question was what are these objects with these indices and so it's actually good that this question only comes up now in lecture 16 because in previous instances of this course they came up much earlier because clearly my notation was even more confusing. So I've started using these bullets and I know that they were confusing for some of you but maybe they helped a bit because it sort of avoided this problem. So very very simply what I'm doing with this notation is that I say there's a function and usually we write functions with brackets that's the typical notation but we like computationally when we do in particular Gaussian process inference we effectively treat them like vectors. We think of an infinity long list of function values that you can index into. That's actually what we do in the functional style programming for Gaussian processes. We provide these functions called mean function and covariance function and then we instantiate them by calling them on an evaluation grid literally an array with slicing into it and so I use this shorthand notation to just write it below. That's maybe the first source of confusion and then the second one is that in these problems we tend to have training data which is a vector of training data and I call that capital X and I know capital X is dangerous and then there is little x which are just points where we evaluate and then sometimes there's also x i which is one of these on particular one and because this is confusing lower case uppercase I've started using this thing as just we're evaluating at some point which we don't know yet where. So to express that that's a function of something which we're not evaluating in some kind of functional programming style way sort of curing style. That's just this other bit that we haven't evaluated yet and we'll just leave it flying around. I know that this is confusing but pretty much all the other notations that I've tried or that I know also aren't good. So we could use a and b but then what's the connection between a and x? We could use x star as the test point but star sounds like optimal something optimized and we will today have lots of stars with things that have been optimized so that's confusing with this so you know I flip back and forth between the notations. So what have we done now that we are sort of turning towards a new step in the course what have we actually done so far? So here's a quick overview of the flow of the course on just one slide so obviously it's very condensed. We started out first three lectures me arguing you observing that probabilistic inference is the correct language for reasoning under uncertainty. It's the right way to distribute truth instead of in the Aristotelian sense binary value something is true or false instead distributing it over several possible hypotheses each of which is a little bit true because we don't know yet which of them is the right answer and we realize that the main thing to do when we do this is not just well it's to use measures but also to make sure that those measures sum to one there's a finite amount of truth that's what makes it work it's also what makes it hard because now we have to keep track of all the possible hypotheses of how they interact with each other and if you do it in an abstract form then there's no concrete realization for it because it's potentially exponentially hard in the number of variables that we keep track of and linearly hard in the number of values that each of these variables could take so for binary variables like the complexity of patient inference or probabilistic inference is true to the number of variables but if the variable can take k possible values then it's k to the d and that's really bad if you have continuous variables because then k goes to infinity and then this is intractable it's even more intractable than exponential it's just not computable so instead we use parametric formalizations we construct particular families of probability distributions over which the inference is tractable and we found that one very useful framework for this are exponential families so these are probability distributions where the log probability density function is a linear function of some parameters and we saw that these in some sense simplify inference because they map inference under particular likelihoods to this notion of conjugate inference where we sum up sufficient statistics of the observations this becomes particularly interesting for one of these exponential families which is the gaussian one the one where the sufficient statistics are polynomials the first two moments of the data set because then inference turns into linear algebra and that's very good because computers are good with linear algebra and so the last so this was exponential families was lectures four and five half of six and since then all the way to now we've basically exclusively used gaussian probability distributions for all our inference needs and that might sound like a big chunk of discourse but also linear algebra is a big chunk of what computers do so that's kind of useful in fact we saw that we can do really powerful stuff with this framework in particular we can learn functions that map from an input domain x to the real line through a little trick that is called linear regression of course and you've seen it before which is that we invent a function phi a representation that takes in the input x takes care of it it surrounds it it masks it and computes a bunch of feature vectors or feature numbers actually features values of features which are real numbers so it maps from the input domain to a real vector space and then we describe the function as a sum of these features with some parameters some weights w and so this notation is just a clean nice modern linear algebra way of writing a sum between the feature vectors and the weights and then we saw that this is quite a powerful language we can learn nonlinear functions with this because while this is a linear function in w it's a nonlinear function in x so we can learn functions that map in various complicated ways including on really complicated input spaces like multivariate input spaces like structured input spaces and we can even learn structured outputs which we'll get to in a moment so this is a really powerful language for functions and then we made an interesting observation which is that when we do Gaussian inference on these type of functions we actually never encounter a lonely feature vector lying around instead the only thing we really have to compute that actually matters is this inner product between two feature functions weighted by some covariance matrix sigma and that led us to this interesting observation that maybe we can get away without these features and by instead replacing this sum with an implicit sum for example some other way of efficiently evaluating this object without this sort of detour to computing the features in particular it's even possible to do this with infinitely many features because there are some series this is basically a series expansion right or so it's a sum over phi i sigma i j phi j over i and j and if sigma is a diagonal matrix it's just a sum over phi i phi i with different inputs um sometimes it's possible to do these series in an in a closed form expression even if they have infinitely many entries and that's called a kernel um or that's a special case of a kernel this whole object has always caused a called a kernel and so we realized that we can learn quite expressive function spaces or functions in expressive spaces with this framework called Gaussian process regression it also has a neat connection to concepts in computer science like functional programming um which we exploited in the code and now in the last two lectures we realized that not all regression problems are of this type so this is here already quite general in terms of x x could be pretty much anything but the output has to be r real line and now last two lectures we sort of said well what if the outputs are for example discrete binary or integer values so if we do classification well then we can sort of squeeze the framework a little bit and still make it work by changing the likelihood then we do not quite do a conjugate prior inference anymore so Gaussian prior times non-Gaussian likelihood is not in general a Gaussian posterior but we saw that we can approximate it with various neat linear algebra approximations like that a plus approximation which is by the way just one example but it's the most straightforward one of this approximate inference paradigm so that's a big chunk of a classic um machine learning course in probabilistic reasoning and if this course would have taken place a few years ago actually it did take place a few years ago then at this point we would now usually move on to various other cool things that probabilists have come up with over the years like what about unsupervised machine learning and what about some other structured types of inference can we build models for those and maybe we'll get to that in a more compact form at the end of this class but I really need to address at this point the elephant in the room which is that a large part of machine learning now isn't actually phrased in this way at all but instead if you at least sort of follow social media you could get you could get the impression that I don't know 80 or 90 percent of machine learning is these days is deep learning so because you're younger than me I should point out that that has not always been the case so historically the field has evolved maybe at least along two different viewpoints maybe three so when I did my phd a while ago there were people who worked with kernels and they called themselves statistical learning theorists and they wrote lots of math and they did all this stuff that we've done here before they talked about reproducing kernel Hilbert spaces and convergence rates and learning rates and so on then there were the basions who are largely sort of associated with physics and they built these structured probabilistic models where they observed that things can be multiplied with each other when you can still do tractable stuff and then there were the so-called connectionists they had an email list that I was on for a while as well who built these deep learning tools actually they weren't called deep learning then that wasn't a word it was just called neural networks or connectionist models or learn representations or depending on who you talk to um and they everyone else was sort of looking at them with a little bit with disdain that's not it's a bit of a trope as well but it's also not entirely untrue because these models tended to be really bad difficult to use so when I tried training a network at some point myself during my phd it just did not work I had no idea how I could ever get this to work in fact I had to actually send an email to the original author yt in this case who happened to be in the same town as me while I was doing this and then walk over and visit him and have him sit down with me for two hours to tell me why this network didn't train and then he told me all the bugs that I had introduced that of course no one had ever told me about were a problem because they were not in the literature so there was this trope that there were literally only two groups in the world that could train a deep net Jeff Hinton said yandeköntz and and maybe this is to some weird degree still true so some people told me in the feedback that you didn't yet have a deep learning class okay other way around who dares no raise your hand if you've had the deep learning class ah that is pretty much everyone okay so for those five of you who didn't raise your hand we'll do today a 15 minute introduction maybe 20 minutes of what actually deep learning is and for everyone else let's see if it matches what professor Geiger told you in his class so what is a deep neural network so if you I'm sure you have lots and lots of different pictures in your head now of perceptrons and confnets and resonance and LSTMs and transformers and all of this other stuff for the purposes of this course for what I'm trying to get to conceptually we'll keep it very very general I will say a deep neural network is a function that maps from an input to an output where the input is something arbitrary and the output is a multivariate real vector vector and that function has the property that it has two inputs but input is a dangerous word so it takes in two sets of variables so if you think of it as a program it has open brackets two things going in one of them is the input X the thing that it maps that takes in and then maps onto like that maps does something with maps and the other thing are variables that I will call parameters specifically which change the behavior of this function and that's actually it so if you just write it like this there's not really a need to call it a deep neural network it's just a function and maybe that's important sometimes to point out that this is what these things are it's really just a language to write parameterized functions now the typical setup that makes it deep is that usually these functions are written as some kind of hierarchical recursive construction in one form or another so you tend to think of some function that takes the input so here is the original input and then successively applies a combination of a linear transformation through the parameters and then a non-linear transformation through something that's in some sense hard coded so this I will call this sigma for some non-linearity common notation and of course it's kind of reminiscent of a sigmoid but of course it doesn't have to be a sigmoid can be under any other non-linearity and then we apply another linear transformation and another non-linear transformation and so on and so on all the way to the top and then capital L is the number of layers of this network and at the top again we apply actually not just a linear but an affine transformation so we multiply with a matrix and then we add a bunch of ones times constant so this is often conceptually separated into weights and biases so the w's are the weights and the b's are the biases you saw in my regression lecture that this is a bit of a I'm not I'm actually not sure myself how useful this distinction is because conceptually we could just say well let's just add one more feature to the output right this non-linearity you could also just produce one set of vectors that is unit and then we could get the biases into the weights so it's not a super big deal but it can be numerically sometimes convenient to separate them from each other because they tend to have different scales okay this is it that's what a deep neural network is and ever since this sort of structure kind of emerged people have been super excited about it and we will talk today a little bit about why but first let me show you some code that we're going to use also to also in the next lecture to really make this quite concrete and what I've done is I've taken the standard tutorial for deep learning in Jax and adapted it a little bit so that it fits better to what we are trying to do so here's some code that you can also find on on Elias it's the usual crap at the start where I just load a bunch of stuff and the important thing is that we're going to use Jax to do this and we're also going to load an optimizer everything else is sort of as before so the first thing we actually need is we need to define what a network is so this is it a neural network is this recursive application of linear transformations or affine transformations to an input so we take in an input we call this the activations and then recursively compute activation times weights plus biases and then nonlinearity and here I've chosen a particular nonlinearity we do and this by the way is this sort of typical thing right you get to change the weights you don't really get to change the nonlinearity of course that somehow basically affects the nonlinearity as well because you're changing its inputs but yeah there's this distinction um and that's it actually that's the deep neural network done now of course there are lots and lots of other ones um but this is my neural network and I like it and I'm going to use it all right um maybe maybe yeah maybe on Thursday I'll even show you some more more advanced code I think it's sometimes actually useful if you have the chance to look at some sort of production quality modern deep neural architectures because we tend to think of them as these super arcane complicated things but they tend to be just a few lines of python so um if any one of you has looked at the llama code for example which is maybe an example of a modern large deep learning architecture that is publicly visible and not hidden behind some corporate walls you will find that it's just 300 lines of python it's not super complicated okay and then um when we have this we can do something with this network called predict that I haven't that I won't talk about now but in a moment um which just takes the outputs and transforms them into classification labels because we need them to be probabilities and then we can make a prediction well actually we can't yet because I haven't told you yet what the parameters are we need to instantiate the parameters first to call be able to call this function so this is done by an initializer in the deep learning lingo typically which looks like this so we uh this is a function that takes in the architecture so this in this case is just a simple list of how wide each of the layers is and then in this case a jack's random key because we want to keep things deterministic um and a parameter a simple one in this case and then just goes through all of the layers and in each layer produces a weight and a bias and um that's it returns it so now we can tell this function what kind of architecture we want by the way I need to actually run this code so it doesn't like do anything okay um and for that we need to set some some some parameters these are global parameters so I set them in capitals we're going to do a binary classification on a two-dimensional input space so the input space is an image that looks like two-dimensional the dots in there and then we map onto the first layer which is 128 units and then on the second layer which is 64 units and then on the output because I do binary classification it's a univariate output now we have to set these parameters and now here come some magic numbers that we just have to believe because I played around with them somehow they are right and then we can make those parameters so this is now a list that contains one two three sets of weights and biases and I can push this into this predict function to make a prediction this is here and do that and now after initialization we saw that then we see that the network can make a prediction so what I'm plotting in red and green is a data set it's the one that I've used in the previous lectures as well with now with a test set in my in sort of thin plots and in the background this is the initial state of the network before training so it produces a prediction which of course is wrong because it hasn't learned yet anything but you can see that it has some interesting structure it sort of lies around that there's red and green parts so this is a deep neural network does this fit with what you've seen in your deep learning class so far this is good okay now um now we need to talk about how these things actually learn how do you train a deep neural network how do you learn a representation to use the gönn language well you you train it by minimizing some loss function and that's in the mathematical language often called empirical risk minimization where the empirical means that there is an actual data set in here that involves summing over a bunch of data not some integral over a probability distribution the risk means that there is a function here that gets minimized risk is always something you want to minimize as opposed to reward which you want to maximize and then sometimes there is a regularizer which is then it's called regularized empirical risk minimization and people make a big spiel out of it being regularized so um what this means is we're going to find a set of parameters a set of weights and biases such that this function is minimized and this function has the property that it's a sum over many individual terms where each term in the sum depends on one datum so one pair of inputs and outputs and the entire set of parameters weights and biases through this function that I've called the deep net and then there's an extra term the regularizer which does not depend on the data and that regularizer could be there or it could not be there it could be zero and constant or it could be something else something that depends on theta so I'm just putting it in for full generality I'm not saying you have to have a regularizer otherwise it's not deep learning or otherwise what we're going to do doesn't work it's just for generality there might be a function that doesn't depend on the parameters so what are the typical choices people use for this little L I haven't this little L has so far not shown up in my code so we're going to have to plug it in so typical choices for classification and this is where you probably not it's like I've seen it many times before uh the logistic loss or it's um multivariate version of it the cross entropy or sometimes these people kind of these these words kind of overlap there is a binary cross entropy and there's also a multivariate logistic which is sometimes called the log softmax and they're all the same thing so they're all these functions which just happen to look like this and well whatever right so for um that maybe the one main thing to notice about this is that if you use this so first of all these functions are functions of two inputs we tend to think of them as the correct label in the data and the predicted label but the predicted label is just it's just a function of the weights and biases right it's the output of the neural network transform through some non-linearity maybe um so this is interesting there are two things this is a function of two things going on and for example if you think of the binary logistic loss then it looks like this which we can recognize as uh the logarithm of a Bernoulli distribution for the label so this is like the logarithm of the probability for y being equal to one raised to an indicator whether y is one times one minus that probability raised to an indicator for if y is not one should i write this down no okay it's good so this is a log probability distribution and the same holds for the cross entropy it's the log of a multinomial probability distribution so a discrete distribution over multiple possible outputs and then for regression the other common type of loss is the square loss which is well one half times one half times the square distance between prediction and and observation which we can also think of as a log probability distribution our famous Gaussian one that we've used so far right why because Gaussian distributions are exponentials of a square so their logarithm is just a square in particular it's a Gaussian distribution with standard deviation one if you use this loss yes should this be log negative of log softmax probably yeah so my main point on this slide is the loss functions people use our logarithms of probability distributions that's going to be the connection of course we'll make use of what are the regularizers people use they are their priors but what are they actually what are the actual choices like what do people use have you ever used a regularizer in deep deep training yeah an l2 norm yes so the most widely used regularizers the l2 norm sometimes called weight decay or weight cost or l2 weight cost and it looks like this and it's obviously the logarithm of the negative logarithm of a Gaussian prior on the weights a standard Gaussian prior so mean zero covariance one with some parameter in front and there's a bit of an annoying business with the parameter because the loss tends to be unscaled there's no number here in front so we have to sort of think of the lambda as a ratio between prior and likelihood okay so what that means is that empirical risk minimization and I've made this case a few times but maybe it's good to do it properly once is maximum apostatory estimation if you train a deep neural network you'll find you're trying to find the mode of a posterior distribution over the weights and by the way of course there are other choices so for example sometimes people use dropout as the prior so dropout you can think of as sort of an l2 but with a diagonal matrix here in the middle so you get some sort of theta transpose times I don't know s times theta where s is some stochastic diagonal matrix of binary zeros and ones where some bits get dropped out every time or you can think of it as an approximation of some expected value or various other there's lots and lots of theory trying to interpret what dropout actually is or people use sometimes l1 regularizers called the lasso to make sparse solutions that's also a prior it's just a log laplace distribution and there are lots and lots of other choices that all amount in some sense to prior assumptions yes yeah so the question is is there in base so lambda trades off between prior and likelihood in base theorem this tradeoff happens automatically why do we need to do this I wouldn't say that it happens automatically in base theorem because if you think back and we'll do that in a moment to our regression framework so far we had an explicit number in front of the prior and then an explicit number in front of the likelihood a scale so the likelihood had an error bar right so that was interpreted it as an error bar and the prior had a width which we interpreted as some kind of model flexibility or model scale and here lambda is just a ratio between the two so why do we only need the ratio well because we're just finding the mode so if we wanted to have the shape of the posterior we need to take care to actually assign an explicit value to either of them because that will change the overall shape of the problem and this is actually a good point that if you just do minimization there are some things you don't have to worry about like these overall scales you just need to know ratios signal to noise ratios rather than signal and noise and that's sometimes going to bite us when we're trying to do a Bayesian inference in in deep learning because then these ratios will actually matter and will somehow suddenly have to set them okay so how do you do this in practice let me just show you my code as well so that everyone's on the same page we define and the terms for the loss so this is the little l in the loss the empirical risk which is a function of two inputs in this case it's the weights and the biases of the network and a training batch inputs and outputs and then you know we get the batch which contains inputs and outputs we evaluate the deep network so that's our f of theta and x that gives us predictions which are sort of y hat and then we compute this binary cross entropy to compute predictions times when then you have to make sure that the labels are in the right scale so they are between plus and minus one and then we need the regularizer term which here is just going to be one half times the l2 norm and actually for training I've decided to just switch it off and maybe this is annoying so I've just I'm actually adding something that's zero but that's maybe a good point to do because people do actually do this in practice there are quite a few people in deep learning who don't use their weights regularizers at all and also depends on the problem a lot this problem is quite simple so we don't have to vary so much about large weights we can just set them just remove the regularizer that's very important because again Bayesian inference is not about the prior it's not about having a prior at all it's just that some spaces require a prior in this case we're going to be typically fine and then during training as you may know people often also compute the accuracy which is not actually the loss it's something else it's just another number that they like to look at which is how many labels the network correctly predicted now one thing that I want to point out is if we think of this here as our log posterior then this is our log likelihood and previously to make this plot what I did is I produced this predict function so up here when I defined the network I made this predict function which takes in the inputs x maps them through the deep net f so now we have f of x and theta and then it basically puts a sigmoid around it to make binary predictions probabilities and in fact it doesn't return the sigmoid if it turns the log of the sigmoid so you may notice if you think about it a bit that this function is actually pretty much the same thing as the empirical risk it does the same thing it's actually part of it the only thing it doesn't do is multiply with the exact labels to get a risk and this is sort of a totally minor code design thing that it's sort of convenient to have these separate things predict and risk but it has led to some parts of the community thinking that there are these two different parts there is training and then there is inference there's learning and inference learning is when you train the network and then afterwards when you give it an input it's called inference but it's really just predict because the two called the same function why because learning is inference that's what the probabilistic framework says there are even workshops at new rips at least there were a while ago called inferning between inference and learning what can we do between inference and learning and it's I don't think this is a helpful metaphor so it's important to realize that the bit we do during training is not so different from the bit that we do during prediction it's pretty much the same computation including the output of the network so the fact that there is a sigmoid added is not really part of the risk it's just a likelihood okay so um where am I with my code actually uh you probably don't like I made this plot so I can call this and now um what people then do is that they set up some sort of elaborate training scheme so that scheme consists of building a function that loads the data from disk that's called a data loader or a data stream and in this case because the data set is so trivial it's very straightforward we just take the data we um permute it randomly which in this case we don't even have to do because the data is already randomly permuted and then we just um go through these random permutations sample with uh out replacement and um just return these training batches so what this means is that when we then call and optimize the afterwards we are going to and we'll talk about this a bit more replace this sum with a smaller sum which is just a part of the data set and that's nice if the data set is big because then the computation is fast but it also becomes stochastic but you know we'll have to deal with that so then we create this stream and for some people who do for the engineering setup so people do large-scale deep learning this bit those eight lines up there are a big part of making things work because typically the computation tends to be i o bound so getting stuff from the disk is actually hard harder than operating on it on a GPU so you have to really think about how to do this efficiently but here it's trivial for us and then we use an optimizer and these two lines hide this entire zoo of things that you could do so you may know that by now there is like 200 different optimizers that random grad students have come up with for training deep neural networks the two most popular ones are called stochastic gradient descent sgd that's the original one from the 1950s and adam which just happened to by social dynamics become the dominant optimization rule it's not necessarily much better than anything else it's just the one that everyone uses and then there are 150 variants of that called nadam adam madams adam and also rms prop and heavy ball and nester off and all these other ones and you could spend an entire course on just listing all of these but i'm not going to do that and now this optimizer like most of these optimizers are of a sort of marcovian type so that means they iterate over an update step where we give them they have some state we give them a current training batch and then they do something to using the loss like they evaluate the loss and it's gradient and then do something with this gradient and value of the loss actually mostly just the gradient to update in the case of gradient descent that's a trivial update you literally just add the gradient or minus the gradient times a constant called the learning rate and then we can let this run and in this case this is going to run pretty fast because i am working on a tiny little data set and you can see that the uh test accuracy and train accuracy have gone up all the way to one so now we're done and we can ask this network to make a prediction to call the log likelihood a few more times this is by the way trivially fast in this case remember last week we did this with gaussian process classification same data set and that is prediction took a like a second or so to do so or maybe a few seconds so this is a bit different and they get this output so this was training we get the loss going down all the way nearly to zero in this case this is a loss that actually is strictly non negative so the best we could get is loss zero in contrast to the log likelihood we had last week which could go below zero we can also plot accuracy on test and train set that's what everyone always does because people that's actually what people care about weirdly uh-huh so they go up call them what you want and now we have a prediction and you can see that it sort of works it's green where the data set is green it's red where the data set is red so fundamentally as something you might want to do it works and you can immediately also see that there are lots and lots of problems with this thing it also has this so if you sort of mentally go back to lecture six or so when I did regression when I came up with feature functions and we did linear regression and learned all these really like weird functions with non-linearities and kinks some of you were like oh why should I use this set of features like why is this not like why shouldn't this should be a better set of features you see that the exact same thing is reproduced here as well it's the same kind of pathology we've made a hard choice of language to parameterize things in in this case these velu non-linearities and we see them reflected in this classification output another problem that we'll need to talk about next week is see all this green down here and all this red up there it's a lot of confidence that's very high so this there's the scale for this color map is zero to one probability for the label so what this network says is up there at four minus four I am absolutely 100 percent certain that the correct label is red and down here I am absolutely 100 percent certain that the correct label is green which maybe doesn't seem so nice right it's not necessarily something we might want to have okay so there's a question where in my code are the other gradients computed so this happens sort of along the way tiny little bit here so value regret is a is a function in jacks that takes in a function called loss and computes its value in its gradient and maybe this is a good point to sort of have the first half of this lecture this fact that I can write this tiny little line is probably the most important reason why deep learning is a thing these days because when I had this aforementioned con con uh conversation with uh yeah with um you wait a these things didn't exist back then we were writing our code in java or in matlab maybe I know everyone has to snicker when I say matlab but it used to be the thing everyone was writing code in matlab even y y t and you had you just wrote down the architecture that was one piece of code called predict and then you wrote a second function called gradient and you just had to write the whole thing there was no way to reuse something I mean maybe you could copy and paste some code and then you read it run and it didn't produce the right gradient and then you had to check and you had to go back and forth and find that there was a minus missing somewhere or a two that you had forgotten your derivation about the gradient and that made progress very slow so what has happened over the last decade or so maybe more than a decade now is that the software stack has developed a lot so that we can now do things like this and this means that this level of abstraction has allowed people to build really complicated models so things like transformers would have been an absolute nightmare to write in matlab from scratch and maybe someone could have could have actually done it but then no one would have been able to copy it unless they had to have released their code and then it would have been very difficult to change something because there was no level of abstraction but computer science sort of came to the rescue pointed out that you can do certain things automatically and you can abstract away all sorts of structure and that's why we have this explosion of complicated models now like transformers and all the other ones as well with that I am going to take a break we will continue at 1106 so there was a question during the break that I want to briefly address which is I said I pointed at this code and said inference and learning are the same thing and it's therefore dangerous to use this word inference at test time rather than learning or the other way around and maybe this I did this too quickly and it was a bit too confusing so the question is why is the word inference wrong for this function so in the deep learning world there is a typical nomenclature that says there is there's learning and inference learning is the bit to do with your training set where you change the weights and then inference is afterwards you've stored your set of weights and someone gives you a new input and then you make a prediction and that's called inference so what I'm saying is using the word inference for this for this second process this letter process is maybe an unfortunate choice of words why because we use the word Bayesian inference for figuring out things we don't know latent quantities and training the deep neural network is an instance of inference so in that sense this learning is inference why because I just showed you a slide showing that what this process does is finding the mode of a posterior distribution that's part of Bayesian inference because we could then afterwards do something else to construct an approximate full posterior so another way of thinking about so really my point is this process of picking in an input after training and making a prediction should just not be called inference we should just not use the word inference another way of thinking about why this is a problem is this code here this line 12 to 22 is really just a partial evaluation of this function I could implement predict by taking this empirical risk and wrapping a func tools partial around it such that it only takes it only produces this but not this and then evaluates this function and then leaves open this bit as a new input that's what predict is it's the exact same function so if you think of it as a software engineer you might say it's really it's bad software design practice to have this function and then have this thing on top whoops where is it here which is the same function just partially evaluated it's dangerous right if I made a change to this line because somehow I decided I want to use a different prediction function I won't I won't it won't automatically change the empirical risk that's not good no we should make things you know so that one change somewhere gets propagated somewhere else so that's mostly sort of my gripe right it's this it helps conceptually to realize that what we do is evaluate a probability distribution and so in in Bayesian machine learning we think of evaluating a probability distribution as the elementary thing we do all the time and inference is the application of Bayes theorem where you get some kind of inverse probability and that's what learning actually is you had a question is this order now so you're so you're matching again sort of the exact same problem that also exists in frequentist statistics that there is seemingly a separation between estimating an unknown quantity and predicting an unknown quantity but they are always just evaluating probability distributions sometimes there are forward probability distributions and sometimes backwards sometimes you apply Bayes theorem and sometimes you just have the thing already but they're just probability distributions so that the same and that's maybe why we have a probabilistic machine learning class to point out these structural similarities so that you know you can sort of save some some RAM in your brain you only have to learn one of these concepts so now what i want to do for the remainder of this of today's lecture is to make very explicitly the connection to the stuff we've done so far and you probably already have realized it by now but maybe it's good to just really carefully do this once and for all so in our course so far here in this room we started with regression which is with so regression by the way being or supervised learning being the same problem that much of deep learning addresses by these linear models so i've already spoken about these before we decide that our function F which seems to have the same structure as the deep learning F has this particular form we choose some features and then multiply with some weights and i already mentioned when we did this so in passing but it obviously confused a few people that this is like constructing a single layer neural network it's like taking the inputs applying some features some representation of the data and then multiplying with some weights to get a prediction so this is a special case of the function i showed you called network where there is no hierarchy it's just take the input apply the nonlinearity multiply by the weights it's literally the sorry i'm going to scroll around in code again um it's this function but we comment out this bit we just don't do multiple layers we just take the last layer multiply the weights add the biases done that's it that's the same thing so of course you could think of this also as saying if someone gives me a neural network i just can just call that phi and then w1 are just the last layer of the network that's it that's the same thing yes so so this is exactly now your question is about this sort of ongoing or by now it's probably settled but but this debate that raged in the machine learning community for quite a while is it actually a cool thing to learn phi to learn the features or as the community says learn the representations so there's a conference called iclear iclr international conference of learning representations that was created in the early 2011s 2012 something also was the first time i clear i clear happened um which was created because the people who wanted to do this learning phi felt that they weren't allowed to publish their papers at new rips the main conference of the back then machine learning and by now we just so this is one of these uh you know famous xkcd memes situation afterwards now we just have three conferences previously we had two now we have three i clear and icml and new rips because everyone does this now and learns representations but of course you could have a question is whether this is actually more powerful than to do gaussian process regression and we'll talk about that in 10 minutes so let's get there maybe just to point out again what because we use phi we get to pick what the we can sort of get to pick a feature for arbitrary inputs we never need to actually deal with x phi takes care of x and that's why we can apply this framework to an a very very broad class of inputs x we can define features of course on real vectors but we could also define features on strings like for language and including programming languages we can define features on graphs to do material science and biomedical science and learn properties of molecules and genes and proteins and so on we could even write features for functions so the input x could itself be a function then what we are learning is an operator something that operates on a function this is a topic that is currently quite hot in machine learning in particular in machine learning for the physical sciences because it allows constructing simulation methods you give me a differential equation i tell you the solution of the differential equation that's a map from a function to a function an operator and you can go even crazier and apply this feature notion to pretty much arbitrary concepts you can think of including good numbers for touring machines or programs for the universal touring machine and then you know have learn properties of touring machines if you like of course you can't learn everything because certain things are not computable but you can still set up the framework so how did we train this class of regression algorithms we decided to do Bayesian inference so the language we used was to say ah let's make the prior assumption that the weights are Gaussian distributed and that the likelihood is Gaussian through this linear function which is a linear map of theta why because I said that Gaussians are closed under all linear maps so Gaussian inference under this linear connection is a closed form linear algebra then we apply Bayes theorem prior times likelihood divided by the evidence observe that this particular choice of prior and likelihood gives a nice posterior a gaussian posterior with closed form expressions for the posterior mean and the posterior covariance which happened to look like this I had them on a few slides now already and we realized that we can compute those with linear algebra and do all sorts of cool interesting things so this is the classic story for the Bayesian inference perspective on these classes of algorithms and as we now know from the previous few slides and because I've pointed it out multiple times over the course there is an equivalent formulation in terms of empirical risk minimization so you can think of the same problem as with prior and likelihood as finding the mode of a gaussian posterior which is equal to the mean why is this a gaussian posterior because it's the in the logarithm the sum of a square regularizer and a square loss so the sum of two squares is another square like the sum of two quadratic polynomials is a quadratic polynomial and therefore the posterior is the exponential of a square and the mode of this posterior the mean of this gaussian distribution is the we can we could find by minimizing the negative logarithm of this posterior which happens to be the sum of two squares and we can rearrange and find this expression that I had on the previous slide literally by minimizing this function so gaussian parametric regression least squares regression is single layer shallow learning it's like learning a shallow neural network with a square loss and a square regularizer it's a special case it's a special case that by the way the curvature of this um empirical risk function the negative log posterior is equal to the inverse of the posterior covariance and I'm tempted to say at the mode but it's just true without that addition because it's a quadratic function and quadratic functions have a constant Hessian everywhere so we don't even need to talk about where we evaluate the Hessian so this is maybe good because well maybe it's bad actually maybe it's bad let's talk about that first because it's a special case of deep learning it seems like deep learning is a more powerful thing than that and it is that's true but it's also good because we realize that we can compute these two quantities posterior mean and covariance in closed form ah so linear algebra very powerful no three parameters no learning rate no atom sgd anything it's just Joleski done and then what we did when we did gaussian process regression effectively was to look at this network and sort of say what what if we add more and more and more and more weights until we somehow can do the linear algebra in closed form without ever looking at the weights so um we then ended up with this framework that was motivated in a sort of from the Bayesian perspective there is this functional object called gaussian process which if you multiply it with a likelihood by Bayes theorem gives us a gaussian process posterior which has this property that it contains these two functions the posterior mean function and posterior covariance function and we we actually then realized there was this there was this lecture on the theory of gaussian processes and kernels where you all kind of went whatever uh which where I tried to point out that this is sort of effectively like training an infinitely wide neural network but because it's functional analysis it's not quite so simple we don't just we don't just get to say ah we just make infinitely wide networks and then everything just works there are a few subtleties that arise and for example one of them is that it's not so straightforward to just write down what the risk even is because now we have this infinite dimensional object called function in here and if you want to write an equation like this then we first need to define what that actually is this regularizer and we have to do that now there's no way around having a regularizer now because there are infinitely many of these f's we can't just say we pick whichever f because there are infinitely many possible choices of functions which all drive this to zero and that would be boring so we have to introduce some well it would be boring also because it would not allow generalization right it would not tell us at all how to make a prediction at some other point so to mediate between training points and test points we need to write down what this object is this l2 regularizer in function space and it turns out that what it is is this so-called rich risk or rich regularizer or maybe more like tangibly the norm of the function we're trying to find in this weird mathematical space called the reproducing kernel Hilbert space associated with the kernel unless there's lots of complicated theory around that which also gives an intuition for what this posterior covariance now is it's it's sort of like a log Hessian but it's not actually a Hessian because this function is a bit too complicated for this what it is is in some sense as I pointed out in that lecture some worst case estimate for the deviation of the function from the true function from the estimated function assuming that the true function is in this weird space the reproducing kernel Hilbert space and we also saw in passing that there is this theorem that says you can write any such function as a sum over countably many such feature functions and how exactly they arrive is a bit tricky and complicated and I spoke about this at length during that lecture but the main point about this insight is that yes indeed we can think of Gaussian process regression as infinitely wide neural networks countably infinitely wide where the individual units are those things called eigenfunctions of the kernel and then in the last two lectures I said okay I know what about classification so if we need to do if we need to predict labels that are not real numbers then we just take this likelihood which looks like this this logistic function and by the way there are other likelihoods but this is the one that I like to use or that everyone else also likes to use and if you're wondering why well check out this week's homework it might give you an idea for why this might be an interesting thing to do and then we now mostly we just point we just realized that using this likelihood is exactly equivalent to using this binary cross entropy term this logistic loss that I've used in the code that I just showed you in the first half of this lecture so it's really the same thing right now we're very very closely connected quadratic regularizer cross entropy loss is exactly what we've been doing and now I briefly need to point out that of course most classification problems are not binary they're not just one label versus the other but they are multiple labels you know computer vision your standard task is this you know car is this image a car a human baby uh banana whatever that's a multiclass classification problem and here I just need to point out that for your sake I've decided to only do binary classification and I'll use five minutes to tell you why because of course we could do everything we've done so far in a probabilistic version also on multiclass classification problems so how would that work well the idea is that for the last layer of our neural network the output of our network is not going to be a single scalar but it's going to be k numbers where k is the number of classes that we're trying to predict and then we take a softmax over those classes so for that we have to turn this vector of weights into a matrix a rectangular matrix that maps from this last layer to the top one and in our code for the definition of the network um of course this is relatively easy it involves putting something other than a one in this bit of the code and then we just have to make sure that the data also has the right shape so I left some comments in there where we do this so when I make a data set here I've left in some code for you can of course take data and make it of this shape and then what we need to do is we need to somehow come up with a probabilistic formulation of what it means to learn multiple output functions and this goes in the Bayesian machine learning world under the notion of multi output Gaussian processes and the formalism for this is that you effectively write a covariance function a kernel that defines the covariance between the c's output of the function at location a and the d's output of the function at location b so that's now a function that takes in four inputs rather than two but of course you can you know imagine a reshape where the you tuples of two become just one input right so on the left hand side you have input ac and on the right hand side you have input bd and then everything just becomes a complete array centric programming nightmare right there's just lots and lots of inputs and we need to like reshape in the right way around and out and the piece of python code that I used for Gaussian process infants so far actually allows for this to work but it's not exactly nice because it's even more dimensions that go into the input of the kernel it's a bit painful you saw Marvin do a variant of that in his tutorial if you were there and there are lots of interesting structures about this so for example one thing people tend to do is that they assume that the covariance between the inputs and the outputs factorizes so there's one kernel for between the inputs and one kernel for between the outputs this is like assuming that there are weights generate generated for the input layer and the output layer separate from each other or if you think of a single layer neural network that means you learn the rows and the columns there's a generative process for the rows and the columns of that output weight matrix and you just take the auto product of them and that has all sorts of interesting algebraic results you can you get nice structure in the covariance matrix called chronicle structure which speeds up the computation you can do it faster and so on and so on and there are lots of cool little tricks but if I would have done that in the course it would just have confused you like crazy so what I want you to take away is that multiclass classification is not a problem from the probabilistic perspective it's not like we couldn't do it but it's just even more tedious than what we've done so far so we'll leave it as an exercise to the reader so if you really want to do multiclass classification with gps you can look at this slide again and stare at it for a bit and think about what that means and then you can actually do it and I actually thought of giving this as a homework exercise this week but I decided against it so so I just want to make sure you don't think that it's like you know this this is impossible it's just really sort of we made some hard choices on how to how to design the code that you use for your homework so far and now adding that later would be a bit painful so that's also why I'll keep showing you binary classification and why we will keep asking you in the homework to work on binary MNIST I know it feels a bit sort of toyish to take MNIST and then reduce this to binary MNIST but it's really just because otherwise you would have to write even more code and we just want to keep it simple for you okay so now it's this question here comes this question is it what what is now actually better so we spent four lectures talking about Gaussian processes and we just realized Gaussian processes are infinitely wide neural networks and now I just showed you deep learning and of course you've had deep learning classes before all of you so you know that there is this choice of deep as well so we can do we can do deep learning and we can do infinitely wide learning non-parametric is a nice word for it which of them is better why do we why do we do one over the other and someone asked last week actually this is a problem for me in practice if you give me a data set it seems like every course I go to tries to advocate for some other type of learning there's deep learning there's non-parametric learning there's parametric learning probabilistic learning statistical learning which one do I actually pick well so part of the story is that we just found out that they are all really closely connected to each other and actually what I want you to get away with at the end of this course is to know how to sort of map between the different concepts and just write and combine them whichever way you like but if for the sake of argument we really stick to one of them now for a moment then we can talk about the weaknesses and strengths of each of these frameworks and think about why they might be better or worse in some settings so the classic argument for kernel machines is the following these are these infinitely wide networks so at the point in time this was around you know early 2000s when kernel learning was all the rage and this was like the three quarters of new rips was just kernel papers and gp papers people were trying to make arguments to the connectionists why this is the right thing to do and one of them goes like this I spoke about these reproducing kernel Hilbert spaces right so the RKHS these spaces are and by the way Matthias Heinau does this as well in his class right has he gotten to the point where he does RKHS yes okay some nodding maybe you're not sure that he's doing it or not but yeah okay so these are these spaces that span the hypotheses that we can reach with kernel rich regression or Gaussian process regression as posterior mean estimates so any posterior mean estimate that comes from a Gaussian process regressor or from a kernel rich estimator will always lie in this reproducible kernel kernel Hilbert space and so one question you could ask is how powerful are these spaces can they approximate everything and indeed they can so there are theorems that say that there are certain types of kernels like for example this one which is infamous because people use it so much the square exponential kernel or gaussian kernel or revenue radial basis function kernel and there are entire entire workshops on radial basis functions in Oberwollfach for example just on this one thing and this kernel happens to have the property that it's reproducing kernel Hilbert space lies dense in the space of all continuous functions so like the rational numbers lie dense in the real numbers this function space lies dense within the space of all continuous functions so for any continuous function and pretty much any interesting function is sort of continuous right at least at least piecewise right so for any continuous function you can find an element in the reproducing kernel Hilbert space of this kernel arbitrarily close to it where arbitrarily close is in some sense of some norm that has to be defined correctly and that led to an argument that these are then called characteristic kernels or universal kernels that said this should be enough for everyone to quote Bill Gates right we don't need anything else we don't need deep learning at all because these are universal function approximators they can learn any function and maybe you've heard sentences like this before but the problem with this is that it's a non-constructive statement it just says there is a function if you give me your function there I know that there is a function in the space that I can approximate that is arbitrarily close so here's the space of all continuous functions your function lies in here and the RKHS lies sort of dense in this space there's everywhere dotted around this space there are functions from this RKHS so there's also one that is arbitrarily close what this doesn't tell you is if I start here and then take data how long it will take to get here so here's a picture that I constructed a long time ago as an example what I show you here is in black in the background one particular function that is continuous in fact it's even more to continuous it's infinitely often differentiable it's a very smooth function so it definitely lies within this space and then in red I show you the Gaussian process posterior which comes from this RBF kernel so the posterior mean function the red solid thing that is an element of the RKHS the other stuff around it is not that's why I'm not plotting it but the red stuff is an element of the reproducing kernel hybrid space and now we can condition so here's the first data point uh-huh first data point second data point uh third four fifth and you can see that everything works we're learning very well right it's sort of we get closer and closer to the true function so now we have 10 evaluations and the function is beginning to approximate the black thing especially where we have more data we get ever closer and now we go from 10 evaluations to 20 evaluations and this is what happens and if we go to 50 evaluations it looks like this totally crazy deviations this also goes down here to like 10 to the minus 20 or so I've just cut it off but the theorem says that eventually we'll get there because we're just constraining the function space and there has to be a function a red function arbitrary close to the black one so let's keep going let's keep going it has to work and in fact it kind of does work actually so everywhere there's a black dot we get ever closer but we pay a price in that there's these nasty nasty deviations around it and in fact if you plot the rms e the so the average square error between the estimated function and the true function it looks like this so here are up to a few thousand evaluations the red line is the this sort of convergence towards the truth and what I'm plotting in orange is the this is a log log plot right so orange plots that the straight lines does anyone know remember what a straight line is in a log log plot from your data literacy class it's a power law so in this case this power law is x so a number of function evaluations raised to the minus one half square root convergence square root is like the worst possible way to could ever worry about so that's the convergence rate of a Monte Carlo estimator it doesn't get much better much worse than that you would like to have high polynomial conversions rate right two three four five something like this maybe even exponential but you don't get that what you get is something that's slower than polynomial and you can see that this is this curve kind of asymptotically gets flatter and flatter and that's actually true there's a theorem that says for such functions like the one I just constructed carefully the convergence rate is logarithmic in the number of evaluations and since your computer scientists you know that logarithms are bad right it's not not useful so what the problem here is is that the language that we're describing the function in that we're trying to learn is universal it can describe everything but we haven't thought about how long it will take to describe it what the length of the string is that we need to write to approximate it so you can have universal languages but they might still make it very hard to describe certain things and results like this sort of raised the question what these functions are actually good for and then what happened was that a little bit later you know people started writing these very complicated theorems that show so this is from Arthur Neffard and Jan van Santen from 2011 who was just one example of these learning rate theoretic results for Gaussian processes and kernel machines back then there were a lot of papers like this that you can read or not read and what this thing says is that if you match the reproducing kernel Hilbert space well to the function you're trying to learn then you actually can get very good rates but if there is not a good match your convergence rate can be very very very bad you're not going to learn anything right logarithmic rates are just useless it's you might as well not learn but it also says that you know if you carefully look at this expression and get things right then the convergence rates can actually be very good so if you're trying to learn a function that happens to be three times continuously differentiable and you use the exact right kernel then your convergence rate can indeed be polynomial of decent order actually for a function of this type and so what this boils down to is that if you know enough about the problem then you can build very good kernel methods and now the question is kind of what are these problems where these models actually work really well so here's a sort of a short so what and I'll have more in a moment um one way in which such models are actually really good so these pure kernel methods that do not learn representations is for problem classes where you know a lot about the problem very precisely and you need to carefully build an algorithm that works really well my favorite go-to example for such problems is simulation if your task is to solve a differential equation Schrodinger's equation maybe your stokes equations if you want to predict the weather if you want to uh infer the properties of some material then you need to solve a differential equation and you need to be able to say you like you have written down the differential equation so you know exactly what's in there you know that this it says something like the second derivative of this function with respect to this variable plus the first derivative with respect to that variable times the second derivative and so on and so on then you know that the function you're looking for has exactly this many derivatives because otherwise you're not going to be able there's no meaningful solution to the differential equation but it also you also cannot assume that there are additional derivatives because otherwise you're overly constraining the solution space and you might miss some interesting parts of the problem by imposing regularity now doing this right is very difficult with deep neural networks because with this choice with this language that we have available we lose and sigmoids it's really tricky to write a hypothesis class that exactly constructs the right types of functions think about it like how would you build a neural network that can represent through its weights functions that are exactly three times differentiable you will need a link function that is exactly three times differentiable i mean you you can build those it's not like they're impossible to build but then you'll have finitely many of them flying around right not arbitrarily many so you have to be really careful about how many of them you have and so on so your normal link functions are either relu that's only well in a way that's non-differentiable or maybe it's you know piecewise linear as a model class or you use something like ton h or zilu which are smooth so they're infinitely often differentiable and in fact this is a real problem in practice right then simulation methods don't work really well on such problems so for such settings gaussian process models are really powerful because you can carefully choose a kernel that has exactly the right power it has it it puts no further assumptions on the problem other than the ones that you absolutely have to have but then there are other applications like computer vision and natural language processing which are the two most common prominent applications of deep learning at the moment where we know very little about the problem in fact so little that we have a hard time even writing down what the function class is that we're looking for and those are exactly the settings where deep learning has excelled so far it's the ones where people have spent decades trying to write good features what is a good feature set for natural images you know there used to be arguments about gabor filters and these kind of things and now maybe what the world has realized is that it's maybe just good to do a parameterization in terms of lots of parameters in a deep fashion so that you can learn the representation and then you can learn good statistics of the natural world one quick question because i want to finish in five minutes no so my prediction is not that if you learn more about computer vision we're going to use gaussian processes for computer vision my prediction is that four applications that have a lot of structure shadow models are more interesting because you can design them by hand and these are not necessarily boring applications right so the simulation of physical systems is not boring simulating the the climate or the cosmos is not a boring problem and for those problems kernel machines are really interesting they're not the only solution but they're very interesting and any deep solution will have to inherit some of their structure and for problems that have inherently little structure like images of the natural world or human natural language we don't even have to look for kernel solutions even though people tried for quite some time and maybe at this point in 2023 we can stop looking for them unless those kernels happen to be constructed from a deep neural network and we'll talk about how to do that in the next lecture but i don't want to end here i want to show you two more slides that summarize kind of my view from a more algorithmic perspective so let's look at deep learning and gaussian processes and think about what people actually like and dislike about them and we'll find that the uh sort of complementary things that kind of interact with each other a bit and it's not just that one works and the other one doesn't it's much more subtle than that so deep learning here is deep learning again some math and some stuff to talk about i think what people really like about deep learning is that you can think of training as o of one and i always have raised my hands around like this because we already talked about this the way that o of one emerges in training deep neural networks is that you observe that you can construct gradients on a subset of the data called a batch that's a stochastic gradient and then you can train the model using those stochastic gradients which means that at no time do you have to wait for a full pass through the data set in particular you can train these models on an infinite data set and you don't have to worry about the size of the data set now however we already observed when we did parametric regression from the gaussian perspective that that property itself is not unique to deep learning you just have to be a bit more careful how to do it when you do non parametric and in particular just parametric gaussian regression it's absolutely possible to train a parametric gaussian regression algorithm with stochastic gradient descent there's no no need to think of this as a uniquely deep learning thing the other thing people really like is this what i will call a ray structure so this is something that is actually a big topic in the sort of theory of software engineering for deep learning i don't know whether that community even exists it's maybe just 20 people but um so what i mean by this is if you look at this object here this gradient of the function of the loss with respect to the deep net or the data then you can think of an array like this so this contains a sum over the training data and then it contains all of the weights of the network or all of the weights and biases so you can think of computing a gradient that expands this object which is an array in weight space and in data space so there's some kind of array hiding somewhere in an abstract space that would give you the gradient on your whole data set and now what we do when you compute the batch gradient is we can compute some of the columns of this matrix the individual batch gradients and then we sum over them so that's a kind of map reduction over it and that's a structure that is very rich as a language to think about and you've actually encountered it already in the jacks code that i showed you right there are all these sometimes tedious slicing operations into arrays that pretty much do this kind of process so what people like about this is that arrays are very structured objects for example mapping operations on arrays allow all sorts of efficient speed ups right because you get structure that is more powerful than a for loop to work with for the compiler for example you can do what's called sharding so you can farm out certain parts of the computation to different machines across a data center and then collect them again or you can also do batching so you can decide which of the columns you can think more mathematically about which of these columns do i actually want to load to compute and you can think about the reduction computation so one of the things that my group for example has worked on is do we always need to just compute the sum over this but can we maybe compute weighted sums or can we may you compute non-linear transformations applied element wise and then summed over to construct interesting estimators all of this is possible when you think of this computation from an array centric perspective now for gaussian process models you can do the same thing but it's a bit more hidden you have to really know what you're doing and that's why i think in deep learning this has been more sort of powerful another two things that are not so entirely serious that people like about neural networks is that well it's just you know neural networks sounds really cool something to do with the brain somehow big and it sounds a bit silly but that's really how they were sold to the community so jeff finton has been on for 20 years about the brainworks and he keeps telling these he used to keep telling them well now it's a bit different but he used to tell these stories about how he's trying to figure out how the brain works and people just like that it just gets people excited in particular phd students which is what matters um the the other thing that's kind of nice that's sort of an afterthought actually is once you've trained the data you can throw away sorry once you've trained the model you can throw away the data and this used to be not so important maybe but i mean the last half year has shown how important this is because none of us knows how gpt works why because we don't have access to the data or the weights so if you if you train a deep neural network you can afterwards forget about the data so you can you know not worry about it anymore and then you can keep the weights behind locked doors behind an api and never show it to the world and it's really convenient so um for gaussian processes there's this annoying thing that the data sort of are the model right because it's non-parametric so every datum is actually explicitly in the model so you if you release that the model you release the data pretty much so what people don't like about deep learning is that training it is really fiddly and i will keep going on about this in later lectures and you've already seen some examples of it you have to make so many choices to make it work and quite often some of these choices are extremely sensitive if you get it slightly wrong it doesn't work or it or it works but you don't know why or you change something it starts working better than it did before and you don't understand why this is really bad compared to the setting in these squares regression where it's just linear algebra and everyone well not everyone but there are people who really understand how linear algebra works so they can write little black box algorithms that just work another thing that we'll get to in a later lecture that is not so great in deep learning is that when you've trained the model and you've just thrown over your data set if there's now a new data coming in it's at least at first sight not so straightforward what to do with it and it's a common setting in applications that you have a trained model that you've now deployed to the world it's internet facing and people are using it and you're collecting new data as it comes and now there's a shift in the data set you've trained uh you know your uh your your Spotify you've trained a recommender algorithm for your users on you know music the greatest music of the 70s 80s and 90s and today and now it's 2023 and people just want to listen to completely different music so your model is trained on something that it doesn't represent anymore what's happening now of course that's a different timescale but yeah so you can collect new data you have access to the data set to the stream of users but what do you do with it do you always just you know go back to the old data you mush it together you let some SGD run somehow and hope that it works that's a dangerous thing to do and I already pointed out some weird weird properties of these models that they have this very high confidence in weird regions where they really shouldn't be so confident and that seems dangerous because it allows adversarial attacks so what people like about Gaussian processes is that they're complementing this view they are very structured models that you can train with linear algebra if you have more data we know exactly what to do you did it as a homework and there's beautiful deep theory for them so what they don't like is that training is expensive and so we'll need to fix that so in the next lecture I'm going to take the insights that we had today which is that the connection between deep learning and all the stuff we've done so far is actually quite close in the sense that we can think of the models that we've trained as shallow potentially infinitely wide neural networks trained with a square loss or in the case of classification actually trained with a loss of people using deep learning already and a quadratic regularizer and then think about what that actually allows us like whether this allows us to connect these two models very closely to each other these two model classes so that these fields that have historically evolved completely separate from each other Bayesian learning deep learning and kernel learning or statistical learning theory can be sort of combined again at least in your heads into one conceptual framework that you can use for everything and with that I'm done for today thank you very much for your time please leave feedback