 what do we actually do with this uncertainty that we've been trying to construct for deep neural networks over the past three lectures or so now? So let me address that by first doing a quick recap with this slide that I keep pasting into lectures and then sort of when I show it realize that I should have expanded a little bit. What we've been doing is to say we started in the entire lecture course with this very structured approach to uncertainty, precise computations including also Gaussian models which essentially are also analytic models and then realize that those are in some sense restricted because we can only use them only to learn functions that are real valued which is already a pretty powerful framework but sometimes we have settings where we want to learn something a little bit more complicated and came up with some approximate techniques and then we realize that maybe we can translate those approximate techniques even to deep neural networks where the distribution really isn't Gaussian anymore at all that we're dealing with. Why? Because the likelihood is effectively an extremely nonlinear function of the underlying parameters, the weights of the deep neural network. We do that at least that's my proposal for how to do it by these this process involves four steps. We realize that what we're doing when we're training a deep neural network is to minimize a loss which happens to have the algebraic structure that we can interpret it as a log posterior distribution or a negative log of a posterior which consists of a prior and likelihood so when we train this network and find the mode of this log posterior distribution that's a point estimate that we could think of as a maximum apostrophe estimate and now to get a sense of the shape of the geometry of the entire probability distribution we do a local Taylor expansion around the mode which involves a constant term and a quadratic expression because the linear term is supposedly near zero the gradient is zero and just a moment we then sort of identify that the trained weights with a point estimate and the curvature the inverse curvature of the loss function at the mode with a covariance over those weights and we can then push this estimate on the weight space forward into the prediction space by computing another linearization of the deep net itself in its parameters so that's a constant term plus a linear term that involves a Jacobian a backward pass and use those to construct from the deep neural network a Gaussian process measure on the output of the network which has as its mean function the trained neural network which is the thing that people usually use so it's still there it's not lost it's not approximated it's actually there and then an additional object a covariance function a kernel that happens to be the inner product between the Jacobian evaluated that inputs with these this inverse covariance so that's a it's one way to construct uncertainty on a deep neural network there are many others I like about this several things which we'll talk about a little bit today there are also some things one could be worried about one of them chiefly is computational cost so I had a slide on that in the last lecture to point out that of course it's a challenge to construct this covariance matrix this inverse Hessian on the weight space why because if there are d weights then that matrix is of size d squared and to compute it we need to sum do this big sum inside of the loss function so constructing this matrix is o of a number of data points n times number of weights d squared it's a big object especially for very large deep models like contemporary large language models which have several hundred billions of parameters so then this d squared is really not feasible and then of course we need to actually compute an inverse of this matrix to get a covariance matrix so in worst case that would be d cubed when we did Gaussian process regression we already realized that maybe d cubed is an overstatement and there's probably some ways of speeding it up and we encountered some for example we could do something quite crazy and just use the diagonal of the Hessian that's certainly going to be cheap because evaluating those d numbers should cost about as much as computing one gradient of the network if you implement it right of course that ignores all the covariance structure actually all the co-precision structure between the weights which can potentially lead to quite significant drops in like the quality of the approximation of the uncertainty we could also just look at a particular layer of the deep network and you're doing this in today's while this week's exercise which was unfortunately delayed because of a bug that had to be fixed in the code so apologies also for me that this exercise is a bit delayed it's a bit difficult to say what to do with that now I was tempted to say well then we just scrap half of next week's exercise but of course since you're getting bonus points for the exercises that would be like taking away the option to get a bonus point so I don't really know so there will be an exercise sheet next week but you can also decide just not to do it and it's not going to be too hard you know like that but this time I can actually say that because I took care to simplify it a bit an interesting idea is to use that the sort of algebraic structure in your deep network to decide only to represent certain parts of the covariance structure between the weights so if you're thinking of your deep neural network as this object that has layers and nonlinearities and structure then maybe there are particular interactions between parts of the network that you actually are most interested in so the networks that we've looked at so far are multi-layer perceptrons so just direct feed-forward connections of relatively simple layers and there one could think of the blocks of the like that of the covariance between weights in one layer and because each layer has many inputs and many outputs the weights are matrix so you could even think of a covariance between that just tracks the terms that that decide the covariance between all the units in the bottom layer and one at the layer above and between all of the units in the layer above and one thing below so some kind of fan out fan in structure and those have been used to construct these chronicle factorized approximate curvature estimates k fact which have become surprisingly popular well maybe not surprisingly they've become popular in recent years or recent months I don't know in particular for methods that that are used or for uses of curvature in optimization during training because they are reasonably efficient to evaluate and then there's a but so maybe one reason not to be too excited about those is that we've looked at some of the eigenvectors you maybe remember the code that we looked at showed we actually saw for our small network what the Hessian looks like there was very little layer by structure actually like maybe for very small networks you'd start to see it but for large networks actually most of the interactions happen across layers so but they tend to be quite low rank so we saw that most of the eigenvalues of the Hessian are very small and there's only a few that stand out that are large but the corresponding eigenvectors at least at least in our simple case were distributed across the entire network so they didn't really have structure but as one eigenvector just for one layer and one for the other so that kind of low rank structure actually points to something that we might be able to use and in your theory homework this week which was at least on time you got to see sort of a first structure or an idea a theory for how to construct this low rank these kind of low rank objects and it sort of hinges around the inside that you can write the Hessian of the entire loss as a decomposition between a Hessian of the empirical risk itself the little l in this slide the tiny l function up there and Jacobians plus a Hessian of the network of f times Jacobians and that first Hessian here of the sorry the Hessian of of this expression with respect to f can actually be shown to be approximately zero at the end of training because then those Jacobians by definition have become zero so I'm not writing down on the backboard how to do this because it's a homework exercise but you'll encounter that when you do this you actually get a low rank structure and maybe that's what we see in the code that we've used so far so this is a way to construct uncertainty on deep neural networks and we saw in the experiments with code that this is a bit of a tricky business it's not as clean as the sort of traditional uses of probability theory it's not sort of textbook style reliable it requires some fiddling it also has all sorts of structure that one could be excited about because it points to the structure in the network but also that one could be annoyed by because it's maybe not exactly reflective of the kind of uncertainty you would like to have so these uncertainty structures that get constructed by these are plus approximations they don't look as nice and structured as we've come used as we've become used to for Gaussian process models and and I already pointed out that that's not necessarily because of Laplace so it's not the approximate nature of this approach so much as it is that deep neural networks are just really complicated over parameterized structured models and just like their point predictions can be very complicated and surprising the uncertainty that gets assigned by this process around them can also be kind of difficult to make sense of so that raises the question that's one of you already posted in the in the feedback why should we care about this what do we do with this if it's not this kind of neat clean reliable error estimate why do I even want to have something like this and today I want to show you that there are really good practical reasons to construct this additional object called that I call the neural tangent kernel which to my knowledge still doesn't really have a another good name although I also realized that this neural tangent kernel is just my name for it you're not even going to find it in the literature it's just nobody has agreed on what it is yet and and those uses tend to be functional in the use of the deep network and by that I mean that they are not necessarily just a raw let's assign an error bar to each prediction and then just return the error bar to the user and then you encounter this problem that one of you asked about so how do I even explain to a user what this error bar actually means but they tend to be more interesting in the sense of making turning deep neural networks into actual pieces of reliable software so in other parts of computer science we've gotten completely used to the fact that computers just do what they're told to do like no one would use I don't know an sql database and just expect it to only work 80% of the time like you just know that it works because it's a reliable piece of software but with deep neural networks we've sort of accepted that there are these really fiddly things and then like people even I think it's even dangerous that we then kind of sometimes give them slack by thinking about them as intelligent systems like oh you know an intelligent system sometimes is wrong it's kind of you know it's a dumb system it's a dumb system is always right because it does something dumb but a smart thing is it might be wrong so that that's maybe it's not entirely wrong to think about it this way but it's also really an annoying state for an important part of our technology stack right so most machine learning methods from a few decades ago actually work reasonably reliably well they're quite well understood we know how to design these models for deep neural networks there are some aspects that are really difficult to work with and I've listed just a few of them here um one is that they have actual pathological behavior in some sense that I'll show you in a moment another one is that training them is very hard and the third one is that once you've trained a model it's actually really difficult to change it afterwards and I want to talk about mostly chiefly the first and the last point today and show that those at least in my opinion can be thought of as not as fundamental flaws of deep learning that have something to do with the non-linear aspect on the nature of them but with the fact that they're usually thought of as point estimates and that by adding uncertainty a probability measure over the over the weights we can actually heal those problems and start to make them a more reliable piece of technology so what do I mean by pathology for that I want to construct like pick out one particular example it's not a universal pathology that all neural networks have and that's somehow you know the most important thing but it's a concrete example a quite technically precise one that explains what why it might be wrong to use point estimates and um so some of you may have seen this before if you've been in the mathematics of machine learning lecture but I'll do it a bit differently today um which is actually a result by professor Hein who is teaching the statistical machine learning class he published a paper in 2019 with um some of his students actually I think in cvpr in which he showed in a theorem so it's a fundamental property that re-loop classification neural networks are pathologically overconfident far away from the data so what do I mean by that I'll first actually I'll first show you something like waving my hands around and then we'll read the the theorem and talk about what it actually means so a V-loop deep neural network is for the purposes of this this conversation today a multi-layer perceptron so a stack of a recursive function that has this recursive structure um some last layer weights times a non-linearity of the weights before times a non-linearity and all the way down to non-linearity of first layer weights times input potentially plus biases biases aren't so important for this argument where the sigmoids uh sorry not the sigmoids these non-linearities are V-loop functions so V-loop is rectified linear unit it's a function that is zero if the argument is zero and then x if the argument is larger than x sorry larger than zero so okay so it's the the is this so so we already noticed in a previous lecture that such recursive structure with real linear linearities is sort of inherited through the network so these functions are piecewise linear right they are zero and then they become linear and a piecewise linear function of a piecewise linear function is a piecewise linear function it just has more pieces so in and the networks that we've trained you've actually seen this you've seen these straight lines lying into in the input space so here is actually our trained network from one from the Jupyter notebook that I showed you last time um and because the network has finitely many of those features I mean there's finitely many weights in here they are only finitely many pieces right there are finitely many of these linear regions sort of poly polytopes um that the input space is that is sort of dissected in so now as you move away from the training data in some arbitrary direction unless you happen to lie to move exactly along one of those directions that one of the video features lies on but there are only finitely many of them so there's sort of that's a space of measure zero so if you pick a random direction that happens with probability zero so as you move away from the data let me just switch this off for a moment we have our data set somewhere in here and there are all these all these videos that lie in here and they create this interesting local structure now if we take any point and move out away from the data in any direction eventually we reach a last part where afterwards we are never going to cross one of these boundaries ever again right so that in this direction we're now just in a linear function so if you move far away if you sort of would plot the the output functions along this so if I take this direction and turn it into a plot so let's let's call this x so little I don't know x x tilde so if I plot x tilde then in this direction the outputs of the network are linear functions but actually it's a classification network so for multiclass classification we don't just have one output we have multiple of them and then we take the softmax of them right so f is actually uh so this is a matrix and this is actually a multivariate output and these are now each of these are linear functions so there's another one and another one and so on so to make a prediction about the class at one particular point we take those functions and take the softmax of them so the largest one will become the dominant label and the further it is away from the other ones the more dominant it will become so now as we move out those linear functions get further and further away from each other right because that's just what linear functions do so the distance between whatever the largest one is and all the other ones keeps increasing so the softmax over that keeps focusing on one of the classes and as we go further and further and further away eventually this is what the theorem says you will get arbitrarily close to probability one for one of the classes the theorem doesn't tell you which class could be any of them and of course it's a different one depending on which direction you're moving in so clearly in this picture if we're moving upwards it'll be the it's going to be the red class and if we're moving downwards it's going to be the green class but that doesn't matter right so far away from the data this this classifier will become arbitrarily confident in one of the classes and this is a fundamental property of these relu classification networks it's just always going to be true no matter how we train it as long as it has finitely many weights yeah so this is a good question for most classification problems maybe arguably the input space is actually bounded yeah so if you think of your MNIST toy problem it has like each pixel is it has a value between 0 and 1 or between 0 and 255 depending on the encoding so you basically have like a hyper cube of 768 numbers between 0 and 1 so in that in that setting this statement is maybe not as you know striking but actually high dimensional spaces are quite unintuitive and they are pretty empty as well so far away from the data could well mean within the cube within the data cube right or not the data the domain that the data lies in and you can show examples of that I mean you can you've all heard of these adversarial examples but you can also think of sort of simple out of distribution examples where you just you think of an image that really doesn't look like a digit that could well be within the box and then these classifiers would be very confident that is one particular class that is just notice and that's it's just not something we would want right we want a classifier to be uncertain far away from the data it should just say I don't know what the answer is right because I have this bit little bit of training data here now why would I know what the correct class label is like somewhere up here in the corner or maybe right up here or so right I should be somewhat more uncertain about this so it turns out that this problem is really just a pathology arising from the fact that we're using point estimates and it is healed the moment we assign a Gaussian distribution on the weight space any Gaussian distribution it doesn't even have to be that a fast approximation it can be pretty much any distribution but of course you might might want it to have any other some other properties as well but just for this particular problem it's really just about the fact that there's no measure and I'll show you how but there was a question just now yeah so the question is again this is only for Vulu networks yes the argument here this precise technical argument is for Vulu networks but it's there is a sort of a deeper message in the proof for how to fix it that just shows that point estimates are wrong so the problem is not that it's a Vulu network or that we are assigning a Gaussian measure or doing a fast approximation or whatsoever it's just the fact that people claim that this is the correct output or people I mean we when we use the networks in the standard way so here's how to fix them so we've now spoken admittedly I haven't really formalized it properly on a slide but you've done it sometimes in exercises and we've done it to immediate Gaussian process classification about how to construct a expected class label for the softmax output or the typically binary softmax output in our case because we are trying to simplify the exposition under a Gaussian measure on the input of a softmax or logistic function we do first of all what we've done so far so we do essentially Laplace so far away from the data we have this like this this situation right if you're far away from the data then the function that we're modeling is a linear function that depends somehow on the weights now if you have a Gaussian measure on the weights for example from a Laplace approximation or from any other way of constructing Gaussian distributions on weights you could also you know run some sampling scheme and then compute covariance matrices of the samples then we can do this push forward trick that I've mentioned on like few slides ago construct this Gaussian process linearized tangent kernel type object then if we're far away from the data we can think of f as a linear function and therefore the Jacobian as well pretty much something that involves x so the important bit is that there's just x at the end as a linear function now and that means our GP will have a mean function that is linear in x and it's in some sense quadratic in x in the covariance function so in the mean there's one x and in the variance there's two one here and one there and then there's some complicated algebraic object in the middle which we can actually ignore for what's to come it's just some matrix so now if you want to make a prediction under this uncertain representation of the latent function then we might compute the expected value of the logistic transformation which is the binary case of softmax of f so that's the expected value and now well this integral doesn't have closed form there is no simple expression for it under if this is a Gaussian probability distribution we already encountered this in Gaussian process classification and then we said you know there are various tricks to approximate this integral you could do it with Monte Carlo as a simple output space Monte Carlo but there is a simple approximation that we can just use for the argument that we're going to make which is this way of approximating this expected value of the sigmoid as a sigmoid of a ratio between mean and variance roughly of mean and standard deviation signal to noise ratio so it's mean divided by square root of one plus pi over eight times the variance of it so there's a bunch of pys and eights and ones but I mean basically there is a mean divided by standard deviation here so square root of variance so now if the mean is a linear function in x and the variance is a quadratic function in x then those two sort of cancel right as when x becomes large we get a finite number that saturates somewhere at something like you know some constant over one plus some other constant and that will the sigmoid of that number will just be some number but it won't be one or zero because it's the sigmoid of a finite number and thereby this problem is sort of solved so here's a picture of what I mean by that if you have so in binary classification we only have one linear output in multi-class we will have multiple linear outputs down here is our is our Vlu network and you can sort of imagine in your head that there's a tiny little data set sitting here somewhere inside that's where all the trading data is and the model has learned it's like complicated in there there's this really complicated Vlu stuff like going up and down and bendy but then as you move far away it's just a linear function so this red line is supposed to be this linear function at the bottom and if you take the logistic transformation of this linear function you get this red curve up here which of course is just a sigmoid because yeah it's a sigma of a linear function so the green thing around it that is our standard deviation it's the square root of the variance and the variance goes quadratically with distance so the standard deviation goes linearly with distance and now you may see that in some sense sort of the the ratio of like the amount of probability mass below and above zero is constant right the distribution just gets broader and broader so we become more and more uncertain but whether like how much of the mass is below zero and above zero is the same so if you take the sigmoid of this that's the green line above we actually get something that saturates at a particular constant number far away from the data and it just says I don't know my predictive class a prediction for for class one is in this case I don't know something like 60 percent or so so really this pathological behavior of this class of deep neural networks is like we can sort of nail down the problem to identify the culprit which is that we're just using this red line and that's this problem arises because we think that we're certain about the weights well we don't actually think it right we just the framework that we use pretends that we are certain because it's a point estimate so as soon as we realize that there is some uncertainty on this latent object called the weights we realize that we have to take care to translate this uncertainty into the output space and then that translation actually is what fixes the problem no matter how uncertain we are on the weights the push forward onto the prediction space will make sure that whatever the finite amount of uncertainty on the weights was gets correctly scaled with the output and then everything is fixed well I say everything of course not really everything is fixed there is a yes we're still in a regime where as we go far away from the data the model is finitely confident in one particular class right it just says one class is more confident than the other so and if you're talking for example about accuracy then that's still not like perfect right so accuracy would mean if you just care for what the most likely label is and just use that as the label then we will still have this property that far away from the data we will get one particular prediction always the same but you know maybe that's just our first good step and we have to think about what we might want to do need to do to get this green line back down to one half if that's actually what we want and we'll do that maybe after the break but before that are there any questions on to this yeah several yeah so remember that the Jacobian is the Jacobian with respect to theta right so we take a derivative with respect to theta and that the x will just stay there so the question is if I understand correctly if you are not not uncertain about all of the weights can we use that to shape the network somehow to or to shrink it to shrink the network yeah so this is this question is maybe not so directly related to these these like two slides but of course one interesting thing you could ask about is if there are weights in the network that are not identified at all by the data so if in our loss function up here on the top um there are dimensions in weight space in theta space that are absolutely pretty much unaffected by this loss so there's nothing to do with the loss in a sense right you could move the entire network in this direction and the loss would stay exactly the same then maybe we should be able to remove those directions so this idea has been brought up multiple times over the decades and sometimes it has been called like a like grand fashion and neurosurgery since you go in you remove bits from the from the artificial brain and then see if it's if it still works and um so I think the thing that I can say in a like a lecture like this is it's it's an interesting idea um but it tends to not work straight out of the box if you do it in a naive fashion so if you take your just your network you do the loss approximation and you say let's pick all of the weights that have really high uncertainty and just drop them then the network tends to degrade in quality typically because those weights still have a finite value and so they affect higher layers in particular if they're not in the last layer um and there I think what one then often sees is that this sort of the the the weakness of Laplace as a local approximation that you might still have some complicated downstream effects that are very non-linear that are not captured by this curvature approximation so this is maybe an example of a question that is not particularly well answered by Laplace because it's a local approximation but what it does um I'm going backwards instead of forwards that's not good so what it does allow us to do as one particular example is to fix such pathological behavior so that if you move away from training data we now get a prediction that at least realizes that you know you're far from the data in some sense okay and do you think you understood this part because this to me is actually one of like among the strongest arguments for probabilistic thinking about deep learning in the first place we can have a lot of concerns about how we do it correctly and as I said a few times before Laplace approximations are one particular way of doing this and there are other people in the world who have differing opinions on how to do this right they are also Monte Carlo algorithms for this purpose and variational approximations and various other tricks that there's even there's even propagations of of interval bounds across layers so some kind of algebraic objects that keep track of in particular for for VLU networks you can do this because of this piecewise linear nature there are sort of bounds of how much the function could change as you push it through these piecewise non-linearities so there are various other approaches to this that we don't have time to do here but what they all bow down to is that they eventually construct a probability measure over the weights and then the real power lies in this push forward in this translating between those two domains which then has a really non-trivial effect on what the predictions are and you can sort of see this in this picture that in the region around zero you get this surprisingly complicated non-linear quality to the effect of this approximation if you so if you look at the bottom plot it looks really simple but the top plot shows that it's a non-trivial thing to do so now let's take a break and let's continue at the five past nine so I made a correction over the break um with someone pointed out correctly that yeah I made it a bit too simple for myself here with this expression so this was maybe confusing let's do this more slowly again um I should have thought for a second before going into this slide so quickly the argument again if we are far away from the data so if x is let's say large in the sense of let's say the data is centered at zero then if x is large then the network is of this form up there it's a linear function in x so this is not an approximation it's actually true if x is large enough now we're going to approximate this network by setting theta to theta star calling that the mean of our distribution and then linearizing in theta so we will take a derivative of this expression with respect to theta not to x and that's why the x stays but of course that changes a right we'll get a different function of theta it's still going to be a linear function in x so I'll call this thing a prime it's just whatever it's a Jacobian and for the argument it doesn't actually matter what that Jacobian is it's just another function of theta evaluated at theta star and then there is this weight matrix in the middle it is covariance matrix but the important bit is that there is an x on the left and the right because taking the derivative of this expression with respect to theta will not change the fact that there is an x in there in general I mean unless it becomes zero afterwards but that doesn't happen in general okay and then yeah that leads to this and then so one question was also again maybe just to make this clear why does this green green curve look this way how do you get this green curve well I like literally make a plot of this expression so this is the logistic function of the mean of a Gaussian divided by 1 plus pi over square root of 1 plus pi over 8 times the variance of a Gaussian and for this plot I mean since it's just just to show the thing I actually just made a plot of like a linspace times of in x times m plus 1 plus square root of 1 plus pi over 8 times something s times x squared right variance times s squared and that gives this maybe surprisingly non-linear thing right this little bumpy shape that goes up and then comes back down again but asymptotically just becomes a constant if so for the purposes of this argument the sort of the flow of the proof is let's assume we use a Gaussian measure on the weights and let's assume that we linearize the network and then let's assume that we use this approximation so yeah there are several of these sort of steps one after the other but the argument is also not to say this is exactly the right thing to do the argument this problem this feature of point trained deep neural networks that they are arbitrarily confident far away from the data that has something to do with the fact that they use point estimates so if you if you use the different approximation a different if you didn't use a linearization if you use some other push forward then this particular argument would not hold in this precise way but what would still happen is that there would be a probability measure on the weights that gets pushed forward through the network and then the uncertainty at a particular output point will depend on the output point and if the function is linear far away from the data then the the probability measure on x will grow in its variance linearly with x that will be generally true it will not be a gaussian measure but its variance will grow linearly and then of course this green line is also an approximation of this integral so if you do this integral precisely you'll get a slightly different shape so but that only fixes as you also said right maybe just one particular pathology that is maybe not even so important in practice because real data might lie in a in a bounded domain and maybe this is also not exactly what we want because there's still one class that is more likely than the others could we do something to fix uh this aspect as well could we somehow make a network that actually becomes uncertain arbitrarily uncertain as we go away from the data so that if you are arbitrarily far away we're arbitrarily uncertain okay so your question comes back down to this this weird thing in classification that the prediction of the network is a probability for the class label and now we are uncertain about this probability so this green line up here is the expected value of the label under this probability distribution over the weight we could construct an entire distribution on the output space over possible probabilities for labels and I haven't done this here I even thought about doing it for this plot but I didn't want to spend too much time on it so this maybe this this a way of answering this question is to say that this green line is really just a point estimate for the for the probability of the correct class or sorry the probability of all classes right the two labels in this case a point estimate for that probability which correctly takes into account that we are uncertain about the weights if marginalizes out the weights it's you could think for yourself about how you would construct a probability distribution on this probability for the class and you kind of you can maybe imagine in your head that you will get around here in the middle the network will typically be quite confident about that probability so it'll be a distribution that has sort of a bumpy shape around the around the green line it's sort of quite close to the green line and as we move far away it's probably some big spread out distribution it could for example also be some bimodal thing with mass on zero and one sort of bend upwards and you can actually plot that that measure yourself why because you can do it directly from a Gaussian right you can compute the probability density of sigma of f if f is Gaussian distributed using the Jacobian of this PDF which for this logistic sigmoid is just probability times one minus probability and it'll give you the right shape it's just something it's something you can plot but it's not something you can just reduce to one number in a simple in the correct analytic computation so we use these approximations like this one which gets close to the correct answer of the integral so how do we fix this asymptotic uncertainty another thing that i'd like to be able to do is to say well there is something about just in here where the data is around zero where we are sort of confident but i would like this model to become arbitrarily confident as we go away and so we add a paper on this as well that i don't i don't want to just entirely pluck this paper but i want to sort of highlight the thinking that you can put into deep models once you think about a probability measure on the weight space so if you really you can really stare at this at this simple plot it's actually what we did when we started work on this paper and say if i wanted this green curve to come back down to one half if i wanted to have this property that as i go far away from the training data each class label gets the same probability so for binary classes there will be one half for c classes it will be one over c maximum entropy prediction then what do i need to have well i can go back to this approximation and say maybe i'll keep using this approximation for simplicity's sake well then clearly if i want this to go to zero this term inside of the sigmoid then well either the mean has to become constant while the variance grows that's not going to happen if we use a velu network it's just like then we would have to change the non-linearity right we would have to change the architecture of the network maybe that's a thing to do right of course we could decide to use different non-linearity that somehow saturate you could come up with some there's already a zilu so we can't call it zilu we could call it sat sattrelu or something a saturating velu and just make it like have it i have i let it have a maximum value as well or let it become ever flatter but that seems a bit overkill right people like using velu so why do that maybe we can do something with the variance since people aren't making much use of it anyway already we would have to have the variance grow more than quadratically in x so we started thinking about oh how would you do that can we do something with can we have like a an x dependent term in the variance but that seems sort of weird if we want the variance to arise from this linearization from our nice like Laplace Jacobian and so on approximation and then it occurred to us after a while that that the problem isn't really that we use we need different features the problem is actually fundamentally that these networks only have finitely many features so when you're training so i mean yeah maybe here's the argument so when you're training a deep neural network you write down your architecture remember the code that we went through you're at the very top of the Jupyter notebook you write down the architecture you're going to use for this experiment and it's finite it has finitely many weights then incomes the data you train those finitely many weights and then you have your prediction trained on this data set and we've sort of become so accustomed to this process that it seems the only way to think about learning always this this sort of paradigm of writing down the model incomes the data sets do some training and afterwards you can do prediction on new data but imagine you were in a setting which is maybe not so unrealistic where the world can always produce more data you could always go out and get more images of the world right and learn more classes maybe not no let's let's say there are finitely many classes for the simplicity of this argument but there are infinitely many images we can train on then as the world produces more and more data do you think you would keep the network constant probably not right because it has finite capacity and the world probably has infinite complexity as you as you keep creating more data and so you'd like the model to have in some sense infinite flexibility but for any finite data set you will only need finite amount of a weight space so if you're a computer scientist it's a little bit like a little tooling machine right you want to have infinite tape but you're only ever going to use finitely much of it but you want to be able to think about the fact that there could potentially never be an end to your tape you never want to run out of memory capacity so there should really be a similar thing in neural networks that if you get more and more data you probably want to increase the capacity of the network somehow maybe so if you have a finite data set though it seems like that shouldn't really matter right we're always going to just use a finite network anyway because we'll get finite data and then we're done so that's true for the point estimate but actually it's not true for the uncertainty because so how would you do this in practice let's think maybe back to Gaussian processes right you would ideally like to keep around an infinite amount of weights and features just waiting to be trained so this entire input space the entire x space here should really be covered with real features they just be everywhere so there's like an infinite amount of them coming as you go further and further away so the only problem with that is that we would then have to keep thinking about them during training so maybe we can come up with a way of keeping in mind that there are potentially infinitely many features lying around that we would like to use if we had more data and they should somehow affect our uncertainty because they should say well there are all these weights out there that I don't know about yet so as you go further away from the data they will all start contributing but we haven't trained them yet so we don't know what they are so imagine if you had really such a Vulu network where there's infinitely many features lying around in the outward parts of your training domain but they are somehow created such that they are all pointing away from the data in the sense that they have their point of switch on so this the kink in the Vulu feature pointed in such a way that they are pointing away from the data so on the training data they are all zero that would actually mean that during training you would never care about them because they are all zero at the training points and so they all have zero gradient right so in a way maybe you could train the whole network without ever thinking of the fact that there are these infinitely many features lying around but only once someone asks you about the uncertainty you should keep in mind that there are these other features lying around so as you go further away you sort of have to add this uncertainty and that's sort of what this picture is supposed to show so imagine you have a tiny data set that lies in here so we've zoomed out the data set is within the within sort of one pixel on this screen and then further out there are all these features waiting to be trained they've just never been touched yet so far because all of their all of them have value zero at the part the point where the data lies so now as you move out you would keep adding them and then you'd get growing uncertainty so that's a way of constructing an uncertainty measure that grows super linearly with x because every single one of them will contribute a term that is in their variance quadratic in x and in the standard deviation it's linear in x but the sum of them will keep adding something to it so that overall we'll have growing uncertainty that grows faster than well as a variance it grows faster than quadratic and as a standard deviation it grows faster than linear and that'll give us asymptotic full uncertainty that's actually the idea and so what does it have to do with uncertainty well it's the sort of effect that only shows up when you think about probability measures and uncertainty it doesn't show up in the point estimate at all and here it's again useful to make this connection to Gaussian processes to think of it even as a Gaussian process because for gps we've actually by now maybe gotten used to this that we have these two objects mean function and covariance function and the covariance function the kernel can have a really like have an effect far away from the data that has that we don't see in the mean function it's just an additional quantity to talk about so that may seem very pedestrian though right I'm just imagining a grid of features lying around and then as I increase their number they kind of they this that you just get more and more uncertainty um how do we do that would we have to like count how many of them we have come past as we move out away from the data well we could sort of do that but then we would have discrete jumps in uncertainty and it all seems a bit weird to do this with explicit features so at this point we can remember actually that we've seen a picture like this before just the right half of it not the left half um when we did kernel methods Gaussian process models and realized that there is a way to actually keep track of infinitely many of such real features in finite at finite cost and back then I showed you this slide it's a copy again of grace wabba who wrote this wonderful book on spline models for observational data which she shows that among many other things she shows in this book it's a really cool book I have it on my on my shelf um that if you have such features here are your real features then this and you have a covariance that involves the sum over such features then it's actually possible to take some kind of uh infinite limit where we divide by the number of features in front and then take the limit of this sum to infinity so that's a sort of a way to construct this process I just sort of pointed at with this picture so if you get more and more and more of these features but they all become weaker and weaker and weaker so that we have at each infinitesimal point an infinitesimal additional amount of weight being added and you end up with this covariance function kernel which you already did back then that happens to have this form so the covariance the kernel between two inputs is given by a constant that this is not the sigmoid in this case because I copied the slide over um it's just a number times um constant times the minimum of the two inputs cubed plus absolute distance times the minimum squared so the important bit is that this expression is cubic in x it's cubic in x here evidently and it's cubic in here as well because there's a square and another x so this is exactly what we want it's an uncertainty that grows super quadratically or a variance that grows super quadratically is to go away from the data and now you can make an argument and go uh so actually this sort of carries through it works correctly if we do this with a Gaussian process model so and I'm going to do that for you now in five minutes but actually we've already done what we needed to do to make it work the rest is kind of just showing in more abstract algebra that it works so let's maybe go back quickly again and see that the important bit works imagine if you had your tiny data set down here then if you have lots and lots of features that are all zero at the date at the trading data or nearly zero then they contribute zero or nearly zero to the prediction at the data and so we can train our deep network without remembering without taking into account that there are all these features waiting to be trained outside and then when we make the prediction what we have to do is literally just basically measure the distance from zero so here it's let's say five then multiply that by some count for how many features we add for each step of length one so let's say in this picture we add I don't know five for each step then we just have 25 features switched on at that point and we should add a term that involves something like 25 cubed for our variance and that's a post hoc process so we can do it to a trained neural network afterwards to get calibrated uncertainty far away from the data now you can do the same thing with infinitely many features and then the argument goes like this imagine that we have we have trained a deep neural network called f on the data or we want to train a deep neural network called f on the data but we actually think that the true function that creates this classification problem the latent function that describes the problem is actually a sum of two functions we call it f tilde and it's a sum of two functions which one of which is the finite neural network with finitely many parameters and the other one is a Gaussian process function so a draw from a Gaussian process which has zero mean function and it has as its covariance this thing that was on the previous slide this complicated expression down here at least if you have if you are in positive axis so for high-dimensional axis you have to think about how to do this in multi-dimensions and then you create this double trumpet that I had on the previous slide by just adding kernels that point in each direction in the input space okay so now we have two two contributions and now we're going to make sort of two arguments the first one is that we can train this model that consists of these two parts by effectively just training the deep neural network as we would normally do and that the contribution from this f hat is going to be arbitrarily small and then afterwards we can make a prediction that's the second thing far away from the training data or actually anywhere in the input space by essentially just evaluating the trade neural network and adding this prior Gaussian process which is easy to work with as a prior right without ever training it and for that we do that by assuming that all the data lie very close to zero so let's imagine we have all the training points are inside on the scale of inputs defined as the this kernel measures them all the data are very close to zero so that amounts to creating a length scale for the inputs of this kernel that is sort of large right compared to the size of the data set or to reskin the data set whichever then we know how to train such models we find the most likely prediction for f so if this were a Gaussian process it would be linear algebra if it's an even neural network we just train it and and then the corresponding posterior on f tilde on the sum of those two is that's actually have this expression so you may remember this so that when we did this example with the co2 curves I also did this source separation with different terms this is an expression like this so the posterior on the sum of these two functions is the mean you would get on the trained network plus a correction term that involves the covariance under this kernel between the any test point and the training points their covariance under the model times a residual training point y observation y in the training data minus the prediction of the deep neural network where the c is this covariance matrix of all of these these models all of these data under the model so that's the like the covariance you would get from a Gaussian process model where the kernel is a sum of our Laplace tangent kernel and this prior kernel k and then an observation noise sigma lots of sigmas in these slides I'm sorry they all mean different things that bad so now the first step was we need to find out that we can train this thing without actually training the GP and for to do that we look at this expression and we realize pretty much it pretty much boils down to the observation that well if the network is really well trained it'll us it'll us achieve basically zero train loss so the distance between y and f of x will be zero and then there is no training contribution to our model that it's a bit simplified there's a correct actual more precise lemma in the in the paper because we basically have to make sure that this thing is actually small on a scale defined by c inverse but we can make it that actually and I'll do that in the next step that it's down here sort of um actually yeah let me just do that so we'll see that this matrix which consists of these three terms will actually be a a large number compared to this number that we multiplied with so this is the covariance under this kernel with the data and remember that all the training data are supposed to be very small so that this kernel will be very small and in this matrix there is a sum of three things one of which is very small and the other three are finite and that's actually pretty much what it boils down to so it's like a the inverse of a sum of something is smaller than the sum of the inverses pretty much that the entire argument and then you can make this for matrices you can make this precise by using mark the eigenvalues of that matrix and bounding them from above like the largest eigenvalue and then you'll find that there is some term in here that is sort of can be shown to be small so what that means is if like after there's a complicated piece of math of argument that shows us we're actually allowed to to say let's train out even your network as before and then post hoc just add the kernel to the Laplace-Tungeon kernel that models the fact that there are potentially infinitely many more the redo features waiting to be trained as we go and then the picture looks like this so this is our setup that we've used in the ipython notebook so far I just don't want to show you the code because it's just going to be tedious but it's really just train the network as before that makes this prediction even with uncertainty so even with Laplace-Tungeon kernel then add this this is actually the posterior mean yeah but with Laplace-Tungeon kernel it would it would still not be perfectly uh unconfident then we add just literally post hoc as you move away if you evaluate the posterior variance you really just evaluate Laplace-Tungeon kernel plus some quantity that is cubic in the distance from uh the origin so really really dysfunction you just add this to the GP kernel and that gives you this uncertainty on the latent function and this predictive distribution and it still looks like it's sort of you can see that the confidence rises for a bit and then it starts dropping off and if you would zoom out of this image arbitrarily far we would get just a white image everything one half everything is equally uh likely so again in practice this is a very simple thing to do you train your network at prediction time you turn it into a Gaussian process during the Laplace approximation and then add a kernel but that adding is actually well motivated it's motivated by the fact that with the finite training data you actually should keep in mind that there are potentially infinitely many untrained weights which haven't been touched yet and they can be modeled all together all infinitely many of them in a constant time and that extra term costs pretty much nothing at test time you really just measure the distance from the origin and here I've even done a little bit of extra work I measure the distance from the origin in the eigen decomposition of the covariance matrix of the data so it's like I've done PCA basically on this data set which is simple in the set days because it's just two dimensions that gives us gives me sort of a a principle coordinate system for the data and then I just that's just the dimension along which I add those kernels so it's really just measuring how far we are from the training data and then you also get finite high uncertainty in a bounded domain when the data is still sort of centered inside of this domain so that's maybe one thing one property you want to have from your classifier if you're building let's say a medical product that's supposed to go out to a clinician and make a prediction about disease or not disease then you want this thing to have the property that if someone shows it that an image of a car it says wrong right not in the training data out of distribution I just don't know what this is supposed to be and currently that's not what a standard deep neural network actually does it just it just returns a class just says whichever I have you I have been trained on two classes it must be one of them it must be this with very high confidence actually and the problem is just that we are not taking care of adding weights correctly so let's finally look at another problem which is keeping to train deep neural networks there's another very common problem in applications of the of of deep neural networks in industry in practice so let's say you're working for a company you've built a classifier that's supposed to give your customers out there in the world on the internet some recommendation maybe it's a you know some retail company you want to tell them to buy a product maybe your music provider you want to tell them this you might like this song as well and so on or your Netflix you want to or one of these streaming services you want to show another video then it's a very common situation that I also keep hearing from people working in industry that you have to retrain so you you get your initial training data set that has been collected from your customers you train your deep neural network on it it works really well beautiful now you deploy it out on your like mission critical system and then the world changes so it works for the first half year and then sometimes someone from the product group comes back and says yeah performance is degrading now because people have changed maybe it's actually because of the system that's also sometimes happens that the system itself interacts like your predictions change customers behaviors and then you know the model doesn't work so well anymore because it sort of pushes them into a corner where like you don't want them to be or maybe you want them to be there but you want to keep them there I don't know and so now your job is to update your model to retrain so in this example which is based on these two papers by Kirk Patrick et al and Witte et al and various others as well it's a common kind of problem that is studied in a sub community of machine learning this kind of continuous learning it's sometimes called or lifelong learning and this is a simple benchmark that tends to be used for these for these papers to represent this problem which consists of amnesty of course has to and we just train a classifier on amnesty first that's the first task and we get this performance so it's like 98% accuracy and now we get new tasks which actually behind the scenes happen to be copies of amnesty where we've commuted the inputs right so we start with these images that look like digits one two three four five six seven eight nine but now we take those pixels and we just rearrange the pixels in the image so a pixel from down here goes up here one goes there there there and now it looks like some qr code basically right like completely wild and so for a neural network that task is just as hard as before because it's the exact same thing actually it's just a permutation of the inputs of course for us humans it doesn't work because our vision system it has already been trained so we look for smooth things right but for a new network it's still possible to learn this and you could say you know these are just more digits it's just a different way of writing a five it just looks like this learn that as well please and now what you could of course do is you could just keep training and that's actually what people often do you just keep your stochastic gradient descent running with the new data so you're just there's a new dataset coming in basically every day all our customer interactions you had get stored somewhere on some data lake right and then you extract them and you just do a little bit of sgd on this new data if you do this then you get a degrading in performance so what i show in this plot is each black line is one of these datasets and what we assume happens here in this experiment is that we after this point we drop this data set which was your original m list and then we train only on the next data set for a few epochs and that next data set gets now trained to 98% accuracy of course because it's basically the same dataset it's just been commuted but while we do that the previous data set is now classified much much worse because the model unlearns what it has previously learned and so the average performance on those two is just halfway between them it's just here and now the next day a third dataset comes in and we start training on that dataset and now that dataset gets trained to 98% accuracy but the other two degrade and our average performance is really bad and as we keep doing this over time all the old datasets are forgotten and the new ones are always trained and the average performance goes down in some settings this might actually be what you want maybe because the old data really is outdated and replicated but you might also want your model to still be able to do the old stuff right maybe because it's just new data coming in that is just more interesting so of course another thing you could do is you could also store everything in a growing dataset somewhere on a big hard drive and then always train on this whole thing but of course that would mean that every week your task gets harder you have to train for longer because each epoch gets longer and longer and that's not a viable strategy right I mean maybe for a while you can ask your manager to buy new hard drives and new more GPUs and just do like go wild but eventually that's not a smart thing to do anymore and it also feels really wasteful another thing that people do in particular under reinforcement learning community is something called replay which you may have heard of so you kind of sprinkle in some of the old data you say oh we're going to train on a finite like bounded amount of training data where we replace every time a part of the training data with new stuff and we play we sort of inject some of your old training data from memory for training and people actually do this in reinforcement learning like they show previous episodes for example there's a very recent paper that I'm linking here which is not even reviewed yet it's just submitted for TMLR a week ago I can I can click on it so it oh I can't okay not not in this app I can't click on it but you can click on the slides you'll see it and I'm not trying to diss on the offers at all I'm just saying people are doing this today like a week ago people published a pre-print of this type so somehow this seems okay maybe this seems normal to you if you've done a deep learning class but if you've previously trained Gaussian processes that should seem weird because if you for a moment forget about deep neural networks and go back to our Gaussian processes if I told you you're going to need to train on two datasets y1 and y2 well we know how to do that we first compute the posterior on whatever the parameters are given y1 and then we do patient inference so we just multiply this posterior from y1 with the likelihood for y2 and divide by the evidence and then we get a joint posterior so this is the correct application of base theorem if we assume as I say over here that the two data sets are independent of each other when conditioned on the underlying model so that's sort of the typical setting in regression classification we assume that the labels are independent given the latent function um if they're not then you know you have to update those equations a little bit and you can think about what how to do that yourself so if we actually were doing Gaussian process regression we know how to condition on one datum and then the next one and then the next one without the model forgetting about it we just have to keep updating the posterior so how do we do this with a Gaussian model like let's think of our sort of Laplace framework but for of course for Gaussians Laplace is the exact answer so there's nothing more to it so the logarithm of this posterior is just the or the negative logarithm it's just the negative logarithm of the likelihood for the new data given the old minus the posterior log posterior from the previous data set so people sometimes say yesterday's posterior is today's prior right we've learned on data set y1 yesterday now we have a posterior given y1 and that will become the prior for y2 so if we look at this as an empirical risk minimization problem to making this connection exactly to how we've done deep learning so far we have an answer to our question how to do this it just says train your deep neural network as you would on the new data but change the regularizer turn the regularizer into an L2 loss that is centered on the previously trained network from the previous day the previous sub data with a covariance given by yesterday's covariance right the Laplace approximation from the last problem and that's clearly something we can do right because we already have our training framework they have everything set up with an empirical risk and the weight cost it's just that so far the weight cost was maybe L2 and now it's just going to be L2 shifted and rescaled let's try that so people actually do this so there used to be a much simpler approach which is to say which is sometimes called regularized training or diffusion training or something like this where they would say ah we're just going to train our deep neural network but we're going to add a weight cost that says at task i we want the trained weights to be close in L2 norm to this to the trained networks at task i minus one the previous one so we want the network not to move too fast in weight space and if you do that then you get this plot here on the right so in gray in the background I I've grayed out what I've showed on the previous plot because otherwise it gets too busy in red we still have average performance if we just keep training and in blue we now have average performance if we do this scalar regularization so what we see is that the performance in black you know have the individual tasks again and we basically unlearn more slowly right so the black the black curves still go down we still forget but we sort of reduce the forgetting a little bit by telling the network that it should keep its weights close to the previous one but we're also paying a bit of a price for this which is difficult to see but if you sort of train your eyes on the top you can see that the end performance on the new data actually is lower than before of course because we've sort of we've sort of put the brakes on basically right while the network trains we've sort of held it back like it wants to run towards the new minimum and we just say no no no no no they added a little bit close to the old problem so this is like a a posterior where we put a scalar basically that's like the ultimate approximation to a hashen is just as a scalar matrix and now we've sort of realized that that's what we what we've done here basically it's another case of people already do it they just don't interpret it in a probabilistic fashion and therefore they don't have to talk about the fact that it's a bad approximation and now we do something ever so slightly more precise which is to say we actually use this posterior from the previous task which involves not just this old weights but also this hashen of the loss at the previous task and that hashen remember is some kind of memory map I think MPS Khan now calls it a memory map as well so we could maybe actually there's already two of us so we can start doing that a representation of which parts of the weight space have already been nailed down by a previous task and which parts are still flexible are still moving like movable haven't been used yet so this is exactly what you would like from a normal computer that isn't like some vague representation of how how brains work right you have a storage in here some some hard drive some SSD and there is a what are these called a a file directory a kind of header on this thing that just says certain parts of this of this hard drive are already used don't write on there because I use them to store memory and certain parts are not used yet so you can write there that's just like a real relaxation of this this kind of feature it's just non-binary it's a more continuous representation of coverage of memory so when there is a large eigenvalue in psi large eigenvalues in psi mean high curvature that means the corresponding eigenvector in weight space is a linear combination of the weights which are strongly constrained by the previous tasks so those we do not want to move because they are important to not change the loss on the other tasks that's literally what it is right it's the curvature of the loss so if you move you will change the loss a lot you will misclassify all data and then there are other directions along which the old data doesn't say anything it's completely unnecessary for like this task and so if you think of the amnesty example you can imagine right that features input features that look like canonical numbers like this sort of the inner oval the zero the eight basically the segments of a classic like eight eight segment lcd display those should be sort of fixed and around the pixels around them they are pretty open because if you permute the data most of the images the permuted images will lie in this other part which you can now use to store more information and that's exactly what happens here so now the green curve shows this and you see that it's pretty much constant so there's an annoying thing that it is lower here than before and that's really a code problem that I took this from an old piece of code by Augustinus Cristiardi who I've cited a few times on previous slides and he's using another package to implement this Laplac thing then for the other models and therefore they are not directly comparable but you see that they are pretty much constant over time and that the black lines in particular all stay below sorry they sort of stay relatively flat so the model pretty much does not unlearn any of the previous tasks in this way and the cost of this is actually constant in time so this is something you can do once and then you can keep doing it while your model is deployed you could you could collect data every day update the like train that use this as a new training task for a Laplace regularized deep neural network let it converge do maybe some sanity checks that it's actually converged that not like something nasty happened because that might still happen in deep learning and then you store this Gaussian process object that we've used so far the train neural network and the Laplace tangent kernel and then you hand it on to the next time step and in doing so we've created a finite memory that gets moved across time but makes sure that the data is not forgotten and that's my answer to this question of how should I make use of uncertainty in deep neural networks you use it to create functionality like this of course sometimes the functionality is really just to show the uncertainty for example in some medical image classification applications we have some collaborations with people at the university hospital the application might really just be to say this image has this classification disease or not disease with this confidence but that's actually rare more often you want to have explainability for example you want to look at these Jacobians like we did on Monday and say ah it's disease or not disease because of this particular feature because of this particular part of the image or because it's close to this training point in in data space but quite often you also want to just create functionality you want to be able to add backstabs to your model to make sure that it's uncertain when you're far away from the data or like in the final example you want to be able to store data across time without an increase in computational cost by avoiding retraining and that's the end of the deep learning part of this lecture of course next Monday we will start talking about actually something that is a nice segue from this last example namely how to deal with data sets that arrive continuously through time without an increase in computational complexity as we move at that point I'll end please give feedback if you can and I'll see you on Monday