 Alright, hi everyone. I'm Arvin. I'm a PhD student in Phillips Group and I work on partial differential equations and Gaussian processes. But today I'll tell you a little bit about numerical linear algebra and also we'll commence with what would be a relatively deep dive into numerical linear algebra for machine learning and Gaussian processes in particular. So there would be mostly two parts to today's lecture. First of all we're gonna take a bit of a step back and look at the bigger picture and answer the question why we should even care about numerical linear algebra in the first place. And then we'll see some fundamental tasks in numerical linear algebra which are important for machine learning in particular. And then we'll become a little more practical and we'll see that implementation of a lot of mathematical operations matters quite a lot and you can make quite a lot of errors there. And see some practical example where we'll see how to implement Gaussian process regression properly in a way. Alright, so let's get started. First of all let me start by saying that numerical linear algebra is absolutely central to machine learning. It's probably one of the most fundamental sets of tools in your machine learning toolbox that you need to implement these algorithms. And you can probably already see that by sort of this multitude of different areas on the slide. Once you see all of these together you probably know that the concept is quite important. Just a quick note here. I'll flash some examples at you now. It's not important to understand all of them in details just to motivate a little bit what we'll be doing today. So let's start with basic probability theory. So here already where we look at the probability density function of a multivariate normal distribution we already see two quite important parts of today's lecture. Namely for one part the inverse right here of the covariance matrix and the determinant here. And these are probably the two concepts we'll cover most today. So yeah you already have these two here. Maybe a more in some sense more practical example is general linear model which if you attach well it's essentially just a linear hypothesis class of functions that you use in regression right. And if you attach a square loss to that and then solve the accompanying optimization problem in closed form we also see that there's a matrix inverse popping up here but it's not just some matrix but the matrix actually has quite a lot of structure in that it is an outer product of two matrices in a sense. It's reasonably large depending on the number of features you're actually using and it's symmetric and positive definite. So we're going to talk a lot about how we're exploiting structure to make numerical linear algebra algorithms fast and reliable specifically in machine learning. Yeah you can just see you know what the model produces with some data set using polynomial feature functions. Another quite well-known base case of a machine learning algorithm is principal component analysis where sort of the the core of it is to compute what's called a truncated eigen decomposition. So a factorization of the covariance matrix the data covariance matrix here into three matrices where essentially these three matrices also have some some prescribed structure and we can also see another very fundamental operation in numerical linear algebra and machine learning which is matrix vector products. It's very you know almost it seems almost too simple but there's quite a lot of meat here to be discussed today I guess and this what this essentially does this algorithm is it does dimensionality reduction so it takes a high dimensional data set and in this case finds a linear subspace so a lower dimensional vector space to embed this data set in which still captures a lot of structure of the data and you can see that here being applied to pictures of faces and then these images here are essentially seeing these pictures as vectors in a vector space the the basis vectors of that lower dimensional vector space where here we are reconstructing these two images with successively more elements from that lower dimensional subspace. The main subject for today will arguably be the model of a Gaussian process we're going to talk about this in a lot more detail but just the quick version here it's a Bayesian non-parametric method for regression so of course we're going to place a prior over the thing we care about prior probability measure which in this case is this Gaussian process and then we observe some data which is essentially just evaluations of the unknown function corrupted by some Gaussian noise here and then we're in a in a Bayesian fashion where we're just going to condition this Gaussian process this probability measure on these observations to retrieve a posterior distribution which in this case is also a Gaussian process which has some mean and covariance matrix given here it's not really important what the structure is but also think to note here is there's inverses here and again there's quite a lot of structure in the matrix we're looking at so this what's called the kernel gram matrix here we can see that is generated by a function so we have what you could refer to as generative information about the matrix where we have a function that produces every single entry of this matrix which we can leverage to actually make our algorithms a lot yeah quicker more efficient more reliable and it's also symmetric and positive definite just as in the Gaussian multi-array Gaussian density case so this is something we'll focus on today quite a lot return to that in a little bit let me just say even though we just show a Gaussian process here a similar version of these these equations applies to a larger class of models which are called kernel methods kernel machines and I guess the most direct contender in statistical machine learning this would be ridge regression if you've heard of that so even though we talk about Gaussian processes here a lot these techniques apply to a larger class of models something that is arguably just sort of come into machine learning relatively recently at least with larger attention is differential equations and here we're showing the solution of a particular partial differential equation which is linear and linear partial differential equations can be written essentially as linear systems in infinite dimensional vector spaces of functions so what's usually the matrix in that linear system is is a differential operator a linear operator a linear map between vector spaces of functions and a quite common method to actually solve these equations is to take that linear operator equation and discretize the spaces of functions between which the operator maps so essentially choose finite dimensional bases or choose finite dimensional subspaces of these vector spaces which then turns this this linear operator equation into a just a finite dimensional linear system which is usually quite large because essentially for every node of one of these triangles here you get an entry in that matrix and it's also again highly structured and we have generative information available because we know it comes from this infinite dimensional linear system and this is something we can also discuss that maybe a little bit later on when we actually talk about differential equations use to to build specialized algorithms for this sort of a situation and finally and this is probably I would imagine everybody has sort of seen variants of this is optimization also has quite a lot of linear algebra to deal with namely essentially every iterative optimization methods for local optimization of in this case some loss function l depending on some parameter theta has a form similar to this right here where d is a direction in which to step and m is a matrix that does essentially does a correction to this so think about d as for example a gradient in just sarcastic or just regular gradient descent and then m could be for example the inverse of the Hessian matrix for the Newton method which converges quite a lot faster but it's also quite a lot more expensive to to evaluate this update rule in the case where this is a Hessian matrix there's quite an important characteristic you know in a sense hidden in here for these systems in machine learning which is since specifically in deep learning we evaluate this loss function on a batch instead of evaluating it on the whole data set there's some stochasticity here and this propagates onto the gradient but both the gradient and the Hessian matrix so in a lot of cases linear systems and matrices and machine learning are actually noisy we don't have access to the true matrix with which we want to solve but we just have a noisy estimate of it Philip already talked about this last lecture and this is something we'll just have to deal with not today we're sort of going to ignore them actually for the rest of the lecture too but this is something to look out for and the newt method is just one choice here so for example if you choose the inverse of the Fisher information matrix you get natural gradient descent and so on so yeah this is actually quite ubiquitous I would say nowadays and then the last example is deep learning right so I realize there's already quite a lot of math here just just realize that there's matrix vector products in the definition of a feedforward neural network then if we do if you compute gradients on these sorts of functions then what we call back propagation is essentially just a vector Jacobian product computed in a very efficient way so there's also a matrix vector product there similarly the forward mode of automatic differentiation is just a Jacobian vector product so auto diff has quite a lot to do with matrix vector products and then if we actually move on to Bayesian deep learning a very common tool to to deal with Bayesian neural networks is the so-called Laplace approximation where we optimize the loss function of the neural network until we assume at least we find a local minimum of the loss which under some loss function assumption the loss function corresponds to essentially finding the maximum posteriori estimate and then we approximate around that point with a Gaussian distribution and this approximation involves computing the Hessian of the neural network with respect to its parameters at that maximum posteriori point and then inverting it to obtain a covariance matrix you're actually going to hear a lot more about this in the lecture given by Agostinos but yeah you already see linear algebra will already play or also play a role there all right so now that we've talked about or you've seen all these methods again it's not important to understand all of them just you need to realize that these four fundamental operations are quite important for machine learning they appear everywhere they're in a lot in a lot of different models which is on the one hand matrix vector multiplication and usually we need quite a lot of those so that it should be reasonably efficient both in terms of computation time and in terms of memory and you've seen that in deep learning in optimization also where we need the product of the inverse of the Hessian with the gradient and in kernel methods we'll also need that mostly in the coming two lectures then and probably this is what you'd arguably think about if you think about linear algebra is the solution of linear systems we've seen that in the Gaussian density and by extension also in the Gaussian process case and linear regression also an optimization in Newton's method and so on to achieve this and also to solve tasks like PCA we'll need what's what I call matrix decomposition so yeah factorizations of matrices into matrices that have some sort of prescribed structure for example a lower triangular and an upper triangular matrix we actually look into this matrix decomposition in a lot more detail later then there's something like a singular value decomposition which you can also use to solve the PCA problem which are essentially orthogonal matrices U and V or unitary matrices and then a diagonal matrix or the QR decomposition which is an orthogonal matrix and an upper triangular matrix and we need that we've seen it in in PCA to do GP inference it's also we'll see quite efficient to use one of these decompositions and to do command filtering which you'll see because in a couple of lectures finally specifically in the Gaussian density case and also for GP regression we'll need log determinants and in the specific case of a symmetric positive definite matrix this log of the determinant can be transformed into the trace of the matrix logarithm of a in general it's also sometimes interesting to estimate the trace of of just an arbitrary matrix function of the matrix so matrix function could be something like an exponential of the matrix something like that yeah and we'll need that specifically for GP hyperparameter optimization which we'll also talk about today all right now that we've seen that linear algebra is clearly very important for machine learning let's start with some practical aspects of it so first of all efficient matrix vector multiplication I've already said that it's actually quite important how you implement a numerical operation on a computer and here we just have a very sort of toy example to show you that even with very very simple tasks where you think yeah what could go wrong there's actually quite a lot that can't go wrong so we're given the the task of well we're given a matrix A with dimensions n and k and a vector in r k actually that's no sorry n sorry that's that's correct where n is substantially larger than than k here and we're given the task of computing essentially the outer product of of A with itself and then multiplying it might be multiplying x with that product yeah so far so easy for for the mathematical part of it these these sorts of products appear quite a lot for example in parametric regression where you actually choose basis functions and also in PCA if you want to reconstruct your data in the lower dimensional subspace for example so this is something that will come across a machine learning and I guess you know you can just write it down python it's reasonably nice for writing down mathematical expressions and you can just write it down like this right um and then we can evaluate it boom didn't work what's the problem here it's actually a little bit hard to read but does anybody have an idea what's going on yeah why is it too large though exactly so what's happening here is python is yeah essentially since we simply don't set any parentheses here right python has left to right operator precedence uh all other things being equal here um and so what it does is it first computes the matrix a transpose which is an n by n matrix and then it multiplies x by that matrix that's a problem just you know some some back of the napkin math here 10 to the 5 times 7 to the 5 times the the the bit width of the the data type here is actually 80 gigabytes for this matrix actor product and that's you know not really reasonable to do so what's the what's the solution here exactly and this reasonably think a simple fix here um yeah brings that down to a much more much more quick i mean you can already see it's like 0.3 seconds to execute which i'm sure there's a lot of overhead by the notebook it's running so very very quick um implementation essentially of this operation which was impossible to do in the other implementation so it really matters a lot how you transform a formula a mathematical expression into an algorithm we're going to see quite some examples of that over the course of the lecture i guess so the algorithm implementing a mathematical operation has a huge impact on performance stability and memory consumption um let's look a bit more in graphical terms what's going on here right you have a essentially um what's what would be called a low rank matrix right so you have by the structure you have at most rank k in the matrix and that's sort of going into the direction of identifying what structure in the matrix you're dealing with right so we in a sense we already learned that for low rank matrices there's a there's a more efficient way of multiplying with a low rank matrix than just to compute the matrix and then essentially multiply a vector with that right um in terms of asymptotic notation you can see that we have a square in here right so n squared instead of nk if n is large think about either a large data set for regression or um you know a pca on an on a two megapixel image something like that um this is gonna hit you quite substantially if you implement the wrong way and also i mean space was the thing that that killed us here so there you can see where that came from and this brings me to this bigger point of if you know something about the structure in your matrix generally speaking the algorithm can't know but you know so you should tell the algorithm that you have some sort of a structure in the matrix and lowering structure is actually lowering structure is actually an example of something that the algorithm like once you've computed a h transpose is quite quite expensive to actually figure out post hoc so if you know you have a low rank matrix use measure methods specialize to lowering matrices yeah that's yeah that's actually kind of a tricky point right because if you if you um just have the the complete matrix like the product of the two low rank factors essentially then it's quite costly to actually separate them but if you like if you write down your math and you know well actually there is lowering structure here don't just multiply the matrix out right you if you already have the factors use them um and we'll actually see an example of how to actually do that a little bit later um yeah and then i mean lowering is just one type of structure there's loads of different matrix structures here there's just a very incomplete list here to show you what i what i mean by structure so the base case is okay you don't really know anything about what's going on with that matrix it's just a general matrix to you and then you can't really get away with in terms of at least matrix vector products can't get away with anything quicker than m n in time and space asymptotically at least and this is sort of what like a little bit of a pictographic depiction of what that might look like in a specific matrix if you look at the sort of heat map of its entries um we've just seen the lowering matrix and actually for specifically the case of lowering plus the diagonal matrix which then can also be inverted right because lowering matrices are actually singular you can't invert those um they don't have an inverse but if you add a diagonal term which for example also pops up in regression uh with uh for example a l2 regularizer you get this diagonal term with the regularizer weight on the um yeah as it's scaling essentially um you can get away with um the scaling that's essentially nk where k is the the rank of this low rank matrix term so the essentially the smaller dimension in the example we just saw um sparse matrices depend a lot on the n number of non-zero entries in the matrix and and that actually kind of makes sense right is also the scaling you get for both matrix vector products and um the space requirements of these matrix depends also a lot on how you represent the sparse matrix but yeah we'll actually not talk about sparse matrices too much you're just so you've seen that this is also one of these types of structure we're talking about kernel matrices and we've also seen these already in this you know where flashed uh GP regression issue quite quickly um uh structured in the sense that you have generative information so you have the kernel um they're positive uh definite and they're symmetric and usually because that's not really useful in terms of time requirements for the matrix vector product you don't really get any speed ups here but you can get quite substantial memory savings think about n being you know hundred thousand as in the example we just had and this is the number of data points in your regression data set in this case if you're using kernel methods then you get 2nd instead of n square ad by the way is the the number of dimensions essentially the input dimensions of your regressor the the function you want to you want to learn um yeah and for specific kernel functions you can maybe already see here that there might be more structure to the matrix um then you know just it being kernel matrix by the way how do you actually store a kernel matrix such that you get this space scaling the space complexity anybody know yeah but that that would just be you know half n squared right so n squared after all yes so this is memory wise it's obviously loads cheaper well i mean it also depends on what d is right so if your input dimension is loads bigger than the number of data points you have then this obviously doesn't work anymore but yeah so this is exactly how you do it you store the data points and you you know the function which generates the entries of the kernel matrix and so you just you know evaluate it while you compute the the the matrix vector product and there's actually libraries for that um which do exactly this thing and you know realizing it very efficiently also using gpu's and writing very efficient CUDA kernels for this sort of thing um which then enables you to compute you know Gaussian processes on essentially hundreds of thousands of data points on a laptop which obviously takes quite a lot of compute time in the general case but at least you don't get memory overflows anymore and finally there's matrices with more general functional form think about something like autodiff graphs here so the essentially the Jacobian chain rule you apply when you when you do back prop through a through an over network um and in general these are also you know can be relatively unstructured and you kind of need same time and space requirements here but if you put specific assumptions on the compute graphs you deal with then you might be able to save some time here and and compute these these things loads quicker and also the matrix vector products incidentally all right so now to a concrete algorithm and we're going to start a sort of line of um methods here that apply or that usually apply to Gaussian process regression also in the next two lectures so we're going to focus on this a little bit just you know recall that this applies to other kinds of methods and you can transform what you learn from these methods to other algorithms to notably kernel ridge regression if you just compute the essentially the the posterior mean of the Gaussian process all right so in a little more detail um our goal is to learn an unknown function uh f star from a d-dimensional input space in this case over the reels but it doesn't actually really matter what what input space you have as long you know um as you can you have a kernel on it um and then a scalar valued function here can also um extend that but let's just keep it relatively simple for now and then since this is a Bayesian regression um method we place a prior on the thing we don't know in this case the unknown function so this is a probability measure over functions already said that it has a mean function mu and a kernel function which is the covariance of that Gaussian measure essentially between two points of of the um of the unknown function um and then we have a likelihood which essentially tells us how we got our data from um the unknown function how we observe the the unknown function in this case we just add um homoscedastic uh Gaussian noise to all of the uh two of the data points um which are just evaluations of the functions and we observe a couple of these um function values y under this noise and then if we essentially apply Bayes rule um to this combination of prior and likelihood we get a posterior measure posterior process over the function given that we observe these data in an experiment um and these are the the moments so the the posterior mean function the posterior covariance function um yeah and this is sort of what it looks like right you get a you get a mean estimate which um under some assumptions is essentially the same you get from kernel rich regression um and you have a in this case we just plot the the marginal um credible interval so essentially twice the marginal standard deviation of that process um which predicts how uncertain your regression estimate is about the unknown function you're looking for and you know in this case it works you know reasonably well the the true function is plotted in this dash gray line here um and the with essentially since you take twice the standard deviation uh 96 point whatever percent um confidence the well the the process predicts um this true function this true unknown function in its 95 confidence interval so that's quite good it works the the uncertainty quantification works really well here um but there's a bit of a problem here since we need to compute this inverse and I guess we already talked about this uh in the last lecture this scales quite prohibitively can somebody recap how how much it costs to do this naively in a sense you compute a a a a kernel gram matrix from n data points where n is quite large you oh and three yeah exactly so it's cubically expensive and that's just prohibitive I mean you already saw um that it might be difficult to compute n square for large um data sets so there's that um yeah another thing we need to solve for Gaussian processes though is you can well I mean I've sort of hidden it here but there's usually parameters to this kernel there's the output scale which is essentially how wide this um this interval is um there's a length scale which which uh usually a length scale which sort of says how quickly the prior assumes the function to vary so how how high the frequency is essentially here or the frequencies are um things like this and these are hyperparameters of the algorithm we need to do some sort of hyperparameter tuning or solve a model selection problem and to do that there's a uh the a Bayesian way to do this is to just compute the um to just compute the um what's called the the log marginal likelihood so what that is is essentially yeah um the the probability of the data y the the observed data y which are corrupted by noise as I said um given the the hyperparameter uh we are essentially trying to trying to look for and we're maximizing that because we want our prior to explain the data set we observed um pretty well right so so um if we actually try to compute this we can see that um by factorizing this um log marginal likelihood into essentially via via the the joint distribution of the noise corrupted um observations and then just the function evaluation and then integrate over the value of the function evaluation we exactly arrive at that that term don't worry if you didn't understand that it's just this is the this is the duration how you how you get there and in the end you end up after some algebra with the function you can see here which is essentially a loss function you need to minimize now well maximize in this case now um and there's two terms here so one of them describes the the model fit so how well the model actually approximates the data comes from the exponent and the Gaussian distribution so it's essentially if you look at it it's just that um and then there's another term which comes from the normalization factor of that Gaussian which essentially since it's a determinant it measures a volume of the rows of a matrix so in this case what it essentially does is it measures the complexity of your of your prior how many functions the in a sense the the hypothesis class you're you're positive by that Gaussian prior uh contain and these are um contrasts to each other right so there's a there's a trade-off here and just to show you an example here we have quite some severe uh underfitting over generalizing going on um and what's what essentially these terms do here right you you fit the data quite poorly the the red line is what's um what's fitting the data well in this case doesn't does not fit the data and so this term will be relatively small sorry small um but this term um will be large right because you you have a minus here so there will there will be a small capacity of the model minus that is at least compared to the model fit term a relatively large term so um yeah this is one direction of it and then the other term here is um yeah quite some substantial overfitting I would say um and here the model fit is obviously in this case probably perfect didn't really check but it looks like it just fits the points perfectly here because it you know has all these bumps here to just directly go through the measurement but then again the the model that stands behind this because there's a lot of variation in the function there's a lot of degrees of freedom essentially um yeah it's quite complex so this term will be large and the negative of it will be quite small so you you need to find the trade-off between these two and here the log determinant shows up again right so we need to compute in order to do Gaussian process regression we need access to the solution of a linear system and in this case actually not just one solve but m plus one solves where m is the number of uh data points at which we want to evaluate your regressor afterwards okay you see where that comes from isn't here anyone so obviously one solve is here right so this is a vector of observations so we need to solve this vector solve this system so this matrix times some uh let's say w should equal this one solve and then if we evaluate the posterior process at m points then we need to evaluate the posterior covariance for these m points so there's uh m right-hand sides here for one x this is just one linear system but if there's an m of these points we need to solve it m times all right um now i've been preaching that structure is very important what structure do we have here just to recall it in the oh it actually says it all right sorry so it's symmetric and positive definite in the most general case but there might be additional structure which we should also leverage because the system is huge and we usually can't solve this at least for extremely large data sets just like that and there's also just generative information which you can leverage to push down on the on the memory consumption a little bit all right so how do we do this do you know any algorithms that might apply here we have already heard i think last time what one of them is what's the base case how we should you implement this i believe somebody last time said chaleski so this is the way we're going to go here but first since chaleski is already sort of a special case of another algorithm we talk about the more general class of matrix decomposition and one very important matrix decomposition in particular which is the so-called l u decomposition so first of all pretend that the system we're trying to solve is a lower triangular matrix so only the like essentially the entries on the diagonal and below the main diagonal of the matrix in this case we look in at a square matrix are actually non-zero and you probably already know an algorithm from school which actually does this maybe not in this name but it's called forward substitution and we're going to introduce it in a divide and conquer fashion so pretend you already know how to solve linear systems of size n minus one where the the system matrix is lower triangular just assume we already know that you know divide and conquer we're going to show how to use this knowledge essentially how to solve a system n by n instead of n minus one by n minus one so if we have a lower triangular system would actually be useful to have a point you know okay i don't know what's going on that's something pretend we already have this we have a have a lower triangular system well then what we can do is we can decompose the matrix into four parts which is we take the scalar in the in the lower right corner then the column above that will be zero right because it's lower triangular so the this is the last entry of the main diagonal then there will be a vector a row vector left to that and then another triangular part in this case is called l minus l i minus one above it and this is also as I said lower triangular we can do the same thing with the vectors so y is the solution of our system and b is the the right hand side vector there we go um b is the right hand side vector of the system and we decompose them in the same way we just split off the the last entry of it well then we want that l y equals b essentially so if we multiply this out with this decomposition we get that l i minus one times y i minus one plus zero times this right is supposed to equal l i minus one uh is this supposed to be b i minus one and we we said that we we assumed that we can already solve systems where the system matrix is lower triangular which have dimension n minus one so this because we split off the last row has dimension n minus one so we can just solve that we have we we have some magical algorithm to do that for now um so this is this way we can figure out what y i minus one is and then we look at the the last entry or the last row essentially of the system which is l i minus one transpose times y i minus one lambda i times gamma i and this is supposed to equal beta i now i said we already know y i minus one but just by you know decomposing the matrix we know this vector and we know these two well actually sorry we we know these two lambda and beta we just don't know this so we rearrange the equation in terms of y i rather this and under the assumption that we know how to solve l i minus one y i minus one equals b i minus one we can solve this system and now we just apply essentially the same decomposition to the smaller matrix all the way down until we reach the scalar case until there's just you know a scalar left which is like a one by one matrix which is essentially a lower triangular matrix right in it as a trivial case um and there we can just divide right this is how we solve a one-dimensional linear system so now we just go back up essentially and insert this into the expression um yeah and we solve that system recursively by dividing conqueror um this is one of these situations where you need to understand recursion to understand recursion so um maybe go over this little bit at home if you didn't understand right now um it takes some getting used to sometimes i would say um oh yeah so actually i already showed it so how expensive is it's it's n squared right because in every iteration or in every on every step of the recursion um you need to compute this formula essentially where this is the dominating part so this needs um because um sort of when i is set to n this needs um n minus one um multiplications and n minus two additions to compute this term and then if we add them up for sort of varying i um you can use a summation formula here and arrive at n squared and this is this this notation is supposed to say to leading order which means we forget everything that is of less order than n squared so if there's a linear term in there we forget about it because it doesn't really matter if n grows large it's a little bit different from from o notation because we actually look at the constant of on the leading term so there's the there's the difference there um and you can do exactly the same thing if the matrix is upper triangular just you do it in essentially the opposite direction right you instead of splitting the last uh term off here you split the first one off and then do exactly the same uh recursion you did here and these two techniques are called forward and backward substitution um respectively there also you already noticed that low is cheaper than inverting just the full matrix we already heard that's in general o of n cubed this is all this is yeah also o of n squared but yeah the to leading order in squared um but we don't have a lower triangular system there's a little bit of a caveat there so how do we get to a lower triangular system um and this is exactly where what i call the lu decomposition comes in right we split the matrix into a lower triangular into product of a lower triangular matrix and an upper triangular matrix and this is actually possible for square matrices if they're invertible so let's look about look at how we do this and it's essentially a very similar construction um we're also using essentially divide and conquer or recursive definition here right so assume we can already decompose a matrix a in uh of shape n minus one times n minus one we have a magical algorithm to do that how do we decompose a n by n matrix in this case yeah let's call it a i or a one in this case here um such that we also get this this factorization into a lower and an upper triangular part again we're just going to split off essentially a scalar here and then a column sorry a row vector and a column vector here and then there's a remainder uh n by n n minus one times n minus one matrix here um and we also do this essentially to what will be the lower and upper triangular factor in the end so um assume there is such a lower triangular factor and upper triangular factor this is what it should look like right at least here because you can also decompose this in essentially the same way and because it's lower triangular you know that this row vector here will be zero now we choose the diagonal entries of the of the left part of the lower triangular part to be one we can just do that we'll see in a little bit and then there will be essentially for now an arbitrary column vector here um and on the right hand side we just use essentially the same row as the upper triangular part here so we just copy that over um but then also here will be a zero block and this will also not be equal to b right because this thing needs to be upper triangular now we want this decomposition to hold so if we actually multiply li times ui we get the following equations we get for the first row um of the matrix we just get alpha i times alpha i zero essentially times one zero so this will be just alpha i in the upper left hand corner this already good because it needs to be alpha i and then it's going to be ui essentially an entry of ui times one and the the other part so the the lower part is zeroed out so the first row of li ui is by construction the first row of a so that's already good we just need to pay attention to the the essentially lower block of it the the lower rows um for the vector li we find that alpha i times li must be bi for this to match in the in the matrix product and by just rearranging term we can see that we need to choose li as bi divided by alpha i and then for the sort of this this this uh this block in the lower right hand side we get li in outer product with ui transpose so li times ui transpose um and then li plus one times ui plus one since we don't since we already know what li and ui are we need to find out what li plus one times ui plus one are and essentially apply the factorization here but we know that it needs to be equal to bi so bi needs to be equal to ui sorry li times ui transpose plus li plus one ui plus one and we if we rearrange terms here we will see that li plus one ui plus one needs to be equal to this thing here now we already assumed that we can decompose a smaller matrix so an n by n minus one by n minus one matrix into a lower triangular and upper triangular part right this is again this recursion step we're doing here so if we were to factorize this in this way we would be done right we have this decomposition if we use our magical algorithm here and the magical algorithm is just to use this same routine all the way down until we reach the scalar case again where since we set the diagonal entries to one is this trivial right we just like if the if the if a is one by one then we just as l take one and as u take the entry of a we want to get then we're done so this is what's called the lu decomposition um can you guess how much that costs to compute or can you reason about it yeah but uh yeah so it's n cubed there's actually it's for also for the homework it will be reasonably important to see what the constant in front of that is because there's some gains to be there but you essentially oh sorry actually I forgot that so obviously something can go wrong here right we're dividing by something and we don't have any constraint on what the values of alpha i can be it can be zero so what we do in the case where where alpha i is zero um this alpha i element is usually called a pivot um we can just swap rows in the matrix until we find an element that is nonzero to divide to divide by um and um this is called partial pivoting in numeric speak so if you read that somewhere this is what that is um and usually as a heuristic for this algorithm to be sort of stable so so robust to to noise essentially um you choose the the entry in the remaining rows you need to look at which has the largest absolute value so that essentially you know you you don't get huge error amplification here by dividing by a small noisy value and just blows up um and by that you can actually show I mean this is kind of a constructive proof of that that this this decomposition always exists if the matrix is invertible so that's cool and for the for the computation time of this um we need to again just look at this right because this is an o of n iteration we just divide a vector by a number for for sort of well actually i of i maybe or n minus n minus i uh rather um so this is the this is the biggest part at each um recursion step and this is this matrix has size n minus one times n minus one right because you in the i step you've already cut off i entries of the of the original matrix um and you need um to compute this to to compute the outer product you need um n minus one square multiplications to compute every entry of that and then the subtraction itself is also n minus one squared operations because it's that many numbers you need to subtract and that way you get a summation over two times n minus one squared and if there's a summation formula for the squares of numbers essentially by doing like an index shift thing here you see that to leading order you arrive at two thirds n cubed um with some structure in the matrix you can push that down quite a lot and you're going to see how much in the exercises um yeah so that's a u decomposition maybe some of you noticed the the strategy we choose here is essentially we subtract the rho u from all rows below it um to compute this upper triangular matrix and this is exactly what some of you might know is Gaussian elimination so this is a very um in a sense efficient and and we'll also see that it's um clever way of doing Gaussian elimination to solve linear systems um since we actually choose these these upper sorry these diagonal entries to be one here we don't really need to store them right we just know their one so we can save some memory here and this is actually um how you implement this sort of algorithm um you can do it in place essentially you you compute the lu factorization inside the array um you give in which you give a to the algorithm because you always only touch sort of the outer this outer part of the matrix and um of course modify this part but you can but you can store l and u in the same matrix essentially um and I'm actually going to show you with a with an example how that looks like so um let's start with some matrix I'll keep it small so that we actually can do it in this lecture so it's one two two four four two and four six four all right now our goal is to um essentially always um or arrive at a matrix where these three entries contain the non-trivial part of l so right the l without the the one on the the ones on the diagonal because they are implicitly there and um this upper triangular part will contain you essentially minus the zero parts yeah yeah that's actually a problem we don't really have a better marker let's see if the green one works better is that better okay cool and we'll essentially keep everything in the same memory buffer so right like keep everything in this matrix essentially okay so first step we can see here is we keep the first row because the first row will be the in this case the first row of u also so we just copy over um one two two next we need to compute this l li um and it's just you know dividing the the b vector which is this vector right here in this decomposition by that sorry by that entry and it since it's one you know it's just the same vector yeah all right so now we've computed almost every quantity at this stage of the recursion um we just need to compute bi minus li ui and then apply the same algorithm on that again but you can see here we don't need these entries of of a anymore once we've computed bi and we have exactly the right amount of space to fill that in so we're just going to compute it in here essentially and that will be um so the first entry of this um li it was the first entry of li four times two the first entry of ui transpose and then we're going to subtract that from bi which is four right so minus four then it's going to be four times two subtract that from two minus six four times two eight subtract that from six minus two and then four times two subtract it from four is minus four so now we've we've essentially done the i equal one step of the algorithm we um the next step would be to uh it's essentially recurs right apply the same trick or the same algorithm to just this two by two square here and with that we actually already terminate then because we already arrive at the the scalar case so we we just copy these right because we don't touch this part of the memory buffer in this step and then we copy that because we take ui transpose also as the first row essentially of ui and the alpha so it's minus four is the alpha and then minus six is the u we divide the bi entry essentially by the alpha so we get 0.5 and then it's going to be bi so minus four minus li uh times ui in this case this is just the number uh six so this is minus three and then minus minus three is three so uh minus one here and now what we have is in the in these upper entries we have the u we've we needed to compute and in these lower three entries we have uh all but the diagonal of l so this is how lu decomposition is computed uh also if you just call uh lu factor in sci-fi this sort of leaves out this this pivoting the pivoting itself um you know would also be implemented by swapping around the rows if you needed to in this case we didn't need to really um but this amounts to the decomposition just uh getting another matrix p to the left of l and u which is a permutation matrix which just you know swaps back the rows essentially which which performs that swap which because swapping rows is a linear operation so you can represent it as a matrix all right um so now we know this decomposition but what do we do with it so we know how to invert triangular linear systems um so what we do is you know since a is now equal to lu because we constructed the lu decomposition that way we just write instead of ax equal b lu lu x equal b which has two linear systems essentially so it's we we need to solve for some y such that l and y is equal to ux in this case equal b and then we solve the second linear system luckily both of these are base cases which you already know which is you know backward and forward substitution respectively so we can compute those and what's nice about this is once we've computed the lu decomposition of a particular matrix a we can very efficiently solve multiple linear systems with multiple right hand sides for that matrix um a right because we we only once need to spend the two-thirds and cubed to compute the decomposition but then for every additional linear system that we need to solve it just costs two n squared for once forwards once backwards substitution if you're wondering if there's a matrix p um here which is a permutation matrix the inverse of a permutation matrix is just a transpose it's an orthogonal matrix so it's also extremely cheap to compute that you can actually do it in linear time because it's very sparse so even in this case you can still invert here and then for k solves of with the same system matrix a which is exactly what we're doing in Gaussian process regression right we have the same system matrix the ground matrix and just solve m plus one systems with it we just pay cubically and then two k n squared for k linear systems so that's a very very efficient way of solving multiple linear systems with the same matrix which you could say is amortizing costs here so there's just one essentially one task left that we need to do which is the compute computation of the log determinant but it turns out you can actually also do that with an lu decomposition so the determinant of a matrix a anybody know how much it costs to compute that directly like without knowing any structures about a yeah exactly it's also cubically expensive as much as inverting the system because so the naive way of doing it is Gaussian elimination too so yeah since we know assuming that we know an lu decomposition we can just say well the determinant of a is the same as a determinant of lu determinants distribute into products so it's the product of the two determinants of the two matrices and then we can use the fact that essentially you can see that via like the plastic expansions or also Gaussian elimination that the determinant of a either lower or upper triangular matrix holds for both is just a product over the diagonal entries just a linear algebra effect that you can check that by different means right and so this means given an lu decomposition we can actually compute the determinant in linear time so we only need to compute one object which is expensive is which is cubically expensive and then solve all other problems from that structure it's a very efficient way of representing a linear system such that you can well matrix such that you can do all sorts of interesting linear algebra tasks with it all right questions so far so I've been hammering on this point that that we need to use structure in order for our algorithms to be efficient right and we have structure that we didn't use yet which is the fact that our kernel gram matrix is actually positive definite or symmetric and positive definite right there's a specialized version of the lu decomposition which is called the Cholesky decomposition which uses exactly that fact and it uses it to make the algorithm faster again how much faster you see in the exercise where we essentially do the same construction so it's basically all the same instead of using once on the diagonal here we use the same values for both which then means that you know since we want the matrix to be symmetric it needs to be the square root here incident we also use the same factor for left and right right because if we know it's symmetric then we can use just use the same factor to encode that symmetry essentially in our decomposition but other than that you essentially get exactly the same formulas here and the fact that you can actually take this root here you don't need to do any pivoting because you know if it's strictly positive definite all the essentially the inner products by definition with the unit vectors are positive strictly positive and these are amount to exactly these entries so you don't need pivoting in this case because it will just go through and by giving that algorithm you also essentially gave the the proof that a Cholesky decomposition exists and is unique for all positive strictly positive definite matrices a and this is of course what we should use because it's already teased it it's faster how much faster you see but this is sort of the go-to tool to solve GP regression at least in small scales because it has all these nice properties it obviously inherits a lot of the properties of lu and some of them some of them may even much nicer so again here's the moments I've been flashing at you for quite a bit now now we're actually going to solve them so as I said we use Cholesky decomposition so we compute the Cholesky decomposition of this kernel gram matrix the kernel matrix in this case looks like this and then the Cholesky decomposition of it looks like this it's of course the same factors just transposed and then we we're going to insert this L L transpose in place of the the kernel the kernel gram matrix here and since matrices flip order if you sort of drag the the inverse into a product of matrices you arrive at this thing here which means that to obtain the posterior mean you need to solve one system by forward substitution right this is lower triangular so it's forward substitution and one system by backward substitution because this is upper triangular and then essentially do the same thing for the covariance here there's something to note here about these factors how would you sort of describe this structure you can see here in the covariance in the in the Cholesky factors of the covariance matrix any ideas it's a bit difficult because I didn't put the scale here right but and that would what kind of structure would that in sort of at least approximately impose on the matrix itself what sorry so you said maybe the first three columns right approximate maybe approximate the matrix already quite well so what type of structure would that like how exactly so there might be and I'll show you in in what sense there there is actually that lowering approximation a good lowering approximation to this matrix already all right questions so yeah that's what you get from it I guess I already showed you that plot before now yeah we can sort of see the result of our work here and specifically to compute the log determinant which you need in an optimization loop to find the optimal hyper parameters of that that model so you need that also quite quite often you can just use that formula I gave you earlier so the log determinant the log determinant of this thing is the log of the determinant of the lu decomposition which I just showed you is just the product over the two diagonals of the two matrices and you can just drag in the logarithm into the product and see well actually I just need to compute one diagonal obviously because it's the same factor and just you know double it essentially we still have a problem though we already see it's all in in the earlier slides that a matrix assuming that our data set is 100,000 data points meaning that our kernel matrix is actually 100,000 times 100,000 is not going to fit in memory so what do we do here and there's a couple of different ways of approaching this that doesn't mean we can't use Gaussian processes in these cases we just need to take more care in how we use linear algebra to solve these problems and I'm going to show you two maybe not contrary approaches but two different approaches to solving this and Jonatan is going to continue the next two lectures which with much more detailed descriptions of methods that you can actually use here so first of them is structure so we can we can just use structure we have in the kernels we employ to invert the system much much quicker and this usually boils down to choosing specific kernels which are sort of amenable to to linear algebra essentially one of the examples here is just using a parametric prior so instead of having a Gaussian process which is essentially infinitely many features if you if you will or as many features as you have training data points you can just choose the finite number well p features here where p is in comparison to the data set reasonably small and then use these kernels and what this amounts to this this this structure of these kernels is just you say you have some some Gaussian multivariate Gaussian prior on some some some weights some essentially what are coordinates in a function in a finite dimensional function space and then you use these to weight to weigh the different feature functions with one another in a linear fashion and this is you know by applying the rules of linear Gaussian inference here you you see that with this construction you you essentially getting this mean and kernel function so this is the GP analog of linear regression as compared to ridge regression um yeah and then you're obviously also going to get um this essentially the same form of a posterior but now the kernel gram matrix looks like this what structure does that have so remember p is a lot smaller than n and this means that obviously sigma is p times p and the the training set features are n times p so phi x here just draw it so this product is essentially a tall matrix times a small the square matrix and then a wide matrix exactly this is low rank cross diagonal so here we see a real-world example of where we get low rank plus diagonal matrices in a in a machine learning algorithm but we only know how to multiply with that matrix efficiently we need still need to figure out how to invert these matrices efficiently and the the actually yeah so sorry before I start with that um a possible choice here is I generated well this is a data set generated from a sinusoidal function from a sine with a specific frequency uh corrupted by some some Gaussian noise and obviously if I know there's a sinusoidal or like Fourier structure essentially in my in my data set I can just use a finite Fourier basis basis for this so this is I think uh 10 the 10 first or some of the 10 first Fourier features essentially sines and cosines with different frequencies like increasing frequencies starting from sine itself and then always doubling frequencies I believe so you get a reasonably good fit also from a non-parametric model here because you know something about your data set and this essentially comes from the um generative structure right if you know um you have a periodic data set something that comes from a from a periodic latent function essentially um you can choose kernels which um on the one hand reflect that knowledge modeling wise but also are extremely efficient in terms of computation so there's a fusion essentially of modeling and computation here which already hints at some of the core messages of the lectures I guess all right so how do we actually invert that system it's still an n by n system so if you just threw lu on there it would be n cube we wouldn't win anything right but I said essentially you shouldn't multiply just these factors together and just throw it in a in a generic solver which doesn't know anything about the structure and a tool you can use here is the so-called matrix inversion lemma um this I realized this is quite a quite a bit of math here but hang on it's actually reasonably simple to understand what's going on so what the matrix inversion lemma essentially says is if you had have a matrix a and you add a bit of a disturbance to it you perturb it in in a low rank matrix in just a very small number of its of its uh subspaces essentially if the subspace that it's it's constructed off um then the inverse of that perturbation will be the inverse of the matrix a minus also a bit of a perturbation and the key thing to notice here is this matrix if c is p times p so c is sigma essentially here then this matrix is a p times p inverse and p is a lot smaller than in in our case here usually so this means that once we have computed v transpose a inverse u which is a np squared operation you can check that um then assuming that we know how to invert a efficiently and it's it's essentially trivial to invert a diagonal matrix it's just you know the the same diagonal matrix with the scaling factor inverted um we can evaluate this expression also very efficiently in this case omp squared right because the p is a lot smaller than the n we would have np squared plus p to the third so the p is just vanishes in this case into the cost of computing this thing here which makes this feasible and what's um quite cool about this is we also don't actually need to form this whole matrix in memory so we also don't have the same memory costs to invert that system we don't even need to form the matrix this is just one way of uh using structure um there's many different approaches to this many many specific kernels which lead to uh um essentially kernel gram matrices that are very amenable to to linear algebra operations um we're actually going to hear about a particular form later in this lecture which is common filtering for very specific kernels you can express Gaussian inferences common filtering and compute this thing actually not just in np squared but in linear time of the data set so Natana and you're not gonna tell you a lot about this later um and there's all sorts of other linear algebra tricks which if you're interested in that just just ask me but i'm not going to go over this right now um but yeah and ah actually we also need to compute the determinant right in order to do um hyper parameter optimization but there's an analog essentially of the matrix inversion lemma for determinants which is called the matrix determinant lemma which is essentially has the same cost so we're good there again determinant of a diagonal matrix is trivial and then this is this is again np squared to compute and then these both are p cubed so same runtime here um actually uh this is this is uh one of the names maybe you've also heard of the Sherman Morrison would be formula this is exactly the same thing the other general approach of scaling gp regression to larger data sets is essentially what i call here approximate inversion you're going to hear the proper term in the next lectures but for now this should suffice and this goes back to the point we made earlier about maybe the Cholesky factors are essentially a low rank approximation of the matrix so you can write down the Cholesky algorithm as an iterative algorithm i did that here and you can see it's essentially again the the same uh objects right there's this what what previously was b i minus li ui transpose in the in the um in the lu decomposition um but an interesting thing to note here is that um in a way you collect information about your system matrix a by multiplying it from the right with a normalized unit vector um and since we iterate over this uh system one to n what Cholesky does is it processes the data set represented in the kernel matrix in a just in a in an iterative fashion in a in a in a in the given order the the order you give it to the algorithm and since um we add rows to the Cholesky factor you've already seen that right we've in when we did lu here we first added sort of the the first column to the Cholesky factor um sorry to the to the l factor in the lu decomposition in next iteration we added this one and this one so we iteratively construct a low rank approximation to these two factors in this case it's the same factor right um we actually get access to this low rank characteristic this this this low rank approximator to the matrix on the fly so we don't need to compute the whole Cholesky factorization of our matrix we can just stop the algorithm and then we have a low rank approximation of the matrix how good is it depends right depends on the kernel matrix but um this general scheme um we can use to stop the algorithm and i've actually done that and i can show you what the sort of approximate matrices look like so if we take a dense kernel matrix um with i'll give you some some some structure here um and then just terminate the Cholesky decomposition after one iteration so this is just an outer product of two vectors here it's the first l essentially this this first uh column right here then we just get this it's quite a poor approximation i would say now if we add four more vectors so it's a then rank four approximation it gets a little bit better and it sort of uh moves this this blob up here moves a little bit down or or or extends down because that's how we process the data points right it the rows in here correspond to the data points and we just approximate data point by data point um approximate the data set better this way adding another five gets a little bit better and you know it just creeps all the way down here until this is actually a 30 by 30 matrix uh so it would be uh the true Cholesky vectors at l30 so arguably what you'd call what you could call something like the convergence of this approximation is relatively poor but this is because we process the data points in a linear order um and since the the sort of the length scale of the kernel actually already like given the data point already contains quite a lot of information about closely adjacent data points the information game from from processing the next data point in the next iteration is relatively low but we don't explore anything about the about the space in which we're trying to learn a function so what you can do is choose the order of the data points in which Cholesky deals with them in which Cholesky approximates the kernel matrix and that i've done here so i just pre-multiplied the the gram matrix with the permutation matrix which is essentially what's called full pivotization so you instead of just swapping rows you swap rows and columns simultaneously and this then would be called a pivoted Cholesky decomposition essentially the analog of a pivoted l u decomposition here which is which is what we discussed earlier um and then compute this lowering approximation after only four terms you get this and this is looks close to perfect right it's very very close to the full kernel matrix so here we can actually see that there is arguably some lowering structure in the matrix which needs to be corrected for it also for it to be uh full rank um which we need to invert it actually but there's some point to be had here that maybe we don't actually need all those Cholesky iterations if we choose the order of the data points in a clever way and you could say that this ordering could be considered something like an active learning problem right because after one iteration of Cholesky you already already know what your predictor is for the data points and you know how much of the matrix you have explored so from that information you can in an online fashion select which data point to observe essentially to to compute in the Cholesky iteration um at the next iteration and um this can these these p matrices here can also be essentially constructed in online fashion so that this is actually possible to to realize in in code this question but you're reordering them so you would actually if you ran the algorithm to completion you would deal with them later yeah yeah yeah essentially oh sorry um so essentially by by terminating Cholesky early while using a permutation matrix you're actually skipping rows but that's sort of the point right if you process the first row you don't want to process the second row because you already have a lot of information about what's happening in the second row so that's essentially what's what what the the point of this whole business is yes very good question so the question was whether the optimal permutation is just to sort of cover as much space I guess right so so so find essentially an even decomposition and the answer is it depends on the kernel and essentially whether you have a uniform grid or whether your data are scattered right so the the the in some sense the initial covering right the size of these blobs here depends on the length scale of the kernel about how much you assume um uh the how how wide the influence of the kernel function is um how much it correlates close by points and then with a shorter length scale this blob becomes smaller and that that way you also need a denser covering to get this initial result right so it depends on the kernel part but again generative structure right we know a lot of about the generative structure so our linear algebra algorithms should be informed about the general structure the the generative structure of the matrix they're working with which classical linear algorithms are arguably not they're just given an array of numbers and they can't know that you need to inform them and this is also the wider point here right so maybe for machine learning because we look at we're looking at very very specific problems we should design our own algorithms which leverage these information which maybe there are classical algorithms to deal with that maybe not but we shouldn't be you know just following the canon of given an arbitrary matrix you should apply that method because you can learn a lot from that structure turning off the factors by iteratively like going from the upper left corner yeah while for approximation you already have a decomposition for that which is singular value decomposition where you kind of solve an optimization problem I guess I don't know there might be better why do we have a SVD given so I'm not saying you have an SVD so I'm basically the point is when is like Cholesky when with the Cholesky factors the order in the order that they are coming they are going to be a good approximation for the for the matrix so the the best approximation would be or the best factors would be the singular the factors given by the singular value decomposition which solves an optimization problem it's a little bit different from an from a singular value decomposition though because in the singular value decomposition you posit sorry the question was whether this is actually essentially approximating what a singular value decomposition is doing or truncating an SVD is doing right you don't know like matching the first few rows will give you really good factors for the matrix but with your heuristic the way I understood is like you're trying to like get which is what you are saying using the spatial structure yeah getting all that information yeah but like yeah I mean that's got to be some kind of approximate singular value decomposition well I mean the difference here too and so okay let me just try to rephrase that again so the question is whether this has something to do with approximate SVD right whether which by using that heuristic trying to approximate the SVD which is like the optimal thing to do so it in this case we would actually be trying to compute an eigen decomposition essentially right because we have a yeah we have a we have a square matrix we have a symmetric matrix and in this case for all intents and purposes these two are the same and yes in some sense the eigen decomposition is actually the optimal thing to do here but just note that these rows aren't necessarily orthogonal which they would need to be for an eigen decomposition so yes you're essentially trying to find the eigen directions because these in in in some metric which I'm not really wanting to go into right now is are optimal but you're you're not taking into account that you would need to orthogonalize these to actually have the the eigen decomposition yeah right so yeah this sort of prevents you from actually finding the eigen vectors because they you know they are orthogonal or they form an orthogonal basis of the space yeah any other questions actually one thing I also didn't mention yet is there's actually a third thing we need to do to to use Gaussian processes in practice right we usually don't only take interest in the mean and the the variance or the the standard deviation of the process we also usually want samples from it and for that you also need an object which is called a matrix square root and the Cholesky decomposition is a matrix square root right because it's essentially the same in a sense the same matrix just transposed multiplied with it with itself to produce the matrix again and this you need for sampling you can look up we look that up in either the probabilistic machine learning class on youtube or just any textbook and this low-rank sampling could also be useful for that right because even though you can directly invert a low-rank matrix you can at least use it for sampling in an approximate manner which is also then a lot faster all right with that I actually come to an end here just a quick summary we've seen that numerical linear algebra algorithms are very fundamental to machine learning they pop up in almost every machine learning algorithm and why should we even care about numerics specifically as machine earners well we've seen that implementing this these algorithms can be a lot faster if you take into construction you need to be very careful implementing these algorithms in practice yeah we should deal with that we mostly dealt with four tasks which is very efficient matrix vector multiplication which will also be important for the next two lectures so keep that in mind the solution of linear systems which we've seen loads of examples of over the course of the lecture we've done that and other tasks by computing a matrix decomposition learn about lu and juleski here um and the computation of log determinants and matrix traces is important to solve hyper parameter optimization problems in Bayesian machine learning we've seen loads of examples of how to use structure and that structure is really important to get fast algorithms for machine learning and we will continue to see how this structure um affects Gaussian process regression in particular thank you let's make sure