 So the day is the final lecture of numerous machine learning. Not everyone is here, which is not a problem, because I'm not going to do anything new. I'll just rehash the content of the lecture. I'm guessing you're here because you want to prepare for the exam on Monday. And of course, you want to hear what's up in the exam. We'll talk about that at the end. What I'll do is I'll very quickly rush through pretty much the entire course from mid-October until now. Not every single lecture, but I have slides from pretty much all of the lectures. Conveniently, I have a few of the lecturers in front of me as well. So if there are really tough questions, I can hand them over to you. And I want to use this as an opportunity. So of course, you may wonder whether what I show you is relevant for the exam. Everything is relevant for the exam. But I didn't make the slides with the exam in mind. In fact, I made the slides before the exam. So what I want to do is use this opportunity to take a step back and think about what we've actually done in this term. Because I think that's sometimes really useful, especially for a lecture like this, which was so heterogeneous, with so many different lecturers. So maybe it's useful for you as well to hear what I think about what you've actually learned. Let's see. So high level, we started out, lecture number one maybe, with me pointing out that why do we even need a lecture on numerical methods in machine learning? Because machine learning actually is numerical computation. The stuff that happens inside of the machine when it learns is the solution of a numerical problem. And that's in contrast to what you might call classic AI, which are the algorithms you learned about in algorithms and data structures. In machine learning, we have to solve mathematical problems that do not have a closed form solution. And they are pretty much all of the problems that classic numerical analysis studies. They are not in the order of the lecture, but nevertheless, integration for, let's call that, Bayesian inference. So whenever you want to measure how much volume is left in a space of hypotheses to Bayesian inference, you need to solve an integration problem. Optimization, if you want to train a deep neural network or any other model that constructs a point estimate by minimizing some regularized empirical risk, you need to solve an optimization problem. Differential equations, if you have an, actually differential equations show up in very interesting places. On the one hand, whenever you have an agent that interacts with this data source and needs to decide what to do next, you might call that reinforcement learning, but it also shows up in situations like robotics, then you need to predict what will happen next. You need an internal model of the world, as simple as it might be, and predict what is going to happen next so that you can decide what you want to do next. That's a simulation problem. But of course, there are also applications in science where laws of nature are represented by differential equations, that those were actually the examples we mostly spoke about, and then you need to solve those and use them as another source of information about what you actually know and how much you can learn from data. And then there's linear algebra because linear algebra is just everywhere. In particular, also in all of the above. Solving Gaussian integrals is a linear algebra problem. Solving quadratic optimization problems is a linear algebra problem. And solving linear differential equations is a linear algebra problem. So therefore, we went through all of these problems, not in this order, but we started with linear algebra. And here, for me, the high level takeaway is, if I've tried to write it into two sentences pretty much, maybe not something you saw initially in the lecture, but it sort of transpired, which is that when you do linear algebra, this doing, the verb in this sentence is something that happens on your processing unit, C or GPU. And it doesn't happen on the disk, but your data is on the disk. So when you think about how complex it is to do linear algebra on something that involves a data set, you actually have to think about what you have read from disk and processed on the chip rather than how much data you actually have on the disk. And this may not be entirely obvious initially, but this is relevant. Why? Because reading through the data set once costs linear much in the size of the data set. And picking bits of the data set out, of course, costs less than linear time. So all the stuff that's expensive in linear algebra has nothing to do with reading data. It has to do with what you do with the numbers you've read from disk on your chip. And that's relevant because we don't actually always have to load the entire data set. So the relevant setting for this, the base case in machine learning is Gaussian process regression, which did sort of double duty in this lecture, in this course, first of all, as a template basis machine learning algorithm, one of the simplest possible supervised machine learning algorithms, and also as the model class from which we constructed numerical algorithms for the entire rest of the course. Here is the slide again from lecture number two, which was held by Marvin and prepared by Ionatan. Here are the equations you then got to see over and over and over again. The entire relevant thing is that to learn this predictive distribution, which is a distribution, where it has a mean and a covariance, two things, a mean and a covariance, to construct these two objects, we need to solve this linear system of equations in here. So this expression that involves, which we write typically as an inverse of a matrix times a vector. And maybe one of the first things we realized is that this notation hides a lot of interesting structure. It's this sort of nice mathematicians way of writing down, or this is the thing we need, but without telling you actually how you compute it. And maybe this is really important in the context of other machine learning algorithms like deep learning, because I think historically there's been an interesting sort of situation that because it's possible to write an equation like this, everyone thinks that's what you need to do to do Gaussian process regression, and therefore Gaussian process regression is expensive and intractable because, well, the textbook says solving such problems is cubically expensive in the size of the matrix, of this quadratic matrix. While for deep learning, we can't write down such an equation. You can just say we want whatever weights minimize the empirical risk, and therefore people don't even start to think about how complex it is to train a deep neural network. We got to that towards the end of the lecture course. We also noticed that this problem sort of shows up twice, here and there, and there's the same matrix being inverted every time, but it's multiplied with a different vector every time. So in this lecture number two, and then the following lecture number three, Marvin and Jonathan told you that there are many different algorithms for solving such linear systems of equations. There is the classic algorithm that you all learned about in probabilistic machine learning called the Cholesky decomposition, which is usually presented as a box that just runs and then it ends and then it's done. But actually you can think of this algorithm as we learned as an iterative procedure, something that takes a big matrix and then starts operating on the rows and columns of this matrix, loading them one by one, and constructing numbers from them in steps that are linearly expensive. So each individual step is of complexity, one length of one of these columns, at least initially, and then there's a correction step, which is quadratic in the number of steps you've already done. So it's sort of linear times quadratic and then because you run it until the end, that's actually cubic because in the end, the number of steps you've done is also n and then you just have n cubed. And what's interesting is that, that was the big reveal of this slide, is that you can think of this algorithm not just as constructing a matrix at the end, but you can actually, at every single step of the procedure, stop and return an estimate for the inverse of the matrix that we're trying to construct. And that's this matrix C i, which shows up here. And that's actually also caused some, we already saw some discussion of this on the forum. After this lecture clearly, which was a very good sign, it seems like nobody was surprised by this anymore. I was like, okay, it's constructing an inverse, right? But that's not actually how you see Cholesky usually presented, right? So this was already new from this lecture. And why is this interesting? Well, at first, you might imagine that that means I can just stop the algorithm first, not first, but stop the algorithm early at some point and just use this approximation. Then of course you have to vary whether this approximation is actually good. So in that lecture, you see I'm doing a little bit of PowerPoint karaoke, we saw that if you actually move through the data in some order and the data is actually ordered, then that means you're going to get a good approximation at some area and a bad approximation of some other areas because you haven't read this data yet. You can also go through the data in some random order and then you may actually end up with a somewhat pretty decent approximation of the posterior distribution of the Gaussian process. And one way to think about this is that it's essentially like forgetting that there's this other data set, other rest of the data set around, which just keep loading bits of the data set from disk and just don't care about the other ones. And clearly that is going to be linearly expensive in the number of data on the disk at most. You might even be able to do it cheaper. And quadratically on top, like times a quadratic term in the number of things we've already looked at to keep track of how they interact with each other. That's maybe a bit boring because why would you leave all your data on disk? Maybe you want to do something a little bit smarter than that. But it's interesting that that's actually possible. So we, yeah, did that. And then there's maybe a real insight that I put onto a black slide so that we can talk about it because it sparked some discussions on the forum, which is that, and I wonder whether I can actually make this clear because it's a bit deeper than it sounds at first. So when you go through your probabilistic machine learning class, you learn that to solve this equation here, this thing, you need to solve these two problems, right? There's this and there's this. And those are two different objects, right? One is called posterior mean, one is called posterior covariance. Now, if you, then you learn, oh, if you have the Cholesky decomposition of this matrix in the middle, this black thing, then you can actually solve both of these efficiently because the Cholesky tells you how to sort of, how to efficiently solve such linear problems. So in a sense, you only need to do one thing. You only need to compute the Cholesky decomposition and then you're done. And then at some later point, which you might not even get to in a probabilistic machine learning class, someone else might come in and say, ah, but Cholesky's expensive, you can't do this with like a big dataset. So instead, if you have a large dataset, you are going to use an iterative linear solver, like conjugate gradients, which we then discussed in lecture number three. But conjugate gradients is actually not a matrix decomposition, at least not the way it's presented in a textbook. It's an optimization method that solves quadratic problems, these squares problems like these. So if you use such a method, it may seem like there are two different optimization problems here, right? There is an A times X1 is B1 and then there is an A times X2 is B2. So you need to run this algorithm twice, right? To solve two different linear systems of equations. In fact, if you have many points where you evaluate this, you might have to run it a lot because this might be many, many different points where you want to run this on, right? But actually conjugate gradient is just like Cholesky. It's a method that estimates the inverse of the matrix just as much as it estimates the solution to this particular linear system of equations. It's just that the properties of Cholesky are particularly nice if you talk about the solution of the one linear problem it's actually trying to solve. So that means we could actually use conjugate gradient to do both of these things at once. We just run it on one of the problems, naively at least, and use the numbers it constructs to directly also estimate an inverse which we can use to solve, to estimate the posterior uncertainty. And when we do that, then there's actually an additional structure that is really interesting to think about, which is that this algorithm then does two things at the same time. It learns from the data and that gives you a finite uncertainty at the end about what the true function is. And it also learns what the inverse of the matrix is on this finite set of data. And those two together, if you stop at an early point, tell you how you're uncertain you should actually be about the entire problem because the errors on your estimate of, I shouldn't go too deep, actually this is Cholesky, sorry. So the errors on the, there we go, we're here. The error on the function that you're trying to estimate at the end, that's the dashed line here. So this error between the solid black line and the dashed line, they come from two different sources of uncertainty. One is that there's only a finite data set which doesn't tell us what the true function is. And the other one is that there's only a finite number of computations, less than the entire data set maybe, which don't tell us yet what the full solution would be on the full data set. And funnily enough that those can both be tracked at the same time. Using an algorithm that looks pretty much like Cholesky like a generalization of Cholesky, it still keeps track of an inverse of the matrix, but it is a bit more general in how it assumes the observations, the projections of the matrix to be constructed. They come from some policy in here. And if that policy is the naive one with just one after the other loads the data in as locations in input space, then you get something like Cholesky. And if you use a smart policy, then you might actually converge much faster. There's a price to be paid though for that smart policy. If you want an algorithm that very quickly figures out how it could be good on all of the data, it needs to take into account when it plans which directions it's going to walk into, all the data. And that means that each iteration of this method will be quadratically expensive in the data set size rather than linearly expensive for Cholesky. This doesn't matter in the limit because once you've solved the entire problem, everything is still n cubed, right? Because k iterations of cost n square for k equal to n is n cubed. And n iterations of cost n times k square are also n cubed. No, sorry, k iterations of cost n times k square. So here we go. Are asymptotically also n times k cubed. So the high level takeaway from the linear algebra bit and maybe quickly you said something and now the sum of the uncertainty by the linear algorithm. So you're tracking both. If you go through the entire data set, you can in retrospect see how much computation you have. So this is a good point. I said you can track both uncertainties. We can't track them individually. We can't track one and then the other one, but we can track both. We can track the sum of them, which is the sum of this green and the blue region. And if you want to know which of them is which, then you need to do a computation that's cubically expensive again. But arguably, this is what you want, right? There's a data set on your disk. You haven't looked at the whole thing yet. How uncertain are you? That's the answer to it. If you want to know, well, once I've looked at the whole thing, how uncertain will I be? Yeah, then you have to do a complicated computation. Of course, because you'll have to look at the entire data set and figure out what it's going to be. So the high level takeaway for this is how expensive this base case of Bayesian inference, Gaussian process regression actually is, is much more subtle than what you might have learned in a textbook lecture. So instead of saying solving linear problems is cubically expensive, that's a worst case statement, we actually have found some very subtle gradations which go something like those four points. So if you decide to never look at the entire data set, then of course, trivially, you're not gonna have to pay for this, right? You only load, if you have a data loader that only takes, only loads K data points, then you can do whatever you want with them afterwards and computing the estimate is going to be O of K cubed. And if you use that data set to construct your posterior uncertainty as long with your posterior mean, everything's fine, right? The only thing you've done is you haven't looked at the whole data set yet. Now if you decide to load the entire data set but only do K iterations on it, you can construct an algorithm like Cholesky which gives an estimate of cost linear in the data set and quadratic in the number of iterations with an associated meaningful uncertainty that keeps track of how much of the information about the data set you've loaded so far. It's a bit more than if you've loaded only K data points because it kind of knows where the other data points are but it hasn't kept track of how they interact with each other yet. If you want an algorithm that is very efficient at loading the right kind of data from disks so that it converges very fast, then each iteration is going to be more expensive because in every one of these n steps you will have to look at all of the other n data points to decide which of them you want to load and how much of them you want to load. That's going to be an algorithm that will be quadratically expensive in each iteration, 4K iterations. And then finally, of course, you can also do Cholesky. You don't care about it. You just say load the whole thing, do whatever you need to do, tell me what the right answer is. And then you're going to pay cubic in the number of data points and afterwards you'll still be finitely uncertain because you only have a finite data set. So why make this strong decision to always look at the entire data set if you're going to be finitely uncertain afterwards anyway? The size of the data set often is completely independent of the problem you're trying to solve. Well, not completely, but it's not directly related either. It's not like some magic oracle tells you exactly the 863 data points that you need to solve the task at hand, but it just says here my 10,000 data points, go ahead. So when you're operating on a data set of some non-trivial size, it makes sense to think about how you operate on that data set, how you load in which order, how much time and compute you invest into each individual iteration, which gets us to the next setting where we in some sense operate on an infinite dimensional data set, one where we get to decide how often we want to evaluate and then be careful about what we do with that information. That's simulation where we initially spoke for three lectures about, in particular, systems that evolve through time, where we focus on the time domain, and those are, although that's a little bit, cutting it a bit too short, often associated with ordinary differential equations. So maybe a simple way to phrase it is ordinary differential equations are differential equations where only the time dimension is explicitly written down in the equation. So for such systems that evolve through time, Jonathan Schmidt first came in in lecture number five and told us that there is, just like Gaussian processes, the inference mechanism for linear problems, so problems where the data relates to the unknown function in a linear fashion, for time series, there is also an algorithm for linear time-dependent or time-invariant problems, that means where the evolution of the state through time is a linear function of previous values of the evolution and observations are only linear transformations of that state space, and observations have Gaussian noise, so that's a special case of Gaussian process inference, and this algorithm is called Kalman filtering and smoothing. It's actually, if you like two algorithms, but in a way it's also just one algorithm which implements message passing on a chain graph like this forward and backward, and this kind of algorithm, do I have a slide for it? Yes, is very easy to write down, and in particular it's also linearly expensive in the number of time steps. So if you're observing a system that evolves according to this linear Gaussian relationship, by the way there's a limit case of this called a linear stochastic differential equation, but that's sort of what this is leading to, if you have a system that evolves according to this linear Gaussian relationship, then there is an algorithm that can incorporate linear Gaussian observations of that system in linear time in the number of observations. It's not linear in the number of elements in the state space, so the number of things you're tracking through time, in that it's actually, well, at most cubic because it involves this inverse of a matrix here, so everything we just heard about inverses of matrices in the linear algebra bit also apply here, so you can sort of combine everything you've learned about linear algebra in here. You can open up a box and say, ah, there's an inverse of a matrix written down here, now let me open up a two box for linear algebra and apply all of this stuff over here. This is maybe an interesting aside, right? So this happens in numerical computation all the time. We're building one solution for a harder problem from simpler solutions to simpler problems by nesting these types of algorithms with each other. And so if you know about the low level of this computational hierarchy, that's useful because you can use it to speed up things higher up the hierarchy. Okay, so that's the filter, that's the forward pass, so what this algorithm does is just intuitively it starts at the left end of this chain, here with an initial belief over what this thing is and then works its way forward through time towards the end and whenever it encounters one of these data points, it can include them and if it doesn't encounter them, actually you don't even have to do the step but you can and then you get an explicit representation of what this latent state looks like with the probabilistic uncertainty and then once it reaches the end we switch from this algorithm to this algorithm which is called the smoother which is a bookkeeping algorithm that in a sense informs all the earlier variables in the chain about the observations that it has made in the future. So that's why this code looks simpler for the smoother because there's no new data to account for. The data has already been added and now we just have to do the bookkeeping to make sure everything is consistent with each other. Okay, that's this fast algorithm and then Yonatan actually ended by pointing out that you can apply this class of algorithms also to settings where the observations are not a linear Gaussian transformation of the state space and the step from one state in time to the state at the next point in time is not a linear Gaussian function but a non-linear function and in particular that algorithm which is called the extended Kalman filter is sort of the pedestrian nature kind of thing is like it's not quite a Kalman filter because things are not linear, well then let's just make them linear. So you do Taylor approximations wherever you need to do them. You just pretend that the functions you care about are linear by taking the non-linear function and computing its value and its gradient or its Jacobian in this sense and then just treating it like a linear function everywhere. So both for the dynamics, that's this F function and for the observations that's this H function. So this is actually an algorithm that has been around for a very long time. It's been studied in control engineering, systems engineering, signal processing, whatever the words are for this as very fast algorithms, the original rocket science maybe. But what we did with it in the next lecture, lecture number seven I guess by Natanahir was to observe that we could use this to include information about non-linear relationships between the states in the state space. In particular, if we take a state space that consists of values of a function and their derivatives, first derivative, second derivative, third derivatives, then we can use this mechanism to include information about non-linear functions that relate derivatives to each other and those are called differential equations. And that's it actually. So it turns out there is an algorithm that looks like a filter and works like a filter and performs like a filter that solves differential equations. And what I've now done is I've written down this actually a slide from Natanahir's lecture number seven and you may have noticed that I didn't mention lecture number six because lecture number six was this interlude where Natanahir told you how classic ODE solvers work. And these methods are these beautiful pieces of arcane code that was written in the 80s that performs extremely well and is very robust and it's a totally closed box that no one can touch anymore because it contains these magic numbers, butcher tableaus that define gains from individual evaluations made at particular points that no one is allowed to change because then the proofs don't work anymore. And in fact, so Natanahir only mentioned this in passing for these classic methods, these proofs really are super complicated. So for these mungenhutah methods in particular, which are like the most cherished form of ODE solvers around, there are even pure math studies on the structure of this algorithmic landscape and why certain kind of solutions exist and why there are sometimes gaps in convergence order and so on. So they are really rigid. It was really nothing you could change. While this framework here is super flexible. It's a very kind of computer sciencey way of addressing simulation. It's just a Kalman filter. So we can decide where we want to evaluate, what we want to evaluate. We can linearize everything. We can use the tools of modern differentiable programming. I'm not even going to say machine learning. So if you have access to gradients, we can use them and we can construct a algorithm that is set up in the following way. We first write down a particular linear Gaussian system. So Markov chain of states that says the states evolve according to some simple linear stochastic differential equation that can be solved in closed form with Kalman filtering and smoothing. And then we just include knowledge about the differential equation by saying at various points in time, I observe that some algebraic relationship between the states holds. So in particular, if I say the difference between the first derivative of x at time t and some function evaluated at the zero derivative of x at time t is zero, then that's encoding an ordinary differential equation. But I could actually include any algebraic implicit equation here. I could just write delta of g of x, all of x. Just g of x is zero, whatever g is. As long as g is sufficiently regular, differentiable and so on, I can put anything in. So not just ordinary differential equations, but also algebraic equations or continuous sort of symmetries. We could say if you move along some direction in x space, things do not change. This can be used to include continuous group symmetries, Hamiltonians, this kind of stuff. Or partial differential equations, which I'm going to flash at you in two slides. But we can also include all sorts of other information. We could, for example, include the fact that we have observed the system at various points in time. If I tell you that there is a dynamical system that follows some laws of nature, given by an ordinary differential equation and some Hamiltonian that is conserved, some Lie group symmetries that hold across time. And also, I've measured that system at various points in time, so I know what kind of path it took. Or maybe I know something about where it started and where it ended or whatever. Then I can include all of this in the same algorithmic language. And that's kind of neat, because it means I don't have to know about this very complicated toolbox with all these different algorithms, Rwongakuta and Geometric Integrator and what they're all called. And I just need to know which one to call in which setting. But no, I can just use a Kalman filter and be done. And in particular, if there are some parts of the state space, if there's something in x, which doesn't actually show up in this equation, because this equation kind of doesn't use it, then I can use the other types of observations to inform myself or the system, the algorithm, about what those values might be. And we saw an example of this in this COVID setup. If there is some latent force in your dynamical system, like contact rate that you don't know, then observing the trajectory of the system alongside with the differential equation can tell you what the latent force was. It doesn't have to, of course, depends on what the differential equation actually is and how the latent force interacts with the trajectory you got to observe. There might be subspaces that aren't identified by those observations, but up to non-linear corrections, the filters actually are gonna take care of this. It'll just notice that the Jacobian has degrees of freedom that aren't identified by the observations and you're just uncertain about them. So that's the value of uncertainty in computation, that it's maybe okay to do things that don't have a full solution as long as you know that you haven't found the full solution. Yeah, that is the, actually already everything I wanted to say, yes? Ha, ha, ha. I don't know. I actually don't know. I haven't even looked at the corresponding question in the exam, so I don't know. Maybe, maybe not. I really don't know. It's good that I didn't look at it before. I really don't know. This is a perfect poker face from Natana at the moment, of course. Good. But what I will say is, of course, that notwithstanding the exam, it's, I'm not saying nobody should know about the Hutter methods anymore, but I do think that they are, it's maybe we're entering a point in sort of time where you don't need to know everything about all these simulation methods anymore. It's also a bit weird, right? If you use the simulation packages where whichever stack you're working in, you have the choice between 20 different simulation algorithms by now. And they are typically all motivated by being able to do one thing particularly well, like keeping some Hamiltonian constant or working well on a particular class of differential equations that are stiff in some sense, or being scaling particularly well to some settings. And not all of this is actually as relevant anymore. If you're thinking of a simulation method as essentially a filter, then you're allowed to do, you're able to do many things that for which you would need to use a specialist method otherwise and combine them with each other. And in fact, with a few caveats, you are actually also able to do things that don't have a full solution, that are not fully identified by the information available. Of course, that doesn't always work and that's sometimes a bit dangerous to do that because we are still working in a linearized Gaussian regime and there might well be disturbances and issues in your dynamical system that aren't captured by this linearization. Sure, and then that's potentially a problem. What's maybe also a takeaway at this point is that here we really very explicitly think about information as the central object in computation. So what these methods do is they manage information. We also call these terms in the, we also call these, oops, sorry, where did it go? This thing we call these information operators, right? These, the term in the algorithm that updates your knowledge about what the dynamical system does, it's called an information operator because there isn't really a difference between information that comes from the disk or from a sensor that is attached to the computer and information that comes from the programmer who has written it down as an algebraic equation that goes into the mechanism. Or maybe some other forms of information. So the classic kind of idea for prior, right? Maybe you have a scientist who just tells you, well, the solution should be something like this, right? Then you can encode that in a language of information operators that would otherwise be very difficult to write down in math. So whether the information comes from the chip or from the disk doesn't really matter, it's just a number and the number doesn't have a stamp attached to it that says I was collected from this particular source. It's just this is how much information I have about your system and that's encoded in the information operator. So also by the way, a fun and nice aspect that the information operator acts as this kind of interface between the user and the designer of the algorithm. So if you just want to use and one of these filtering simulation methods and you want to include information, the only thing you have to provide are these objects, these encapsulated bits called information operators. You don't have to write down the filter and the smoother actually, that can be done by someone who designs the algorithm. And then we've got a maybe very painful lecture to point out that you can also use this to solve partial differential equations. And I'm not going to redo everything that Marvin did, but instead I'm going to tell the high level story again. So solving partial differential equations is essentially the same thing. We have an equation that says, well, we typically write partial differential equations in a form, I think Marvin did this as well. Is it somewhere on here? That looks something like D of V of X and T is maybe even with brackets around this is some function of V of X and T, right? Where D is a differential operator which contains lots and lots of partial derivatives. V is the solution we care for, it's a function and it depends on space and time because now we're thinking of special temporal systems. So a weather forecast looks like this, right? There are all these forces hidden in here, things that we know like, you know, densities of air and velocities at an initial point and so on. And we just know that they somehow relate to each other. But really what this is is it's just another one of these equations that go like this. It's just now that instead of a state space X that goes through time, we have this function F which depends on time and space. And so we have to be a bit more careful with how we do the computation. And a filter does not like immediately solve all of our problems anymore. Because we might have, so one simple thing to do is to put a grid in space and then solve it on a filter across time. But if that grid has many points, then this big inverse in the Kalman filter still shows up and it still matters and it still makes everything slow. And what Marvin pointed out was that in a particular case where this here is a linear operator so where we just have a linear partial differential equation, then this can be solved as an instance of Gaussian process regression. Where we, in particular, if this is also linear, sorry. It's a very simple, simplest form but this is where both of them are linear. Then we just observe a linear projection of V across various different operators. And then things are fine. It's just Gaussian process inference except for the fact that you may be really worried because this is a statement about functions and functions are somehow not the same as vectors, right? So they are potentially dangerous because function spaces are very nasty, potentially. And not everything you've learned in linear algebra applies in function spaces so you've sort of learned to be very about this. And that's why Marvin had this very complicated slide that essentially had contained the license, the model card that says you are allowed to think about functions in this setting as if there were really long vectors with infinitely many entries. With some caveats which come with a complicated theorem that Marvin had on his slides. But beyond that, you can think of such differential equations as just doing Gaussian process inference with a really interesting observation model. And what do you do if those are not linear? Well, there isn't really the full answer for it yet. It's also not on this slide but you can probably imagine that it's going to be a little bit like in the ODE case. You just linearize everything as much as you want. And then you can do the same thing with it for ordinary differential equations. You can write down a generative model for functions, a Gaussian process at the top, and then inform it about various sources of information. In particular, sorry, you can inform it about the fact that there is some partial differential equation that holds that relates the values of this function to each other in a nonlinear or linear, sorry, in a linear fashion but according to some linear operator. Then you may also know something about the boundary values of this function so that its derivatives at some points are zero or that the value of the function at some point minus some shift is zero. And you might also measure this system at various points with a sensor that actually measures what's going on. And then all of these just get combined together because they're all information operators and they all show up in a big Gaussian process inference scheme that builds a big graph matrix with kernel matrices that have operators applied to them from the left and the right. You end up with a large least squares linear algebra problem and then you go back to lecture number two and three and redo linear algebra. And we end up with some uncertainty, quantified representation of what we know about this dynamical system which we've measured at various points. So in the future, whenever someone comes up with differential equations in machine learning, you don't have to be scared about them anymore. There's just one source of information. So maybe for your generation, this isn't even a thing anymore but I distinctly remember that up until not so long ago differential equations were really not a thing that happened in machine learning. Everything was just observing a function, lots and lots of regression, classification kind of problems. And now everyone's talking about diffusion and partial differential equations and neural operators and neural ordinary differential equations. But as you can see, they are really just a step from observing a function to observing a projection of the function. And the fact that it's sometimes nonlinear shouldn't be a problem because we also often observe functions in a nonlinear fashion. In classification, book standard, boring undergraduate bachelor's classification is already a nonlinear observation of a function. Logistic regression is a nonlinear observation of a function. Okay, at this point, we were close to Christmas and we moved to integration. And here I'm just going to say three things again because I did that lecture so I kind of don't have to repeat myself too much. We started out lecture number nine talking about what do people normally do when you hear integration in probabilistic inference, Monte Carlo. Monte Carlo is a beautiful set of algorithms that turn a deterministic inference problem into a stochastic inference problem, a statistical problem, by so to try trying to compute such integrals by instead drawing random numbers from the probability distribution described by this P and then evaluating a sum rather than an integral. And the nice thing about this approach is that it has some interesting statistical properties. It gives an unbiased estimator and that unbiased estimator works on pretty much any integration problem. So that's nice. It's an algorithm that works for everything. Well, actually every Lebesque integrable function but for basically any integral. But it also has a cost which is that this estimate converges slowly. So as a function of the number of evaluations of the integrand, you get a convergence rate that is at best one over the square root of the number of samples for the expected error. So the rate at which the estimate of the integral converges to the true value of the integral is one over the square root of the number of samples. And in fact, that's a best case estimate depending on which algorithm you use, it might be much worse. Both because this constant here might actually depend somehow nastily on the problem or and maybe more importantly because the algorithms we actually use in practice, Markov, Chem, Monte Carlo don't achieve this rate. They just achieve it asymptotically. And then in lecture number 10, I guess before Christmas, we looked at sort of the polar opposite, the entirely different kind of approach which is not to say I don't know anything about the problem. Let's instead, let's just draw random numbers and hope that everything just sums up. And instead, let's spend as much time as we want trying to model the integrand really well and then learn what its integral is from carefully chosen evaluations of the integrand. This is connected to the idea of Bayesian quadrature. So we start out again, of course, by now this is natural to you with a Gaussian process prior on the integrand that is related to a Gaussian process or actually just a Gaussian marginal variance, a marginal distribution about the integrand. Now if you condition this distribution on evaluations of the function at various points, then you get an estimate for the, not just the integrand, but also the integrand. And that sort of one thing we had to do to make this work is to find kernels for this Gaussian process which are integrable in closed form. So that we can lift it, if you like, the problem of computing an integral over some function to the problem of integrating the representation of that function in some reproducing kernel Hilbert space represented by some evaluations of the function. So this constrains us a bit because we now need to be able to compute the integrals over the kernel, this one and this one. But beyond that, it's an algorithm that allows us to integrate. And we saw that in some settings, in particular low dimensional problems, this class of algorithms can actually perform really, really well. It can converge much faster than Monte Carlo. Polynomally fast, not square root fast. And in fact, there are even special classes of algorithms that converge much faster than that, like superpolynomally fast. Admittedly, this mostly works for low dimensional problems and scaling it to higher dimensional problems is tricky. Scaling in the sense of building algorithms that actually still converge fast in high dimensional problems. It's not so hard to build versions of these algorithms that work as well as Monte Carlo in high dimensions. But then again, why would you do that? You might as well just sum up a bunch of random variables. So the main takeaway at this point in the lecture course was not so much, oh, if you need to integrate something, let's now use Bayesian quadrature. It's more this sort of philosophical observation that when you're using an algorithm that works for everything, it's typically going to work badly for almost everything. So you're paying a price for generality. Algorithms that work on a large class of problems tend to work not so well on every individual instance from that class. While if you build an algorithm that works well only for a very small class of problems, then they might work really well on each individual of those problems. But of course, they might break badly outside of that class. And this, of course, is sort of reminiscent of Bayesian inference again. If you have a prior that puts a lot of mass on a small constrained space, then you don't need much data anymore to do inference in that space. But if the prior assumptions are wrong, if you're actually facing a problem that isn't in your hypothesis class, then all the bets are off. The algorithm might work totally badly. So what does this mean for computation? It means that if you think about this sort of social aspect of computation, if you get your algorithms from a theorist who builds algorithms for a living, then that person you're buying your algorithm from is incentivized socially by the academic system to build methods that work on a large class of problems. Because few people in the world get to write a paper about how to solve problem X and only problem X, and then never have to have to solve it again. Typically, they get to write papers about how to solve this class of problems with this beautiful algorithm. So in practice, though, if you're the one actually building a solution for a commercially or a scientifically socially valuable problem, you actually only have one problem to solve. So you can increase the performance of your method sometimes if you know what you're doing and you build a customized method that works well on your problem. Only to the point where you actually put in things that you really know about your problem, not where you invent stuff that doesn't hold. At that point, we had a Christmas break, and then we came back after Christmas for kind of a change of step coming through problems that numerical problems in contemporary machine learning that arguably do not have a beautiful solution yet, which are nevertheless so important. So in fact, they are sort of driving the entire community that we need to talk about them anyway. And here, our message was, I think, quite different from before, like before Christmas, there's all this, oh, look at all the stuff we know, we can do things like this and this and this. And then afterwards, it was more, look, this is something that nobody really knows how to do. So maybe it's interesting for you to think about it. And also to understand why it's so hard. Not because just we don't know how to do it, but because basically no one out there really knows how to do it. And training deep neural networks is the prime example of this. So I'm not gonna show all of the slides again. Frank came in the week after Christmas and he showed you this list of all these optimizers, 120 out of them, and argued that pretty much all of them work equally well, more or less with minor variations, and that you might as well use Adam to train your deep neural network, so to solve optimization problems that are of this sort of type that we'll discuss for a few minutes. And that might sound at first, actually let me go one back. It might sound at first like this problem is solved. There's Adam, everyone just uses Adam, it's fine. But actually these algorithms are fundamentally frustrating. They're really bad to use. They need babysitting, they need hyperparameter tuning, they don't always work, you need to initialize the network well, you need to figure out what the algorithm does while it trains, and you need to even decide when to stop it. So this is strikingly different from the situation optimization people used to study not so long ago. So when I went to university optimization lectures and tutorials consisted of people telling you that there is this algorithm which just download and it works, and it's perfect. I was actually once, sorry time, in a tutorial in a machine learning summer school by Stephen Boyd, one of the grand eminence gris of optimization who spoke about convex optimization, and he actually, at several points in his tutorial, did this running gag where he said, this isn't this method, I can solve this in this problem, and then would you like to know how it works? Well, that's none of your business, he kept saying. Because you don't need to know these things are now packaged, you just have to write down the problem in the correct formulation and then it's solved. And there are these perfect algorithms that are super fast and they solve every problem. That's what optimization used to be like. It was just hit the button and it worked. And it's not like this anymore at all. Instead, we have the polar opposite of it in machine learning now. We have engineering teams of a hundred people working on large language models, having weekly meetings to decide whether they want to lower the learning rate a bit or change the batch size, or go back to a safe state that they did last week because in the meantime, the gradients have exploded. So we're actually at a really interesting point in time with this that these methods don't really work, right? So they work kind of, they train something and it's good enough to get public attention and get lots and lots of money to do so. But it's really not efficient, it's not an efficient use of resources. So why is this? And now you could think of three different reasons why that might be. One might be that these problems are really large, large language models, billions of parameters. But I think that's not so strong an argument because optimization has dealt with million dimensional problems already 30, 40 years ago. And really there's not so much fundamental mathematical difference between a million dimensions and a billion dimensions, right? And chips have become bigger and smarter and faster. So that's not really the problem. These problems are also not convex and classic theory of optimization mostly applies to convex problems, fine. But then again, in the 70s, people were already applying causing new methods to non-convex problems and got them to work. They didn't have guarantees, but they worked. So that also not really the problem. No, the problem is stochasticity. And with that, we're right back at the center of this lecture course about uncertainty in computation. So what actually happens when you train a deep neural network is that you're not minimizing the population risk because you already have finite data anyway. But you're also not minimizing the empirical risk because, so the second thing, where is the pointer, it's gone. Okay, I'll pump in my mouse, this thing. That's what you would do if you have finite data but infinite compute. But you don't have infinite compute. You have finite compute as well. So instead, you're actually evaluating the mini-batch gradient, this object that looks like a distribution of a bunch of arrows around a batch mean. And that drastically reduces the performance of optimization algorithms because it introduces very, very significant noise. Noise that is in fact actually larger than the signal. So this picture is actually a little bit misleading in two dimensions. So the typical batch average gradient doesn't even point in the right direction. Like the inner product between the top line, sorry, the second line and the third line isn't even larger than zero. Typically speaking. So we can't even think of this as a small perturbation. It's not the sort of thing that you study in numerical analysis where you say, oh, let's have this algorithm that works perfectly if everything is correct. And then let's imagine that we actually have a small arrow epsilon and do some forward or backward arrow analysis to figure out what the disturbance is going to be. No, because the object you're computing, that the disturbance between the thing you want to compute and the thing you get to compute is not small compared to the thing you want to compute. So you really get just access to a source of random bits that have a little bit of structure and the name of the game is to get as much information as possible out of those random variables. So Frank made several points in his lecture, but one of the points that I want to particularly emphasize is towards this end here, which is that we don't need to despair at this setting. In fact, it offers lots and lots of interesting new avenues to try. Why? Because we actually get access not just to this red arrow in this picture over there, but also to all the golden arrows because they are being computed anyway when you compute your batch. So this object here is fully computed and fully accessible during training. So the typical deep learning libraries only give you back the sum, the mean gradient, but actually if you really want to, you can get access to all of these things. And now suddenly you have a much richer source of information available, an array rather than a vector, an object that you can study and think about and do any interesting things with. You can also compute the gradients of this object, so curvature estimates using automatic differentiation, just like we used the gradient of the loss to construct this object. So we can just keep going deeper and deeper with automatic differentiation. And then Lukas Tatzel came in in lecture number 12 and said, yeah, you can actually maybe make use of such curvature estimates to build interesting algorithms like the classic optimization methods, Newton-based, Newton-type optimization. And you just have to be a bit careful because we're now in this sort of deep setting where all of these three problems that I mentioned before apply. So A, the model is very large, so these curvature matrices are even larger, quadratically larger, so they are super huge, so we can never hope to actually look at the entire Hessian. We have to be smart about which bits we compute. Secondly, the problem is not convex, so the Hessian isn't positive definite. So we'll need to work around that, for example, by working with other matrices which are guaranteed to be positive definite, like the generalized Gauss-Newton, the Fisher, and so on. And of course, everything is stochastic still, so naive second-order methods don't work. And that's where we are at the moment as a community as well. Not just here in this lecture hall, but pretty much across the entire world, there isn't a conclusive answer on how to train deep neural networks. And personally, my prediction is it's going to be super exciting when such an answer emerges and you're going to know because it's going to make training deep neural networks unbelievably faster. Like, it's going to be orders of magnitude change in performance and that's going to change how we do machine learning. So let's see when it happens. Meanwhile, I don't want to leave you hanging with this and say, ah, it's going to come up something. We can also make use of these quantities that I sort of hinted at to do other things than training. If they don't just need to train deep neural networks, we also need to do other things with them. Once they are trained, for example, we want to know how much they actually know, which parts of the weight space are constrained and which parts are not. What does the model know and what does it not know? How uncertain is it? And I did one lecture, well, I did it, but Augustinos wrote it, to show that we can use these quantities, these curvature estimates, to construct uncertainty estimates for deep neural networks in very sort of lightweight ways using the idea of a Laplace approximation. So we actually, I'll go to, yeah, I'll keep it here. So I don't even have, I didn't take a slide for Laplace. Okay, but so you all remember, the idea is you train your deep neural network in whichever way you like until you found one point in weight space that is your best guess for what the weight should be. So that's when your optimizer gets switched off. It never actually terminates itself, you switch it off. And then at that point in weight space, you compute some approximation of the curvature of the loss function with respect to the weights of the network. So that tells you in which direction the loss is flat across the data, the training data, and in which direction it's steep. So if I move weight space in this direction, then a lot of the training data will be misclassified, so I should stop, right? If I move in this direction, nothing is going to change in the training data, so I might as well move in this direction or this one. And that gives a notion of uncertainty, right? This dimension is uncertain, this dimension is very confident. And we translate this into a Gaussian distribution by taking this curvature matrix, taking minus it, and taking the inverse of the curvature, and then we get a Gaussian covariance matrix. And that covariance matrix we can plug into, for example, simple linearizations of the loss to get this predictive uncertainty that you see over there. This kind of functionality can be used for various different use cases. In particular, it can be combined with a linearization of the network in weight space to get a mechanism that essentially turns any deep neural network approximately into a Gaussian process posterior, a parametric Gaussian regression algorithm posterior. Approximately, but in a totally structured fashion, using only linear algebra and automatic differentiation, no fancy Monte Carlo complicated expensive things in a way that can be done after training without messing up the performance of the algorithm. And I'll think I'll largely actually leave it at this. Maybe one thing I wanted to say about this app is maybe once we reach this slide, we've also traversed this to another end of the connection between inference and computation. So I said at various points across the course and also today several times that you can think of computation as inference. And if you're doing inference, you might as well do Bayesian inference. And that may sound like being Bayesian is the right thing to do and be sort of like for whatever philosophical reason. But actually these results kind of show that it's a bit more tricky. Being probabilistic is important and being probabilistic for me just means measuring volumes in hypothesis space, figuring out how many degrees of freedoms are left and understanding which parts of your model are actually constrained by whatever information you have, data or computational information or mechanistic information and which parts are not because quantifying uncertainty is important. But whether you need to do this perfectly by solving or applying Bayes theorem and computing a correct posterior measure depends on the setting. In computation, you often don't want to be fully Bayesian because then it's intractable. So we do Gaussian inference. And in deep learning similarly, we also don't want to be fully Bayesian because it's gonna be very expensive. So instead we first try and find fast solutions and then try to make them as good as possible. I think that's generally a good approach whether we talk about numerical computation or deep learning. Because if you want to do the perfect thing that is computationally intractable, no one will care. But if you do something that actually only adds a tiny bit of computational resources but gives you something you don't otherwise have, then that's value. With that, I'm done with the main content of what I wanted to show you today. Here the big high level takeaways. Computation is inference. All of computation is managing information, manipulating it on a chip, by loading data either from the disk or being told about it by some human designer that writes down a law of nature that can be evaluated at various points. And whether the numbers come from the disk or whether they are available as information operators, the algorithm has to decide what to do with them. So numerical computations are active agents acting in an interaction with the data source. It's very clear if you solve a differential equation that the algorithm has to decide where to discretize the problem, how many final computations to perform so it has to actively decide what it does. But with data, it's the same thing. If you do Gaussian process regression and you have a very large data set or you do deep learning and you have a very large data set, it pays to think about how much of that data you actually want to use, how much you want to load and how much computation you want to do on it. If we take this collection seriously, then we can build new data-centric ways of doing computation that if you do them right might be more flexible, easier to be used and easier to generalize to different settings. They provide sort of design patterns like these information operators like Gaussian process regression, Kalman filtering and smoothing and so on and Laplace approximations that can be used in very flexible ways to include information from various sources. And in the end, all of this is numerical computation. So machine learning for the person who builds the learning machine is numerical computation. So if you know about how numerical algorithms work, both the classic ones and our reinterpretations of them, then that allows you to be a better machine learning engineer. And if you don't know about them, then you're stuck with the tools that someone else makes for you. All right, I hope that you still use the chance to give some feedback. This is the final feedback form. So if you want to write anything about the entire course into it, you can do so. Scan the QR code now because I have a few more things to show that or to discuss. And of course you have to talk about the exam, but at this point, that was the content of the lecture.