 Welcome everyone. I'm very happy that you're all here in this sizable group, especially for a specialist lecturer that isn't even mandatory for anyone. And I'm really looking forward to do this introductory lecture for this numerics of machine learning class. So before we get started, I first have to tell you why you should care about numerical algorithms and then we can start talking about what they actually are. So maybe we can close some of the windows, not all of them, but maybe to get some of the... We have to find some balance between noise and oxygen, we'll see. And heating later in the year is going to be heating as well. So if you ask the media, if you read about AI in the media, you'll learn that machine learning algorithms are computer programs that take data. So both statements about the world, numbers that are listed somewhere on a piece of paper, and they turn them into models that you can use to predict something about the world, or maybe even to act in the world if you're really AI. So interestingly, the first thing is the input and the second thing is the output. But what actually happens in between is the stuff that you do because you're the computer scientists, the machine learning engineers. So what's the thing that actually happens inside of the machine? That's a numerical computation. That's the bit that the computer does when it learns. And in contrast to classic AI, rule-based AI, contemporary machine learning algorithms use numerical algorithms as the primitives for these computations. What do I mean by that? So classic AI algorithms are algorithms like maybe nearest neighbor or alpha-beta search. So algorithms that just run and then they find an answer and then you know that they've done the right thing and you're done. Numerical algorithms are algorithms, and I can already tell you what they are. We're going to talk about them for the whole lecture today. They are linear algebra methods, simulation methods, integration methods, and optimization methods. All of these algorithms are methods that estimate a mathematical quantity that doesn't have a closed form solution, at least not one that is tractable in reasonable time. And that's maybe a subtle distinction, but I'm actually going to make it a little bit formal and then I'll tell you why. So it's not a proper mathematical definition, but it's sort of a vague idea of what the difference is between these what I might call an atomistic algorithm, something you might have used for a classic AI solution, and a numerical algorithm. Put simply, an atomic operation is something that can't go wrong. And a numerical algorithm is something that can go wrong in subtle ways. So here's an attempt to define something. Assume you have some function that you care about, that you'd like to compute. It maps from some input space to the real numbers, because that's the only thing computers can actually do, and it has some input space, and this function is kind of intractable. It's complicated, right? Now, you might call an algorithm a map that's implemented on the computer, on a Turing machine, that computes, takes an input and computes an output, such that we hope that this output is an approximation to the true value of this function. I'm going to call an algorithm a numerical method if there are points in the input domain, such that the algorithm accepts it, it runs, it decides, it terminates, it puts out an output, and that output is wrong. It's beyond some error, some machine precision error where machine precision is up to your definition of what you mean by machine precision. So in short, an atomic algorithm is something that just always works, and a numerical algorithm is one that can go wrong. So why is this important? It's important because if you're using a tool that can break, you need to worry about it. If you're using something that always works, you don't have to spend brain cycles on it. You just call it and done. Well, if it's an algorithm that can go wrong, you are now in charge of making sure that it actually works. And the fact that these algorithms are central to machine learning means that you, as the machine learning person, have to think about it. So here are some examples. Atomic operations are the stuff that comes from the GNU-Lib C, like the exponential function, the sine function, but also more obscure things like the complement of the error function, or the gamma function, or the log gamma function, stuff like this. And they tend to come as just functions, they're in sci-fi special, or even in base numpy, or even in the mathematical library of Python itself, so they're straightforward. While numerical algorithms are methods that tend to have their own library, so it's like sci-fi sparse, linear, sci-fi is optimized, and there's stuff in there, and there's several different subroutines, so that's already your problem. There's several different choices to make. They all do apparently the same thing, and you need to figure out which one it is. So what do these algorithms actually look like? I want to give you a feeling for what we are talking about. Here is an example of an algorithm, it's on the Internet, this is a prototypical numerical algorithm, and here are a few things to notice about this. First of all, can someone guess what kind of programming language this is? It looks like this. It's not MATLAB, but it's the language that MATLAB was inspired by. It's FORTRAN. It's actually FORTRAN, I think, 95, which you can tell because of the weird indentation. I think in 77 you can't indent this way. So it's old. This algorithm was written in 1993, maybe a little bit before that, by a Swiss guy in Geneva, and it implements an algorithm that was invented by two guys that are quoted here at the top. They're called Dormand and Prince. In I think 1975, I might be wrong about this though, so it's a while ago, and it actually is an instance of a class of numerical algorithms called Runge-Kutta methods, which were invented in 1905 by two German mathematicians, Runge and Kutta. Hands up, who has heard of Runge-Kutta methods before? I'm already trying to gauge a little bit what the audience is like. That's about a third of you. So this is the first thing to note about numerical algorithms, that they are old. They are not machine learning engineers. They use algorithms that were invented over a hundred years ago. They implement them very carefully. Implementing them carefully means that this algorithm has a very precise interface. There's a lot of things you have to provide to it, so you need to know what you're doing, if you call this algorithm. I'm going to tell you in a moment actually what this is that you have to provide here, and they have a really intricate structure. So if you look through this code, obviously there's not time to actually read it, but if you've ever looked at a code like this before, you'll notice that it does something really fancy stuff. For example, early on, there are some numbers being added up. I had a nice example somewhere. You're like this, prepare entry points, and then at the very end, there's a bunch of numbers. There they are. And they're important. So maybe if you've implemented the mathematical algorithms before, someone has showed you a piece of code like this, and told you exactly what you need to do. So these are the numbers that you have to use, and they are magic. You cannot change anything. If you go in and change one of those numbers, the minus 56 over here, if you change that to minus 50, the algorithm is not going to work anymore. So that may sound like a silly thing to observe, but it actually has a very tangible effect. It means that if this method doesn't do exactly what you want it to do, you're out of luck. There's nothing you can change about it, but the marital analysis professor has told you, this is it. That's the algorithm. Don't touch it. It was invented in 1975. Work fine so far. So I'm showing you this to give you a sense of what the kind of space was that machine learning engineers encountered when the field emerged. So machine learning is much younger than these methods. The first international conference on machine learning was in the 80s. Maybe the field kind of emerged in the early 2000s as a proper subfield of computer science. And whenever people encountered these numerical tasks within their new models that they were trying to train, they always inevitably noticed that there was already a method for it out there. It was invented by other fields, by economists, computational physicists, mathematicians, and so on and so on. And it already did what they were supposed to do. So that was really convenient because it meant people could just use the old code. They could take this fancy old code that was clearly very complicated to develop and just use it and not touch it. It's convenient because it's fast. It saves you a lot of time. But it's also a little bit tricky because it means that if the task we are trying to solve isn't exactly what this method is trying to do, we're out of luck. And I'm going to show you over the next few minutes examples of how this can go wrong. So those are the numerical, this maybe gives an intuitive picture of what the kind of algorithms are we're talking about. Now we should think about what are actually these numerical algorithms that we need within machine learning and why are they a little bit different than the classic ones and why should we care about how they work. And I'm going to use this opportunity to do double duty and also introduce you to give you basically an overview over the content of the lecture course that we're going to cover over the course of this term. So everything I'm showing you is like a little sneak preview of stuff that's going to come. The first class of numerical algorithms we'll look at are linear algebra methods. So where does linear algebra show up? It's obviously the most elementary thing ever, like you learned about it in high school. Here is where, actually I should have asked you but maybe it's too easy, right? Here is where linear algebra shows up in machine learning in the elementary algorithms, the base layer of everything else basically. So hands up, who has taken, and this is again me trying to understand what the audience is, who has taken the probabilistic machine learning class already last year by Jacob Macke. That's over half but also not much more than half. Who has not taken it? Okay, so it's probably half-half. So if you've taken the class, then you've heard about Gaussian process regression as a base case of Bayesian inference. It involves this thing up here, learning a function, this is this red thing from some observations which is this black stuff interpolating between those points with uncertainty as you can see here. So you do that, Bayesian inference by multiplying a prior with a likelihood. The prior is a Gaussian process and if the likelihood is independent Gaussian, so IID observations of function values over here, then if you multiply those two you get back a Gaussian process posterior. If you haven't heard about this before, it doesn't matter, you don't have to understand. If you've heard about it before, it's pretty obvious, right? And then you've learned that this object is defined by two objects again, right? Two functions, a posterior mean function and a posterior covariance function. Those defined, so this M of x, this thick red line here in the middle for those who don't know, and this V of x is a bivariate function that is basically the source of these samples here, that are these golden lines. So these objects, you've written them down in your class, if you've taken this, they look like this. This posterior mean involves taking the data y, so that's a vector, and multiplying it with at least that's the way you see it in the slides, the inverse of a matrix and that matrix arises by filling the entries in that matrix by evaluating a function called the kernel. So these are kernel methods. Another way of thinking about these algorithms is that they are kernel methods. Then you invert that matrix, you multiply it with another matrix that also arises from filling up a matrix with values of a function and that's your function. That's what this thing looks like. That's this red curve over here. So that's the stuff you see on the slide when you go into a probabilistic machine learning class. It's not the stuff you do when you do your homework in that class. So there's anyone who's taken the class already. You can tell me what the knee-jerk reaction is once you implement it. Yes? Use the Cholesky decomposition, right? Professor Macke probably told you if you ever write numpy.linear.inf like invert a matrix it's already a bug, right? You should never ever have a reason to compute the inverse of a matrix. Instead you should use something called the Cholesky decomposition. So I've lost focus of my slides. So the code probably looks like this. This is the elementary piece of Python code you would use. You import some fancy stuff called showFactor and showSolve from numpy and then so you fill up this matrix you build this matrix that we need to operate on and then you call this fancy method. And this fancy method returns something which is called the Cholesky factorization and that you can use that to compute this object that we care about. So you can use this to solve this issue here by solving inverse times y basically. So find the vector a such that g times a is y and then you multiply from the left. So that's fine. So this piece of code is again one of these instances of this sort of fancy fancy code. So this is Lina Cholesky in numpy. If you have a look at this and you try to figure out what's going on that what is it actually doing. Let's go showFactor okay that's nice. This is what we're looking for. What does it do? I don't score Cholesky. It was somewhere up here. What does it do? It calls oh it does this. So it calls a function that does all the work. That's called does anyone know what this stands for? Okay and that's exactly the problem. That's it. If you read through the rest of the code it's just all input checking. There's nothing else happening in this Python code. It just calls something called and that's the thing. So has anyone heard of Lepeg or BLAS? So what this is it calls one of these elementary BLAS operations underneath. Actually it's not elementary it's level 3. So what this is PO is positive definite because this matrix is positive definite because it calls a triangular factorization. And what does it do that? Well you can try for yourself to hunt for this piece of code that does this. You won't find it. It's Fort van code that's hidden away in a library that's already installed on your machine. So you can't even look at the source code on GitHub. At least not in this Git repo. So again we're in a situation where we can't really think about what to do about this algorithm. So why is this a problem? What's the problem with kernel machines? Maybe you've heard that they are not particularly common in contemporary machine learning anymore. Why? Yes they don't scale well to large amounts of data. So this matrix here is of size n by n. It's a size number of data points by number of data points. How expensive is it to compute this Cholesky decomposition? Just shout it out someone already said it. n cubed. So why n cubed? Does anyone remember their high school days of equations? So it doesn't actually matter how Cholesky works you're going to learn about that next week. What you can think of is the stuff you did in high school. So you write down this is probably not going to work but let me just wave minds around. So you write down the equation you're trying to solve. A times G times A is Y. And then you multiply from the left with a matrix until you have on the left hand side a unit matrix and on the right hand side you have the solution. So to do that you start with the first row of your matrix which is n long and you want to have a one and then lots of zeros. So you divide by the first entry and subtract that line from all the lines in the matrix. So the line has length n and they are n rows that you have to subtract it from that costs n square and then you have to do that for every row. So that's n cubed. So how many data points do we have in a reasonable data set these days? A million maybe? It's not a really big data set but it's reasonable. So a million is 10 to the 6. 10 to the 6 cubed is 10 to the 18. How many seconds does a year have? 10 to the 7. Pi times 10 to the 7. How many operations does your computer do per second? 10 to the 9. Maybe 10 to the 10 depending on how beefy your GPU is. So let's say 10 to the 9. 9 plus 7 is 16. It's 100 years. So we're not going to do this. So people say criminal methods don't work. They don't scale actually as you said. But they don't scale because you call this function and that's cubically expensive and it's one of these methods that just runs and you have to wait for it to finish and before it's finished it doesn't do anything. So in lecture number 1, 2 and 3 from Marvin and Jonathan who are sitting at the back over there you're going to learn that there are actually other linear algebra algorithms. And those are the kind of algorithms you actually need to understand to use otherwise they won't work for you. You have to babysit them a little bit. They are not like classic algorithms. But if you do, they can speed up computation massively. For example, if you know something about your data set if you know that it has structure you can use that to make the computation way faster. In the most extreme cases even exponentially faster. Not always, but sometimes. And if you're happy with the algorithm not returning the perfect solution but just an approximation the algorithm isn't even cubic to begin with. It's already faster. And you don't have to wait 100 years for your method to train. Which is really not a surprise because you may have learned in your theory class that what this algorithm actually does is it solves the least squares problem. So it minimizes a function just like deep learning. It minimizes a very simple function actually a quadratic one. Why should it be harder to minimize a simple function than to minimize a complicated nonlinear function like a deep neural network? So the argument that kernel machines are too expensive is stopping way too short of the truth. It's much more tricky actually, much more detailed. And if you know about why this is you can actually build solutions with Gaussian process regression with all the nice properties of Gaussian process regression that scale to very large data sets. That was instance number one. Linear algebra. Let's do setting number two. Simulation. So simulation for me and this is the moment where anyone in the room who has taken a numerical analysis class who is studying or used to study math in a previous life or still does. It's different people interestingly actually some overlap. So here's my definition of a simulation algorithm and you will have your hold your nose while I do that because it's quite vague. I'm going to call a simulation method an algorithm that simulates the trajectory of a dynamical system through time. So in a sort of abstract sense this involves equations like this which are statements about a curve x of t, a multivariate curve actually so x might be a vector that doesn't actually like doesn't have x as a free variable anywhere in the equation. It's an implicit equation, right? So there's something wrapped around it. On the left hand side there may be something we might call a differential operator a nonlinear function of this curve. This includes the special case where this linear operator on the left is actually, sorry, but this operator on the left is simply the derivative operator operating at time. Then this is called an ordinary differential equation or it may be a more complicated thing where the operator on the left is a potentially nonlinear differential operator that also involves partial derivatives of x with respect to its entries. Then this is called a partial differential equation with a more actually general setting where there's just one equation on the left hand side that operates on all the derivatives and on the right hand side it just says equal to zero that's called a differential algebraic equation. So methods that solve these problems that estimate x you may call simulation methods and they show up in machine learning in two actually quite different kind of almost orthogonal domains. On the one hand they show up whenever we build acting agents. So AI is also about building machines so a robot or a self-driving car or a reinforcement learning agent that plays a game have to be able to predict what's going to happen next so that they can change their behavior to create a beneficial future to create a trajectory that they are happy about. You could call that control or reinforcement learning or robotics or whatever, it's all the same thing. So these methods need to predict the future so they need to predict the behavior of dynamical systems. And then there's another setting that's currently actually emerging as a really big threat within machine learning that's called scientific machine learning which uses simulation tools to encode physical knowledge about the world differential equations. So physics and chemistry and most natural sciences encode knowledge about the world in laws of nature and these laws tend to be differential equations partial differential equations typically Schrodinger's equation makes goals for economics Navier stokes for climate models and so on. So if you want a machine learning algorithm that makes use of this physical insight then you need to be able to solve differential equations and I'm going to give you a very simple example of such a setting to tell you why you need to know about how these algorithms actually work. So by the way, what are these, like, the solvers for this ODE for a setting like this this algorithm that I showed you before. Actually, this is the prototypical method for solving these kind of problems. So here's an example from our own work. This is from a paper that was published in New Ribs a year ago by Jonathan Schmidt who's also sitting there in the back. This is a data set that everyone knows. Maybe you haven't seen this plot before but you know what this is about, right? These are COVID cases in Germany over the first one and a half years of the pandemic. So you may remember back then the thing that everyone cared about that you went to bed looking at before falling asleep was how does this curve continue? Does it keep going up? Is it coming down? What's the next thing that's going to happen, right? So this is a prediction problem. It's exactly the thing that machine learning is made for. It's even a very simple prediction problem. It's just about 500 points on a one-dimensional axis. It doesn't get any easier than that, right? If you have AI systems that can invent images like the stuff on the right, then you can clearly go through that line and predict what's going to happen next, right? So if you are a machine learning expert, not a numerical mathematician, your first approach might be, I'm going to use one of my fancy machine learning algorithms. What's your first favorite machine learning algorithm? Give neural network, right? Let's use a RELU network to regress on this function and predict what the next thing is. This doesn't work. And actually if you've learned about RELU networks you may know that they extrapolate linearly. It's not a surprise that this looks like this. It doesn't have to be at zero. It could also go up. It's just a bit of an accident with this data set of where I stopped. It could also have been a linear extrapolation upwards. It doesn't matter. It's all equally useless, right? You could use some fancy Gaussian process with a cool kernel. It's not going to work either. It just produces some line, maybe an exponential increase, maybe a straight line, whatever, right? None of this is useful. The structure behind this data set that we need to use to predict and that causes the structure is that people are getting infected with a virus, right? So there are different equations that model this system. The most simple one, actually the one that everyone uses even though in a somewhat advanced, more advanced version, are these SIR models. Hands up, who has heard of SIR before? Ooh, okay, so I'll tell you. So they're very, very basic ideas. So this is actually the model that even the most fancy simulations used. The ones that we looked at all over the world. Just that in the fancy versions, it's not just three lines, it's like a thousand lines, but they do the same thing, which is that we separate society into three groups, people who haven't had the disease yet, people who currently have it, and people who have already had it. We call them susceptible, infected, and recovered SIR, and you move from one group to the next. So if you're still susceptible, you're in the top group, and then there's a chance of you getting infected. If there's many infectious people, and a lot of susceptible people, then a lot of people get infected and move over to the infected group. So these two terms are the same. P is the size of the population. It's just a constant. And then once you have the virus, you can infect people. Now you're in the lucky group that gets to stay at home, and you move out of this group by recovering with a certain rate. So if many people are infectious, many people will recover, and if no one is infectious anymore, it's one of these ordinary differential equations. So this is exactly the kind of thing I can apply this fancy algorithm to, that is up here. And actually, I can tell you what I need to do. So I need to encode this cool function that is these three lines of differential equation. I need to tell it where to start the integration, what the initial set up, what the initial value of all three values are, up until where I want it to solve, what the end point is of the simulation, and then a lot of tolerances should actually work. So this is exactly what the numerical mathematician told us to do. It's just, you notice something about this equation that I haven't told you yet? There's two numbers in there. They're called beta and gamma. What are they? You don't know. So I set beta to 0.1, and gamma to 0.1, and I let it run, and it looks like this. That's useless. So what is the problem? You can play around with this beta and gamma. Actually, I did. I'll move it up and down a little bit so that the curve looks like this. But no matter how you choose beta and gamma, you'll always get one hump coming back down again. Never two or three. Why? Because the contact, this beta is called the contact rate, changed through time. And we all remember, because we had to stay at home to reduce beta. That was the entire point. People went home, then they came back again. People didn't quite follow. There was another wave and so on and so on. All this dynamic has to go into beta. So how do we get this equation, though, into this algorithm that's supposed to do this simulation? So here is why it's bad that some applied mathematician in 1905 came up with this setting. It's called an initial value problem. It assumes that you know what the differential equation is, and we don't. But we have something that could tell us what the differential equation is. We have this red line. Notice how this red line has nothing to do with the algorithm we used to solve. There's no way to put it in because this fancy algorithm doesn't have an entry for data. So what could we do? Well, we could make beta time-dependent. So making a time-dependent is already, are we going to do that? Let's make it the output of a neural network because we're machine learning engineers, so that's the best idea we have. So that's a parameterized function. It has a bunch of parameters, theta, and we're going to optimize through the ODE solver. There's just a problem that this code doesn't have derivatives. There's no backward. It's Fort van Koud from 1980, what? 1993. People didn't know about automatic differentiation back then. There's no autodiff. Thank God for Google, someone implemented this algorithm for us in JAX. It's, I think, actually originally Jake Wandaplass, this guy, or Matt Johnson, the main developer of JAX. They've actually, as you can see, maybe committed to this repository two days ago. So here is their ODE solver, JAXexperimental.ode. What kind of ODE solver is it? Oh! It's exactly the same method. It's now implemented in fancy JAX. And it still has, let me go and define it, it still has those numbers in there. Right? So the first one is a 1 over 5. Let me see. The first one is a 1 over 5 or 0.2. I'm sure Matt and Jake actually went through this original Fort van Koud and they copied in the exact correct values. If you change one of those numbers, the algorithm doesn't work anymore. Right? So even though we now have fancy contemporary machine learning toolboxes, we haven't changed anything about these algorithms yet. So why is this a problem? Well, the problem comes that here we define this Runge-Kutta step. That's a single step of this algorithm. Step? What? It does steps? Yeah, it does steps. So where is the main algorithm? I'll show it a little bit. I hope that I'll find it. There we go. This is the magic line where the steps are performed. So this is JAX or actually, yeah, so there's a scan. That's sort of a funny way of hiding a for loop. This algorithm actually does a for loop inside. So when we call gradient descent on fitting this curve that we originally had onto our data, we're actually calling a for loop and every computer scientist knows that nested for loops potentially a dangerous thing. Right? So this approach is called simulation-based inference or approximate Bayesian computation and it's actually a meaningful way of solving these kind of problems. It's actually the current way of solving these kind of problems. And some of you here in the room are working on these methods so you know very well how to use them. Even more fancy. It's a bit silly to use it on a simple problem like this. You would actually use it on a much more challenging problem. Right? So let's say that this inner for loop actually evaluates this function f that we don't know many times over. And then once it has evaluated that many times over, it steps outside, returns the gradient, does a gradient descent step and then starts again calling f many, many times over. And you'll learn exactly how it does that from Yonatan and Atanayl. We're going to teach lectures number what? 4, 5 and 6. And what they will tell you is that you can build your own algorithm a method that constructs this fancy list of numbers inside of the Fortran code in a procedural way and adapts them maybe in light of this data. And the way this is going to work is and I'm only spoiling it a little bit that they're going to tell you that you can start with something you could think of as a spine in your algorithm a spine that you can hang operators onto, that spine is going to be a Markov chain. Anyone heard of Markov chains before? Not vigorously if you have, not everyone has. So you'll learn about Markov chains, very good. Not everyone's nodding because Yonatan is going to give an entire lecture on inference in Markov chains. And it's going to tell you that you can think about the states that you're trying to simulate as a sequence of probability distributions and that you can then condition these probability distributions on something we'll call information operators that can provide information, right? One kind of information could be that there is an ordinary differential equation and it holds. So some states in this big X will be derivatives and we'll find out that the value of this derivative is given by the value of some function f evaluated at the curve X. But you can also plug in other operators like for example that you know that the data that you have observed data that tells you the curve goes through some points. And you can plug those in within the same for loop. And you can use this extra data to inform the algorithm about stuff that you don't know yet like our contact weight beta. And you can do that all in a single solve, forward through time. Without calling a for loop over and over and over again in an outer loop. And you'll get a plot like this that at the same time in first the contact rate through time in a non-parametric fashion with trans-aggression process posterior and predicts into the future. And they'll tell you how to actually do this. And then this result is obviously going to be an algorithm that runs potentially much faster and maybe more importantly is way more flexible than this primordial piece of code that has been handed to us from 19 what? Maybe on 75 or 1905 depending on who you're asking actually. And we need to do that because these days we have data. So when mathematicians built these numerical algorithms they didn't have data. So in 1905 data consisted of some guy outside walks around with a notebook, writes down some numbers, comes back into the lab draws onto a plotting paper draws a curve through it and says what is the law of nature that we're going to use? There was no computers no hard drive. But these days machine learning has created a data-centric view on computation and the algorithms that we use even at fancy Google are still 100 years old so it's time to change them. You're going to find out from Jonathan and Datanael and then also afterwards from Marvin again who views over the scene on a previous slide that you can build these algorithmic spines that can operate on different information operators that they can actually provide algorithms that have the same computational complexity and efficiency as the classic algorithms if you run them on the classic setting but create new functionality in the form of various information operators and if you're really careful about them once you know how they work you can even use them to solve partial and even differential algebra equations. Okay that was simulation Marvin had linear algebra and simulation. Now right before Christmas in the two weeks before everyone goes home actually one week where everyone's still here and then one week where everyone's at home we're going to talk about integration and you're going to see me doing that time which actually has a reason. The reason is I don't have a PhD student working on integration at the moment and that's because I currently don't know how to make it any better that's why I'll have to do the lecture and so in integration I'm not actually going to introduce you to particularly fancy new algorithms like in the previous two chapters but instead we're going to talk a little bit about what actually happens during computation and get a good feeling for like the most elementary kinds of operations and we'll also cook some food. So I'll briefly tell you why integration is important though for machine learning it's actually an elementary operation of Bayesian inference. So for the probabilistic side of machine learning the elementary operation is this thing up here computing a posterior distribution or you could call this a conditional distribution and you do that by taking a joint distribution and dividing by a marginal and marginals are integrals. So if you want to do Bayesian inference in a generic fashion you need to be able to solve these kind of integrals. They're typically of this kind of form in the very face of thinking about them. And spoiler alert, the simple story is going to be that the standard way to do this in Bayesian inference is called Markov chain Monte Carlo and it's actually the worst of all possible algorithms it's sadly also the only really proper possible algorithm and we'll find out why. But we'll realize that there are special settings in particular low dimensional integrals or very structured integrals that also have really nice properties and they'll tell us something about the relationship between the good old classic algorithms that I've shown you before and maybe a new way of thinking about computation. And then in January after the Christmas break we'll move to what's arguably the most central numerical operation of machine learning, which is anyone want to guess? Second? Multismatism. Metasoldification for me would be a lower layer of linear algebra. We'll talk about metasoldification next week. Optimization. The foundational operation of machine learning is empirical risk minimization or maybe a basic template of machine learning. So this involves computing a value theta of some parameters that minimizes some loss function where the loss function happens to be to have the property that it's a big sum over many terms where each term in the sum depends on all the parameters but only one datum in your data set. There are various ways of thinking about these problems. You could call them exchangeable models or empirical risk minimization and what the loss function actually is depends on your model but the prototypical one is a deep neural network where theta are the weights of the network and the yi are a placeholder for the data which could be inputs and outputs for a supervised problem not just outputs. And actually why are these algorithms so important because they are a basis for something that I think is synonymous with machine learning now but some people actually think of it as maybe a more general class of algorithms which are differentiable programs. So programs for which you can hope to compute the gradient of this function by applying a gradient operator and if your programming language is rich enough I think after 2007 or so you can compute this gradient automatically and apply it to the individual terms in the sum. So algorithms for this purpose are called optimization algorithms and just for the other domains we spoke about before there are classic methods for them and they are stored in sci-pi.optimize right so what would be such algorithms what's your favorite optimizer Adam what about mathematicians that raised their hand before who had math classes Adam BFGS I was hoping for someone so if you go into sci-pi.optimize actually maybe there is now an SGD I don't know I think in sci-pi.optimize there are methods called minimize and they call algorithms like BFGS who has used BFGS before like actually written code that calls BFGS okay did you like it ah okay so it's interesting because BFGS is a wonderful algorithm it doesn't have a learning rate it stops on its own you never have to stop it it just stops when it's done you can start it from pretty much anywhere just converges to a minimum and it works on pretty much any differentiable problem but somehow when I ask a machine learning class in 2022 what your favorite optimizer is nobody uses BFGS anymore so why is this this is actually an interesting case of in the previous previous two chapters or three chapters I showed you old algorithms that are sort of being used without reflection right whereas someone just takes the code from 1984 and applies it in 2022 and just ports it line by line into from Fort van to Jax in optimization we're much further their machine learning is noticed that the old algorithms don't work they build new methods but in doing so we lost a few things so if you've trained a deep network before you know that you have to choose a learning rate that you have to watch this method run you have to restart it 50 times see which one works best you have to decay the learning rate after a while do all sorts of weird fancy tricks to make it stable and then you have to figure out when you're done the thing doesn't even stop on its own you have to wait and then you have to say stop let's see what it looks like so why is this well it's because no one actually does this that is up here no one actually minimizes the empirical risk what you do instead is you realize that it's 2022 and n is big right n is a billion maybe data points and the cache of your GPU can hold four of these terms in L right because you have to load y and their images and they're big and so you have 8 GPUs and you can hold 32 or so so what you do is instead when your optimizer asks for a gradient of the thing you don't actually compute the gradient instead you have lost focus again instead you draw a batch so you ask your machine actually there's a fancy piece of code for it called the data loader which you ask a random sample from your data set from the disk and then you compute a much much smaller sum over a lot less terms maybe 4 maybe 16, 32, 265 something like this and if you do that right that's actually a good idea because if you manage to draw these data points iid so independent identity distributed from your data set then this is an unbiased estimator of this thing it's a random variable now but its expectation is the same as the thing you're interested in and in fact actually it's a sum of iid random variables so if m is big it's almost gaussian distributed now annoyingly m isn't actually big it's 4 right but if it were big it would be gaussian distributed so in our computation inside of the computation in the GPU when we ask it to compute a number now we're not getting back the number we actually want we're getting back a random variable that is distributed around the thing we care about the loss function according to some complicated probability distribution that is even if we are lucky a gaussian distribution and gaussian distributions have two parameters they have a mean and a standard deviation or a covariance actually so here's the thing we want to know we don't know it though we can compute this thing that's our gradient but there is something else which is the noise around this observation the covariance so if you've trained a deep neural network before and all of you have I guess I'm not even going to ask because it's going to be a embarrassing for the three people who haven't you're probably calling something like this so this is the torch code for this maybe you're using something else and so you define you load your data this is the bit that does the random batching right then you define your model and your loss function and then you call the fancy thing that computes the gradient and then you can ask for a gradient in the parameters that's this part of this likelihood of this computation what about this part where is it have you ever thought about why this is not part of param it's odd no there's a probability distribution in our computation it's a stochastic computation probability measures have statistics not just a mean why does torch not tell you the variance say again it doesn't know so why does it not know what would we need is it expensive maybe maybe it's too expensive maybe the smart people at facebook writing torch have decided that this would be too difficult to do so they've left it out how expensive would it be to compute a variance well so here is a thought so this is the thing we're currently computing here's our gradients so we do this over a batch so there's actually an array in memory kind of right there's a long gradient and then there is m of those for each element in the batch right and they've been computed because clearly we can do that that's what your computer does when you've done that which is the fancy backprop pass that's the expensive thing that you use your GPU for you might as well just use those numbers in the array and square them element wise right so it will give you a non-standardized second moment estimate but the mean is also an estimate I mean these are both estimates of the same type right so why not do that it's not expensive it's just you take numbers that are already in your RAM anyway or your VRAM or whatever or your cache you just square them squaring is not expensive and then you sum them up and you return back that number that gives you the diagonal of this covariance matrix and say but still would be good no to have an arrow bar on each gradient element why do we not do that and I think so here's my hypothesis of why we don't do that it's because we still think like numerical analysts in 1905 we think optimization is about computing a gradient and then that gradient goes into a function called you know BFGS update or gradient descent or whatever it's like a one line contractive operator that uses the current gradient and applies it to the weights and that's it that's what optimization is right that's how all the optimizers work but they ignore the fact that they don't actually have access to the number they're trying to compute they only have access to a random variable so the random variable should show up the fact that it's a random variable should be part of your deliberations when you build the algorithm so we actually build a piece of code I mean I say we Felix and Fred wrote a piece of code in 2019 there was an aura that I cleared in 2020 to show that you can compute the variance and you just change your line a little bit you import this cool python package for PyTorch backpack for PyTorch you extend everything and then you say do the backward pass but give me the variance and here it comes bam so we can do that we've done it now and now we can start thinking about what to do with this well it was 2020 it's two years since then it's a lot of a tall order to make you know to solve deep learning basically so we don't quite have the answer to it yet but we're going to tell you a little bit about it so after Christmas you're going to hear lectures by Frank who is at the back of the camera and Lukas who is sitting at the best seat of the house and Agostinos who is is he actually here? no, he's already gone who is probably going to do the lecture from Toronto where he's going to move next and Julia who is at the back of the room about how to think about optimization or actually maybe not to think about optimization but instead how to think about training a deep neural network and not even trying to use the word optimization anymore because it's misleading we're not actually trying to find the minimum of a function we're trying to use random variables that we can probe to find a point that generalizes well and gives a well-trained deep neural network they're going to tell you that uncertainty arising from this randomness in the computation is extremely important it's actually maybe not even randomness because randomness sounds like you can't control it you're sampling from your data set so you have a lot of control over it that this is very core to the issues of how to train deep neural networks they'll tell you that you can compute various observables and that you can use those to understand better what goes on inside of your deep neural network and to build therefore better training setups for your deep neural network and yeah, so, spoiler alert they're not going to tell you exactly how to do it and it's always going to work because at the moment no one in the world knows I know for a fact because I personally am of the opinion that very few people in the world know more than Frank about how to train a deep neural network and he doesn't know either I think but they'll also point out that you can use these observables that you may want to use for training to add new functionality to your deep neural network for example, you can add uncertainty to the output you can make it a Bayesian deep neural network without using the fancy Markov chain Monte Carlo algorithms that are so expensive but instead do that in a way that basically doesn't add any cost it just provides uncertainty without pain and then there's still going to be a few things that you can't do in this way for which you have to do the laborious bit of building an outer loop and optimizing your hyper parameters but even that is something you can think about properly and Julia is going to tell you about how to do that in the very last content lecture at the end of the term so that's the outlook yes you're getting a population the question is if you're moving if the optimizer moves while it calls you get a different distribution every time you take a step yes and you have to deal with that but you can still call in one of these lines every time the method you're building has access to the gradient and the variance and actually it has access to every single value available you can use every row of this array every term in the sum and operate on those and then you can do even more things you can compute second derivatives of this in certain directions and so on so these are the sort of the that's the detail story that those are some sneak previews of what you can expect over the next term so maybe you've noticed that there is a bit of a theme running through all of the stories I've now told you so I said at the beginning a numerical method is a method that estimates something such that it can go wrong and maybe you know this experience from your own experience with machine learning algorithms that there are methods that estimate something that can go wrong I've told you that linear algebra methods can be much faster if you're willing to accept that they are imprecise if you're willing to accept a rough estimate maybe you notice from the analysis of machine learning algorithms that if you are willing to accept a small data set you can get away with a smaller data set if you're willing to accept imprecision in the estimate I've told you that simulation methods can be realized in the form of Bayesian filters conditioned operators which doesn't sound like a numerical algorithm at all actually that sounds like a machine learning method and now I've just told you that when we do deep learning or actually when we do differential programming on large data sets that the computation involves a probability distribution that I've written down as a likelihood in your computation likelihoods are also not really terms that show up in the numerical analysis class but they do show up in a machine learning class so actually it seems like these numerical algorithms that we use as the tools to train machine learning algorithms they actually are machine learning algorithms themselves and there is a real deep philosophical insight behind that numerical algorithms of this type that I've defined are algorithms that estimate an unknown quantity you could call that a latent variable let's call it z like the location of a minimum the value of an integral or the curve that solves a simulation problem and they estimate this latent quantity that you can't directly write down by evaluating tractable observations like values of the integrated various points matrix vector products or these random variables that come out of GPUs if you evaluate a gradient on a batch so they estimate a latent quantity from data and that's exactly what inference is they are performing inference they are estimating a latent quantity z given that they have observed the result of some computation c the only difference to Bayesian inference as we know of it in machine learning is that they collect their data not from the hard drive by reading data points from a data loader but by calling the processing unit so they use the computer as a data source but in the end it's just numbers and in the example of simulation you've even seen that we can use those numbers very tangibly coming into the estimation process from two sides data and compute so it's possible to think of numerical algorithms as learning machines and actually if you think of them as tools that kind of makes sense so if you think of the tools that an engineer uses a mechanical engineer uses to build mechanical machines they are also machines a power drill is a machine or a CNC lathe is also a machine but you can use it to build more machines so if you're a good engineer you don't just know the machine you're building you also know the tool the machine you use to build the machine so if you want to be a good machine learning engineer and build big fancy machines that do fancy stuff like make this image of tools in the background you should probably know what the tools are that you use to build these machines and in which sense they are machines and how you should set them this is the idea behind an even bigger mathematical step which is to say well if you can already think of what a numerical algorithm does as Bayesian inference why don't we just build numerical algorithms from the ground up phrased as Bayesian inference as numerical algorithms that perform this operation here in the middle and these methods are called probabilistic numerical algorithms and they are algorithms that take a probability distribution describing their task and then use the CPU or the GPU as a data source to refine their estimate of what the solution to that numerical task is and return a posterior distribution over that task such that that posterior distribution hopefully has good properties it's ideally centered around the true solution with a spread that has something to do with how far the center is from the true solution so they quantify uncertainty these algorithms are discussed not just in this lecture but also in a book that you don't have to buy at all it came out a few months ago if you want to you can check a free PDF under this URL but you don't even have to read the book to come into this class the book is not going to be a foundation of this I mean the stuff we're going to discuss in this book but they're not going to refer to individual chapters of this book at all you can just look at them if you want to this is mostly just a you know sneak way of introducing this book what you're going to learn in this class is in the overarching theme from the first to the last lecture is that when you think of computation as machine learning you can do cool stuff with computation and that cool stuff is the same cool stuff you can also do with your learning machines for example combine different sources of information empirical data from the disk and computational or mechanistic so physical knowledge for example and data from compute investing resources either into collecting data or into computing more numbers which is also resource incentive to find a good solution you can add uncertainty to the result of a computation and use that to for example do these tradeoffs between how much data and how much compute you invest and you can do what in Bayesian inference you would call type 2 maximum likelihood or evidence maximization so you can try and find parameters of your computation hyper parameters like learning rates by some kind of outer inference loop around the inner inference loop that's also a Bayesian inference so this class as you may have noticed by now is definitely not your average numerical analysis class we're not in the math department where you are rarely going to see proofs in these slides and we're not going to talk about numerical algorithms in the language that you may have encountered in a numerical analysis class and that is on purpose it's not because we don't know how to do it because we're not mathematicians well also that it's because we think that if you're doing machine learning and you want to understand how the machine works inside you need to think about the machine inside like a learning machine so it makes sense to think in the language of machine learning it's also really convenient for you because that's the sort of mental toolbox you already have so you might as well use it to build new algorithms it is a computer science class so you're going to write a lot of code and I'm going to tell you about that in a moment but maybe most importantly it is a very idiosyncratic class in fact I think this is the very first instance of a dedicated numerics of machine learning class in the whole world I've looked around a little bit I've tried to Google for it I couldn't find anything other than my own book so this is an opportunity for you to be right at the moment when these ideas kind of emerge as something we could try out as one more way to improve learning machines not to solve everything but to contribute meaningfully to how our algorithmic landscape evolves that means you're going to learn what actually happens inside of the box so you'll look at these old algorithms and try to understand how they actually work sometimes that means looking at old algorithms not inventing new ones, just understanding what they do because that's important but then once you know you'll learn how to change them how to move away from algorithms that contain magic numbers and instead compute those magic numbers on the fly the algorithm runs and adapt them to for example data so that you build new models that play more nicely with their surroundings that's important because people out there for example in the industrial world have now realized that machine learning algorithms tend to be a bit difficult to employ in the bigger pipeline and if we want them to play nicely with surrounding infrastructure they need to have these kind of properties and last but not least expert knowledge in numerical algorithms or in the algorithmic side of machine learning is maybe one of the most highest expert levels of machine learning engineers so if you want to be a well employable engineer or even researcher it's a good idea to know exactly how learning machines work on the inside I've also already mentioned and at this point we're sort of slowly moving towards admin that I'm actually not going to teach most of the lectures instead they're going to be taught by PhD students from my group these eight people that are all here in the room except for Augustinus and that's also on purpose I'm actually going to be here as well most of the time except for I think once or twice and it's not because I wanted to take some time off well maybe but actually because I think that for many of these settings we'll talk about these people are really the experts on these algorithms because they've spent now several years typically of their lives already really thinking about my new details of these algorithms so for example I think well Yonatan is at the moment one of the maybe ten people in the world who best understand how to solve these squares problems we've actually been at workshops together where five of these ten other people were around I think and Nathanae has been travelling the world all the ways in which this can go wrong and he'll tell you but he's still doing it so it can't be all that wrong and Frank is in a collaboration with people at Google Brain trying to figure out if there's any way to train deep networks better and if all the people he's in constant video chats with there don't know how to make it better than he does and I don't know who would so I think we have to write people here to really tell you exactly what's going on and you can ask them the hardest questions you can think of and if they don't know the answer then well that's a good guess that this might be an interesting research topic to work on so I hope that you're going to be excited about getting lectures from real experts not a professor so that's the content bit my proposal is that I just use briefly a few minutes to tell you about the admin part that of course has to be part of a first intro lecture all the time and then there's still time for questions if you want that maybe you can use that time too if you have questions to have them bubble up in your head so the first question of course is how is this going to work how are we what is the day to day life and life in this lecture so we already know we have lectures every week on Thursday and there's going to be a tutorial every week on Friday there's no tutorial this week and also no tutorial next week the tutorials will be here in this room on Friday at noon we'll hand out exercise sheets starting next week and every week the exercise sheet will consist of two parts the first one is just a service to you so I know that when a class is taught for the first time in this format like this one is there's always the question what's going to be in the exam I want to know what the exam is like there's no old exams to look at please tell me what's in the exam so we're not going to answer those questions on the top of the sheet that is actually a question that could be in the exam of course we're not going to ask that question in the exam but questions like it you don't have to hand in that question at all and in fact we're not going to correct it if you submit a solution but you can ask about it in the tutorial the second part is the one that's actually the work so the example question obviously shouldn't take you more than 10 minutes to solve because if it's going to be in an exam in the exam it's 90 minutes long and there's going to be 10 or so of these questions so the other part of the exercise is coding so every week you'll get a Jupyter notebook so I've learned I didn't realize actually before but I've learned now that Jupyter stands for Julia, Python, something and R so there'll be sometimes Python code and sometimes Julia code and you're going to solve and typically those questions will consist of at least two parts and we're not going to you'll submit solutions, you submit that by the way by exporting the notebook as a PDF with the outputs and just uploading the PDF because otherwise we'll have dependency hell with everyone submitting from different environments and those exercises will not be they'll only be graded in a binary way so for each question the tutor who is going to be the PhD student who gave the corresponding lecture is just going to make a tick mark if it's okay as a solution or make a cross if it's not a good solution and to get a tick mark you have to make an earnest attempt to solve the problem you don't have to solve it perfectly but it's not okay to just say ah, try it didn't work right? every time you get a sufficient, every time you get a tick mark you get a point in the exam the exam will have 100 points so if you get all of the exercises correct you have 26 points because there are 13 sheets and that's like half way to passing the exam you also have to get 5 and actually that's a typo I'm going to correct this afterwards 5 exercise sheets so that's just a bit over a third of all of the exercise sheets graded as sufficient to be admitted to the exam so basically you have to do some coding otherwise you can't go to the exam that's important because we think it's important for you to quote when you're learning about stuff and you can only learn really about algorithms if you actually use them yourself then there will be an exam of course, that exam will be on the 13th of February for the first exam in here and on the 31st of March next year also in here for the second exam you don't have to take the first exam if you don't want to you can just come and do the second exam however, if you don't take the first exam and then come for the second exam and fail it, don't come afterwards and ask us for a third attempt right we're not going to teach this class again a year from now so if you fail this class completely this term it's going to be very difficult to reset it you're going to have to wait a while so I recommend that you take the first exam and if you fail the first exam you can still take the second one but of course I would hope that no one fails the first exam and then it's all good okay yeah then some simple I probably don't have to point it out but you've already noticed we're recording this lecture there's a person with a camera in the back if you're coming into this room and you're consenting to being recorded don't worry about it though we're not recording the audience actually so I've had a look before at the cutout it's only me so you can't really be seen but if you ask a question you may be just about audible on this microphone we're not going to put the recordings on Ilias or anywhere or on YouTube right after the lecture we're going to do so at some point after we've cleaned up the recordings and made them into a bit of a structured thing we also want to think about how to release them on the YouTube channel so this recording isn't actually for you to follow along during the lecture because we want you to be physically here and ask questions I've learned from experience that once we do this virtually people don't come anymore the room is suddenly empty but there's 40 people on Zoom and you just talk to a wall and they're not really questions and it's particularly even for people who should ask the questions let's be in a room together and the final point I want to make is that this is a specialist lecture it's not exactly found like standard start-out knowledge for machine learning so if you're interested in doing a master's thesis or maybe even a PhD on numerical algorithms for machine learning or for data-centric computation if you like then it may be a good idea to take this class and let us know once you're a little bit accustomed to the content once you've seen what we've done that you're interested in interesting stuff to work on in every single lecture and we're currently looking for both master's students and PhD students to work on both applications like for example climate and weather forecasting and algorithmic solutions like for example integrating simulation and deep learning in a more holistic way that it's currently being done by for example nodes and other kind of algorithms and even to develop new software engineering tools for deep learning you'll hear in a later lecture by Frank and Lucas about what that actually means with that I'm at the end we briefly summarize and there's time for questions I've tried to convince you that numerical algorithms are really important for machine learning they are what drives the learning machine they are the engines of machine learning so if you want your engine to run well you need to understand it and you're lucky because it turns out that this engine the tool that you use to build a learning machine actually is a learning machine itself because it estimates an unknown quantity from observations and that means you can phrase it in the language of Bayesian inference which is the language of machine learning really so by embracing this viewpoint and taking it seriously you can learn how to build machine learning solutions that work faster that are more reliable they don't break down as easily or they know when they do and they are easier to use because they provide information to the user that other methods can't do to do that we're not going to ignore classic numerical algorithms we're not going to throw them out and never think about them again, that would be bad it would be throwing away over a hundred years of very hard work by mathematicians and other people but we're also not going to slavishly follow the viewpoints of classic numerical analysis but instead we're going to adopt the viewpoints of machine learning and think about these algorithms as learning machines now I hope you want to join us on this ride it's the very first time that a class like this is taught this way and it's also an unusual setup with these nine different people teaching it so I think it's going to be an exciting experiment and I hope that most of you will want to stay thanks