 I think so. Is there any signal picked up? It's too low. All right, yes. This will be a crash course on optimization. So I think the main goal of this lecture is to familiar ourselves with the concepts used in optimization, such that when we are done, we are ready to go out and read the documentations of these packages and understand what they are doing. So I'll do the first part, focusing on the concepts, basic concepts, and Gregor will further develop these concepts and show some of these algorithms in action. So, yes. So if we look at optimization, a lot of real-life problems can be rewritten in terms of optimization. So they have some characteristics. For example, if we want to find the cheapest airfare to travel from your home institution to Berkeley. So if we're looking for something that's, well, looking for the airfare, that's the cheapest. Or if we are looking, solving for the ground state of a protein, a protein, so a molecule. Well, we are looking for a ground state, so something at the lowest. Or if we are optimizing investment portfolio, so we are maximizing the profit, maximizing looking for the most amount of money. Or if we are training a neural network to classify images, so we want to reduce the, reduce the, what's the name of that measure? Well, we want to reduce the rate of error. We want to reduce the rate of error, minimize that. Or if we are looking for data compression, or if we are doing model fitting or rejecting noise from measurements, all of these problems, they are optimization problems, and if we are looking for something that's, well, cheapest, lowest, or the most, or something like that. So these are optimization problems. Now mathematically, we can write down these optimization problems in its normal form. And they all share this structure. So the variable x is a vector of variables, or we call them unknowns or parameters. They all refer to the same thing, x is a vector. And we have an objective function. It's a scalar function. We sometimes also call it the cost or the penalty. And, well, the function is defined on vector x, and the x is defined on some domain. It can be real, like here, or it can be discrete, like on natural numbers. And these, well, we want to minimize the function, the objective function f, on the domain of Rn, and subject to a set of constraints. And these constraints, there are two types of constraints. Type one, it's C i equals to 0. So that's selecting a subspace from the space where we are solving for the minimizer. The other type is C i greater than or equal to 0. That's cutting out a region from the space. So the first one is reducing the dimension. The second one is cutting out a region. So this is the normal form of optimization. So every optimization problem, if we spend a little bit of time, it can be rewritten this way. And that's the only kind of problem we can solve. So, well, in the normal form, there are several types of optimization problems, and we'll look at a few examples. There are many of them, and many terms you can combine them and come up with new types. The first type is what we call linear programming. So that's basically saying if the objective function is a linear combination of the variables or the parameters. So these linear problems, they are usually easier to understand and interpret. So if we jack up one of the parameters, the cost is going to go up. If we lower that parameter, the cost is going to go down, or something like that. So it's very intuitive. Number four is usually used in fields that are not traditionally highly scientific or complex. So, for example, the economy or social sciences. People build linear models for very complex systems and hopefully work and they solve these systems. So what I want to say is that we want to watch out, though. Some nonlinear problems, we can rewrite them such that they still look linear. For example, this objective function of x1, x2, if it's the sum of x1 and the square of x2, if we do a variable transformation of x2 square and replace that with t, then the problem is linear again. So I think this is called generalized linear problems. Theme in what you're writing is that x is going to be the thing that you're varying? Yes, it's the variable. X is always the thing you're varying. And I know you were completely clear about that and I'm not complaining. I'm just saying that for many people in this room, x is not the thing you vary. And if they saw a sum of ci, xi, they would think the ci's are the things you're varying. That's right. I'm just pointing out as my role as translator, for the mathematician, x is the thing you're varying. Is your background in math, by the way? No, but my reference book was in math. Yeah, yeah, good. So I just wanted to emphasize that just keep us on track. Yes, yes. You really see it stands for the parameter, that's the variable. But here x is the unknown that we are solving. If you started from elementary, the school would recall that. x is the unknown. That's what the mathematicians were following. So another big class of optimization problem is discrete problems. If any of the parameters are limited to integer domain, then the problem is discrete. And usually this discrete problem, the structure of the domain is hard to describe because you know you hop from points to points and okay, well, how do we infer when you do the hop? How does this objective function change when you hop from one point to another point by a large distance? That's usually hard to describe. And therefore we have to use some sort of brute force searching over the entire domain to find the minimum, to find a solution. But these problems are usually very interesting problems because they look like puzzles. For example, this typical problem people solve in computer science when you start programming, one of them is called the 8-quance puzzle where you want to put 8-quance to a chessboard such that you want to minimize the number of quents in the same row, column, and the diagonal. And the minimum solution, we know it's zero collisions. Well, you can write a computer program to search for all of the possible configurations and then find the solutions. So that's a typical problem in computer science when you start. Or some even more interesting problem like the Sudoku. You want to, well, following the Sudoku rules and find a solution where no numbers, well, every nine numbers is unique in this pattern. Even the recent popular computer game 2048, it is an optimization problem. Given a sequence of numbers that are produced on the board, we optimize the action we take, like up, down, left, right. Well, the action we take are the variables, are the x. And the cost function is the number, the last number, the largest number standing on the board. We want to maximize that number by tweaking our actions for any given input. So, oh, well, yes, there was this particular robot that does, it can consistently do 16384. Consistently, yes. But, well, the parameter space of this guy is really, really big, right. You can imagine every step we have four options and it's like thousands of steps. It's a really, really hard problem. We have to use heuristics to, like, see, okay, there are some rules you want to follow, like roughly, you want a large number on the corner, like this sort of thing, to help us to do the brute force search. Eventually, it reduces to the brute force search. So this is typical of discrete problems. Well, another type that's more benign is the continuous optimization. If all of the parameters are real numbers, are real numbers, so the x are real, then the problem is continuous. Well, one special type is if some of the variables are complex numbers, there are still continuous problems because each complex number are two real numbers. You just replace them. It's very typical if you are doing optimization. We replace a complex number with two real numbers. So these continuous problems are benign even though the domain looks denser, right. You have a lot more choices now we are not limited to integers, but because the objective function is defined on continuous domain, we can actually make some assumptions about the neighborhood of any point or any region we've already explored, right, because we have Taylor's theorem and we can expand the function. We have the gradient, we have the Hessian, we have all of these. We can make some predictions, very useful predictions. With this information, we can prune the region, the domain, or the other word in computer science we call it curling. So remove the region that are obviously wrong. So we can shrink and shrink, shrink all the way to the minimum. So this idea, based on this idea, people develop very powerful iterative algorithms to find solutions with very, very few iterations. If you think about the domain, it's really huge, but 10, 20 steps, we are already very close to the minimum. A special type of continuous optimization is called the least, it's the least square problems or similar to regression. So basically, in this least square problems, the objective is written as a sum of squares and we want to minimize the square, so least square because it's finding the minimum of the square. So for this type of problem, because there's this structure, the square structure, there's a special class of methods that's useful for it. It's Gauss-Newton method, or there's an extension of that method named by two people's names, Lavenberg and Marquardt. It's basically the same, it's a very similar algorithm, but it's only for least square problems. And if the problem, well, sometimes this least square problem gets really, really big and there are papers, active researchers trying to make use of the sparseness of this problem. Sparseness means, okay, the function hi depends on any hi depends on a small subset of the parameters, then the problem is sparse. So there are automatic tools to deal with the sparseness and an extension to this least square problem is to add a constraint saying we add a lasso constraint. That means we want the absolute value of X, the sum of the absolute value of the parameters to be smaller than a number. And this will lead us to some of these, like parameter reduction methods in machine learning and all of these fields. They are all related, all of these things are related. So the last interesting one I want to talk about is this stochastic optimization. So when some of the parameters are random variables on actually, I think, well, this is strictly not really true, but in the similar, well, it's roughly true. In this sense, then we have to optimize the expectation of the objective because it's a random variable, essentially. And what I want to say is this is different from stochastic gradient descent. The problem in stochastic optimization is stochastic. But the stochastic gradient descent is an algorithm, an algorithm that uses some randomness to help us to find the solution quicker. So these two things are different. We have stochastic problems, and we also have stochastic algorithms. They are different. Many problems are also stochastic. For example, if we want to optimize the profit of a stock portfolio, the performance of each stock is random. It's stochastic. So these problems usually have to look up the literature and look for algorithms for these things, for stochastic problems. So we go back to the main focus of our field, which is on-constrained optimization of a smooth function. So think about the ground, well, maybe the ground state energy is not really one of those, at least square of these problems. So there are many types of minimums. Well, I draw an example. The red line here is one of these functions. We can look at a few types of minimums. Well, obviously, there is one global minimum at, well, almost on the center left. And then there are several local minimums. And the one on the right is particularly interesting. It's a flat plateau. So this type of minimum is not a strict minimum because any minimum in the plateau in its neighborhood, there are many, many minimums of the same value. So it's not a strict value because, well, there are numbers equals to it in the neighborhood. So, well, we are usually more interested. Most of the algorithms will give us, well, I'll take it back. So this, well, there is a special form of... Oh, I think that's fine. Yes, there are types of minimums. There are also types of non-minimums. There are maximums. And on the left, I have also showed a stationary point, a saddle point, where the derivative is zero, but the function is neither minimized or maximized. So these confusions are also important. So frequently used the names in optimization. So we talk about a few attributes of the objective functions very frequently we use them. One of them is the function itself, the objective function f. It's a scalar. And then we take the derivative, because x is a vector, the derivative is a vector, so that's called a gradient. And in some packages, it's also called the Jacobian. For example, in sci-pi, it's called the Jacobian. I don't really know why. Okay, but it's a vector. What's more important is if you want a vector to go in, it's this guy, the gradient. And the second order derivative is the Hashian. Well, as you imagine, the more derivative you take, the more dimension this quantity becomes. So Hashian is a matrix. And in literature, you really h. The name h stands for the inverse of the Hashian. So there is this minus one twist there. In the literature, you see h is the inverse of Hashian. And well, so basically, if we have all three of them, the function has to be sort of like a smooth to the second order. So you have a continuous Hashian. And if the function is good, like if you can go look at the function all the way down to the Hashian, then we can derive from Taylor's theorem necessary and sufficient conditions of minimizers. If for any point x star is a local minimizer, then we can assert. The gradient of the function is zero, and the Hashian is semi-positive definite. Semi-positive definite. Yes, the Hashian has to be semi-positive definite. And we can go the other way around. If we know at a point x star, the function, the gradient function is zero, and the Hashian is positive definite, then we can prove the function is, or the point x star is a strict local minimizer. It's local. It's not global. It's local. So well, this looks useful, but what if I want a global minimizer? So it turns out if the function is even more behaving, all of the local stationary points are global minimizers. And this well-behaved function, this type of function, we call them convex functions. So we have to bring up the concept of convexity. Well, the mathematics are originally there, but the idea is very simple. So we pick two points, x, y, in the domain of the function, and draw a line between x and y, a line between the function value of x to y, the dashed line. And then the number alpha controls the point I'm looking at between x and y. So that's the blue line. That intersects the function, the red line, and the dashed line. If the intersection to red, which is the function value at alpha, is always less than the dashed line, which is the straight line before x and y, then the function is convex. If for any point x, y, and any value alpha between x and y, this is true, then the function is convex. For convex functions, any stationary point, we mean the gradient is zero. Any of these stationary points become global minimizers. This is a very important attribute, and usually when we translate a physical problem to an objective function, we make the assumption that the function is convex without actually proving it. Just hope it works. And that's one of the things we usually do. We assume it's convex, apply the algorithm, and find a bunch of stationary points and see if these points are minimizers or not to validate if our initial assumption of convexity was correct or not. So we don't usually go out and literally prove it. So the algorithm, we solve these problems with algorithms. And our primalization algorithm, it does nothing but produces a sequence, a sequence of x, a sequence of parameters. Yes, that would be great. No, I was just going to say that the linearly squares problem that I did on the first day is a very good example of a strictly convex problem because the chi-squared is literally a paraboloid in the parameter space, and so it strictly has this property everywhere, and that's why when you do the linear algebra, you get the exact global solution. And I guess the L1 that you mentioned, the lasso that you mentioned is also strictly convex. But there aren't that many convex functions known, I think. I don't know. All right, so we have a few examples of the convex functions, square and lasso. So, well, the algorithms, they produce a sequence, the sequence such that the limit of this sequence of parameters, well, the limit of the gradient on the sequence of parameters is zero. So usually, well, we don't really go to K of infinity if we want a practical algorithm. We usually truncate the sequence and say, okay, that number at the 100th iteration of the algorithm is good enough as an approximation of the minimum. So we usually truncate this sequence. But mathematically what we prove is, in any algorithm, if we prove its correctness, we show the behavior of the gradient when K goes to infinity. The general structure of an algorithm is really, really simple. All of these iterative algorithms follow this structure. I think I'm missing a vertical line between the two lines. I forgot to fix it. So the first step we do, well, in any step of the iteration, what we do first is we propose a new position, new xi, new x. And this usually involves multiple evaluations of the objective, the Jacobian, or the gradient, or the Hessian matrix. Multiple evaluations of these, and then some linear algebra, black magic, we get a new proposal. And now we look at if the new proposal has converged, if the sequence has moved a lot or not, if the sequence didn't move a lot, then we say the problem has already converged and we halt the algorithm. Any algorithm in the end, you have to halt it. That's to end the algorithm. If not, then we go back and propose a new proposal based on what we have found about the previous proposal, the general structure of the algorithm. Sometimes if the algorithm can't find a solution, we will just keep looping and looping and looping, and we have to put a limit of the number of iterations we want to use and stop there and report an error. That's why most of the root finding or optimization algorithms, it returns an error code saying, I couldn't find a solution, or it returns an error code of zero saying, well, which means it finds an optimal solution which has converged. So talking about a sequence, then we have an order of convergence, like how fast the sequence is converging, well, approaching to the true minimum. So there are several orders of these convergence. I've written down the names from left to the right. It's getting faster. The first one actually didn't really have a name. I call it blind, which basically means it doesn't really know anything. It's just trying. It's the slowest one. Well, the second, a little bit faster than blind. Being blind is linear order, and then there's quasi-Newton, and the fastest one we usually see is the quadratic. The quadratic is the fastest. So the order of convergence, it's tightly related to the amount of information we make use during the optimization. If we use more information about f, or if we are making more assumptions about f, you can imagine we arrive to a faster convergence. So the blind class of convergence, these algorithms only make use of the objective function itself. So that's why it's blind, because it doesn't really know the direction to go in the next step. So some of the algorithms in this class is the simplex algorithm, or an extension to simplex algorithm is the Nelda-Mied algorithm. Nelda-Mied is implemented in Psi-Pi as well. So the basic idea is we just find a region, a simplex, or a shape, a shape in the domain, and then try to shrink it and move it such that we always bracket, we believe we bracket a minimum. So a bit faster is the linear convergence, and these algorithms make use of the gradient, just the gradient itself. And algorithms with linear convergence are gradient descent, and a better version of this is a conjugate gradient. So a bit better than linear is the so-called quasi-Newton method. It's between second order, between quadratic and linear. So these algorithms approximate the Hessian with the information about gradient. So they assume the Hessian is continuous. So that's how you can do it. And the algorithms in this class, they are faster than linear, but they don't require you to construct the full matrix, which is usually expensive. So some of the algorithm here is the BFGS. Gregor is going to talk about. And another type that's very... Well, that's favored by some authors, but I don't really see it used by many people. It's the SR-1, the symmetric rank-1 algorithms. So they... Well, they share the same feature. They approximate, they produce a thick Hessian that's similar to the Hessian and make use of that based on gradient. So the fastest algorithm we'll talk about in the quadratic algorithms, and they make use of the Hessian to help us to find the minimum. And the Newton-contrugated gradient is one of them. It's a step search... It's a line search algorithm. And then the two others are dog legs that use trust-region Newton-contrugated gradient. The two other algorithms, they use trust-region methods. We'll talk about this later. Yes? Yes, yes. So the order of convergence is only roughly telling you about the number of iterations. But as we go down, as we use more information about the function, each iteration is getting more expensive. So that's where when it becomes sort of like an art, there is just a rule of thumb about which algorithm to use. And you can imagine, right? It's going to be somewhere in the middle. So it's either linear or quasi-Newton. That's what people use in practice. And what I also want to mention is this order of convergence is about the algorithm. But sometimes the problem itself is very hard. Like the condition number measures the difficulty of the problem. If the problem is hard, then no algorithm can really help as much. The algorithm is... Even the fastest algorithm will converge slowly for a harder problem. So just think about it. The limit of the harder problem is like 2048. That's a hard problem, right? So here are a few examples. I hope... It's fairly readable, okay? So I just do this in sci-fi with four different algorithms, representative algorithm of different orders. We're trying to minimize the Rosen-Brach function, which is usually a benchmark function for optimization. And the top left corner is the blind algorithm. It's the Neil de Meade algorithm. And you see, it arrives to the minimum, but it's very noisy. Each ax there corresponds to evaluation of the function on the left panel, right? In the contour. Ax is the evaluation. The line is the iteration. Solid line connects the iteration points. And we see the right panel of this figure. The blue noisy wiggle is the objective as a function of function evaluation. It's the objective as the function of function evaluation. So you see it reduces. As the algorithm proceeds, the number reduces, but slowly. So eventually, to get a satisfying solution, we need 110 function evaluations from this Neil de Meade algorithm. Neil de Meade algorithm. And we can go a higher order. So the CG conjugate gradient is the panel below. The conjugate gradient algorithm. And, well, you see, it's a lot more efficient, right? The objective drops faster and we measure the number of evaluations. It did 53 function evaluations for the objective. But the reduction of the number of evaluations come at the cost. Now we have to do the NGEV. It stands for... It's the sci-fi name for the number of gradients. It also did 53 gradients. So, well, it's doing more computation per step, but less steps to reach the minimum. And the top right is the BFGS, the Quasi-Newton method. Well, I used the green line to mark the trajectory. And you see it's the function evaluation... Well, the objective reduces more... It's smoother than CG, indicating the algorithm is more efficient. And if you look at the number of function evaluations, it's actually fewer. It did 32 evaluations for the objective and also 32 evaluations for the gradient. So, basically, it used fewer iterations than conjugate gradient and it arrived at the solution faster. Yes, faster. So, the second-order algorithm is the last panel on the bottom right. It's the trust-region-Newton-conjugate method. It used 21 iterations. Well, it is 21 objective evaluations. It found the minimum. It's very efficient, but at the cost of doing 19 evaluations of the Hessian. And as David Hogg mentioned, the evaluation of Hessian is always expensive. So, the algorithm itself is actually slower, even though you look at measuring the number of iterations, it's faster. It's fewer, fewer iterations, but the algorithm is slower. I was just looking at the BFGS plot and just noticed that within two iterations, it's actually really close to the minimum. It's just like getting finer and finer. So, that's also something to notice, I guess. I'd like to make another comment that these algorithms are implemented in sci-fi. So, there are two things. One thing is the algorithm itself, which is the theory behind the process. And the other one is implementation. So, when you compare this four, we have to remember about implementation sites. So, if you use different codes, you may have a different result, right? Yes. So, when you compare implementation, let's say, of Neldermeet in different packages, then it's the comparison between the codes of the same algorithm, which is also important. Yes, yes, that's true, yes. Yes, I do, but I have yet to upload it to somewhere. Yes, I do. So, talking about sci-fi, one of the goals of sci-fi is that for every of these algorithms, popular algorithms, they want to have an implementation there, but it doesn't have to be the fastest. It has to be there so people can use it and see it and play with it. And all of these, I remember the optimization of... People also make commercial packages for optimization and throw all kinds of black magic in it to make it faster for special types of problems, and you have no idea what they are doing because it closed the source, and even if they give you a source, you probably don't know what they are doing anyways. So, yes, but the sci-fi algorithms here is just a reference, right? It roughly shows you... It's just making the point that the cost of more functional evaluations were expensive steps, expensive evaluations of hatching. So, I think it's important... There are ways of proposing new positions, proposing new positions along the sequence we are trying to converge, new x, k. So, there are two flavors of these algorithms, the line search algorithm and the trust region. The first time I saw trust region in a series of documentation, I decided that this is something I don't understand. I'm not going to use it. Therefore, I think it's important for us to know what is trust region because series is actually a good library. So, well, line search is simpler, easier, and it's usually the first example in any optimization textbook. The idea is simple. Well, it's simple anyways, right? So, we first pick a direction, propose a direction, then along that direction, along that proposed direction, we find a step size that reduces the gradient of the function, or reduces the function itself. So, the step size doesn't have to be perfect. As long as the direction is good, the step size, it only needs to do two things. First, it provides a sufficient decrease in the objective. The second, it provides a sufficient decrease in the gradient. So, there is this theorem. It's called the Wolf Condition. You can look it up on Wikipedia if you're really interested. It's kind of like this long, all right? But that's the basic idea. If you can guarantee the step size, it's good enough for these two criterias, a solution where the algorithm will always find a solution. So, but the art here is more like, it's more into the proposing of the, a proposal of the direction, and sometimes the direction we can make use of the Hessian, if we want to go second order or quasi-Newton, we make the proposal of the direction with use of the Hessian matrix. But the limitation here is the Hessian has to be positive definite, because we want to solve for direction from Hessian. If you want to solve for direction from some matrix, well, it's better positive definite. So, if the Hessian is not positive definite, then we need to make some compromises. Therefore, the line search algorithms, they are really, really useful for linear algorithms, which doesn't use Hessian, or quasi-linear, quasi-Newton algorithms, which use of approximate to the Hessian. Because it's an approximate to the Hessian, we can, requires to be positive definite. It's not real, right? Make it positive, requires to be positive definite, and then we can use line search algorithms. So, these are about line search. The more complicated algorithm is the trust region algorithm. The idea behind trust region is, well, we want to find a region in the domain of the function, such that in this region, a known form, usually a quadratic form of the function, a quadratic form is a good approximation of the function within a region. And then we keep trying, and then in this region, we find the minimum of the known form and move the region there and try to develop a new region. So, every time we are trying to cover the function with a piece that we know. That's the idea behind the trust region. And so the most, it's most natural for second-order methods, if you think about it, for quadratic methods, because the known form is quadratic, and we approximate the function in the quadratic form. So, we better know the function up to the second order. So, and also because we are approximating the function with a known form, the Hessian of the function doesn't have to be positive definite. As long as the function has a Hessian, we can write down the quadratic approximation. So, similar idea to line search. The minimum we find doesn't have to be the perfect minimum. There is a special point called a Cauchy point. As long as the point we propose performs better than the Cauchy point, the algorithm will converge. And there is one thing about the Cauchy point, though, it's not particularly a particularly good point. You know, the name is Cauchy, and it's not particularly good. And that's because that point turns out to be always along the direction of the steepest gradient descent, the gradient descent. So, if the trust region algorithm, every step we pick the Cauchy point, it's no faster than gradient descent. It becomes the first order linear convergence algorithm. Well, doing better than Cauchy point, when we propose it's dog lag, but this algorithm, the dog lag, actually requires a reference direction. Therefore, it requires a positive definite Hessian. But the algorithm I showed, the trust region Newton conjugate gradient, it doesn't require a definite positive Hessian. So, trust region is more natural for quadratic convergence algorithms. And that's why in series it shows up later, and it's in that context. So, if we want to go quadratic, this is the way to go. So, it's time for a short summary of most of my part. So, we look at a few types of common optimization problems, some are linear, some are discrete and funny, and then continuous and stochastic. And we mostly use optimization on smooth objectives. So, there are powerful algorithms for doing that. We look at the definition of minimizers and the role of convexity. Convexity guarantees stationary points become minimizers. And we also look at several types of convergence, the linear, quasi-Newton, or quadratic. And the difference between line search and the trust region. Line search is good for linear and quasi-Newton methods. Trust region is good for second-order methods, quadratic methods. So, how am I doing in time? I'm not sure if I want to talk. Well, let's look at this. So, for further reading, I recommend a bulk. It's the numerical optimization by, I can't really read his name. Norsito, Jorge, Norsito, and the right. And the second order is right. I don't know how to read that name. So, I have an ISBN number here, but you can also find it on Amazon or your library. And then, also the series, users manual, it's in series-solver.org. There's a tutorial about roughly similar content. And the sci-fi lectures on optimization actually provided a table of when to use what algorithms. But the lecture itself is slightly outdated because it uses the old API, though the content, the mathematics there still applies. And also a popular library, the ALG library. It also has some discussion on when to use which algorithm or conditioning and fancier, more advanced topics because there are a lot of black magic in the ALG labor already. So, I think my part is done. I was thinking about talking about automated differentiation, but I guess are people interested? Okay, well, I just want to go through an example because this algorithm is written on a graph and I don't really know how otherwise best to describe it. So, we look at an example and how to find the differentiation by a computer, the gradient of this function f. The function f is simple. It's just the sum of squares and take a square root. So, the first thing we do in automated differentiation is that we convert the function to a graph, graph computing, right? Graph, but okay, it has always been like that. So, each variable, input variable, x1, x2, x3, we first take the power of 2 and then they become temporary variables. It's helpful if we name these temporary variables. It's very helpful. For example, in the popular library, Dask, all of the intermediate variables are named. They are generated names based on the operator and the variables. So, you can cache them and do good things. So, I'll call them T1, T2, T3. These are the squares. And now, you know, the add operator, the add operator in the middle between T1 and T2, it only takes two numbers, right? Add, you can add A to B. So, therefore, we actually need two add operations. The first one combines the square of x1 and x2, so T1 and T2 to form another temporary T4. And then T4 is added to T3 to become the sum of squares, T5. And the square root is applied on T5 to get the objective function. So, this is evaluating the objective, fx. Starting with any number of x1, x2, x3, we mix the substitution, mix the substitution of these variables to their number and forward in the graph, we find the objective. This is not yet the derivative or the gradient. So, what happens if you want to do the gradient? Turns out there are two ways of doing this. The first way is like how we did the objective. We go forward in the graph from left to right. The second way is we go backward from the last evaluation that gave us the objective and we gradually go back. The reason we can do this is because taking derivative is simply applying chain rule. And in chain rule, well, you can go from left to right or right to left. It gives you the same answer. But in practice, it's slightly different. We'll look at that later. But so, if we want to go forward, for example, well, for any point x1, x2, x3, then for any of these variables, what we want to do is in any stage, we want to make the assertion. Each of these variables is replaced by the gradient of that variable. So, every sweep, we keep moving from left to right, and we make sure the frontier of the sweep, anything behind that, is replaced by its gradient. So, the first thing we do, we take x1, replace it with the gradient of x1, which we know is just 1, 0, 0. So, and we do that for x2, x3. And then we keep moving. We move the frontier to t1, t2, t3, and, well, then you know the gradient of t1 with respect to x1. You know, you just apply the chain rule. Okay, and we know the gradient of t1, t2, and t3, because we know the relation between t1 and x, and t2 and x. So, and then we keep moving forward. The frontier, we find the derivative, the gradient of t4, because t4 is a combination of t1 and t2, and we can apply the chain rule there. And just going all the way down, all the way to f, we find the gradient of fx. So, we make this assertion. We hold the frontier, anything behind the frontier, we know the gradient, and anything beyond the frontier, we solve them gradually, until the frontier moves to fx. So, this is one way of navigating a graph also, right? And that's going forward. Going backward, the frontier starts from the objective. So, and the assertion is also different. So, when the frontier moves from fx to t5, going backward, we want to assert, we know the gradient of fx with respect to the temporary variable t5. And we just keep going. So, once we arrive to t5, we move to two directions, and the operator takes two inputs. So, we move the frontier to t4 and t3, and apply the chain rule. Once we arrive at t4 and t3, we know the gradient of f with respect to t4 and t3. Well, the next step, we keep moving the frontier, all the way down, backward, in the computation, and until the frontier arrives to the free variables x1, x2, and x3, and our assertion becomes, we know the gradient of f with respect to x1, x2, and x3, and we found the gradient. So, that's the backward algorithm. The frontier, which means the region we know the gradient, goes from the objective to the free variables. This might be obvious, but so, why is this useful? Well, why is this useful? I guess there was a lot of interest on automated differentiation in this workshop. So, in that sense, it's massively useful. The second thing is, when we write down a model, if we are a perfect human, we can write down the derivatives by hand, implement them. It can be thousands of terms, millions of terms, we can write them down and code it up and put it in the program in principle. However, when the model is really, really complex, this quickly becomes untrackable. And such problem, you see, it's super, super mechanical and there's no reason why we shall be doing it. The computer can do it for us. So, that's why this automated differentiation is appealing. And I kind of feel recently, it's gaining a lot of popularity. But I don't really know why, but it's getting popular recently, very popular. Like BFGS are very popular in big models, in optimizations, like my machine learning friends all say, just use BFGS and forget about it. But BFGS is only a fast algorithm if you have derivatives. So, I'm always like, oh, no, I don't have derivatives. If BFGS takes numerical derivatives itself, it becomes slow again. So, the only way to make BFGS fastest is with analytic derivatives, say again. And unstable and there's all, yeah, you have to tune parameters to make it work right. So, by far the best way to use BFGS is with derivatives and I'm always like, oh, but then I need a gradient for my function and they're like, why are you concerned about that? You just turn on auto-diff. And it really is true, like that really is a true statement. So, I think this totally changes, optimization used to be very hard and now because you can auto-differentiate and run BFGS. I have a simple example of not that, but implementing auto-diff in a few lines for a very simple test problem just to get a better idea of how it works in practice that I can do later. It's very short though. That's the next step. Well, my last slide before switching to Grigga, I think I occupied too much of his time. The diagram I draw on the previous slide, it can get out of hand when the problem is huge. If we have billions of degrees of freedom, parameters, then the graph will have billions of nodes and billions of edges. So, computers are not really designed, well, it's designed to do this, but it's hard, it's a Facebook-sized problem when you have a graph of billions of nodes and billions of edges. But actually, most of the functions we deal with in science, they are not that messy. They have structures in this graph. And one way to make use of this structure is to represent variables as vectors. So, the previous function, if we represent X as a vector, then it just becomes a single line. It's very simple. X, you multiply, well, it should be power. You take the power and then do some reduction and then take a square root. And this idea is sort of behind this package, tensor flow. You represent your model with a bunch of vector and tensor operations, then the auto-diff becomes more trackable, becomes actually trackable, and you can deal with this. And for this, when you write down a problem like this, it becomes sort of obvious forward we have to store the sensitivity matrix because T1 is a sensitivity matrix, which is sort of like a Jacobian of the temporary variable against the free variable X. And that can grow out of hand if the temporary variables are huge. So, but going backward, because we always start from a scalar, so going backward, we only need to store the sensitivity vector because we are looking at the derivative and any step in the propagation or the frontier, we only store the derivative, the gradient of a scalar versus a bunch of variables. So that's a less memory requirement. So storing a vector, storing a gradient of a scalar is easier than storing a gradient of a vector. So that's the advantage of going backward when we have this high-dimensional vector-rised operations. So that's my last comment on the topic. So I'll show you Swap. It'll take five minutes break. Can test. Yeah. Can everybody hear me? Yeah. All right. Hi, everyone. I'm Gregor. So I'm going to continue on the excellent introduction to optimization that you gave. And I'll talk, give a little bit more detail on two of the algorithms that are commonly used, that I've been using in the past few months. That's conjugate gradient and LBFGS. And I'll show you a couple examples of how we use it in our actual daily work, which I think is really, really cool. So hopefully you'll appreciate that. Before I get started, I just want to sort of put this in context and remind us of the little discussion we had on Monday with David regarding Bayesian versus Frequentist, sampling versus optimization. And it is true that optimizers and optimization is typically used by Frequentists, and Bayesian is usually sample. But also, and the Bayesian way of thinking is actually really useful that I find. And most of the problems we should be addressing and thinking about as a Bayesian. However, every once in a while for whatever reasons, and we discussed those reasons, such as computational efficiency, we're not either not really able to sample and get a full posterior, or we're not even interested in getting the full posterior, and we want to get some kind of an estimator. And for these things, we really need to be using optimization. And in machine learning, this is used a lot, even though we might not notice it directly when we apply machine learning algorithms. But there is typically an optimizer hiding behind any machine learning algorithm that tunes your neural network parameters or whatever else you're doing. So let's look at some more details. I'll start with a simple problem, the favorite problem for the week, just a linear problem. I'll just write it down in many dimensions. Suppose we want to solve this system of linear equations. So A is a matrix, X is a vector, B is a vector, X is what we're trying to solve for, like David mentioned earlier. So that's the notation. This is the problem. So you might ask, what does this have to do with optimization? This is just linear algebra. We all know how to take another inverse of the matrix. And so that's what we should be doing. There are excellent packages for doing that. So we can invert the matrix and apply it on B and find X. So why don't we do that? And why do we want to think of this in terms of optimization? And before I move on, I'm going to impose this condition of A being a symmetric positive definite matrix, and little n is my dimensionality. Why I do this? A little hint from earlier is convexity. So A will eventually become sort of a Hessian. And if we want to arrive at a solution, we better have it be positive definite. So what does this have to do with optimization? We can actually write this problem in terms of this function phi. And we can try to optimize that function. We can try to find the minimum of phi. And here is the proof that finding that minimum is equivalent to solving Ax equals B because the gradient of phi is equal to Ax minus B. I'm going to call the residual for the rest of the lecture. So if you find the minimum, the gradient is 0 at that point. So you've found the solution to Ax equals B. So why minimize? First of all, you are able to do linear algebra if your matrix is not huge. If your dimensionality is, let's say, around 1,000, you can invert the matrix really easily. It's all good. But matrix inversion is expensive. It's all n cubed-ish. The sort of simple algorithms are n cubed. There are smarter ways to do it. It's like n to the 2.3-something-something. But still, more expensive than n squared. So if you have like a million-dimensional parameter space, then that's hopeless. Basically, you are not able to invert that matrix in any realistic time. And in fact, you might not even be able to store the whole matrix in memory. There are ways, of course, with sparse matrices to deal with these kind of things. But these are usually issues where linear algebra might fail and you might want to think about optimization. And the problems that the examples that I'll show you and the problems I've been working on are actually many, many, like, very large dimensional problems in playing around millions and going up to billions of dimensions. So linear algebra there would be hopeless. So a quick introduction to the conjugate gradient method. I'm going to give some math and some theorems and I'm not going to prove them, obviously. So the definition is, first, you have a set of vectors, p, and we'll call them conjugate with respect to a if this condition is satisfied and a is, again, a symmetric positive-definite matrix. And so if this is true, if you have a set of conjugate directions, then here is a theorem without a proof, is that you can minimize your function successively in these directions p. So you can first find the minimum in the direction p0, fix that, then move on to the direction p1, p2, and then you're guaranteed to arrive at your solution. And you're guaranteed to arrive at the solution in at most n iterations, or n is the dimensionality of your problem. So that becomes basically a one-dimensional optimization problem at each step. So you have to find this alpha k, the coefficient of pk, at each step. And if this is the linear problem, you actually can solve for that exactly. And, yeah, so let's have a quick look at an example from this same book. Let's say A is diagonal, and so conjugate directions then will be the coordinate axis, basically. And I should also mention that the choice of conjugate directions is now unique. There are multiple choices. So you just need one choice of conjugate directions. So in this case, if this is your function, then you're able to find the minimum by first minimizing along the first coordinate and then the second coordinate. Then you arrive at the minimum guaranteed. This is a two-dimensional problem. However, if you try that on a non-diagonal problem, you first minimize along the x-axis and then the y, and then x, and then y, and then x, and then y, and you may never arrive at the minimum point. Okay? Another useful property I'm not going to prove again is your residual at each iteration is actually going to be orthogonal to all the previous conjugate directions that you have. So it basically means that at iteration k, you're minimizing, you've already found the minimum in the linear subspace spanned by the conjugate directions that you used so far, plus your starting point. Okay? And this is really, really useful because you can actually construct, you can make a new conjugate direction given the residual at that given point and the previous conjugate direction by just making a linear combination of those with an appropriate coefficient beta. And you can find this beta by imposing the condition that pk and pk-1 have to be conjugated with respect to a. You can write down a linear algebra formula for that and you'll find your beta. And so this is excellent, actually. Why is this so important is that you can find your next conjugate direction depending only on the previous conjugate direction. So you don't even need to store all the conjugate directions that you have found so far. You just only need to remember the last direction that you did your minimization in. So this brings us to the algorithm. I'm not going to go through that. But basically, you start at any point, x0, that's your starting point. If you can find a good starting point, great. If not, it's still okay. Calculate the residual at that point. Set your first conjugate direction to be the negative of the residual. Your first iteration is k. While the residual is not equal to zero, repeat this. And I should comment here that this is never really achievable in practice because of numerics. It will never arrive exactly at zero. So instead of that, in practice, we impose the condition that your residual, the norm of the residual, is smaller than a given threshold, epsilon. So at each iteration, you first minimize along your given conjugate direction. You'll find your alpha k. You minimize. You move to a new point. Calculate your residual. Given your residual, you find this coefficient beta k to form your next conjugate direction. And this is, as I said, guaranteed to converge in at most n steps where n is the number of dimensions. Of course, this won't be good enough if you're dealing with a million-dimensional problem. You don't want it to converge in million dimensions. So there are some useful properties about convergence that I'm not going to prove again. If you have... If a has r distinct eigenvalues, then you're guaranteed to converge in at most r iterations. So that's one property. Another property is it shows you the distance from the minimum at given iteration k as a function of your eigenvalues and how far they are from the smallest eigenvalue. And I'll show you this in the next plot quickly. And then there is another property that shows the rate of convergence as a function of what's called the Euclidean condition number, and that's just the ratio of the largest to the smallest eigenvalue. And so if you look at this, the smaller this condition number is, the closer it is to 1, the faster it will converge, the closer you'll get to your minimum. So that's another useful property to have. So if your matrix has eigenvalues that are close to each other, the largest and the smallest eigenvalues are close to each other, then that's a good thing. Here is a quick example. So in this case, you have m large eigenvalues and then n minus m eigenvalues that are clustered around 1. And then after about m or m plus 1 iterations, you'll actually be very close to your minimum because of this property. You can show that. And then, so as I said, if you have r distinct eigenvalues, you'll arrive at exactly r iterations, but there is another property that if the eigenvalues are clustered, so you have roughly r clusters of eigenvalues, then again, you'll approximately solve your problem in r iterations. So that's all good. But you might say, what if I just have a matrix that doesn't have any of these nice properties? So then what do we do? Well, then you can do what's called preconditioning. You can transform your variables linearly to another set of variables like that. And so then the matrix you're dealing with will become this. And so if this new matrix has those good properties, then it will converge faster. And so C in this case will be called the preconditioner. And the good thing is that you don't actually have to do this transformation explicitly. You can just modify your algorithm slightly to also include this preconditioner. So here is the preconditioned algorithm. I'm not going to go into more details about how to precondition. I'll just say that all those properties that I mentioned before, these are the things that you're trying to achieve with preconditioning. Either try to get your eigenvalues clustered with each other or try to make that Euclidean condition number as small as possible with your preconditioning. It will usually depend on the problem that you're trying to solve and some actual knowledge about the problem that you might be able to find a good preconditioner. No. So you should be able to perform this operation, basically. Or actually, so you should be able... Right, exactly. And for the problems... Right. So the problems that we're actually solving are huge dimensional. We are never able to store these matrices in memory. And even the matrix vector multiplication, it's usually not done directly. So it's not like an n-squared operation. It's like a hint. It's sometimes like a fast Fourier transform, which is a linear transformation, and log n, even though it's like a matrix times a vector multiplication. Okay? So, yeah, this is your preconditioned algorithm. And in one slide, I want to mention that... So far I've been looking at a linear problem. This conjugate gradient method can be generalized to nonlinear problems. And the generalization is usually not that hard. So now, instead of having that quadratic function phi to minimize, we have any nonlinear function f. In this case, two things have to change. First of all, when you're doing your minimization at each step, finding this alpha to minimize along a given conjugate direction, for the linear problem, you could do it exactly. For a nonlinear problem, that becomes harder. So that step becomes replaced with a line search. You talked about line search earlier. So as long as your line search satisfies these wolf conditions, which guarantee sufficient decrease in the function and sufficient decrease in the gradient, that'll work. And the other thing is, when you're calculating the residual, you don't have this Ax minus B anymore. And instead of that, you'll have to use the gradient of your function that you're trying to minimize. And then the rest is basically the same. I'm not going to prove why this works or how well it works. We'll look at a couple of examples later to see if it works well or not. But it's good to remember that conjugate gradient can be applied to nonlinear problems as well. Next is BFGS, named after these four guys who invented the algorithm, Broiden, Fletcher, Goldfarb, and Shannon. This is a quasi-Newton method. What that means is we're actually going to try to use some second-order information, which is the Hessian. But we're not going to calculate the Hessian. We're not going to store the Hessian. We're just going to somehow approximate the inverse Hessian. And the approximation is done using the previous iterations of your algorithm. So we are storing all the previous iterations of the algorithm, the points that we visit, and the gradients. And then we're going to use that information to approximate the Hessian. There is another version of this called Limited Memory BFGS, or LBFGS. And this one only stores the last m iterations, let's say the last 10 iterations. And in practice, this usually works almost as well as BFGS. So this is Limited Memory because you're not storing all of the iterations. You might think, well, if I have all the memory I want, maybe I shouldn't bother with LBFGS because I can store them all. But in reality, this might actually be a good thing. So if you have just a quadratic problem and your Hessian is not changing at all, then sure. The more iterations you can save, the better you'll approximate your Hessian. But if your function is highly nonlinear and it's changing from iteration to iteration a lot and your Hessian is changing from iteration to iteration a lot, then at a given iteration, including information from a thousand iterations ago into your Hessian, might not actually be that good of a thing because back there it was a completely different Hessian. So you might be better off with LBFGS anyway. How does the algorithm work? So first of all, we expand your function f to a second order and we forget about the rest. And so let's say at a iteration, I'm trying to minimize this second order function, the second order approximation of the full nonlinear function f. If I had the gradient and the Hessian, which is denoted by b, I could find the minimum exactly by applying the inverse Hessian to the gradient. In reality, since your function is more complex than this, it has more terms and also that your Hessian will be approximate, we want to do just a line search in this direction. So we won't go to this exact point, but we'll say, okay, this is the direction we're going to search in. We'll perform a line search. We'll find this coefficient alpha k and then that will determine how far we move in that direction. And then we'll repeat. We'll find the new direction and keep going. So in some sense, this is similar to conjugate gradient. But then the direction to move in is determined using the approximate inverse Hessian. How is this done? Well, so let's denote the inverse Hessian by h. So two things. There's this one condition basically that I want to satisfy. So if I'm going from xk to xk plus 1 gradient changes from gradient fk to gradient fk plus 1. Basically, I want the Hessian applied to the difference of your points to be equal to the difference in your gradient. So it's just writing down that the second derivative should be the derivative of the first derivative and the points are your iterations. So I'm requiring this condition to be satisfied. And if you write the same thing in terms of the inverse Hessian, it becomes that. And so this is the condition I want to satisfy. Obviously, you can have many choices. The choice that we make is in some sense where you want to stay as close to the previous iteration as possible given some metric under these conditions. So your new Hessian has to be obviously symmetric. That's to satisfy this condition. And then you'll find this minimum. That will be your new Hessian. So that's the same thing written here. And you can solve this. It becomes this. Now, this equation here, you don't need to derive, but I will mention again that you're never storing Hs or Hk plus 1s. All you're storing is this sk and yk, which is just the difference between your current iteration and the previous iteration and the same thing for the gradient. And so at each iteration, you basically do this whole multiplication with vectors all over for all the iterations. So you don't calculate the Hessian. You just apply this equation over and over again until you get to H0. And H0 is usually chosen to be the identity matrix. So the take-home message is H is never stored. Only the previous iterations are stored. And using that, you're still able to calculate an approximation of inverse Hessian applied to your gradient. Okay, let's see if I want to talk about this much. I'll just briefly mention one kind of problems that we tried to solve. These are, again, linear problems. Your data is equal to some linear function of your signal plus some noise, right? So we've seen this before with the line fitting. This is just generalizing it to many dimensions. Big S denotes your signal covariance matrix. Big N denotes your noise covariance matrix. Big C will denote the data covariance matrix. Let's not worry about this part for now since I don't have time to go into that. But one thing that you want to try to find is given the data you want to estimate. So all you measure is the data. You don't know the noise, you don't know the signal, but you know the magnitude of the noise and the signal is determined from your model, the signal covariance matrix. So you might want to estimate, in some sense, what is the actual underlying signal that produced that data? Obviously, there are many choices. So in some sense, you want to find the best choice of the signal S given the data D, right? There are a couple of different ways to do this. One of them is called the minimum variance estimator. In practice, this is called a winner filter. What this does is you say, well, I'm going to say that my estimator S will be a linear function of my data. Under that condition, I want it to have minimum variance. Then you can prove that this guy here will give you that property. So if you notice here, there's a bunch of linear algebra applied again. Inverse covariance matrix, this transformation matrix R, and then your signal matrix. So this is how you calculate the winner filter. One important property of this is if you are dealing with Gaussian fields, then this is equivalent to the maximum probability estimator. What that means is, just thinking like a Bayesian, given your data, you find the probability of your signal, given the data, and you maximize the probability over all signals to find the maximum probability signal. And these two are equivalent. So why am I writing this down? These kinds of problems are usually solved using linear algebra, but if the dimensionality gets very large, you can replace this with an optimization problem. We can just find the signal S that gives us the maximum probability, and then we can solve the same problem. So linear algebra can be replaced with optimization. I wanted to briefly mention about the Fisher matrix estimation for these problems and power spectrum. I'm not going to go into details. All I will say is the Fisher matrix, I'm not going to define this. We can talk about the Fisher matrix, et cetera, in more detail if you're interested afterwards. But usually there are linear algebra formulas to calculate that, and then you can calculate a power spectrum estimator. This is done, for example, if you're familiar with CMB analysis. This is done all the time, and so they find a power spectrum estimator. Again, you can write down linear algebra. All I will say is that you can perform the same thing with optimization methods, and we've applied this, and I'll show you some examples. But you have a very large dimensional problem. You don't want to deal with matrix inversions, et cetera. Optimization will give you the answer just as well. So in the last few minutes that I have, I'll show you a couple of examples that we've been working on with optimization methods work. One example is weak lensing, and I should warn from the start that this picture here shows strong lensing. And the reason is weak lensing is by definition weak. So if you just try to draw a picture, you can't really see anything. That's interesting. So gravitational lensing, everybody is familiar, is that when you look at a galaxy far away and the light is passing by a large amount of mass, then the shape gets distorted. In this case, all these guys are sort of distorted and you kind of have this circle around this big amount of mass. So that's gravitational lensing. The problem is, so gravitational lensing happens because of dark matter, because dark matter is the dominant matter in the universe. And we can't see dark matter because it's dark, but it interacts gravitationally. But we do see these distorted galaxy shapes. And what we want to do is from that to infer where the dark matter is. So this is actually a really powerful probe of finding the dark matter distribution, and so that's the problem we're going to try to solve. So our observable, our data, will be what we call the shear fields, the shears of these galaxies that we measure, the distortions of these shapes, and the thing that we're trying to find is the dark matter behind it that's modifying, that's making these distortions. Just a little bit of notation. We introduce what's called the convergence field Kappa, and this is essentially the integrated density field, or the projected density fields. All of the density, if you look in a given direction, how much dark matter there is up to whatever source you're measuring. And this can be converted into two shear fields, gamma 1 and gamma 2, like that. Just what I want you to notice is this is just linear. And so then our data becomes the shear fields in real space. Our signal is the convergence field in Fourier space. That's what we're trying to estimate. And going from data to signal, it's a linear operation. So you first apply this, and then you do Fourier transform. And then, of course, the shear field also has noise. So we're just doing some simulations. So we're going to, given some power spectrum, we simulate a projected density field. So I'll use density and convergence interchangeably. So this is your Kappa. This is a simulation. I'm starting with a small example of 64 by 64 grid. And this has a reason. It's 64 by 64 is about 4,000, and I can actually solve this problem with linear algebra. It's not very high-dimensional. So I want to see if my method will give the same answer as the linear algebra for this particular problem. So you construct the density, then you construct the two shear fields. You can see the structure in the shear fields, up and down and diagonal. You can add some noise to the shear fields, and then you mask out some regions. This is a toy example. So I've masked out sort of the centers and the sides. And then I've added noise. That's not homogeneous. So this is your observed data. This is what you're trying to reconstruct. Okay? Let's look at some results. This is the reconstructed density field, given that using optimization, BFGS, LBFGS. And this is the same thing that you obtain if you solve the problem using linear algebra. And they do agree. I've also calculated the Fisher matrix with both methods. They agree as well. And so this is showing you the iterations of the optimizer and how close it gets to the minimum. This is a log scale, I should mention. So within 11 iterations, you're basically already close to your minimum by 10 to the minus 2. Right? So the method works. Then we apply the same thing on a huge problem. So now we make the grid a million dimensional, 1,000 by 1,000. Same thing. Density field. Shear. We mask out some regions. So this is a more realistic mask for weak lensing, wherever there are bright sources that you can't see. You mask that out. I've just added some inhomogeneous noise. So in this region, the noise is higher. In the green region, the noise is lower. So this is the two shear fields that you end up with. And you're trying to reconstruct the density field. So I'm going to show you how LBFGS works. A very quick movie. So my first iteration is I assume everything is zero. I'm not putting any information in there. And I'm going to try to calculate it. So the movie will just go iteration after iteration. So let's get started. So you see it first reconstructs the regions very quickly where the noise is low. And then eventually it gets similar to this field. The chi-squared is still decreasing. But you can't see it anymore in the picture. And then after 70 something iterations, we are there. Compare these two algorithms on this problem. LBFGS and conjugate gradient. And you can see that they perform very similarly. A second example, I'm kind of out of time. So I will be very quick about this. Now we're looking at a nonlinear problem. And this is the nonlinear structure formation in the universe. So we're basically trying to reconstruct the initial conditions of the universe. So you start with some linear density field. And then under gravitational collapse, you have your nonlinear structure formation. You add noise. So this is the structure that we see today. This is what we want to calculate. And so this is a very nonlinear problem. And I want to show you that the methods that I just discussed actually work for the nonlinear problem as well. So again, a similar movie for reconstruction. I start from zero, no information. And I should say that this is 128 cubed box. So it's over, I guess it's about 2 million dimensions. So this is optimization in 2 million dimensions. And let's see how that works. It's sort of trying to find all the structures. And after about 30 iterations, it's basically there. Chi-square keeps decreasing. But you can't see that anymore. But I think it's... Yeah, so basically the parameters are the initial conditions. In this case, all the Fourier modes. So if you have 2 million pixels, then you have 2 million Fourier modes. So yeah, thank you. Yes, so I actually have some backup slides on the power spectrum. And we do reconstruct the power spectrum as well. I just didn't put them in here because of time. But I'll show you a couple of backup slides because I want to show you them anyway. This is the linear power spectrum reconstructed from the data. So the black one is what we reconstruct. The...let's see what is what. The red one is the truth that you put in. We actually put in a fiducial power spectrum because you have to put in something fiducial to start with. And so here the fiducial power spectrum is green. I don't know if you can see it, but we didn't put the wiggles, the baryon acoustic oscillations in those. And we reconstructed them. And so then if you divide it by the fiducial one, you reconstruct the linear power spectrum and you reconstruct the baryon acoustic oscillations with error bars. And we can talk more about that later if you're interested. But yes, eventually the goal is to reconstruct the linear power spectrum of the universe given the nonlinear data that we see. And why that's cool is because the nonlinear data also contains higher order statistics. But if your primordial universe is Gaussian, if you can go back and reconstruct the linear power spectrum, then you're basically reconstructing all the information that there is. Yes. Yeah, so the green line here shows the feature in the nonlinear case that you would see and the red one is the linear. So the green one sort of is decaying, but we're reconstructing the red one. Right, so right now we're just doing it by hand. So we derive it and then we've implemented it. But we're working on an autodiff, either using a library or our own thing to actually do it automatically. So yeah. And we have grad student Shorag who's here. So yeah, that's how we calculate the derivatives. Just to show you that in the nonlinear case, again we compare the LBFGS and conjugate gradient, I would say that LBFGS is performing a lot better than conjugate gradient for nonlinear problems. Yeah, so that's a good point. It's similar, it's a similar ratio. So conjugate gradient takes twice as long, roughly, as LBFGS. But yes, that's a better way to compare two things. I'll just, where am I? I'll just leave you with this slide, just a few packages that you might use for optimization. There is SciPy. I'm not going to talk about any of them. The last one is written by me and we're using that one. And why we're using that is because of the requirements that we basically have. First of all, very high-dimensionality. And we have to have our own way of distributing that over different nodes. I should say that all the packages that are different can just say, okay, if you have an optimization problem, choose this one or that one. They all perform better. They all have their own properties. Some are good at some things. But that being said, I should also say that clearly the last one is the one that's the best and everybody should be using that. Thank you very much. Questions? Right. Or not, just any problem. But yeah, and the autodiff library here is not implemented fully yet, but we're working hard on it. So hopefully that'll show up within the next few weeks. So, yeah. Okay. I'm still blown away on the science side, but I will discuss that with you afterwards. On the optimization side, one package I would add to that list is IBM C-Plex. IBM C-Plex is usually used for problems like scheduling problems and stuff, but actually it's an outrageously powerful optimizer. It costs about $50,000 per seat per year. But they give it for free to academics. So it's an extreme and it has a lot of shit in there. So if you have problems that don't completely have the continuous form, you can't take derivatives or you might have some integer parameters, things like that, problems that are more kind of heterogeneous than these, then IBM C-Plex is a remarkable machine. But don't buy it at home. Don't buy yourself a home license. Thank you. Psykit-learn cheat sheet that asks you questions, you know, is your problem like this? Well, then you should go and use this. Maybe we could do the same for optimizers, because I may not look at this list, I need help. Which optimizer should I go and look at first, given various features of my... Would that be difficult to do this afternoon? That might be possible. I'm not really an expert on all of this, but if there are people who have been using any of those, then maybe we should get together and put together a list. That would be good, yeah. Any more questions? All right, let's take a 10 to 15 minute break. Come back around 10.50 or 10.55 with Dan. Oh, yes, thank you very, very much. Hello, take a minute to start. All right, hello everyone. For those of you who I haven't met yet, I'm Dan Forman-Macky. I'm a postdoc at the University of Washington, and my day job is working on exoplanets, but I like to spend most of my time hacking, so Astro Hack Week is my favorite week of the year. This one's been no exception to that, so it's really nice to be here. So I was assigned the topic of sampling, and so I took that, I interpreted that as sort of MCMC in practice, Markov chain Monte Carlo in practice, and so that's what I'm going to do. I'm just going to write on the board today and give a little lecture about sort of the nuts and bolts of how those classes of methods work and why you might want to use them, and then depending on whether or not we have time, the time until lunch is slowly disappearing, I also made a worksheet that I put into the Astro Hack Week GitHub repository in day four sampling that sort of goes through a few, implementing your own sampler and things like that, that at the end, if we have time, it might be fun to sort of try that out for those of you who haven't written your own sampler before. Now, how many people here have actually used Markov chain Monte Carlo before in their research? All right, yeah, that's what I expected. So it's become very popular in astronomy, which is probably one of the reasons why it's relevant to talk about it today, and there are other methods of sampling, but I'm not going to really talk about that here. Okay, so let's get started. So to start with, we're just, it's useful to have sort of an example to illustrate some of the things that we're going to talk about later on, and so I'm just going to use the fitting a line example that we've used constantly through this entire week. Except I'm going to make it slightly more complicated because as we've seen several times, if you just are doing linear regression, you can compute a lot of things easily if you're just fitting a line to data and assuming you have Gaussian uncertainties and things like that. So I'm going to make it a little bit more complicated and I'll show you that just right here. So the basic story is that we're, you know, fitting a line to data and we have some data points with their uncertainties and what we have to do is we have to write down a likelihood function and a prior probability distribution and we heard a lot about that yesterday. And so what I'm going to do is I'm going to use a likelihood function today where we're going to allow the uncertainties on all of these data points to be underestimated. So we're going to fit for the uncertainties and marginalize out our lack of knowledge about the uncertainties. The basic idea there is that you might be able to estimate some parts of the noise in your data but not all of them and it might be useful to have some sort of fudge factor to take that into account. And so what that likelihood function looks like is the log of the probability of our y's. Phil doesn't like it when I draw an arrow but I'm going to do it anyways. And then in this case we have three parameters instead of just two and I'm going to call the third one s. And what this looks like is the usual thing like this except we're saying that our uncertainties are underestimated so we're going to add this s parameter in quadrature with the measured uncertainty that we're given. And so that's our new parameter. And unfortunately this isn't quite enough. We need another term that's given by this and the reason why that comes from the normalization of the Gaussian so the Gaussian distribution we're assuming that the uncertainties are Gaussian and there's this one over square root of two pi sigma squared in the Gaussian equation and that's where this comes from. And qualitatively the reason why you need this term is that if you just let s be a free parameter you could make it arbitrarily large and your fit would just get better and better and better and so that wouldn't be very useful. So that's where this comes from. So this is the likelihood function that we're going to use as our example. And the reason why it's nice to write this down is because this you can't solve using linear algebra. Now that we have this nonlinear parameter s if you want to fit for the best value of s you have to put this into an optimizer that we just had a very nice session learning about the different options there. So if you wanted to fit for the best value of all of these parameters you would use an optimizer. Now, and you should definitely use an optimizer not sampling if all you care about is the best values of these parameters under some definition like the maximum likelihood or the maximum a posteriori. Now sampling becomes relevant when we don't you know for our abstract of our paper we don't just want the best number we want some uncertainty on our fits. And to get those uncertainties in the Bayesian framework when you have a nonlinear sort of complicated problem like this then sampling becomes often the easiest way to do that. And I'll get into details for why that is but this is an example where you might want to sample. Okay, so that's our sort of motivating example right there. And so now I just want to just remind sort of bring up a few terms from yesterday just so that we're all on the same page. It was sort of a crazy night last night so it's useful. So a little reminder about the terms that enter into a Bayesian analysis. So we had our likelihood. I'll try to write bigger so I realized that it's hard to see. So the likelihood is the probability of the data given our model parameters and that's the terminology that I'm going to use right there. Where D is the data and theta are the model parameters. And so this is the thing that we're used to writing down. This is sort of our physics goes into this thing most of the time. And then there's the prior which is the prior probability of theta without any data. And then these things go together to give you a posterior probability which is the thing that you probably care about which is the probability of the model parameters conditioned on the data. So that's the thing that is going to come up a lot as we go forward. And this is equal to the prior times the likelihood with this normalization factor right here often called the evidence integral or something like that. And we'll come back to that again in a minute as well. So these are sort of the pieces that are going to go into our analysis as we're going forward. Now one thing that I'd like to point out here is that the reason why sampling is generally discussed in a Bayesian framework is that you have to draw samples from a probability distribution and what we're interested in is the distribution of parameters that are consistent with our data. And there are only two probability distributions here that are probability distributions over parameters. That's the prior and the posterior. The likelihood is not a probability distribution over parameters. So what that means is that you can't draw samples from a likelihood function. Does that make sense? You can draw samples from a likelihood function but those will be samples of data not samples of parameters. So you could generate new synthetic data sets but you can't generate new sets of parameters drawn from that probability distribution. And so that means that when we're sampling we're always going to be using posterior probabilities but not just likelihoods. So you could draw samples using MCMC from a likelihood function but the things that you would be drawing would not be parameters. It's just not possible. What happens if you try to do this? So if you say put a likelihood function into an MCMC algorithm and say that you're drawing samples of parameters from a likelihood function that means that you're implicitly choosing some form of the prior and it's not good to implicitly choose that because then you don't really know what effect that's going to have on your results. And implicitly you're choosing an unnormalized uniform prior when you do that. But we'll get to the details of that in just a minute. I just wanted to, yeah, that's a useful point that Daniela, thanks. So we have these components. We have our likelihood function that includes our physics. We have our prior that includes any prior knowledge we have. We can put them together into this with this normalization constant that I'm sort of sweeping under the rug for now. Now what? I said that what we wanted was sort of uncertainties on our parameters or something vaguely like that. And really what you want is either to sort of find the best fit values, the best thetas where you would use an optimizer, or what you want to do are integrals. And that might sound slightly counterintuitive, but I'll try to convince you that anything you would ever want to do with a distribution like this is do integrals. And those integrals will have a form that looks something like this. I think, yeah, you can see if I write down here. There are going to be some sort of expectation value under your... Well, let me just write the integral first. That'll be more useful. So you want to integrate some function of theta, where those are your parameters, weighted by the posterior probability distribution. Anything you would ever want to do with any results you'd want to report can be written as integrals. And for example, if you wanted to compute what the sort of mean value for a parameter was, you would put in here, for f of theta, you would put in just theta. And you would integrate theta across your samples, and that would give you the mean value of your posterior. If you wanted an uncertainty on that, you could put in theta minus the mean theta squared, and that would give you the variance of your posterior. And those are all written as integrals. And other things that you might want, something like a median of your samples can also be written as an integral. Just an integral until you have half of the mass of that parameter, things like that. And one other integral that you might want to do is a marginalization. So if you have multiple parameters and you only care about the distribution for one of those parameters, then you want to integrate out the other parameters. That's also an integral. So there's this whole class of integrals that you want to do with probability distributions. And that's the goal of sampling. Now, the reason why sampling helps you in this case is something that in Canada we learn in primary school, and that's that this integral can be approximated by a sum that looks like this. It's just the function value evaluated at a bunch of samples, theta n, where theta n are draws from the posterior probability distribution. So if we can draw samples from the posterior probability distribution, then we can evaluate this integral very easily just as a sum. And that's preferable to something else that you might do. If you have a 1D problem, it might be okay to just sort of put it into a quadrature, something like that, and just do the integral numerically that way. But as soon as you go to a higher number of dimensions, that's going to quickly become intractable. And so having a sampling allows you to do this integral very, very easily. And so if you have a sampling from your posterior probability, you're going to be able to compute all the integrals that you might want to do just very simply as sums. And so that's the goal. Are there any questions at this point? Yeah. Well, I don't think it's... I mean, one thing that you might want to do is you might want to optimize a posterior probability to get the... If all you care about is one representative model that someone might use, then finding the best possible posterior value is probably useful. Yeah. Yeah. Yeah. I like that terminology. I think not everyone would agree with that argument, but I'm comfortable with saying that here. Anything else? All right. So now what we're left with is some... We need to come up with some way of generating these samples right here. Now, if we are in a very lucky situation where our posterior probability is simple, you know, if it's a Gaussian or something like that, then you can just draw samples from that. You open up numpy.random and generate samples. And then you're done. And you can do all of these integrals. But in most real-world problems, including this one right here, we can't do that. There's no analytic way of generating random draws from the posterior probability distribution. So that's where Markov chain Monte Carlo comes in. And so I want to talk a little bit about sort of the standard methods of MCMC and how you might implement it. The posterior density is something you can always evaluate. So we write these probability expressions and with likelihood functions and prior probability density functions. And the posterior is something that you can always evaluate. It's just hard to sample from it. So I mean, that's not quite true because of this term right here, the one that I keep on ignoring. So the thing that we can evaluate is the prior times the likelihood in most cases. And in general, this normalization constant is difficult to evaluate for the same reasons why it's difficult to sample from this probability distribution. And that'll come back in just a second. But yeah, the basic idea is that for a given set of parameters we can evaluate these two functions in principle at some cost. All right, so the basic idea here, I'm just going to sketch it on the board, is let's say we had two parameters, theta one and theta two. And let's say that magically I know that their posterior probability looks like this. It's some banana, something like that right there. But I don't know how to sample from it, but I know that I can evaluate it and it looks like that. So the way that a standard MCMC, sort of the most basic metropolis MCMC works, which is the simplest method, is you start by choosing some point in parameter space. Let's maybe use a different color for now. Start by choosing some point... Never mind. So you start by choosing some point in parameter space, say over here, it's a pretty bad choice, but we'll come back to that again in a moment. And what you do is you're going to propose some new point based on that point, and it's going to be sort of randomly selected in a way that I'll describe momentarily. And let's say that we propose this point right here. When we propose that point, the probability is getting higher, and so in that case, we'll definitely accept it. We want to go towards higher probability. And if we were using sort of a greedy optimizer, then that's always the decision that we would make. If it gets better, then we take it. If it doesn't get better, then we don't take it. But for MCMC, what's going to happen is we sort of work our way up the hill, but then once we get here, we don't want to just find the best value. We also want to sample in the tails, too. And so what you do is sometimes there's going to be a proposal made where the probability actually gets smaller, and in that case, we sometimes accept that proposal. And the probability with which we accept that proposal is given by this. Let's call it R. And it's accepted with probability given by where theta prime is our proposed location and theta t is our current location. So here we have this ratio, which is the ratio of the probability at the proposed location to the probability at the previous location. And you can see that if the probability goes up, then we'll always accept this point. And if the probability goes down, there's some probability. So if it goes down a lot, we're very unlikely to accept it. And if it goes down just a little bit, then we're quite likely to accept it. But this brings us to one thing that Phil brought up, which is that it's hard to evaluate this Z normalization constant here. And so it's really nice that we have a ratio right here and we only ever have to compute ratios of posterior probabilities because that means that the normalization cancels. And so we never have to evaluate it in order to do MCMC. And so really that's the main benefit of MCMC is that we can simulate samples without ever computing the normalization of the posterior probability. So in this particular example, for some choice of how we propose our points, our new points, this R is not tunable. But I'll get into the details of that in just a moment. So for sort of the simplest case, this is what acceptance probability you need to use to be guaranteed to get a correct sampling from your posterior probability. If you sort of, you know, if you put some factor here, which might be one way that you might tune it, then you're not going to get a correct sampling. You're going to get some sort of different sampling, some different distribution that you might be able to compute. But no. It's always going to look like this. Now one thing that I haven't explained is how we chose to propose our points. So at each step in our chain, we had to propose a new point. I said that that was chosen randomly based on the current location. And sort of the simplest way to do that is to just draw a little Gaussian around here and sample from that Gaussian because Gaussians are pretty easy to sample from. And the basic idea is that if you're in a good part of parameter space then probably somewhere nearby is also going to be pretty good. And so in that case, you would draw a point from here and then compute this probability here. Now just for completeness, it's worth adding one thing here. Let's say that this proposal distribution is called Q of theta given theta t. So it takes the current location and proposes a new location. You actually have to include a term over here to make things completely correct that looks like this, like that. But you'll see that since we're using a Gaussian this would actually cancel and this term would be just one. So that's why I didn't need to include it in the first case. But if you were using something more complicated, if you had a better way of making proposals, for example, you might have to compute this term and that would be a way that you would adjust the acceptance probability. Yeah, I don't, I mean you can. So it's actually a semicolon that I've written. I recognize that it's hard to see from, you know, when you're not wearing your glasses. But no, that's sort of a standard way of writing it in the sense that it's sort of like a parameter, not a, it's not something that you're, yeah, exactly. But either way would be fine. Yeah, it's definitely not a comma though. So maybe I should, yeah, agreed. Yeah, that's right. Well, so let's say that you had, that you had some reason to know that over here was better. Maybe you had some auxiliary information, like an ensemble of samples or something like that and sort of there was more density over that direction. Then you might be encouraged, you might want to encourage your samples to move towards the high density regions. In that case it would be asymmetric. Well, no, so this term, if you compute this term then that corrects for that effect, right? So that's why this term enters when you don't have a symmetric density because you need to account for that fact. Absolutely. And part of the reason for that is, so I'm not going to talk about any proofs or anything like that, but there are proofs that if you use a particular form of MCMC that given an infinite number of samples, you're going to converge to correct sampling from your probability distribution. But part of the reason is that those, there are terms like detailed balance and things like that and the actual proof is very subtle and I don't feel qualified to really talk about that in detail. And so that's one of the reasons why you just have to be a little careful and if you're going to implement something crazy. Yeah. Yeah. If you look at each individual proposal you make, it's not very clear what you are improving. But in the end you get a very nice distribution that's following the likelihood. Right, that's a good point. I think that's sort of like the beauty of this algorithm. It sounds like a mess, but in the end you get something really, really nice. That's right, yeah. And just sort of the concrete point that the product at the end of this is a list of samples. That's where the chain comes in. It's not just the final sample or something like that that matters. It's the list of the ensemble of all the samples that you've generated through these many iterations of this procedure. All right, leading question alert. You said you'd accept the new sample with that probability either one or what if the poor new proposed sample doesn't meet that test and you reject it? Yeah, that's a good point, Phil. So if you reject the sample, then you put the old sample back into the chain. So you'll get a second realization of the same exact same location in your chain. And that tends to happen. Is it? Yeah. Yeah, so it is insane because especially if your likelihood function, for example, is very expensive to evaluate, then you've just thrown away all the information. You ran for 20 minutes some integral and computed your likelihood function and then you just throw away the fact that you ever even saw that and you're never allowed to look at it again. And so that's insane. That is one of the reasons why... Well, yeah. So the fact that it's not in there is... Yeah, absolutely. Yeah, I'm overstating it. I agree. Oh, yeah, absolutely. But the fact that you can't remember anything about the value of the likelihood that you computed at this other point is the bit that seems counterintuitive, right? But no, I agree. There is information in the fact that you have doubles of the same sample, of course. And you have to do that. Yeah, we're going to get to that in just a second, yeah. Okay, so the comment was that if your sampler is bad, then you're going to see lots of repeats of the same sample. And even if your sampler is good and a real problem, you're probably going to see lots of repeats as well. Yeah. The place that you can tune is the Gaussian... the parameters of the Gaussian for the sampler. Perfect. That's exactly what I was about to go to. Oh, sorry. No, thank you for leading us into that. And I was worried originally, like, oh, well, that's going to blur your likelihood if it's too big, but it's not because of this criteria. But that will also change the acceptance fraction, I guess. Exactly. Beautiful. Okay, so that's exactly the place where I wanted to go. So this all sounds pretty nice so far. But the point was that this Q proposal distribution has tuning parameters in it, right? And so in our two-dimensional problem, it has something like three tuning parameters because it's a Gaussian and it could be, you know, a Gaussian that looks like this or it could be a Gaussian that looks like this and, you know, you can sort of imagine all the different combinations. And so just to sort of demonstrate what happens is let's say we choose a really bad proposal. So let's say we choose a tiny little Gaussian that looks like that. What's going to happen is we're going to run for, you know, thousands of steps. It's going to kind of walk around over here and it's slowly going to go towards the right answer and if we run forever, it will provably draw samples from this probability distribution that we want to do. But we can't run forever, right, in the real world. We need to someday write our paper. And so this is just a waste of time to have it like that. Now the other extreme is let's say we use a huge proposal and our first step gets us over here. That's really good. Now we've dealt with this sort of mess at the beginning. But then the next proposal we make is going to be over here. And that basically has a probability of zero so it's never going to get accepted. We reject it. And we run for thousands of steps and we just have copies, a thousand copies of this point in there. And so as you can see at the extreme ends, the performance of your sampler in any real world problem is going to be very sensitive to the setting of the parameters of your proposal distribution. And so really what it comes down to at the end of the day is tuning those parameters. And so that's what I wanted to talk about. Oh, and the other point is just that as the number of dimensions of your problem grows, so I've been drawing this two-dimensional problem here, our motivating example had three parameters and so that would mean it would have six tuning parameters and as we go up in the number of dimensions, we get exponentially more tuning parameters and that becomes quickly intractable. And if we have a five-dimensional problem, then we're done. So I want to talk about tuning a little bit. So basically our goal in the standard metropolis method is to, or while just in general, our goal is to have a proposal distribution that looks like the distribution that we're trying to sample. If we knew exactly what the posterior probability was, then a great proposal distribution would be to sample from the posterior. Then we'd be done. But of course the whole point was that we can't do that, but if you can find one that's pretty close, that does a good job, then that'll be good. Of course on purpose I drew this as not a Gaussian because that means that if you were choosing a Gaussian proposal, the one up here that would be best is different from the one down here that would be best. And so you'd want different tuning parameters there. And so at the end of the day, you sort of have to settle for something that's good enough. Now, there are lots of different methods for tuning parameters and things like that. But at the end of the day, what you really need is you need to have some measurement of how well your sampler is doing so that you can change your parameters and say, look, my sampler is doing a better job than it was before. And so that parameter is called the integrated autocorrelation time. And so I'm going to explain where that comes from right now. So what you could imagine is, I'm just going to erase all of this here, I think. So after you've, I think we saw a few trace plots yesterday. And so I'll just draw one of those up here just to give us a sense. And so what happens is after we initialize, the chain sort of goes like this and for one parameter, theta one, as a function of time in our sampler, it's going to be moving around the space and sort of sampling around like that. And one thing that we could do is we could take this and we could look at the autocorrelation function of that. And that's going to look something like this, where tau is the time lag and it's going to kind of go something like that if we have a good sampler. That's what the autocorrelation function should look something like. And the auto, sure, oh yeah, that's sure, yeah. So the reason why it starts high like this is that the point, the proposal right after the one that you, like at a lag of one, the points are going to be very correlated, right, because you're proposing a new position based on the previous one. But then as you get further away, the hope is that by the end of your chain you'll have forgotten about where you started, right, so that's why you would hope that the autocorrelation function would fall off to zero as you get to longer and longer legs. And of course, if this went shorter then that would be even better, right, because that means that it's forgetting faster. And so what you want to do is you want to somehow condense this into a number, which is something like the number of steps required to forget about the location where you started. Yeah. Why does it sometimes goes below anticorrelation? So that's probably because I'm drawing just what the numerical estimate of an autocorrelation function for a sampler would look like. I guess, yeah, I don't know. That's a good question. Yeah. Not what? Yeah. Sure, yeah. Yeah, in detail this is what they do look like. Yeah, that's right. The fact that... Absolutely, yeah. I think that sounds believable. Yeah. Yeah. Yeah, I'm going to talk about that in a second. But so, yeah, what I was going to say is that the... So then there's this... This is terrible notation because it's often called tau, but the int, let's call it tau int, is the integrated autocorrelation time, which is basically the integral under this function. And the reason why that's relevant is that if we look at... Remember, I said we were computing these integrals, right? That looked like this. And what we're interested in is the uncertainty on this integral. So just our sort of Monte Carlo uncertainty on the fact that we're computing this integral numerically, right? And so what that will look like is actually just... You'll remember that if you had completely independent samples, then that uncertainty would scale like root n, right? As you get more and more samples. But, sorry, the variance would scale like one over n. But the uncertainty would scale like n. But since we don't have completely independent samples, what you get is something that looks like this. And so we end up with the integrated autocorrelation time here. And so what that means is that, really, you just have an effective number of samples, which is given by n over the integrated autocorrelation time. And so that's where my original statement came from, which is something like the number of steps that you need to forget about your previous location. So at the end of the day, if you run, you know, a thousand steps with an autocorrelation time of ten, then you actually only have a hundred samples, right? Only a hundred effective samples, because the rest of them are not independent. So at the end of the day, what you want... Yeah, go. No, no, please. Estimate that autocorrelation time. Sure, yeah. Is it when it gets below certain value, the autocorrelation? Sure. Let me come back to that in just a minute, yeah. So all I wanted to say was that your sampler is going to be better if it has a shorter integrated autocorrelation time. But just one subtlety of that statement is that you'll notice that this is all one-dimensional. And what this means is that you're going to have a different autocorrelation time for every integral that you do, right? So it's the chain of values of f of theta. So you have, at each point, at each proposal, you also have a value of f of theta. And it's that one-dimensional chain that matters when you're doing this computation right here. So I should probably even put an f subscript here, because it depends on the function that you're integrating. So say, yeah, great. I agree. And that f could be something simple, like just giving you theta one like I had plotted before, but it could also be something more complicated. So really the autocorrelation time that matters is the autocorrelation time of the parameters that you care about, right? So sort of the value that you care about, the number that you're going to put in the abstract. And they can sometimes be wildly different. But what you want to do is you want to come up with the sampler that gives you the shortest autocorrelation time so that you have to do the fewest number of calculations to get the most number of effective samples. And so one comment about how we compute the autocorrelation time and about how we compute the autocorrelation function is that it's all a circular argument, right? You have to run your chain and then you have to compute the autocorrelation function of your chain and then you have to try to estimate this number based on the finite number of samples. You can imagine that if you only had a small number of samples and you haven't run long enough to actually sample from the full probability distribution, then you're going to get an incorrect value for the autocorrelation function but also for the autocorrelation time. And so at the end of the day, there's sort of a lot of different heuristic methods for doing this and in the worksheet that I put online, I link out to a document which is the best description that I know of of sort of good methods of doing this that are tested empirically. So I'd recommend just looking at that. I don't think that actually going into the details of that right here are going to be useful. But basically it ends up, you end up with sort of an iterative method where you make sure that your results aren't changing very much as you run for longer. Yeah? That procedure you just described sounds a lot like the engineering work that people do in machine learning, right? So Markov chain Monte Carlo now sounds like some kind of regression problem where you're trying to estimate a posterior PDF but you're still estimating it and then you have some choice in the setup so you have control parameters in your model that you have to somehow make decisions about before you settle on your estimate. Yeah. So I don't think anyone actually tunes their MCMC by like running an MCMC with this particular set of parameters and computing the autocorrelation time and then taking the derivative of that and moving in some direction or something like that. But ultimately that is sort of the procedure that you'd want to do. It's somehow you want to minimize the autocorrelation time. Choose the parameters that minimize your score. Yeah. Exactly. I don't think you'd even need to do that because there's no risk of overfitting really. But in practice what most people do in this, there are lots of other heuristic methods of tuning things. But basically for most reasonably simple problems if you're using a sampler like this trying some different values by hand for your particular problem and looking at the autocorrelation times that you get out tends to be fairly reliable. I was just going to list some of the other... So here I was talking about tuning but you can see that this also is sort of related to the question of convergence. So this is a term that comes up a lot when we talk about MCMC and that's how do you know that you've run for long enough that you have a good enough sampling for your problem. And of course that's impossible. You have to run for an infinite number of steps to actually satisfy any of the proofs and so any decision you make to stop is going to be necessarily based on heuristics. But this in particular setting sort of a goal, a target for the number of effective samples that you want from your problem or that you need to get the precision on this integral is sort of the most justified way that I can think of doing it. But also it tends to be quite computationally expensive because you have to run for such a long time to get a reliable autocorrelation time. So there are other heuristic methods and so looking at the acceptance fraction is one so that's that point that if your proposal is bad you are going to accept way too many samples, you're going to accept every proposal or you're not going to accept any at all and you want to be somewhere in the middle and there are various arguments about why you might want to be around about a quarter. And so that's a proxy that people use but it's not really justified. And then there are other things like there's the Gelman-Rubin statistic where you run multiple chains and compare the variance between the chains with the variance within the chain and that's another heuristic that you can use to try to quantify how converged you are. And it's very important to do this so I wanted to talk quite a bit about this point because it's really important with MCMC that you have a sense that you can actually quantify how well you're doing. It's really not good enough to just look at the traces or something like that and say oh this looks good enough and there have been various results in the literature where it's been demonstrated that something can sort of look alright in a trace plot but it won't satisfy any of the convergence criteria and if you compute an autocorrelation time then it's clear that you're not converged. And so it's really important that you do something like this anytime you're doing an MCMC and probably an estimate of the integrated autocorrelation time should be included somewhere in your paper as a comment about I know that I've converged or I'm happy with the number of samples I have because here's the integrated autocorrelation time and I have these many samples. Something like that. Let's see, what did I want to do here? Sure, so the expensive part is computing the autocorrelation function and you can use fast Fourier transforms for that and you should always do it that way instead of directly computing it. And then once you have the autocorrelation function you can do sort of like nice procedures that are super fast for estimating the autocorrelation time from that. That's right. Yeah, I'd say I've found it, I recently spent a lot of quality time computing autocorrelation times and I found it very instructive to sort of, so I would actually recommend just sort of trying out and sort of comparing to the standard packages because there are a lot of decisions that go into any procedure for doing it but yeah, there are definitely methods out there for doing it. Yeah, the suggestion and the note that I linked to from the worksheet is that you basically want to have at least 10 times as many autocorrelation times as the estimate that you're getting as a minimum and even that would be small, absolutely, yeah. But if you don't have something like that, then you're really in trouble. So yeah, the thing I was going to say is so I demonstrated the metropolis method right here that's sort of the simplest possible MCMC and for lots of problems that's probably going to be sufficient but there's all these questions about tuning and all these other kinds of things. So there are lots of other methods that people have developed and they all have their pluses and minuses. Can anyone suggest another MCMC method and give me a place where you might want to use it or a place where it would be useful? So you all wrote your own metropolis methods and use that? Yeah, so there's a class of ensemble methods and one of those would be MC. And do you have any thoughts about why you might want to use or sort of what are the benefits of using an ensemble method? Yeah, exactly. So basically these ensemble methods, the basic idea is that if you have a lot of samplers all running in parallel, then they can know about each other and they can use that information to try to spend time sampling in the high density parts of parameter space and waste less time and sort of, you know, and the thing that's really nice is that there are few tuning parameters. And so this, so, and that tends to be best in sort of well-behaved small problems. So if you have a, you know, 10-dimensional problem or something like that and it's fairly well-behaved it's not super multimodal or something like that. Does anyone have another recommendation? Yeah. The in high dimensions, the other extreme is Hamiltonian Monte Carlo, which MC, which the upside is it's really good with very high-dimensional spaces but the downside is you need the gradient. Yeah, exactly. So that's a really good point. A Hamiltonian method is exactly the same as a metropolis method except the proposal is much more complicated. It actually integrates, does a numerical integration of the point moving through the potential energy of the probability, the log probability. And so in that you have to be able to compute derivatives of your probability function with respect to the parameters in order to do this sort of leapfrog integration that is normally used in there. And so the proposal is very expensive but if you do it right, you accept every proposal. It always gives you a good suggestion. And so that tends to help a lot in large numbers of dimensions where the standard metropolis methods and ensemble methods start to break down and just give you bad proposals all the time. So that's a good suggestion. Anything else? So a standard package is a stand and so that uses Hamiltonian methods but the one downside to stand is that it has a limited number of different functions that you can use so if your likelihood function is some black box that does numerical integrals and looks up from a table and interpolates and does all these other things then stand is going to struggle with something like that. I don't know a good implementation that works with sort of black box. Cosmo plus plus. Cosmo plus plus for an implementation that works with black boxes. You think with autodiff now existing, someone would be working on this. Yeah so in principle you can do it with stand. They have a C plus plus API that has autodiff so if you can write your likelihood function in C plus plus in principle you can just plug it in but as far as I understand there's no documentation for that so I wouldn't recommend it. Anything else? Yeah so that's another good one. So what we mean by that is do you have a suggestion for why you might want to use it and sort of a word or two about it? In some steps you sample some of the parameters and then you freeze them and in the other steps you sample the other parameters and then you freeze them and then you go back and I think people often use them in hierarchical problems where there's some parameters that affect all the data points but then there's other parameters that only affect some of the data points and so you can sort of take a step on the global parameters and then you can do lots of work in the local parameters independently and then come back to the global parameters or something like that. I don't have not actually ever used that I have to admit. Well it is because basically what you're doing is you mentioned before that it would be ideal if you could sample from the posterior gives kinds of do that by sampling from the conditional so you fix the point and you make a slice and then sample along that resulting conditional estimation. Of course you can sample from the conditional using metropolis or something like that, right? So it's better if you can compute the conditional but I think with the same input you could still in principle use it but the real benefit is when you have structure in the problem and maybe if some parameters you can directly sample from the conditional then that's a real benefit. I think that's the key thing, right? The direct sampling. If you can formulate it so that in some parameters you can just choose a random number generator and draw a sample. That's the real winner. Yeah, models. So in the end the way they analyze CMB data like Planck is by having these hierarchical models but also blocks of parameters where the conditional probabilities are simple. And so using GIPs you can sample you have these five or six very large blocks of parameters such as foregrounds, maps of the CMB, systematics, instrumental effects, noise, et cetera and they sample, some of them are analytic. Some others have to be sampled and so it's actually a very successful example for GIPs. Yeah, good. Anyone else? I have one more. Oh, Ben, yeah. So you mentioned multimodal posteriors and I don't think anything listed there is particularly good at that kind of thing but if you had some kind of tempering algorithm to kind of soften the prior time's likelihood that can sample pretty well from multimodal posteriors. Yeah, that's a really good point. Yeah. Oh, like parallel tempering or something. Yeah, for example, yeah. Yeah. So the last one that I wanted to bring up was nested sampling which sort of is very related to tempering in certain ways. And nested sampling and tempering methods you can also, one thing that I didn't talk about was that normalization constant, the evidence integral. So that's the one integral that you can't do with the standard methods that I talked about here, the metropolis methods and most of these top ones. But if you do want to evaluate that you can use tempering or nested sampling to do that but that's sort of a topic for a different day. But also nested sampling algorithms some of them can work better in multimodal problems as well for sort of the same reasons as tempering works. Yeah. Should we talk about the difference between Markov chain sampling methods and non-Markov chain like important sampling or... Oh, yeah. Yeah, so that's a good point. Yeah, at the beginning I asserted that I was just going to focus on Markov chain methods and then there are other... There's sort of simple Monte Carlo or important sampling or something like that where there's sort of... A lot of the things that we talked about here wouldn't necessarily apply to those but they would allow you to compute evidence integrals and stuff like that as well. But yeah, I didn't want to get into that here today. Okay, so then I think we're probably finished, right? So apparently lunch is a little delayed so if you have more to say than we do... Sure, okay. I have one last thing that I wanted to talk about. These things always take longer than you expect. But the last thing I wanted to talk about was... So you've implemented your method or you've chosen an MCMC method, you've run it, you've looked at convergence, you've computed autocorrelation times, things like that. And now what you need to do is you need to display your results. And so that's an important discussion. Well, you need to interpret your results and things like that too. And so I just... Yeah, just before we wrap up, I just wanted to talk about a few different ways that you might display your results. And the first thing I wanted to say is, well, I just erased it, but as we heard yesterday, you should always look at the trace plots. I don't think those necessarily go into your paper, but they're very useful for debugging. You should always plot them and see if something crazy is going on there, then you should figure out what's happening. And so that's just the chain as a function of time in whatever parameters or collection of parameters that you care about. But then another one that's useful is to make a scatter plot matrix or a corner plot where... That's terrible. Where basically what you do is you would make a matrix like this where this is theta 1, theta 2, theta 3, and so on. And then you plot them against each other. So here you can plot a histogram of your samples, and then here you would plot the 2D histogram between theta 1 and theta 2. And this is really useful. You get all of the different projections between the two. And that's super useful because you can see if you have covariances and other interesting features in this space. But another thing that's relevant about this is that when you make a histogram, you're actually integrating. So what you're doing is, let's say, we want to compute the value of this bin and this histogram right here. That's evaluated by doing an integral where our function is like a top hat. It's zero everywhere except for in that bin. And so what that means is that if you compute this using samples, you just do the histogram of the values in one parameter, then this is actually a representation of the marginalized likelihood. So marginalizing out all the other parameters... Sorry, it's a representation of the marginalized posterior. So it's the posterior probability of that one parameter marginalizing out the effects of all of the other ones and your uncertainty about all of the other ones. And the same is true in two dimensions. And so these are all marginalized probability distributions that you're showing in a plot like that. So that's actually very useful. So if you have just two parameters that you care about, you can just throw away all the other samples and you have the marginalized probability distributions for each of those. Does that make sense? Okay. And then the last sort of plot that is useful is to... And you should always do this one is to look at what your predictive distribution is in the data space. So if we have our line data, what you want to do is you want to take samples from your posterior, say a hundred of them or something like that and just plot what line you would predict in this space. And it's going to kind of look like this. And you'll get some representation of what's happening in the space that you understand because seeing some feature in this space might be hard to understand, but in that space you might quickly see that you're getting... you're overfitting to one outlier or something like that and it's all getting pulled down. And so that's an amazingly useful diagnostic plot. It's also useful for your paper. That's one that you should always do. Yeah. Every of all of my samples in each dimension and I put those values in my paper. Yeah. Is there anything else I should do to ensure that other people can use my... And is there anything really bad about doing that? So as we discussed yesterday, as long as you say what you did to compute those numbers, you're fine. It's very useful to propagate samples. And so if you can save your samples that you've computed and the value of the log likelihood and the log prior evaluated at each of those samples, then that's a very useful object and that's sort of the best representation if you can pass it along. But yeah, you can choose to compute means, variances, quantiles, medians, anything like that. It's fine just as long as you explain exactly what those numbers mean. Because if a paper has, you know, 50.1 plus 0.3 minus 0.01, I have a hard time knowing exactly what that means unless you... Or I have a hard time using that value unless you explain to me exactly what that meant, exactly how you computed those numbers. Sure. But there's at least a chance. So I think that's all I wanted to talk about. And yeah, so, oh yeah, one last thing. I just want to advertise that by the end of the day today, David Hogg and I are going to have a completed manuscript that we're submitting to archive with tutorials about how to do MCMC. And it's going to be a bit of a hard afternoon. But if you want, we are writing it in the open on GitHub, of course. And so you can just clone this repository and type make. And in principle, that should run all of the examples and create figures and compile the LaTeX document. And there's lots more information, much more information than you could ever possibly want in that document. But it might come in handy if you're doing some MCMC for yourselves. So I just wanted to advertise that there and hopefully it'll be on the archive eventually. So Dan, I was going to ask if there's a good reference that you recommend on some techniques like HMC or Gibbs. But did you just answer that question? So in our paper, we don't actually go into details of sort of the specific implementations of some of these other methods. I mean, for HMC, the Neil paper is sort of the classic book chapter from Radford Neil is sort of the place that I've learned about it. But I don't know if I have anything better for that. I don't think there's a specific astronomy sort of astronomy specific document for that. Sorry, there is a book. I can find the reference later, but it's sort of the description of monographs of different authors of different algorithms and it goes into the mass more. And it has some nice historical perspective on the development of the methods as well. I know Zhaonimeng is one of the co-editors and co-authors of this manuscript. That does sound believable, yeah. I'm sure, yeah. Okay, great. Yeah, good. Any other questions? And if anyone's interested in doing MCMC, it's sort of new to this and sort of wants to implement a method or something like that, then the worksheet that I put online might be a good starting place. And if you have any feedback or if you run into any problems, please don't hesitate to talk to me this afternoon or email me or whatever. My contact information is very easy to find and I try to respond to emails not always super fast. Okay, let's thank Dan again. Thank you, Dan. And I think the pizza may have just arrived for lunch. Do we want to do any announcements before lunch? Okay, all right.