 All right, so apparently I speak very fast. So the way to slow me down and actually the way to get something out of this or the series of lectures is to ask questions. None of this is intuitive until you've looked at it for 40 years, and then it becomes very easy. But the problem is then you might have forgotten, like sometimes I did when I was preparing these lectures, that how did this come about? Why do I know this? So I really urge you that you will get nothing out of what I'm saying unless you actually ask questions. And very few of us are native English speakers. So please, if I speak too fast, just tell me to go slow. It's not a problem. The problem is if you do not understand something that I said that will confuse you for the rest of this, because everything that we're going to talk about is connected to everything else. Everything is connected to everything else. So don't think that, oh, I didn't understand that. That's OK. I'll understand something else. No, everything is connected to everything else. And the only reason I can work on these things is because of everything being connected. Just so you know, my background is not in statistical physics, not in data science, and not in biology. I'm a string theorist. I did string theory for almost 20 years. I've only been doing biology for like 22 or 23. So I really urge you to ask questions. Don't be shy. It's not a problem. If you see me standing around, come up and ask any questions. I really want you to get something out of what I'm saying. Clear? Yes? So what I say we're going to cover, it totally depends on what you understand. So we can cover as little as you want or much more than you want. But I want you to understand whatever it is that we do cover. Yes? Clear? OK. So what are we talking about here? We're going to talk about the inverse problem. What is the inverse problem? Someone gives you a bunch of data and says, make a predictive model that if I have some other observation, what will be the consequence? That's the whole point of inverse modeling. You see something, and you want to make a model that can predict. That's the whole point of data science. You are not trying to just collect data and store it up in big piles, though I have to tell you that experimental biologists love collecting data. They're not so happy actually analyzing it. But they love collecting data and putting it in databases, which is good for me, just saying. So a basic principle in modeling data is to say, OK, we observed certain events. And we see certain probabilities. We see certain frequencies at which something happens. And what we're trying to do is find a probability distribution underlying the events that we actually observe. That's the basic problem in data science. Someone gives you a bunch of data, and you have to come up with a probability distribution from which that data could have been drawn. In whatever context in data science, this is what you're trying to do. Find a probability distribution from observations. And the basic principle for this is called maximum likelihood. What is that? That's saying, oops, that's saying, yeah, there. That's maximum likelihood. There are these probabilities that we're trying to learn. And these are the observed frequencies. And the idea is these are normalized probabilities. And we want to maximize this likelihood. That's why it's called maximum likelihood. So we put this constraint with a Lagrange multiplier. Everyone understands how Lagrange multipliers work? Yes? So what's the idea? It's very simple calculus. If you try to maximize with respect to lambda, you'll get the constraint sum of probabilities equals one. That's what you're trying to enforce, right? That's the Lagrange multiplier. That's the Lagrange multiplier's job. And when you try to maximize the probabilities, what you're going to find is that the ratio of the frequencies divided by the probabilities is equal to a constant, right? So what's that telling you? That's telling you that the probabilities are proportional to the frequencies that were observed, right? So when you solve the constraint, you learn that the sum of fi is equal to lambda. So because the sum of pi is equal to one, that means lambda has to be one. And you find that the frequencies that are observed are equal to the probabilities of the event that you observed, right? Now, what did you learn? You learned absolutely nothing. Because this is a trivial solution, right? It's saying that if I observe another event, right, the probability of observing it is only the frequency that I already observed it at, right? So if you observe a new event, you have no predictive probability, OK? So what we want to do is we want to compress all this data, right? We want to compress it into a model that can actually extrapolate, right? That can predict events that were not observed, right? That's what we're trying to do with modeling, is find something that lets us extrapolate to events that we did not observe, right? From the events that we did observe. So that's the first place where you're making a choice. You're looking at a family of models, right? And you're asking, in this family of models, which is the most predictive model, OK? You're trying to find the best model, yes? Is that clear? That's the basic problem that we're trying to solve here, OK? So I just said, as I just said, we need models for calculating probabilities. And let me point out here an interesting thing that we'll again see over and over. So the log of the maximum likelihood is just the sum, frequency is multiplied by the log of the probabilities, right? And we can write that by adding and subtracting fi log fi in terms of this quantity right here, right? Fi log pi over fi, right? Yes? And this thing is going to come up over and over. So I want you to really, you might as well memorize it, because there's no way out of remembering this particular combination, OK? And this is a constant. These are just observed frequencies, log of observed frequencies sum over the observed events. It's just a constant. It does nothing for us, OK? So you might as well ignore this part, but this part is what all of this is going to be about. So we want to focus on maximizing this particular combination, OK? So maximizing this, OK, good. So now, where did we come up with? Why did we come up with this? It's because this class of functions, convex functions, what's a convex function? It's a function that satisfies, oops, sorry. OK, it's a function that satisfies this constraint that a line drawn between two points on the function, right? The function is always below that line. No matter what two points I pick on the curve, right? The function is actually below that line, right? That's a convex function. It doesn't need to be differentiable, OK? It turns out actually that a convex function is almost everywhere differentiable, OK? Oh, it's certainly continuous, OK? But this is the basic point, right? That there's a straight line and the function lies below the straight line no matter where you pick these two points, OK? Yes? Is that all? Oh, OK. But then my clicker? Like this, but we have to connect and install something, a logic thing. Let me just work with this, OK? Tomorrow, maybe, OK? OK, that's what comes with the age. You know, you learn things slowly, so, all right. All right, OK, oh, sorry, yeah. And down here, right, that's just a general form of writing the same inequality, right? It's just if you have a bunch of points, then the weighted combination of the function value at the points, at that weighted combination, is going to be less than or equal to the weighted combination of the function values, right? Yes? And this is going to be useful to us because the assumption here is the sum of ti equals 1, right? And the ti for us are going to be frequencies of probabilities, OK? That's how we're going to use this many, many times, OK? The ti are positive or positive or non-zero. They add up to 1, so they're pretty much probabilities. And this is going to be how we use this, OK? Yes? All right, OK. So the general form is called Jensen's inequality, though other people found it first. But the basic form that is used is the function, a convex function, evaluated on the expectation of a random variable is always less than or equal to the expectation of the function evaluated on the random variable, OK? OK? Is that better? Maybe I can do red and green, so OK. So we have the function, yes? Is that clear? Anyone who doesn't understand this, yes? The expectation is just that sum of ti that we had, right? That's what that expectation is, right? OK. And really, the proof is just from this. You just have to imagine that you have a continuous distribution of the random variable. You, you know, approximate it, and then you take induction and limits, and you can prove this. So it's no big deal, OK? Some notation, I'll write E of O, where O is any observable, OK? And what I mean by that is the sum over all the probabilities, right, of the value of that observable at that event, right? Yes? So that's our notation. OK, that combination that I was asking you to memorize, that's called, the negative of that is called the Kulbach library of divergence, OK? And guess what? I started telling you about convex functions because minus logarithm is a convex function. What else is a convex function? Example? Someone? Sorry? The quadratic function, yeah, our most favorite convex function. The exponential function is a convex function. Minus log is a convex function. So maximizing this combination, which I asked you to remember, is the same as minimizing this, right? For convex functions, we want to minimize. For concave functions, we want to maximize, right? And the negative of a convex function is a concave function and vice versa, right? So this is the Kulbach library of divergence. And it is a measure of how close two probability distributions are, right? Yes? And the interesting thing is, I say it's a measure of how close, but it's not a distance measure. Why is it not a distance measure? Someone? Why is it not a distance measure? Sorry? Right, it's not a symmetric function, OK? There's no triangle inequality that goes with the Kulbach library of divergence, OK? It's just a measure of how close two probability distributions are, OK? The way I've written it is in terms of a discrete set of probabilities, OK? When you have a continuous probability distribution, there's a slightly more subtle form, OK? But that's the form. Fi log fi divided by ti. Now, it's good to understand exactly what goes into why we write it in this way, OK? Because some of these frequencies could be 0, right? The assumption that goes into this is that any frequency that's non-zero will always have a pi that's non-zero, OK? But not vice versa. It's not symmetric. Pi is always supposed to have a broader distribution than the fi, right? So you can have some of the fi is 0, but fi log fi is still 0, so that's perfectly well-defined, OK? But you can't do it the other way. You can't have an fi non-zero when pi is 0, OK? Yes? That's the way I remember what form this takes is that you can write it like this, but the pi always is non-zero, OK? Yes? OK. And you can show that this is convex very simply. And this, again, is a proof that is good to almost memorize. It's a very simple proof, but it's good to almost memorize. It just follows from the convexity of minus log, OK? So you say, OK, we have fi log fi over pi, right? That's the same as minus log sum fi pi over fi, right? That's this part, right? And then we use the other side of the convexity inequality, right? And we get this. And then the fi is cancel, and then minus log sum pi. And why isn't it 0? Because these pi's are only evaluated at those events which had a non-zero frequency, right? The sum of pi evaluated over all possible events is equal to 1. It's a probability. But this sum is only evaluated over frequencies that were actually observed, right? So this is always minus log this. It's always greater than or equal to 0, OK? Yes? OK, I know. I'm sure many of you have seen this before, but I want to make sure that we're all on the same page, OK? Yes? So I'm going over very basic things, but I want to make sure everyone understands, OK? Yes? Good. Any questions? Yes? Are we good? OK, let's go on then. So we've decided that we want to minimize this Kodak library divergence, and it has this very nice form. And we need some sort of model for probabilities. And this is where statistical physics enters the picture, right? Why is statistical physics anything to do with probabilities is because basically all this stuff about entries and so on that we're going to talk about over and over, Mr. Boltzmann is the one who figured this all out. And he was a statistical physicist. He was the first statistical physicist, if you like, OK? So what Boltzmann said is that a physical system with a fixed internal energy will occupy all possible states that it can access. It has fixed energy, so it can only go to certain states, right? But every one of the states that it can access, right, at that energy, it will access with equal probability, OK? And this is encoded by saying that the system is trying to maximize its entropy, OK? This is something that Shannon, the founder of information theory, found independently, right? That minus sum p i log p i, right, is information, OK? So what probability distribution, suppose we have n possible events, what probability distribution maximizes the entropy? Yes, uniform. Every single probability has to be equal. If there are n events, this thing is going to be 1 over n times log of 1 over n, right? So it's going to end up being exactly log n, right, OK? So that's the most entropy you can get for n independent events, right? Equal probability for every event, yes, OK? Now the internal energy with fixed internal energy is kind of hard to work with, right? So what we do is we do a Legendre transform. We want to exchange the states with fixed energy for states with a fixed temperature, right? So that when you change the temperature, you can tune the internal energy to whatever you want it, OK? Yes? That's what we're doing here. We're going to do a Legendre transform. This is the internal energy. Every state has a probability. Every state has an energy, right? The average overall states of this, that's called the internal energy. And we're doing a Legendre transform from the fixed internal energy of this ensemble to a fixed temperature, or rather, this is inverse temperature, OK? Every time I say temperature, please think of it as inverse temperature, OK? And this is just, what is this enforcing? It's just enforcing that probabilities have to add up to 1, right? OK? Yes? So when we do this, now we're going to maximize entropy subject to this, right? These two, beta added and lambda added, right? So beta is enforcing this constraint that the internal energy is you, right? And lambda is enforcing probabilities. So when we do this, we maximize this. Well, that's what Newton and Leibniz told us, take derivatives and set them equal to 0, right? So we do that systematically for all the probabilities, for the constraint, and for the temperature, right? And for the temperature, we find, as we wanted, the average internal energy. The average energy is the internal energy, you, right? The probabilities are normalized, and the ds dpi constraint gives us this relation. And with a little bit, very little algebra, we see that the pis have to be of this form, right? Where z is the sum over all events, e to the minus beta ei, OK? This is absolutely fundamental. So if there's any doubt in anyone's mind, right? Talk to me now. We're clear about what's going on here? Yes? We exchange this internal energy for this temperature, right? And as a bonus, we figured out what's called the Boltzmann distribution, right? e to the minus beta ei is divided by the sum z. That's the probability associated with a state of a given energy, right? So what do we learn? As the energy goes higher, the probability of that state goes lower, right? Yes? So that's clear? Tell everyone? That as e goes up, this form of probabilities says that the probability is going to go lower, right? So this is where data science comes in, OK? The idea here now is that we mapped it into a statistical physics problem, right? And we're trying to find a relationship between our observed frequencies, right? And some energies that we're associating to the system, right? And the idea is the higher the energy, the less likely the event. That clear? Yes? That is the basic idea here, OK? And so now that would be absolutely great. Is that all we need? But unfortunately, I mean, well, that's it. We could always write p i equals e to the log p i. So what did we learn? What's the big deal, right? We could always write it in this form. What's the big deal? The big deal is that there are many, many calculations we want to do which require the log p i have a very specific form, OK? The log p i, because everyone sees this is an identity, right? p i equals x log p i. I did nothing. But when I demand log p i is the linear and the parameters that we're trying to determine, theta, OK? Right? Now we're actually saying something, OK? We're saying something about what kinds of models we can make in the statistical physics framework, right? We can't just make any old model. We can only make models where it makes sense to talk about probabilities taking this form, right? That the log of the probability is a linear parameter times some observable, OK? So in terms of statistics, these are the exponential family of distributions, OK? Where the probabilities are proportional to exponential of something linear times an observable, yes? We'll care about this. So by saying that we want to do, yes, question. Beta is negative. Beta can be negative. Why? Why not? I mean, we won't ever use the fact that beta is positive. Beta is just a parameter, right? And you'll see that nothing we do depends on beta being positive or negative, OK? In fact, beta won't occur in what we talk about. The only thing we're going to use is this form, OK, that rare events have higher energy, OK? I mean, if beta is a negative, then higher energy will have a smaller probability. Yes, I know. But what I'm trying to say is, in data science, we're not actually talking about real energies. We're talking about our fictitious energies that we're mapping to probabilities, right? We're not actually talking about real energies like a gas of molecules and whatnot, right? We're just talking about formal parameters and observables associated to formal parameters, OK? That's a weird question, OK? So this is important to understand. Then we're setting up a completely fake statistical physics idea, OK? We're just using, in data science, all we're using is the intuitions and more important the mathematics of statistical physics, OK? We're not actually using the physics of statistical physics. We're using the mathematics, which is that the mathematics of convex functions, OK? Yes? Is that OK? So you said that the form of exponential is important. Yes, the form of that exponential is important. Not a sign of data. Not the sign of data, not at all, OK? Yes? Great question, yeah, OK? It is fake statistical physics. But it gives us convex functions, so we're happy, OK? And I'll tell you, I'll show you many places where it's important that it's linear in these parameters, OK? Like this, OK? That is what is important for us. Not the sign of theta, which is just like beta, OK? But that it's a linear in the parameters, OK? That is the crucial thing. And I'll show you in the next slide exactly why that is the crucial thing, OK? OK. Yeah? All right, good. OK, so how restricted are we? Actually, the exponential distributions are many, many of the distributions that you know and love and work with all the time, right? There's a list. I took this list. I didn't even write all of them down, but this list is from Wikipedia, OK? The normal distribution, beta, Poisson, exponential, vertically, all are exponential family distributions, OK? Here's the general form. There has to be some sort of function only of x, the random variable value, OK? There has to be some function of only the parameters, right? The key point is there is a linear times a function of the random variable, OK? All in the exponential. So if you can get the probability, theta is the parameter of that distribution, right? If you can get a probability to take this form, it's an exponential distribution, OK? Yes? You get an example of something that isn't an exponential distribution. There you go. Normal distribution is a Gaussian distribution. Gauss is the first one to come up with normal distributions, right? So yeah, what else? Discrete distribution. Oh, no. Actually, some discrete distributions are very easily put into this form. It's not all of them. Yeah, all of them probably can be put in this form. A Valencian distribution, anything that has a heavy tail, right? Cannot be put into this form, OK? Yes? So there are definitely distributions out there that are not exponential distributions, OK? But it is quite a large collection of distributions, OK? All right, so here's some examples, OK? So like, suppose the random variable just takes values minus 1 and 1, right? Then the probability of sigma, there's one parameter theta, is like exponential sigma theta. Where does this denominator come from? Someone tell me, where does the denominator come from? It's just a normalization of probabilities. You guys know too much physics. Don't think of partition functions. It's just e to the minus theta plus e to the plus theta, OK? But yeah, yeah, right, it's a partition function. But it's just the normalization of probabilities, right? All right? OK, so if we have two variable sigma taking values 0 and 1, what are all the possible distributions you can write down in the exponential family for these discrete events? There, I wrote them all down. And this is the denominator is, again, every possible configuration of sigma i taking 0 and 1, right? This is when they're both 0. This is when only sigma 0 is non-zero. This is when only sigma 1 is non-zero. This is when all, both of them are non-zero, right? So this is all the states. And theta, this form, I'd like you to remember, theta is a vector. There are three different parameters here, right? And there are three separate what I'll call observables, sigma 0, sigma 1, and the product, right? Yes? So we'll have this form over and over. We have dot product of a vector of parameters times a vector of observables, yes? OK, right? And these are discrete random variables, right? And we wrote an exponential distribution. A fun one is the Poisson distribution, yes? Which is usually written like this. Lambda is the rate of the Poisson process, right? Yes? So lambda r is an integer, positive or non-negative integer, right? But you can see this is an exponential distribution by just writing it in terms of log lambda, right? Yes? And this brings up a very important point. We always want to work with natural parameters. By natural parameter, I mean something where you can write it as the parameter times an observable, OK? So when you write it like this, lambda is not the natural parameter for the Poisson process, OK? The thing that is a natural one is a log lambda, OK? Because then you can write it in the form of an exponential family, yes? So we always want to work with natural parameters, OK? And really, I'll show you within a slide or two why we want to work with natural parameters, yes? OK. And this is where we learn the value of natural parameters, OK? So we have some generating functional of correlations. What do I mean by that? A generating functional of correlations. What I'm trying to say is that any expectation value, we introduced the notation expectation value before, right? Any expectation value that I want to evaluate of any of these observables, right? I can get by just differentiating with respect to those parameters. That's what makes them natural parameters, OK? I do DD theta for any of the parameters, any of the natural parameters in an exponential distribution, right? And I'm going to get the value, the expectation value of that observable, OK? And I have to act on log z, right? Log of the partition function, right? And that's how I get this expectation value, which, as we discussed, is the observable times the probability of that observable at that event, right? Yes? We're clear about this, everyone? This is why we like natural parameters, OK? Differentiating with respect to natural parameters gives us expectation values of their conjugate observables, OK? Yes? OK. Actually, there are even more important things. For instance, if you look at the second derivative of the partition function, right? So this is a whole matrix because there's a whole bunch of these parameters, say theta 1, theta 2, da, da, da, right? So this is a matrix. So if you do the algebra, which is, you know, I take this DD theta 1 or DD theta i and I do DD theta j, right? So what am I going to get? So if I do DD theta j on this, I get DD theta j on the numerator, right? That gives me this term, O i, O j, right? Yes? But there's also a theta j in the denominator, right? And when I do that differentiation, I get a negative sign times this product of expectations, right? Yes? I know. Very basic stuff that just to get on the same page, right? OK? And so this particular combination, OK? I'm going to call a connected correlation function, OK? Just as a matter of notation. And this matrix Cij, this particular combination, this matrix Cij, is a very lovely matrix. Because if you take any vector, any vector of numbers, vi, right? And you compute this double sum, vi, vj, Cij, right? This is just in terms of vector of the matrices, v transpose Cv, right? Product, just number, right? That number is always going to be greater than or equal to 0. Is that obvious to anyone? Let me step through that just to be clear. When I do vi with Cij, right? Vi gets me v dot O here. ViOi is v dot O, right? And I do the same thing with vj. And it gives me v dot O again. That's this part, OK? Yes? If I do vi dotted with this one, I get expectation of v dot O, right? There is there, right there, OK? And if I do expectation of vj with this, I get a expectation v dot O, OK? Yes? If you multiply this out, you'll see there's exponential of v dot O, v dot O, right? The expectation of v dot O, v dot O. And then there's minus 2 times the product. And then there's plus product of exponentials. So you put it together and you'll see this, right? With a v dot. And that's greater than or equal to 0. Why is it greater than or equal to 0? Because it's the expectation of a square, right? So it's always greater than or equal to 0. What is this proof for us? Why do I care about this? You know, I'm nodding, so I want to know. Why do I care about this? Come on, guys. You're nodding. That means you understood. So tell me, why do I care? Sorry? No, I specifically want to know, why do I care about this greater than or equal to 0? Yes, that's exactly right. This is what shows that this log z, right, is a convex function of the theta i, OK? That is absolutely crucial, OK? Log z is a function of those theta i. Now is it clear why I love natural parameters? Is because it's in terms of those parameters that this partition function, log partition function, is going to be a convex function, OK? Yes? OK. So now correlation functions are moments, because I showed you dd theta gives me an expectation value. I do dd theta any number of times. I get, you know, expectations of products, right? Those are moments of that probability distribution, OK? So the correlation functions are moments. Connected correlation functions are the things you generate when you take derivatives of the log of the partition function, OK? Like we were doing in the past slide, right? We were taking derivatives of the log, OK? And when you take derivatives of the log, you'll always get these sort of subtractions, right? And this is what's generating for us connected correlation functions, yes? And we want to work with connected correlation functions, because they basically get rid of stuff that is sort of broken up, right? Stuff that is not showing us about the interactions between the observables. It's like expectation of some set of observables times expectation of some other set, right? Where those observables are not interacting, OK? That's really what you want to think about, the subtraction here. It's taking out the stuff where the observables have nothing to do with each other, their values, OK? OK. So we did cover this. So let's go on. All right. OK. So we go back to our code back library divergence, right? Except we now have f, and we have our form for p, right? Our p is the e to the theta dot o, right? Yes? So now, if I write the code back library divergence in terms of f and our exponential form of p, then this is what I get. Remember, the f are observed frequencies, right? Observed frequencies. The p is a theoretical quantity. We're trying to figure out what the theta parameters are. That's the exercise that data science is about trying to figure out what is theta, right? Yes? So when you expand the form, the Boltzmann form of the probabilities, right? Then you'll get two terms. This is from the exponential part, and this is from the z that was in the denominator of the normalizing probabilities, right? Yes? We're clear about this? All I did was take that log pa and evaluate it, right? And that's what gives us these two terms, yes? OK? Now, if I differentiate this, that code back library divergence with respect to theta, this term doesn't matter, right? Because it has no theta dependence. They're the observed frequencies. They don't care about what model we want, right? It has absolutely the observable frequencies, don't know anything about models. Just like, you know, Boltzmann will continue rolling down hills whether or not we believe in Newton's equations, right? So and then when we differentiate respect to theta, we get this term because we get this, right? That's this term. This, when we differentiate log z with respect to theta, we get this term. And this term is when we differentiate this explicit theta with respect to theta, right? So we get these two terms. Now, this, what is this? What's the difference between this and this? I know I wrote it down here, but I want someone to explain to me what is it that I wrote down there. Yes, that's exactly right. This is the theory, right? Says the expectation value should be this. And this is what the observed frequencies tell us what the expectation value should be of that particular observable, right? So if we were doing gradient descent, which is one way to try to figure out what the theta should be. Remember, we're trying to minimize this Quebec library divergence, right? Yes? Then this would be what we'd use to do gradient descent, OK? You know what gradient descent is. If you're trying to minimize a function, then you calculate its derivative, and then you go down minus the direction the derivative is the highest, right? So that's going down. So what you see here is that it's the difference between the two expectation values, the observed expectation value versus the theoretical expectation value that is driving our gradient descent to try to figure out what theta should be, right? And it obviously makes sense that if your model is actually giving you the right observed expectation values, then the model parameters don't change anymore, right? The gradient descent stops. You found the parameters, right? OK, that's great. It turns out this expectation value is very hard to compute. Why is the model expectation value hard to compute? Sorry? Yes, we never, ever are able to calculate partition functions except in very trivial cases, OK? They involve combinatorially large numbers of sums over states, right? We can't do those sums. So it's a very nice, clean expression, right? Isn't it beautiful? I really like it. It looks very great. You can't actually calculate anything with it, OK? What you end up doing is spending a lot of time trying to do Monte Carlo's with that theoretical part, right? It's the expectation value on the experimental frequencies is really easy, right? There are only so many observations, even in this day of good data and whatnot. That part is really easy to calculate, OK? It's this other part, the theoretical model part that is hard to calculate, OK? So maybe in lecture three, we will discuss a smarter way to set up this calculation, OK? But in principle, I want you to remember this, that this expectation value is the theoretical expectation value is very hard to compute, OK? All right. So now we're going to switch gears a bit. And that's why I spoke louder, no? And we're going to talk about large deviations, OK? So we'll big word, large deviations. What is large deviations about? So tell me, what is the law of large numbers? Someone tell me, what is the law of large numbers? The law of large numbers is a statement that if I observe a certain random variable, right? I observe n trials of the random variable, right? So I get x1, x2, up to xn random values, right? Numbers. And I take the mean of those random n random numbers, right? So 1 over n sum of xi, the observed n random variables, right? As n becomes large, that sum of random variables, the mean of sum of random variables, is going to tend to the mean of the distribution, of the underlying distribution. That's what the law of large numbers is, OK? Yes, is that clear? That's what the law of large numbers is. It's very simple. Just say that the mean of your observed sample, as the sample becomes bigger and bigger, will tend to the mean of the distribution, OK? Yes? OK. The central limit theorem. Someone tell me, what's the central limit theorem? Sorry? And so the theorem says that the mean of the sample will follow the equation distribution. That's right. The mean of, if I do the same n trials over and over, the mean of those will tend towards the mean of the sums will follow a normal distribution, OK? So I will actually discuss a short proof of it. So the key thing to understand about the central limit theorem is that what it's telling you is about how the distribution of the sums of random variables, right? How they're going to look in the main part of the distribution of the sums, OK? So it's really good at describing what the mean part is, OK? It's not so great at describing what are rare values of that mean, OK? So roughly speaking, that the central limit theorem applies to what you might call moderate deviations of that mean, OK? On the area of square root of n, where n is the number of things, OK? So it's telling you about moderate deviations, OK? So what is interesting for us is the probability of rare events, which is what the large deviations theory is going to be all about, OK? Trying to get better handles on the rare event probabilities, OK? So it's going beyond, of course, the law of large numbers. It's going beyond the central limit theorem. We're actually trying to find out something more about what the tools of the distribution look like, OK? That's what large deviations is about. And then there's what is called in the literature level two large deviations, where you're asking a sort of meta question, right? You're asking, what is the probability of the whole empirical PDF, the empirical probability distribution function that you got from the sample, right? How far is it from the true distribution, OK? From the whole PDF, OK? That's called level two large deviations, OK? So here's a proof of the law of large numbers, OK? So for a random variable with a finite expected value and a finite non-zero variance, those are the assumptions, right? So a finite non-zero variance, so you can't have things with heavy tails, no probability distribution with heavy tails, right? Finite variance, OK? Then the probability of your random variable, the sum of 1 over n sum xi, being more than k standard deviations away from the mean, right, is less than or equal to 1 over k squared, OK? So this is actually a proof of this law of large numbers, because it says exactly how the mean is going to converge, right? Yes? And I want to take you through this proof, because it's really very, very simple, OK? It's conceptually very, very simple. So here's a variance by definition, right? Sigma squared is the expectation of x minus mu squared, right? And you break up this expectation into a part where x is less than k sigma away from the mean, and a part where x is more than k sigma away from the mean, right? Yes? OK? And so that's what I did here, right? The expectation of x minus mu squared when x minus mu is less than or equal to k sigma, right? Times, of course, the probability that x minus mu is actually less than or equal to k sigma, right? Plus the expectation of x minus mu squared when x minus mu is more than k sigma away, right? And times the probability, right? Yes? OK. In the first term, we want the lower bound. So we say, OK, suppose this probability, right? The expectation of x minus mu squared is 0. It's certainly greater than 0, so I can call that 0, right? Yes? And then I'm left with only this other term. But by definition, if x minus mu is bigger than k sigma, then the expectation of x minus mu squared is bigger than k sigma squared, right? So again, that is consistent with getting a lower bound, right? Yes? And so we get k sigma squared times the probability that x minus mu is greater than k sigma, right? OK? Now you divide both sides by k sigma squared, right? And you'll get this, 1 over k squared on this side. Sigma squared divided by k squared sigma squared, right? And so you get this upper bound, 1 over k squared, on the probability that x minus mu is greater than k sigma away, right? OK? Yes? OK, good. OK, so now we're going to talk about the central limit theorem. So there is a normal distribution with mu and variance sigma squared, right? Yes? That's your standard canonical Gaussian distribution, right? And it's good to know that it's not going to hold for every single possible random variable, OK? So you should have a limit. It doesn't always hold. Again, the problem usually in all these things is in heavy-tailed distributions, right? That's where it's not necessarily going to hold, OK? And that's where I was saying that the central limit theorem is telling you about moderate deviations, OK? It doesn't really hold when there are heavy-tailed distributions, OK? So here's a precise distribution of precise definition. The limit of yn, which is defined as 1 over root n, that's what I was telling you, the root n, right? 1 over root n sum of xi is distributed as the normal distribution with mean 0 and variance sigma. If the exponent, so we subtracted the mean, so we get e xi equals 0, and the variance is this, right? Sigma squared, yes? Now, you see this part, it's not uniform convergence, OK? What does that mean? It's convergence and distribution, which means that if you give me an interval on the x-axis, like between a and b, right? So for a fixed a and b, the probability from the exponential distribution of being between a and b is going to be equal to the true probability of being between a and b, OK? In the limit, that's what it means to be a convergence in distribution. It's not point-wise convergence. You fix a little interval a, b, and in the limit n, the distribution will converge to that, OK? That's what it means, convergence in distribution, right? It's not point-wise, and the reason it's not point-wise is because it doesn't really hold in the tails of the distribution. The tails converge much slower to the Gaussian form, OK? The main part converges quickly. The tails converge slower, OK? So you can't say it's converging point-wise. It's converging only if you specify an interval. It'll converge on that interval, OK? I know these are finer subtleties, et cetera, that I'm saying. But you don't want to be bitten by an assumption when you're doing some analysis and you didn't think about, well, is it actually true what these assumptions are making, OK? So it's good to at least hear them once, the assumptions that go into these things, all right? All right, OK. So what's behind the central limit theorem? It's actually very simple to prove. If you just calculate, if you have a random variable which has mean mu 1 and variance sigma 1 squared, mean mu 2, variance sigma 2 squared, the sum of the random variables, this I'm going to leave as an exercise, OK? The way to do this is by doing a Fourier transform of both of them, convolving the Fourier transforms, and then doing the inverse transform, OK? That's how you derive this. And so what you'll see is the distribution of x1 plus x2, has mean mu 1 plus mu 2, and variance sigma 1 squared plus sigma 2 squared, OK? Yes? And related to this, if you rescale the random variable with a constant c, right? Then the mu and then the mean gets rescaled, and the standard deviation gets rescaled, OK? Yes? That's really all you need to prove the central limit theorem, OK? These two properties. So why is that the only thing that you need? Because if you define yn to be 1 over root n times this, right, OK? This 1 over root n is going to take the place of this c, OK? So a little bit of notation and stuff to get straight. I already said that, well, I said it implicitly. Remember how the partition function is a moment generating function, OK? For these natural parameter exponential family distributions, right? Yes? So there's the moment generating function. Why do we call it this? Because when we differentiate with respect to t, what you get is powers of x, right? All the moments of x with more and more derivatives of t. Yes? Right, OK? So the coefficients, if you expand this formally in powers of t, then you'll get coefficients that are expectation of the powers of x, which are the moments. So very simple identity, which you just plug in x plus y in there, and you'll see that the moment generating function of x plus y is the moment generating function of x times the moment generating function of y, right? OK, yes? Similarly, if you plug this in, the moment generating function of cx is, well, you put a c in there, and then you put t times cx, and then you move the c over to the t, and you get the moment generating function of c, evaluated at ct, right? OK, yep, just simple stuff. Now what does that say about the log of the moment generating function, which is called the cumulant generating function? Cumulant is what statisticians call, what we were calling connected correlation functions. If you take derivatives of the log of the partition function, you're calculating connected correlation functions. If you're a statistician, you're calculating cumulants. That's it. Cumulants are connected correlation functions, OK? And obviously, there's a log here, so that identity we said with the cumulants, the moment generating function of x plus y is a product. When you take the log, it becomes a sum, right? A sum of the cumulant generating function of t of x and the cumulant generating function of y, yes? And similarly for c, rescaling x, right? If we've subtracted the mean value, why is this? What is this saying? The cumulant generating function evaluated at t equals 0 equals 0. What is that saying? Yeah, that's exactly right. This is just the log of this, right? At t equals 0, log of m of x is expectation of 1, right, which is the normalization of probability, right? So kx t equals 0 is equal to 0 is just saying that probabilities are normalized, OK? kx prime at t equals 0 is just the expectation of x. kx double prime at t equals 0 is just our old friend, the connected correlation function of x squared, right? And that's the variance of x. The mth derivative at t equals 0 is called the mth cumulant, right? And now based on the scaling right here, OK? And what I said about root n, the mth cumulant of yn with that scaling of 1 over root n, right, gets this inverse power of n raised to the power m over 2. Remember, there are n copies of x inside yn, right, with a factor of 1 over root n, right? So when you take the mth power of that yn, you get m over 2 powers of n, OK? And what happens if m is greater than 2? Then what's that telling you is that the mth cumulant of yn if m is greater than 2 is equal to 0? It's going to tend to 0, right? In other words, the distribution of yn will just have a mean and a variance. That's it. What kind of distribution has only a mean and a variance? It's a Gaussian distribution, OK? Right? The assumption here is that every one of the cumulants has a uniform bound. So those cumulants cannot grow. Again, this is telling you cannot have heavy tails in the distribution, OK? These cumulants have to be uniformly bounded, and that's when we get a central limit term, yes? OK, good. Now, OK, so that's a lot of theoretical stuff. A few examples just to be very clear about what's going on here. There's a Bernoulli variable with probability p. What does that mean? It means it takes value 1 with probability p and value 0 with probability 1 minus p, right? It's a discrete variable, 0 and 1. It's partition function, which is m Bernoulli at t, right? OK, so it has probability 1 minus p, right? I've taken a value 0. Well, e raised to the power t times 0 is 1. So that's why there's no t dependence here, right? On the other hand, when x is the random variable x is equal to 1, you get e to the t, right? And that happens with probability p, right? And then we have the log. And so when we do this, we get tp plus t squared over 2p 1 minus p plus blah, blah, blah, blah, OK? But it's better to just leave it in this form because what's a binomial variable? A binomial variable is n copies of a Bernoulli variable, right? n copies, that means it's a product, right? That's how moment generating functions work. You get products of the moment generating, right? So that's why this log has the factor of n times the Bernoulli log, right? Yes? And if you kept this in this form, or rather you bring all the p's together, and so you write it as 1 plus p times e to the t minus 1, yes? Then you'll see that this n log Bernoulli as n goes to infinity gets this factor of np in front and e to the t minus 1, OK? It's just really, this is very, very simple algebra. Just bring the p's together, and then in this limit you'll see this, OK? Now the Poisson distribution, right? It's this. So the moment generating function of the Poisson distribution, see it's a sum over all possible k, right? All possible non-negative integers. And you get e to the tk for the entry at k, right? And then you take the log. Well, if you have e to the tk times lambda to the k, you combine them, you get lambda times e to the t always to the power k, right? And that's just exponential of e to the t times lambda. And when you take the log, what you'll get is minus lambda, which is from this term, plus lambda e to the t, OK? And the only reason I bring this in is because you know that if you take the binomial distribution and you take the limit as n times p as head fixed and n goes to infinity, right? That does become the Poisson distribution, OK? So this is just exhibiting that, OK? So there are examples. OK, now, so how are we doing? Are we saturating? Are we still following what's going on? Any questions? Yes, please. Regarding the convergence of some way forward. Yes. No, no, there's not a clear transition, OK? The reason you say that convergence and distribution is you put your interval a to b. And then you ask, how large does n have to be for where the central limit to be within epsilon of between a and b, right? And you can find an n large enough, right? But you can't do it for all values of a and b at the same time, because the n that you need will be different depending on where, OK? So that's why the convergence, it does work over the whole domain, but not uniformly, OK? So you have to pick a between a and b first. Then you find an n that will depend on a and b. And it will converge, OK? Yeah, good question, OK? So now, we have these bits that take values 0 and 1, right? And we want to calculate the probability distribution of the mean of these bits, OK? So if we have just two bits, then this value is just going to be 0, half, or 1, right? But if you have n bits, then this value starts to be almost continuous, right? If you have many, many, n is large, right? OK? So we want to calculate the probability that Rn is about R, right? OK? So how do we do that? Well, we sum over all these binary vectors, n-bit binary vectors. Each of them has probability 2 to the minus n, right? Because they are 2 to the n of the binary variables. So the probability of any given binary vector is 2 to the minus n, right? OK? So we have to sum, for a fixed value of R, we have to sum over all such binary vectors, each of them with the same probability, right? Yes? OK? Well, there's that probability, and this is just a combinatorial factor. Rn of them have to be 1. And 1 minus R times n of them have to be 0, right? That's how we'll get the sum to be R, right? Because they're all going to be 0 or 1. So we're summing over all binary vectors and of the n bits that are there, Rn of them have to be 1. That's how we'll get R. OK? Now we do sterling approximation, right? n factorial is about n to the n e to the minus n. And I didn't put in the 2 pi square root of 2 pi and so on, because that actually is more than we'll need. OK? So the same thing. I applied to Rn factorial, and I get this. Every place there's an n, you put an Rn. OK? I put these together with the form that we had right here. So I do sterling approximation on this, sterling approximation on this, sterling approximation on this. Yes? And we put it together, and there we have our first large deviation result. The probability that Rn is equal to R is exponential of minus n times this function i of R. i of R is this. Log 2 plus R log R plus 1 minus R log 1 minus R. I am telling you, this is a Kulbac library of versions. Can someone tell me why? Why is this a Kulbac library of versions? Sorry? P i is half. Yes, exactly. So P i is a half, right? So if I were to write it in the Kulbac library form, F i log F i divided by P i, right? I would get divided by 1 half, right? So it's in the denominator, there's a log. It's minus log a half, which is log 2, right? But it has a factor of a half. So there's two of those, and you get log 2. So this log 2 really belongs as R divided by a half, and 1 minus R divided by a half, OK? That's why it's a Kulbac library of versions, all right? Now what you notice is I of R, it's convex. You can check it's convex, OK? And it has a unique 0 at R equals a half, OK? I of R is called a rate function, OK? That's just terminology used in large deviation. It's called a rate function, OK? And here are a slightly more general form, where the probabilities are not just a half, OK? Initially, I assumed that the probabilities were half, but now suppose the probabilities are alpha and 1 minus alpha, OK? Then this is the form you get, which instead of a half in the denominator, you get alpha and 1 minus alpha, OK? And this is just showing you, numerically, what is i n going to look like? As I said, it only takes certain values. If you have, say, n equals 5, then it'll only take values 0, 1-5th, 2-5th, 3-5th, 4-5th, right? And 5-5th, right? So i looks like this. But as n becomes large, you see it becomes a continuous function, smooth function, and it has a unique 0, right? So it's a convex function, and it has a unique 0, OK? OK. This I will go over very quickly, because we have just been talking about the normal distribution a lot. So this is actually included in what I said about moments of the normal distribution, right? m of x1 plus x2, right? So the probability that yn, which is the sum of xi, right? The 1 over n sum of xi is equal to this. So this is our second large deviation result, that the large deviation rate function in this case is s minus mu squared over 2 sigma squared, just this part right here. Why don't I include this part? Because the large deviation by definition, what we're calling large deviations, is only the leading exponential terms, which come in the form e to the minus n times the rate function, e to the minus n times the rate function, OK? We're not talking about sub-leaving terms. If you were to write this, you would see a log n, right? A half log n for the square root. That's sub-leaving relative to this rapid exponential decay, right? So the rate function is only talking about the leading exponential decay, right? Not sub-leaving. That's why we don't worry about this term, OK? OK. And that's just some examples. I want to end today with, if you saw the last two examples, you might be thinking, oh, what do I need to do? I just, for the Gaussian large deviation result, basically, what did I get? I took x minus mu, and I replaced it with the sum, the mean of the sum, and that's it. Why do I have to do a lot of calculation? Is that all it takes? So the last example is telling you that the Gaussian is a very special distribution. So suppose you have exponentially distributed random variables, right? So mu is the mean value. It's exponentially distributed. Mu is greater than 0, right? Yes. It's an exponential distribution, perfectly nicely normalized. So if you calculate the probability that a sum of exponentially distributed variables takes the value s, right? So you'll take the product of these probabilities, right? And that's how you'll get the sum of xi, right? That's the s, right? And that's why you get this, right? OK. But there's also this integral, right? Because you have n different integrations, right? And the constraint is the sum of xi has to add up to n, right? The sum of xi has to add up to s, right? 1 over n, sum of xi has to add up to n, right? So you have to do these integrals, n of these integrals, right? Now I leave it as homework exercise to you, but this integral has quite non-trivial dependence on s. OK? Think of it this way. It's a combinatorial sum over this integration domain, right? That cares about how far you can go in s. So like if x1 takes a certain value, the sum of x2, x3, x4, all the rest of them is constrained to be s minus x1, right? And if you say x1 plus x2 takes a certain value, then the rest of them are constrained, right? Yes? So when you do this integral, it's actually a very nice little integral to do once you set it up, OK? You can do it by induction if you like. It's really not hard to do. But if you do it, what you'll see is that this comes out to be n times s raised to the power n divided by n factorial. OK? So if you then use Stirling's approximation, you will end up with a rate function that looks like this. And I want you to understand that this part is coming from the entropy that is encoded in this integral, OK? That was constrained by s, OK? So that's why large deviations is not a triviality. It actually is telling you about fluctuations, OK? Yes? So that's the note on which I'd like to end. Yeah, the rest can wait for tomorrow, OK? So was that too fast, too slow, balled out of your skulls? What? Come on, give me some feedback. Otherwise, you have to sit through three hours and 45 minutes more of this. So I'd like it to be something that you enjoy. So please, give me some feedback. Really, I can take it. I'm a big boy, OK? Sorry? You're very good? You sure? Yeah? OK, then we'll continue tomorrow. So we have a 15-minute break. We have some refresh mood outside. We resume at 4.05, OK?