 Are you enjoying this school? I hope, I like you asked a lot of questions. So please go ahead. So shall we start our second lecture of people? All right. It's very easy for him to say, ask more questions. Right, all right. So just to remind you where we were yesterday, right? We did normal variables identically distributed and we showed, well, we can do this integral and we showed that it comes out this form. And so we look at, now very, very important here. Remember, what we're looking for is the coefficient in the exponential of this factor of n, right? Whatever is the coefficient of this factor of n in the exponential with the e to the minus n, that's the large deviation rate, right? So in this case, the large deviation rate came out to be s minus mu squared divided by two sigma squared, which just happens to be the same as the normal distribution function, right? And then we had, that's an example and then we did the case of the identically distributed exponential variables and there we saw that doing the integral gave us this additional factor, right? So it's not quite as simple as just taking this and substituting s for x, yes? So now if you had to do this distribution by distribution, this is kind of tedious. So of course, people have derived ways to calculate the rate function much more directly and in a much quicker way. So that's what we're going to get to now. But just in general, just so we're on the same page, when we talk about a rate function or large deviations, what we're talking about is some set of variables, some set of random variables, a n, and what we're looking for is the probability that as n goes to infinity. So this is a sequence of random variables, they're going to go to infinity and we're looking for the asymptotic probability in this exponential form of what is this function i, that as n goes to infinity, the probability becomes, it's exponentially small as n goes to infinity, but what is this function i, okay? i is not going to depend on n, i is n independent, right? All the n dependence is right here in front of the exponential, yes? So that's in general what we mean by a large deviation function, okay? We have a family of random variables and if we're saying that oh, do the random variables take values in some set B, that's the mathematical formulation, then there's some number iB, which depends only on B, not on n, and this limit exists. And iB is the rate for this set of values B, okay? That's the precise and general mathematical state. Okay, so now, just so we also see that when we were calculating in the last couple of examples, this rate function you see is, that's a density, right? So to connect to this B, think of B as an interval from say one to two, right? So you say, okay, is the large deviation value of the random variables in this interval one to two and you integrate that density that we had in this way and that's what the probability is that something's gonna be in one to two, in this case, S to S plus DS, okay? So now, the most important thing and the most convenient or most important way in which we'll calculate rate functions is called the Gertner-Elles theorem and what does this say? It's actually a very simple result. What it says is, I won't try to prove it, okay? It's not hard to prove, but I won't prove it. Well, I'll give you some intuitive idea for why it's actually true, okay? Suppose we found a rate function, right? So that means as N goes to infinity, we can approximately say that the probability of YN having value between A and A plus DA is going to look like E to the minus N IA DA, right? That's by definition of the rate function. So we're saying that suppose we found a rate function, then we can evaluate this exponential probability, right? We just have to say E to the TYN, that's this part, and then the probability was this part, right? So if we found a rate function, then this exponential expectation of this exponential, what is this? Does anyone remember? We came across this several times yesterday. This is the moment generating, this is the generating function of all correlation functions, right? The log of this, this is the moment generating function, and the log of this was the cumulant generating function, right? So you remember this? Then what it's saying is that the moment generating function for YN takes this exponential form. Now if we scale, we want to get this factor of minus N out, right? So we say, okay, suppose we scale T to be equal to K times N, yes? Any puzzled looks, anyone confused? It's all clear? It took me a few hours when I first learned this to understand what was going on, so that's great, but we'll see. Okay, so we say T equals KN, that's a definition of K, okay? And then when we plug it in, what we see is E to the N, KA minus IA, right? We factored the N out, yes? By defining T to be K times N for K, some real number, we factored this N out, yes? Now why did we do that? Well, because we're interested in the limit, N becomes large, right? What is going to happen when N becomes large? Any guesses? What's going to happen to this integral? If you were evaluating this integral and someone gave you this form and said this N is going to become large, what's this integral going to look like? Can you give me an approximate value? What's it going to be? Right, we're going to look for not quite the stationary point, but the maximum, right? The maximum of KA minus IA, the supremum of KA minus IA is going to determine the value of this integral entirely as N goes to infinity, right? And that's called the Laplace approximation, or the saddle point approximation, right? So what this is saying, and this is the content of the Gettner-Elles theorem, is that as N goes to infinity, the moment generating function is equal to the exponential of N times the maximum value, the supremum, over all values of A of KA minus IA, right? And so now we take logs on both sides and we see lambda of K, remember what K is? K is the rescaled, we had a moment generating function parameter T and we replaced it with K times N, so lambda K is the cumulant generating function except we've rescaled T, it's the limit as N goes to infinity, one over N log of this and that's equal to approximately the supremum of over all values of A, KA minus IA. So what is this telling us? It's telling us that the rescaled cumulant generating function is the Legendre of an eventual transform of the rate function, right? So we have a direct way of calculating the rate function by saying that it's just the Legendre of an eventual transform, it's that function whose cumulant generating function is the Legendre transform. Yes, no, maybe. I'm just saying you have an integral e raised to large parameter times something, then that integral is approximated by the value exactly at the maximum value of that exponential, right? And then we take logs on both sides and we get this. And this form you should recognize is the Legendre transform. So what is that telling us? Maybe you don't recognize it in that form because if you're physicists you've always been told the Legendre transforms look like this. There's some function of T, there's another function of S and F and T are related by Legendre transforms if they satisfy this, okay? Right? But that's not the most general form of this. The reason it's not the most general form is this only applies when F and G are differentiable, okay? This applies all the time. There's no assumption of our differentiability here at all, okay? Right? So this is actually the correct form or the most general form of this transform. And I know I hopped on convex functions last time. Why do we care about convex functions? It's because the Legendre transform the Legendre transform is dual on convex functions. You start from a convex function, you do the Legendre transform, you get another convex function. And then you do it over again and you get back your original function F of T, okay? This is only true if F and G are convex, okay? And I'm gonna show you what happens if one of them is not convex, then if you do this two times over, what are you gonna get? So we'll discuss that, okay? But I just want you to understand that the usual physics Legendre transform is this form, it's just a special case of this, okay? So someone tell me, how do you go from here? If I tell you G and F are differentiable, how do we go from here to this? This is like the first thing you learn after you learn about how derivatives are defined. What's the first thing you learn? The use of a derivative. We forgotten calculus? No, you're not old enough to have forgotten calculus. So how do we use derivatives once we learn how to differentiate? Derivatives tell us about maxima and minima or extrema, right? Where the derivative vanishes, that's an extrema, right? So if I wanted to look at the supremum and I tell you F of t is differentiable, S is just a parameter, you differentiate this and you'll get S minus F prime of t, right? And if it was an extrema, then S minus F prime of t is equal to zero, right? Yes? So actually you can then plug it in and you get this form. Just if you assume that F and G are differentiable, okay? Okay, so geometrically what is the Legendre transform doing? Well, here's your original cumulant function, lambda of k. You look at the slope of this function, right? It's a of k at some value of k, a of k, okay? That value of a of k is the slope of the rate function at the value a. So a of k is the slope of lambda of k, then k of a, the inverse function, is the slope of the rate function, okay? That's the general geometric idea of how a Legendre transform works, okay? Now from this picture, can someone tell me why do we need convexity at all? Sorry? Yeah, but why does this picture tell you about one-to-one correspondence? Sorry, close, close, close, but not exactly. Yes, so the reason is for a convex function, the slope is always a non-decreasing, right? So in other words, if I'm trying to make a function that can be inverted, because we don't want to go from a of k to k of a, right? So that function better be invertible, right? So the slope starts out really negative for a convex function, and then it goes over to positive, right? But it never turns around. It starts negative and it goes positive, right? Yes? So it's a reparameterization. In other words, you can invert it. It's because it starts negative and goes positive, and it doesn't turn back, okay? So that's the reason it's a one-to-one correspondence. And if you had a bump in here, then the slope would go down, it would go up, and then it would go through a zero, and then go down again. And then it's not invertible, right? So the Legendre transform is dual only because these functions are convex. That's why the Legendre transform in this geometric sense is a very, very closely related to convexity in functions. Yes? Okay? So I talked about most of this, but I do want you to notice one thing about Legendre transforms, which is perhaps not necessarily emphasized in our first courses, is that if T and S are vectors, okay? You're not children anymore, so I'm not putting little vector signs over it, okay? So if T and S are vectors, then the gradient with respect to the T variables of the S variables is the second derivative, right? The second derivative matrix, the Jacobian matrix of F, right? And vice versa, the gradient with respect to S of the T variables is the Jacobian of G, right? The matrix, right? But by the chain rule, that means that this second derivative matrix of G is the inverse of the second derivative matrix of F with respect to T, right? As matrices, this is the inverse of this because DTDS is the inverse of DSTT, right? Multivariable calculus, okay? So how do we use this? I think I talked about most of this. If you have a random variable X, what we want is we have the cumulant generating function. We have, I think we talked about all this. Yeah, so the large deviation rate function is a function of the expectation value. And it's naturally the Legendre transform of the cumulant generating function. So what if the rate function is not convex? Because look, the cumulant generating function as we showed last time is always convex, right? I went on and on about how you take the second derivative and you get the second cumulant and it's always positive, right? So the cumulant generating function is always positive, is always convex. But the rate function as defined in its abstract, you know, the probability has to go as e to the minus n times rate function. That has no necessity to be convex, okay? So the methods that we're going to talk about or we're talking about the Gettner-Elles theorem is only useful when the rate function is also convex, okay? It does not hold. It will not give you the correct rate function if the rate function is not convex because the Legendre transform of the cumulant, which is guaranteed to be convex, is always going to be convex. Whether it's a rate function or not is up for debate, but the cumulant is always convex, the Legendre transform of the cumulant is always convex, okay? It's the other side, is the Legendre transform of the cumulant actually the rate function? That is not guaranteed, okay? All right. It's the log in terms of the natural parameters of something, right? So I showed that if you take the second derivatives of the cumulant generating function, you get a positive definite matrix. And that's, you know, if you like one of the definitions of convex functions, if it's true at almost every point and so on, okay? Okay. So what happens if we have a non-convex rate function? So what happens as I was saying, is you come down here, the slope is negative, negative, negative, then the slope goes through zero, right? Right here. So once it goes through zero, remember in that definition with the supremum, KS minus I, right? What is going to happen is that supremum is going to stay stuck at this point, okay? So there's some slope here and it'll stay stuck here and KS will go up and S will go down and K will stay stuck. The slope will stay stuck here. When the slope goes to the same value again, right? Right here. When it goes to the zero. Then it'll start rising again. So if this is the rate function, the cumulant is going to look like this. The cumulant comes down. It stays stuck here while S goes all the way from here to here, okay? So this is a non-differentiable point of the cumulant. No one guaranteed for us that the cumulant was always going to be differentiable, right? No one told us that. So the cumulant is not differentiable and the Legendre transform of the cumulant, you can calculate it straight forward with the Legendre differential form. And what it looks like is this. It goes down, then there's a straight line between the two points and then it goes up again. So what happened? The Legendre transform of this rate function of this, sorry, of this cumulant function, the one that had the singularity, turns out to be a convex function, right? It's called the convex envelope of this function which was not convex, okay? So the rate function, the true rate function is this, okay? But what the Legendre transform is going to give you is sort of the convexification of it, okay? Yes? So you can tell when the Legendre transform is not giving you the exact rate function, okay? Because if you see the convex function that you're getting out of the Legendre differential transform has a straight line, right? That's when you know that there was something funky about the original rate function, okay? Yes? All right? Okay. So now here is, actually, this is not a historical talk. Cramer's theorem, Cramer's paper in 1938, I believe, is the first place where people actually started talking, I mean, Cramer started talking about rate functions and large deviations, okay? Now I presented to you a historically, I first give you an argument for the Gertner-Elles theorem and then I tell you, oh, by the way, the Cramer theorem is a consequence of the Gertner-Elles theorem, okay? Historically, it didn't happen that way. Mr. Cramer proved it and then Gertner and Ellis many years later came up with their theorem, okay? But Cramer's theorem makes it even simpler to calculate rate functions. What Cramer said was, well, suppose they have identically distributed random variables, okay? So what we're trying to do is calculate this, right? And so the rate, the cumulative generating function is by definition this. Yes, are we clear? That's the definition, right? So if these are identically distributed, then the exponential factorizes into a product of exponentials, right? It's the exponential of a sum of over xi. So therefore it's a product of exponentials of each individual xi, right? And they're independent identically distributed random variables. So therefore this product comes out of the expectation, right? The product comes out of the expectation, there are n copies of this, right? We all remember how logs of powers work. So the factor of n goes away and what Cramer tells us is that it's even more straightforward. There's your cumulative generating function for a single copy of the random variable, right? And that's it. This is how you calculate the cumulative generating function and you can straightforwardly calculate the rate function as well as on differential transform of just this cumulative generating function for a single copy of x, okay? So you didn't have to do all that work if it's identically distributed random variables and all the differentiability that you're assuming goes through, then you can calculate the rate function directly from the cumulative generating function of a single copy of x, okay? That makes life very much simpler. Let me show you why. Okay, so this time we're gonna use Cramer's term to calculate what we calculated for normal variables and we said we had to use, you know, convolutions and so on to calculate what happens when you have sums of normal variables and so on, right? But with Cramer's term, we don't have to calculate convolutions at all. We just calculate the one copy of the random variable, Gaussian random variable, we can do this integral, even I can do this integral and you take the log and you get this, mu k plus sigma squared k squared over two. That's it, right? Now we need to do a Legendre transform of this. Lambda k is definitely differentiable, right? And so it tells us that dk lambda of k, which is supposed to be s as a function of k, right? Is just mu plus sigma squared k. So we can invert this k as a function of s is s minus mu over sigma squared, right? Just basic algebra, that's it. Thanks to Mr. Cramer. So then the rate function, which is the Legendre transform. In this case, it's even differentiable, so it's the Legendre transform. We just put this value of k of s in here and with a little bit of algebra, you'll get back. No convolutions or anything, right? Just in a straightforward way, from the Cramer and generating function, the Legendre transform gives you the rate function, okay? Now a calculation that I did not do for you in detail was that integral for the exponential distribution, right? Because it's a little messy, you gotta do all these n dimensional integrals and watch out for induction and this and that, okay? But with Cramer's theorem, it is trivial to calculate this. So Cramer tells us that lambda of k is the log of exponential of kx. Well, for an exponential distribution, if I shift the random variable by kx, that just, you know, it's totally straightforward to calculate what is the exponential. And so it comes out to be log of one minus k mu, okay? And then we again solve for s of k by differentiating with the cumulant generating function and you get mu over one minus k mu. Now we're going to invert it again just like before, right? So we get k as a function of s and this value of k of s, we plug into the definition again of the rate function as the Legendre transform, right? And with a little bit of algebra, you get exactly the measure correction that I did not calculate for you, okay? Yes? So Cramer's theorem and the Gatine-Eller's theorem that comes with it, they make it much simpler to calculate these rate functions. Yes? Okay? So you don't need a whole lot of magic. You can do it with a single copy of the random variable. Just doing the Legendre transform of the cumulant generating function of a single random variable. Yes? Questions? It's all clear? All right? Okay. So it's sort of obvious that everything we did works for vector valued random variables, okay? But if you can do it for vector valued random variables, then with a little bit of, you know, are there any mathematicians now? There's not many mathematicians here without worrying too much about convergence and so on, which, you know, to be honest, if you start doing infinite dimensional limits, you have to worry about convergence, but without worrying about it right now, look at this. Suppose I give you, you know, n a sample of n copies of a random variable, right? That's x i, or sigma i, right? These are your observations. Then I can define a probability density, right? ln of x, which is just one over n sum over this set of delta functions, one at each observation, right? Yes? So now think of it this way. ln of x is itself a random variable. It's a density, but so what? It's a density that happens to be a random variable in the space of the densities, right? And why is it a random variable? Well, it's a function of this vector sigma i, which is a vector random variable, right? So a random variable, which is a function of a random variable, it's still a random variable, okay? So now it's a function that happens to be a random variable, and we can still calculate the human and generating function of k, where k happens to be now a function of x. k was a single number or a vector of numbers, right? Now the random variable happens to be ln of x, and if you plug this definition of ln, in here, see what you'll see is that what you get is the log of exponential of k evaluated at x. So now k is a function. It's the dual of this random variable, right? So it's also a function. This was a function, so k is a function, okay? And why am I telling you this? Well, if you now calculate the rate function for this, just by applying Gettner-Ellis again, okay? Nothing else, just Gettner-Ellis, okay? What you'll see is that the rate function for any density mu rho is the true density is given in this form, okay? You've seen this before. This is the Kullback-Leibler divergence, okay? Everything in this theory of large deviations will sooner or later go back to some version of the Kullback-Leibler divergence, okay? Yes? So what I've just argued for you is that this same set of ways of computing the rate function applies to if you have vectors of random variables or you have a function which is a random variable, okay? It's the same technique, nothing different, okay? Now, one thing interesting to note here is that mu is the value of the random variable that we're asking the probability of, okay? Rho happens to be the true density. Many of the applications of this are where we want to find out what is rho, okay? And what this is telling us is, oh, the rate function which tells us how far our empirical density mu is from the true density, right? Is given in this form. How do you calculate this if you don't know what rho is? We're trying to figure out what rho is, right? So you can't even calculate this unless you know what rho is, okay? So actually for data science, what is interesting for us is something called an inverse Sanoff's theorem, okay? In other words, we want to find out if I give you a density, how far, you know, and I have a, so there's a theoretical guess for a density and there's an empirical data that we've observed, right? What we'd like to know is how close is this theoretical guess for a density to the true density, right? So I'm not going to talk about Bayes' theorem much here but Bayes' theorem is this basic point of probability theory, right? What is the basic point of probability theory? Applied to this kind of inferential statistic, right? Yes, we all agree with this. We're going to take, if you look at what I wrote down, rho is our model for what the true density could be, right? Ln is our observed empirical density, right? So the probability of Ln being the correct density with if rho was the, sorry, the probability of Ln being observed if rho was the correct density, right? That's what Sanoff's theorem gave us, right? What we want is to invert it to see how likely is it that rho was the correct underlying density, right? So rho you should think of in the Bayesian sense as the model. Sanoff's theorem tells us that this factor is the rate function or the exponential of the rate function of Ln, the empirical observation with respect to rho our model for the true density, right? And the denominator is just integrated over all possible true density guesses, right? The models for true densities, okay? So this is how you can invert this. I don't think I'm going to get time in this set of lectures to show you a real application of this, but this is how you find a useful Sanoff's theorem of use. You have some family of guesses of where the true density might be and those are your models for the true density and you use Bayes' theorem to go from the rate function of your empirical density to the probability of one of your theoretical densities being the true density, okay? So that's just basic Bayes' theorem. Yes, please. How do we know how our model is? That's a great question. I could give a whole lecture right now. This is, as usual, this is the sticking point in any Bayesian analysis, okay? How did you know what the probability of your model was, right? So Bayesians, like me, say mumble words like, you know, oh, we'll try to be as unbiased as possible, okay? No, that's, I'm telling you, this is, you know, this is a school, we're letting all the dirty laundry out. There is no true way of finding P of rho, okay? The only thing you can say is that with enough data, without depending on P of rho, okay, the dependence on P of rho will get eliminated if you have enough data, okay? However, if rho doesn't even, if the true rho doesn't even happen to be in the family that you're testing, it's not like you'll suddenly magically pop out of that family and find the true density, okay? So all this is doing is telling you that if you had a space of models that you're testing, which is the best model in your space of models, okay? So yeah, great question, but sorry. There's no way any Bayesian analysis is going to tell you what P of rho is. It's just not happening, all right? So we talked about the Gettner-Elles theorem and I showed you, you know, where it arises as this Laplace, you know, as the Laplace approximation to the integral, right? But if you follow that argument, if I take any function of my random variable, okay? Then if I plug that in, I can calculate a rate for that function, okay? In just the same way, I take the supremum of f of a minus the rate function and that'll give me lambda as a function of f, okay? It's just before this we had f of a was k times a, right? A linear function, right? I'm just saying the same thing holds if you take f of a to be any bounded continuous function, okay? It doesn't have to be that linear function, k times a that we usually use, okay? Yes? It's useful to have this. Maybe next lecture we'll use this, okay? But it's just very, very general. Actually, Vardhan's theorem is not just a trivial generalization, it applies in cases where the Laplace saddle point approximation argument that we gave doesn't apply, okay? It's much more general. It implies the Laplace transform, Laplace approximation, but it's much more general, okay? All right. Maybe we'll skip this, but one thing I want to say, okay? So, suppose we have some function of a random variable, okay? So an is some set of random variables for which we know the rate function, okay? Now I ask you, oh, what's the rate function for some function like tanch of that random variable, okay? Just for example, okay? Then how do we find this? So this is called the contraction principle, and what it says is if you know the rate function of a n, then you can easily find the rate function of b n. Just in this way, the rate function of b at b is the least value of the rate function at all values of a such that that function h of a is equal to b. So let me parse that carefully. Little b is some value of this new random variable, which was some function of the old random variable, right? So what this is saying is that the rate of this little b value is the lowest rate you'd get for any of the a values that gets mapped by h into that b value. In other words, these are all rare events, but the least rare of all the ways that you could get to b is the way that determines the rate function of b from the rate function of a, okay? And this is a very general point that if you're looking at very rare events, okay, and you look at some more complicated function of those rare events, the probability of this more complicated function is always going to be the most probable way that you could get this function from all the different ways that you could get it, okay? So it's the least improbable value, right? This is kind of a slightly subtle point. So just understand, if there are 15 different ways that I can get to b from a values, right? And I ask you, what's the highest probability, what's the asymptotic probability that I'll see some value of b? Well, I pick through my 15 different values of a that all lead to that same value of b, right? And I say, which of those a values has the highest rate, right? And that's the value that's going to determine the rate of my value of b. The most, the one with the lowest rate, right? The least improbable value, yes, okay? All right, so now we do a few counter examples just so we don't get overconfident about how we use rate functions on the transforms. So this is a Pareto density. You take the beta parameter to be greater than three and c is just a normalizing constant, right? So this is a perfectly normalizable p of x, okay? Yes? x is on the whole real line from minus infinity to infinity, okay? The variance of x is also finite if beta is greater than three. You can calculate it explicitly, right? It's just non-zero. I mean, it's non-zero and finite, okay? Yes? But what is i of s? Well, if we want to use Kramer's theorem, we try to calculate the cumulative generating function, right? What is the cumulative generating function for this? It involves us multiplying this p of x by e to the kx and then integrating from minus infinity to infinity. That integral is zero. Or rather, I should say that integral is infinite except that K equals zero, right? So in other words, the cumulative generating function is not going to help you calculate a rate function, okay? Except the cumulative generating function is telling you that the rate function is just identically zero, okay? So there's a perfectly innocuous density and would you have thought that this would lead to any pathology? Not really. But I want you to note that this is an example of a heavy-tailed distribution, right? For which it's not just not true that the cumulative generating function exists, okay? The cumulative generating function does not exist for this density. And therefore, the rate function is identically equal to zero, okay? So that's counter-example number one. Now, let's do a more fun one. So suppose the random variable is like a stuck coin, right? It either becomes one, one, one, one, one, one, or minus one, minus one, minus one, minus one, okay? Right? So if I'm calculating the mean value of this set of random variables, it's either going to be one or minus one with probability, half each, right? So the probability density of yn, which was the mean value of the random variable I observed, right? Is a half y equals plus one, half y equals minus one, right? Does everyone understand what the random variable is? The random variable was this. Either the sequence is one, one, one, one for n copies, or it's minus one, minus one, up to n copies. Certainly is a random variable, okay? And we can calculate its mean value, and that's this distribution. So what is I of S? Okay, we can calculate the cumulant quite easily. You get e to the ky at y equals minus one, and you get e to the ky at y equals plus one, and they both have a factor of a half in front, right? So this cumulant generating function is just the limit as n goes to infinity, one over n, that we always put log cosh of nk, right? One half e to the kx plus e to the minus kx is cosh of kx, right? So log cosh nk, one over n, as n goes to infinity. What is that? That's just the absolute value function, okay? They're clear, right? Because as n goes to infinity, if k is positive, then you're going to get the e to the nk, and if k is negative, you'll get e to the minus nk, right? Which is e to the minus of something negative, so the log in either case is going to give you the absolute value of k, right? Yes? So now there's our cumulant generating function, perfectly nice. What's the problem with this cumulant generating function? It's not differentiable at k equals zero, right? It's perfectly fine at other values of k, but at k equals zero, it's not differentiable. If you look at this probability density, you can read off the rate function right away. What's the rate function? It's zero at y equals plus one, it's zero at y equals minus one, and it's infinite everywhere else. Why is it infinite everywhere else? Remember, it's e to the minus n rate function, right? So only values that that rate is going to be non-zero four, it's going to be zero four are y equals one and y equals minus one, right? Everywhere else, it never gets to those values, so the rate function has to be infinite, because it's e to the negative n rate function, right? Yes? So if s is zero, so these are observed values for s equals plus minus one, right? And all the other values are is is equal to infinite, they're never observed, right? Okay, now I hope you can see that if I give you i of s is zero at this value, it's infinite everywhere else, that's not quite a convex function, right? So it's infinite, it has zeros, it's infinite again. Okay, not a convex function. The Legendre transform of this absolute value of k is not hard to calculate, very easy to calculate. Between s, between minus one and one, the Legendre transform gives you that i is equal to zero. Why is that? Because look here, right? It's ks minus absolute value of k. Between for s in this interval, right? This supremum requires s equal, for requires i to be equal to zero, right? But if s is bigger than or less, bigger than one or less than minus one, then this term dominates and you're supposed to take the maximum value of k, well k can go to minus infinity, you get i equals infinity, k can go to plus infinity, you get i equals infinity, right? So the function that you get for is infinite outside this interval and in the interval, it's equal to zero, that's it. So I'm just trying to point out that these are very simple densities where you can calculate everything as explicitly as you want, right? And what you see is that it's not so hard or it's not so difficult to come across examples where i is infinite, not differentiable, you know, has an interval where it's zero and so on, okay? All right, let's do another example, being more explicit, all right? So this time we take the random variable yn, to be a random variable z, which is either plus one or minus one, yes? And then we have a bunch of identically distributed xi, okay? So yn is a perfectly nice random variable, right? And what I wanna know is what is the rate function for yn, okay? So the way to do this is let's imagine z was plus one or minus one, right? Then we can calculate what's the probability that yn will be s, some value s, right? Because if I tell you that z is plus one, then I'll get some value that xi has to be. If I tell you yn is minus one or minus s, then it'll tell you what the other value is, right? So when z equals plus minus one, you get two different exponentials for the probability, okay? Very simple, okay? Well, what's the probability that yn is s? Then what you get is e to the minus n times is, because this is a probability you multiply with, by the probability that z was plus one, that gives you a factor of a half, plus the probability that z equals minus one, gives you another factor. So the rate function here is the minimum of i plus of s or i minus of s. Remember again, the least improbable way that you could get to that, right? The least improbable way that you could get to a certain value of s. Depending on the value of z, it's either i plus or i minus, okay? But now, you can calculate the cumulant generating function directly from definition, not hard. The cumulant generating function is absolute value of k plus k squared over two, okay? And you can calculate, you can do the Legendre-Fenture transform of this, and again you'll see that the Legendre-Fenture transform between in this interval is flat, it's zero. And this is the function here, the Legendre-Fenture transform of this cumulant generating function, okay? The actual rate function looks like this. You see, the Gaussian or the quadratic parabola, right? For s equals minus one, the quadratic parabola for s equals plus one, and there's the kink at k equals zero, okay? So there's a non-convex rate function, a non-differentiable cumulant generating function, and there's as explicit a convexification of this as you would want, okay? Are we clear? I'm just showing you that you can do these examples quite explicitly, right? And see exactly where things break down, okay? The rate function, the true rate function can definitely be non-convex, okay? It's just the cumulant will never be non-convex. However, it can have places where it's not differentiable, right? And where it's not differentiable is where there's pathologies, right? Yes? We're clear about this, all right? Okay. Now, suppose we consider sequences, say omega one, omega two, dot, dot, dot, and omega i could be random, identically distributed, or a Markov chain even. I won't go into the Markov chain part, but now think of it this way. The probability of a sequence of length n is itself a random variable in this case, right? Omega is a random variable, the probability of that random variable is a random variable. It depends on what the omega random variable was, right? Yes? So I define an of omega, which is minus one over n log pn of omega, and an of omega itself is a random variable. Now I wanna know what's the rate of this random variable, okay? And this is not hard to calculate. We can do the cumulon generating function by the Gettner-Ellis theorem, it's exactly this form, right? So notice pn to the minus k, the expectation value of that, right? The expectation value of pn to the minus k is just pn times pn to the minus k, so you get pn of omega times a one minus k. The minus k is coming from this explicit factor, and the pn of omega, the one is just how you calculate an expectation value, right? You're summing over all possible n random variable sequences, there's a probability of them, and you're calculating pn omega one minus k, right? This is the Gettner-Ellis theorem. Just explicitly plugged in. So now if they're identically distributed random variables, then the sum over omega, which are what? Just copies, right? They're identical independent random variables, so it amounts to basically, this sum over probabilities factorizes into a product over the n different omega one, probability omega two probability, omega three probability and so on, right? And so what you end up with is the sum over all the values that omega i, each of them could take, right? They don't have to be integers or anything, it could be in any space, right? Omega i are some random variable, it could be any space, this sum. So the rate function, I mean, sorry, the cumulative generating function is just sum over every value that that random variable omega i could take, raise to the power one minus k, the probability of that. So there's an explicit cumulative generating function of this, it might seem like a very complicated thing to do, take a random variable, take a vector of random variables calculate the probability of that vector of random variables, say that, okay, the log of that probability is itself a random variable, what's the rate function of that random variable, okay? In the continuous case, I remember, it's the x-axis of the exponential of t, yn, but here it's not exponential. Oh, it is an exponential. The random variable is defined as an of omega, which is minus one over n log pn of n. So if I take e to the minus n an omega, that is exactly e to the n times one over n log pn, so exponential of log gives you, okay? So yeah, no, this is exactly the same Gertner-Ellis form, okay, yes, good question, but yeah, it's exactly the form, okay? I skipped a step, I couldn't put it up, okay? Now, you might think this is just, you know, he's just making this stuff up, just for the heck of it to fill the time, but this is actually an extremely important result, okay? A lot of information theory will use this argument, okay? So let me explain why. Now, suppose this rate function that we calculate by doing the Legendre-Fentrell transform of that cumulant generating function that I showed you, right? Suppose it has a unique, suppose it's convex and has a unique minimum a star, call it, okay, right? Then what that says is that the limit of this expectation value an is itself a star, right? That is by definition of the rate function of a, right? But it's actually something that's telling you that an itself in that limit is a star, okay? An was minus one over n log probability of pn of omega, right? What does that mean? What that says is this one says that the mean Boltzmann Gibbs Shannon entropy of that sequence is going to tend to a star. These are extremely significant things. That's why they come up all the time, right? And so that's called the entropy rate or the Kolmogorov-Sinai entropy of that random sequence of independently generated random variables, okay? So this is how we connect to the Kolmogorov-Sinai entropy for cases where the rate function of that random variable minus one over n log probability of that sequence, right? When it's convex and has a unique minimum, then actually that minimum value is the Kolmogorov-Sinai entropy, okay? Now this is even more fun because if you believe this then you ask, well, what's the probability of a sequence? It says that most of the probability is concentrated in sequences that have probability Pn of omega is approximately e to the minus n times a star, right? That's my definition of I of A, right? But if I tell you that the probability of, most of the probabilities in sequences like this, then I can ask you, how many such sequences are there? This is called the typical set of this process. And just one over this probability tells you that the number of such sequences is e to the n times a star, right? If most of the probability is in these sequences, each of those sequences has this probability e to the minus n a star, then the number of such sequences has to be about e to the n a star, right? This is extremely important in computation and information theory. When you ask roughly how many such signals are there, right? When the typical entropy of the signal generating process is something, how many such sequences are there, okay? So this is very general and very important even though we came at it from a kind of abstract point of view, okay? So sometimes abstraction is a useful thing. It lets you see clearly how to calculate something where a priority, you'd have a very hard time understanding where would such a number come from, okay? All right. Okay, now in the last few minutes, we'll talk about equilibrium statistical physics, all right? We are not being historic here, right? This is not a historical survey. I'm trying to explain it in a roughly logical way, but it's not historical, okay? So the reason I talked about those sequences of random variables is because microstates are an example of such a vector of random variables, okay? So each omega i corresponds to, say, the position and velocity of a particle. You have n total particles. That's a microstate of your gas, right? Yes, that's a microstate, okay? What's a macrostate? A macrostate is some function of this microstate, right? The macrostates that are interesting to us are states that have nice limits as omega goes to infinity, right? As a number of particles goes to infinity, right? The a priori distribution on the space of microstates is some probability, right? You have some microstate. You have some probability density on the space of microstates, right? The mean energy per particle, for instance, is an example of a macrostate variable, right? It's the energy of the whole microstate per particle, right? The thermodynamic limit corresponds to the n goes to infinity limit, right? Which is what we've been doing over and over in our calculations of rate functions, right? Okay? So that rate function is going to be related to the kinds of thermodynamic potentials that sort of summarize all the information about all the microstates of the system, right? Any questions about this? Yes? No? This is some of the deepest things in physics, right? Boltzmann, Einstein, they're the ones who came up with this so that I presented to you in this very simplified way that, oh, it's obvious. It's not obvious, okay? It took some very, very brilliant people to come up with this way of thinking about it, okay? What is entropy? Okay? Assuming that there's a uniform measure on the space of all microstates, we say that suppose the energy per particle is in a little interval u to u plus du, right? Then the entropy is this function of hn lying in this little interval u to du, summed over all the possible microstates, right? That's the volume in microstate space of microstates that have energy per particle of little hn in this little region, right? u to du, right? That's the entropy, okay? And the rate function is defined as usual exactly as minus one over n log probability that hn is in this little interval u to du plus du, yes? And this I'm assuming that we have a uniform measure. Lambda is the volume, the amount, the volume of a single particle, right? The phase space, if you like, of a single particle. So then the rate function is just this volume factor, which is not really interesting, times minus log of one over n, one over n minus log of the volume, right? This is on what is on Boltzmann's gravestone, okay? This is the rate function is minus the entropy up to this log of volume stuff, which is not important, okay? Why do I say the rate function is minus log entropy? I mean, minus entropy. Rate functions are convex, right? Rate functions are convex, except when they're not, but rate functions are mostly convex. The entropy actually is a concave function, okay? The entropy is a concave function. The rate function is written as minus entropy. So if I define a partition function in this way, introducing this parameter beta, right? It actually, again, remember what we did with rescaling the cumulant, right? So you write beta as m times k, right? And what you see then is that lambda of k is in this form, right? A beta is minus k, and this is the Gettner-Elles form of the cumulant, right? So this tells you that the Massu potential, Massu potential is minus lambda, right? It's a concave function of beta. Another way to say it is the Massu potential is minus the free energy, okay? The reason I wrote it like this is that people may or may not include the beta in the free energy, right? So if you include the beta in the free energy, then you get concavity. If you don't include it, then you get convexity, okay? So this is just a matter of definition of what you call the free energy, whether you include a factor of beta in it or not, okay? In any case, the important point here is that this is how all the things we've been talking about in terms of large deviation theory, right? This is how it's related to equilibrium statistical physics. And these are our usual dualities, phi of beta, which is remember the Massu potential. So you see an infimum here, not a supremum, it's an infimum, why? Because we're talking about concave functions, right? So if I put minus signs on both sides, then I would have supremum, right? But this is the usual definition, the entropy is a concave function, the free energy is a convex function. We are working with the Massu potential, which happens to be minus the free energy with the usual definitions, okay? Yes? And that is what I have for you today. Oh boy, that was good timing, all right. Okay, questions. Come on, this cannot have been that trivial. If it is, I have a whole class full of Einstein's here. By the way, Einstein's theory of microscopic fluctuations actually comes from that formulation in terms of large deviation function, okay? So it tells you about probabilities of observing little fluctuations about the macroscopic fixed value. That's all included in the rate function thing. Okay, yes, question. Thank you for the lecture. I have question that the assumption that we made was the rate function is convex, but... No, no, no, no, we didn't make that assumption. We showed that the cumulant is always convex. The rate function, if you use the Gettner-Elles theorem and say, oh, the Legendre transform of the cumulant is the rate function, then you'll get a convex function. It just happens not to be the rate function, okay? So let's be clear about that. The cumulant will always be convex, right? But if you do the Legendre transform of a convex function, you'll again get a convex function. That's how the Legendre transform works, right? But no one guarantees that that is the actual rate function. Remember how I derived the relationship between the cumulant and the rate function? What I assumed was, suppose there's a rate function, then the cumulant function is a Legendre transform of the rate function, okay? That's what I assumed, right? I never said, I never showed you any reason to think. That the rate functions, Legendre transform of the rate function is always, Legendre transform of the cumulant function is always the rate function. I never showed you any argument like that. Then the discussion after, then the discussion that you made, the connection between rate function and information theory. Yes. I assumed in that, okay, why don't you finish your question? Then that part, we should assume the convexity of rate function or not? Yes, there you're definitely assuming the convexity of the rate function. Thank you. So the idea of typical sequences and so on, it may not hold unless the rate function is convex and has a unique minimum, okay? Or, right? Yes? Then my final question was the real data, real-world data like MNIST, or would you think the rate function of that data is convex or not? I want to ask about that. Ah, you can't ask about the rate function of data. You have to tell me what is your random variable for which you want to derive a rate function? Data doesn't come with a rate function. You look at the data, you come up with a random variable associated with the data for which you want to find a rate function, and we can do that. But you can't just give me data and say, find me a rate function. You have to tell me what random variable do you want the rate function for? And that random variable would be some function of say the pixel values, right? Or in one of the MNIST images, for instance, right? So you'd have to come up with some value that you want that is calculated from the pixel values of an MNIST image, right? And then I could tell you whether there is or is not a rate function for that random variable. Yeah. Thank you for the answer. Yeah. In the previous page. Yes. One more. One more. Yes. You said that beta is something like n times k. No, no, no, that was wrong. I know I said it, that's not true. Now beta is k. Beta is. In this case, beta is minus k, yeah. Minus k. Yeah. The reason I say that is because if you write it as e to the minus beta, right, h, this is e to the minus beta n, one over n, h, right? So this is the little hn, right? So there's already a factor of n there. And then shouldn't it be k smaller hn? I'm a bit confusing. But in fact, if you do this on the next slide, where I write phi and little s, that always is like little h. It always comes to little h, okay? But it's the reason I didn't, I don't rescale this is because the little, the n factor is coming from little h, right? I started with this, then I write it like this. This has the little hn, right? Yes, and that's why the cumulant function, the rescale cumulant has just beta and hn, yes? Yes. Are you sure? Yes, okay. Just before the equilibrium part, I did not understand why the entropy is approaching a star. Ah, no, that's an assumption. Suppose I away is convex, and has a unique minimum at a star. It's an assumption. I'm saying the assumption is true, I'm not sure, I'm not able to get the third statement. This hn goes to this. Well, it's just log of pn in that case. Yes, so pn is a star. And the sum over probabilities over all sequences is one. No, no, no, there was a one over n in the definition. Minus one over n log pn, okay? Yeah. Regarding this typical sequence. Yes. Well, it's not one, it's a whole class, but yeah. Yeah, but there was this Bayes theorem, there was p of rho. Rho, I mean, this typical sequence and that rho, are they connected in a sense that for a realistic data, if the data is very large, we expect rho to be typical. I mean, whatever rho we see would be the most typical rho. And no, wait, rho was a probability density or probability distribution. How do we observe that? Your data is always a draw from that probability distribution. So what you're talking about is Sanaab's theorem that if you have some set that you observed, then there's an empirical density. How close is that empirical density to the true density? That's the kind of question you can ask, right? But just because there are typical sequences and most of the data happens to be in the set of typical sequences, right? That doesn't mean that you can get the true density from this. You know why? Because there could be eight typical sequences, right? For which you're really not getting all the probability, right? It's almost surely the sequence that you observe in the n goes to infinity limit will be one of the typical sequences, right? But this almost surely can leave a lot of stuff, right? It's in the n goes to infinity limit that it's almost surely, right? So I would just be careful in making that assumption that my empirical density is going to be close to the true density. We hope, but that's what Sanaab's theorem is saying is that the rate function is given by this Kullbach-Labler divergence between mu, your empirical density, and rho, the true density, right? What is the rate function saying? That if mu is not exactly rho, then its probability is going down exponentially, right? So that's as close as you can go. I mean, that's the most you can say is that you are probably very, very close, right? If you are not close, then as n goes to infinity, you will go to zero. That probability will go to zero, right? Because that's what we know, that in the exactly n goes to infinity limit, if there is a law of lateral numbers, you will end up at the true mean, right? That's the best you can do, okay? Yeah? Okay, oh, shall we stop here? Let's take a break.