 Yes, so Professor Joe told me that you all were complaining that I didn't go fast enough yesterday So I'll try to speed it up today Okay, I'm covering more material too All right, so Today we want to talk about How should I put it? New ways to think about very old problems. Okay So the first thing I want to tell you is about important sampling. What does that mean? so Let's do a little example our problem is to This we do all the time, right? We want to estimate some expectation value of some observable over random variable X and we have some probability distribution that Wates what the importance of any given value of X is right? That's that's the P of X and a lot of what we talked about The first three lectures was how to determine P of X and One of the ways we determine P of X the probability distribution or the probability density for the random variable X and We probe it with many different Operators or many different observables, right? We do this all the time calculate expectation values over some probability density of the value of that observable Now what if this observable is only non-zero where this density is very small? so now Imagine that the expectation value so when I write this symbol I the curly the calligraphic I what I mean is the The function that's a one where that Observable is non-zero and zero everywhere else, right? So this is shorthand for the heavy side function, okay? Now suppose this value is less than 10 to the minus 8 in other words the probability density of all the points where this observable is a non-zero is Very very small, right? So if we wanted to estimate the value of this observable the Expectation value we would have to draw more than 10 to the 8 samples, right to get even a one example Where this observable would be non-zero, right? Yes, so it's a very inefficient to do sampling in regions where this probability density is very small, right? so the whole idea of important sampling is to distort the density to this to replace this p of x with something where We know how we modified it, right? But the events where x happens to be Where o happens to be non-zero Get a higher probability in the distorted density, okay? Yes So let me give you an example how this works or what concretely mean by this so We want to estimate this gamma I Is some function that is non-zero in some range of the density of the power Values that this random variable takes Now when we draw n samples from this density So we're doing some sort of Monte Carlo or some sort of sampling estimation of this number gamma Now we draw n samples from this probability density and our estimate is Gamma n is the average 1 over n I I This heavy side function evaluated at Xi. Yes Right very simple Now this estimate has variance Like so gamma times 1 minus gamma divided by n right also very standard now Suppose we want a confidence interval For our estimate of gamma Right. This is where we use the central limit theorem. We're sampling the same thing many many times and So it tells us, you know, you can look up the z table if someone gives you oh, I can't I want alpha the level of the confidence to be point zero one you can look up the z table which is tabulated the area under the Gaussian For alpha over two and our estimate would be gamma n plus or minus z You know at alpha over two times the square root of the variance, right? Also very very standard, right? Yes For example, if alpha was point zero one the confidence interval for gamma being less than point one would be This number two point five seven six is the square root is that z that we calculated for alpha equals 0.01 and the square root of the variance should be less than 0.1 times our estimate gamma n So what's the value of n that we need? Well n is approximately you I Gave you the formula for variance Right there. There's n. So we solve this for n and we find that n needs to be oh About two point five seven six divided by point one squared Times one minus gamma over gamma that comes from the variance divided by gamma So if gamma is very small like gamma is ten to the minus six, right? Then that means that we need almost seven times ten to the eight samples To my estimate this is a very very small number right yes So that is the real problem of trying to estimate a Quantity that's only non-zero where that event is a very very rare. Okay? That's what we're trying to do So how do we do this? Well, the basic idea is as I said we're going to distort p of x into another density and Are what we're going to talk about right now is how do we choose this distorting density? So imagine you this is just either by going from here to here You see I multiply and I divide by some density q of x Right I can do that for any density q of x that is non-zero in the range of values that x takes I can do this right And now the idea is that we change our observable from being the heavy side function of You know that's non-zero only in some region We're going to change the variable the observable to be p of x divided by q of x times our original Variable yes So we get instead of e sub p the Expectation with respect to the measure p we now are evaluating the expectation with respect to the measure q The density q are but of a different observable, right? Very simple idea. Is that clear? What we're doing? We're just multiplying and dividing by q and then we're going to try to be clever in how we pick q Yes, that's the idea So sort of intuitively we want to pick q of x so that I I evaluated at those values of x When I is non zero is equal to one is not a rare in this distribution, right? That's the idea if we sample with respect to q this value of x where I our original Observable is not equal to zero is not rare Right, so let me maybe I draw it and that'll be a little bit clear So suppose our initial distribution is like so And we want to evaluate something way in the tail here, right? So I Oh is greater than zero only here, right? Okay, so that means we draw if we draw things from where the density is high I Is zero there, right? So it's totally useless for us to estimate this, right? so what we're going to do is distort this by q and Q might be a density that looks more like More like So this was our original density P our new density might be like this, right? and now what happens now the region where our Observable is non-zero, right is Much more likely to occur if we sample with respect to q Yes, and we have to compensate for the fact that we've changed the density to q by the Observable going to P over q times the observable, right? So the samples that we need from q are Not rare anymore, right? But we are compensating for the fact that we distorted it by multiplying by this number, right? And this number can be small in fact it will be small right because p is small in this region and q is large, right? But the sampling that we do will Give these values Much more frequently, right? That's that's really it's a very simple idea that it turns out to be very powerful as I'll try to show you, okay? Yes That's all there is to it Okay So now what we're going to talk about is how do we pick q in some sort of how should I say? Rational way, okay Okay, intuitively as I tried to explain what we want to do is we want to maximize q of x Where our observable is not equal to is greater than zero, right or not equal to zero, right? Any q Will lead to the same expectation if you draw enough times, right? If you take enough samples any q will lead to the same expectation value, okay? What's different about different q's? What would be different about different q's? How would we go about picking q's if I draw enough? samples the q will always end up in the Observable or will always end up having the same expectation by with enough draws So what rational reason do we have? How do we decide what q to use? sorry Right, so we want to figure out some way to quantify How quickly it's converging, okay? and the way you quantify that is To evaluate the variance of the observable the variance of this distorted observable, right? We calculate How what the variance of this distorted observable is? Okay, and a little bit of algebra this I just plugged in this distorted variable in here Remember, this is the heavy side function. So I squared is I, right? So this expectation value with respect to the new density q of oq squared is q times p of x over q of x squared times i o So that turns out to be the expectation with respect to our original density p Okay off this distorted Observable, okay? So what we're calculating is the variance, right and what we want to do is We want to minimize we want to find q so that this Variance is Minimized right we want to pick q so that this variance is minimized Yes, so that's going to be our criterion for which q to pick So I have some bad news first of all If you actually rationally try to derive what q should you use? The answer is going to be oh pick a q that's based on this unknown density p That's not useful, okay, so It turns out to be more a matter of looking at the data and finding a q that Will minimize the variance, okay? So you might pick some parametrized parameterized family of q densities and see where you find the lowest variance, okay? Yes Right simple idea. Did we get it? Yes? We modified the observable and we modified the original density in such a way the expectation is the same But the sampling makes it much more efficient and then we try to figure out how we distort it By trying to make the variance of our estimate as low as possible Okay Yes Okay Will this always work? Well, no So there's a little bit of art here in terms of how you do the important sampling, okay? So here's a counter example if you like Suppose p of x is an exponential density, right? Exponential distribution we covered this in lecture one for instance Suppose you want to estimate what's the Area for x larger than some some fixed y, okay? So let's try suppose we distort with Another version of the you know some other parameter mu, but still an exponential distribution So if I do this little calculation What you'll see is lambda squared over mu times two lambda minus mu divided So that's divided and then this exponential and I'm putting a factor L as the upper limit Just to make it clear that Mu cannot be bigger than two lambda Okay Lambda is unknown, but mu cannot be bigger than two lambda because otherwise this integral will diverge, right? So that's an upper limit to how much we can distort Q I how much we can distort right how we pick q well mu has to be less than two lambda Otherwise this the variance is infinite, right? Okay, similarly Mu can't go to zero, right? That's just no distortion at all. That's the uniform density, right? Or rather it is very distorted, but it diverges again, right? So basically there is no minimum at any finite value of mu Okay, so I said oh we try to minimize the variance But in this parameterized family that I picked there is no parameter at which the variance is actually minimized Okay, so that's what I mean You still have to do some sanity checks that you picked a family where the variance is bounded, okay? So statisticians spend a lot of time Proving when certain classes of these densities that we pick right are Actually going to give you bounded variance everywhere in the distorted Dense space of densities, okay? Yes Okay, now let's Apply this Important sampling measure redefinition idea to actually prove camera Kramer's theorem You remember way back when I stated Kramer's theorem as a consequence of the Gettner-Eller's theorem I never proved the Gettner-Eller's theorem Right, so I just motivated why it would be And there I cheated because I said oh we'll use the Gettner-Eller's to prove Kramer's theorem, okay? So I want to give you a proof of Kramer's theorem directly, okay, and Using this idea of important sampling, okay the measure redefinition Okay, now we're going to go through this Slowly so please don't let your eyes glaze over like what is this too many equations, but we go through this slowly Um So here's the probability that the sample average the sample mean is greater than some value mu plus epsilon, okay? Right, so I'm going to basically redo Chebyshev's inequality in this measure distorting way just just follow along So this probability, right? I Just rewrite it I can write it as the probability that the sum of all the variables Minus n times mu so in other words I multiplied the argument of this probability by n everywhere and I brought the mu over to the left-hand side, right? So I get it's this probability is equal to the probability of the sum of Xi's Minus n mu being bigger than n epsilon, right? Yes Now if you remember our first proof of Chebyshev's inequality We then said well the probability that it's bigger than n epsilon means that if we take the expectation value Divided by n squared epsilon squared this quantity is less than this probability, right? That doesn't make sense We're looking at the probability where this quantity is Greater than n epsilon, right? So the expectation of this thing squared is Always in this region bigger than n epsilon, right? So if I divide by the expectation divided by n squared epsilon squared that is less than this probability, right? but the numerator is just n squared times sigma squared if if x is a normal random variable with mean mu and standard deviation sigma the numerator is just n squared sigma squared and The denominator is n squared. Sorry the numerator is just What did I say? the numerator is n times sigma squared because this is n copies of X minus mu, right? So the numerator is n times sigma squared the denominator has n squared epsilon squared So this is sigma squared over n epsilon squared Okay, which is a sort of statement that Chebyshev's inequality said, right? that the probability that the mean is bigger than something is Greater than this we stated it the opposite way that the problem, okay? and You can do the same argument if you change the sign you mu minus mu minus epsilon same same kind of argument now if you look at this The same thing the same argument we could just put in some arbitrary always increasing function Psi and the same thing is true Just think of it Psi is any non-negative increasing function Then this probability is less than equal to probability that Psi of the left-hand side is bigger than Psi of the right-hand side, okay? Simply because Psi is a non-negative increasing function What examples do you know of non-negative increasing functions that we are very fond of? Sorry Yes, more generally we're very fond of the derivatives of convex functions Okay So that's just keep in mind that this Psi will turn out to be the derivative of a convex function at some point, okay? all right, so That's what we get when we basically X squared was the function that we used here here we take Psi to be any non-negative increasing function And we get this, okay? Just the same argument, but the other way, okay? now it takes Psi to be the exponential function, okay and What we get and why did I do this because remember we were distorting the density, right? so this in the Large deviations literature is called tilting the measure Okay, just in case you ever read a paper on large deviations and you see this phrase tilting the measure Okay, that's what they're doing is then multiplying it by e to the lambda s okay, then the For this choice of Psi We just evaluate this and what we get is 1 over n Probability this probability is less than equal to minus lambda s Plus 1 over n sum of I log e da da da, okay? Yes That's all we did. I may have left out a log here. I did leave out a log here, okay? There's a log between right here. There should be a log After the first 1 over n, okay? Then then then you'll see what I mean because you have Psi ns in the denominator Psi is e to the lambda s. So that's where that minus lambda s comes from Okay, so sorry. There's a log here between the first 1 over n and the p Okay, so that's how we get this Do you recognize the cumulative generating function? Anyone? There's n copies of the cumulant generating function right here log of Exponential lambda x that's the moment generating function is e to the Expectation of e to the lambda x log of that is the cumulant generating function so what we have is n separate copies and this tells us that the left-hand side is Greater than equal to supremum of lambda s minus the cumulant generating function Hopefully that reminds you who from way back in prehistory like Monday or something That is the the gender transform Right and this is what the important sampling thing is trying to get at right again There's a log missing here as well. Okay, so there's a log here There's a log here. Okay All right, this is an lower bound. So let's do the upper the other bound Now Again, you see the distortion Our new measure key of x is p of x times e to the lambda x divided by mx lambda Why do we put in this mx lambda? Which is just a number? It's not a random variable It's not a function of our random variable. It's just a number It's just a function of lambda, right? That's the moment generating function. So it's just a function of lambda Okay, why do we put it that in the denominator? because Then if we calculate the expectation value of 1 in this measure q it's the measure q is still normalized Okay Right, so this is a concrete example of where this tilting of the measure or importance sampling changing It actually is going to be used. Okay So now we do that same calculation and this time I left the log where it was supposed to be Okay, and Let's not go through the algebra in great detail, but basically What I want to point out is you do a little bit of the same kind of algebra And you end up and this is where I will go into detail the same sort of arguments give you this That's just taking this log and noting that this part is just this part But then when we take the log of expectation of this part Right here the heavy side function for this That's where we have to be a bit careful. Okay, so that I'll do in detail now So are we clear about what's going on here? We did one side of the inequality for finding The rate function as a function of lambda s minus the cumulon generating function, right? Now we're doing the other side of the inequality so that at the end we will say oh That's why it's true because it's both less than and great greater than this function. Okay, so this is looking at it from both sides in the statistics literature actually What they prove is that there's an upper bound for the rate function There's a lower bound for the rate function because that's the general case You actually get an upper bound and a lower bound and they're not coincident. Okay, so there's an Supremum on one side and there's an infimum on the other side and that's the precise mathematical way of formulating Kramer's theorem, okay, or anything to do with rate functions actually if you look at a mathematical paper on rate functions It'll always talk about the lower bound and the upper bound for the rate function. Okay But we're physicists and data scientists so we don't care All right. Okay. Now as I said, I'm going to go through this carefully Because this is where we see exactly where the Legendre transform comes in so what we want to show We have this form and we'd like to say that oh, this is the only part that matters as we take the limit epsilon goes to zero Okay, that's what we'd like to argue Which means we have to point out we have to show that this right hand side actually goes to zero That's not obvious. Okay, so we have to show that this right hand side this right the term on the right actually the term on the On the right for me on the left for you This term actually goes to zero. Okay All right So there's that term and we want to show the limit as epsilon goes to zero of this probability That s plus epsilon is an upper bound for the mean Sample mean and that's greater than s goes to zero So now let's look at some properties of the cumulant first of all the derivative with respect to lambda is If you write it in terms of eq or distorted measure It's the expectation of x in the distorted measure and in our original measure It's the expectation of x times e to the lambda x divided by the cumulant generating function at lambda on The other hand by definition almost at lambda equals zero the derivative is just the true mean, right? Okay, so what happens as lambda goes to infinity? What's the value of the derivative of the cumulant as lambda goes to infinity? So this is the general form Tell me what's gonna happen as lambda goes to infinity make a guess. What's gonna happen? Look, it's just a probability density, right? This expectation of a random variable Imagine there's a probability density, right? If I distort the measure giving more and more power to the Highest value of x that the measure is non-zero that the density is non-zero, right? As lambda goes to infinity. It's just going to go to that highest value. It'll be concentrated only on that highest value Right? Okay, so that's what I mean is as lambda goes to infinity this goes to the Largest value in the support of x, right? The largest value of the support so now if it starts at mu and Goes to the maximum of the support at any value of s between the lowest value and the maximum of the support, right? There is some value of lambda which is a function of s such that This cumulant generating function evaluated at that lambda of s is s plus epsilon over 2 Why do I say s plus epsilon over 2? Because you see what that says is that this thing in the center here is Actually in the center the probability is not zero Right? So that means this log probability goes to zero, right? Because the probability is guaranteed to be one Because I picked lambda of s so that this probability is one log of that is zero and that's independent of epsilon Okay, it's a little bit subtle, but it's important to get you know, you want to be careful with these things Okay, because the rate function is not just any old smooth function, right? It can have Places where it's actually infinite which what does it mean when a rate function has an infinite value? What does it mean? Can a rate function have an infinite value? Come on. I won't bite your head off. Just guess What does it mean for a rate function to have an infinite value? Sorry, the probability is a zero at that value. So the rate function had better be Infinite at that value because remember it's each the negative n rate function is the probability, right? So if the probability if the rate function happens to be infinite that means the probability is exactly zero that that Mean value is never going to get to that value, right? Okay So it's important to show this and so what you see here now is our the other side of the inequality Lambda of s lambda which is now a function of s, right? Minus k of x at lambda of x and that is exactly the Legendre transform, right? So the rate function as a function of s is equal to the Legendre transform of the cumulant Okay, so that's an honest proof of the cremer theorem Okay Yes, are there any questions? Because now we start to do funky stuff. Yes. No, maybe Yes, please Yes Sorry Yes, okay. All right so here's a line and Suppose p of x looks like this, right? Yes This is the original p of x Now We are multiplying this by e to the lambda x, right? That was our q of x Multiply it by lambda of x So for different values of lambda This will start to look like this Right, but it can never get higher than the point at which P of x the last the highest value where p of x is non-zero, right? So as lambda gets larger and larger q of x becomes more and more like this Right, so ultimately as lambda was to infinity remember q of x is Normalized its integral is one, right? Okay, so what's it going to be the only value it can be is q of x becomes a delta function exactly at the largest value of m of the support of p, okay, and That's what I was trying to explain. Okay, then as you distort it to higher and higher values of lambda, right? The area under the q of x curve is still one, right? That is gets more and more concentrated at the maximum end, right? Yes So that's where that comes from Okay Okay, so now we're going to use this Probability tilting in funky ways to make much the maximum likelihood maximization that We've mentioned several times. We're trying to make it basically trivial. Okay, or very easy to do So Take a little detour to explain the motivation for what we're going to do So suppose you have a single spin, right and it takes values plus one minus one Right, that's it So we can form a partition sum We have certain probabilities that the Value plus one is taken certain value Probability that the value minus one is taken, right? so if it's plus one then we have this probability and This likelihood and when it's minus one we have this likelihood. Okay, and Why where is that n? That's one half because that's a number of configurations And if you like you can think of epsilon as an inverse temperature But really all it is is parameterizing a probability in a certain way. Okay That's all it is right. I mean a probability on a variable that only takes values Plus or minus one is just two numbers, right? And the numbers have to add up to one So that that's all there is you can parameterize it anywhere you want, okay? But it's just two numbers that add up to one, right? Okay Now what I'm going to do is I'm going to write this Exponential Which always takes this form right e exponential of minus epsilon times some number I'm going to write it as one minus one minus Exponential, okay? yes and That we write as one plus delta ij and delta ij is this this whole thing after this first one So delta ij is minus one minus e to the minus epsilon v ij okay, and What's the motivation for writing it like this? Imagine epsilon was very small, right? If epsilon was very small e to the minus epsilon v ij is pretty close to one, right? So it's going to look like roughly one minus epsilon v ij So we have one minus one minus epsilon v ij so this whole term as Epsilon is very small is going to behave like epsilon times v ij, okay? Then why do we write it as plus delta ij just so that when we? Expand it out. We don't have to worry about minus signs all the time, okay? We only plug in the value of delta ij at the very end, okay? Yes, so this Writing it like this now. You see that I didn't do any approximation at all, right? I just rewrote the sum over configurations as Sum over configurations times this product one plus delta ij Right? No approximations whatsoever right now Okay, and now we can expand this product, right? Again, it's just algebra. We certainly we have a polynomial a Multinomial polynomial whatever we can certainly multiply it out, right? If it's up to like one or two term even I can multiply it out So, you know you guys are young you can do more algebra So this is a sum, right? You can do this, okay? And so we take the one out and we get z is one plus and then there's one over n sum over configurations First the linear terms in delta ij there they are and then there's quadratic terms in delta ij there They are and this goes on, right? Yes This is called the Meyer cluster expansion Sometimes it's called the strong coupling expansion. That's typically in particle physics or lattice gauge theory. It's called the strong coupling expansion in Statistical physics is usually called the high temperature expansion because epsilon is One over the temperature therefore when temperature is large epsilon is small Right, but we're not really doing statistical physics. We're just Taking this sum over probabilities and expanding it in different ways, okay? Yes Everyone care about this Okay Okay So now as we we're going to do this very very carefully for a single spin Single spin taking values plus minus one the partition function is z One half e to the minus epsilon w plus exponential minus epsilon times minus w This is the term coming from absolute from Sigma equals plus one. This is the term coming from sigma equals minus one and When we write it in terms of these delta variables we introduced it's going to look like z equals one half One plus delta plus plus one plus delta minus and then we can expand it out And we get one plus a half delta plus plus delta minus and there's a delta plus plus Delta plus times delta minus square, right the quadratic term also, okay? So this should be an approximate sign For epsilon small if we wanted to oh I did write the delta plus here, but delta plus delta minus squared term. I did write it here For epsilon small we can expand this and we get z is one minus epsilon over two w minus w Oops, that's zero, right Plus epsilon squared over four w squared plus minus w squared So we end up with one plus epsilon squared over two w squared Plus dot dot dot, but everything onward is higher order in epsilon Yes, we're cool with this okay Now the take-home message basically here is that z in this high temperature or small epsilon limit is quite trivial to calculate Okay That's all there is to it Okay Yes Okay Just to belay with this point. I'm going to do two interacting spins now Then we'll do three four one at a time. Okay, but okay, let's do two interacting spins So now the number of configurations is four, right? two spins each taking values plus minus one four configurations and We can have Three different non-trivial interactions so to say observables that we might put in Sigma one and Sigma two that's two of them there and the product Sigma one times Sigma two, okay? Right, so z looks like this and we can be more organized about this expansion and We can write e to the w Sigma in terms of kosh w Sigma plus cinch w Sigma Right. I didn't do any approximation. This is just algebra, right? Okay, but now notice kosh is a even function, right? So whether Sigma was plus one or minus one kosh is just going to be kosh independent of w independent of Sigma Right. Yes Similarly, the cinch w Sigma is an odd function Right, so I can drag that Sigma out of the cinch w Sigma And so I get kosh w plus Sigma times cinch w, right? I know that you must be thinking he's gone nuts. He's going so slowly. What is he doing? But you know just just for there with me right now and I want to be sure everyone understand what we did here, okay? So now we factor out kosh w and we get one plus Sigma tan h w. We all agree with this Yes, okay, good So now the kosh terms don't depend on Sigma at all so they can come out of this partition some right every term has a Kosh term, right? So we get it out for every variable w1 w2 w3 There is a kosh term that comes from it. Okay? Yes, please Sorry, ah, there's w1 Sigma 1 plus w2 Sigma 2 I use the Einstein summation convention So when I write wi Sigma, I mean w1 Sigma 1 plus w2 Sigma 2 this one It's a new variable. It's a new. It's a separate parameter w3. It's not w cube. Sorry bad notation It's not w cube. It's w W a separate new w parameter for Sigma 1 Sigma 2 Sorry, I'm so sorry. Yes Okay, yeah, it's not not a higher power. No, it's a totally separate parameter w, okay? So there's three parameters w1 w2 and w3 Okay, yes Okay, so I just applied this to each exponential separately, right? So I write the this exponential which has a sum as product of three exponentials, right? and then Each exponential I write in terms of this kosh times 1 plus Sigma tanh and so I get this expansion, right? Yes, and You notice here that this part Does not depend on those configurations at all, right? It's just this part that depends on the configurations So we won't ever need to sum over these but it's good to see that What's the only term that could survive here? Remember Sigma 1 independently take values plus or minus 1 so any term that has only one Sigma 1 or 1 Sigma 2, right? It's going to be multiplying something that's independent of Sigma 1 and Sigma 1 is going to take values plus 1 minus 1, right? So all those terms cancel any term with only one Sigma 2 is also going to cancel Any term with Sigma 1 and Sigma 2 is Also going to cancel, okay? right So even with interacting spins Z is very simple to calculate, okay? Okay, so now we go back to lecture one and and I showed you that the Colbeck-Liebler divergence between the observed frequency distribution and Probability distribution that we're trying to figure out the parameters we're trying to figure out is the expectation value minus the Empirical expectation value of that observable, right? We did this way back when, okay? And we said that this expectation value Was the derivative of the log of the partition function? And that's why it was difficult to calculate because the partition function is difficult to calculate So was I lying then or am I lying now? I just showed you partition function Calculate in the high temperature limit. So what's going on? Oh No, no, no, no even even Even if I had an infinite number of spins In the high temperature limit That whole configuration sum is going to become one In the high temperature limit So I was not lying then this is the whole partition sum for not just high temperature It's the partition sum for any temperature, okay? That is very hard to calculate Now I want to find these parameters Theta i or wi, but we have no idea of the magnitude of the interactions of the spins So what do I mean by going to a high-temperature expansion? I have no idea what w is how can I talk about high temperature, right? So this is where we use our Favorite new trick Tilting the measure so let me be very concrete about what we're doing here Okay, so we have interactions call them oi We have some set of random variables which are spins which take values plus minus one n of these spins and Their interactions in other words the observables that determine their You know the probability of observing a certain configuration, etc Are just given by these operators these spin products sigma i sigma i sigma j for i less than j And you can have cubes and you know quartic pieces and as many higher dimension things as you want, okay? There is one parameter for each of these observables and so using the Einstein summation convention. I write it as e to the wi oi and O i depends on the value of sigma so the probability of observing a specific configuration Given the parameters wi is going to be parameterized like so Right, we have seen this before e to the wi oi oi is a function of sigma and the all-important normalization of the probability is the partition function right there, right? Yes, we have seen this Okay, and if we want to maximize the maximum likelihood What we get is some w star where The the solution is the probability of observing sigma the observed configurations at w star Which is the frequency of the observed configurations, right? This was from lecture one, okay, and that frequency is just of course the number of Observations of that configuration divided by the total number of observations, okay? Where's the temperature? Okay, so the basic idea here is going to be That we are going to gradually if this was imagine. There's some sort of potential in which these spins are interacting so the idea here is that this is the actual potential and what we're going to do is we're going to flatten this potential, okay and The whole trick is how do you flatten the potential while maintaining the solution of the maximum likelihood? Equations that the observable frequencies are equal to the probabilities that the measure That we're determining, okay, so we have to maintain f observable equals be observed, okay? but Try to make be easy to calculate Okay. Yes Okay Now again repeating if we have the cool back library divergence. These are the observed frequencies. There's the probability That's the form gradient descent tells us that we should change Wi the parameters that we're trying to determine with some learning rate Times the expectation in the empirical distribution of the value of this observable minus the expectation in the theoretical distribution, okay and Whether it's f or q the value of this expectation is a sum over all configurations times q of sigma times oi at that configuration and Why can I write all? Configurations even for the empirical distribution because for the empirical distribution q of sigma is zero except at the observed values Right, okay. All right in the small epsilon limit. I Showed you explicitly for the two-spin case and the one-spin case The only thing that matters is that kosh product of koshes in the beginning, okay? and so at if I multiply the whole you know probability thing by Epsilon wi the expansion of the partition function is 1 plus epsilon squared over 2 times the sum over all the w i's Just squared, okay? So in this limit If I define p epsilon of sigma to be this form right here Then the expectation value of oi is just epsilon times wi That's it Okay, and what did I do here? There's p of sigma that we had before right? I multiplied it by p of sigma raised to the power epsilon minus 1 okay, so Doing the algebra. This is saying p of sigma raised to the power epsilon right and This in the denominator is just normalizing it Okay, the denominator is just normalizing Because p of sigma raised to the power epsilon is no longer normalized, right? if the original sum of p of sigma was normalized if I go and raise it to some power epsilon It's no longer normalized, right? So we have to make sure it's normalized otherwise none of this makes any sense, okay? So p epsilon of sigma is not just taking p and raising it to some power you also have to alter the normalization Right, so with this denominator it is still normalized, okay? Good So how is this useful? After all we want to ensure that f observable is equal to p observable We don't want to find solve different equation We want to solve that same maximum likelihood equation Well, actually all we want is that the solution of the equation we solve should be exactly the same as the solution of the maximum likelihood Right, so we're also going to redefine the the empirical frequencies, okay in exactly the same way as we modified the theoretical frequencies multiply them by p observable raised to the power epsilon minus 1 And I have no idea why there's another parentheses there, but You understand what I'm saying, yes, okay? And we also have to normalize the observations Right just the way we normalize the p epsilon, right? Okay now if we try minimizing the Kuhl-Bach-Labler divergence between this normalized probability distribution and this normalized probability distribution It still gives you the answer that you want If you follow the steps that we did Remember, there's a Lagrange multiplier that enforces normalization, right? Just follow the steps step by step you get the same answer Okay So what was the benefit? The benefit is now that gradient descent Is saying delta wi Is proportional to the empirical frequency Empirical expectation minus the expectation in that high temperature limit In other words, it's just the empirical expectation in this distorted distribution Minus epsilon wi That's all there is to it No partition function or rather we calculated the partition function The expectation just happens to be now just minus epsilon wi okay, when professor Joe and I and Our friend Ty first did this We were really wondering is this actually going to work so Why is this useful? Well, first of all if I said epsilon equal to 1 Then it says that wi is equal to the expectation value in the empirical Distribution of the observable that is called the Hopfield solution okay John Hopfield said that that's the way you determine parameters in interacting set of spins. It's his idea of associative memory and have been Learning the wi are equal to the expectation of empirical Expectation of these observables oi, okay If we think in terms of a differential equation Right for epsilon not equal to 1 then what this is is saying it's like a damping term, right? It's like a regularization Okay, so epsilon equals 1 in some sense is a over damped, right? It just stops right there, okay Yes, all right Okay Does this work? Well, I wouldn't really be talking about it if it didn't work, right? so Let's try different values of epsilon This is the mean squared error. This is synthetic data. That's why I can talk about mean squared error in real data You never get to talk about mean squared error Um epsilon equals zero one point one doesn't actually get the right Values epsilon equals zero point eight again does not get to as low Error as you could epsilon equals zero point five You know it's lower than both so somewhere. There's an optimal value and If you plot This was with respect to iterations How many iterations it takes to get to the optimal value and if you look at the mean square error as a function of epsilon? You can see that it goes down and then it starts to go up again Whether you have 10,000 data points or 5,000 data points. That's how it looks, okay? Yes, this is for m equals 20 spins of Those curves were already on the Last slide the cool back library divergence looks like so, okay? Remember this is epsilon equals 0.8. This is epsilon equals 0.5 and this is epsilon equals 0.1 Cool back library as a function of epsilon Keeps going down, but we don't really care What's interesting for us and what took us a long time to figure out is exactly? When do we stop the iteration and what is the best value of epsilon to use? Because this is an iterative solution, right So what value of epsilon should we use and how many iterations to do? Okay, it turns out the number of iterations is not that hard to figure out But the value of epsilon to use is Hard to figure out So then what we figured out is that if you look at the internal energy so to say, right? That internal energy is maximized at the value of epsilon that gives you the best fit Okay, it's a very pretty flat curve. So it's not very sensitive Okay, but at epsilon too small that would be an under damped System right or epsilon very large. That's an over damped system, right? Um That gets you worse Performance, but somewhere in the range in the middle if you just follow the energy And I say why follow the energy because that's actually something that you can Calculate from data. Okay, you don't need to know what the right answer is Okay, right What's nice about this is that it is very And it can go to very large system sizes relative to any other way that I know Okay, so this is a pseudo likelihood estimation. That's maximum likelihood You can use maximum likelihood exact maximum likelihood. Oh up to about 20 spins Okay Maybe if you go to a cluster you can do a few more spins, but not that much more, okay? Because you have to calculate the partition function to do exact maximum likelihood, okay? The pseudo likelihood method is Basically doing a Monte Carlo to figure out what the denominator is that we didn't want to calculate, okay? Our method they which we call the eraser machine Erasure machine. Why do we call it eraser machine because basically it works by erasing the potential? Right we're distorting it to the point where it's a flat potential almost flat potential Right and in the process of that distortion you can figure out What you see look at it this way you have to know what the What the interactions were before you can erase them, right? So in the process of erasing the interactions We learn what they were Okay So a few more slides We coupling strong coupling sample size All works pretty well Now I can't show you So this is for m equals 40 which you can you know you can do You can't really use exact maximum likelihood for this, but you can do pseudo likelihood And if you look at the accuracy if you have weak coupling We do pretty well on most things at strong coupling we do very well Okay, that's now at you know a hundred interacting spins The hopfield solution still works because that's a very simple solution, right? Wi equals expectation value the hopfield solution always works even for very large systems But not very well Okay, the other two methods you can't even apply for very large systems. This is a hundred spins The hopfield solution works, but it doesn't give you the actual couplings our solution gives pretty Decent couplings relative to that But just think what's the state space? What's the set of configurations for a hundred spins? To raise to the power hundred, right? You're not sampling this exactly and no matter how large a data set you get Right. I mean this is going up to 40,000 Samples of a hundred spins, right? You're never going to reach to raise to the power hundred, right? So this is with a very small fraction of the total configuration space Which is all you can hope to get at, okay? And because I work in biology it is important for me to be able to Get some estimate of couplings in a very complex systems where there is just not enough data, okay? Okay, and That's the last slide really Hopfield Boltzmann and our evasion machine This is the inferred interactions. These are the actual interactions This is for a hundred spins the red is our interactions the black is what the hopfield machine gets Computation time there's the Boltzmann exact calculation and And there's our Erasure machine I think that's what at system size 24. That's as high as I could get it That's a factor of on the order of ten to the four difference, okay? This is not really, but okay So I'll end with What every data science talk has to either start with or end with MNIST images What we did was we asked can we figure out? Use this kind of Boltzmann machine to remove noise from Noisy images if we have some training set, right? So what how do we use that the idea is that each pixel represents a spin, right? And every like if you have the digit eight then you train the W I J To know what the interaction should be for the digit eight, right? Now if I give it a distorted image with noise, right? the question is By flipping spins. Can we figure out what the correct? Pixel value should be Okay for that digit that's denoising if you like using a Boltzmann machine, okay? And so it does reasonably well. It doesn't do perfectly, but it does reasonably well in recovering denoised image, okay? So thank you for your attention. Hopefully this was amazing if you have any questions You know talk to me or you can email me and I will respond Okay Right. Thank you Questions if I understood your procedure, right and maybe I didn't You're reweighting all the configurations using this epsilon to make them almost everything you observe Almost equal frequency effectively. Is that oh, that's right. Yes. Okay. Can you I guess I missed the intuition of why? You're not killed by sampling noise on these rare configurations if I reweight the observed frequencies You would expect much more noise on the tails and by reweighing it I thought you would be killed by that noise because you're kind of Boosting errors as well as signal, but somehow it doesn't seem to be the case in your empiric So I missed no because because actually the The configuration that you're matching to right the the model configuration is exact That minus epsilon w that is replacing the theoretical Observation expectation value, okay? That is absolutely exact in that limit and What is happening is that? We update W and Before the next iteration the observed frequency as are adjusted to the new W Okay, let me this is a a little bit subtle. So let me Yeah So you see this right? Okay, so you see this P of sigma So that's P of sigma as it should be If you look at this P of Sigma P of observable, okay This is from the previous iteration So when we do the maximum likelihood the only variation in P is Coming from the log P term the F log P The F epsilon is not being Changed So it's just the P term in the maximum likelihood that is being changed at the next iteration before we Change before we update we change the F with the new value of P Okay, so at every step It is getting flattened, but it knows exactly how much it's been flattened Okay, and the partition function is being calculated precisely, okay? so it's like important sampling if you like but The The tails of the distribution are very sensitive because the The partition function the theoretical model Part of the Gradient right is exact. I mean I we even use this I didn't we didn't publish this yet, but we even use this to estimate the partition function using basically this idea and seeing If you follow this through then actually it gives you an estimate for the partition function okay, and Even the naivest one it has some missing in the rare events, okay But the majority it comes pretty close to the exact partition function So what we're doing is looking on improving the tail estimates Okay, but you do not get killed by the variance not not at least in any experiment that we've done It's simply because it's a very very Strongly bound by see that When you look at this term right That is actually a very strong constraint on what model it's trying to fit You can't you can't this there if you Maybe maybe this will help a little if I were to add a higher order epsilon terms This gets much worse Okay, so it's really this very strict thing. That's only true in the extreme Epsilon goes to zero limit it gets worse guaranteed even epsilon squared will get worse Okay, so it's very very strongly determined by this far Okay, good question. Yes. Yes Beginning of your talk you mentioned that you need to file Q so the form transformation in such a way that you minimize the variance No, that was just to motivate how you can use sort of different probability densities to get at what you're trying to get at okay, this is not quite important sampling It's because what is the Q in this case what I'm doing is I'm basically taking the I'm telling you what? P divided by Q what the new Q is I'm telling you that it is this very strongly Determined strong coupling limit, okay, and then I'm asking okay. What should be be? Right, I'm distorting in a very very specific way in a very rigid way That's that's that's it's it's actually It's P that that you're trying to determine in this case right in that in that case you have in the In the important sampling case, you don't know what P is right here We don't know what P is but every at every step of the iteration You have a guess for what P is right? So you distort the F which is a known distribution, right? With this hypothetical P distribution right and Then you update the P distribution Right and match it to this F times P epsilon minus one, okay Right, so you're distorting sort of both So it's not exactly important sampling It is kind of a it's just how should I say it's an imaginative way to use important Keeping exactly. Yes. This this, okay. Yeah Yeah, of course He stopped here and then let's thank people again for his nice