 Don't do that All right, okay, so We did a fair amount of theories, so let's try to do a few examples now This is mostly in fact. It's all the work. I'm going to talk about is work done with Junghira Joe and Others so I'm not really sure I had to fly all the way to talk about this work, but Let's get to it. All right So let's start with some Very simple things so what we're interested in is we observe some stochastic systems, right and We'll assume that their spins take values plus one and minus one and There's a vector of these spins and we observe them flipping up and down as time goes on, right? And so there's time step one time step two and so on and these spins are just flipping up and down probabilistically, right and What we're trying to do is figure out What are the interactions between the spins that are making them do this flipping stochastically, right? so let's start with suppose there's only one spin, right and It's state at the next time point is Stochastically determined by its state at one time, right? So I wrote down a very general probability if Sigma at t plus one has the value of rho which remember row is just plus or minus one, right? then If row is plus one then it'll be e to the b plus w Sigma t divided by 2 cos b plus w sigma t and our job is to figure out what are b and w? Right, these are the parameters of the model. What are b and w? That's what we're trying to figure out, right? so We can look at the mean value and the covariance over a period of observation and the mean value over some period will be You know you sum up all the plus ones and minus ones and you'll get a number m and then m will be between one and minus one, right and We can also look at the covariance of Between sigma at t plus one and sigma of t, right now if we expect some sort of mean field self-consistency, right? then we expect that if we plug in m into The expectation value here. That's where that tan comes from if I take this probability and I multiply this by Row and I sum over row remember row is only going to take values plus one and minus one, right? so you get e to the positive b plus w sigma t minus e to the Minus b plus w sigma t and that's just the tanche, right? So that's the expectation value on the right-hand side here is the expectation value Given m so by for self consistency we might think that m should be equal to tanche b plus w m, right? That's just sort of the mean field self consistency. We've seen it in the easing model probably many times We can be more sophisticated and we can say okay this Correlation between time t and time t plus one, right? The expectation of that should also be determined in the same way That is the correlation between sigma t plus one and sigma t Should be equal to the correlation between tanche of b plus w sigma t times sigma t So that's just intuition. So how do we derive these things? So What we want to do is we want to maximize the likelihood of the observed time series, right? So that's maximum likelihood. We want to maximize the probability of Sigma t plus one given sigma t, okay? This isn't really a likelihood but close enough, right? Um, so now differentiating with respect to b and w because we want to maximize this so We differentiate with respect to b and we get this and We differentiate with respect to w and we get this Everyone understand where that came from, right? If I just differentiate with respect to b I get the expectation value that was tanche, right? And I differentiate with respect to w and I get this, okay? So if I set this equal to zero and I set this equal to zero, right? I want to maximize, right? The maximize the maximum likelihood So I set these both equal to zero and those equations are exactly the intuitive Mean field equations that we wrote down, right? Where we can do better So if I multiply this equation by m and I subtract it from the second equation then we get dw minus mdb acting on the log of the maximum likelihood is equal to the sum over all time sigma t plus one, delta sigma t minus tanche of b plus w sigma t times delta sigma t and delta sigma t is just sigma t minus the mean field value, right? Now if you expand the tanche assuming this deviation from the infield value is small If you expand the tanche you get one minus tanche squared x times delta, right? This is the nice thing about tanche is that you differentiate it and you get basically a bump function, right? And that's what that is the bump function So if I plug this expansion into this equation, then I get this relation, right? And if you look at this You'll see that you can solve. Well, let's see you can solve this if you set this equal to zero you can solve for w and That's the solution that you see is right here So you see we expressed the naive mean field approximation It gives us a explicit in terms of our mean field expectation value and the Observed correlation functions. Yes We're clear about this, right? Yes, okay Now so that was expanding tanche to first order, right? Just first order if you expand tanche to second order you get a better approximation That's called the Thouless-Anderson-Palmer approximation If you treat fluctuations in sigma t plus 1 as following some sort of Gaussian distribution Then you get what's called the so-called Exact mean field approximation We'll do something different But I want you to remember that what you ended up with was w Written in terms of something to do with the mean field and a ratio of covariances, okay? Just that's where these things are coming from Okay, now there is just one spin now we're going to take a whole vector of spins So there will be indices and matrices where this was just scalars divided by scalars, right? but Conceptually, it's really all the same thing. Okay so now we have an N spin model and The stochastic update rule is Just really very simple. There's sigma i t plus 1 the value that we're expecting and there's a linear combination of What we call we'll call this the local field Produced by the other spins at the location a right so that the Interactive picture is that there's a bunch of interacting spins and They're determining there What their state is at the next time right and basically all the spins here interact in our influence each of these spins With some different weights, yes So there's so if we call this spin 0 1 2 3 then there's the effect on spin 0 of spin 2 That's w 0 2 right and We sum over all the Spins at the previous time. This is time t. This is time t plus 1, right? we spin we sum over all the spins at time t and That's the summary this local field is that summation, right and That's how the spins at time t influence the state of the spin at time t plus 1 and Instead of that single number w. We now have to determine this whole matrix w i j Yes, but conceptually it's really the same thing all right and Again you can calculate the model expectation value and it turns out to be tanche of that local field, right? So the model expectation at time t plus 1 is the tanche of the local field at time t Yeah Are we good? All right, okay now Tanche is always an absolute value less than 1 less than equal to 1 So that tells us that this ratio of the actual value of sigma i t plus 1 the actual observed value, right? The expectation divided by the actual value that's always less than or equal to 1 right? Yes We're good So what that means is that if we Somehow could update our Local field in a way to be the inverse of this ratio then This updated local field Would have a value of sigma i t plus 1 that would be bigger than the previous Iterations value, okay, so we're iteratively trying to improve our estimation of Wij right and Our way of trying to improve this is by altering the local field Felt by any spin Right, and we're thinking that okay. Maybe we can improve it by remember tanche is a very nice function It saturates, right? So Tanche is going to go like this, right? So if I make that local field larger, it's pushing the value of tanche further out, right? Yes, so it's getting closer and closer to the desired value Plus or minus one, right? So that's why I say it should improve it right on the other hand What if you got the sign wrong? Right. I'm only comparing the absolute values, right? So if I were to do it spin by spin I mean time step by time step I could improve it, right? But Wij is a global thing. It applies to all time steps, right? So Wij That you really want to end up with is sort of a consensus best Wij You can optimize each single time step perfectly happily make it perfect But that Wij has to apply to every time step, right? Okay, so now that's the intuition for why we want to update the local field like this The question is just like we predicted The mean field equations and then we derive them by maximizing the likelihood We want to derive Some update that looks like this. Okay, this is just the intuition for why we want to push H in such a way that we go up the Tanch saturate the Tanch as much as possible. Okay. All right So now we're going to derive that update rule Using all the fun stuff we've learned about largely variations So it wasn't all useless. Hopefully we'll use it. All right. Okay. So The moment generating function is going to look like this. There's J and beta No matter what I say, don't think of beta as an inverse temperature. It's just another parameter. Okay, and And J is a vector of parameters, right? And what we learned was that Zi log of Zi is always going to be a convex function of any parameter that's going to occur linearly in the exponential. I Think I harangue about that enough times over the last two lectures that you will remember this That J and beta log of Z is a convex function of J and beta Right We know and love the moment generating function We like the cumulant generating function, which is the log of Z even more, right? Okay, so Define this convex free energy as I said the log of Z and What you'll see of course as we've discussed is That the derivative of F with respect to J This Parameter J is the expectation value of spin J okay, and the derivative of F with respect to beta is the This hi, okay, I didn't tell you what hi is. It's just some Some function some linear function of these sigmas. Okay, it doesn't even have to be linear We'll just say it's some function of these sigmas Okay Yes We've done very abstract things with these things. So I'm just telling you this is now very concrete. There are spins there's these parameters which are coupling to the spins and We can derive our usual moments and expectation values. Okay. All right so One thing that I do want you to remember is that we're never assuming that there's some sort of equilibrium distribution Okay, I didn't say anything whatsoever about any probability distribution for what values sigma can have right? I didn't say anything. We're just summing over observed configurations with this weight that I put in this weight that's linear in these parameters J and beta Right, I didn't assume that anything was an equilibrium that the spins were evolving in some reversible way irreversible way nothing at all Okay, we just have some observations all right now the Free energy and the partition function that I wrote down are very explicitly differentiable So I can use the simpler form of the Legendre financial transform Right There is nothing that's going to give me a non-differentiable point in this right and every data set that we ever observe is always going to be finite So there's no non-analytic Non-analyticity possible if you sum over a finite number of terms. Okay, it's just not happening All phase transitions in physics happen in the infinite volume limit Okay, there are no phase transitions that are sharp at finite volume That's because they cannot be any non-analyticity with a finite number of terms Okay, so we can differentiate to our heart's content and we can define this dual Free energy, which is the Legendre transform of this original free energy Okay, and they're both convex functions fi and gi Yes Okay now We can write it down explicitly what is Gi What why do we want to do this? Let's be clear. Why do we want to do this transformation? Anyone yeah, but you are just repeating what I said to you know you have to think the Point is that J is Not an observed parameter is not an observable, but the expectation value is right So what we're going from is J? Which is a parameter we're trying to determine to a variable that we can actually that that's an expectation value Okay, that's the reason for shifting from J to M Okay It's just like if you have a spin system in a magnetic field then you can either talk about the Magnetic field you applied or you can talk about the magnetization of the system Okay Right, you can either talk about the energy or you can talk about the temperature, right? So they're dual variables in this case for our convenience M is more convenient than J Okay, that's the reason Okay If I wanted to any questions Yes, no Okay, so this form I want you to note that if I set beta equal to zero Then Gi is just negative of the entropy. Okay, so minimizing this free energy Gi is Exactly the same as maximizing the entropy Si Okay Okay, so the Legendre transform duality gives us the D Gi derivative with respect to the expectation value is J, right? Just like D Fi D J was M so D Gi D M J is J and D Gi D Beta is minus D Fi D Beta Hi, okay This just kind of where did this come from this just comes from the fact that G plus F is J dot M, right? So if I do D D Beta of that I get D Gi D Beta is minus D F D Beta. Okay. All right so and Just a little bit of notation J is now a function of M, right? This is the Legendre transform, right? So the J is now a function of M and so when we write H I the expectation value of H I it'll be just as a function of J which itself is a function of M. Okay Yeah, so now Suppose we expand this G in a Taylor series about the minimum at beta equals zero In other words, we have some observed Expectation value at beta equals zero and we're going to expand G in a Taylor series. Okay as I said everything is analytic everything is Perfectly nicely differentiable so we can do that and There's the Taylor expansion M star is just the expectation value of Sigma at beta equals zero. Okay, so there's a Taylor expansion to second-order Where's the linear term? Advances, right? Advances because of our assumption right, so If I take this equation and I differentiate with respect to beta term by term, right? Anything with a star means it's already Evaluated at M star, right? So the Differential derivative of with respect to beta acts on this term and This term and you get the variation in M star With respect to beta. Okay, but this is just differentiating this with respect to M. Oh Sorry with respect to beta, okay Okay, what's the big picture here? What we're going to do is we're going to differentiate with respect to M and with respect to beta We have a Taylor expansion on one side, right? Those derivatives commute, right? So we're going to evaluate them in two different ways And we're going to set them equal Okay, that that's really the big picture here Okay, that's a little bit of algebra. We have this quadratic Taylor expansion on one side, right? And then we just have the definition of what G is on the other side. Okay, but all we're going to do is Assume that the derivatives commute that's a very innocent assumption, right? All right So if we do that This was the Taylor expansion I wrote down, right? Now if we look at this term and we differentiate with respect to beta then we see that Dm d beta is this Okay, it's the connected correlation function. Remember we talked about connected correlation functions If I differentiate with respect to parameters that are linear in the free energy, then I get connected correlation functions Right? We discussed this ad nauseam. Okay, so that's what this is. This is a connected correlation function and The second derivative I'll remind you is the inverse of the covariance matrix I emphasize this from the very beginning that the second derivative of The dual of the Legendre function right of the in the Legendre transform is the inverse of The second derivative of the first function, right? That's that's what we have here This thing is d2 f dj Square right and it's the inverse Okay All right, so Now if I match terms on both sides then this is actually That intuitive equation that I wanted to derive So it gives us an explicit update rule for how these parameters w are going to be updated Just based on the minimization of the free energy, okay? Yes, it's a little bit of algebra. You'll get the lecture notes But I want you to understand the big picture of why we're doing what we're doing. Okay, we just matched terms Dm d beta Versus d beta dm with the Taylor expansion and the definition that that was it. Okay, so that's how we got this update rule Did we maximize anything? Now we didn't maximize anything All we wanted to do was look at the minimization of the free energy okay, now I Said h could be any function and in fact you can take h to have linear terms in the spins You can have h to have quadratic terms in the spins and so on and so forth okay, you can carry this out as far as you want and This systematically will give you higher and higher order core contributions to this delta h So what does this give us? Oh? wait So I showed you how to do this iteration right But in any finite data set you have to know when should you stop that iteration? Okay, in all of data science you do not want to overfit So usually what we end up doing is we split up? Data into a training set and a testing set and a validation set and whatnot right okay But we're not I didn't actually show you any place where I was trying to minimize some sort of discrepancy or cost function right You didn't see me minimize anything Right, so how do I know when to stop? I didn't do any splits of training testing etc. Etc. Nothing right So we have to figure out when do we stop this iterative update of Wij So look what we want to do is we want to make our predictions our expectation value as Close to the observed value as possible right? That's that's the whole name of this this game right we want to minimize the discrepancy between the observed value and our expectation value right and Because the sigma i squared values are always plus one right sigma i is plus one or minus one So sigma i squared is one We can write this also in terms of one minus expectation divided by sigma and The whole thing squared right and as we discussed this second term here is Always less than one right there's always less than equal to one in in absolute value right so every term in this discrepancy is positive and it makes sense that What we can do is we're going to monitor this di as we iterate We're going to monitor di and we'll stop the iteration when di starts to increase okay, and This is quite independent of the iteration You know the actual iteration that we do because I didn't try to I didn't differentiate this and say I'm going to minimize This right it's a completely independent thought That there is this discrepancy we'd like to have the best possible value for this what the parameters in our model right But we never ever use this to derive how we're updating W right Yes Isn't that magic? It's I just want to be absolutely clear. We did not use this in updating okay So now let me show you what happens So here This is the sharrington cook Patrick model So the Wij are all supposed to be we set up some sort of synthetic data set with zero mean and the variance of the Wij values is G squared over M and Where N is the number of spins, okay, and Suppose we have lots of data in other words are the number of time points we have is Equal to N squared why where does the number N squared come from? Why is that? Useful why is that important? Because you're trying to determine a whole matrix Wij right there N squared independent entries in that matrix right? so if you observed N squared plus one time steps Or N squared time steps right that should be sufficient data to determine this whole matrix, right? but if I observe only some fraction of the That number of time steps right? Then that isn't really enough data to determine the matrix very well, okay? so What you observe is when there is lots of data then of course we determine our iteration Gets the values of the Wij pretty well But you'll notice that once you get to some sort of iteration you can keep iterating. It's not getting any lower, right? the discrepancy or this is the mean square error and The other curve the light curve is the discrepancy, okay? On the other hand when there isn't enough data then actually the discrepancy and the MSE Start to increase as you over fit right as you over fit the Accuracy of your predictions gets worse, right? So we really need some way to know when to stop, okay? Now the mean square error would be nice, but we don't know what the real value is, right? We don't know what the real answer is so we can't really use the mean square error to decide when we're going to stop iterating However, we do know what the discrepancy is, right? We observe these spins So we can calculate the discrepancy. We have our predictions. We have the observed spins. We can stop our iteration when D reaches its minimum, okay? And this is what you get if you stop when D reaches its minimum, okay? Now I'm comparing here five different methods for determining the WIJ Our method is the one in black. That's the FEM Maximum likelihood is the one in red Exact mean field is in green the tallest Anderson Palmer equations are in blue and naive mean field is there and sort of pinkish, right? and you'll see that as we go to very little data right that our method does better than the other methods as You go to very little data. That's what the the x-axis is L over n squared, right? So when you have very little data, then our method does better than the other methods that we checked Why why do I keep hopping on a very little data? Because I work in biology Okay biological systems are extremely complicated and we never have I don't care what anyone says about big data There is a never enough data to fit The models the complexity of the models, okay? So our effort is to try to do the best with the inadequate data that we have, okay? So it's important for us that we're not wasting any of the data as a you know Separating the data into a testing set a validation set a training set It is actually important to use all the data Okay Yes, so that's the reason why I keep hopping on how do we stop the iteration without using any of the data for validation or testing Yes, okay All right So to summarize This is our local field I told you only about a quadratic about a linear dependence on the previous time You initialize it with a random w you update like this This is the exact W I j update It's the correlation function inverse Times this expectation value that you know how to compute now and you repeat this these steps until the Discrubbency starts to increase and you do this independently for every spin, okay? Okay, so that's the algorithm, but that was with the sharrington cook Patrick Model which you know it serious clever because you can do all sorts of nice expectation values with Gaussian random variables especially when you have random matrices with Gaussian entries You can do all sorts of fun stuff with them the problem is that Real data rarely has distributions that look like that So I had asked like what are the asymptotics of your Of your algorithm The problem is I never get to see the asymptotics because there's always a finite amount of data and I don't know what the coupling distributions are so I can't derive Nice central limit theorems for how do their couplings behave in the asymptotic regime where you have infinite amounts of data For example, suppose we take this is a generator synthetic Coupling matrix So what we did was we made something that had a histogram of couplings. That's very far from Gaussian Random it it's these spikes, okay? The the couplings become weaker as you head out and if you simulate this as a easing model This is the raster of time series that you get, okay? Nothing very predictable here So our question is if I give you some little amount of this data, how well can you predict the correct wij, okay? So these are three sequences where we get more and more and more data and you'll see that this kind of Time series from that you can still infer the coupling matrix pretty well okay Now this this is a Even more interesting one this picture. This is the first baby with autism who's in in the US. There's a baby food formula company called Gerber or Gerber I should say and So what we did was this is the first autistic baby whose Photograph was used as the Gerber Baby of the year, okay? So we took that photograph and we converted it to a matrix right wij right and the histogram of couplings looks like this and You can simulate it and this is what the raster plot looks like, okay? and now I give you some little bit of this raster and I ask you how well can you reconstruct the couplings and And this is the answer as you get more and more and more couplings you can reproduce the photograph Better and better, okay? Yes Sure, I don't know what the wij is right the Sherrington Kirkpatrick model assumes that wij are Random variables, okay, which are normally distributed and which have a certain variance g squared over n Okay, I don't want to use that kind of Really nice coupling matrix. So I said, okay, let's come up with Weird couplings right? What should wij be? Basically pick a random matrix We're not too random because actual data in the real world has pretty strong correlations Okay, so this is I'm just showing you the histogram of coupling strength versus You know how many entries have coupling strengths like that, okay? That's this coupling matrix represented as a picture right and Then I simulated this as a stochastic easing model and this is the kind of time series you get Right now I give you some little portion of this time series and I ask you what is the coupling matrix? so the whole game is can we reproduce how close can we get to the actual coupling matrix, right and This is I'm just telling you as you get more and more data You can come pretty close to the coupling matrix and this is all you're observing is time series like this, okay? and for the second one all we did was take this photograph and use the pixel, you know grayscale 0 to 256 or 0 to as the coupling strength, right and That's the time series we generated and from this time series. We're trying to reproduce the photograph That's all okay. Yeah Okay, as I said the H doesn't have to be linear in the sigma you can have quadratic higher terms etc etc and this is just showing you that if you have linear and quadratic terms You can still reproduce them, okay? okay now Since we want to do big data the question is How efficient is this right because somewhere there has to be a catch Right, okay, but here's the fun thing. I showed you the update If you notice, it's not actually You see that update, right? It's kind of a multiplicative update. It's not like Delta w is this right. I'm just telling you w new is this factor times the old w Right, that's a multiplicative update Okay, so it's not taking a little step. It's not doing some kind of gradient descent trying to minimize anything Because first of all, we didn't find that we didn't write a cost function that we were trying to minimize, right? So this is a multiplicative update So if you think that there's some sort of valley that you're trying to minimize, right? This update Is kind of bouncing around Okay, it's not trying to go down this lie. It's kind of bouncing around, right? so there are benefits to this bouncing around as in If you did maximum likelihood and tried to minimize Then you take a lot more iterations Than our update, okay? The computation time for maximum likelihood estimation is Oh, roughly a hundred times bigger than the multiplicative update that we had, okay? All right, okay So I Like real data so far. I was showing you just you know fake data, right? I mean I simulated a time series and I Reproduced the coupling so what yes, sorry question. Yeah I believe that I believe that I I Agree with you. The thing was that I did not actually ever minimize a cost function. That's actually the important thing for sure Sure Okay Yeah, so this is real data people actually took a salamander and They wired or they They Put electrodes or they measured the spiking of neurons in the salamander retina When this retina was shown a movie of a fish, okay? salamanders eat fish so or at least small fish so the salamanders retina was getting excited and this is the neuronal spiking in the salamander retina and and So we inferred the network of interactions in the neurons and this is the predicted neuronal spiking Versus what was it? Yeah, if you select Okay, I don't want I don't want to go into too much detail But basically this is the actual neuronal spiking and this is the predicted neuronal spiking, okay? and This is the inference accuracy Random actually, I don't know what but So this is the large wij based spiking meaning if you only look at the large Core large interactions large coupling strengths, then this is the inference accuracy, okay? More data. Yeah, this is a more Fun data set if you like Currency fluctuations are kind of hard to model. There's a lot of money to be made with currency fluctuations so the We took some currencies. I forget how many I think there are something like 11 currencies and So what we did was we Got data from the Bank of Italy. I think the Reserve Bank of Italy their website had data about currency fluctuations or currency rates and So what we inferred was the interaction matrix at different times, okay? In different time periods, okay? And what you'll observe kind of interesting. This is why you want to be able to do things with very little data, okay? Is because if you use all the data for these 17 years 2000 to 2017, right? Okay, then these are the correlations that you get, okay? This is from the actual data, okay? but if you look at Two-year chunks, right then what you notice is that there are much stronger correlations, okay? So if you can actually infer couplings with very little data Right, that's actually more useful Because the couplings themselves could be modulated on longer term timescales, right? Economies go up and economies go down, right? So it's important to get bigger correlations out of little out of you know small amounts of data and To show you this is what the data looks like if you binarize it binarize means just that You convert it to you know plus minus one depending on whether the currency went up or it went down Yep, okay and then tie Proceeded to ask can you make a profit in With our predictions, right? So these are just plots showing with So first he did okay, these are our predictions the black shows just trading every day depending on whether your signal was to Buy or sell depending on which see which Currencies you wanted to buy and which currencies you wanted to sell, okay? Right, so this was a cumulative profit then he said, okay? The discrepancy that we're calculating right That discrepancy actually tells us are we doing a good job fitting or not? right So then he proceeded to set up a trading strategy where you only Make a trade if the discrepancy measure is saying that you're doing a good job fitting, right? And it's kind of cool that if you do that then there are smaller fluctuations You end up at roughly the same spot, but you don't have these big drawdowns, right? It's a little more monotonic When you use the information about are you doing a good job fitting to actually decide whether you're going to trade or not okay, and the profit per transaction is Higher if you use your discrepancy measure to tell you when you know that your model might be Working correctly, right? Okay, so that's that okay Any questions here because now I proceed yes Right the currencies are actually continuous numbers, right? But what we did was we converted them to did the currency with respect to a base currency I forget which one we took the euro or the dollar, okay? Did it go up or down relative to that base currency? So though the whole you know continuous fluctuations get converted to just a binary variable plus or minus one Yes, okay Okay, so now We want to do Hidden variables why hidden variables because as I said I work in biology and No biological system. That's actually living and even dead. Can you actually observe everything that's going on? Okay? So it's very important to know Are there hidden variables that are affecting the dynamics that you're actually observing? Yes right and so the question with the hidden variable models is Okay if I allow you Any set of hidden variables any type of interactions of hidden variables, right? then I Can fit anything as well as I want, okay? It's only it's not a well-defined problem if I say there are hidden variables use any interactions you want, okay? So in any hidden variable model you have to have some restriction on what sort of interactions you allowed, okay? And then you can say how many hidden variables do I need? Okay, so there are two separate issues here a you have to restrict the form of interactions that the hidden variables have and You have to determine how many hidden variables you need Okay, it's totally useless to say I'm going to allow as many hidden variables as I want Okay, that's not a predictive model so these are the two things we have to do is find the interactions of the observed and the hidden variables and We have to find the number of hidden variables Okay That's what I just said So what we did was we used that same algorithm that I described to you we use that to Make synthetic data again, this is the sharing to incur Patrick model and we said okay suppose you observe only 60% the question is With observing just the 60% what's the error in your predictions? And you can see there's quite a bit of error if you only observe 60% of this right so this is the predicted coupling matrix and this is the error This is the actual coupling matrix and this is the raster that you use to Simulated from this coupling matrix, okay? On the other hand if you use hidden variables, then this is our predicted coupling matrix and this is the error okay So with with hidden variables you can really make much better predictions, okay? But there are all sorts of tricks not tricks. I mean there's all sorts of subtleties So I want to go over what you have to do to get this kind of result, okay? Okay, this is just showing you Out of naive mean field the Thomas Anderson Palmer equations exact mean field maximum likelihood and Free energy minimization How well do you reproduce okay observed to observed couplings hidden to observed couplings Observed to hidden couplings and hidden to hidden couplings, okay? All these have to be inferred Yes question Sorry Yes Yes Then you need yeah No, no, I'm going to show you how you have to determine the number of hidden variables So as I said, that's that's a very important thing if someone says I can solve hidden variables But cannot tell you how many hidden variables That's not useful So yes, the actually the hard part here was to figure out How to determine the number of hidden variables? That's the the hundred dollar question or the thousand dollar question How many hidden variables, okay? All right so we get the Predicted interactions versus the actual interactions reasonably well Right and it's it's very critical for this see the expectation maximization algorithm there are two steps, right? the maximization step is where we're using the Free energy minimization the expectation step is the same as any other algorithm, okay? so if you can do that faster, right if You can do the Finding the couplings faster That is when you can use a hidden variable model because the hidden variables Require a lot more computation, right because you have to simulate them from your maximization, okay, and you didn't observe any of them So there's a lot of sampling involved So if you can do this faster, then then you have a chance of determining the hidden variables, okay? So it's very important here to be efficient in determining the hidden variables Okay So now this was the question how many hidden variables, okay? This is just a summary of the algorithm Okay, I'm going to tell you the answer and then I'll try to motivate why we have this answer So remember we had a discrepancy, right? We defined a discrepancy which was basically The deviation of our prediction from one right the ratio of the deviation from the actual observed value squared Summed over right that was a discrepancy D When you have hidden variables Then What's the discrepancy? You don't know what the value was, right? You don't know what the hidden variable state was So what you can't really use the discrepancy of the hidden variables? To decide how many to use right or where to stop the iteration, okay? so the Measure we came up with was calculate the discrepancy of the visible variables, right and Then scale it up to take into account how many hidden variables There were okay, so if you look at this, this is NH number of hidden variables plus the number of visible variables, right? Divided by the number of visible variables, right? Yes, so on top and the numerator that is hidden plus Observed variables the total number of variables divided by the visible variables, okay? and The intuition for this and there's no derivation that I can tell you the intuition for this is simply that the discrepancy in the hidden variables cannot be bigger than the discrepancy Cannot be less than the discrepancy in the visible variables, right? Does that make sense if there's errors or uncertainty in what you observed, right? if you inferred some hidden variables, they can't be less uncertainty in the Hidden variables, right? That's just error propagation if you like yes so This Is the measure we used and so the question is can this predict the right number of hidden variables by minimizing this discrepancy? okay, so The mean square error which again remember in real data you never have the mean square error But this is synthetic data hidden variables 10, 20, 30, 40 and This is the discrepancy of the observed This red line is the discrepancy of the entire data set with our scaled Discrepancy and so you can see that it's picking out the right number of hidden variables if we look at the minimum of the Scaled discrepancy that I'm yes Okay, so that's our whole algorithm is to Use the scale discrepancy to figure out what the right number of hidden variables is Okay Now I have to tell you that this is synthetic data and it's very clear what the hidden variable number is okay The reason for working with real data is you'll see I'll show you in real data. It's not that clear Okay, but you can still see that it goes down and there's roughly a range of hidden variables that you need okay But it won't be quite as clean as this is Okay All right So in this time moved on to Considering stock market data I guess currencies did not hold his interest that much anyhow So he picked 25 large US companies and he followed there. He got data for their stock prices Fluctuations right stock price fluctuations. This is the rest of what it looks like their stock price fluctuations Again, he binarized this taking into account Did it increase the stock price or did it decrease okay? So that's all he's trying to predict is the stock price going to increase or decrease okay? That's that's all he's going to use and We don't know what the hidden variables are in this model if there are hidden variables in this data so he computed the Wij and so on and so forth and you'll see that the number of hidden variables comes out to be like Between five and six something like that right Again in this case it comes out to be about four and you notice this is in different Time intervals right this is 2014 to 2016. This is 2016 to 2018 The predicted covariances versus the actual covariances and then he did simulated trading Including with hidden variables including without hidden variables and including no inference at all That is just random trades. Now you have to have some baseline right? You cannot just say oh, I did great What's the baseline? You know why look even the no inference trade does great. Why? Because for the there are certain periods where the whole stock market is going up It's not like you become smart. It's just that the whole stock market is going up So you can do all the trades you want and you'll make money It's you're not a genius the whole stock market is going up. Okay, so Even without any inference, you know, there was a period between 2000 and 2018 where you made money Okay, but if you do make a model with the hidden inference then it does a little bit better Okay, again, he used the discrepancy To come up with a strategy for when he should trade right? That's important. You should know when your model fits so that you know that you can trust the prediction or not, right? Okay, and so this is just showing you trading on certain days trading every day's profit per transaction, etc. etc. Okay Okay Salamanders and fish again in this case He took some of the Neurons and he hid them right this is real data you can take some of the Variables and say okay suppose I assume that I didn't observe them right then you then you want to reproduce what you hid, right? and again, he could show that if you you can determine the number he took four of the strong neurons strongly connected neurons and he Kept them aside so that we didn't know what they were and you can roughly predict how many of the strong variables there were And if you include them, then you get a slightly better prediction Then if you didn't include them, okay, these are the four that he left out Okay This I'm going to skip because Well, okay, let me explain this This is you can't have any person talking about data without showing you MNIST, right? That it's just it there's a law or something Anyhow, what we're trying to do is we take the MNIST digits and we ask Are there hidden variables in predicting this the the Pixels, right? Okay, this is not time series data. This is just What are the interactions between the pixels that make up these digits? Okay now in this case he used It's not quite in the same framework as before what he said was each of these digits is going to determine Some num the state of some number of hidden variables But the hidden variables are constrained that only one of them can be one and the rest are zero Okay, so this becomes now a classification or a clustering problem Okay But it's not in the exact framework of what we've been talking about What it is is in a framework where there is this constraint that only one of the hidden variables is active at any time Okay, in that case he found a discrepancy that came up with 60 The number of hidden variables on the order of 60 Now that's interesting Because if I say that the hidden variables are constrained to be only one of them active at a time, right? That's basically a clustering problem Okay So what he found was that the number of clusters predicted was about 60 Now you could say well, these are pretty much random. What do you mean 60 clusters? There's only 10 different types of digits, right? But it turns out and this I have no explanation for that most of the digits occurred in about six clusters each Okay, anyhow, so there's our 60 clusters and you'll see there's slight Differences between how the digits are in the shapes in each cluster. Okay All right Okay so the last topic I want to talk about is The hidden variables were of this type The hidden variables look like this. These are your observed variables. These are your hidden variables, right? So you either observed all of them and you didn't observe all of these, right? You didn't observe any of these, right? But you observed what you observed you saw all of it, okay? Now what I'm going to talk about is What if there are missing variables? Okay So suppose you didn't know and These are different for different Time steps, okay? The number is different How many are missing is different? Okay? So it's not like there's a block of hidden variables that you just don't know it's just that there are hidden missing values, right and This happens a lot in clinical data Where a patient didn't show up for a test etc etc, so there are missing values or the test didn't go right There are missing values, right? Okay, so the question is can you fill in these missing values? Okay, so it's a hidden variable problem, but it's much more irregular than the Hidden variable problem that we were talking about, okay? All right So We still use the EM algorithm and Maybe in this case, I will explain what how we update And but we do need a different stopping criterion because there's no simple scale argument, right? Given how this data is randomly distributed the missing values You can't really say that oh, there are this number of hidden variables and so I'm going to scale up the discrepancy in this way There's no simple scaling of what the That is going to determine the number of hidden variables There is no number of hidden variables. The data just has missing values, right? Okay So the key point that Sanghwan Lee and Professor Joe's work with me did was they figured out that the Discrepancy in the missing data Should be equal to the discrepancy in the observed data Why is that? Because when you're filling in these values, right that are unknown, right? You can't fill them in better than you're predicting the observed values, right? It's again sort of uncertainty propagation. There's no way that you can predict what you didn't see Better than you predict what you did see Okay, so let me show you that that tells you when to stop iterating Okay, so now I'm going to explain Now we we just solve this linear equation as a regression logistic regression Which is maximizing the total likelihood? Okay, so when we're just directly solving this by logistic regression And as professor matter said if you directly solve it then it is maximizing the total likelihood, okay? So for missing data the stochastic update is First assign random values to the missing entries then find w i j then stochastically update the missing values you this is the part that's Sort of tedious and you have to do it step by step You look for the likelihood that what happens if I switch the sign of one of the variables Or does the likelihood go up or down and you stochastically reassign the value that you've imputed, right? the value that you're guessing, right? Depending on the probability of the positive sign or negative sign, right? The ratio of these probabilities Right, so that's how you update these missing values They're clear. Yes. No Here's the the point because this might might be confusing If I take these spins right there and spins Suppose this one was missing, right? I Don't know what its value is, but I've just some value, right? and I did the iteration and I have a new value of w i j right my coupling matrix Now I say okay, what if I had the opposite sign if I flip that missing value, right? Does the likelihood of the step after and the step before, right? Does that increase or decrease? Right and So you look at this ratio of likelihoods and if it's high Then you have a higher chance of updating it to that value, right? And if it's low, then you have a higher chance of updating it to the opposite value, right? So it's a stochastic update Yes, just based on the step before and the step after At any given time step, okay? Okay, so for synthetic data So as I was saying the deal the discrepancy in the observed versus discrepancy in the missing and You'll notice that the number of iterations where this crosses is five and That's where the mean square error in the predictions is also roughly the lowest, okay? This is when 30% of the data is missing. This is when 50% of the data is missing This is when 70% of the data is missing and you'll notice That if I keep iterating Beyond the correct number of iterations, okay? Then I end up overfitting, right? So like in this case, you see that we should have stopped at 20, right? And that's where it roughly crossed the observed versus the missing discrepancy, right? But if I keep iterating then I'll end up here where I have clearly overfitted the data Yes, the couplings have become stronger than they need to be Okay, so it's very important to know when to stop, okay Mean field expectation maximization versus the stochastic update that I talked about An interesting thing to look at is how well are you predicting the Correlation that's actually observed the correlation between the spin at time t plus 1 and the spin at time t This is something that you can measure in the data set, right? So how well are you predicting that? This is Wij inferred Wij true This is a nice theoretical measure, but you can never observe this. You don't know what Wij is, okay? So this is for synthetic data. You can draw this plot for real data. You can never draw this plot You don't know what Wij is, okay? But you can observe Dij, right? And this is telling you that Dij restored versus Dij true is pretty good, okay? And as you'll observe is 70% of the data is missing. There's a lot more uncertainty in how you restored the data, right? Okay this is an accuracy measure restoration accuracy and This is just showing you mean field expectation maximization versus the stochastic update that we did and This is the slope of basically the same thing in terms of the as you get more and more missing data Obviously if there's 80% of the data is missing, nobody is going to do very well, okay? All right And you can do the same thing for very strong and very weak couplings So this is Let's just go on It's just it's a measure to see look if you have very weak couplings then it's almost impossible to fit, okay? If you have very strong couplings, then it takes a lot of iterations and it's not so clear Where the observed in the missing discrepancy is, okay? That's just you know baby steps for This is all synthetic data. So you don't take it too seriously whether you do well or you don't do bad Okay It is important for us to always worry about how much data you needed So this is a question remember L is the number of time steps you observed n is the number of spins So L over n squared is a good measure of how large the data was relative to the complexity of the model So if you notice that are we stopped the iteration at about six iterations But the lowest RMSE was down here at about three iterations, right? But the mean square error is not something that you can get on real data, right? So are it there may be a better way to stop the iterations? But I we don't know what a better way is but ideally for very little data We'd want to stop around three, right? But our measure does not stop around three We stopped around six, okay? When there's lots of data, then you know Anybody could do it, okay? You don't you don't really worry about how many iterations because once you get past ten iterations or 12 iterations You know, it's not going to get any worse, right? So really the only check of an algorithm is what happens it with very little data or in hard You know inference circumstances, okay? Okay The last thing I'll show you is Neuronal spiking this is now real data K of t is a probability of case Simultaneous spikes. So you have n neurons, right? They're spiking and you have this collective statistic K of them spiked together at the same time, okay? So K is this number It's either one or zero, right? Sorry, it does either K neurons that spiked at the same time or not, okay? So it's a simultaneous spiking statistic and you Calculate this for all the time step and you get a histogram, right? You get values of K for each time Yes, and then you can do a probability of K simultaneous spikes So there's a histogram So There's the original data P of K That's the black X's There's our algorithm. Those are the blue circles There's EQ em there's mean field and there's just taking the frequency okay, and EQ em stands for equilibrium expectation maximization and What you'll see is that the collective statistic is Best reproduced by our max our stochastic update That's pretty much it what's interesting is There's the actual time series There's the mask time series the blue indicates what was not observed And this is the restored time series, okay? Yes Here's an interesting thing you want restoration accuracy Guess what does best on restoration accuracy? This frequency thing, right? What is frequency is just saying I'll look at a neuron I'll see how frequently it spikes, right? And I'll take the most frequent value and fill in all the missing values with that most frequent Value, okay, and that has the highest restoration accuracy Why because a lot of the neurons are silent most of the time Okay But that same frequency filling in does terribly for the collective statistic, right? And that's natural right because collective statistics are not just a question of are they all Silent or you know, are they all spiking at the same time? Okay Okay, and that is it for today. All right, so Questions. Yes Yes Yes Okay And we discussed this after the first lecture. I'm just using the mathematical framework of statistical physics, okay so I Never ever Used anything to do with equilibrium. There's not a single equation I wrote down that actually assumed anything other than convexity and Natural parameters for the Operators I introduced in the x in the exponential Never ever will you see anything on my slide that actually uses Anything other than the mathematical framework of statistical physics No The mathematical framework is when I say mathematical from I'm talking exactly about the sum of Exponentials with linear parameters multiplying operators That sum has all the properties I need convexity right and And That's pretty much all that's used is the natural parameters in an exponential sum When you say equilibrium, right? You have in mind something dynamical, right? But I never ever I mean all the stuff you heard from Professor Lee for instance, right? There he was actually using The time dependence and what we observed and so on right I Didn't use any of that Not not anywhere. I can go over the whole equations with you. You know never ever Do I use anything To do with the concepts of equilibrium statistical physics It's the free energy is a convex function, right? The what I define this formal free energy is a convex function, right? If it's a convex function It has and it's analytic in this particular case. It has a minimum, right? I Expand in a Taylor series about that minimum Why shouldn't I? No, I want this is a good question. I Want you to understand that really I'm a physicist and I assure you this has absolutely nothing to do with statistical physics Other than the mathematical framework You can call it whatever you want, right? I could call it, you know, I don't know what widget minimization, right as long as The widgets have the mathematical property that I want I could define You know, this is the widget function and there's a dual widget function from the raison de transform of the widget function, right? Everything would still be true You have this cognitive dissonance because you know statistical physics and you say he's doing garbage This is not statistical physics. I Totally understand your your your point of view I'm just trying to tell you that all we used was the mathematics and not any of the physics Okay, yes, that is absolutely true Yeah, you know, that's absolutely right. See what why did why did we come up with this is because Professor Joe is a physicist. I'm a physicist. We work in physics. We used to work in physics departments, okay? So yeah, it doesn't to be honest. I learned about large deviations much later After we already did this work Okay, and Then I'm saying oh look wait large deviations. They're doing the same things But I you know because I'm old I actually learned this stuff from reading Julian Schwinger's papers Julian Schwinger basically invented all of modern quantum field theory in the 40s and he introduced all these concepts Introduce a source Notice what happens when you do a duality transformation in the source, etc, etc So I came it from the point of view of quantum field theory and then I you know for discover that large deviations Provides the mathematical justification for all this Okay, now I totally understand your confusion. I use these words. Maybe I should start calling it widget function or something Okay. Yeah questions Yeah, okay, so the If I understand this correctly that that's what I tried to explain was that if the couplings are being modulated on a longer time scale, right? then being able to Then being able to infer things For sure with little data on short intervals, right? Let's you learn the modulation of the couplings, right? But if the couplings themselves are changing on the time scale that it you know of the data you need to infer them Then I'm not sure you can did I understand your question, right? Yeah, so then I'm not sure you can do it Yeah, yes, yes. No, that that's the same thing. I would say for the currencies, okay I mean, you know because these are macroeconomic trends, right? and And different countries have different levels of inflation or other stuff and this takes a While for it to percolate, right? Like trade relations between countries don't change overnight They change over the course of years and geopolitical events and whatnot, right? So if you're able to modulate if you're able to infer with small amounts of data Then you can update your model more frequently, right? That that's the take home message being able to do things with little data is Very useful Yep, of course. Yes No, no, no, no, in fact Maybe I should have emphasized that these are all asymmetric couplings So it's very far from equilibrium because if you had equilibrium then Wij would have to be symmetric Okay, in fact, that's a very interesting thing that these Wij's are actually asymmetric Never did I write down an easing model like symmetric interaction? Never Okay. Yeah, good question. Yeah Any other questions? Yes, please Yeah, so you mean it what if What if it wasn't a time series? Is that the question? I talked mostly about time series time steps, but we did this inference also for Not in this this set of papers, but we actually did this for protein sequences for instance my postdoc and I and There's no time there you just have these observed protein sequences and you're trying to predict, you know, which in which Residue at which position is interacting with some other residue in the, you know Alignment of different sequences for the same family of proteins. Is that what you have in mind? Yes Is that is that the question, okay, but but those what I was trying to show was that the like the stock market Trading right That involved that used the hidden variables That we inferred So that's a if you like that's a restored time series with the hidden variables So you get much better predictions if you take the hidden variables into account and And you get much better predictions. That was the P of K histogram I showed you get much better predictions if you Figure out what these missing values in the neuronal spiking were Yes, you get much better accuracy in prediction. Yes. Oh Yes, that's a very good question. So what we did is we bend the spikes Okay, so we have a time interval and we bend the spikes So like in this five seconds if there was a spike We say that was a one if there were no spikes. We say it's a minus one right No, actually, that's not a good biological reason at all Because there's no way that your neurons are going step by step by step, right? But that's a matter of because the way we've been updating we're not making Independence stochastic simulations of each neuron, right Okay, so so that's why and in any I We've actually been doing work with MEG data Okay, and that is much finer timescale resolution Okay, MEG data of the human resting state brain, right? But even in that case, you know the experimental resolution is something like 2000 Hertz Okay But it's still a finite resolution So that's a so-called continuous time series of MEG data, right on 168 or 240 sensors, right? But it's still, you know, if they give it to us, it's a it's a bunch of numbers at You know specific times, right? So it's a never This idea that we have that we have continuous data any data that you're given to analyze isn't never continuous, right? So that then you know the binning is not I didn't win it the data is given to me as a Numbers at specific times, right? Yeah, of course any other questions. Yes, please. Sorry. Ah, yes For the MNIST data The there's no time series. It's just these so then we actually assumed W was symmetric Right, and and that's that's what we're doing Yeah, yes, please Yes Yeah, well, that's for instance in the MEG data that we're modeling That I'm not going to talk about We are actually combining This kind of binary Variation with actually figuring out the magnitude that the intuition is this and and you should think about that a little then Suppose you have a stochastic time series, right? if you can actually predict When it crosses zero when the derivative crosses zero whether derivative is positive or negative, right? If you could predict that discrete event with some confidence, right? Then you probably understand The whole time series reasonably well If you can predict stochastically whether the Derivative is going to be positive or negative, right? Then you can probably predict the whole continuous time series reasonably well and that actually turns out to be true that even if you just You know try to predict just the change in the sign of the derivative, right? You can actually end up figuring out the interactions that determine the whole thing Yeah, then in fact when we observed that these Currencies were you know, we were predicting them correctly even though these are continuous variables We were just binarizing them, right? That's what led me to think that maybe if we can just predict to when the derivative changes sign That's enough information to actually predict the whole time series and it turns out to be pretty close to the truth Okay, there's still you know you you have to do a little more work Well quite a bit more not me, but my postdocs Okay. Yeah, any other questions? All right