 Okay, welcome everyone. So it's a pleasure to have with us today Gaspard Kaczyk from my ste Austria, who's going to deliver a set of three lectures, which will be mixed blackboard and slides. So please Gaspard. All right. All right. Thanks. Thanks for the invitation, Antonio. It's really a pleasure to be here. I like to come I'm originally from Slovenia and so I traveled yesterday from Vienna on the train. And I noticed that after years where there was not really a convenience connection between Vienna and Trieste now we're back to the old monarchy line so that I think that, you know, it does that exactly that's what I so I think on the Austrian stretch they made it a bit faster but it still takes nine hours so I was surprised how come that you know when we were in Villa of Virginia, why there is still one hour until arrival to Trieste, so you know they need to change the train and kind of do all the circumsum but it was very nice and romantic. So, all right. What what I would like to spend these three lectures on is, is to talk about maximum entropy models. I'm a physicist by training, but mainly work on the interface between physics and and various types of problems in biology which include neuroscience problems but also cellular signaling problems. And the way I was imagining delivering these lectures or three lectures what would be sort of to combine today maybe a bit of a background on maximum entropy models on on on the blackboard just just to get everyone on the same page. And then to show applications of these ideas to data on slides, of course. And so, I would like to highlight certain connections so I think, mainly the aim of my lectures, while introducing these models is to highlight the connections is so this type of modeling maximum entropy modeling really emerges in very different fields of science so depending on what your background is, you might recognize the following words so if you're more from the machine learning side. You know, we would be talking about a class of generative models. You know, you might recognize the words such as the Boltzmann machines or the restricted Boltzmann machines. If you are mainly coming from the statistics sites, we are in some sense talking about distributions from the exponential family class where there is a well defined set of sufficient statistics. And if your background is, is mainly physics or statistical physics, then obviously the Boltzmann distribution will be one example but actually, how to say in in let's say in the recent 15 years or so, there has been a resurgence of interest in the maximum entropy models, and you might read papers about inverse statistical mechanics or about the inverse ising model and so on and so all of these concepts, you know, pertain to kind of a common body of mathematics or physics if you want. And so I would like to illustrate a little bit what we are talking about, and in particular focus on three types of uses of these models. They are related but not the same so one the first use which is what I'm going to be talking about today and I'll illustrate it on some data spiking data from the retina. And I'll describe what this is. So the first use is and maybe I sketched this down. So we're talking about maximum entropy. So my entropy models. So the first use will be to infer distributions from limited data. Okay. And again, I'll say more about this but let me just put this, the structure of the lectures on the second use will be, which I hope to do tomorrow will be to use maximum entropy models as a very elaborate framework. And I'll talk about null models from data against which you can do hypothesis testing and I'll try to illustrate what I mean by that. Okay. So a framework, and this is kind of related to the first one but in practice what it is somewhat different a framework for building null models. You know for them doing hypothesis tests and the third application is kind of the most recent one. So the purpose is to construct what we call optimality priors for Bayesian inference. So here, you know, if you if you've dealt with Bayesian inference of say parameters from data you know that you have to specify some type of prior over parameters typically and there is kind of there has been a lot of discussion for what this prior should look like what they should be. So this last bit will be sort of our contribution for using a very non trivial set of priors that we call optimality priors and they are, they have the form of maximum entropy distribution so that's the connection as you will see. And these priors can be used to link. That's why my lectures have the title of tenality theories that they can be used to link a particular type of optimality theory that you have about the system with finite number of data that you have collected about the system and that's therefore to test that the committee theory. All right, so, you know, kind of to give credit where it's due if we want to trace back the origin or one of the origins of this maximum entropy models, one looks back to the original work of James. That is by now quite, quite, you know, far into the past. So, 157 or 157 of this is in this graph. 106. And where where he basically so and you know the title of this is information theory and statistical mechanics it's a it's a paper that if you haven't read it is really worth reading even if not the whole paper it's a pretty short one but if not anything it's a question. Yeah, because in the introduction, the, the question that's that's taken up is the question of whether one can think of statistical mechanics. Really not as a sort of particular theory that you derive from you know very like in a mechanistic system to talk about its states microstates and so on. So I think about statistical mechanics as really a statistical theory of about things about systems that you have to build with limited amount of information that you might have right so that's the viewpoint and, and actually I like have a situation from the abstract because it's really great so it says information theory. And that's how the paper begins provides a constructive criterion for setting up probability distributions on the basis of partial knowledge and leads to a type of statistical inference called the max and estimate. It is the least biased estimate possible given the information, ie, it is maximally non-committal with regards to missing information. And it goes on already in the, in the abstract to say that it's actually possible to make a sharp distinction between, you know, viewing this kind of statistical mechanics as this statistical theory that has nothing in particular to do with the mechanistic underpinnings and about the mechanistic aspect of statistical mechanics and fortunately of course very nice and productively connect but you can discuss to separately. You know, this is good to have in mind that's actually the viewpoint that we are going to be taking today so let me now get a bit more detailed about the maximum entropy models. All right, so, right we start. Again, we'll start with this first point, trying to build construct probability distributions from, from, from data from limited data. And so, you know, to kind of set it up. Let me start with a set of samples that I might have about the system here I didn't know them as follows x one x two the numbers are indexing the samples and so suppose I have a big T, which is my number of samples in the data. And, and these vectors in principle they can live in in some high dimensional continuous space that brings up some difficulties I might discuss in the end so for now let me limit myself to samples being discrete. You can think of them as binary vectors but they could be also living in some, you know, kind of not non non binary, but still discrete domain. Let me just get my notation so. Let's let's say my samples x xt are original, you know, and each x component can take q levels, let's say. So for q equal to this would be binary vectors. Now, what we are looking for so we are given this data, and what we want is, is built a distribution p of x this is a generative distribution from which I can think that the samples have been drawn right that I observe so I want to have a generative model because once I construct such a distribution, I could sample new samples right that you know in some would be if you want similar. So these are representative draws let's say from from from this distribution p of x. And now, you know, there is a there is a regime of course of this problem, which is trivial right so. So the trivial regime is of course if my number of samples is so large I can just over sample this distribution count because these are discrete objects that I'm discussing now. So I can say that you know the number, roughly the number of samples, you know is much larger than sort of q to the end, then I can just build an empirical estimate. Which would be the hat, if you want of x, right, which is just some sort of accounting right it's one over T. And I kind of some overall the samples until big T and I have some indicator function right. And you know, this is usually not the regime we're in. You know, we're in the regime where we don't have enough samples to just to just count. And so we need to come up with something with something else right usually we're in the curse of dimensionality regime. And I'll show you examples from neural recordings just to just to have a sense right if you if we think of, if we think of each sample if I can record in the brain or in the retina. I can record from n neurons and each neuron in each little slice of time can I respite or not so it's a binary, you know there is a binary representation of their activity. Right. If I, if I only sample 10 neurons. Okay, then they're joint activity patterns leading the space of two to the 10 about 1000 dimensional space and you know in a typical neuroscience experiment I can actually collect many more than 1000 of such samples so in that in this regime, but obviously now we record from many more than 10 neurons so as soon as you record from 20 you require, you know, many more than 1 million samples and now we routinely record from 10s of 1000s or 1000s 10s of 1000s of samples so you know, and the experiment durations have definitely not scale us the number of samples is scaling so we still typically record in experiments for the duration of a few hours and that's already very good. Okay, and in those few hours, you cannot get, you know, the samples that would exponentially address this scaling of the state space so so typically we're in this regime where this is not true you cannot sample right. So we need it's it's kind of obvious what we need so we need some way of regularizing this distribution, right of using the samples to constrain something about it but kind of regularizing it and if you think of what would be possible good ideas you can just recall that the same thing emerges the same problem would emerge even for continuous distributions, right. For continuous distributions what you would do, this is just an aside. Okay, so if x is a 1D continuous variable and I want to talk about the PDF of x right then I typically I collect some samples, let me represent them by kind of little crosses right something like that. So if from these samples I want to kind of construct an estimate of a distribution, I maybe you know the generates the data maybe something like that I don't know what I'm implicitly or explicitly having to take into account for continuous cases of course that this distribution is in some way smooth. Okay, because if it's not smooth, then these points can simply be seen as kind of delta functions in my distribution exactly where the samples lie. And this would be a terrible estimate of the underlying distribution because it's totally overfit to this particular set of samples I observed. So in the continuous case, we impose some way, there's many methods for smoothness let's say in current density estimation. And so the question is in these discrete spaces, is there a similar notion of smoothness right in this high dimensional space that we can think of. And this is where, you know, as you will see one way in which this smoothness comes in is through this maximum and prescription that we are now outlined, and this really follows much the reasoning in the James paper. All right, so what is the max and model on our approach if you want. It proceeds in steps so the first thing that you do is you use the your finite data of these samples to compute to estimate from the data. So m is some selected number could be one or could be more and functions or statistics of the data, which are in this in this framework also called constraints. So I will first write it and then explain so data, I'll define this one over. So what I have is, you know, I have some functions that I can evaluate over my, my data. I enumerate them by, by me and you runs from one to M, I have M of them. And what is this my notation here just represents an empirical average over the observed data set of these functions have. So this can be anything the maximum entropy framework does not tell you what this function should be. They can be very simple, right if these are, they can be sort of expectation values let's say of of individual components of this x I'll give you examples later, but in general you can frame it simply as you know some constraints, you know operators that this is depending on where you come from, but these are things that you can estimate. You can discuss about their error bars and so on and over the final data set that you have observed. And now, once you have once you have done that, we'll be looking for a probability distribution P that will exactly reproduce in a way I'll just write down this, these constraints, so it will match the constraints exactly. There are many distributions that have that capacity to match some some final set of constraints and so we have to select from one of them. A particular one that encapsulate this some sort of notion of smoothness and so you know Jane's idea is that the one distribution that you choose that matches all the constraints is a distribution that's at the same time as he says most non committal right so it's random or has maximum entropy. So that is the reasoning. So we'll find. We'll find distribution P, I'll be noted by P hat. Okay, that matches all of these constraints exactly while maximizing entropy so if I write this down right so entropy in this case will just be our Shannon type entropy so it's the sum. Let me write it down. He had X log the base of the logarithm doesn't matter. So we'll be trying to maximize the entropy of his distribution P hat. This is our max and model. There's such a way that for every one of these constraints. F mu in this distribution so that the expectation value in this model right which is simply some over all x. X, F mu of X such this has to be equal to what has been estimated over data right for each new. And new in data. Okay. This is a very straightforward idea. And it this particular problem you can view it as a constraint optimization problem right you're trying to maximize this while satisfying all these constraints also satisfying the distribution has to be normalized. And so, it's not terribly complicated to jump into some statement into sort of a variational principle where we can find out what kind of a form is max and distribution has to have. Okay, and, and you can see before I write it down you can see that this would satisfy what what what James was was alluded to right so have a limited number of so some limited information about about my, my distribution which are these constraints, I need to choose them, a distribution that in some senses most random right so this, if there is no constraint and this piece are discrete distributions, right the distribution that maximizes the entropy without any other constraints a uniform distribution in some sense you can think of a uniform discrete distribution store being most smooth right it doesn't have any structure, it just flat over all the state space. As you need to satisfy constraints you have to deform that distribution away from uniform in order to match the constraint but still trying to keep us, if you want us flat as close to flat as possible right. Okay, so you can, you can turn this into a constraint optimization problem for this distribution P. So it has the term that needs to be maximized, which is this, which is what we wrote there x p x log in x. And then it has a constraint, so this will be a variational problem I'll try to find the tree, the P needs to normalize to one. So I need to put, and you know a constraint that will achieve that so I put a Lagrange multiplier, big lambda, and I impulse the summation over. So, so this is the normalization constraint that this lambda we will impose and then there is M other constraints right which are those other expectations so I have a summation over all constraints. Each one comes with a Lagrange multiplier that I denote here by g mu, and each of these constraints requires summation over x px f mu x to be matched to data, you know which is some constant right because that's evaluated over the data so I could put it here even if I don't put it in it won't change the variational problem. Right. And so this, if you take this variation. You can, and set it to zero. Right. It's a very easy thing emerges because here you just take, you know derivative with respect to P goes away. Here you take the derivative with respect to P or left with the constraint because this is linear in P. So one trivial term is this one and so when you take P log P derivative with respect to P, right you take a derivative with the first term you're left with lock P. And then the second term is again easy because it's P times one over P and it cancels is just a constant right so this you have done many times, I'm sure. So you get when you set it to zero, you get, and you know do do a little bit of one line of algebra to rewrite it in in a nice way, you get that maximum entropy models always have the following form. And so there is a normalization constant, which is your partition function. So this is an exponential distribution, which in the exponent has, I'm sorry, has the mu and mu x, right has a linear. So this Z emerges because you impose the normalization, and I just renamed the constant into something that you're more used to from statistical physics. So in this exponent, there is sort of a linear combination of the constraints that you're imposing, which are these all these f mu's, and each of them is equipped with these Lagrange multiplier. Right, and these guys, so the non trivial part of solving the problem this was a trivial part of solving the problem all the max and, you know, models have this form, the non trivial part comes from the fact that you have to set these guys to certain specific values. So set, and this is a, you know, these are M unknown numbers if you want, set this g mu such that such that these constraints and I will not copy them right I'll just put a star such that these constraints are met right. So you can view such that star is true. So you can view that as, if you want, you can view the concerns are M constraints or that's M equations for g mu. And unfortunately they are non linear because obviously this g mus are also in the partition function. You can write them in, you can write these constraints in the form that you're used to from statistical physics. So you can, you can say that the log z over DG mu right this will be the expectation value of f mu in this model. So this thing has to be equal to the value of f mu evaluated over the data right. And so this is what we are supposed to solve. Some cases for some selection of constraints this is easy but in general it's not okay because because you have to know as in your statistical physics the partition. Good. Just, if you read the James paper there is a slight slight slight twisting how we apply it okay so James in the paper. In the original paper was was talking about knowing some abstract quantity f about the distribution here this, you know and what what would do what would be sort of statistical mechanics look like under this constraint. But it doesn't come so clearly out from his paper is that in our case this constraint functions are not something that I would make up there actually something that empirically evaluate some, you know is evaluated over the data, right they come from a final data set. But you know that's still fine I think all the arguments that he talks about our kind of you can make them in this context. Let me just see if I didn't forget anything. Okay. So now you can of course now you see this connection that I was alluding to statistics right so this is sort of an exponential family of distributions with the sufficient statistics that are these guys right and these are kind of conjugate if you want parameters. Right, so this, you can alternatively right this maximum entropy malls are nothing but the probabilistic malls for data so if you want. I would say to come out with a possible prescription and it is the plain vanilla way to learn this to to infer or to learn this these parameters. You, we can just, we can just ask. Let me do that. If we were to employ maximum likelihood learning to fit this unknown GMU what would, what would that look like. Right. So, you know one way to view it is to solve this nonlinear set of equations, but the other way is to view the same problem is to infer this GMU using maximization of likelihood, or log likelihood if you will. And so I can just write down, given my data, given my final data be right that I had X1 to XT, I can write down the likelihood or log likelihood for this model. Log likelihood would be the log of right assuming all samples are independent. It's the log of my P of XT right so I evaluated over all the samples because the data is ID I have a product for each across all the data and let me maybe explicitly write it down as. You know, a distribution over data that depends on these parameters. Right and if I, if I kind of unpack this, I get the summation over samples log right and it's pretty nice because of course if I take the log of P. I have minus log of the partition and then the log of this exponent so the exponent cancels out I have minus. Wait, sorry, I have what do you want I have minus log of the Z. Right, and then I have plus the sum of GMU at XT. Right. Yes. All right. And so one again so one can one can derive from this log likelihood, one can one can derive a learning rule because I want to maximize the likelihood with respect to parameters GMU. And this derivative piece is pretty nice. Why, right here. This is actually a constant. I mean this does not depend on individual samples so you can take it out of the sum. And then, in this second term, right, you will recognize that I will have the sum over all samples of this function, which is just T time the expectation value of this data, right. And so, if I write down this, this derivative. I get right and sorry, the end the derivative of the log Z with respect to GMU is precisely what I have here says the expectation in the model, right. I get I can rewrite this as T times where it is the number of samples, minus up to a minus I might have I hope I didn't make a mistake at new in the model glass in the data. And so, right so it's the difference between the expectations in the data and expectation in the model. And so from this you can derive or you can propose a learning rule for for for these couplings GMU so you can make an iterative learning rule by which we'll climb the log likelihood gradient right so the gradient is here. So I can say that, you know, at the next iteration for my couplings and Q is the iteration number. So I take what I have. Climb up a little bit, you know, on the gradient right with some learning rate alpha, right which has this thing, hopefully with the correct sign but if not one flips it. So, F, you, data, minus F, you, he had you. Right. And so, you see that this this particular thing as so this particular, right and this particular learning rule has a correct fixed point right so you'll stop updating your couplings. So this term is zero, this term is zero when in your model the expectation is exactly as in your data which is what we require. And so this alpha here is some small learning rate. Just to remind you this this thing is just the number that you estimated over data so that part is easy. Okay, this is just the ones in the data. The first part is the expectation of my constraints in the current under the current parameter coupling parameters. This is usually what's what's maybe difficult to compute right so you have this exponential model and you have to compute the expectation values given these parameters at every learning step. Okay, but once you have this you take the difference and you just update your parameters a little bit. So this is sort of a plane. This is, of course, not the optimal learning rule so it has the correct fixed point there is, you know, improvements on this base kind of just to illustrate is a plain vanilla simplest type of learning rule that you can get right. This problem for maximizing entropy subject to constraints is convex so there is a guarantee so there is theoretical work and you can look it up on. There are three conditions as it is called so it's a convex problem so there is one solution you can start let's say with GMU if you want zero so zero GMU one right so initial value for the parameters simply set them to zero. Right, so the problem is not complex because it would have many optima for this GMU there is only one set of solutions that you're fine, it might be difficult because this is not easy to compute as I'll try to show you for concrete examples. Okay, so that bit is hard. The other right the other kind of climbing the log likely put and so on that's not, that's not that hard. Okay, so up to here, right, we have introduced the maximum entropy model, have this exponential form. You can fit it to data by solving these equations or by doing this type of iteration or maybe something more fancy there is multiple ways to go around this, but just to just to very clear clarify this point, a maximum entropy model right is not. It's not a single model it's a framework, right off models that depend on which constraints you choose right and the choice of constraints is the thing that is subject specific. If you're doing your distribution, fine, what do you reproduce from data which set of statistics that depends on the application. Now you could of course, if you're very patient let's say like you could start doing kind of formal model selection you could say, you know if I choose this constraints and construct a model versus this constraints and build a model, you know which I do model comparison which set of constraints is most informative or generalizes better and so on so one can do all of that. But typically, the choice of constraints is up to sort of a specific problem that you're addressing and you know I'll try to move this a little further to illustrate what I mean by that. Okay. All right. Good. So, that is done. Let me maybe make now this, you know both for illustrative purposes but because we're actually already moving to max and miles that can be applied to data. So let me give you a more concrete example where I will choose the constraints. And in particular I'll try to motivate that for discrete distributions. That we are discussing now. There is a particular a privileged or special set of or a letter of distribution or approximations that that is interesting to discuss irrespective of your data set. Okay, so this will serve both to construct this letter of approximating this max and distributions, and to make this a little bit more concrete. Okay. So we start in the following way I'll enumerate this by zero and I'll say that there is a P zero of x. That's a trivial case right which is a max and distribution with no constraint. So let me further develop this for x being binary vectors okay generalizes for discrete vectors but let me just do binary here. So we already set max and distribution with no constraints for these binary vectors is just a uniform distribution right so this is uniform. So this particular px of zero will just be one over two to the n if n is the number you know the dimensionality of these vectors. Okay, so that's trivial. Right, so then I can construct the next, you know and obviously has nothing to do with data right this is irrespective of what whatever the data is. So let me now build the next distribution in such a hierarchy so I'll work out a max and distribution that will match the data in all single single element marginals. Okay, max and distribution that constraints. So the same value of Xi so I is now from one to n or the empirical average sorry I should say, of each of the components to, you know to whatever is in the data, which I'll denote by my so this is from data right. So this is the average of every x. Okay, so that's, that's a simple distribution now we know what form it has to have so this P one, right has to be written as one over Z one exponent, and then, with usual physics notation right I have here one to end. So each ix i, where h are now these Lagrange multipliers that in the you know to kind of link it to the ising model that you might know I denoted these parameters now by age. Right, and so this is, so this is a max and by what we derived before it's a max and distribution that has these type of constraint right so each Xi comes here into the exponent with the linear coefficient. And since isn't this is a distribution that factorizes right, so I can, because it has a right it factorizes in Xi, so I can write it as a product of, you know, one over Z one i times X. And now this is the one where it's still very trivial to solve right because I can actually analytically get the formula for this hi right so I know that X. I need to hi divided by the corresponding partition, which is if I have binary 01 vectors and be to hi plus one. So this thing has to be equal to to this data expectation value, and from this I can just compute right that hi has to be the log. As you know and I always forget what's up and what's down and one minus M so am I am one minus am I. Right, so the maximum entropy distribution that's consistent with single point marginals is a trivial factorizable distribution that's written down here, for which the parameters are directly computable from data easy right it is like independence films if you want. So the first, the first kind of non trivial one, in a way that's actually not that easy to compute is the next order approximation right so I hope you see what what we are constructing here, we are constructing a set of distributions that match data in the values and now the next level will be to match not only the mean values but also the pairwise correlations right so this will be our next model. So 01 P2 right P2 of X. So max and distribution that constraints that constraints both the expectation value of X, I'm sorry the empirical average of that of Xi as before. So this is our next model right it will reproduce exactly what this one reproduced exactly, but, and it will also reproduce X, I, X, J average in the data which I didn't know the CIJ note it doesn't matter whether I do connected or not connected because I'm I mean it means correct already so I can constrain either the non connected or connected correlation. Right and so this for this guy, we also. So, since we derived what's the general form we also know what the solution is right so it will look like this. Right as before it will have a Chi Xi plus. And I'm just, you know, in generic Max and all of these Lagrange multipliers for GMU, but I'm now writing them up in in the form that is most familiar from statistical physics right so. So we have these terms that I need to set to match the means and these terms that, you know, if you want parameters that are set to exactly reproduce the, the average, the measured correlation function. Now just just to clear up right this age, the values of this age will not be the same as the values from this age. Okay, because obviously by imposing. So this is a non inner problem so when I put JJ and I have to also retrieve the age so that these observables come out correct. Okay, so these are not numerically they're not the same. So this type of model now comes with. So this is. Okay, so here we have the uniform distribution here we have what we call independent or factorizable approximation. This guy is pair wise. Or ising like right if you I'll say why, or it is also called a Boltzmann machine. So, in general now right if I give you the means and the covariances and this can be some arbitrary numbers that are measured from the data. So this is the most formed solution for H and J. There are approximate schemes for how to compute it, but you know from statistical physics, we often compute the inverse problem, the forward problem so we set them. We decide what the magnetic field is that's in external control parameter, we specify some structure on JJ let's say we say these are nearest neighbor square lattice interactions, and then you know you turn the crank of statistical and compute the partition or the free energy and compute from it, the observables right so now we're going the other way we're given the observers and need to find out these guys. And in particular, unless nature is very kind to us and very nice. The set of JJ and I'll show you that an H I won't have any nice form right like, you know that all spins would be acted on the same field, or that you know the couplings just happened between nearest neighbors and so on so this could happen but typically doesn't right. So this means that now the hard part of fitting this is precisely because already a forwardizing is difficult right in general, if you get, if you get that to compute the partition function. But here you have to do that in the learning loop right as you update your parameters. In the learning rule, you have given the set of h and j as you learn them, you have to be able to compute the, the, the m and the C. And for that, you either need the partition which you don't have or you can do Monte Carlo sampling to get the approximate given the parameters to get the approximate observers right that you can always do by Monte Carlo sampling of course but then they will have the error bar and then you're learning the stochastic learning and so on. So it has all been done, and you know it can be done to some extent. Okay. So again that's the most random distribution that's consistent with first and second order margins and you can see that we can just march down this hierarchy right we can now also constrain. So the other correlation functions see IJK and our model will gain. K IJK X I X A X, and so on right, and at some point, right because these are discrete distributions you will march down to the point where you constrain for and, let's say binary units, you constrain correlation functions of order and of the last order you will be able to do all in a finite system. And at that point, by definition your max and model will be exactly the empirical distribution of the data because you have matched all the marginals right from first to highest order and with all the marginals specified at this distribution school is specified. Okay, so there is some notional right, we, of course we typically truncated when we describe the data at some low order that's what we hope. In principle this letter just continues up to you know PN of X, which is, you know some crazy distribution with with with, you know, all end product terms in it, and lots of coupling constants. Okay, so I won't even write it down, but, and you might think that we have now cheated somehow, you know the curse of dimensionality because I in the beginning I said we have final data. We cannot hope to get a model right of the joint distribution by just sampling, then I introduce this max and there is now this hierarchy, and you can say okay so I just march down the hierarchy I constrain all the end for the marginal terms and I described the full data distribution so how do they do that. And of course the problem is that, you know these distributions constrain different statistics of the data, but if your data is limited. Then your empirical estimates of the statistics are getting off worse and worse quality so maybe it's easy to estimate the means. Maybe you still have enough data to estimate the covariance matrix but obviously with a finite amount of samples right your estimates of higher order statistics get all weird. I mean, unreliable, and therefore this sequence of distributions will start overfitting as always those higher order terms which now depend on finite sampling. But, as I said, what's nice about this type of modeling framework vis-a-vis a generic maximum likelihood inference of high dimensional models is that, since you know what is the sufficient statistics of this is before you even build a model. You can estimate empirically these correlation functions and put an error bar on them and kind of know upfront how reliable each statistical estimate is, and then that informs you what you should or should not be including in your model, if you want, at least as a guideline. Okay. Right, so this hierarchy. This hierarchy has another very nice thing and this is I guess I'll say this and then I'll stop for today with with a blackboard and show some things on the slides. Each of these, each of these constructions right so the uniform distribution, the independent model, the kind of the pairwise model, and the third order model and so on. And one, at least in principle has an entropy associated with it right so you can compute the entropy of this guy, and you know that the entropy over and binary spins is just and bits for this guy, you can compute the entropy of this guy. Right, which is still simple I don't have the formula but you know you can you can it's an independent thing so just add up the independent entropies for each of the spins. Right, there is an entropy of these guys which is obviously not very easy to compute because you need a partition for it but it exists. Okay, so there is a sequence if you want. So this letter of models defines a sequence sequence of entropies. There is an s zero, which is this independent, you know, which is the n bits. There is a, there is, and this, and there is s one, which is the entropy of the independent model, there is the s to which the entropy of the pairwise transformation, there is an s three and so on and so on until we reached as an okay as an is the entropy of that final distribution that matches all, all kind of all marginal constraints, which is, which then has to be the entropy of the empirical distribution over the data, right. And because these models are in sort of net kind of impose. So, each next model constraints all the constraints of the previous model. As soon as in the max and framework you impose a new constraint the entropy can only go down because you impose extra statistical structure stayed the same in the constraint this is void. So instead of, there is inequalities here right so this has to be larger equal this is larger equal this is larger equal and so on. Right. And in particular, this fact allows you to decompose an interesting quantity that I'll try to write, but I'd like to write it close so I'll just erase this pairwise thing. I think that you have this kind of hierarchy of entropies allows you to decompose the following quantity in a very nice way. So, how should I, how should I do that. Let me do one more thing. So, you go from uniform model to the model that's constrained by our deal, but by the margins the entropy drops from s zero to s one, right, because of imposition of marginal constraints. When you go from here to here, you impose pairwise structure so you go from independent to pairwise coupled system. And this dropping entropy is typically denoted by, or typically you can denote it by, you know, that's a difference that is s one minus s two. It's kind of I, it's called it's called a connected, connected information of order to as you go from here to here. There is an additional drop, which is strictly because you added third order constraints right so, and so on. So these are differences between consecutive entropies. And so what this differences allow you to do is to decompose a quantity that's called multi information, I'll define it in a second. Multi information into a sequence of non negative contributions, which are these contributions here. All of these are greater or equal to zero. There is a unique decomposition because there is a unique hierarchy of Max and models, right. And so this is a decomposition of the total statistical correlation structure in the data order by order. So what is, you know how much is pairwise interactions are responsible for third order and so on. So for this to make sense, I need to tell you what I is so what is this multi information. So now I rely a little bit maybe on the background if you have it if this makes sense for you that's great if not it's not a catastrophe. So for this information, you can view as a cool back library distance between the full distribution over the data. And it's, and the independent model, which has, which has in which these pins are independent non interacting. Okay, so P one of X. So this is a joint distribution so it has all correlations inside, and this is, you know, the, the, the distribution that only matches first order margins and has no interactions inside. So this is a non. So DKL is a is a information theoretic, non negative measure in bits. It tells you, you know, how much in bits how much statistical structure there is when you go from here to the full joint and what I'm saying is that this guy can be decomposed in order by order contributions in a unique way so how much of this is due to pairs to triplets due to quadruplets and so on. And that's oftentimes empirically useful, right because you can, you can, you can construct these max and approximations and ask you know how much is explained by pairs how much is there that's not explained by pairs or by third order interactions and so on. And this, by the way this new information for those of you who know this, this quantity is, if X was a two dimensional vector. Just for fun, right. For fun, if I wrote DKL between P X and Y, and these are now, you know, so this is a vector but they're supposed these are now scalar variables, right. If, if I wrote this. So these are the marginal distributions of X and of why, right, because this is factorized so I can write it as I can write independent models as a product of tools that a particular distribution to marginals. So if I write it like that, you will recognize that I is just the mutual information between X and Y. So you can see the multi information as a multivariate generalization of the mutual information for those of you who are familiar with mutual information already. Okay. So it's a measure of the total amount of statistical structure. Okay, so. All right, so let me now make a few concluding remarks and then I'll see how much time I still have to show you stuff on the slide. So we go until 12. Okay. Yes. If you're looking at this backwards, is this the data processing at work. It's a question. I don't, I don't, I don't think it's directly the DPA. I mean the DP. Well, at least the data processing, the data processing inequality that that I'm familiar with is always phrased in terms of, you know, a set of mutual information terms that that that proceed via right so when you have. Right. So when you have stochastic variables and there is sort of dependencies between them right, you have a limit on the mutual information between these two based on these two let's say. I'm not so the way this was shown I don't think it was by recasting into DPI but maybe it's possible to do it but by doing that, I just don't know. I like if you're interested in in in this informational decompositions there is a PRL paper by Schneidman et al. That is in time I think network information and connected correlations. But you know just if you if you want the reference. I'm happy to give that. Okay. You know, let me now spend without, you know, writing much down but but let me now spend a few moments, kind of motivating so why we are now that you know a little bit about Max and models. Why we would, why would one build them, you know what what like what good can come out of it. Okay. And because my maximum these models do provide some insight I want to say what that is, but of course they're not, you know they're not. How to say they they also have minuses and so you know I'd like to discuss a little bit what you can get get from them and what are their limitations. So the first, so let me first say that these are not by no means our maximum entropy models, the most expressive or even close to type of models and since this is a machine learning school as well right there by by no means mode like a very expressive type of model. So there is the if you want very good performance on test data for these type of generative discrete state space models, you oftentimes will not choose an accent model. It's difficult to fit even the pairwise models are actually difficult to fit. And there is a variant of these models, you know that that's designed to be much more expressive that you might know that's called a restricted Boltzmann machine. Okay, so this Boltzmann machines for this pairwise approximations right. So, let me just say so why use them. And I don't think the answer is because they're not because they're expressive. So just, just to sketch right so if let me let me sketch what a restricted Boltzmann machine is because very connected informed this to the Boltzmann machines that you saw here. Which kind of is for me was always funny that it's called restricted Boltzmann machine, even though it's really you know expressive power is actually greater than typically of the of the kind of model that you saw. So this is a probabilistic model, where I think of a draw graphical diagram so these these are, these are my X units or the you know the binary samples that I was talking about. And I postulate there exists some variables, why that I will not see. And, you know, the lines I'm showing our kind of dependency lines I'll write on the mathematical form. So these are like us now you know maybe n dimensional sort of n dimensional discrete vectors. And I consider joint distributions between X and Y that are of the pairwise form. So this will be this will now look as the pairwise model. So let me write it down and then talk about it. I H, let me write it. H, I, X, I, plus some over J, VJ, YJ, plus a pairwise term. J, J, J, X, I, YJ. So I consider a joint distributions of that form, where I have a bias like a field on every one of the axis. Okay, I have a field V on every one of the wise, and I have a coupling coupling matrix if you want but it's only permitted to couple axis to wise so these are these lines there's it's not permitted to couple, why to wire X to X okay so just across the two layers, the, the assumptions this is the latent non seen variable layer and these are my visible units I mean these are these are the data that I sample and then the model that fit I mean this is kind of a full structure of this probabilistic model, but then what I will be fitting to data is of course the marginal of this. So with the sum over all y over all latent variables p X, Y. Okay, so, and of course now I can, you know, if my data like before data D is a set of observed such vectors, I can try to write, I could this is a probabilistic model so I can try to learn the h and the, and the V and the j, and so on and I won't talk about that what what I want to say is that this particular, for instance, model class. First of all, it has this one has full expressive power so in principle it can model any probability distribution. If you have enough latent variables and you can convince yourself that you know at worst you can have two to the end of this latent variables and then you can actually explain sort of the probability of any pattern with this type of model. So it's like in deep learning, this is actually a universal, if you want the proximity or universal modeling tool for for discrete distributions. It has some other because of this conditional independence of the two layers right so the axes are independent between themselves conditioned on wise and vice versa there is some very nice learning algorithms for this and so on. So if you want, and you know we have played around with this applied it to neural data, it works great on some data sets and so on and it has better performance in the pairwise model etc etc. So you, you know you, you might want to ask me why am I talking about Max and models why don't we just do that, because for many data, you know it's actually better. Okay. And so the answer is the following. I mean one part of the answer is, of course, this is in some sense a black box mall is actually hard to understand what is learning. So what Max and models are limited that they have sufficient statistics gives you interpretation, or can give you interpretation that's what I wanted to show you on data. Okay. And I'll hint at what some of these interpretations are, but you know one of them is, you know, of Max and class it's, it's a general generative model that can produce samples from a high dimensional distribution but what went in is just pairwise structure of the data nothing Okay, so, so you can ask, how much of the complexity of the data is accounted for by just a pairwise structure that's the separation that you know that Max and allows you to make or that pairwise plus three point or whatever you want you know. So, interpretability once one thing the second is of course there is all sorts of links between the pairwise models and other type of Max time construction and statistical physics. We have to be very careful and say why, but there is a lot of stuff from statistical physics that carries over also the ways like how do we actually compute the entropy of the Max and models and so on so we can do it because of the developments in statistical physics. And there is all sorts of tools that we can use. You know, we can have all the fancy Monte Carlo sampling schemes and so on right at our disposal so tools are there interpretability is there. And that's kind of the third so well second or what what reason is that there is a number of success cases for Max and models, despite their limitations. So when we observe that the correlation structure of the data is very complicated, but then to that data you fit the pairwise ising model you extract these parameters jij ising like couplings, and you realize that couplings are much simpler than the observed correlations. Okay the classical, of course the classic example would be, you have three nodes ABC. Okay, there is. So these are three neurons or three doesn't matter or three, you know positions in a in a protein, and you can measure the correlations between these guys so there is some correlation between a and C. There is some correlation between a and B and there is of course some correlation between B and C. These are some non zero numbers that you can estimate from data. So you can construct your, your pairwise Max and model and you might infer that there is between these two guys let's say I'm making it up. There is some non zero interaction jab, and between these two guys there is some non zero interaction JC, but actually between these two guys, it's consistent with zero interaction between them right so the so the underlying set of interactions is much faster than the correlation structure. So you would observe even if you played around with your ising one right if you play around with the nearest neighbor ising on the lattice then of all the JJ only the nearest neighbors are non zero. But if you compute the correlation across spins across the lattice then of course next, you know nearest neighbor correlation not zero next nearest neighbor collections also not zero right and correlation it kind of is extending and if you're close to critical point it's essentially far. Okay, so correlations are non zero but interactions, you know can be just nearest neighbor. And so the hope is that for some data sets this happens. And indeed there is much success actually this is what our group has not been working on but the. There is much success that you might be familiar with when people align protein sequences. And then from the, you know, there is a whole set of proteins, I sequences from kind of extent related organisms. And then what you do is you build this type of pair wise model basically. But of course, you know proteins, protein sequences are not binary they have you know 20 amino acids 20 letters at each position but it's still discrete so it's kind of a POTS like model if you want instead of icing like model. So the correlation of the sequences, you learn that certain positions in the protein, they have non zero if you want coupling. And that coupling is much more strongly indicative of actually underlying physical contact in the 3D protein structure between those pairs of positions then it's any sort of direct correlation measure. It's a direct DCA for a direct coupling analysis and you know versions of it has have seen much success and it's the underlying reason is this this type of, you know, this type of phenomenon. The third that the next point I would like to make is a warning point. The maximum entropy models resemble thermodynamic equilibrium models, a lot, and you might be fooled into thinking that that's exactly the same thing. If there are maximum entropy formalism, you can apply to any stationary distribution to you know our samples from, you know, some stationary process. You don't need to be generated by a physical process at equilibrium, equilibrium thermodynamic distribution statistical mechanics at equilibrium is a subset of max and models but vice versa does not hold right you can be max and distributions for stuff that's not equilibrium. Okay, even though it looks like icing, but note that in the models that we infer, there is no separate notion of temperature pulled out there is there was no beta standing in front of of that. And so you, the math is almost the same but interpretation one has to be careful about because you know some formula still hold but you cannot interpret it necessarily as an energy or you know or a temperature or so on. It's like energy like temperature like and so on. There is also other things that I hope you will see from from what I show you. So I suppose that in statistical physics will do as a natural assumption we don't even question so for instance when we write down various types of eyes through rising models, you know we often think of thermodynamic limits we think of how jays need to scale as angles to infinity and so on. Many of these things are not. They are not warranted when you look at finite data sets right there does not need to be a thermodynamic limit if we study networks of neurons that are at most 300 neurons long if we study sequences of amino acids that might be a few hundred long but they're not in the, not necessarily the thermodynamic limit would apply. Okay, notions of typicality and so on so it's a it's a neat that also theoretically these models are interesting right. So be careful about what we carry over from visits. Okay. So, as I said, all of these constructions easily carry over to discrete non binary distributions continuous, it's hard. There is right so for continuous distributions. You can define maximum entropy models, but you know certain things become more vague as as you might be used to for entropy of continuous PDFs is already a weird object for instance, you know if you think of continuous distributions that for which you constrain the first second and third order moments well if it's over the infinite domain these things are not even normalizable so I mean there is there is other types of issues which we don't have will be screened. And I think that has been less worked out. And the last bit is there is also generalization from what what I was explaining to to samples that are not independent so right now we always thought of data as being remember it's kind of a collection of samples that I can see as being drawn from distribution that I'm looking for p of x ID samples right now if these were times like if this is actually a time index and these are correlated data because it's generated by sometimes underlying time series process. Right then, you know, you can still build this type of models of course for it but you're neglecting the time correlations, they can be included because you can think of building distributions over sequences of time sequences of x, and then you start, you need to constrain also a cross time correlations not just across component correlations. And just to give you a reference so this is also has been introduced and people work on this, this is something that you will find under the name of maximum caliber. The maximum caliber is, you know, so you can see there's a super set of the type of nice and models. All right. Maybe I stop here and then if any of you if there is any question now I'll take it. And if not, I'll give you just as a setup for tomorrow on the slides that some you know moving to more towards an application neural system a little bit more relaxed like the intro of what the data is and what we try to achieve. So if you guys any question now, maybe it's a good. And then meanwhile I share on the screen, right. Yes. So this new information. Could you, for example, like see how it is the trend like if you cut it at different orders like I two, I two plus I three. Look at how the curve behaves to decide. Okay, if the curve like basically stops changing. Let's say okay this level of corrections enough describe all my data and there's no further. So I think this is at least empirically how it is often done. You know you kind of look at how the, how the mutually, how the multi information is saturating as you include order by order you know is there some particular knee and so on. And of course it in reality is a bit more nuanced right because you're limited by final data so at some point you know whether you're, you know, you have to be careful also about how things generalized and so on, but indeed you could do what you what you're suggesting. Okay, thanks. Thank you. All right. There's a question in the chat. So, so wait I didn't fall so the question is whether the max and models do well or poorly with dependent data. So the answer to that is that so the right it really depend. You can always fit a max and model also on correlated data, but of course by construction it will not capture. I mean by construction right it will not capture anything about the temporal correlation. So long as you don't start also constraining it by a cross sample across time correlation in which case it becomes a maximum caliber, how well that does. You know it's an empirical question right whether you need to include just short short term correlations or I mean this depends what's in the data right whether it has long term correlation structure short term and so on. But, but in terms of the framework it's really just everything almost carries over even if you you just need to think about this higher dimensional object. It's just X vector in P over X vector but P over a sequence of vectors right so it's more nuisance to deal with, but it's not not new math. Sure. So in which I've seen Maxent models use this, you know when you have observations in genomics where you have millions of cells. But then there is an issue that you're observing an average with added noise which is the measurement process so can you do Maxent when you actually have, you know an additional process of noise added on the constraints. What do you think about that. I think. I mean, you know you can the question is what do you learn right. Right so so you would have I guess you would have to. So let me say something a little bit about that in the in the neuroscience or I've thought less so let me say like this I've thought less about what happens if you have experimental noise added to it simply because we were not faced with so many cases in what I'm going to show. And what you can do with careful Maxent construction that are richer than what I showed is, is you can start partitioning the variability right so you can have you know sources of one fluctuation if you want right because you know neurons are exposed to external frequencies that fluctuates versus because in the circuit there is various types of correlations. So you can design Maxent models that these these apart, which then will mean that they have, you know they're not let's say they're not only constraining the total correlation, because from that you cannot know whether that's due to you know stimulus or or intrinsic circuitry, but you're asking the models to constrain correlations conditional on other variables that you have control over. I don't know if you can then fit experimental noise for saying in that process right so maybe you can fit extra knowledge if you have calibration experiments, let's say, right. But I haven't seen this done. So I don't know about about how to do it. Thanks. Okay, so let me just, I'll just lay out the introduction. It's good and then we'll cut at the appropriate place okay. So what I, what I would like to do. And this is work that is by now, reasonably old so it's about 10 years and other examples will be newer. So I want to introduce you to it's rather to a neural population coding, and in particular to retinal experiments and then show how mass entropy models that you have seen can be applied to those data sets and in particular how does one choose the constraints, and what insights they get us for neural coding. So just two three slides about the biological system so you can view. So this is a cross section through a vertebrate retina like light stimulates the photoreceptors. And then photoreceptors are neurons that transduce light into sort of electrical currents. And these signals filter through some circuitry that's very stereotyped and has been well studied and for us is not that important, but the end of this cascade are these guys, these are retinal ganglion cells. And these are spiking neurons so they, they had they have discrete outputs, they either are silent or they spike. And their accents are bundled together in the optic nerve that goes to the brain so basically, all that our brain gets about the visual world are signals in this nerve. Okay, so all the other stuff is reconstructed by your visual cortex from these signals that come from the retina. So it's a, if you can view this as a transformation device right from light to the spiking activity. So we are therefore in this system. Very nice questions about what's known as population coding, which is, you know, it's one of the basic questions neuroscience, which is how a collective, how a population of neurons, like how the responsive all of these guys together in code for the stimuli, and that's in particular interesting when stimuli are naturally stimuli which are very rich with lots of features in it. Right, so of course you can show the retina also very simple stimuli like individual dots, or you know lines or bars or whatever. But you know retinas presumably as complicated as it is because it needs to, you know, transduce signals about a rich visual world so you one can study that in an experimental setup so this is a. I'm just sharing my time at Princeton when all of you are was there. He's an experimentalist who's now in Paris in the Institute of Vision, and while while he was there they developed this sort of a micro electrode array recording device. So these little dots are electrodes, sometimes they're space 30 microns sometimes 60 micron apart. So here is there is 250 do I think. So what you do is you, they set the retina from the animal so they set the I out then you take the retina out and it's like a little kind of square millimeter type piece of a tissue that you can then this is typically done in this case it's from salamander it can be done from salamander, etc. So you press the retina down onto this array. And if you keep it happy and profuse that and so on, it will actually survive for salamander for up to eight hours or six hours you can record from it while you project any stimulus on top. Okay, and these, these little electrodes will then pick up and so here is the here is actually these are the electrodes and these are the bodies of these retina again themselves on top of the array. So if you do some electrical activity, and if you do some signal processing, you can assign, right, you can isolate from the signals, when which of the neurons spiked. Okay, so you get this high dimensional recording that I'll show you next, but importantly, retina is written a topic. Okay, so a little piece of retina, so the retina is big, but the piece that's over the array. This is the piece that looks into a small angle of the visual world. Okay, so, so everything that happens in this part of the world is encoded by some neurons that you can record. And in this preparation. It's very nice because most of the neurons that are looking into some part of the visual world can be recorded. Now if you go to the court as that's not the case. Okay, even if you can record tens of thousands of neurons you know that there is other millions that you are not seeing in your experiment. So in this particular case, you're actually recording from the majority of the neurons I think all the stimulus. And so, yeah, so this, this is an old movie, maybe you have seen it because I keep play I like it. So this will be the stimulus that was actually shown to the retina. So we have it's salamander so he likes looking at the watery things so it's fish swimming around. I'll show the movie so the movie will play, and then each one of the recorded neurons, you can determine by separate calibration experiment which part of the visual world is, is it sensitive to. There's an example neuron that is looking at this particular part so that's the center of its receptive field if you're familiar with this. And because we don't I mean Olivier records not just from one records from many neurons. If you put all of them on top, this is what they cover. So as I said, you know this sub population covers one piece of the visual world which is that piece right there. So if you want to interpret the movie now the movie will play and then every time a neuron fires its ellipse will appear for for a brief moment right this is where it looks. So let me see if I get this going. There is some so it's looking here, not much is happening so there's not much activity and then a fish will cross it right so and as it crosses it you see, you know it's kind of moving stuff it excites the neurons and so on. And this is just a little clip, but just to give you a glimpse of how this looks like right, and you can repeat the same movie many times and do all sorts of tricks with it. So what comes out of this experiment is then a set of samples in time so these, these are neurons so each each line is a response of one neuron when it's dark it made the spike so there's really lips right. So these are neurons. And if you zoom in right you can represent this, this roster as it is called, you can represent it as a collection of these binary words, right, where one means the neuron, some neurons spike and zero means it didn't spike. This is discretizing 20 millisecond time means there is a long discussion of neuro physiology why 20 millisecond there is a reason for that it's connected with neural integration times and so on and correlation functions happy to discuss it's not that important for now. But let, let, let me say that that's typically the precision at which I mean the precision at which these neurons spike is is is sort of below 20 millisecond right so they will time their spikes reproducibly on natural stimuli to within this been reproducible. Okay. So once you have this roster you can ask many questions and I'll just motivate them and then we'll end up for today. So you can ask a number of questions and the simplest question is about the vocabulary if you if you think these are the words that you know these binary combinations are course of symbols that are being sent to the cortex. You can ask about the probability distribution over these words and here there was this question obviously these are correlated in time this is a time process they're driven by the stimuli, right. And so, asking just about their distribution as if they were independent is missing a lot of structure, but turns out it's still a non trivial problem and an interesting one. It's a little bit like taking a book, right, and, and thinking about the distribution of words in a book and ignoring that words come in sequences and so on right it's just looking at what words are used with what frequency. Can we learn something from that. Okay, so you can try to take these samples and build models for that distribution, which is what I'll show you with Max and, but of course there's all sorts of other questions. So you can for instance, if you take the stimulus and repeat the same stimulus over and over and over again. So there is now many repeats. Okay, then the neurons will not responding exactly the same way, because there is noise in the system, even if the stimulus is exactly the same. And then of course you can talk about the distribution of the words conditional on time so just over the same time being so stimulus was exactly the same. And so all of these responses must mean the same to the brain. Right. And you can condition either on time or you can condition directly on the stimulus. So you can build this type of conditional models and Max and can be used for that as well but I won't I won't show it. Or you can do the inverse problem. Okay, you can ask, given that I observed a certain sequence of the spikes, can I reconstruct back the stimulus and that's called the coding. So, many groups have done this, so far as included this is basically can I read the neural code so if I only listen to the neurons, can I reconstruct the full movie. Right, and or how well or which features of the movie can I reconstruct. And a lot of work with modern machine learning has also been done on these decoding problem, right. Can I read the neural code and reconstruct the stimulus. Okay, so here I am for today. So what we will try what we'll talk about tomorrow is this problem. So we'll give them the samples will do the Max and stuff. These are high dimensions so we have hundreds of neurons. You know there is not enough samples obviously to do the direct sampling. And so we'll use maximum entropy construction to build it you know models of these distribution and try to learn stuff from them, and just to showcase a little bit what you can learn. So I stopped I don't go over time. Thank you. See you tomorrow.