 OK, so let's start. I'll pass this around. Thank you. So before we start applying maximum entropy modeling to concrete examples, I just wanted to say a few more things about it from the formal point of view. The first thing is I wanted to give you two simple other examples of maximum entropy. And one of them, actually, you know very well. So now imagine you have a system for which you can define an energy function, which is defined by some Hamiltonian. So remember, x is the collective state of my variables. And now imagine that you want to have a distribution of maximum entropy that is consistent with the mean value of the energy, which you call e. And that's a Hamiltonian. So then you know from the formula I derived yesterday that the distribution should take this form. And this beta here is the Lagrange multiplier that's conjugate to the energy. And if you look at this, this looks very familiar. This is just the Boltzmann distribution. So in other words, the Boltzmann distribution is the maximum entropy distribution that is consistent with a given value for the free and for the mean energy of the system. So in fact, that's the example from physics that we most familiar with. Now another example, it's a simple example just for illustration, really. So let's say you have x a random variable. And we want to put a constraint on its mean value that we measure experimentally, et cetera. And also its variance. So remember, the variance is just given by this. So in other words, putting a constraint on having a certain value for the variance and the mean is saying the same as having a constraint on the mean and simply the second moment, because this is just the mean squared. Again, I used the formula from yesterday. And what I do is I put exponential and then I put lambda x plus mu x squared. So it's an exponential form with the first moment, second moment, and here just a normalization. But you know that you can rewrite this in a following manner. So what I'm writing here is just a change of parameters. And if I want to be really careful, so I can rewrite this form under this form and there's a simple relationship between x naught and sigma on the one hand and my lambda and my mu on the other hand. And this is just to balance out the term of order zero, so that you're only left with terms of order one and two. So let me be explicit so that there's no possible confusion. I hope I'm not getting this wrong, but it's not that important. And then I can calculate this normalization coefficient because now I recognize simply Gaussian distribution. So again, you can view the Gaussian distribution as the distribution of maximum entropy that satisfies the mean and the variance of a continuous real variable. And there are many other examples like this where the simplest distribution you can write down just happens to be that of maximum entropy. Yes? So long that you are a second model? No, so let me re-explain this one more time. The idea here is that you have Lagrange multipliers that I conjugate to the value of which you want to fix the average. So when you do this exercise, if you want to find the distribution that's consistent with a given value of the mean energy, you need to tune your beta. You need to tune the temperature so that you reach that desired mean energy. Again, it's an inverse problem if you like. Usually what you do is that you fix the temperature and from the temperature you will calculate the mean energy. Here we want to do the opposite. We fix the mean energy and we back out the temperature. But of course, if you can draw a relationship between energy and temperature, and if you can calculate for each temperature the mean energy, you find I don't know something like this. Then you can do this inversion simply by taking energy reading off the temperature. This is more general. I mean, here there's only a single parameter. But here what you want to do is that you want to pick lambda and mu so that you satisfy your constraints. So now what I'm saying is that it's the following. That I have this lambda and mu and there's an analytical solution to lambda and mu as a function of mean variance. And to see that, I first do a change of variable, if you like, where I replace lambda and mu by x0 and sigma squared. This is just a change of variable. Once I have done this change of variable, I recognize that this distribution looks like this as a function of these new parameters, which are simple functions of lambda and mu. I forgot to say something important, maybe, which is that p of x, which is just this, is simply x0. This is Gaussian. And then the variance. And again, I use Gaussian integration rules as sigma squared. So in a way, once I rewrote parameters as a function of x0 and sigma squared, I've already solved the problem because it just so happens that x0 and sigma squared are the mean and the variance of this distribution. So here the inversion is kind of trivial. I measure the mean, I measure the variance. And then if I want to turn this into lambda and mu, I just use these formulas. This is a simple analytically solvable case. So in the previous lectures, we talked quite a bit about maximum likelihood. And here I introduced a new concept, which is maximum entropy. What's the relationship between these two guys? So in fact, there is a strong one. It's almost equivalent. It's just a difference in the formulation of the problem. So let's say that when we did maximum entropy, we said that the distribution will take this exponential form, the exponential of a linear combination of my observables. I forgot what notation I used. And then we say, OK, but you have to tune, and this is why I just said now, you have to tune your lambdas so that you satisfy the constraint. So tune or fit sets the ensemble of the lambdas such that the average of the observable under the model is equal to, sorry, so maybe I call this K. So this would be my empirical average, and I want my model. So you have this problem to solve, which is the inverse problem. But now let's assume that we know that the model will take this form. So we don't invoke max sense, but we just said that we want the model to take this form. So for instance, if you have a spin system and you know that the interaction is pairwise, and you know that it satisfies Boltzmann distribution, then you know it will take this exponential form according to Boltzmann law. And then we want to look for the parameters that best explain the data. And in a sense, really, a maximum likelihood. So what we'll do is that we write the likelihood. It's the probability, remember, of the data given the model parameters. And here I assume that all my samples are independent of each other. So they're each an independent realization of my random process, which means I can simply write this likelihood as a product over all the data points that I saw, of the probability that this particular data point, given the parameters of my model. But what are the parameters of my model here is simply these guys, these Lagrange multipliers. Once I've written this form, the only thing that's left for me to choose is the value of the Lagrange multipliers. Now this is simply given by this form. This is just the probability of drawing this particular sample from my distribution. So it's simply given by the probability of x, of that particular configuration. So if I write my log likelihood, which is ultimately very often what I'm interested in is to take the log likelihood because it's additive in my samples. It would be a sum over my samples of the log of this. So I will have a sum of minus lambda. And I also have minus log z. So what if I apply maximum likelihood to this? Maximum likelihood says that I'm looking for the set of parameters, so in that case it would be the set of my lambdas of this log likelihood. So what do I do when I want to maximize a quantity? I take the derivative, and I ask that the derivative is equal to 0 as a function of this parameter. So in practice, I just take my log likelihood and I take the derivative, let's say, with respect to lambda a. And if I do this while I get two terms, the first one will just be minus, sorry, this is of xm. And here we have minus a sum over my samples. And the other one is m times d log z over d lambda alpha. But here what you see is that you recognize precisely the empirical average of two factor m. So this I will call the average according to the data. So it's the empirical average. So this is the definition. And when I rewrite things this way, I just have minus m, the average in the data minus d log z. OK, so I still have this to calculate. Zm has normalization coefficients. But this is actually something that in stat mech we're quite familiar with. Like if we take derivatives of the free energy with respect to the different parameters of the model, we end up with the average of the conjugate of these variables. This is a general feature of the Boltzmann distribution. It's very easy to prove. If I write d of log z over d lambda a, first of all I can say that this is 1 over z d z over d lambda a. And what is z? z is the sum of all possible configurations of my Boltzmann weight here. So that d z over d lambda a, I take this derivative. I get a minus sign. And I simply get this. And I put this together. What I recognize here is simply the definition of the average of the observable OA. I take my sum of all configurations of my quantity OA with the Boltzmann weight. I can put the z here if you like. And I simply recognize my p of x here. So sum of p of x times the quantity is the average of the quantity. And there's a minus sign, which I forgot. And here there's a plus sign, sorry, minus sign. So at the end of the day, here the model average goes here, and this is the data average. So you can see that this derivative, which I want to put to 0 for lambda star, is simply equal to the difference between the data and the model averages. So when I reach the maximum, it's when this thing is equal to 0. So at this maximum likelihood value of the parameters, equality between data and model averages, which is exactly the constraint I had for maximum entropy. So in other words, if you start from this model form, and you maximize the likelihood with respect to these parameters, you end up with the maximum entropy distribution. So the two things are really one and the same thing. It's just the justification is different. In one case, you start from the observables, and you say, I want something of maximum entropy, and that gives me this form, and then you need to solve. And the other one is to say, I want this microscopic form of the distribution, and then you still need to solve by maximum likelihood. But at the end of the day, the models are the same, and the values of the parameters will be the same. OK, so now I wanted to spend the rest of the lecture talking about some applications of this idea. And by the way, I should maybe give credit to all this maximum entropy ideas were developed by James in 1957. But only recently there has been an interest in applying this to biological systems. And as far as I can tell, one of the most striking recent examples is application to neural networks. And specifically, retinal neural networks. So we already talked about the retina. So you already know what it hides structured, but let me just remind you. So we talked a lot about photoreceptors. This is where light comes in. And then some output current comes out of these photoreceptors, which is then transmitted to bipolar cells, which themselves transmit a signal to ganglion cells. And the ganglion cells, I remind you, are the cells whose axons form the optic nerve, so that the cells whose activity will then be sent to the rest of the brain. So all we know about the visual world is contained in ganglion cells. So you have other layers of amacrine and horizontal cells, which function as a horizontal relays, in the sense that they make these different cells communicate with each other laterally. But one interesting question is, from the moment you get the stimulus here to the moment you get this output activity, how can you interpret this output activity? So because this is the only thing that the brain sees, it's interesting to understand how the neural code, so by code I mean input-output relationship between the stimulus and this output activity, how that code is structured. So you can rephrase the problem in the following. You have an image. And then what you can do is that you can record actually the activity of these output cells, the ganglion cells. And these cells, the way they communicate is not exactly the same way photoreceptors communicate. Photoreceptors, remember, where it was coming out was some continuous output current. So ganglion cells function like most neural cells. They function by having action potentials, or they also call spikes. And it's a stereotypical burst of electrical activity in these cells. So when you record these cells, you can summarize the activity of the cells as a function of time. For each cell, whenever there's been a spike, you're going to put a small dot in this diagram. So this would be called a spike raster. And this is all you need to know about the activity of these cells, what cell spiked and when. So a general question that people have is, what's the relationship between this image, avocat, let's say, or movie, and the neural activity coming out of these cells? And in the past few years, for the past 15 years or so, it's been possible to actually record many of these cells simultaneously. So this is done using multi-electrolerase. So the way it works is that the retina is extracted from the dead animal. I mean, the animal is killed. Then the eye is taken out and dissected. But the eye is still alive. I mean, it's still working. So the neural network is still working. And then one literary scrapes out the retina from the bottom of the eye. And then one flattens out the retina against a glass array. And on this glass, you microfibericate small electrodes that will allow you to record the activity of these cells. And you can do this while projecting light onto the retina itself. So you can actually do this while showing any image you want. And then you can actually record what's coming out in terms of the spikes. So there's a lot of technology involved in this. This is a recent paper where they explain how they do this from the experimental to the signal processing point of view. But all you need to remember, really, is that there's a way of presenting any image you want and then recording the activity of many cells. So by many cells, I mean, you can really record up to 200 cells at most at the moment with this kind of technology in a dense patch. And of course, the retina, this would be done on retinas or vertebrates. So it could be tiger salamanders, which is an amphibian or guinea pigs or rats, or even monkeys. And in all these organisms, of course, there are many more than 200 cells. But if you take these 200 cells that will be recorded from, shown here in green, in fluorescent green, and in yellow, you see the position of the electrodes, you basically will focus on a very small patch of your retina, so a very small patch of your visual field. And in this very small patch of the visual field, you want to know how cells collectively encode the information. You have any question about the way this works? This is the biology part, but OK. And I just want to point out that this retina does a lot of stuff. It's not simply a pixel map. So it's not like you send light somewhere, and then you just see spikes exactly at the position of the gradient cells where you send the light. The retina actually performs some processing. It's really a piece of the brain, so it's already doing computations. And so part of the difficulty is that there's no good satisfactory model of how this input-output relationship works. So how do we get from this to maximum entropy? The first thing one can do in the analysis is to, the question I was interested in is understanding the collective activity of the neurons. So to build the collective state, the first thing that these people did in 2006, so this is from a paper which the citation will come later, by Schneidmann et al. in 2006, is that you take your spike raster. So this is for each cell on the y-axis, you have a black dot whenever there's a spike. And you cut your time into many small bins, a cell of size 20 milliseconds. And then for each of these bins, you ask whether a particular cells did have a spike during that time bin or not. And if it did, you put a 1. And if it didn't, you put a 0. And then you repeat this for each time bin and you end up with a binary pattern. So for each time, you have this binary word. This vertical column is like, that's what comes out of the retina of this particular moment. And you can view it as a binary word. And you have one binary word for each time bin. You can view this as if you're interested in a collective activity. You can write this way. Or if you're more like a physicist and you prefer a spin notation, you can say you can put a minus 1, so you spin down if there was no spike. So you call that a silence. So these are two different conventions, right? And you put a 1 if there was a spike. So now the collective activity of your network, it would just be a configuration of plus 1 and minus 1. So I change my notation from x to sigma to make it look more like a spin. So this is where we can use maximum entropy, because what we want is we want a model for the probability distribution of these spike, what I would call spike words. So spike words are basically the words of the code used by the retina to encode visual information. So the first thing I want to know is really what is the statistics of these words, right? So I have all my possible words in here in the 0-1 convention. I just want to know what the statistics are. If I want to do this naively, I will have to record many, many, many, many, many, many words and then do the statistics of them and look at the frequency of each possible word. The problem is that there are 2 to the power n possible words. And let's say when n is over the order of even 100, you can see how that starts to be a problem. And 2 to the n is 2 to the 100. That's about 10 to the 30. So it's way too many. So there's no experiment in the world where I can really sample all these possible words. So I need to make simplifications, and this is where maximum entropy comes in. So the first thing you can ask is what if I simply constrain maxims by what we call the mean firing rate. So the mean firing rate is basically the average of sigma i. For each i, I just want my model to reproduce the probability that in each time bin, the neuron i will spike. If I have a spike rate, r i, the priority of spiking, so the average value of this is with probability r i times delta t, I will have a plus 1. And with probability 1 minus r i delta t, I will have a minus 1. So it will just read like this. So delta t is width of my time bin. So constraining the spike rate is exactly the same thing as constraining this mean value. And the maximum entropy distribution I have, in that case, is the one we derived yesterday, is a model, essentially, where each spin is independent. So each neuron is independent of each other. Yeah, it's a good question. So here, for these experiments, what they showed to the retina was natural movies. So people went into the woods with a camera to look at things moving around. So the idea being that in order to re-understand what dictionary of words the retina uses, you need to show natural stimuli to it. So it's a movie. It's a movie. It's changing in time. Yeah, it's changing in time. So this is imaging. And so it's what is their average response of the nature to this theory? I mean, calling it average, it's not really an average because it's really the statistics of the responses. So you don't take an average. You look at the statistics of all possible responses to these natural stimuli. So our average is something. There's no average. There's no average except that you were, in a sense, let's say you write some text. And when you write some text, you want to know the frequency of each of the words you're using. So you can't say, really, you're taking an average. You're just looking at the statistics of your words. So for instance, the word there will be the most used and then that, et cetera. That's what you want to know here. You want to know the dictionary of words that it's using. And there's no semantics. And in this analysis, there's also no relationship to the stimulus. Of course, when you write language, in the context of what you mean, the grammatical structure will influence the frequency of your words. But you forget about that. You just want to know about the statistics of the words at the end of the day. So it turns out that if you write on this kind of model, and then what you can do is that you can take a small subnetwork, let's say n equals 10. In that case, 2 to the n is not too large. It's 1024. So in that case, you can actually record the frequency of each of the possible words. So you can actually calculate the empirical frequency where I call p data. It's the number of times you saw this particular word divided by the total number of times. And then what you can do is that you can compare this versus this distribution here, which I would call p independent in day. And well, I won't show this graph because actually I may not show this graph. What you get, this is identity line, and this is in log scale, you get a lot of scatter. In particular, there are many, so each point here would be one of these 1024 possibilities. And you get a lot of cases where a spike word is actually quite frequent, but it's completely underestimated by the model. So this is a clear case of correlations between the neurons. The neurons do not act independently of each other. What happens is that some tend to spike when others also spike. And this will make the probability of multi-spike patterns, spike words, more likely than you would predict by just taking a product over each of the neurons, as you would do here. So to fix this, and this was the work of Schneidmann Etto in 2006 in nature, was that they said, OK, what if we now also put a constraint on the pairwise terms? In that case, the distribution takes this form, which I already showed yesterday, of a disordered Ising model. So yesterday, you actually saw the ferromagnetic Ising model, where all the j's are constants and all the h's are constants. This is like the fully disordered Ising model. And these are your Lagrange multipliers. And what you need to do now is adjust them so that your model agrees with the data for both the mean and the correlations of your spikes. And this in general is a very difficult question. We don't know how to solve this problem analytically in general. We know how to solve it when the jij's are constant on some regular two-dimensional lattice. Already in 3D, we don't know what to do. And we absolutely have no idea when the jij's are arbitrary. So there are many ways of solving the problem. One of them is to use Monte Carlo. But in general, just on the technical side, so remember, the task is the following that you see these. And there's a one-to-one mapping between these observables and your parameters. And so let's say that you have a way of solving the direct problem. So the direct problem is this one that, given the parameters, can calculate the collision functions. So as I said, that's already a hard problem. And in this case, the only way to do it exactly, or sum it exactly in a way that is proven to converge, is to do Monte Carlo. So you have this model. It's like a nizing spin model. And you just summarize using a Metropolis algorithm. And then from that, you can calculate your collision functions. But the problem is that that's not what you want to do. What you know is these from the data. And you want to infer h and j's. So that's the inverse. And the generic way of doing this is to do a gradient descent algorithm. And the reason why you can do this is because of this reformulation in terms of maximum likelihood. So remember, when we wrote down the derivative of our likelihood, we found this. And you want to maximize this guy. You want to maximize this function. And here, you have the value of its gradients. In the simplest way of a simplest algorithm there is in optimization, it's called gradient descent. In that case, it would be more like gradient ascent, because you want to maximize the likelihood. So let me call it gradient ascent. What you do is that you take your parameters, lambdas, and you update them proportionally to the gradients of the function you're trying to optimize with some parameter, epsilon. So in multidimensional optimization theory, like you're sitting here, your gradient is going that way. So you're moving a bit in that direction. And then the direction is moving that way, moving a bit this direction, et cetera, until you reach the maximum. And so the nice thing here, if you can solve this direct problem, you can actually calculate these values according to the model using that same Monte Carlo or whatever technique is your favorite technique for calculating your mean observables. And you can compare to the data. And then all you have to do is update your parameters according to this rule, which just becomes this to model. So in practice, this is what they did for this problem. And this is the kind of result that one gets. So one gets the full matrix J-I-J of interaction between your neurons, for instance. So you can see that it's actually this very little structure in this. You have some positive and negative terms, so positive and negative interactions. So it's really, upon first inspection, it looks like a spring glass, because you could have frustration. And then on this graph, I just wanted to focus on this one. Here, it's the same kind of comparison I was showing on the board. In green, you would have the prediction of the independent model. And in red, you see the prediction of, here, it's called the Ising model. It's what I call the pairwise model here. And these are small groups of 10 neurons. And you can see that, while the P1, the independent one, as I was showing here, has very poor performance, the red one falls very close to the identity line. And I think we should have a short break now of five minutes. Yes? I have a question. Yes? There is a simple model in data to see the deviation between the two of them. And then, if there is a push-off between them, they are critical of the value of the deviation if all of the systems can hide it. What is the second? Oh, right. OK, yeah. So when you should, yeah, OK. So you're asking about, I mean, I think your question is related to what kind of observables should you pick. And that's a very difficult problem. And I don't really have a solution for this. And in this particular example, I think the simplest thing you can think of in the beginning was this. You assume that only neurons fire independently. So you should start by having a model like this. And then you need to look at independent variables that are not fitted by a model to see whether they're well predicted. So here in this case, it's fairly simple because we are in a particular case where, in fact, you maybe didn't need maximum entropy to start with because you can actually calculate. You can actually construct this quantity, which in general, you can't. So in that case, you can just compare that. But here, actually, this maybe serves to illustrate the leftmost one. Here, this is a prediction of something that's not predicted by either of the two models. That's not fitted by either of the two models. It's a prediction of the total number of spikes. So it would be this in the plus minus one convention. So that's the total number of spikes in a given time window. And if you think about it, this is a high order quantity that's not fitted. Sorry, the probability distribution of this guy is a quantity of a very large order that's not fitted by either this model or by that model. So then it's an independent if you like of what you try to fit. So you can see whether you're doing well on these. And you see that the independent model really fails to explain the probability of having spike words with many spikes. The pairwise model, on the other hand, fits it fairly well. So you need to basically play by trial and error. You should first start with the simplest thing. And then you see whether you can explain things you didn't fit, like the typically global observables such as this one, the total number of spikes. And then if it fails, then you need to add more observables to constrain. Then the choice of exactly what you add, it's a thorny question. And in fact, Matteo Marcelli at the back has been working on this quite a lot to try and understand how you should pick what set of observables, or in other words, what set of terms you should add to your exponential family to get the best possible model at minimal cost of complexity of the model. And you said you had a second question? Yeah, the average of the vector x, and in the second case, where we do not have independent variables, there's a constant covariance between the nodes. Yeah, no, this is what this is. But here, we go the other way around. We said that we want to impose the average of sigma. So sigma is x, and it's just different notations for the same thing. And if we impose just the means of sigma, you end up with a model of independent variable. And if you impose the pairwise one, you end up with a model of the type ising. But to get this form and this form, I mean, just to remind you that this is just independent variables, here you derive that it's independent from the maximum entropy principle. So if you apply maximum entropy, while constraining just the average values, you end up with something independent. And that's not so surprising, because maximizing entropy is maximizing the randomness. And the independent variables are more random than correlated variables. So this is also another motivation for maximum entropy is that you're not supposed to add more correlations or more interactions than you need to. And maximum entropy is the way to do this mathematically. OK, other questions? OK, so we're going to have a five minute break, but really five minutes, because then we only have 20 minutes left. Start again after this five times three minute break. So let me move on to another example. I'll skip that. You can ask me. Well, OK, so I'll skip that. Just want to say that once you have a model of this form, of this form, like a nising model or something like this, you define a Hamiltonian for your system. You may want to study it from the physics point of view. For instance, you may want to see what there's a facial transition in this problem or this kind of thing. So this is the kind of thing that my collaborators and I did. And the kind of thing I can do is calculate a specific heat that you would calculate from these models that were inferred from the data directly. What you can find is that the specific heat has a peak as a function of this temperature. I don't want to stay too much on this, because I want to talk about another application, which is even more intuitive than the neurons, which is flocking. So flocking is something you may have witnessed yourself, one of the best examples is birds. So when you have a big group of starlings, for instance, as was captured here from the terminal station in Rome, you see that they really fly in a very coordinated manner, all flying in the same direction. And you see these beautiful motion. And it's not just the birds, the fish do the same thing. In that case, it's called the schooling, not flocking. Even sheep do the same, then it's not called schooling, it's called herding. But then it also happens at much smaller scale. Even the cells do it. So these are cells in the epithelial sheet. And these are the cells. And here you see the velocity fields of these cells. And you can see that they show very coordinated motion. And these are big cells from your skin, let's say. But it even happens in bacteria, like the swimming in coli that I was talking about, propelled by these motors. They also tend, when they're in big groups like this, they also tend to move in a coordinated manner. So it happens everywhere. And so there's a group in Rome who specialized in, let's skip that, in studying this phenomenon, in particular what they've managed to do is to take precise 3D pictures of big flocks of birds, of starlings, and again from the same place, they set up the cameras on the roof of the Rome train station. And they set up basically three cameras. But the general idea is to have at least two cameras to be able to do sterile photography. So you can reconstruct the third coordinate by looking at the difference between your two pictures. So this was done by Andrea Cavagna in a Jardina and their group in Rome. And it's actually, there's a difficult part to this, because if you know that this bird is the same as this bird, then by, you know, by stereometry, it's very easy to reconstruct the Z coordinate. You don't understand that. You're shaking your head. OK, so when you take a photo, you take a 2D picture. So what if you want to take a 3D picture? It's like when you see in 3D, it's because you have two eyes. So reconstruct the Z coordinate from the slight difference in angle between the two images, projecting on your two eyes. So you can do the same thing with two cameras. And by looking at the difference of angle, you can reconstitute the Z with this simple formula. The problem is that when you look at a flock of bird, there's nothing that looks more like a bird than another bird in the same flock. So while this one is very easy to get, this one is probably this one, and this one is probably this one, then when you're right in the middle, knowing that this one is actually this one, it starts becoming a bit problematic. So they've managed, they used fancy algorithms to be able to solve this matching problem, matching two birds in the same. But at the end of the day, what they get is a three dimensional reconstruction of all the positions of the birds. And then you take two pictures in very close proximity in time, like a tenth of a second away from each other, you can even reconstruct the speeds. So you get a big flock, like you see here, a big flock of about 1,000 birds. And you get actually all the velocities, which are represented by arrows. And what you can see is that they're really all fly really in the same direction. It's very polarized. So you can measure polarization. This is a measure of polarization. It's equal to 1 if they all had exactly the same orientation of flight. It's about 95% on that picture. On some others, it's more like 98%, 99%. But it's not just that. It's also that if you look at the fluctuations around the mean, so you look at the motion in the center of mass of the entire flock. It's the same thing as saying that you look at the difference between the orientation of each bird and the mean orientation. This is what's represented on the right here. Then you can see that even these fluctuations are correlated over some sizable length scale. You can see that we really have essentially two domains. And the length scale here is of the same order of magnitude as the entire size of the flock. You can quantify this a bit, or you can make this a bit more quantitative by calculating a renormalized coalition function between the orientations of two birds as a function of their distance. And you can see that these decays, of course. And at some point, it crosses zero. And this is what you would call your correlation length. So there is a correlation length of about 10 meters in this case. But what's striking is that if you look at different flocks of different sizes, and you plot this domain size xi as a function of the flock size, you see it follows a linear relationship. So what this suggests is that the correlations are scale-free. So why does that mean? It's because when you have a correlation length that's kind of the size of the system, it means that there's no other natural scale in the system, or that the natural scale of the system is much larger than the system size itself. So it's like a finite size scaling analysis to show that there's no natural correlation scale in this system. So this was shown by Cavagna and Jardina in 2010. And the conclusion they drew from this was that the order that you observe, the strong polarization you see in these flocks, it must be self-organized because of these large correlation functions. Rather than being centralized, if you have centralized commands where everybody's trying to do the same thing because they all get the same stimulus, let's say, or because maybe one bird is telling them what to do, then what you would see is that the fluctuations, so they will all fly in the same direction, but then if you looked at the fluctuations, then they'll all be independent of each other. They're all doing the same thing, but they own small error. Here, the fact that their errors are correlated with each other is suggestive that there's some self-organized or emergent behavior. So can we try and understand this? And can we try and understand it using maximum entropy? And here, we'll follow the same strategy as for neurons, essentially. I just characterized my bird by the orientation of flight, which is simply the velocities normalized, or direction of orientation of bird I. What I want is that I want a model where I'm constraining the pairwise correlations between these directions. So here, I put an arrow because it's really an arrow in three dimensions. So it's not the same arrow as I had before, before the arrow was over, because there was a vector of n dimensions. Here, each bird is characterized by three dimensions. So if I apply my rule of maximum entropy, what I get is a model that looks like this. So again, it looks a bit like a spin model. In fact, you can view this as some sort of a potentially disordered Heisenberg model. And here, you can really see why this could be a model of alignment if all my j's are positive. If all my j's are positive, this is as if I had a Hamiltonian, so my Hamiltonian here would be minus what's in exponential. If all my j's are positive, that means that each bird is trying to do the same as the others, because I get negative energy contribution whenever two spins are aligned. So it's exactly what happens in a flow of magnets, like two spins that are interacting with each other want to point in the same direction. Now, once we define things this way, we run into a problem which is that we cannot measure these, actually. Because what we have in this problem, typically, we have one snapshot at a time, so we see one configuration of the flock. So we cannot do these averages. So that's kind of a problem for maximum entropy, because you need to start from these means to be able to then fit the parameters. And to circumvent this problem, we use the fact that we assume, and this is quite different from what we assume for neurons, we assume spatial translation invariance. So what this means is that we think that the rule of interaction of the birds with each other doesn't depend on where they are in the flock. We assume translational symmetry, in other words. And to impose this translational symmetry, we simply make an ansatz for the form of the jij. So we say that j can take two values. Either it's one constant j is if small j is one of the first nc neighbors of i. So this would be i here in the middle. And here, I would have nc equals 6. So the six first neighbors interact with its six first neighbors. And otherwise, there's simply no interaction. So this way, I really define this time really a ferro magnet. I say that my two spins, or direction of flight, will interact if the neighbors. So this is usually notation for neighbors. By neighbors, I really mean this, if it's one of its nc first neighbors. And here, I started, you see, from n squared, n times n minus 1 over 2 parameters of the order of n square parameters. But now when I'm down to this, I have two parameters. I have j, and I have nc. So this is the interaction strength. And this is the interaction. So then the game is the following, is that I have my data, and I will simply use maximum likelihood to find these two parameters given my data. So I just look at one snapshot. And I write down the probability that I got this particular snapshot given j and nc, and I maximize this respect to j and nc. So I take the data collected on the roof of Rome, where I have all my velocities. And given this form of the model, I just optimize over these two guys. And so one finds that the first thing, it's what I said before. Like once you have a model, you want to check that it actually predicts well some of the observables that you didn't fit directly. So here, one that was not fit directly is the entire correlation function of the function of the radius. And you can see that it fits very well. I mean, if you do the maximum entropy, which would be simply this form, it agrees very well with the right one. And it may seem contradictory what I just said, because I said I constrained the correlation functions in the beginning. And then I'm telling you that this is an independence validation of my fit. And the reason is because once I've reduced my parameters in this manner, I can show that this is equivalent to having a maximum entropy model that is constrained by a local index of correlation, which is just a single empirical mean. So really, when I'm doing this fit, I'm constraining the value of a local correlation index. And here, the interaction range I find from this fitting procedure, I find about 1 meter. So in a way, you can view this as my fitting procedure makes sure that these points here, within 1 meter, do reproduce the data well. But all the rest of the curve could have gone wrong, essentially. So it's really an independent validation of the model. So one can look at higher order correlations. So these are clearly not fit by the model. And these agree fairly well as well. And so the kind of thing that one can answer with this kind of analysis is things that biologists were interested in. One of them was whether the interaction range followed a so called metric or topological rule. So let me explain what that means. So metric would mean that each bird interacts with some other birds within some fixed radius RC around them. And so the consequence of this is that if the bird densifies, if the flock densifies, and sometimes the flock can take different densities, then the number of interacting partners should increase if it takes interaction range within a given radius. The alternative hypothesis that people have was that the rule is topological. Topological means that it doesn't really care about the absolute distances. It would just count or it counts. It would just take a fixed number of neighbors. And that's irrespectively of the densities. So this is the same flock as here. But here, see, it keeps six interacting partners. Whereas in this one, it increases. So in one case, a proxy for the density is the mean distance between two birds. And in one case, the number of interacting partners is just constant irrespective of this density. It would be a flat. And in this case, these two quantities here would be exactly linearly related. So what do you think birds do? Metric or topological? I hear both. So do you have to say something in favor of metric? That is, they follow the one that, what would that, I mean, metric means that it would also, if there's more, more close, then they'll take more into account. OK. So it's not obvious what they do. It's also not obvious what they should do. People have argued that with the topological rule, it offers more stability and robustness to the flock. Because then if the flock expands a little bit, if you have a metric rule, some birds will lose neighbors. And in that case, they may not align as well. And in that case, they may be lost to the flock and you get breaking up of the flock. Well, I mean, the size scaling is not really, it's really a density scaling, right? But the size scaling, as long as the interaction range is finite, that's not really a problem. The Hamiltonian will be extensive, et cetera, and so on and so forth. OK, the answer is topological, as you actually plot. So you can analyze many different flocks, which have many different densities. And then you can look at this Nc to the inverse cubic power. And you see it's flat. And also does not depend on the size of the flock, right? So you might be worried that it might depend on the size of the flock, because remember, the correlation length depends on the size of the flock. In fact, it depends linearly on the size of the flock, right? So here it's really just to come back to that point. You get a long interaction, and you like a long correlation range that scares linearly with the size of the flock from a purely local interaction range that doesn't, right? So this is exactly a prototypical example of emergent behavior in physics, is that you get local interactions, and it doesn't really matter how local they are and what the size is, right? And from that, you get global order, no matter what the scale is, right? So this is exactly what you would predict from the ferromagnetic Heisenberg model, right? You get this long range order. What else do I want to say about this? Maybe ask. OK. Right now, I just talked about the orientation. So the orientation is like velocity normalized, so it's equal to 1. But in fact, the fact that you get this correlation length that scares with the flock size is also true if you look at the fluctuations of the modulus of the speed Vi. And that's a bit more surprising, actually. Let me explain why I'm being... When you think about the orientation, it's like the orientation of the entire flock is a natural symmetry of the system, right? So of course, if you forget the fact that there's gravity and so maybe up and down is not the same as east, west, north, south. But otherwise, you can reasonably assume that there's some invariance in the direction, right? So there's no preferred direction of flight. And when you have a symmetry like this, you may know that Goldstone's theorem predicts that you get scale-free fluctuations. So maybe in fact, it's something that would come out of any sort of model that you could write down like this in the spirit of physics. But for the models of the speed, it's not clear why it should get scale-free fluctuations, right? Because while the orientation is arbitrary, there's no preferred direction of flight, the velocity cannot be arbitrary. It cannot go out of bounds, right? I mean, you cannot observe this, for instance. So it needs to control its speed somehow also because of physical reasons. So to try and understand this, we wrote down a maximum entropy model that was constrained by something slightly different than this. And instead of taking this, we constrain the difference of the velocities, right? But this time, not just the orientation, but the actual velocities. And we need to add two things, one, the mean velocity, and the second one is the second moment of the velocity. And if you do this, well, you can show using the same technique of maximum entropy as we always do. This is the form of the model that you end up with, right? So here, nij means that i and j are neighbors. If nij is equal to 1 and then 0, otherwise, we call this usually a j-sensi matrix. That's what defines the network of interactions. So you get this kind of model. And I won't show this now, but one can show that in a sum approximation, you can break this down into two parts. One part that actually is equivalent to this Heisenberg model, and one part that's completely specific to the modulus of the speed. So in other words, the orientation and the modulus of the speed decouple from each other. And that's only true in the approximation where these quantities here are small. But OK, once we write this, here we can interpret the terms in the following manner. This is like a coordination between neighbors. So this means that each bird is going to have a velocity that's as close as possible to its neighbor. So it's trying to remember this is the energy, so it's trying to minimize the difference with its neighbors. And this one is just each bird is trying to control its velocity. So this is like just a harmonic potential around the preferred speed, v0. And you may not recognize this at first sight, but this is very similar to Landau's model that was useful to explain superconductivity. Because in Landau's model, the Hamiltonian is written in this form. So probably something you've seen in your studies. I put a j here. Usually it's normalized out. So this is a very generic model where you said that the order parameter, which depends on the position x, we will have this contribution that tells you it's like a smoothing contribution that tells you that the order parameter and two neighboring points should be similar to each other. So there's a penalty to having a large gradients. And then you have this g5 squared. And then, as you know, there's a lambda phi4 that's next. And the idea of this model is that when g crosses 0 is a critical point. In general, this is a phenomenological model. So you can express g as a function, let's say, of the control parameters of your model, for instance, the temperature. What happens when g goes to 0 is that the lambda phi4 term will take over. And this is where you get a second order phase transition. So here you see you have a very similar structure. Think of phi as v. And think of x as the position of your bird. So instead of having a continuous medium, you have birds which are in individual entities. But then they're related by a network of interactions, which you can view this network of interaction as your lattice. And then when the lattice space goes to 0, and otherwise said, when you look at your bird at your flock of birds from a large distance, it would look continuous. And so you can see that this term here is very similar to this j term here. So it's like the gradient term. And here you have this g. So there's no lambda. Other than that, it's the same kind of model. And what happens is that you would have a critical point if g goes to 0. So when g is negative is when you really need this term, otherwise your theory is not normalized. But as long as g is positive, you may not need this lambda. And so what we can do is, again, take the data and fit the three parameters that we have. Actually, we have four now. We have j, we have g, we have v0, and we have nc. And we can do this, again, using maximum likelihood. And if we do this, we find these quantities and we form the quantity g. What's really important is the ratio respect to the other parameters. We find that it's actually indeed very small. So it looks very much like the system is close to the critical point. And maybe we shouldn't be surprised because I said that the correlations were scale-free. And if we just had this kind of theory without g going to 0, so with a positive g, then there's no way you could get this long-range correlation functions. Because you can only get the long-range correlation functions when you're close to the critical point. And from the biological point of view, what's interesting is that this g is the parameter that controls. It's like, remember, I said this was a harmonic potential. This g sets how flat this harmonic potential is for individual speed control. And if I find that g is very small, that means that individual speed control is very small, actually. So there's a very flat valley. And in principle, the birds could fly any speed they want. But what happens is that, if you looked at the fluctuations of the speed, the actual of the order of 7%, which is very small. And the reason why they're small is because even though the speed control of each individual bird is small, they all listen to the neighbors, which themselves listen to the neighbors and so on and so forth. And so you get an emergent behavior where even though every bird has very little control, they all try to do the same as the other. So they leverage this small control they have to end up in a very tight control of the speed. So why is that interesting, even from a biological point of view, is because they could have achieved also even tighter control by having a large g, adding a large control. But if they did that, then the flock would be less flexible. The idea of this is that when you close the critical point, you're also more sensitive. So if you put an external perturbation to the flock, like let's say a predator comes in, then the flock will be faster to respond. Whereas if everybody is really trying to have a very strong individual control, then even if there's a perturbation, they try to keep to that control and keep that value. And they won't be very reactive. So this was the idea of the Goldstone modes. This is just to maybe summarize. This is a picture taken from Wikipedia or the internet or whatever. And the idea is that typically in this kind of model, you have a rotational symmetry. That's in the direction here called pi. It's basically the transverse part of the speed. And this, as I said, because there's no preferred direction, is a free direction where there's no resistance in terms of energy. The s part, so s, sorry, it's the wrong notation. Here this would be the v part, so the absolute velocity. On the other hand, it's up against this harmony potential. So the Goldstone theorem tells you that you should have scale free in this rotational direction. But you don't necessarily have scale free in this direction, unless g, and therefore the flatness of this Mexican hat, goes to zero. And let me just, to finish, give a third example, because there's been a popular application in the past eight years now. It's about sequence modeling. And I just briefly dross over it, so don't worry too much about it. But the idea is the following, just a very quick summary. OK, so the problem is this. Is that you imagine that you have, so as you know, proteins are made of amino acids. So you can describe the protein simply by its sequence of amino acids. And in nature, sometimes you find very similar proteins that are in different organisms, or sometimes even in the same organism with different versions of it, and very similar functions or very similar structures. And people have collected these examples, and then they've tried to align them to each other. So this is the kind of things that people get. Each line here is a protein sequence. And each row, each line is a protein sequence. And they've aligned the positions with respect to each other, so that positions along a certain column are homologous from one protein to the next. And the thing to notice is that at each position, you have quite a bit of variability, that there's some possible variations in the choice of the amino acid. However, all these proteins will fall pretty much in the same manner. So the question people ask themselves is, how do we characterize statistically? Was the variability, was the allowed variability in the composition of these proteins? And a key idea is that, of course, maybe sometimes two amino acids can perform more or less the same function. But a key idea is also that these variations are correlated across different positions. And the reason is the following is because sometimes during evolution an amino acid mutated somewhere. And this could only be done by mutating another amino acid in a different position. And to see why that's the case, you have to think of it as some sort of a lock and key behavior. So let's say that these are your two positions. And physically in a protein, these two positions were close to each other. And they were nicely packed together so that this amino acid here in purple would nicely be complementary to this amino acid here in blue. But now imagine that you mutate this amino acid here in orange. So instead of having this triangle shape, maybe you don't see it here as a circular shape. So basically they don't really match anymore. It's like you change the key, then change the lock yet. So from the evolutionary point of view, these proteins won't function very well. They'll be cleared out by evolution because the organisms that carry them will be selected against. So that's not good. In terms of fit, these have low fitness. However, if a second mutation occurs, then the blue one becomes now the green one, so that the green one again has a nice complementary shape to the orange one. Then this would become viable again, and this would become fit again. So in this situation, you see that this purple guy can turn into the red or orange guy here, provided that this second mutation also occurs. So it's called compensatory mutations. And the consequence of this is that when you look at many, many proteins that evolved and have pretty much the same function, you should see correlations between the composition of amino acid at these two positions. And so what people have proposed is to write down a maximametry model where the observable that will be constrained are basically the pairwise marginals, the pairwise composition of amino acids at two positions i and j. And if you use all these observables, the kind of distribution you end up with looks like, again, a statistical physics model, where now si, instead of taking value plus 1 or minus 1 or 0 and 1, et cetera, et cetera, can take one of 20 values, the 20 values that amino acids can take in a protein. So maybe you've heard about the POS model. It's one of these things that instead of having a spin, you can have q different colors. So here's exactly that. You can have q equals 20. In fact, 21 because you also have alignment gaps. You can have 21 different colors. And then you can have all the possible interaction terms. So it's the maximally dissolved POS model. And the reason why people did this is because they wanted to know in the structure what amino acid interacts with what amino acid. And you don't necessarily know this unless you have structural information about this protein, which in many cases you don't. So what they wanted to have is to have a way to predict who's interacting with whom by just looking at the j's here that were large. So if you have a large value of this interaction parameter j, then it's quite likely that the two amino acids are actually close physically in the physical structure. Remember a protein, this is what's called a primary structure. So it's a linear structure. But a protein folds, meaning that it will do turns and you can have two positions in a protein that are far from each other on the primary structure, so on the linear structure, which will be closed in 3D. And you don't know that by just looking at the sequence. So you want to predict that. And this offers a way of predicting this. And this is what people did. And using this kind of method, using the maximum entropy method, they could actually predict very well here in red the context that there were between different amino acids along the sequence. And they could do this in a way that if you had instead an important point of their analysis was that if you, instead of examining the interaction here, you examined the correlations, so these pairwise marginals, then you would do significantly worse in predicting these contacts, these physical contacts. And the reason, again, from physics shouldn't be surprising to you is that if you have three nodes that interact in this way, this guy interacts with this one, A with B and B with C. But there's no direct interaction between A and C. So A and C may be actually distant in a physical structure. However, since A interacts with B which interacts with C, A and C may be strongly correlated still. So if you just looked at the correlation, you may be led to wrongly conclude that A and C are close to each other, whereas they actually interact only through B. And by doing this kind of analysis, you unpack basically the interaction network from the correlations. So that, therefore, you will get the j's directly. And this will be the correct measure for contacts. OK, so I'm done. Thank you for your attention. Tomorrow we have the exam. It's a multiple choice. It's designed to be very easy if you followed the course and if you tried to do the problems. There are 20 questions.