 present the organizer for giving me this opportunity to tell you a little bit something we have been doing recently. And this is certainly related to physics and chemistry, but not so into the real biochemistry and chemistry, hopefully, but still be interesting. So I want, in this talk, try to argue there is a thermodynamic structure in data. And data coming from a system with, of course, a few assumptions, but as minimal as possible. So therefore, ultimately, this is a theory about data. And for the property of the data itself, however, we assume that the data is sufficiently large, and I will give better, more precise definition later. And one of the most important thing I want to argue is the notion, which is so treasured in physics, that is energy, ultimately can arise in this context. So let me just bring you back to the origin of a thermodynamic energy, especially the internal energy concept that was established in 1840 to 1850 around that time. And I want to call everybody's attention. At that time, this internal energy concept is not completely connected to the atomic understanding of matter and the kinetic plus potential energy formulation. Now we all take as almost for granted. The concept of energy and entropy was then expanded by Gibbs and Helmholtz. You see the year is 1870 to 1882. It's quite a few years later. And in their work, the most important new idea is introducing the concept called free energy, as well as the chemical potential. But now we know that the Gibbs free energy and the chemical potential concept can actually be fully developed from another point of view phenomenologically from a chemical kinetics theory and in terms of rate equations without any reference to mechanics, classical mechanics, Newtonian mechanics. In fact, without even any notion of mechanical energy. So in that sense, free energy, Gibbs free energy, chemical potential, these concepts and our ultimate direct to argue internal energy can arise in a statistical description of certain systems. And that thing I say we now know is encapsulated in the three papers came out about five years ago. But that's not the content of my talk. That was just our motivation. And I don't need to convince anyone here because this is on stochastic thermodynamics. One of the fundamental insight from this field is that how much the mathematical theory of probability can help us to understand the concept of entropy. And of course, particularly most of us interested in here, the entropy production. But today I want to bring us all back to entropy. And more interestingly, so if you actually immerse yourself in biochemical studies, especially biochemical study of equilibrium and kinetics, what do you realize the whole scientific practice is actually goes something in a very rough description. You should first classify molecules of your interest into species. And then you basically do counting. And of course, in the real measurement, you just measure concentrations. And then we calculate the equilibrium constant all the time. And then this last step, we convert the equilibrium constant into certain kind of a thermodynamic energetics, usually Gibbs energy. And if we have the luxury, we change the temperature. And then how scientific, physical, chemical discussions start from there. And I want to argue this whole practice literally can be lifted as a paradigm for study any kind of data. And the central message of the talk today is the concept of internal energy might be derived through the statistical concept, but very much in the spirit of PW Anderson's idea of how to define an emerging phenomenon. From data, we coined this word at the infinitum. Basically means we visualize, we hypothesize, we idealize data, can go on and on and on forever until infinite of them. Of course, in reality, you can only do finite of them. And I like to lift this paragraph out of his 1972, PW Anderson's 1972 famous paper, and just particularly pay attention to this yellow highlighted paragraph. He said, starting from the fundamental law and a computer. Now, in my case today, the fundamental law is a theory of probability. And I do a little more math limit theorem. So therefore, I don't use a computer. We would have to do two important things by solving a problem with infinite of many body. That is what I'm going to do, let n go to infinity, the amount of data go to infinity, and then apply the result to a finite system. Of course, ultimately, applied science need apply these idealized system back to reality. The good news is we have plenty of experience to know sometime infinity and even Thursday of 15 is sufficient to understand the real system. And before we synthesize the behavior, and this behavior is what I call emergent phenomena. So the first thing, let's assume we have a certain type of recurrent data. So the word recurrent is left to be a little vague here deliberately. Recurrent means you see something again and again and again. And we assume we can collect this data forever, and then eventually have pretty much all statistical pattern in the data become stabilized. That is, of course, what we call the data is reproducible. Here, including you do the experiment again, starting from the first data point in another setting, we expect or we assume the data will give you exactly the same statistical distribution, the things you are interested in, if you collect all the way to infinity. I would argue this assumption on data is fundamentally the idea that physicists, especially statistical physicists, has always been asked for called ergodicity. But as you know, now I put this assumption on the data. I don't put this assumption on the system itself. That has a huge difference. And the subtlety here, if I have time, we will go into it. But basically, I don't care what the system is. I care about this ergodic data is already in hand. And for the simplicity today, all my discussion will be assuming a finite state space with only k possible state. This is, of course, to most of you seems way too restricted assumption. But I just want to assure you that k can be generalized to infinite and can be a continuum. So therefore, there are enough mathematics to support the generalization of what we talk about. But they are subtleties. And they are certainly challenges. But for the simplicity of a presentation of the ideas, I will restrict myself on a finite state space. And then furthermore, with the finite state space, we assume, so far as the data, now I'm going to start talking about the assumption about my model for the data. And the model for the data today I do, specifically, is an IID sample. And this IID can be easily generalized to Markov process. As many of you know, in recent years, our field are very interested in 2.5 level large deviation theory, so on and so forth. And as far as the theory in connection to Markov chain is summarized, at least the touch upon discussing this paper, but I'm not going to talk about it either. OK, so let me just review what do we know as large deviation theory, which is a statistical description about an infinite large stationary data. And I particularly interested in the quantity is empirical frequency. That is, since I only have capital K possible states, you can just collect the data, seeing which state shows up, and take the number of such states, the occurrence, and divide it by the total amount of observation, and in other ways, a normalized empirical frequency here for a particular state of K. And this lower case K, of course, go from 1 to capital K. And when the capital N, the total data go to infinity, this empirical frequency converges to what we understand as the probability for the system. And I use a direct third function here to indicate that convergence. And of course, this convergence just tell you every other possible frequency, not the same as the p, will be 0. But being 0 did not tell us enough information. What we really like to know is this infinite decimal is how big. Therefore, the large deviation theory tells us you can do better by quantifying these infinite decimal and in the form of minus N, which N go to infinity. And then here is now a function for all possible frequencies. They have a value. But this value now only tells the rate of all those probability for all those sequence, which frequency are not the same as p, go to 0, the rate. And that rate function, if you do the computation, will give us the KL divergence in the information theory, as well as the standard of rate function in large deviation theory. And here, I just want to particularly call your attention the structure of this function. That is, here are the empirical data. Here is our model. But this model right now so far, I assume, is from the theory of probability. So therefore, I call it a Karmogorov model, the model based on the theory of probability. And then later, I were trying to suggest an alternative way to understand this function. Now, with this definition of the rate function, it is a convergence rate. Then we all know very well this following mathematical result. That is, if I have two terms, each term has the exponential rate of going to infinity with two rate constants a and b. Then in the asymptotic limit as I go to infinity, there's only one term dominant with a rate c. And the c simply is just the bigger of the two between a and b. In fact, you can obtain the c by taking the log divided by n and do the limit. Therefore, the very definition of this rate function come out of this slide, give you a variational principle when you're dealing with multiple terms. I want to say fundamental origin of entropy maximization is not because some statistical argument is not because we want develop a more fair model and because we don't know something. So therefore everything is equal. I want to say it's fundamentally because the very definition of this rate function which we now identify as entropy. Now, let me try to introduce an alternative way ultimately leading to the same KR divergence. And in this alternative narrative, we start with a sequence of the data. It's recurrent and we assume it is by my definition earlier a gothic which can go to infinity. And then I'm note here I only have one particular sequence. That is a very different question as the logic deviation theory is asking all possible sequence with a certain frequency. So now with this give one sequence of data but I'm now interested in what is its probability when the sequence getting longer and longer and capital N getting bigger and bigger and its probability and how it's changed. And it's a very obvious for my problem because the state of each system is only discrete which is keep multiplying the probability of the state under the idea assumption. So therefore, ultimately this probability just a large product of all independent probability for each term. And if you rewrite on exponent, you pick up the probability for K state and how many of those terms here is in lower case K state normalized with N out. This new K log pK is the rate of vanishing probability. And for most of the sequence except one type this is literally saying most of the sequence the probability will go to zero really fast and at the end the data has no randomness. The data is for certain because this is the gothic hypothesis I gave for the data. So therefore I would coin this term called the vanishing randomness of a particular sequence. And it has a general form of summation new K which is the data with a frequency and your model pK. Note in this case, this is about one particular observation with big data. And I would like to start my discussion from this quantity. And the first thing we asked which is not a exercise in the theory of probability. It is actually by our intuition. That is, now if I change all possible different model p what is supposed to be a better model with a given set of data? And that turns out to be an important principle in statistic established by R.A. Fisher. And I will follow that idea which is basically widely known called the maximum likelihood principle. That is, we like to find the probabilistic model such that the data will happen to be exactly at the peak of this probability distribution. This is not an argument about big probability versus lower probability. So important, so Fisher tried to call this likelihood trying to avoid using the word probability. Now, what do we realize? And a simple exercise will lead you to that is the entropy we know widely from Shannon is intimately related to this maximum value. And by doing the maximum value, this quantity, frequency in UK like in UK some over, oh, sorry, here's a mistake print should be K is the property of the data independent for model now because I already searched the all possible model and find the best model. However, assumption about ID still is in my assumption. Now, with this result, it is very obvious that the best model different from any other model will give you a KL divergence. Except to pay attention, this KL divergence is flipped. The probability now is on the numerator is what I'm interested in because I'm interested in all different kind of a model and the data now is a fixed in the function. This is exactly the mathematical functional form but it has a very different meaning now. Here, this comes the interesting and important question. This derivation I just showed you will give you the Shannon entropy after I put a minus sign. Why should it equal to the Shannon entropy derived from accounting of multiplicity? That is the original derivation or the theoretical outline of Shannon is basically saying let's have a long sequence of symbols and let's assume the empirical frequency is given by this set of new. How many sequence will have this frequency? It's a purely counting of the number of sequences. Well, that turns out to be, of course, well known to all of us but the mathematics treasured called the chronograph zero one law which states the collection of all IID sequence, this particular collection, each with the empirical frequencies that matches the probability has a sympathetic probability one in the large N. So therefore for this collection, the number of sequence in this collection should be exactly the reciprocal of the probability of each sequence. Does this equation? But this just shows in the limits these idea of a typicality leading to both talking about the number of sequences as well as the best model out of for a single sequence. And as I will show you this latter discussion has interesting consequences. And now one tantalizing analogy. This is just an analogy and it's trying to make us start thinking something maybe more interesting along the way. This very relation I already showed you. Remarkably look like what we worked with all the time in statistical thermodynamics off by KT if we allow us to use KT as just simply a scale a unit for energy that is free energy is equal mean internal energy minus entropy. Note, in my derivation or discussion I started with this term in the blue called the rate of a vanishing randomness. That give us the very first glimpse a possibility of connecting what physicists call energy in through classical mechanics with a pure data statistical perspective. Now, let's do even more modeling in terms of a basing statistic. Let me just remind you again, my data space data. Let's start again, my state space is discreet from one to capital K. And now we want to build a basing inference scheme by doing so we first need is a model for all possible class of models and put a probability on the space of all models and then the model here I call data. And after I put a probability on all possible models each data point acts suppose give me the probability under the model and then use the base rule I gonna get a posterior. Now, if I have a next data point I will do this again and I like to apply this idea to my data which is the discrete state one by one under the assumption of our ID. And in this case, my model is the complete probability vector. I don't make any assumption about a probability I still use the ID. And then we assume a set of a probability vector in this probability simplex starting with a prior after one data point which is literally just multiply a particular K depends on which data point and then start to get the first posterior and reiterate this all the way to infinity. And when you do that, you realize this sum over log of probability and the frequency of seeing certain data will show up again and this time in the basing iteration. Now, the most interesting thing is we need always renormalize the denominator to get the base rule to complete the basing inference. Here is a large integration of the numerator and it turns out to evaluate this denominator in the limit of angle to infinity is exactly the same mathematics as we do for the maximum term method or Laplace method in statistical of Darwin-Foller method in statistical mechanics. And the answer is you ended the ways using exactly the inequality because this is the largest term in the denominator. Thus, the convergence rate for the basing sequence in the functional space is again the KL divergence form except it is reversed. Again, in this time, new is the data, the P here I try to emphasize now have a very different story and the narrative and origin. It is coming from a basing logic and the discussion. Therefore, I try to emphasize it called as a basing model. But the real interesting important thing is we arrived at the KL divergence again, which basically saying the large deviation theory have a mirror way to articulate. And this result has been called an inverse of Sanoff theorem. And interestingly, this line of discussion actually already have a very, very long history in statistic. And this is called the basing consistent validation problem. Because we have a sequence of data which yourself is convergent, one is interested in whether the basing statistic approach will ultimately lead to a consistent result as the sequence of the data. The original work started with Dube as early as 1949 and many other people followed, but the most intriguing fact or history is Friedman, one of the leader in the field. In 1965, found a quote from Wikipedia. The rather disappointed result that when sampling from accountable infinite population, i.e. not the finite state I discussed, the basing procedure fails almost everywhere. Basically, the story I gave you doesn't exist anymore in any nice way when you're dealing with even accountable infinite state space. And of course, that implies even for a continuous state. And it is this result literally killed this nice story I presented to you. However, I believe as a scientist, we need to think to use the finite result story and realizing the fundamental idea in statistical physics, we need to know what kind of a limit to do first, what kind of a limit we do second. It is pretty obvious to us, we should always assume that data can only fail a finite state because after all for any finite interval a to b, any data point only have a finite resolution, number of digits that are literally cutting any continuous state, a b interval into a finite partition. So therefore finite is a significant result, it only depends on how do we take the appropriate limit. Now, when we have an empirical frequency for a set of data, but if the state space is large, this is only a sort of experiment. What do we ultimately interested in our empirical mean value of several real valued observable? There is very clear how we get that large deviation rate function for the observable from the KL divergence. And here is how we start to recognize a significant portion of statistical thermodynamics. Let's ask, if I have a random variable called g, and I'm interested in its empirical mean value with n sample, what is its probability in a particular interval? And the large deviation theory says it will have again a sympathetic expression here with the n, and then a function which tells you the convergence rate of this empirical mean value. The more important thing is this empirical mean value for the converging rate function for the empirical mean value exactly due to the principle of maximum entropy as I illustrated to you earlier is a constrained optimization problem of giving the KL divergence and constrained on if you have the frequency, it multiply the random variable will give you exactly the empirical mean value. Other empirical mean value are impossible to this set of frequency. So therefore we don't have to think about it and we only need looking for all those new which satisfies this constraint and then the greatest entropy, the smallest rate function. Now, let's recall how do we do a constrained optimization problem through calculus? We do Lagrange multiplier. When we do Lagrange multiplier, we realize it is always a saddle point problem. Therefore, we first write a soup. So here let's introduce a Lagrange multiplier and here is the constraint and we take the soup for the y value and then the in for the original value and this is a joint optimization ultimately give us the constrained optimization. Now, here is something when we do constrained optimization, a few step immediately shows you you can define the Lagrange transform of the function you are interested in and then transformed around transform back. This is a consequence as all the mathematician in this field, you know, of Lagrange duality. It turns into a Lagrange transform. That suggests to all of us, we should carry out a Lagrange transform or more precisely, we call it a Lagrange transform of the KL divergence. I'll come back to this later. So let's take KL divergence and do Lagrange transform. Lagrange transform written in this small mathematical form is try to balance it by variable itself new and now a new quantity called a mu and do this minimization problem. And of course, when we do this minimization problem, clearly what we should do is just the derivative respect to mu which immediately give you the mu equal the derivative of this function in the green box respect to mu. And that's why we know mu must be the derivative that lead to our traditional understanding of Lagrange transform. Now a few steps of this month. Just want to mention that we are right now in the question time. Okay, let me just finish very soon. Now, this immediately give you now the most important insight which we really should teach you this thing every statistical sum of dynamic class, the log minus log partition function. In other words, the linear differential transfer of KL divergence have a very authentic row in statistical sum of dynamics because this is the free energy we study all the time. And of course, in that case, here this quantity is always the mechanical total energy or Hamiltonian divided by KT. Now you can do this backward taking this log partition function free energy doing a non-transform. Of course, we get it back to entropy, we know that. We do that all the time. Combine these two inequality. Now there is a new equality which is also known if I write them out to many of us is this result called differential Young inequality. Note what I want to hear is for arbitrary of new the data and a set of energetic model. And when the Ecosign is true is when these two functions are related through the differential transform which is the minimum. I would like to argue this inequality is ultimately a non-time dependent, nothing to do with time, second law kind of argument which I know many people including Eleanor Leib has been argued forcefully we should not think fundamental second law as original formulated by Clausius and so on and so forth as a time dependent problem. Now let's do a quick comparison, two minutes. If I rearrange the equation I just showed you here explicitly writing out each term and rearrange them you realize the probability P and the energy parameter mu always come hand by hand together and everything else is the data. That suggests to us a mechanical energy description of the data and a probability statistical description of the data are actually equivalent because we know in statistical mechanics we actually assume the P is one. And the contribute all the energetic description of different frequency through energy. This of course is Boltzmann's law and this is basically the in conclusion while the large division theory provided that the mathematical foundation of the entropy concept and its variational principle in application I would like to say we are not studied the rare event per se because we are not sampling any data out of a given known probability. Rather it is about a set of data which we assume does have a given well defined frequency and then a model which we call P. And the system and the measurement can be anything but the data is actually have the data is already conversion it's ergodic it has the truth which we should model and understand. Now of course you have choice of the data on the model you can do ID, you can do Markovian you can even assume it's deterministic in that case you need to prove there is a invariant probability and the basing random iteration is just a functional form of this model. And some of these idea is contained in this review article and I close my talk by this picture which is saying there's two way to understand the data starting with a prior the data will give you if you assume a probabilistic model a set of entropy function but the mechanical physical understanding is introduce an energy to bias the statistic which we call Boltzmann's law. At the end, both the energetic description and a statistical description supposed to lead to a posterior understanding of the data if the data is already ergodic. Thank you very much. Thank you for my interesting talk and I hope are there any questions please? We have time for questions. Maybe I'll ask one which might be naive but let's say that you know play with examples with this inequality or short. Are there any features that you see for instance if you're closer to saturate the inequality or not? Any ways to compare data which say show the same amount of violating the inequality? Yeah, so it turns out this inequality is exactly what the Swan said at the end that David Chandler is those people in chemistry which is they there's a theory called a free energy perturbation theory in which they actually try to find the right energy function through simulation or certain kind of data. And in that case, they are really literally using this inequality to make it small and small and supposed to get the better and better model. And so therefore this here is the key thing is way and this in that context this quantity is called an unequilibrium free energy. And so therefore that give us the bold some suggestion equilibrium is actually between a data and a model. Now this is now completely outside the physics classical idea of what is the equilibrium now. Okay, thank you. We have a question in chat. Manvendra, if you can, you can unmute yourself and ask the question if not I'll read it. I just wanted to ask this, can you explain this the flow chart which you have shown at the end of the talk with some example? Like some example. So we already know when you have a data and of course what we want to do is just build a probabilistic model. And so therefore in that sense we don't really need to entertain any idea and a concept of energy. However, there is an important difference when we describe the data in terms of the probability versus this way of putting our model on the exponent in terms of energy. And that actually is related to the idea that talked by Professor Ito earlier. The energy lives in the tangent space of the data. The data is living in something called a probability simplex but the energy is living in the tangent space and a little geometrical understanding of the relation between or more importantly, the probability simplex itself is not even a linear space. It really should be understood as a manifold. In that sense, the tangent space has a nice linear structure which we believe is the origin of we can talk about energetic in additive term which is very much the power of ultimately we can do mechanics well. Imagine this. If three-body energetic and force field cannot be obtained from two-body and sum together in rather reasonable way the mechanics is useless. Okay.