 All right. And so now another field in talk. Eric DeGuilly will be talking all those physical physics of generative grammars. Laura is yours. And do you want me to give you a comment when you are getting to any particular moment in time or think you're good to go? If you just gesture in the video then I'll try to catch that. Okay, we'll do. Okay, thanks. So thanks, David for the introduction. I can't see myself. Let's see. Can you see my video? Oh, no. You either. One second. There we go. Okay. So thanks, David for the, for the invitation to speak. So first, let me preface my talk by saying that I'm coming from the background of systems. I was studying glass and related things like that and got interested in computation in a broad sense, which led me to the, what I'll tell you about today. So there's a little bit of difference in philosophy, I think from some of the stuff that's being done in stochastic thermodynamics, but I'll try to motivate my approach as I go. Okay. So the motivation for me to start thinking about grammars and language and so on was that the way we understand complexity in physics and models that we can really solve completely is mainly at least colored by my background through the spin glass and the in that, in that model complexity is essentially synonymous for having many metastable states so there's many properties that go along with having many many metastable states and that's essentially what we think of as being complexity. And at a technical level, when we have systems like the spin glass that are complex, this complexity is signaled by so called replica symmetry breaking in what is essentially an equilibrium partition function. So there's a whole formalism going along with that that's been applied to glasses to random ecosystems lockable term model and related models to the hot field model of associated memory to neural networks and so on. So that was a program that started in the 80s and has been continuing to the present day. And the question I asked myself was, is this the relevant paradigm to understand complexity in language and language in a broad sense. So that languages could be computer human languages, computer languages, but also the biological type of the genetic language, the protein language and so on. And the formalism that I'm going to use to, to think about this is that I've generated grammars so there is a formalism that was developed to quantify syntactic structure in language, starting from Chomsky in the 50s. And the main idea is that behind every language there's a set of rules that govern this syntactic structure so I think it's easiest to present by way of an example so on the left, you see a sentence in English the bear walked into the cave. And so according to this Chomsky in paradigm that associated with that sentence is a unique tree. And that you can see that's hidden, and the structure of that tree encodes this syntactic structure of that sentence. And so more precisely on the nodes of the tree are variables, and they are their hidden symbols and they what they correspond to for language or human language for phrase structure are abstract categories so you can see things like now and And then higher level things like noun phrase and verb phrase. And so the way that it goes is that there are supposed to be rules that tell you for a given language, say English, that a sentence can be composed of a noun phrase and a verb phrase in that order so that would be one rule. Can you see my cursor as I move it. One rule. Yes, we can. Yep. And so there'll be, according to the Trump's game paradigm a set of rules for English that tell you how to form all well formed sentences. And the key point is that a grammar typically generates an infinite set of sentences so it's a highly compact way of quantifying structure. And so for the point of view of computing information processing and so on. It's, it's something that should be very efficient. Okay, if if it's appropriate to the problem at hand. Okay, and now let me be a little bit that was one example but the whole structure is quite broad. And in fact, even goes back to panini who was studying Sanskrit a long time ago. So a bit more generally, the way we can think of a grammar is a set of string rewriting rules. So there's always these two sets of symbols the hidden ones and the observable ones. In general what we do is we begin with a start symbol, and we're allowed to apply the rules repeatedly, until we get only observables. So for example, a simple grammar is one that has these three rules and only these three rules. We can apply the rules in any order, starting from, from s. So for example, here's what is called a derivation that's written out. Just as a line of text, as becomes SS becomes this and so on. And the end string, a ABB AB. If instead of a I write a left parenthesis instead of be I write a right parenthesis then you can see this is equivalent to this, and you can easily convince yourself that what this grammar does is generate all well formed sets of parentheses just clearly infinite set. So it's a very compact way to generate this infinite set. And in particular infinite set that's that's not so trivial it's sets of well formed parentheses are equivalent to trees so this is actually generating all possible trees. And in general we call the language as the set of observable strings generated by a grammar. Okay. So the whole formalism is quite, quite vast and accordingly depending on how what you allow for the grammar than there's different classes of possibilities that can result. So there's the so called Chomsky hierarchy that that classifies those. So starting from the simplest types of grammars, which are the regular grammars, one gets up to the context free grammars and then more complicated ones context sensitive recursively memorable. And there are different ways to think about these different types of grammar so I'll show you graphically them later, but from the point of view of computing you can ask, for example, we're given class of languages. How complex of a computer do I need to be able to parse strings from that language that is so parsing what it means you're taking the string observables of text and building that derivation which could be a tree or some other structure. And so, for a regular language what you need is a finite state automaton to to parse it so finance it sometimes computer it goes through finally different states that has no memory it just reads a symbol and changes its internal state. For a context free grammar, which are the ones that generate trees as derivations, you need an automaton that has a stack memory. It's a memory that you can keep adding things on to but you can only call from the top and then the more complicated languages require memories that are that are addressable anywhere and then eventually infinite memory. Okay. And the way that I think that is actually easiest to understand the different, the distinction between the different languages is by considering the structure of their derivation so for a regular grammar. The structure is essentially linear. So the hidden symbols are going through markup process. And as they go through markup process they're emitting some observable symbols so this is accordingly called a hidden markup model. And what seems almost trivial that they are quite useful so if you call grep in your Linux machine, the, what grep takes as an argument is the regular language so can read regular languages. When these things are made stochastic then they're used quite a lot in bioinformatics and also in neuroscience as I'll talk a bit about later. So the next level up in the Chomsky hierarchy or the context free grammars that generate trees as their derivation structures, and this is the class that's played the most important role historically in linguistics for phrase structure, but also in computer science. So, since this is about computing and another way to think about grammars derivations and so on is that when you give your code to a compiler. What the compiler is doing is it's building this derivation structure, which is the thing that is natural to translate into machine code so when a code is when a compiler is compiling your code that's what it's basically doing is is building one of these trees okay. And then okay they're more complicated grammars and their derivation structures will not be trees anymore they'll be more complicated graphs with loops, but let me just skip over that for now. Okay, so these grammars have been studied since the 60s quite intensively in the 60s in the 70s, and a bit less so now so there's certain classical questions one can ask. So, as a forward problem, you can ask okay if I have a given grammar what's the language that that's produced so what's the set of all the strings that can be produced. Or one can ask for a parsing problem so given a grammar. What is the best algorithm to parse text so there are multiple algorithms in general. And a bit more generally as an inverse problem you could be given a lot of text and ask what what grammar could produce it. And so, although much is known their algorithms to do all these things, and in some cases so for example for regular grammars then there's kind of like best, we have, we know everything more or less, both from the algorithmic side. For more complicated grammars then things are still being studied to get better and better algorithms and so on. But little is known about the typical case so there's more physics type questions one can ask. So for example, as for the forward problem so how complex is a typical language and that's the kind of thing we would like to measure using quantities from information theory, entropy and so on. For parsing then one can ask about computing resources like how much heat is produced to to parse text from a certain grammar. And inverse problems so suppose we get some alien signal, how much text will we need to decipher it that's that's a question that you'd have to ask about assuming some kind of typicality of the language. And today, what I'll focus on is just this first question here how complex is a typical language. So I'm going to focus on context free grammars that generate trees. The reason is that, okay there are the most important class for human languages and computer languages, but also, they're the simplest class that can produce text with long range correlation so it's quite interesting from that point of view, from the point of view of physics. Okay, so you can have in mind a kind of tree like I have, like I drew already so suppose that every rule so this is the way I'm going to define my model. So suppose that every rule that we have is associated some energy. So the ABC that has to be accounted for every time we use it so then any derivation so an entire tree like this to get will have an energy associated with that just by adding up all the rules that are used so pie ABC is the number of times. So it becomes BC is used in the derivation specified by Sigma. So Sigma is like encoding all the hidden symbols on the graph. And I should say that I'll be considering context programmers for context programmers. It's enough to have rules for the branching of the trees so the rules of the interior of the trees. There's also a whole set of parallel things happening at the leaves. I'm mostly not going to write that just because for brevity but you should think that for everything is happening in the interior there's some corresponding problems at the surface. Okay, so this is a stochastic formulation for every possible derivation I assign some energy, and the cheaper rules are the ones that tend to get used more frequently and overall the derivations that have a lower energy are the ones that are more grammatical in this framework and so a deterministic limit is obtained by sending some of these, or many of these the ABC to infinity such that those particular branching never happened. And my, my model is going to be a talking parrot and more even more precisely a more ridiculously equilibrium talking parrot so it's a device that is sampling sentences of some typical length for a given grammar and in contact with the heat bath so I have a whole equilibrium setup here. So, this is not the way I originally presented the model, at the time I was not really thinking of a specific physical instantiation but it's equivalent, what I did is totally equivalent to considering this. So the point is that the parrot is just talking but it's not there's no real input it's just talking, it's sampling from this distribution from the grammar that has. And so we have then the whole apparatus of equilibrium static that can be applied. And so we can ask for example what is the typical energy, which is the typical cost of producing a sentence at a certain length from this grammar and so on. And now, the interesting regime is when we have or presenting constructing long text so that could be one long sentence or it could be many long sentences together. And in principle, observables like this expected energy depend on all the details of the grammar but as elsewhere in physics we expect universality to hold. So more precisely we expect that the right observables will be self averaging in the sense that they don't depend on all details of the grammar in a limit where that goes to infinity. So what that motivates then is an ensemble approach that instead of trying to compute things for one grammar which is hard. We can try to compute things over an ensemble of grammars, which then defines what I call the random language model. And then the simplest model for my ensemble of grammars is just to let these energies be Gaussian ID. Then there's a certain mean value V bar and some fluctuations quantified by this quantity epsilon. So it turns out that for context for grammars this V bar just controls the size of the tree trees I'm going to tune it to get large trees so that's just because what I want to consider this limit of large trees. And so this epsilon will be the essential parameter controlling the variance of the rule and energies. So the way to think about it is that when epsilon is large this variance is small. It means that basically all derivations will have the same energy so there's no discriminating factor between what's grammatical once on grammatical will just be battling in that in that sense. If epsilon is small, the different and different rules have strongly different energies, which means that the different derivations will have much different probabilities. And which allows then syntactic rules to be followed very rigidly. Okay, now, as usual when we have any kind of equilibrium problem, what matters in the Boltzmann factors are not energies per se but energies times inverse temperatures beta. And so the dimensionless control parameters are beta times this V bar and also beta squared over over epsilon. And again, V bar is just controlling the size of the trees so the essential parameters this one here. And so lowering the physical temperature is increasing beta which is lowering epsilon so T and epsilon play the same role which is why I called epsilon temperature in my paper. Now for historical reasons, I chose to actually fix beta equals one and then use epsilon as the control parameter but you can just think of it as equivalent to controlling the physical temperature, if you prefer. And so what happens so there, there are two essential control parameters so the first there's epsilon that I just mentioned, there's also and which is just the number of hidden symbols, which is something some measure of the maximal complexity of the language that you can have. And so what I'm showing you now are results numerical results for the Shannon entropy of the derivations and more precisely this is a Shannon entropy of the interior part of the derivations, the interior part of the trees. And there's a very variety of curves here because I've done numerics at different values of n so different numbers of hidden symbols. And numerically I'm measuring the entropy of of like phrases so k equals one that's when I'm just an entropy of single words k equals two is for pairs of words, also called by grams k equals three is three words in a row and so on. And the main point is that all of these entropies are near their maximal value above some certain temperature so that means that the derivations are just uniform random noise. But as I lower the temperature below some critical value they all start to drop quite precipitously. And as I increase k they drops more strongly so what's really interest of interest is when k goes to infinity but that's numerically challenging to compute. We expect it to drop quite a lot here. And the interior this Shannon entropy of the interior of the trees is not exactly the thing we care about the most we care about the actual language that's produced. And so that has its own Shannon entropy. And the story is similar it's less dramatic but these entropies are flat above some above this characteristic temperature and then they start to drop and they, they drop more for longer phrases. So, overall what we're seeing is the emergence of structure and it's, I call it deep structure that's just meaning that it's corresponding to the interior of the trees at some characteristic temperature. I understand this scaling here where this comes from by asking you for a given length of text, what's the entropy, the Boltzmann entropy of trees so how many trees are there for a given like the text that's something you can easily compute. And you can also ask for a given length, what's the typical energy at a certain temperature. And what you find is that this critical temperature is the one where the fluctuations and energy are the same order as the entropy so it's one those two things that you can compete with each other. So in other words, at higher temperatures, entropy wins it doesn't matter that some phrases are more grammatical than others that's basically relevant compared to entropy so we just are sampling everything essentially uniformly, whereas below this temperature, energy starts to win and you start to say the parent starts to say the grammatical things, even though there's far less of them. One can also ask okay is this a real transition if is there some symmetry that's broken at this transition and so on. And indeed, the permutation symmetry seems to be broken at this at this temperature so what that means is that I have a bunch of hidden symbols. So I have things like noun and verb and so on. But in the high temperature regime, even though those are different symbols, if they're used equivalently, then there's no actual functional difference between them so I think the fact that I say one is noun and one is verb doesn't mean anything until they start to actually be used in a distinct way. And that's what's happened what happens at this transition point. So, in other words the permutation symmetries spontaneously broken at this critical critical point and now, if this is a real phase transition associated to breaking of symmetry, there should be some order parameter that is zero above. So one can define an analogy with spin glasses, an order parameter, a spin glass type order printer that's that's suited to to trees and it is indeed small above this point and grows below it. And now one last important thing on this slide is that I motivated the ensemble approach by saying that if we, even though maybe in the real world what we care about is one grammar. And if we look at large text we expect that not all details of the grammar actually matter. And, and that the many absorbable should be self averaging. And now you see on this slide that I've shown error bars on the observables. So those error bars are not the errors and the measurements they're they're rather the magnitude of fluctuations over different grammars at those parameters. So you see, for example, the entropies as I increase K those bars are getting smaller, or as I increase and the bars are getting smaller. So that suggests that in either those limits and goes to infinity, or K goes infinity, that the fluctuations over the entropy go to zero, which means that the quantity self averaging so, in other words, at a given epsilon and and the entropy is uniquely fixed independent of the grammar. It's useful because it means for example if I, if I measure the entropy and I know the number of hidden symbols, then I can infer what is the temperature of that grammar just by going across on this curve. And then the q two likewise. The error bars get smaller as I increase and. Okay. This indicates the ensemble approach by showing that at least for these observables, things appear to be self averaging, although one can do things much more carefully with larger more statistics and so on. Okay, and overall that this kind of picture, it suggests a simple picture of learning. So, it was argued by Chomsky the so called poverty of the stimulus argument that the child knows something about what it's learning that if it didn't know anything couldn't learn anything at all. So let's suppose the child only knows that it's learning a context for grammar so presumably this is due to some hardware constraints. There's a discussion about that in neuroscience literature. The picture initially doesn't know anything specific about the grammar has to be able to learn any, any language. So it would be starting at the high temperature regime, all the weights are equivalent. Then as the child is imitating its parents, its temperature necessarily decreases so that it's tuning the knobs of the grammar to make some things more like than others, which will necessarily make it move to the left in that diagram I showed. And what the model suggests will happen is that the output will look close to noise for a while, even though the, this temperature is changing. It doesn't really affect change doesn't really affect the output, and then quite suddenly you'll cross this transition point and you'll start to produce more grammatical sentences. And this is what most parents will tell you is observed that it's very hard to detect changes in the child syntactic ability until about two years of age two to two and a half years of age, when their ability starts to increase quite dramatically. And, okay, this is a very qualitative picture one would like to be more quantitative. There's not that much data for real humans learning so that's the key problem. This is a study by Ricard Soleil, for example. So they showed that a certain you can build some graphs from human data, and there's a clustering coefficient that's, that's small below this, this syntactic transition and large, above it, large beyond it. And so we could show that in our model the same thing happened so that the transition seems to be synonymous but again the data is quite noisy. So it's hard to make definitive conclusions from. Okay, now, going back to my initial motivation for this problem. I initially hope that this model would be solvable in the sense of stat mech that we can compute the partition function to understand the base structure and so on. This is probably too ambitious. Although one one can develop theory to compute some things so for example, one can develop theory to predict this order parameter like, like quantity q to a high temperature. And so the black curves here are the are the theoretical prediction without fitting parameters that match the data perfectly so down here this is just some sampling artifact that's totally understood, but they, they track the data very well. And then they do something else. And work is ongoing at the lower temperature regime to understand what happens. And then okay one, one, when one faces a model that's that's not solvable then the natural thing is look for some models but it's hard to make a simpler model that has the phenomenon so you can make some models but they have no transition to just produce noise all the time. Basically you need a large dynamic range to get anything interesting to get low entropy but then that makes the model hard to solve. And then okay I focused on this forward problem of just asking how complex languages are and so on. But there are the other problems that I mentioned so for example the inverse problem of alien language inference that's quite interesting. And actually someone doable. So that's, I've worked on that sporadically but haven't published that yet. Again, when something's too complicated you should go to back to a simple model so I also looked at the regular grammar case which I initially thought would be would be trivial turns out to be pretty interesting. So you can write down a very analogous model for regular grammars, which in this stochastic case are in markup models as similar transition so the entropy drops not, not quite so dramatically but still drops at a characteristic epsilon. So what I like about this, this case is that you can understand what's going on from random matrix theory. So not totally quantitatively but you can place the transition point so on. And moreover that for the case of Markov chains, there's a measure of complexity the predictive information from Bialik and others and this seems to peak near this transition point, which is pretty interesting so it means that when epsilon is high. So what this predictive information measures is. If I know the past how well can I predict the future and how well meaning how much information is there to predict it with the future. So the point is that when epsilon is large. Everything is just noise there's nothing to predict. Whereas what epsilon is really small. Dynamics is very deterministic so again there's not that much to predict even though it's quite predictable. And it's in the intermediate regime where the dynamics is sufficiently complex that there's a lot to predict a lot to predict there. Okay. And there's more data out there for Markov models than for, for context free one so in particular, we had a look at fMRI data that had previously been modeled as a hidden Markov model by neuroscientists. We measured the various things in our theory and what's interesting is that we showed that the over a variety of human subjects like 800 human subjects, they're predicted epsilon was very close to this. Got it. This, this critical value and moreover we could quantitatively predict the values of entropy and other quantities from that so it's a very, it's a null model but it turns out to be work out quite well when you have Markov models and what we're doing now on this model is trying to understand the effect of the explicit effect of matrix asymmetry to connect with things studied in stochastic thermodynamics like entropy production so Okay, and again going more towards stochastic thermodynamics so my approach was essentially an equilibrium approach I was just thinking about typical properties of grammars but one can try to make it dynamic by choosing a direction of time so for regular grammars that's already built in that's obvious, but for CFG is for context free grammars, there are different ways to make the time go you can have it coming from the root of the tree downwards or from the left to the right and so on. And so there's different ways to set up the problem and probably should be informed by some particular application of something that's running grammars. But that would allow a stronger connection to stochastic thermodynamics. So let me conclude there that generative grammars are a formalism to encode syntactic structure in a very broad sense so I talked about it here for structure of languages of strings that there are also things called shape grammars and so on but I'm not so familiar with context free grammars in particular are simple model with quite an entrepreneurial properties they can have long range correlations and so on. And once one has an ensemble of grammars that defines what I call the random language model that seems to have some spin glass transition there's a bit of debate about whether it's a true transition or not but in any case the numerical results are quite clear. And the stat mech problem the analytical side of it is not trivial not totally intractable somewhere in between. The thread of work I started when I was postdoc in Paris so let me think. My colleagues there Remy want to solve Hori Kirch on Francesco Zamboni again some regional care Francesco Bonnie, and also Georgia who I was talking to so with that all and here thank you. Okay, thank you very much. So there was one thread going on the chat. I think it's mostly been resolved. Are there any other questions that people have. Peter Verner, so if you could make it anybody else if not oops, we're having several. Okay, you guys could make it a little bit quick then given that we're already a little bit over time. Peter you were pressed up. Peter, go ahead. I can hear you. Do you want to go while Peter is trying to. So, thank you very much very, very nice. So, you can see the context free grammar. And you do not consider probability context free grammar. When you associate probability to every rule. I didn't catch exactly where I think I think the question. Do you consider context deterministic context free grammars or probabilistic context free grammars. Because for the deterministic case the energy would be zero or infinite, like a kind of hard core potential. Energy is taking real values then it's stochastic and the temperature the range of energies is very wide so it becomes closer to a deterministic case but it's on the content. And just to ask about this scaling you have a log to an epsilon on the X axis, and you have a plan to be over a log and on why. Why do we have these logs where and how did scaling affect the whole number of days. And for the Y axis, the scaling is quite natural because there's a maximum value that a Shannon entropy can take over and variables right so that's scaled so that this is clear near one at high temperature, then. Okay, the scaling that I did here I just wanted to collapse the different data so that's more or less an empirical scaling, but you can derive this scaling by asking about the typical size of energy and so on. I don't have any intuition for why it's log square but you can just do the math and you can see that it comes out it's not it's not complicated. I don't have any deep reason why it's log square. Okay, thank you very much. Here we can either seen or hear you and so he's got a question and then if you could just answer this very very quick Eric it's in the chat. And then we'll move on to Michael so the question is why is the partition function difficult to compute Eric. It's, it's because the energy for a given grammar the it's it's like like any it's like it's kind of like a spin glass you have an energy for each rule. So to figure out the all possible configurations that have low energy that's it's not so simple to do. It looks more or less like a spin glass. There's some some differences but they're the structure of the problem is is similar to to spin glass spin glass on a tree in fact. So at this point I think that any further ball change those questions on a course I'm particularly intrigued by the stochastic thermodynamics angle of making this be a dynamic model rather than equilibrium. But anyways, I think at this point, thank Eric and the concluding talk.