 Well, let's get started on time today. So I hope some of you looked at the suggested reading, and this is mostly the various suggested readings are just to give some sense of the main issues, which we talked about on Tuesday. But also maybe hints at some of even the more interesting philosophical implications of thinking about what structure and pattern are, what randomness is. We've been studying that. So last quarter, there was another article by Lem, Stanisle Lem, science fiction writer, also in The New Yorker, having to deal with just the notion of probability. I mean, it's a very tongue-in-cheek. Don't know if you remember it or read it at the time, but very tongue-in-cheek critique of probability theory. And that's also up on the Computation Mechanics reader page. So what I was suggesting for today, and this is mostly just in The New Yorker. It's not technically complicated. It should be a fun read. And it's mostly a biographical essay about his life, but he's sort of asking himself. And he's very knowledgeable about statistics and probability theory and philosophy of science, a particular information theory. He really likes information theory. So if you go back and read his other books, he was quite prolific. He passed away about three or four years ago, I think. There is this thread running through all of his stories, chance, probability. But here, this essay is a short autobiography, asking himself how much of his career, his life, the circumstances were due to his intention, good advice, that sort of thing, and how much was just chance. So he kind of weaves that through this recounting of his writing career, his early life, and so on. So it's a fun read. I like his writing. So I urge you to read that. Also, there are a few other things. Again, mostly just a quick survey kind of reading. There's this last year's Nature Physics Review, which covers some of the technical background to the topics for the spring. Again, I'm not suggesting, at this point, reading it in detail. But at least a quick scan would be helpful. First half of it or something like that sets things up. Also, there's an older review, competition mechanics, pattern prediction, instruction, simplicity. And there are two introductory sections that try to motivate things more philosophically. So just sections one and two. It's just three or four pages to get some sense of why we're doing what we're doing. And then this is just a newsy kind of thing that talks about some of the work that we did years ago from the new scientist. So this lecture is actually probably the most important lecture of both quarters. I've been kind of setting things up. Tuesday was this sort of critique of, well, first, reminding ourselves how wonderful information theory is. And then my sort of setting it up as a straw man and kind of criticizing things. It doesn't take us to the next level of directly measuring structure, saying what structure is. So that's what the spring class is oriented towards. And just to give you some sense of how it's going to unfold, of course, the vocabulary will become clear as we go through the lectures. So today, we're going to talk about this notion of state. This is kind of a deconstructionist sort of look at a central concept in science. We all feel like we have some notion of what state is. And we're going to revisit that and recast things in terms of looking at some process. What are the effective states? And then can we actually get the dynamic or equations of motion out of that? I'm going to give you the answer today. It's a little bit technical. Not so much in terms of theorem proof, but just the introduction of mathematical ideas. It won't seem completely clear or workable yet. But that's the goal or the burden of the following lectures to unpack the consequences. So I'll try to motivate what these effective states are, how we can find them as much as I can, and then set up the definitions. And then there'll be results that will follow in the further lectures. And there's a particular representation that gets singled out. We keep talking about, oh, we have a process generated by a hidden Markov model or a Markov chain or whatever. But those are chosen models. And the question is, how do we go from a process's behavior, the definition of the process in terms of the behaviors and their probabilities, to figuring out what the representation should be? Does the data, does the process itself tell us how it should be represented? And that the answer is going to be this thing. So I will define this thing at the end of the lecture and we're going to study this thing. And it's going to actually come back practically and give us some new algorithms for calculating all these information quantities we were talking about at the end of last quarter, some very efficient algorithms and new sets of information quantities, enriched notion of pattern and structure, even what we mean by theory. So then go into measures of complexity, which will connect back up to the information theory. There's some really interesting properties. As soon as we can start talking about the intrinsic representation for a process, there's some interesting new statistical properties that pop up certain intrinsic notions of irreversibility. Just a little bit, you know, in physics, we always think about, well, the microscopic equations of motion, the underlying physics, if you change time to minus time, you get the same behavior and then it goes now, wait a second, I just broke an egg and that's clearly not reversible. So what's the issue here? So there's another kind of irreversibility that's related to that physical notion of reversibility. And some of the processes we've been looking at already have this property of being statistically irreversible. So we'll talk about what properties are reversible. I look at the process in forward time or reverse time, they're the same. And there are other properties that are now made explicit using this representation of a process that are not statistically the same in forward to backward time. We'll also talk about how hidden processes are. I mean, we've been talking about Markov chains and hidden Markov chains. Well, exactly how hidden are the states from us? I just give you one of these, you know, internal Markov chain with labels on the edges, the transition matrices, how much of the observed sequences reveal the internal state information? How long a sequence do I need to look at to begin to see what the internal structure is? So that's this notion of crypticity. We'll sort of make some parallels here with notions from cryptography. That's this discussion is sort of scientist as crypt analysis. Namely, nature gives us all the encoded information and we have to decode it as we build our theories. Once we have a preferred representation, and I'll have to convince you of this, it's not obvious that there should be such a thing. There's actually a very natural way to think about what information means. Now, this is something that Shannon in introducing his uncertainty measure of information, this P log P form, the amount of information, he just concentrates on the amount and never tells us what information means. So there'll be a lecture or two where we actually go back and answer that question. On the engineering side, his approach is perfectly, it's extremely well-motivated. We don't care, shouldn't care. The hardware and the internet shouldn't care whether you're searching with Google or you're shopping or whatever. It should just do its thing. It shouldn't care about the content. But a scientist, when you build models, you'd like to understand what the behaviors mean vis-a-vis your understanding, vis-a-vis the current model, and maybe even intrinsically. So we're actually gonna talk about in a sense a objective theory of subjectivity. Typically, we approach modeling, acknowledging if we're honest to ourselves that the choices we make, what we currently understand depends on how we've been trained and in some sense a subjective and then there's this mystery. How do we ever learn something about nature out there? So I'll try to convince you that there's a way of thinking about this that depends on this intrinsic representation we extract from the system itself and it gives us sort of a natural notion of semantics and interpretation of the meaning, the content, information content of measurements. Go back, there's some more technical things we need to do to understand some new algorithms, new kinds of information, new algorithms to calculate these information measures. Get back to those information diagrams we talked about before. And then there's some sort of various kinds of applications. Directional computation mechanics, there's a more, we talked a little bit about these interesting properties that are not reversible and statistical properties that are not the same in forward and backward time. Well, there's a whole calculus underneath this that we're gonna unpack that actually gives us some new calculational tools. Turns out even simple things like this very symmetric mutual information between the past and the future of the excess entropy, that's very symmetric, mutual information is symmetric and it's variable. That quantity to calculate that, you actually have to look at how a process generates information in the forward time and in the reverse time. So this is very odd. Practical benefit is that then we have some new algorithms. Let's us, in fact, algorithms, I should say, analytical techniques, even closed form calculations for a lot of these information quantities that before we were talking about just in terms of the block entropy and taking derivatives is kind of a very kind of empirical way, like some of the homework problems or the exam problems were kind of cranking through the block entropy at different word lengths and looking at the rate of change of that for the entropy rate and how it converges to the linear asymptote of E plus H mu L and so on. That was all very, I don't know, right from the word distribution, right from the block entropy. And there's a more direct way to do this that comes out of kind of stepping back and looking at the time series in both the forward and reverse direction, so. And that again kind of enriches our notion of what kinds of information there are, hopefully tying some things together rather than just increasing the length of this list of different kinds of information in processes. And then things kind of shift, will have kind of depends on time and interest and also by that point, we'll be working on some projects. So there are different topics, different areas that we've applied things to. So the first half of the quarter is theory, basic questions, and then we'll start shifting into doing some projects. So we might have some lectures on complex materials. How is it that materials store and process information just in terms of their structure? There's another interesting, very current topic called information thermodynamics. And this is starting to address this issue I brought up, well, Tuesday, and also the very first lecture at the winter quarter, how are information and energy related? Some very recent work in the last 10 years, very exciting work extending equilibrium thermodynamics to what are called non-equilibrium steady states, and we can make connections now to the process that we've been looking at. All these chaotic systems are overtly non-equilibrium systems, so we haven't been able to directly apply a thermodynamic description yet to these things. I mean, there's, remember we talked a little bit about these class of models back in the dynamic section of winter quarter cellular automata? Well, there's a kind of information, theoretic and structural interpretation of how these spatially extended cellular automata form patterns and form little particles and how the particles store information and when they collide, they're actually doing a kind of logical operation. So you have spatially extended computation. But again, so these are all kind of topics and we can kind of do this on the fly. I have, you know, another dozen or so lectures we can just kind of select from depending upon people's interest. Rate distortion theory is an interesting, actually introduced by Shannon in his original communication theory. How do you make good approximations of a process? Imagine you had, the problem that he was first concerned about, imagine I have some process where the measurements are real numbers, but I want to transmit that across a binary channel. What's the best way to discretize a continuous number? So there's ways of, rate distortion theory is kind of an optimization. You sort of trade off the complexity of the code book that you build against the error rate and there's a systematic way for doing that. This looks extremely close to statistical mechanics. In fact, you know, a lot of the history of the theory in the 20th century was this interesting interplay between sort of fundamental physics, statistical mechanics and ideas and actually statistical inference, machine learning now, and there are a lot of interesting overlaps, concepts. Yeah. I don't know if it was covered during the first quarter, but are we going to cover database theory and query theory at all, like the amount of computation? No, that would, I mean, yeah, that kind of thing, there are classes that go into computational complexity, right? Is there some theories not for the related theory? Oh sure, there are information theoretic versions of computational complexity based on this concept called Komogorov complexity, which maybe I should have put that in here. I typically don't cover this area, Komogorov complexity. So Komogorov was this very famous, probably one of the best Russian mathematicians which is actually saying a lot and he was interested, he was kind of tracking the development of the Turing machine and sort of more fundamental notions of computation, discrete computation and was wondering if there wasn't a way to give an algorithmic theory for probability. So that's where he starts. It turns out that that whole area of Komogorov complexity, for our purposes, it is information theory. So, but sometimes it's worth, people still work on this and talk about it. Not terribly helpful for us because it's easier to work with information theory quantities like entropy rate. So, yeah, but more and more algorithmic approaches. But when I say computational here, we're gonna, I will introduce a different notion of computation, what I call intrinsic computation. So there it's not so much, am I designing a device to be a database and looking at the average number of queries I have to, or the length of queries and that sort of thing to get a result. Rather it's looking at some physical system and or biological system and trying to figure out how it's storing and processing information on its own terms. Not necessarily useful to me. That doesn't mean these ideas wouldn't apply to useful computation but we won't have time to really get there. Interesting applications to of all of this discussion of intrinsic computation and information processing to physical and other kinds of systems that go through phase transitions. So in fact, what we're gonna be talking about in the next four or five weeks is sort of the level zero theory of these ideas. In fact, there's a hierarchy. Anyone studied computation theories very much like the Chomsky hierarchy and the theory of computation. There's a hierarchical notion of these effective states. So that's a nice fun topic. You can actually calculate these things. The number of systems go through phase transitions. The one that we've already been studying has been the logistic map as we vary that our parameter, it goes through period doubling bifurcation and then finally after it goes period two to the n, period diverges, it becomes chaotic. The range of correlations in the behavior goes to infinity at this phase transition or at the onset of chaos and there's a way of actually characterizing in closed form what that information processing is. We've been talking about finite state Markov chains, a finite state hidden Markov models. This involves infinite state Markov chains in a sense or even continuum state processes. So really fun, interesting, a lot of you can. And if there's a quantum enthusiast, there's also a version of this developed for quantum. Simple few qubit quantum systems. And we've been applying some of these ideas to look at biological evolution. Someone question that's interested me for a long time is, you know, there's Darwin's theory of natural selection, but then there's also this sort of complimentary view that way back at some early time, there was no biology, right? We just had these soups of simple molecules and how did something like natural selection every emerge? So the ways of talking about the population dynamics of replicating individuals that have structure and as they get structured, they form communities that themselves are structured and that starts to be the substrate out of which you can start to explain the emergence of natural selection in sort of prebiotic, prechemistry, almost prephysical evolution. Anyway, so keep this in mind, there are a number of different things, maybe about halfway through once you get started on projects, I'll come back and we'll take a vote which topics we wanna do. But first things first, okay? So back to the course flow chart, right? So here's this learning channel. Here's nature in principle sort of unknowable because it's typically observed through some kind of sensorium or instrument or so we don't really get to put our heads inside the black box. We end up with a time series of measurements and the question is how much can we understand about what's going on inside the black or gray box in terms of the measurement sequences? And information theory gives one class of answers to that. Once you're willing to extend the notions in elementary information theory as presented in Cove and Thomas to deal with processes that can have arbitrarily long range correlations than their couple answers, right? Excess entropy, transient information, entropy rate, ephemeral information, bound information, all those different things we talked about. But it does not deal with this. So we really haven't completed the analogy between the sort of scientific learning and discovery process and Shannon's communication channel till we actually talk about us over here, the receivers that are moving through some world or doing an experiment trying to understand from this indirect impoverished view of what's going on, what is really going on inside the box? How do we build models in short? And we're just going to break that question down into two simple questions. How do we build models? Well, basically, they're two questions. Looking at this data, what are the states? And once we have the states or imagine some kind of state space, what's the dynamic over that space? So does the data tell us what the states are? The dynamic should be represented and then how the behavior should be represented and then what are the equations of motion? In a sense, reconstruct the state space and then pull out the theory. That'll be hints of wilder speculations about automating scientific inference. Okay, so I said this was sort of the most important lecture. And so before we get into the mathematical definitions, what I'm doing is maybe kind of philosophically trying to strong army you into a particular view of the issues. So what we're going to do is play a game. This is the interactive part of the lecture, the prediction game. Very simple rules. So I'm gonna give you some data stream, okay? Then I want you to give me what you predict the future's gonna be. And then also, after you make your prediction, I'm gonna ask you to give me a kind of a state-based model that would summarize what you think the process is, how it's structured. Okay, so process number one. So I'm gonna give you some data. This is the past. So here's the data. One, one, one, one, one, one, one, one, one, one, one, one, one. Okay, so you kind of smile. Right, in fact, it's kind of awkward. He keeps saying ones more than he should because I got it all ready. I know what's going on. Quick, let me make a prediction. So don't be shy. What's your prediction? Excellent, I like that. That's assertive, great. Some people actually hesitate to say anything. It's like, oh, come on. I had to drag one, one, one out of people. Anyway, yes, okay, good. Now, there are a lot of practical issues. You could say, oh, the process is very, it's in a bad mood. So actually, after the 101, it's gonna stick into zero. But to good approximation, if that's the given data, that's a perfectly justified, accurate prediction. Okay, now slightly more subtle, but not too much harder, what model? So if I have a state and states and transitions, what state-based model would do this? How many states? Right, exactly, very good, okay, good, so. So there we go, good. So now we're on board. So this is just kind of, now we're syncing up, okay? So just a single state, this is a simple process and all it does on every transition just emits a one. Okay, good, probably one. Okay, now, second example. So there's your data. Now, second process here, one thing that's predictable is no one smiles, unlike the first example. You go, oh, look, zero, one, zero, two, zero, three, zero, oh well, hmm. One, one, two, ones. Well, this is slightly unfair. But actually, if you look at it, sort of most of the words of length one, well, well, zero is a one, length two occur, length three occur. In fact, I really should give you more data because when I ask you to make a prediction, you're still kind of scratching your head. This is really too short a sample to see any kind of regularity. And you could, actually, one fair answers, you just give it back to me. It's like, that's what I got, that's what it's gonna do. In fact, for a long period of time, that's how weather forecasting was done. I'm not kidding, it's called the method of analogs. And it's only in the last 20 years that we're using super computer simulations. What was done in the past is people would look back through all of the historical records and try to find what happened in the last week and compare that to some week in the past and then look what happened the next few days. And then that was the forecast. But okay, so in this case, actually, imagine I gave you more data so the punchline is that as you look at more and more data, you start to realize that basically all words of all lengths occur, okay? And the words of equal length occur with equal probability. So basically you can give me anything. And that's a fair, that would be a fair prediction. Something like this, just I flip a coin. Okay, so now, more important is what's the model in this case? So the process, we've somehow concluded that all words of all lengths occur with equal probability. So what's the model? Two states. Two states, okay. And determine just, it could be, it could go in your way, your two zero, two one. So what do the two states mean? And one state. Okay, we have a one state, we have a two state, okay. One state. Do I hear three states? No one wants three states. Well, that's interesting. Okay. One state. One state, okay. Zero. Zero. Okay. So now, right, so, but that's an interesting point. It is possible to write down a two state model of this. So that's a fine answer. This one, also perfectly good answer. The fair coin, if I were to label transition probabilities here, 50-50. This one's smaller. And this is gonna be one of the key properties we're gonna be looking for in this preferred representation I was talking about. Minimal models. So, at least at this point, you could say, well, we're gonna invoke Occam's razor. You give me 16 models that are consistent with the data, the finite sample, and I'm gonna pick the smallest because I'm lazy or my computer can't run the big ones or something. But, yeah. So, and you can see this is minimal. I can't remove a state and still have it be a model. I can't remove an edge and it still be a consistent model and still predict the data way it did. So, okay, good. Interesting point here, notice that the completely predictable process, all ones, and the completely unpredictable process have simple models, single state models. Okay, last example, process three. One zero, one zero, one zero, one zero, one zero, one zero, one zero, one zero, one zero. There was like half a second longer before some of you started to smile. Why? Because it's, look, I got it, right? So, what's your prediction? Right, one zero, one zero, one zero, okay. Very good, right, okay. So, and you know, as I'm saying this, immediately we go, okay, there's a little template, word of length two, and that's just repeated. Now, the question is, what's your model? What's your state-based model? Two states, right? One and zero. One and zero, right? One state, one zero. Right, okay, so that's a good point. So, we have different choices of representation here. Right, so you could just have a single state that had a word that it put out. So, what we're doing for now is just putting single symbols on the transitions. Okay, so in that case it would be a two state model. So, here you go, huh? Three states, what's going on here? Okay, not gonna point out something that was also part of the models in the last two examples. There was, there was, they were just one state, and that one state had these two concentric circles. This is called the start state. And this represents your state of knowledge before you've made a measurement. One way to think about it is you took a bunch of data, you built your model, and then you went off for your Tahitian vacation, and then you came back and the experiment's still running, you have to make a measurement to see where it is. Right, it's kind of the issue of synchronization. So, before it didn't make any difference. All ones are completely a fair coin. The sort of the states you visited, the recurrent states and the start state that's saying here it's different. So, before I jump in there, I don't know whether I'm gonna see a zero or one. In fact, my guess would be it's gonna occur 50-50. So, if I just look at this data with the window of length one, it looks like a fair coin, right? So, those ones occur with equal likelihood. So, that's what this represents. However, as soon as I make that measurement, say it was a zero, from that point forward, I know exactly what the template should be. It should be one zero, if I saw one, then the next thing will be a zero, and it becomes completely predictable. So, what this start state, so-called transient state does, it tells us how we come to synchronize. So, remember before when we were talking about this transient information, right, that was how the block entropy came up and met the linear asymptote E plus H mu L, was the area. Now, we have almost a mechanistic explanation for how we came to know, in this simple case, we have to make at least one measurement, and then we know whether we're the odd or even phase of this period two oscillation. So, not having made a measurement completely uncertain, Shannon would say, oh, there's one bit of uncertainty, but as soon as I make a measurement, there's only one transition leaving each state and the future's completely predictable. So, now the structure of the model and the states is telling us not only something about sort of the long-term structure, what's this template, zero, one, and then it repeats and it's completely predictable, but also how we come to know that, yeah, which. I just wanna know like in this particular example, label zero and one, sorry, yeah, it could be A's and B's, right? It could be A's and B's, right. Before you can start to say, whatever I get first, I call it A. Yeah, yeah. And so then you don't, it's splitting, right? Just whatever you see first, I'm gonna call that a one, and the next thing I see is zero. I'm just, I don't know whether that's, right? Because your label one is, right? It's like our, Oh, sure, sure, you could high voltage, low voltage or something like that, sure. So, but then you're gonna predict high-low, high-low voltage, so it doesn't matter what the label is. But it's like a vertical current, right? It's probably not about positive current flowing, typically we, and we're gonna go in the opposite way. But it was just, we kind of arbitrarily said, I'm gonna call whatever I see first one, whatever it's actually, but, anyway. Well, then all the model's gonna capture is that it's gonna be the other symbol in the next time. Right, that, yeah, that's all, right? Yeah, yeah. How we name these symbols, even how we name the states, at least for now, is going to be sort of arbitrary. It's a choice of alphabet size. We know what the alphabet size is ahead of time, but I can, A, B, C's, upper lower case. Oh, okay, wait, now I see what you're getting at. Yes, yes, yes, yes, yes, well, yes you do. I mean, you're right. So what you're kind of arguing is, okay, let's forget this new mechanism thing. I like my simpler model here, and now it doesn't matter, I now have this device that has this additional cognitive ability that there's a rule, and you said it. The first thing is gonna be called a one, right? So this transient state explicitly encodes that step. Right, so part of this is being absolutely explicit about the representation. We're not gonna allow ourselves to have any extra kind of hidden little functionality anywhere, just like how we're saying before. Yes, you could have one state, and it emits the word one, zero, and we're gonna be absolutely prosaic, one symbol per transition. Yeah. You're one, zero, three, plus one, final. Yeah, right. Well, yeah, we'll get into some issues like that. And then is that a decimal three, or is it just a symbol like C, or, and then oftentimes when you're doing these analyses, there are all these additional unspoken assumptions you bring into the analysis that can actually add structure or throw it away. So we're gonna have this kind of, this discipline of being very prosaic here. Yeah, Tom? So by including these extra states in our model for the transient states, is that a limit or any possible to not talk at all about state distributions? Because wouldn't an equivalent way be to say we start off with a distribution of states 5050 and not any gloves to a one, zero? Excellent point, yes. So do we not, can we now, if we include enough states just consider determinants? Okay, so a couple of answers to that. So the first will be an elaboration of what I just said, talking about Quinn. Then you're gonna need some mechanism that deals with the state probabilities. Okay, so that I want you to represent that. We're gonna do this very literal computational or component in the model accounting. Now, what we're gonna see, in fact, is that these transient states actually describe how the, when you put initial distributions down onto these recurrent states, how they come to equilibrium. So in fact, they're gonna, they'll be the explicit mechanistic answer for how distributions of our processes settle down, become stationary. Yeah, so, yeah, okay. Well, so this is the entire point of the lecture. What you were doing intuitively, or maybe a little bit of explanation here and there, is the thing I wanna capture formally. We were building these models. We talked about minimality, how well they predict that predictably changes over time, and then structures in the model not only capture the long-term organization of the process, but also how we come to know that. So, so now the goal is to try to formalize this a bit. So what's the goal here? Well, we just started out. I called it the prediction game. The goal is to predict. So, how do we predict? Well, we use information from the past. We can't use information from the future. We don't have it yet. But the question is, what information are we gonna use? What information did you use when you were thinking about the predicted sequence from the examples and the model? Right, some kind of pattern, right? Right, so we need to be explicit about that rather than just scratching our heads. Or we could, you know, stick you each in a box, then Hewlett Packard could sell you as the pattern discoverer, right? We'd like to automate this like we talked about on Tuesday. Can we be explicit enough about what a pattern is? And in the periodic case, that was easy. You know, the fair coin case, well, it was a little bit problematic and that's kind of a hard example even to use for intuition because I had to give you more data and it's kind of tedious to figure out that a sample really is a fair coin. If you go to traditional mathematical statistics textbooks, they will tell you how to infer a fair coin from a series of coin flips except they assume it's a coin. They assume what the pattern is that it has no pattern and then the calculations you do are just estimating the bias, the frequency of heads and tails, if you will. That's not our problem. Our problem is we don't even know if it's a coin. We wanna know what the structure is, what the states are. Okay, so this is what information? I put it in quotes. Is it Shannon information? Or some other, you know, completely different thing. So we wanna find somehow the effective states as we just did with those examples. And then of course we'll make some guesses that states it's usually not too hard to go the next step to think about what the transition structure or what the dynamic is over that. But again, you know, we're just sort of changing, using these words, information states. We're trying to be honest here. The states are sort of hidden from us. We might have hundreds of states and we just have binary measurements. How are binary measurement sequences related to 100 internal states? Yeah, it's just, again, very prosaic. All we have are sequences of measurements. We'll assume we know what the measurement alphabet is and how do these symbols reflect the internal set of states and the dynamic. So here's the main idea presented graphically and then we'll get into the mathematical formalism. So what are these effective states? So now I want you to imagine that you've been observing a process for an arbitrarily long time, infinite time. In fact, you can load all of this. You have an extremely powerful computer and you're well-versed in your machine learning algorithms. You can crunch as much as you want and then you stop at some time T. You finish your sampling and you make the best prediction you can make. Well, if the process is slightly stochastic, there'll be some alternatives. So here I just kind of represented it. I could have seen a one and then a one and then a zero and then a one or a zero, a one, one, one, one or whatever. So I stop at this point. If the process is slightly stochastic, there is a tree of future sequences I can see. So that's the sort of basic setup and I'm allowing you to use whatever forecaster you'd like. You can go to the Google Prediction Engine, Prediction API, throw all the machine learning algorithms at the data I've given you and do the best you can. Okay, and now you make one more measurement and then you do this again. And in this case, what happens is that the future looks different. So I went, I just now measured a zero and it turns out if I've seen a zero, then I predict that I'm gonna see a one, but then after that, I can see zero, one, it kind of branches out again. Okay, so what we're gonna say about the generating process is that it's in different effective states when the futures look different. So here, again, so whatever effective state the process is in, at time t, the state it's in is not the same as the state it's in at time t plus one because the futures have a different shape. There are obviously more options up here than down here, for example. The predictions are different. Now imagine I do this again. I make another measurement. Now I see a one. And then you kind of notice, doing yours are visual pattern recognition that at least at this point now at time t plus two, the range of future possibilities that I see, I can see zero, one here, zero, one, zero, one, so on. The way it branches out, it's the same as at time t. So then we're gonna say that the process, the generating process is in the same effective state when the shape of the future is the same. And by shape, you'll see what we mean is not only which sequences I'm predicting to occur, but also the probabilities with which they occur. Okay, so that's the main setup that actually kind of nails down, maybe a little indirectly, this notion of effective state. Effective for what? Well, for prediction, right? So now we can refine our previous goal with two questions. Find the states that are effective for prediction. Moreover, and this is a very, it's almost an obvious thing to note, different pasts or histories that lead to the same predictions are equivalent as long as we're doing, and our goal is prediction. In other words, let's be lazy. Let's not make distinctions between particular pasts that lead to the same shape of the future. Somehow, those pasts have the same information that we need to do prediction. That's why we say the process is in the same state. These are sort of predictive states. So we're not gonna make distinctions between histories that lead to the same predictions, yeah. Does this depend really on the algorithm that you're using to make those predictions? It does, except at this point, we're just gonna do the formal mathematics and not worry about implementations. So this is gonna be true of sort of the vest algorithms. Yeah, there's certainly ways of doing this badly. And, but we actually have to develop a theory of bad models too, before we can tell you what good models are and pick one out in particular. Yeah. Yes, yes, yeah. So we're still, I'm just kind of slowly motivating things and I'm recasting the original questions, right? So now I need to be more formal about what I mean by prediction, histories being equivalent and so on. Okay, so effective for what? Well, for prediction, well, what do we mean by prediction? Pretty simple idea here. It's just a mapping from pasts to the future. In particular, from a particular past to this prediction, which is a distribution, predicted distribution or forecast of the future behaviors and the probabilities. Okay, so notation again, we'll have some process here. We're set of observed symbols, past and future. Looking at blocks again, lower case means a particular realization. Upper case are the random variables, right? So just a little useful terminology here. Just the way that graphic was set up, the thing we're focusing on are these distributions over futures conditioned on particular pasts. So I call that the future morph, shape of the future. And these are the things that I was showing on that graphic. These are the objects and we were comparing these objects, right? So a very direct notion of prediction is I wanna, if I've seen a particular past, then I wanna produce one of these future morphs because from this, I'll make my predictions. This distribution over futures, I'll make my prediction. By the script L there, do you mean you're really predicting after you're given a finite lesson? Yeah, yeah, yeah. Yeah, if you remember sort of last time when we were talking about even some of the information identities and I was being a little bit cavalier in sort of writing these arrows, which means semi-infinite, you know, and then I occasionally would show you what you had to do to prove things. You have to drop down to finite length and then at the end take limits and so on. So yeah, so here I'll just finite distance in the future. The theorems we're gonna prove by the way we should pay attention to this whole framework will take down to infinity. So yeah. I have a question about refined code. Wouldn't it make more sense to use as much data as possible from the past? Sure. Because it would strengthen the prediction. Right. Right, okay, so I'm now, yeah. So the refined goal is I wanna predict as much or as accurately as I can the future, namely estimate this future morph using as little, I didn't say information, I just said as little from the past as possible. You'll see how this comes in. So this is gonna be our notion of minimality here kind of coming in and it's gonna fall out naturally. Christina. So, you can answer this later if you want, but I'm trying to think about the parallel in my study that I did last year on temporal morphology and thinking about the state spaces and the predictions and the relation of the transitions between them versus actual dimensional spatial dimension and the patterns that we see that are spatial patterns. So information patterns and the system behind those as described versus the actual spatial patterns. And I'm trying to figure out what's different about what you're saying in each of those domains. For now, forget space. No, seriously. It's a couple. Okay, right, right. So actually in the winter, I did go over some cellular automata setting us up so we can explicitly do this construction for spatial extended time-dependent systems. And you'll see there's a modification we have to do instead of just using these one-dimensional paths because we're just, it's a one-dimensional problem with time series or purely temporal processes. What you do is you actually use a light cone in spacetime and it gets the notation, well, we should get comfortable with this and then you'll see it kind of fall out. But yes, so there is a nice extension to that. Yeah, yeah. Okay, so again, right. So we want to predict as much about the future, this guy, the future morph, using as little of the paths as possible, okay? In other words, we're not making distinctions. Remember, we kind of agreed to be lazy. We have two different paths seen at different times, pasts, and they lead to the same future morph. Man, they're essentially predictively equivalent. We said the process was in the same effective state when that happened. Okay, so now what I'm doing here is sort of like layers of the onion, trying to refine the original question, which was, oh, just give me the states in dynamic, intuitively, to be more and more formally precise about what we're trying to do here. So how are we gonna frame this refined goal? Okay, well, obviously we're dealing with a particular space, at least one space, the set of all pasts. So here's my artist rendering of the space of pasts, this oval here, and I have these different pasts, S prime, S double prime, and so on. Each point in this space is a possible past generated by the process we're looking at. Okay, that space is quite big. If it was a fair coin process. If it was that period one process, this space of pasts has one member, the all one sequence. Okay, now we're sort of comparing pasts in terms of their predictive ability. So we need to think about actually grouping pasts together. So what we're gonna do is, so the main sort of working hypothesis here is that the histories that lead to the same predictions are equivalent, and that induces a structure on the space of pasts. So here we have our space of all possible pasts for a given process, and I make some guess. It's a two-state model or a three-state model or whatever. That effectively is grouping, it considers certain pasts to be equivalent, and I put them all into these partition elements here. So here I've taken the space of pasts and I have five different sets, our classes, and all the histories in here, the points in here, lead to the same prediction. That's what I'm trying to draw graphically here. And in addition, I want this set of partition elements to exactly cover the space. So that is a partition of it. Namely, if I union them all up, I get all of the pasts that we've seen, and they don't overlap. So it really is a partition in a mathematical sense. Okay, so we're trying to think about how to formalize this intuitive idea that histories leading to the same predictions are equivalent and that essentially induces at some kind of partition structure. You know, if I just willy-nilly guess that some process got 17 states and I'll have somehow 17 partition elements. It's groupings together. Under that, those assumptions, certain histories will lead to the same prediction. Okay, so now we're going to talk about this mapping. Just again, giving some more structure to the space. We have this map from histories to partition elements. So I'm going to call that eta, and I can plug in a particular history, past, and it returns which partition element it's in. So, you know, if I have a history that's over here, it's going to return R2. Basically, eta just returns the name of the partition cell that our group, that that particular history was in. We will also think about this eta applied to the whole set of pasts, now thinking this is like the semi-infinite collection of random variables, so that the output itself, which cell I'm in, which effective state I'm in, is a random variable, okay? And I can even take the data that we have, we're given the process, and it's word distributions, we have the probabilities of the histories, and I can sort of sum those up for all the histories in a given element to talk about now the probability that I'm, for a given history that I've seen, the likelihood that it's in this or that partition element, right? So we go from the raw data down to this coarse-graining of the space, I can talk about the probability of the partition elements. Well, there's some special cases here. One is, you might call it the null model, I make no assumptions, or there's sort of one partition element, I make no distinctions, okay? So that's, in other words, there's just this single partition element, the entire space of the observed histories is, I just say it's one element. So that's sort of just saying it's just got one effective state, and we just saw two examples of that. The all-ones process, well, space of histories is trivial, that's one point, there's only one partition of that, okay? The fair coin case, this is gonna be the space of all semi-infinite binary sequence, a huge space, but I'm just gonna label them all together. And so that is also, the picture behind that is we just have one partition element, and that's what that fair coin single state corresponded to, that effective state. So yeah, and we have, the question, of course, is what sort of predictive, what future morph do we have, given the kind of abusing notation here, given that I'm assuming this non-null model, what does the distribution over futures looks like, and you just convince yourself that, if you sum over all of the histories in the space, then you end up with just the unconditioned distribution over futures. In other words, I'm making no assumptions, therefore my prediction over the future is just the future distribution itself, not conditioned on anything. Sort of defining what I mean by null here. The other extreme is what I call every history is precious. So I denote this r infinity, and basically it's just said simply, every history is a state. This is, you know, given some data, you're not making any effort at modeling, it's just someone says, what's your model and you hand back the basket of data to them and say that's my model. You've, you know, not really learned anything, but that is a model, right? So every point here is one of these partition elements. Each past is a state. And now there's a really great benefit of this. If every past is a state, then the future morphs are just simply what you get from the process conditioned on that particular past, corresponding past. So that's a really good model in terms of prediction. It sort of doesn't teach us much. We haven't learned anything. We don't know how the process is really structured, but there you go. You could do great predictions because I have all the data sitting there and I just look up and see what's gonna happen in the future, the method of analogs. Yeah. I'm just wondering, maybe you can speak to it, like people doing like data mining and whatnot, a lot of them talk about not having to use scientific. Oh, that's interesting debate. Yes, right. In fact, it was probably about what, four years ago now? It was even a feature article in Wired. Featured essay by Chris Anderson. They had then editor in chief of Wired. The end of theory and a friend of mine, Peter Norvig, and his director of Google research actually published an article in communications of the ACM with colleague Halevi talking about the unreasonable effectiveness of data. And so it's generated a whole argument and Peter likes to say that their translation engine is linguist free. They didn't invite a single theorist, linguistic theorist in to help with the translation engine. It's all a very straightforward data mining algorithm and in particular, the more translated corpuses that they put into it, the better and better it does. And the claim is, we're just using sort of language, the translation is using data that shows how language is used in the wild. It's actually is used. Forget all your theory stuff. So yeah, we should come back to that. I mean, we'll have a number of opportunities where I have some issues with that view. We're gonna be partly personally, because I'm a theorist, but partly also I do a mention of scientific understanding. Yeah, so here you go. This might be a graphical representation of the Google translation engine. I'll send this to Peter. I should do that. Just because they're not using theorists, doesn't mean that in the process of Google and machine learning, they're not. Yes, exactly. Yes, right, right. Yeah, they're not doing nothing. Right, yes, they're doing something, but it's not, yeah. Yes, so we've had long debates about this. Okay, so this is, I mean, kind of a two extremes where we've now set up the space of histories. This is, and now we're thinking about, we have this very explicit model or kind of approach to thinking about predictions and the partitions of equivalent histories they induce, right? So that's what this construction is. We'd also like to have some way of measuring how effective these things are. So the idea here is, at this point, I'm allowing us to just pick our favorite partition. Now when we went through the prediction game, everyone sort of came up with the right answer. So we had some prejudice about what this is. We didn't pick a 100 state model for the fair coin, although like I said, you could have given me the data back. So okay, and I actually couldn't have argued too much against that, at least to start with. So we have this sort of picture. We have this degrees of freedom. Whatever this partition is, we can choose that. And then we'd like to know, once we make that choice, how well do we do it predicting? And so we're gonna use our Shannon, kind of it's like a block entropy measure conditioned on model choice, right? So this is now, yeah, so this is, think of the conditioning here is on model choice. Oh, it's a three state model. It's a four state model, right? Or we can do this as a kind of single step thing. And we have what would be our estimate of the entropy rate, the uncertainty in the next symbol given some choice of model. Just like before, we were doing length L history approximations of the entropy rate. Looked at how that converge and everything. I can also condition on how I'm partitioning up the space of histories. And then that induces some amount of uncertainty that we can quantify using either this entropy rate approximate or the block entropy, yeah. So this is even that we don't have a goal of reducing when we're done. Not yet, stay tuned. This is just monitoring prediction. And amortizing all the information that we brought in, we have natural measures to do this. We can quantify that, right? I mean, if we know what the real entropy rate is, then we can compare this estimate based on our model assumptions to that. And any difference would be how poor we were doing. It's not too hard to develop bounds on this prediction error. It's always got to be less than log of the alphabet size. If I make no assumptions at all, I don't even know the statistics. We saturate the upper bound here. All we're assuming is that we know the alphabet size. Nothing else said, we assume they occur with equal likelihood or the other extreme case where all of the histories are states. Individually, we can predict as best we can and therefore, we'll get the entropy rate out. The true entropy rate of the process in that case. I mean, the real issue is how do we go between these things? Way too big and trivial, small. Now we need to develop some way of talking about comparing models. So there's certain limits on the prediction. So we're gonna assume some particular model and we're looking at the uncertainty over blocks. Well, we know by definition that there's some effective function that represents the partitioning of the space of histories, that's eta. That's what r is. r is basically a function of pasts. And now we go back to last quarter and if you remember the data processing inequality when we had a Markov chain, variable x goes to variable y goes to variable z. If I tell you what variable y is, it shields these two things so the mutual information is zero. So we actually have a chain like that. So this random variable r, this partition is a function of the pasts. I should say r is this function eta and then eta is a function of the past. So we actually have a Markov chain here. So you can use that to convince yourself that this sort of model based uncertainty is always greater than maybe equal to if you get lucky. The uncertainty when you use over the future, when you use the pasts themselves. Actually, it's a somewhat tortured way of saying. You can't do any better than actually just using the pasts. What do we want the future s on the right of that diagram? Oh, actually I didn't do future, did I? Oh wait, no pasts, right? Pasts here? Yeah. Did I say it wrong? I mean, maybe I said it wrong. Yeah, so r is this function eta by definition. A to then is a function of the pasts. Maybe we don't have the future in that diagram anyway. Right, sorry. Wouldn't we want it on there? No, no, no. All I'm trying to argue here is that this entropy is greater than or equal to this entropy. And that just depends on back in the data processing and the quality discussion. We showed that if you have a function of the random variable you're conditioning on, all that function, the best that function could do is be the identity and it works as gonna throw something away. If it throws something away from the adventure conditioning on, all that can do is increase the uncertainty. So that's the data processing inequality here that leads to that short step. So again, the intuitive thing models can do no better than to use histories. But like we were sort of saying, it's sort of unsatisfying. Or the other way to recast this, I divide by L and take limits that the entropy rate from any assumed model can be no better than the true entropy rate. A random the process actually is. Can't get any better than that. So yeah, or the other way of saying this is if I take the R infinity model, I get the entropy rate back. Then we have our notion of redundancy here. We call it prescience. So again, what I'm trying to do here is develop some quantitative measures of how good a guess R is for a process. So if you remember the way we were using redundancy before, we had, anyway, defined the sort of prescience of a selected model. It's just log of the alphabet size minus the sort of induced entropy rate for that. That's how, for example, how compressible the process would look if we assumed this model R. In the case of the null model, well, this was just log of the alphabet size, so there's no prescience. In other words, the guess R isn't giving us any leverage in describing the process. So that's what prescience is. And then this, of course, is upper bounded by the total predictability, if you remember, at G. Because in that case, we use our infinity here. We actually get the entropy rate and that's the total predictability. That's the amount of, right, the difference between sort of the raw information of log of the alphabet size and then the true intrinsic randomness, the difference is the compressibility of that. So, okay, so now that actually gives us another incremental step in refining our target. So now we wanna find states such that we have the entropy rate of the subjectively chosen model is the real entropy rate. And that will be tantamount to having showing that R that does this is an optimal predictor. So this is just focusing on the predictive aspect of this. But of course, we already have any answer to this. Just use every history as a state. That's just two verbose in most cases, except for the all ones case. So there's another criteria here. And for that, we need to augment focusing on the sort of predictive efficiency of these models and how we measure it using these information quantities to talking now literally about the size of the model. So to do that, we're gonna have this, we're gonna be looking at the distribution that's induced over these chosen partition elements. Okay, and look at the Shannon information in that. We're gonna call that the statistical complexity of the selected model R, okay? So I mean, it's got standard interpretations of this number. So H is just a Shannon entropy. So that's the uncertainty in the state. If I don't know which partition element I'm in and I tell you, oh, you're in three, you're in two, you're average surprise. Is this number here? It was also the size of the optimal code, but we'll come back to that. The easiest way to think about it in the current discussion is this statistical complexity is really something like the size or log of the number of states, right? If the states were uniform, remember then this P log P forming that just becomes log of the number of events. So for now, think of statistical complexity is just size of the model. So now we have a way of talking about this model's bigger than that model. And it's a number, right? It's a scalar. There are other interpretations will come back for statistical complexity. It's the amount of memory sort of induced by if the process actually was structured according to model R. Okay, so again, and this is kind of the final statement of the goals. So the first was can we find effective states that give good predictions? Well, not just good, but optimal. And we came up with this criteria. Whatever these are, this guess at partitioning the space of histories into these groups of equivalent, particularly equivalent histories, we want that to get the entropy rate right. And then we can do no better than that. That's the intrinsic randomness. But now the second part is can we find the smallest set of partitions? I should say the smallest partition, namely the partition with the smallest number of cells in it. And we measure that by just plugging in the partition, calculating the probability distribution of partition cell elements and just looking at the Shannon information and then minimizing that. So we now actually have a cost function. It's now been reduced down to an optimization problem. Search over the space of the models that we're guessing, which are these partitions of the space of histories. Every time you make a guess, calculate this number. And then calculate this number and move in a direction in the space of these models so that I move closer towards the entropy rate, effective entropy rate, being the entropy rate, and that minimizes the model size. So I have these two numbers and so now you should imagine I have my base space of all these possible models are, each one is a partition of the space of histories, but just think space of models. And now I have the surface over that that's given by these two numbers and I minimize one and I maximize the other or minimize the difference one. Make it optimal predictive. Okay, so now we have a model that sum up is a partition of the space of histories. And now I want you to sort of shift from the space of histories to maybe one step more abstract to now imagine we have a space of models. And all I mean is here's some space. Are just rendering of the space of all possible models of stationary processes. And every point here is a particular model, AKA partition, AKA partition over the space of histories, forget that for now, we just have a space of models here. And we get to make these different choices. And then hopefully somewhere here is the truth. Well actually we'll talk a little bit later on kind of mid quarter when the truth is over here. That's called auto class modeling. Almost all of mathematical statistics does in class modeling. So anyway, we'll assume the truth is accessible in the space we're looking in. And you know, we can have competing models. So we call these rival models R1 and R2 or at least the first cut perfectly. We live in a democracy, you can be R1, you can be R2, fine. Until further testing. Okay, so this is, we call this Occam's Pool. So we have all these models that are reasonable candidates. A familiar one would be what we've been doing. In fact, almost all, you could think about the information theory that we were doing last quarter. We had one kind of model in mind, but we never really said it that way. We were looking at sequences. Why can't sequences be effective state? Sure, right? They're the histogram models. They were the word distributions or collections of words. So you can kind of imagine here, we have this series of all last quarter was thinking about words of length one, words of length two, words of length three. We have these histograms and we estimate their probabilities. That's a model. It's actually kind of a block Markov model because I have sequences of length 10. What happens when I run out of length 10? I put down, I sample another one from my bag of length 10 sequences and their probabilities, okay? And then presumably in the limit of infinite L, we end up down here at the truth. There are various convergence properties about how the length L approximate of the entropy rate would go to the true entropy rate. And that's kind of here, maybe a little more explicitly laid out in the space of models. So this isn't a strange space. We've been doing this all along. So histograms are models too. So when people talk about model free statistics, I have, there's always a model somehow. Okay, so now finally start answering some of these questions. So we're gonna make a commitment here to a particular kind of effective state. I now have, I talked about R being this arbitrary choice, different points in the space of rival models. But then we came up with ways of monitoring their predictive performance and their size. And I said that we should think of that as a cost function. Optimal prediction is the goal with minimal size. So what we're gonna do is make, define a certain kind of state called the causal state and we'll prove it actually saturates those bounds, that it reaches the optima we're looking for for this objective function. Optimal predictor of minimal size. Okay, so causal states, they're the set of pasts that have the same morph. So what I'm gonna do here, and it sounds like I'm almost restating what we said before, I'm gonna choose a partition of the space of histories so that two histories are equivalent when they have the same future morph attached to them. So before what I was doing just to draw the contrast, I was allowing you to partition the space of history in any crazy way that violated this even in some cases, or was more elaborate. Okay, but so this is the main idea. Again, we're not distinguishing histories that lead to the same predictions. That would be a waste of time and storage. So why do that? Okay, so the main mathematical object, we have this predictive equivalence relation. Sometimes I'll refer to as a causal equivalence relation. What we do is identify two histories, prime and double prime here, if and only if the future distribution conditioned on seeing the two different histories is equal. So I'm just saying formally here defining this equivalence relation tilde to implement this idea. Don't make distinctions between histories in your model. That don't improve your predictive ability. They're redundant, throw them away. Okay, so this is the main idea here. On the one hand, we'll see where it's a nice mathematical concept. We can take models and calculate these groupings of histories analytically, but there are also empirical implementations of this equivalence relation that you can actually look at raw data and pull things out. Okay, so this relation, I mean I guess the claim is this predictive specification here that determines the causal states is an equivalence relation. Well, what's an equivalence relation? Just to sort of review what that means. First of all, it's a relation. A relation is simply on some space is a statement about pairs of elements in a space, and it's just simply a grouping of them. So a relation is just unique. It could be a set of integers where the pair of integers, one is twice the size of the other. That's also a relation. Now, the claim is that predictive equivalence relation is in addition to equivalence relation, which comes with three more properties, namely that tilde is reflexive that every history is equivalent to itself, well, leads to the same prediction, so that's kind of trivial. Symmetric, that if history s prime is equivalent to s double prime, that means s double prime is equivalent to s prime, not deep here. Maybe more interesting, transitive. That if prime is equivalent, history prime is equivalent to history double prime, and double prime is equivalent to history triple prime, then history prime is equivalent to history triple prime. They're in the same partition cell element. Notation. So we use square brackets here to represent the entire class. So I can put in a particular past, and then when I wanna talk about the set of all the histories that are equivalent to that particular one, I put a square bracket around it. So it's all of the other histories that are equivalent to the given one. And then kind of a nice compact notation, sort of a group theoretic notation here is, I can imagine I'm taking the space, sort of the raw set of histories, and then I apply this equivalence relation, so I don't sort of like division or sort of modulo this, then I get a new space, which is past mod, which are the equivalence classes themselves. So I'm going from sort of raw sequences now to this group, these groups of classes. The predictive equivalence relation is itself a partition in the sense that the equivalence classes aren't empty. If you union them up, you get all of the observed pasts, and they don't overlap. And it's not too hard. I mean, each one of these statements I'm making, there's like two lines derivation once you start with the causal equivalence relation to get these, to convince yourself of these. In fact, sometimes this kind of mathematics, it looks like it's not even clear what there is to prove. It's sort of obvious, but the proof of these things is in the paper CMPPSS, Computation Mechanics Paper in the readings page. Okay, so, so now we have these causal states. These are pasts, groups of pasts with the same morph, okay? And I've introduced this kind of script notation to be causal states here. And there are different ways we can think about these causal states. One is, they have three different components, I should say. One is that a particular causal state is the set of pasts that are in its partition cell. It's a set of all histories that are equivalent to it. We can also think of it this way. For getting about the past, we just end up with going from the raw space of sequences, modding them out by the equivalence relation, we end up with a set. Well, I should say a set of sets, right? So each individual thing is a group of pasts, but then they have, there's a set of these things. They're the partition elements. And the way to think about these is really in a sense, these are their names. So this is state one, state zero, state seven, and so on. So we have a set of pasts, they have names, okay? And it's just using the previous results. Summing these sets up gives us the full space and they don't overlap. Now we have, just like we did with the, our subjectively chosen R models, we have a mapping that goes from a particular past to a causal state that's called epsilon. So it maps from the space of states to the spaces. Space of histories to the space of causal states or you think of it as a function, I put in a particular past and it tells me it returns the equivalence class. Or more simply, it just returns the name. Put in this particular past, it's state seven. Easier, okay. And again, just for notational purposes, if I put in the uppercase here, this is the random variable, which is the semi-infinite past. And that means then I can also think about the causal states as being a random variable. What state am I in now? What state am I in now? That's that random variable there. There's, of course, another piece associated with a given causal state and that's its morph, right? So there's a set of pasts. It's got a name, state seven, and there's also this prediction it's making about the future. That if the process is in that state, the prediction it's making, that's given by this future distribution conditioned on, now we have a particular model, namely the calligraphic s, causal states, okay? So, right. So we can, I mean, what is this probability? Well, it's the probability, this future distribution is the same as any particular past in the equivalence class. So we already know these things because we collected these together because we were building the equivalence classes based on knowing what that future distribution is. So we can take any past in this causal state to give us the future morph. That's all equivalent, a bit of a tautology. Now just a quick hint, kind of just an illusion to some of the technical difficulties here or sometimes when you have to roll up your sleeve. So the discussion so far, I've been using finite futures so there's not much of a problem there but I was conditioning on infinite length pasts and actually when you implement this or even at some of the proofs you'll see, sometimes we build up these causal states only going a finite distance in the past or build up the future morph only going a finite distance in the past, finite distance in the future. So there's a way, but notice, the notation gets a little kind of crazy. So I'll present things and kind of overview first and then as necessary, we'll have to roll up our sleeve and do some limit things. You'll see in the next couple of weeks. Yeah. Do you expect that to reach you for some key and then not for the other thing? Like if we go far back enough, they're gonna be doing that. Yes, right, right. So I'm just hinting at, yes, right, yes, yeah. We're gonna stub our toes a couple of times on and then suddenly things will balloon out when we're trying to prove something and then kind of clean up the mess and then go back with our intuitions. Yes, right, yeah. Yeah, in fact, we're actually, it's still sort of a research topic. We're sort of working on what are the minimal L and Ks for a given process that you need to go out to to actually find the ones that you would get with infinite, semi-infinite past and future. And it's a little bit messy, so. Yeah, yeah. Is it possible that you could have, in some sense, a kind of chaotic behavior in the name what you're learning as well? In the learning algorithm? No, I mean, in terms of what the state of knowledge itself is doing. Oh, sure, we'll get to that. Yeah, when we start talking about kind of semantics and stuff like that, you'll see that the kind of knowledge varies. As you make successive measurements, the state of knowledge you have about the process can vary all over the place. And it's, yeah, yeah. So, okay. So actually, we answered the first part. Namely, when we're talking about modeling, completing the learning channel, really seeing what this modeler at the end of this learning channel is supposed to be doing, we have the effective states. Or that's the claim. I haven't actually proved anything to you. In fact, mostly what I've been doing is motivating things and giving you definitions. So, I actually have to tell you why you should care. Maybe it's plausible, but you shouldn't really be convinced at this point. It's just plausible. Okay. But let me finish this up real quickly. It's not too hard to also, once we have the states, what's the transition structure? And that'll be kind of the end of it and we'll get down to this so-called preferred representation. And then the following lectures are what did we do on Thursday? What did he do on Thursday? Why do I care? What properties does this, the set of causal states actually capture? And I actually have to prove that to you. So, okay, so quick. So now we want to, assuming we have the set of causal states, past modulo predictive equivalence relation, what's the state-to-state transition structure? So, imagine we have some past S prime. Well, we know, because we have the epsilon function that is the functional version of the partition of the space of histories, we know that we're in, say, causal state I. Okay. Well, we make a transition. We observe a symbol, just one. And now we have a new history, right? We just append the new symbol to S prime. We have a new history, S double prime, here. And then we just plug it into the epsilon function that tells us what causal state we're in. Now, call that J. So there, that's it. We go from on symbol S, we go from I, causal state I to causal state J. And we just kind of go through this. We go to each state and go to zero, one, okay. At that end, where do I go? We use the epsilon function. So, in a sense, the transition dynamic is already sort of implicit in the epsilon function. Another way to think about this is what that operation we use sometimes called causal state filtering. Imagine I have some series of measurements or symbols like this. Well, I can stop at each moment in time and there'll be some semi-infinite history that I saw up to time minus three, some semi-infinite history up to time minus two and so on. At each moment in time, I can take that semi-infinite history and plug it into the epsilon function and that'll tell me what state causal state I'm in. So causal state filtering is going from the raw data and now we can actually see what the internal state transition structure is by looking at, now, we have a process over causal states. In some sense, epsilon function is extracting the internal state dynamics for us this way. So we have this new process, causal state process. Well, we also like to have transition matrices here. So what we can look at is the probability, if I'm in causal state I, what's the probability I'm gonna generate S and go to state J? Well, that I know how to do here. I mean, if I'm in, for example, causal state I, well, there'll be some particular exemplar in its equivalence class that leads to that. And then the state I'm going to, well, I know that I had one of these pasts in here and I added S onto it and I plug that into the epsilon function and I get the next state that I go to. So I can directly calculate this transition probability using things I already have. Okay, so then we end up with, and this should look very familiar and this is why way, way back in week probably four or three we described hidden Markov models with the symbol labeled state to state transition matrices. This is why, so that's kind of a long range correlation in the course but that's why we're using this representation. That's why in these hidden Markov models we have states and states and the symbols are on edges. There were state labeled things that we didn't use. Equivalent representation, we didn't use it mostly to match up with this. Okay, so this is kind of it. In terms of defining the main representation. So the claim is that there is an intrinsic representation we can get from the definition of a process from terms of just raw behavior. And from it, essentially outlined a procedure to get this set of equivalence classes, the causal states and very quickly I told you how we can get these symbol labeled transition matrices. It's a kind of hidden Markov model. But not Markov states but these causal states inside. So that's the end result of this. So that's again the process telling us how it should be represented. We're not imposing, it should be a seven state Markov chain or something like that, it tells us. Given the statistics, the raw behavior, we get it. And this is just an example that again should look very, very familiar. So this is the even process, right? Pairs of ones, maybe any number of zeros in the, well, we'll talk about and actually reconstruct this from the even process data. So we understand where the states come from. But just generally, there's always a unique start state that can be a set of transient states with maybe possibly cycles that you've rattled around in for a long time. But as soon as you leave the transient states, you go into some recurrent states and stay there forever. And usually the epsilon machines will just have a single recurrent component like this. There are other cases, we'll come up across a few cases that we have multiple recurrent components. So this kind of emphasizes our discussion, the period two process, lambda used to be the null symbol. If you haven't made any measurements, what I mean is your history is just the null symbol. The start state up here means I haven't made any measurements yet. And that's the equivalence class of that null past. So S zero, and then this is the unique start state. It's the equivalence class of not having seen anything yet, the entire space of pasts. Well, the other way to think about it in terms of distributions, which we'll get into a little more, we can think of this as a machine that lets probability distributions relax. You can start out with all the probability in the start state. So that's another way to think about what the start state is. And then transition structure sort of tells you how this sort of delta function probability on one state spreads out and finally relaxes down to the asymptotic stationary distribution on the recurrent states. So let me finish up here. Right, essentially right. Transient states, how either how the process relaxes itself the equilibrium or how if you know the model, how the observer comes to know what the recurrent state is. And then the recurrent states sort of capture the long-term time asymptotic stationary statistics if there's a single component. Quickly. Okay, so I'm gonna shift gears real fast here. So the homeworks and labs and things are now gonna start using this computation mechanics in Python. So a lot of particular things about Epsilon machines, also ways of calculating these other information measures that are built in, we really haven't used yet in the SAGE campus system. So just a quick kind of overview of the components in it and we'll be going through these sort of week by week. There's a sub-module that works on probability theory. Let's you deal with different kinds of conditional marginal and joint distributions. There's an information theory module which we've used a little bit before. Mutual information is conditional entropies, block entropies, excess entropy, ways of plotting things. And then there's a whole range of routines that address working with Epsilon machines, these sort of optimal presentations of stochastic stationary processes. Some of it's numerical, some of it's symbolic and closed form. So yeah, so we'll do a little bit of, as the weeks go on the exercises, you'll be doing a little more Python programming, which the SAGE environment's pretty nice. We're not doing large Python programs. These are always a few lines, 10 lines, six lines, nothing terribly big with helper code. We'll provide helper code. If you're interested in learning more about Python, you can look at these slides which are online and go to, there's another course I teach and there are some tutorials in Python, mostly oriented towards doing computational science as opposed to using Python to do web programming or spreadsheets and boring things like that. So people use that to get started with Python. So I'll end with that, almost on time.