 Okay, well let's get started. So today we actually have some work to do. You know, Tuesday was giving you an outline of one of the methods of calculating the epsilon machine given the description of our process in terms of the word distribution. So hopefully that's clear and then there are two problems on the current homework which are to complete the probabilistic reconstruction for the golden mean and for the even process. I did, if you noticed on the homepage, make a little note. I went back and fixed the topological reconstruction for the golden mean process. The slides were wrong. I just, at the last minute, copied the wrong machine over. It made no sense, but it's now corrected. So you can go back and look at that. The online lecture will have the wrong slides, but the PDF and then the HTML for the slides are now corrected. So you can step through that. I mean, you probably would want to read that as you got ready to do those problems. Okay, but today I want to address the question like why you should care. Why should one care about this set of causal states and transition dynamic? I kind of, in motivating the construction, in talking about how we start out with prediction and then that leads us, interesting enough to this kind of structural viewer process. I kept hinting at why this is important, but we mostly focused first on the formalism and then on the algorithmic side of it, which kind of brackets the idea. So hopefully there's at least some sense of the kind of underpinnings of how it works concretely. But the real question is, okay, imagine you're successful and you actually have one of these things in your hand. What does it tell you about the process? So that's what today is, and there's really some work to do. Namely, it's a number of results, basically theorems and propositions, corollaries, what not, and kind of proof sketches of how they work. The longer, more detailed proofs are in the Computation Mechanics article. Today I will give what I'm going to call proof sketches. Sort of think back to Cover and Thomas when you were reading through and introduced various properties of information gain, the kind of proofs that the Cover and Thomas book gives are kind of this higher level, not detailed, very technical, but at least they're kind of constructive and give you an intuition. So that's my goal today, to establish the main properties of the epsilon machine. So here we are, back at the learning channel. Finally, after many weeks we're really addressing what a modeler should do, or an intelligent agent or a piece of neural tissue in principle that helps an organism survive. What should it be doing? So the claim here is, and we're sort of answering this in principle, practical issues will come later. There are two basic questions, which really came from how we introduced dynamical systems. So we're looking for a dynamical system. What are the states, given this impoverished view, this inaccurate instruments reporting of what's going on in the hidden box, and then what are the equations of motion, what are the dynamic over these states? And the claim from the past two lectures is that these are now answered. What are these effective states? They're the causal states. What are they effective for? They are effective for prediction. What are the equations of motion? Well, it's really the causal state, the causal state transition structure, and the net result is this epsilon machine, set of states, and transition matrices, symbol-label transition matrices, and those form a kind of hidden mark-off model that has a number of properties that we'll establish today. And then once we get the properties established, you say, oh, I'll give you how these properties. A, what does that tell me about how a process is organized and how random it is? And also, what other things can I calculate if I have this epsilon machine? And the answers will be relatively sweeping. Okay, so just to recall, epsilon machine is this set of causal states and set of transition matrices, symbol-label, measurement symbol-label transition matrices. The first real lecture was trying to motivate this equivalence relation, predictive equivalence relation that induces the causal states. Again, we group histories together when conditioned on those particular histories, the view of the future is the same. In other words, don't make distinctions between particular pasts if they lead you to the same future, view of the future. You can either think of this as incredibly obvious or profound. Certainly in terms of the consequences, it's a surprisingly powerful idea and I'll try to convince you of that today. So we have this set of causal states. Maybe algebraically we think of we have this space of histories and we mod out by this equivalence relation. We end up with this set of causal states. There's always a unique start state. Then once we have the causal states, we can use the causal state filtering. At every moment in time, we do have a history. We can use the epsilon function to look up for the particular history. We've seen what causal state we're in, and then we can see what the state-to-state transition structure is. So we can pull out from actually the original data, we can pull out these transition matrices over the causal states. So causal states have several things attached to them. They are a set of histories, a coolant's class of histories. They have names, 0, 1, 2, 3, Alice, Bob, Charlie. It doesn't matter. And the future morph, each causal state is making some statement about what the future is going to look like. The principle determines the past condition, future distribution, and then these transition matrices. So there's this overall process, which we did, I should say procedure, which we did on Tuesday. We can start with some description of a process. This can be obtained any number of ways by observation, by starting with, say, a physical or biological model, figuring out what behaviors occur in their distribution. We then calculate these future morphs, search for them on the parse tree, and then we end up with the causal states and the transition structure. There's always this unique start state, which corresponds to, lambda means the null symbol, I haven't made a measurement yet, so I don't know what causal state the process is in, and then I start making measurements, just like for the period two process. Initially, it's a fair coin, because 0's and 1's occur with equal probability, and then I have to make a measurement, to see odd or even phase of the period two oscillation. There's a probabilistic way of thinking about it, where I can describe me, the modeler's view of what's going on, or my ignorance. I don't know what state the process is in, that corresponds to putting all the probability, state probability on this start state. And then we sort of watch, step by step, how this flows down into the recurrent states and settles out again. In the generic case, we're going to have transient states, that we visit for some period of time, and then we'll make some transition down to a set of recurrent states. Typically there's just one recurrent component, and then we rattle around in the recurrent states, asymptotically in time. Now, this picture is helpful, because it looks like what we were talking about before, in the winter quarter in terms of Markov chains, Markov chains. This presented to you here, it is a hidden Markov chain, edge labeled hidden Markov chain. Actually, the mathematics we've introduced is much more powerful than this, and it depends on the nature of the process. In particular, things don't have to be finite states. It's okay to think about this finite state Markov chain or hidden Markov chain. That's fine, to get started. But in fact, there are a number of cases where we're going to analyze where you actually have a countable infinity of causal states. So this is just one example, and in fact, this is the Epsilon machine for something we've already studied and done calculations for, like you started to do this in the first homework ten, the simple non-unifiler source. It turns out the simple non-unifiler source has an infinite, countable infinity of causal states. You start to get some sense of that, well, mostly because the reconstruction algorithm kind of tops out pretty quickly if you're doing it by hand. So there are other ways to get to this thing. But even more interestingly, or maybe more of a challenge, is that there can even be some sort of fractal set of causal states, some partial continuum of causal states. Now, this isn't obvious, but I just want to let you know that the definitions in the framework we've already set up for some class of process actually can induce this. The examples I'm showing you here, whether it's that simple non-unifiler source or this one, or even this one, a continuum of causal states. These are all, these are processes generated from sort of randomly selected hidden Markov chains that are non-unifiler. So, and in some sense, converting to the Epsilon machine as I'll prove to you today is changing that model to a unifiler model, and the consequence is actually generating this distribution. So you might say, just make this a little bit plausible. So this is actually a three-state hidden Markov model. It's non-unifiler. It states ABC. And the causal states, if you think of yourself as the observer, describe your best prediction of the distribution over these states given the observed symbol, particular distribution over states. When we have just a finite set of causal states, then typically, if you see a sufficiently long word, you'll always end up with a delta function distribution of one particular state. That was the problem of synchronization we talked about. Turns out for these non-unifiler Markov chain processes, this problem of synchronization is much more problematic. You never end up with just knowing it's probability. So this is actually a simplex. It's a distribution over three states. Every point in here is a combination of numbers that add to one positive between zero and one. So the kind of simple finite state case I first showed would be you have, eventually, after you've seen this or that word, you know what state you're in. There's kind of hopping around. But in the general case, we have to think about very complicated sets of causal states. So, interesting. Even for a finite memory process. So these are called the mixed states. We'll come back to that in a couple weeks. But just so we don't think that all we're working with are, as my colleagues in computer science would think, oh, you're just working with probabilistic finite state machines. No, we're not doing that. These are rather rich dynamical systems. Excuse me. Okay. So now to start thinking about what this epsilon machine representation of a process is. Actually, sometimes I'll use the word presentation partly because there could be many models of a given process. So sometimes we, in the literature, use the word presentation rather than representation. Okay. So in what sense does one machine a model of a process? Well, we can think of it as maybe we started with the process. We calculate the causal states and transition structure. We now have this particular kind of hidden Markov model. So one question would be what does it describe about the process? Well, one way to think about using it is as if it were as a generator. I now have a model, in a sense, put it into a simulator and see what their probabilities are. Well, since we started with the process and we're claiming it's a good model process, it's intuitive that the epsilon machine generates the word distribution. It's supposed to model that, right? So in what sense is an epsilon machine model of a process's word distribution? Well, so again, the word distribution is the distribution of length one word, length two words, all length words and supposed to somehow reproduce that. So let's look at a particular word of length L. So we have a series of measurements here, length L. And then the rule of using an epsilon machine is we always start in the start state. So I'm going to put unit probability in the start state. And then look at the transition structure. So the way to think about this, I have my epsilon machine, I put the probability distribution all in the start state. I don't know what's going on. The unique start state of ignorance about what the process is doing. And then you make a series of measurements. And a particular word takes you through a particular path in the machine. And as you do that, you just starting with this initial distribution, well, this is trivial, this is just a delta function in the start state, I just follow the transitions going from the start state to the next state, given that I saw the first symbol times the transition probability that I went from the next state to the successor state saying the next symbol I just follow a path through and the whole time I'm just multiplying through a series of transition probabilities. Get down to the last symbol. That product is the probability of the word that's assigned by that epsilon machine. So in short there's just a rather direct way of calculating. If we have the full epsilon machine with its unique start state it's just really this telescoping product along the path that that sequence takes. Now the way I'm describing this assumes something I'm going to prove very shortly. Namely that the way I described it, it assumes unifilarity. Remember unifilarity is if in a state and I take a transition that's labeled by the symbol I just saw I go to a unique transition and unique next state. That was the problem with non-unifilar models, that this might be branching. I might go from A to B and C on a zero and then this calculation would sort of branch out. I have to be falling this increasingly large number of alternative paths that would be consistent with the observed word. So we talked through that with the symbol on unifilar source and so on. Here when I write this out I'm actually assuming this unifilarity property. So I'll prove that to you. So this new presentation of a process is unifilar. In that case there's just one path through the machine for every word. Up to some technical provisos. For example if I say zero there can be a number of transitions on a zero. That's why in this first way I'm describing how the Epsilon machine produces the word distribution. We're starting from this unique start state. In that case there's a single path through the machine. Yeah in a sense you start off synchronized. By definition there is this unique start state which we'll have to establish. And then since every individual position is unifilar it's just one alternative and therefore I'm going to follow one path. Okay. So in a sense unifilarity has this practical computational complexity advantage that calculating the word distribution it's linear in the length of the words. Non-unifilarity I had this exponential increase potentially instead of alternatives I'd have to track and that would be computation more intensive. So now there's another way to think about this which kind of contrast with using the unique start state. You can also forget the transient states and assume I've just got the recurrent component and then calculate the asymptotic state visitation probability. So this alternative approach to calculating the word distribution from a machine is to assume I start in any of the recurrent causal states with the given asymptotic state probability and then I'm seeing the given word I'm interested in from that state. Of course there might be some forbidden transitions so a given word doesn't necessarily follow from it every state so you just keep track of that. So the way we do this is again go back to the transition matrix that describes the causal state to causal state Markov chain itself calculate this left eigenvector normalized improbability so that gives me the asymptotic probably visiting the various causal states. And then what we can do and I'm using broad ket notation from physics here but this is just this matrix product. We have this eigenvalue think of that as a row vector times this symbol transition matrix and then I just normalize I add things up so I can go from I start in any state with this given probability that sort of flows one step through the machine and I add up the resulting probabilities to get the probability of a given symbol that just for one step extends to two steps same idea row vector times now this product of the symbol label transition transition matrix for the first symbol times the second symbol and then I add them up with that one's column vector so in general this is how we calculate the word probabilities if we want to use this method of I don't know what state I'm in I assume it's the asymptotic state probability and then I just calculate you know if I had ten states or ten paths I might have to follow typically things get pruned off very quickly in terms of what are allowed start states for a given sequence so yeah so we can any word here we just think of this as just one matrix we just do this matrix multiplication out following the symbols and choosing which transition matrix to use okay so the two ways to calculate the word distribution yeah I guess whether any of the other ways of generating an epsilon machine would give you just three just three I mean the method we wind up generating the epsilon machine always gives you the start states so we wouldn't need this method but are there other ones that oh yeah yeah yeah right yes okay good right so why have two techniques why not just use this first thing okay so there are a number of things which we'll run across but just to give you some idea of how complicated these things can be which is a little bit hinting back to those examples I just gave a fractal continuum of causal states so so there's some methods first of all that assume basically just estimate the recurrent causal states so for example I did mention this one thing the causal state splitting reconstruction it's sort of main ansatz is to assume you have an IID process a single state single causal state and then as you collect data you reach kind of a statistical threshold oh I see no consecutive zeroes if it was a golden mean process then that gives you a statistical justification for adding a state that corresponds to remembering that restriction if I see a zero I must see a one so that sort of starts off and that really focus on just the recurrent states the subtree reconstruction that we went through on Tuesday actually since we're starting at the top tree no that's where the transient states are going to be if they're anywhere that tends to include all of that in there now practically sometimes even a even a sort of a process with a finite number of causal states can have an infinite number of transient states so in that case maybe I don't want to calculate it's just easier for me to calculate the asymptotic state distribution on a seven by seven matrix I have seven causal states and just do this okay and so we'll talk about when that happens so that yeah there are reasons yeah okay so but really the main goal is to establish a number of properties at the end of which you're supposed to go oh I see why we should do this as opposed to all my hand-waving motivations before so we're going to talk about causal shielding how the causal states actually render the observed past and future conditionally independent like I already hinted at we have to show that the Epsilon machine is actually a unifeler hidden model that the causal states have a kind of Markovian property of summarizing the past maybe more operationally that they're optimal predictors in fact there can be other optimal predictors but the Epsilon machine is one of those things but they're the optimal predictor of minimal size and in fact if you come up with you have your own favorite model you say oh here it's optimal prediction it's minimal size then basically you and I are just disagreeing over what we call the states Alice Bob and Charlie are 0, 1 and 2 essentially this presentation is unique and hence this set of properties strongly suggest this is what you should be doing in any modeling now again we've kind of hinted at and discussed very vaguely okay finite data there are all sorts of issues but we're trying to establish here is kind of in principle if we have a good model the statistics of a process what is the best presentation or representation and we're extracting that from the data getting the number of states and the transitions are from the the process itself not imposing that like it's typically done with a lot of hidden Markov model data analysis modeling okay so we just have this series of results to get through okay so first thing causal shielding so what do I mean I mean the past and future are independent if I give you the causal states so the past is independent of the future if I know the causal state in other words you're trying to predict something about the future and either you can remember the particular fast past that's taking you to this present time or I can say oh you're in state D somehow those are equivalent kinds of information shouldn't be too surprising cause the causal states are made of the past so okay so how are we going to do this so again we'll just think of this process described by its by infinite chain of random variables with past and the future and then what do I mean by what's the probability definition of this conditional independence so think of this as if I had probability of x and y given z then I would say x and y given z x and y are conditionally independent if this joint distribution joint conditional distribution factors into the product of the two marginal conditional distributions in this case right the distribution over the process past and future given the causal state is the product of the distribution over past given the causal state times the probability of the future given the causal state so the way we say this is that if you have if you know what causal state in you're in it shields the past from the future well in the future from the past it's symmetric so very much like what states of a Markov chain do but for hidden processes okay so to prove this we're going to use the properties to find the epsilon machine and then also kind of build up for the other properties going to talk about using shielding and unifilarity and so on okay so what we want to do is we're going to take this joint conditional distribution and split it just by doing a probability identity here right so we're just taking probability of X and Y given Z and we just factor that out to probability of Y given X and Z times the probability of X given Z that's just probability identity these are equal so in order to basically what I have to argue to establish the property is that this dependence on the past disappears in other words that this factor here is really just dependent on the causal state idea is pretty simple right I'm in a causal state but their histories, histories lead to causal state they're kind of in a sense they're redundant information so that's the goal that's not too hard so okay so what we're going to do is imagine we had some particular past and we stick that into our epsilon function tells us what causal state we're in okay so now the first thing I'm going to do here is just take this left hand side and unpack it a little bit so what the distribution we're really talking about is the distribution of the future given what we saw particular past S prime and that we're in some causal state and maybe the causal state we're in the exemplar in its equivalence class is some other past okay but now we wouldn't be in this causal state if these S prime and S not prime were different they're essentially the same information so in fact this probability is the same thing as just looking at the future given this particular past and what I'm doing is I'm basically just picking the particular past I've seen this realization picking S prime to be the another exemplar in this equivalence class they all lead to the same distribution okay so that's sort of one way however this probability here I can sort of go back so I'm looking at the future condition on the particular past but then I can also apply the epsilon function to that past to find out what causal state I'm in right so this is just using the causal equivalence relation that these two distributions are the same I can either condition on the past or the causal state that's in the equivalence class that leads to that causal state so this ends up establishing that this probability here is just this so that uncertainty in the future conditioned on the particular causal state doesn't depend on the history I can just drop this so the other way I was saying is that knowing the causal state you're in and knowing what history has led to it those are redundant kinds of information so I can just factor that out so we get this factor okay so that's slightly I mean as a proof sketch you know it's kind of hints at the sort of certain locations you have to go through to establish the what is relatively intuitive property namely that pasts and causal states as they're constructed are functions of the past those are the same kinds of information at least as far as predicting the future distribution goes okay so that gives us the causal shielding idea good which tells us you know it gives us some interpretation of what these causal states are doing kind of operational property the causal states okay so now maybe a little more surprising so the claim is that the epsilon machines are unifiler so if you remember unifiler means that if I'm in a given causal state of time t and I see a symbol there is at most one next successor state that I go to right so state A goes to state B or C but it goes to A on a zero and B I mean sorry B on a zero and C on a one I can't go from A to B and C both observing a zero okay so what does this mean let's I mean now we have to say what this property is more formally so we assume we're in state I and then we observe some symbol now what does it mean that we see at most one successor state one successor causal state well it means that here if I if I'm in this causal state I pick one of its histories okay but then if I see S that gives me a new history at the next time step and the claim is that that is in the successor state J okay so basically there are two different cases if there is a next causal state then we have to show that all the other successor states have zero transition probability can't get there or it could be the case that I'm in you know state A and on a zero there is that's a disallowed transition well then in that case we're just going to set the transition probability to zero okay so that's the setup so really this boils down to assuming at time t if we've seen these well not time t if we've seen these two different histories and we assume they're in the same equivalence class then wherever we are having seen S or S prime at the next step we see symbol S so we assume we're going to see symbol S and the claim is that this new history S past S is equivalent to S prime S there so at the previous time step they're in the same equivalence class and you go one step forward having seen the same symbol you're in the same equivalence class okay so the way to do this a little bit some notation here let's okay so so after we've seen some particular past we look to the future and that's a set of sequences so what I want to do this whole set what I'm going to do is just talk about the set of sequences that follow if say I saw a zero so the idea here is that S has been a particular realization 0, 1 say and then F are all the sequences it's called the follower set that you could see from that point okay so the other way to say this is that this set of sequences S, F these are all the future sequences I could see that are prepended by having observed S okay so what's the consequence of assuming we start here with two histories that are in the same equivalence class okay well I'm just going to rewrite the causal equivalence relation what that says is based on little s and s prime I look to the future and their distributions are the same so I've just written down that well this just focus on this up here this is actually a joint distribution over next symbol, next symbol, next symbol independent random variables two pieces this next symbol and then everything else all the sequences so I'm just rewriting the notation here so that I'm thinking of this not as this joint random variable but now they're kind of two aggregate random variables so single symbol random variable and then the sequences that follow after it and same thing for here when we're conditioning on S prime okay well now we just apply probability identity probability of X given Z we're just kind of factor that out probably Y given Z times the probability of X given Y and Z okay so I can do that now the notation gets a little bit messy here but all I'm it's just a probability identity on the left hand side to get this and the right hand side to get here and I've just written this out this product so over here I'm choosing to put pull out S1 here but the future is starting one step after the next symbol right and then I'm conditioning on the next symbol being S and having seen that particular history little S times the probability of seeing the next symbol, individual symbol and that it followed that particular history same thing over here except all that's different is I'm conditioning on this different history S prime okay but just an identity here applying this but but it turns out we assumed that in fact in going one step forward after both of these histories S and S prime we saw the symbol S okay that means this factor and this factor here are one by assumption we saw them also I say stationarity here because it could be that when I saw S and S prime they could have been separated by some time so I have to assume the probabilities aren't changing so that's stationarity probability conditional probability word probabilities don't change based on the origin of time okay so that's one so I just drop those out and I end up with with this right so and all we're doing here is after I've gotten rid of this factor of one I'm just packing back together I saw this particular history S and after that I saw the symbol S so I now have a new history in a sense at the next time which is that pass plus the new symbol and then we have this new this is the follower set of this symbol sequences I saw after one measurement and these are the same whether I started with the history S or S prime these future distributions are the same well that's this criteria here namely that this history and this history are in the same class because conditioned on those histories I have the same distribution over futures doesn't really matter but it seems like you didn't even have to assume that that's one because there's two probabilities that have to be equal we assumed them in a sense we assumed them I guess that's true but the fact that these probabilities are equal this probably is from them being in the same class that history is right yeah yeah except we don't know right right so there's a case here right yeah right in some sense I'm dividing them yeah so it's either right there's a case here what happens if it was disallowed well I was kind of in the assumption assuming that it did occur but you're right you and you have this in the long in the paper you'll see you have to deal with all the dotting of the eyes for that kind of thing so this is it's some it's it's maybe the shielding is kind of I don't know I find it a little bit intuitive given how the causal states are constructed this unifilarity is a little bit surprising it's a pretty powerful property I mean like we keep this reason I meant was emphasizing it through the winter it leads to very computationally useful things in terms of you know an observed sequence corresponds to one path over the internal states and so on we needed to calculate entropy rates and all sorts of consequences for this so it's interesting how just assuming this predictable equivalence relation we're getting a non-trivial property like that so so again so why do we care about unifilarity well there's this more or less one-to-one mapping between the internal state sequences and the observed simple sequences that's nice we can calculate properties of internal state sequences which are just Markov chains to make statements about the observed sequences in particular the most immediate one is to calculate the entropy rate if you remember the formulas we used the state average branching uncertainty that assumes unifilarity it only works with unifilarity so critically dependent on having some model of your process even if you just want to figure out how random the thing is by calculating the entropy rate you need this so so the sort of bold claim is however you're calculating and there's a long history since the beginning of information theory of giving models of stochastic process how do I calculate or can I calculate entropy rate basically they're all equivalent to somehow finding these causal states that have this unifilar property so you can't get away with this you might call it something else but it's somehow you have to find these causal states to somehow be using this predictive equivalence relation some methods it might seem rather implicit but okay what about the causal state process itself so it turns out that it's a first order Markov process right going from data get causal states the causal filtering I can take each observed history and turn it into this causal state that causal state at different times I now have this stochastic process over these causal states well a priori that could be a really complicated process but it turns out it's first order Markov the causal states really are capturing a lot of the historical information in the present moment so what do we mean by this first order Markov remember Markov process a general process that next variable can depend on arbitrarily long past if it's first order Markov all that's relevant is the value of the variable the previous time step so again and then you look at this as you know really is the causal state are summarizing their pasts this way okay so the sketch of the proof would just do it for assuming that the current variable just depends on the previous two steps and we want to show that really just depends on only one previous step okay and then you could by induction get up to longer histories okay so what we're going to do here is look at this probability probably at time t given where we are state we're going to t-1 and t-2 so I'm just going to add some notation here we'll consume this random variable the next state I could be in is in some subset of the causal states I can be in state D and go to E and F so this is again just a statement about the internal chain but now I'm going to switch back to the labels on the transitions that took me to those two states why because that's where we just established some properties so a lot of these proofs we make this move going from the internal Markov chain up to the observed some property you know on the observed sequences and back down again or other way around so that's kind of tricking so what we're doing is going from just talking about the internal state process to its proxy up at the level of this the symbols that label the transitions that took us to the states in M okay but we know we just proved that up at the level of the sort of observed sequences conditioned on the causal states they shield that we don't have to depend on this so we're borrowing shielding which we proved up at the observed sequence level to show that we can get rid of the state two steps before and then we just pop back down and change what we're concentrating on the transitions and the symbols that label them to the states so causal shielding at the observed level leads to this order one Markov property over the causal states themselves okay so this things are simplifying here um probably the most uh uh important property or first important property is that the epsilon machines are optimal predictors so and now what do we mean by that so I want you to think back to this sort of formal space of all possible candidate alternative models rival models the epsilon machines in there somewhere but there are all these other choices I could make so we're going to go back and start talking about how an arbitrary choice of model induces a partition over the space of histories and the idea here is to establish optimality what we mean is that the uncertainty in the future given the causal states is the lowest compared to any other alternative if I condition on anything else typically your uncertainty will be higher um this actually doesn't take too long so let's just do some rewriting here let's focus on this guy um so conditioning on the causal states we have this uncertainty over the future L steps ahead uh well like we've argued before I can either think that I'm I know what causal state I'm in or I can just pick one of the exemplars in its equivalence class for histories okay so so this uncertainty given causal states the same thing as conditioning on the infinite past associated with that causal state that's fine um but we know that any rival model is going to be some function of the of the past remember that was that aided thing we first introduced this notion of rival models and it's sort of a version of the data processing in a quality where basically if the event you're conditioning on uh you take some function of that that function all the best it can do I mean or typically what it will throw information away that was useful so that and you throw information away that you're used in this case to predict something that can mean your own your predictions will be worse that you're uncertainty in the future is larger so there you have it because the rivals are sort of arbitrary functions of the past they can do no better than remembering the past but the past are sort of well summarized by the causal states so we end up with in fact they're equivalent to having the pasts so all the other models must be worse predictors a couple quick just observations essentially remember we had the how random the process is based on different rival models well you can show that if we choose the causal states we actually end up with the process is entropy rate not surprising because knowing the causal state is as good as knowing the past of the process they just had me good concise summaries of it so how do we do this so the definition of the entropy rate of a given model class R is just now I'm going to look at the block entropy conditioned on the model well in this case I'm assuming the causal states okay so this is just a definition of the entropy rate in terms of the block entropy growth rate I can move from knowing which causal stadium to just focusing on the histories that are associated with each causal state now here I have this future L steps ahead and I want to think about that as L independent random variable so this is actually a joint distribution next symbol after that going into the future conditioned on this past well it turns out that I can factor there's the from last quarter there is a conditional entropy chain rule we can factor that out and then shift time so that we're based on different histories I'm only predicting one symbol ahead so you end up with this joint distribution you can factor it into single time step predictions right so this is this is actually L separate terms here we have this joint distribution for L end up with a sum of L single symbol uncertainties and then we have different histories are conditioned on based on if we're predicting from two states going forward or not and we can shift those in time by stationary to get L times the single symbol uncertainty conditioned on the past and that's just the entropy rate and immediately corollary of the previous thing now that we have this sort of optimal case so the epsilon machine gives you the entropy rate of a process any alternative rival model perhaps does as well but typically would do worse it'll assign a higher entropy rate combining the previous proposition and then this lemma we get this so not only is it true for predicting L steps ahead it's also the case that epsilon machines are optimal for getting the entropy rate of a process it's also a rather direct corollary remember we had this notion of the prescience if I choose a model how much descriptive or predictive is it measure that in terms of redundancy so you can show that the epsilon machines are maxly prescient they capture the most of the future compared to any other rival so the rival models we had this measure redundancy being the difference between log of the alphabet size and then the entropy rate that our rival model induced there well in the case of using the causal states we just showed that this is the entropy rate of the process and that was our total predictability the predictability gain summed up so basically the and this is almost a narrative rewording basically the epsilon machine is more prescient about the future more predictive about the future than any rival and that just follows from what we just established namely the entropy rates are larger therefore that difference is smaller for any rival so in terms of goals that's pretty good so after doing all this work you calculate the causal states and transition structure and you have these optimal predictors one way to think about this and this gets back to the very first lecture on information theory and there was this mysterious quantity what is information and I kind of mentioned my favorite definition was due to my mentor Gregory Bateson kind of early cyber netician right he defined information as a difference that makes a difference so here we can see the causal states just capturing that the causal states contain every difference in the past that makes a difference in predicting the future that's exactly how that predictive equivalence relation is constructed right we don't make distinctions between past that are predictably equivalent what are we trying to do we're trying to predict the future so it actually nicely encapsulates the concrete version of his kind of informal definition of information so another way to say this and we'll come back and I'll introduce this notion of sufficient statistics the causal states are sufficient statistics in short anything you want to calculate about a process can be calculated from the epsilon machine but I'll prove that to you okay so so at this point we've been talking about this entire space of models somewhere in there where it was the epsilon machine the causal states and we could be anywhere out here we're just picking these things so just in terms of prediction we showed that the epsilon machine is the best but it turns out there can be other models that are equally predictive so we call those the pressing rivals so you should imagine the space of all these ways of all these models we could choose all these possible partitions of the space of histories there's now a subspace where all these basically get the entropy rate they ascribe the same apparent randomness to the process right so the definition of these are choices of model where the future is equally uncertain compared to using the causal states themselves now where we're going with this is actually trying to compare another well develop another optimality criteria for the causal states we want to show that they're the smallest set that's predictive in this sense but we need to do a little work first so we have to go back to the space of histories and compare how rival models and the causal states partition up the space of histories a little bit of internal structure here so the result we want to establish now is that these equally predictive rivals are the way they partition up the space of histories those subsets are classes are refinements of the causal state partition of paths so we have two cases either, let me describe it graphically so here is the space of paths and then I put in here with the solid lines the partition induced by these five causal states and then there are two cases for the rivals either a rival state the set of histories associated with the rival state are completely contained in one of the causal states or the set of histories lays across different equivalence causal state equivalence cells in the space of histories mixing them together so if in this case here one of the partition elements is completely contained or equal to one of the causal states then by definition they make the same prediction so there's no real difference here by the way the conditional distribution is the same that's fine or the more interesting case here is when the rival partition cell contains causal cell causal state cells some number of those okay just in terms of the vocabulary this is not a refinement it's a mixture of the partition elements so the future predictions we would make the future morph conditioned on R2 in this case is going to be some mixture of the future morphs associated with S5, S4, and S3 okay we're not going to say exactly what we're just throwing these things together so we'll just sort of write that out formally that given R2 here the predictions we're going to make on the future sort of on average we'll be hopping around with some mixture of the future morphs associated with the contained causal states with some coefficient here that does the right normalization to make these probabilities some mixture okay problem is that if you mix distributions you increase the entropy in other words you make worse predictions again we're sort of ignoring distinctions R2 is ignoring distinctions between paths that lead to different predictions therefore you can't do any better or you do worse I should say and the way you show this is simple information identity that the entropy the mixture of distributions is bounded below by the sum of the individual entropies of the individual distributions so you get worse predictions with the rival which is the contradiction because we started out assuming that we had precedent rivals that had the same made the same predictions over the future same uncertainty about the future contradiction therefore this can't happen this can't happen this second case can't happen this can that's fine they agree so the conclusion is sort of shown graphically here is that if whatever alternative model you want to give me if you're predicting at the same error rate assigning the same uncertainty to the future then your the induced partition over the space of histories has to be a refinement either wholly contained or at least the boundaries of the cells have to respect the boundaries of the causal state cells like that so it's kind of obvious there can be more of them and certainly no fewer because then it would start to mix the cells for the causal state partition okay so now with that kind of graphical picture what's going on here as we compare different models we can now establish the second sort of optimality property of epsilon machines namely that they're the smallest of all the pressing rivals all the models that you would have that would give you the entropy rate or predict optimally the epsilon machine is the smallest well what do I mean by small well what I'm going to do here is I could talk about numbers like I was just doing with the refinement picture but our measure of model size is the amount of state information so we're going to use the statistical complexity of your rival model your pressing rival and compare that to the amount of state information in the causal states and the claim is that the best you can do is equal this and typically your choice of alternative predictive model will be a larger model in this sense means you have to use more state information to do the predictions okay so again a proof sketch okay so the pressing rivals are refinements so I say that previous picture formally is that if I know which rival partition cell I'm in because of this refinement property I know which larger causal state cell I'm in so there's some function whatever it is that will map me because these are a more refined picture they'll always be a map from the rival partition cell element to the causal state that contains it okay so there's such a G however we have this basic you know information inequality here if I have a random variable that is uncertain to some degree some entropy if I take a function of that random variable well all this function can do is either be the identity or throw information away confuse things throw away events therefore contract the distribution and make it less entropic okay so the entropy rate of random variable is always greater or equal to than the entropy of the function some function of the random variable well that's exactly what we have here right we have c mu is just the Shannon information in the causal state distribution well we just argue that that actually these are a function of the rivals and then applying this identity it means that this state entropy has to be less than the state entropy of the rival right it's more refined there are more elements a simple way to think about this would be well we're using p log p here for the Shannon information in these distributions imagine it was uniform then we're really just counting states right whenever the event probabilities are equally likely p log p turns into just log of the number of events so we're just that's why I say just sort of as a first cut it's easy to think about the c mu is just being model size but this is more general we're talking also about the distribution over the states so the epsilon machine size the statistical complexity or the state information has is less than or equal to any other equally predictive model so within that subspace within that subspace so these are all the equally predictive present rivals this one the epsilon machine is the smallest well so one of the consequences of this is that and we'll come back to talk more about this but that the statistical complexity measures the amount of historical information that the process stores and the key point I want to make here we'll come back and talk about sort of operational interpretations of these different entropy measures but I just want to maybe this kind of trying to presage this a little bit this kind of interpretation wouldn't be true if we were using models representations that weren't minimal right I can take just like the case of that rival partition I can just start making all of these subsets and make an arbitrarily large model that's equally predictive well but then I can't take the number of states to be the measure of memory because well first of all it's arbitrary in that case so here we have this criteria we derive that the causal states are smallest in number that still do optimal prediction and that's unique property so there'll be that allows us to interpret properties of the model as properties of the process it's removing one kind of dimension of subjectivity that I could just make arbitrarily large models even of a fair coin right I could take like a hundred state model as long as the branching transitions were all 50-50 but there's no sense in which the fair coin has a hundred states of memory right by definition the pass is independent of the future so this is sort of implementing that or reflection of that well you could say that okay yeah you showed me that that within the presaint rivals the equally predictive alternative models the epsilon machine is the smallest but there could be other predictive small machines too why not so it turns out that in fact the epsilon machines are unique and this takes a little more work to do okay so the claim is that a presaint rival of the same size statistical complexity so it's equally predictive but of the same size same statistical complexity is essentially the same thing as the epsilon machine up to you and I disagreeing over what we call the states Alice Bob and Charlie 0 1 & 2 ABC right so the idea is if we have a presaint rival and we're assuming that it has the same state information statistical complexity as the epsilon machine then necessarily the state sets are equivalent so how are we going to do that well okay so the first thing follows directly from the assumption that we have already established or talked about that there will be some G since these are refinements I always know whatever rival cell I'm in I know what the enclosing causal state cell is so there's some function G but this is actually going the other way around is there some F such that if I tell you what causal state you're in you know what rival cell you're in and one way to do this is just to not so much do a construction over the space of histories but just look at this measure of uncertainty so we're going to claim that there is such a function if we can prove that this entropy this uncertainty if there's the uncertainty in which rival cell I'm in rival state given I know what causal state okay so what we're going to do is look at the mutual information between the causal states now we're thinking of these are random variable we're kind of hopping around between them and then the rival effective states just look at mutual information and then we're going to expand this in two different ways and compare okay so the first way is we pull out if you remember the definition of mutual information it's the uncertainty in the first variable minus the uncertainty in the first variable given the second variable right so the uncertainty in the causal states well that's c mu of the epsilon machine minus the uncertainty if we knew what the present rival cells were and I can also expand this in this complementary way pull out the marginal distribution from this joint of the uncertainty in the rival states minus the uncertainty the rival states given the causal states so those are just information identities well we just argued that there is no uncertainty if I know what the rival cell is I know what causal state in it there's a function it's a determinant so that term is zero so we just have this on the left hand side now and then on the right hand side oh actually again okay so we just rewrite rewrite this get rid of that term so now we have that the statistical complexity of the epsilon machine is these two terms but then we assumed that the state informations were the same right we assume this that the same size informationally so this and this are the same therefore this term must be zero so the uncertainty in the rival states given the causal states is zero so again it's not as constructive I'm not giving you what the F is but at this information level we've shown that this function from causal states to rival states must exist again so the rival present rival's partition is a refinement but then we assume they're sort of the informationally the same size well they somehow have to match each other show that there's this function exists in fact it's the original function here that map from rival states to causal states is really F inverse so so the same thing we can disagree over vocabulary but that's it present rival's the same size or isomorphic that's the point so um there's some kind of messy border cases here yeah yeah so right so if you roll up your sleeve and do the measure 3 it gets kind of grungy actually that's why I'm calling all these things proof sketches at least to get some idea we can talk about the properties give this kind of high level view of like why it's interesting and then you know when really got interested it's so yeah we'll get into some of these later on but once we have the basic ideas down so it's yeah um this is just the quickest of road maps okay I think this is like the final property and um namely that uh if we look at the causal state process and any present rival equally predictive that the sort of induced stochasticity is smallest with the epsilon machine in some sense the causal states attribute the least random process internal process so the way we're going to say that is that this now remember this is order one Markov chain so we captured everything over the causal states you can show that this state by state uncertainty it's like an entropy rate that this is always less than any other predictive rival um okay now this one it's an example of where whiteboards would be much better doesn't really it's a long proof so I'll try to get through this and but we'll hop back and forth a little bit okay so what we're going to do and this brings in a lot of these information identities okay so epsilon machines have minimal states stochasticity um well so we're going to start out here um making some observations that uh we have to do some groundwork before we actually get down to just talking about the rival state process and the causal state process and we're going to be moving from internal processes the state process to observations and back again okay so the first thing we're going to do is focus on um the uncertainty in next symbol next state given the previous state for the epsilon machine okay it's just an information identity this is where I'm thinking about this this is just a some joint distribution right the uncertainty in x and y given z and I just apply this information identity that's I have a choice but I put y out first here uncertainty y given z plus the uncertainty in x given y and z okay so I just rewrite this this joint conditional joint uncertainty into the uncertainty in the next symbol in the future given the previous state and the uncertainty in the next state given the next symbol and previous state okay but by unifilarity that second term is zero if I know what okay I'm starting in a state and I know what symbol I'm going to see by unifilarity I know uniquely what state I'm going to go to well there could be no next state but I would know that so this this is zero there okay so yeah right so we just showed that this is equal to that okay and then we can do the same thing for this you know predictive rival or we're looking at same kind of next symbol next state given previous state and I pull out uncertainty in the next symbol and the uncertainty in the next state given the symbol and the previous thing I know that this is zero it's a non-negative number so I'll just throw it away so I get this inequality here that this is going to be less than that or that this in particular however I know that since this is a prescient rival the the and I can go from the prescient state to a causal state that that this uncertainty for the next symbol is the same as using the causal state to predict that but then I just showed that this on the previous slide this single step symbol uncertainty is also equal to the joint uncertainty of next state in this but again these are redundant because unifilarity okay so now we can also expand this joint conditional uncertainty in a different way in a complementary way where I don't bring out the first symbol here but I pull out the next state given the previous state uncertainty and then I have the uncertainty in next symbol given previous state and next state two edges and this is the symbol on the edge okay but now I can put these two things together here to show that since these are equal I just showed this is lower bounded by this quantity that this right hand side is greater than or equal to this so what I'm trying to do here is pull out terms that look like this because I just want to make bounds on the rival internal state process and the causal state internal process okay so I don't find that overly informative it's one of those proofs where you have to see the final result and then go back and read it again okay so right so I can expand and the right hand side of this guy again like we did before and then rearrange so that now what I'm about on the left hand side are the two terms we were interested in this is the entropy rate essentially of the rival state internal process entropy rate of the causal state Markov chain now we have this difference here and then that's lower bounded by this difference given the past and current causal state the uncertainty in what symbol I'm going to see so you should think about state state and then this variable is labeling the edge between them the transition between them and then same thing here I've got to these two rival states and there's a symbol between them however we know that we can go from rival states to causal states to this function G then that induces a way of going from pairs to seeing that pairs of causal states are actually a function G prime just extension of G to pairs over pairs of rival states so that means again it's like that kind of data processing inequality I'm trying to compare these two terms here so the uncertainty in the next symbol given this compound knowledge of the previous state and current state that's actually a function of I should say the causal state previous state and next state which causal states are that's a function of this and therefore it increases the uncertainty in the next symbol right I'm just using this this identity here right uncertainty X given Y is less than uncertainty in X given some function of Y right all the G can do here is throw in useful information away from Y that will make X look more uncertain that's all I've done here right because the causal state pair is a function of this detailed information I had but then that means given this inequality this is positive right this uncertainty I just showed that this uncertainty single step is larger than this uncertainty therefore it's positive therefore this difference is positive and therefore any rival the internal state process of any rival is larger than the randomness or internal state entropy rate of the causal states like I said this would be better written out on three boards and I can go back and forth and point at things so it doesn't so the epsilon machine does not ascribe really nearly extra randomness to a process you could do that it's a hidden process after all so there are you could have things that are at the observation level equally predictive but had a really more complicated internal mechanism so this in some sense is showing that the mechanism is least stochastic compared to alternatives okay well that's pretty good I mean that's sort of all the hard work here maybe just some other connections and then we'll finish up so there is an interesting discussion in elements of information theory in Coving Thomas about the relationship between information theory and statistics and they kind of recast a lot of the ideas in mathematical statistics and information in terms one of them is this historical notion of a sufficient statistic so we have this just a little background on what this it means we have some random variable x and we assume it's distributed according to some given distribution that maybe has some parameters denoted theta so x is Gaussian distributed then theta would be mean and standard deviation you have some parameters so imagine the distribution for x has some parameters so then we have this this function here I denoted t maybe bad should call there's some function of this think of oh I'm going to calculate the running average of the samples of the Gaussian process to get the mean so the idea is you have a statistic which is a function of samples that helps you estimate the parameters of the distribution there's another function which through samples would give you the standard deviation so the idea of a sufficient statistic is that this function of x contains all of the information you need to estimate theta or said this way when we compare things we use mutual information so what this says is the information theoretic recasting of this concept of sufficient statistic is that the mutual information between the random variable and the parameter you're interested in is the same as the mutual information between this function you're calculating of x and the random variable yeah so is it also true that t of x would or x to t of x to theta would be a mark on? exactly yes right right so that's another way of casting what a sufficient statistic is exactly and then the notion of a minimal sufficient statistic not so much talking about size like we were just talking about but the idea is that this sufficient statistic is minimal as a function of every other statistic you could calculate so the punch line here is that the epsilon machine is a minimal sufficient statistic for a process so in other words we can calculate anything we want so the sketch here is that well first of all the maximal prescience gives sufficiently gives sufficiency namely the mutual information between estimating the word distribution and the causal states is the same thing as estimating the word distribution and the past that's essentially just rewriting that prescience out in a different form for mutual information but it turns out every prescient rival is a sufficient statistic anything that does optimal prediction is essentially a sufficient statistic but then we just established that in the space of all the alternative optimal predictors the epsilon machine is the smallest and just because the rival states are refinance so basically it's a minimal sufficient statistic and the punch line or the lesson is very simple you can calculate every statistic every property you want from a process epsilon machine a lot of this is just actually recasting things we already talked about with the other properties but at least it makes a connection to this other discipline of statistics but just to summarize we showed that the epsilon machine is an optimal predictor it has the lowest prediction error of any rival unconstrained but of the optimal predictors it's the one of smaller size and then if you come up and you say oh I've got an optimal predictor of smaller size it's essentially equivalent to the epsilon machine so it's a unique model of a process and reproduces all of its statistics and the causal states give you the causal shielding gives you the sort of operational sense of what the causal states do they're very much it's like a generalized notion of Markovianness of summarizing the past so now to sort of look ahead a little bit I want to start using these things to do things I mean two weeks now I was setting up the framework trying to argue how we have answers to these questions that certainly the information theory discussion of winter quarter brought up by how do we do modeling, can the data tell us and the claim is now more or less established with these proofs that there is a process can tell you how it should be represented and there's a way of optimally doing that and there are these ancillary properties like well actually if you want to even calculate how random it is the entropy rate you have to use these causal states somehow the epsilon machine so now looking forward to using the epsilon machine to start talking about dynamical system or generally how they store and process information so the next lectures kind of this sort of a theme here namely thinking about sort of the natural physical biological chemical could be social world if you like neuro biological world in terms of how it stores and processes information and we're going to use the epsilon machine to do this so there's this notion of intrinsic computation not you know running Microsoft Excel on a waterfall I don't mean that I mean the waterfall on its own terms or the neuro biological system on its own terms or the spin system on its own terms how much past does it remember how much of the past does the process store where in the system state space is that information stored what's the architectural structure of how that information is stored and then how does that stored information get used to produce future behavior so you know in short how much of the past is a process store well I already said it the claim and we'll have to show this in applications is that the statistical complexity is the amount of history an information theoretic measure of amount of history of process stores and its state information what's the architecture of the information stored well we have to look actually a little more at the structure of the epsilon machine but the answer to this question is the epsilon machine that is the architecture well I might want to pull that back if I was looking at some the epsilon machine is coming from a symbolic dynamic description through generating partitions pull it back to the dynamical system so there's a little bit of work there but the short answer is the architecture number of states the actual set of transitions that is how the information is stored and then how is that sort of information used to produce future behavior well we have kind of a proxy for that the entropy rate how much information is generated by a process but there are other aspects to these we'll get to there are other ways of thinking about things that we already talked about the bound information the ephemeral information decomposition of the entropy rate and we'll be able to pull all of that out of epsilon machine and so the computation mechanics in python package in the sage browser has all that stuff built in so we'll be able to calculate these quantities and talk about how to estimate those things and get a much more detailed picture of how processes store information process it generated and in particular using the structural picture we have namely the calls of states in their transition organization refine our notions of different kinds of information processing that will be it until Tuesday unless you have questions