 Today's goal is to do a little bit of review, but show some useful, in fact, very useful consequences of using mixed state presentations and how to calculate with them and talk about the relationship with Epsilon machine. So again, just to step back a little bit, the original idea weeks ago was we were interested in doing prediction, optimal prediction and we came up with this causal or predictive equivalence relation that had us group histories together and we called those partitions of the space of pasts, causal states, and then we really were sort of conditions of knowledge that we needed, knowledge of the past that we needed to do optimal prediction. And what we were setting up in the last lecture was thinking of these mixed states as kind of a proxy for causal states. There's a relationship we'll have to talk about here that are in some sense one step more abstract than the causal states. There are distributions over the causal states. So in fact, by focusing on these mixtures of causal states, we have a kind of metacausal state and we'll show what the relationship is between these mixed states. And this will affect a whole hierarchy of these states of uncertainty. Okay, so I'll just review some of the notation from last lecture, actually finally defined the mixed state presentation and we also have to talk about the dynamic. I mean, we sort of went through that in an example. Mostly I'm going to focus on how to calculate the mixed state presentation. You give me any hidden mark-off model and then we can just go through this procedure to calculate the mixed states. And we get this new dynamical system, a new model. The question is, well, how is that related to the epsilon machine? There is some kind of parallel between these mixed states and causal states. What is that? And then practically, I'm going to go back and we'll talk a little bit about the computational complexity of estimating block entropies and word probabilities. In the beginning of the Tuesday lecture, I noted that there are at least clever ways, or at least non-bonehead ways of calculating block entropies so you don't have exponential blow-ups. But it turns out that comes back to haunt us again when we try to calculate block entropies, but mixed states will save us. And then as another application of all this setup, we'll give a nice algorithm close form expression for the synchronization information. Synchronization was the process of making some observations coming to know what the state is. Well, the mixed state framework is set up to talk exactly about how state distributions go forward. No surprise. Then we can write down a nice way of calculating the synchronization information. And then next week, we're going to be talking about using mixed state presentations to talk about the temporal asymmetry, statistical asymmetry of processes, which we'd already introduced by way of example, but it was just an observation as opposed to something you can calculate with. So, okay, so again, just kind of going over a couple of the slides from Tuesday, right? So the mixed states are the state distributions that are induced by seeing a word. We have some word of length L and the sort of notation for the mixed state is mu, some measure over states at time t, having seen nothing, that's how we start off. It's a state distribution at time t. So what does this mean? Well, it's a slightly confusing notation. What's the probability that this now thinking of the mixed state as a random variable, what's the probability that it is one or the other of the given states, presentation states that we're starting with, just simply giving the interpretation. It's the probability of being in states, the state distribution, having seen the word and having started with a given state distribution. Okay, so this way of thinking about is we're thinking of the mixed states as random variables and we need to do this for some of the proofs later on of the efficient algorithms. So now you say, I was thinking of these as points on a simplex or state distributions. How are they random variables? Well, we're sort of conditioning on an event and on this random variable. So in this notation, we're thinking of this as a random variable. Of course, typically we take time zero and like we did in the example, the initial state distribution to be the asymptotic state distribution. But we could do other things. Start with other, we maybe have some other information about what the starting state distribution is or we're interested in the step by step evolution of this. And as we saw, you can deviate from the asymptotic state distribution based on what you've seen. Paul? So the probability there, it's not the probability that the distribution is a delta function, but it's the weight of a certain state in that distribution. Right, yes, right. Yeah, this is a little confusing. It's like there are two different ways of thinking about this. Oh, this is a random variable. This is the temperature and what's the probability that it's 10 degrees. The other thing is now you're thinking of it like it's a vector of numbers. It's a state distribution and then you look at scale. He's thinking that, oh, that this mu is 0, 0, 0, 1, 0, 0, the delta function on that state. So we're not, that's not what I'm trying to convey with this notation. In fact, it's just better to think of it this way. Okay, so I'm going to not condition it on what state it's in, but more just think of the mixed state as this probability here. Maybe this is less confusing, more direct way of expressing this. And we just want to think of this as this vector in some simplex space, some dimensional state, dimensional vector space. So starting at time t with this mixed state, we see some word here that's of length l. So the question is, what's the mixed state having seen that word? Again, just the definition, state distribution of having seen w starting with mu t. Well, that's a conditional distribution. We can just rewrite this as, using a definition of the conditional as the joint distribution over a marginal. I can just pull the word out in front and look at the, what's the probability that I'm in this particular state at time t plus l having seen and have produced the word w, that I just divide by this marginal here that gives me this conditional. And of course, I'm just carrying through the conditioning on what our initial distribution was. Well, we know how to calculate these things by pushing, this is just pushing the initial distribution forward using t w. And then the bottom is just where normalizing that over all the states we could get to, just marginalize that out and that's this probability, we just sum out. So this one vector here just sums out all the states we could end up in. And that's the probability of seeing that word starting with this initial state distribution. So that's how we push these things forward. We have these mixed states, we've seen some word and then push it forward, we get this partial distribution and then we normalize and get the updated state distribution. Okay, so again, interpretation sort of just the uncertainty in the state given a word and now we're being very careful to specify the starting distribution, which we didn't make so explicit in previous weeks. We can track our uncertainty quantitatively just looking at the state entropy having seen a word and when this vanishes then we know the state with probability one and that means we're on one of the vertices of the mixed state simplex. Now we can think of these mixed states, this gives kind of a more formal or indirect way of thinking about them. The mixed states that have zero entropy are basis vectors of some space. So I mean rather than three dimensional or four dimensional or two dimensional simplex we can imagine just arbitrary dimension and then use this as a criteria for finding the basis vectors. We call these the pure states. They're in state A, B or C. Then we think of sort of arbitrary mixed states as mixtures, these state distributions as mixtures over the pure states. So the pure states span this vector space and a point in the middle of it, just an arbitrary mixed state. Of course, I think most of this is kind of clear enough in the examples where we just think of points in this geometric space hopping around on the simplex. Now just to draw a little bit of the parallel, so remember the original introduction of the predictive equivalence relation for epsilon machines, right? This equivalence relation where we group two paths together when condition in those particular paths, the distribution over the futures was the same. And now sort of what we're doing is we're using the mixed states as a proxy for the past. The states, why do we use states? Because they summarize the past. Well, of course that was a property we noted for the causal states. Now I'm just drawing this parallel because there are some differences here. So here we can think of another equivalence relation over mixed states. We're going to say that two words are equivalent under the mixed state construction when the state distributions are the same. Now what might be the connection here? Well, these are words that we've seen. So these W and W prime are sort of like pasts. And then now we're looking at the equivalence of the state uncertainty, these mixed states. Well, if we know the internal state distribution, we can predict ahead. So that's how we can develop this future distribution here. But it's formulated slightly differently. But there's a parallel that where we can, we're going to use these mixed states as a proxy for carrying around all these partitions of the past. It'll be more efficient that way. That way I also get some interesting insight. Okay, so that's sort of what mixed states are, a little bit of contrast with causal states. Not yet the same thing, but similarly motivated and potentially equivalent. Okay, now we have, that's the state space. So what about the dynamic? Well, there is this sort of very natural dynamic over the words. I've seen some word and I observe another symbol and I have a new word. Just the concatenation of the word and the new symbol is a new word. So we have this dynamic over mixed states that's induced by this. So we have this dynamic, we have a word, we add on a symbol, we get a new word. And then we can just go look at what's happened to the state distribution. We just apply the mixed state calculation to look at the state uncertainty that's induced by that word. And then same thing for the next word, WS, we can look at its mixed state. And then we think about observing the symbol S as a mapping from the previous mixed state to the new one. So this mixed state dynamic is unofiler. These maps are unique. So we go from, we add on a new symbol. This concatenation is unofiler in a sense. We always go, given the previous thing and the symbol, we end up with a unique next thing. Same thing down here. We have this mixed state that's induced by W. We go to a unique next mixed state when we add on S. We update. So one consequence of this, well, in addition, first of all, it says that the mixed state presentation is nice. It carries along many of the properties that we had before for unofilarity of presentations or the epsilon machine. In particular, we can calculate things like entropy rates using presentations or models that are unofiler. In addition, this construction of building the mixed states, that I'll go through an example shortly, is a way of taking a potentially non-unofiler presentation and unofilerizing it. So you claim, oh, the process has seven states. This transition structure, and I go, well, that's non-unofiler. I can't calculate the entropy rate. I'm going to just go through and calculate the mixed states. I'll end up with this new model. That's now unofiler, and I can just plug that in once I calculate it to, say, the entropy rate formula. That assumes unofilarity. So that's a nice thing. So this restriction of unofilarity that seemed to come with the epsilon machines, now we have a way of converting from general hidden Markov models to something that's unofiler. Okay, so how to calculate these presentations? Yeah, so I'll go through the example again. It's a little more abstract than the example we did on Tuesday, the particular case we did on Tuesday. And then I'll come back and just give a definition of what the mixed state presentations are. Okay, so we're going to start out at time t, not having observed anything. That's actually two concentric circles, so you should think about this like the kind of a start state or actually kind of like the top tree node in the parse tree. And then we push this forward. So we were just talking about we go from mu t to mu t plus one having seen a zero. And this is our formula to update the previous mixed state to the new one. Okay, and then the transition probability is, since we're seeing a zero, we update the mixed state with t zero and then sum over all possible states that we could go to having seen a zero. That's essentially just the probability of having seen zero starting with that state distribution. So we're just, you know, we have the states, we're making some guess, okay, we have the state distribution, we just look at, oh, what's the probability? I'm in this state and I see a zero. In this state and see a zero, and then we just sum those things up. So that's what this is. And I'm kind of writing this down like it's an operator. Same thing. We go through and check all possible words. So we just look for a zero. Then what's the probability of seeing a one starting in that state distribution? Again, that's just the probability of this, it's just this probability. And then we update the mixed state, but now using t one and normalizing. Okay. So we just keep doing this. So all possible words. We're just kind of treeing this out. This should sort of look like the parse tree for the epsilon machine reconstruction. We do it again. So now I'm in the mixed state at time t plus one. Having seen the zero, well, I can see another zero here and I just push it forward. Same expression here. And this is the conditional probability of seeing a zero, having seen a zero. And then we go to the mixed state at time two, which is the way I'm writing it down here. It's like we've updated directly from t to t plus two, having seen the length two word, zero, zero. Same thing here. Zero one, this path. You now have a mixed state that goes from the original down to here, having seen zero one. One zero and one one. Okay. So we just calculate this out. I mean, there might be disallowed transitions. So we would see some of these branches not there. They're at zero probability. Also, as we're doing this, and we saw that in the example, having, we can see one word or see another word. They can lead to these being the same. And this is where we're going to use the mixed state equivalence relation. It's important when, as we're treeing this out, looking at all the words, when we see the same state distribution, it means from that point going forward, we're going to see the same symbols with probably the same future morph. Okay. So how about an example, even process, just to calculate this out. And hopefully this will, since we did this sort of by hand with the parse tree and sub trees, this is a nice contrast so you can see what's similar and different when we do the mixed state calculation. Essentially the same thing, again slightly more abstract and more powerful in a way. Okay. So even process, two states, we'll start with pi, the asymptotic state distribution. And then, okay, so we see a square down here, pi times t square. Then we normalize it. That's the transition probably right here. It's just a probability of seeing a square starting in state distribution pi. And then we update. So we get mu one having seen the square. It's that ratio. Same thing with the one. Okay. And then if you actually plug in the numbers here, what you see is that things we already know, you know, the asymptotic probability of seeing a square is one-third. I'm seeing a triangle is two-thirds or a one and zero. Two-thirds, one-third. And then if you actually go through the calculation of the new mixed state. So we notice that on a square, all the probability is in state A. Well, that makes sense. You can just read that off here. We just came right back. That's all that can happen. Even if I started in B, well, actually I can't do that. Sorry. That wouldn't happen. If we're in this distribution, there's nobody to get from B to A. But A can go to itself on a square, but B is just isolated. Okay. Right. So we end up with this delta function. So now we've synchronized. Square is a synchronizing word for the even process. But we come down here and we just did that calculation. Right. If we see a triangle here, well, now we go from the two-third, one-third distribution to a half-a-half, equally uncertain. So we keep going. So from both states we can say, oh, I can see a zero or one, zero or one on those two mixed states. Okay. Down here, if I'm in A and I see a square, well, I go back to A. So you can do the calculation if you want and I get up this delta function with all the probability on A. Notice if I was in this one-zero or state A mixed state and I see a triangle, I know that I'm in B. Well, you also noticed that this presentation was unifiler. So actually from this point forward, now that I have all these pure states, I'm always going to be in pure state. So I've synchronized here. Kind of done here in a way. But come over here. So we're now 50-50 and then on a zero, let me start at zero, on the square, I can only go back to myself on A. So now I'm synchronized on that. But then if I saw a triangle, well, we just did this calculation several times. We had a half-a-half and you push them forward by probably half, probably one. So we end up back in two-thirds, one-third. So now notice that this mixed state, not a delta function, but it's the same as this one we started with, pi. So I don't really have to do any more calculation here. I can just link this back up to here because any sequence is falling from this is the same and the probability is the same as falling from here. So I'll just link that back up. In addition, this mixed state is this mixed state. Well, I should have said this before, and this mixed state was this mixed state. So this guy loops back to itself. This guy actually goes over here. So I've just connected those things back up using the mixed state equivalence relation. So this is like identifying subtrees, except this is being done over mixed states, much more compact in a way. We just have a vector we're looking at here, vector probabilities. So there's just one mixed state left to look at. We need to know what the follower is. We have to look at 0 and 1 from that. Well, that's state b. You can almost anticipate what that's going to be like. What happens on a 0? What happens on a... Sorry. What happens on a triangle? What happens on a square? From there? Well, on a square, there's no square leaving b. Therefore, this transition is disallowed. I just put in 0, 0 here as a mixed state. It's got a nonsensical mixed state, not a distribution. However, on a triangle, I can go back to a. Like that. Which is, this mixed state is this mixed state. So that's disallowed. We throw it out and we connect this back up. And this should look kind of familiar. Almost familiar. Let me rewrite this, redraw it. Now it should look really familiar. So what do we do? Well, there's one way to look at this. We did absolutely nothing because I already started with this minimal, the epsilon machine presentation of the even triangle process. However, we now have the, in the process, two transient states. So these two are transient. As soon as you see a square, you come down to these two mixed states and rattle around. These are delta functions. So they really are now one-to-one correspondence with the original states. But it's sort of interesting now to look at how these transient states came about. This was pi, our start state, two-thirds, one-third. And then if I see a triangle, I go down to this equally unlikely in A or B. And it's also telling us in that, when we're in this state, it's actually relatively likely we're going to see another triangle and come back. Kind of makes sense because there's one restriction that you must see a triangle after having seen a triangle. There's a little extra transition weight here. And we just kind of rattle around. So as long as we're seeing triangles, we just rattle around in here. Oscillating between states of uncertainty of states, between two-thirds, one-third, and 50-50, one bit and less than one bit. Back and forth as long as we're seeing triangles. And then until finally, with probability one looking at a sufficiently long sequence, we will take the other transition and see a square and then we're synchronized. So this is a nice way. If I give you a model, give you a presentation, it's a nice way to calculate what looks to be like the epsilon machine. Needn't be, it turns out. There are cases where it doesn't minimize the number of states, but at least we end up with it's guaranteed to be unifueled. Okay, so the examples tell us what's going on sort of behind the scenes. Now I want to talk a little more generally about how we convert from general presentations. So imagine we have some process. You pick some presentation of this. So we'll have some alphabet here, some set of states. I'll just note those with the V and then a dynamic here, which will be some transition, set of transition matrices. And again, what we mean here is just that the transition matrices are these conditional transition probabilities. Probability, if I know what state, I mean what the probability is seeing the next symbol and next state are. Okay, so now we want to look at the mixed state presentation of that. So we're going to define the mixed state presentation of a given model and specifying how we're starting the model out. We have to specify the state distribution. Again, we can take pi if we want. And then we denote that calligraphic U. So we plug into this operator. Now we're thinking of this as an operator. We plug in the given model and the mixed state and then we end up with this mixed state presentation that has the same alphabet, has some set of states, well, mixed states, some dynamic, again, over the mixed states, and then there's basically initial distribution. Okay, so what are the mixed states? Well, we just look at all possible words, just like we were doing in the even case, plug in all possible words and just calculate the mixed states we get from the chosen initial distribution. Just treat that out. So that'll be the set of states. Now there's this induced dynamic, which I showed just in the commuting diagram before. There's some operator that takes previous mixed states, connected mixed states, having seen a symbol or extending that to words, some transition matrix over words. And what we mean by this is that there is some mapping from current mixed state to next mixed state on a symbol. And what we mean is the probability that if we started in this mixed state that we're going to see the symbol and end up in this next mixed state, eta prime. So now I'm going to introduce a little bit of a trick here. So what do we mean by this probability up here? Am I writing it that way? What I mean is that we have a dynamic that moves us forward. So there really is only one eta prime. But now I'm going to think about this as an operator that considers all possible next states, but in fact just is a delta function on the one you get taken to. Oops, yeah. Well, I changed primes here. So this is starting with muti, selected that. Eta is the next one we're going to. And this probability is zero everywhere except for that mixed state that you actually get taken to. That's all I'm doing here. So this update gets defined recursively. We can go from some mixed state at some time, having seen a word length L and a new symbol. It takes us to the mixed state where we just look at this longer word, WX. The construction is unifiler. Again, the states of this MSP, the mixed state presentation, are the mixed states over the states of the original presentation we were working with. We can think about mixed states of mixed states. If I have another presentation that's also hidden Markov model and I can do it again, turns out that this tops out if you use the start states. What I'm doing here is I'm applying U to the mixed state presentation of M, but I choose this start state with all the probability on the original state distribution. And then that is sort of the names have changed, but it's isomorphic to the original mixed state presentation of M. And then, again, the final important point is that this, now I think of this operator U, it's a way of converting non-unifiler presentations to unifiler. So if you didn't use, for your starting state of your mixed state of mixed state presentation, you have one bold. Yes, right. So if you didn't use that, then would it be something else? Right, right, yeah. And it's, yes. It would be a totally different machine. It's related, it's related, but you end up with typically the first thing will change with the set of transients that you calculate. And if it's synchronizing presentation, then it'll eventually capture the original piece. But yes, we're not going to push too much in that direction. It'll come up maybe next week a little bit when we reverse time. But okay, maybe another example to illustrate the role of these transient states and also just to be concrete about what would be an alternative presentation for a process then. We're kind of used to using the epsilon machine, which is so optimal. So here, this is a period two process. Here's the presentation. It generates the period two process. Square triangle, square triangle, square triangle, square triangle, square triangle. So that's period two. But what I've done is elaborate the states. So now I have four presentation states. Okay, you know, one is welcome to do this if you like. It seems slightly perverse, but that could be something. But it does illustrate also sometimes you have very complicated models. You don't know whether it's the optimal set of states. Smallest set of states. Okay, so let's just go through and calculate the mixed state presentation of this four state presentation of period two. Okay, so we're going to start out with high for this. So we put a quarter probability on each of the states. That's going to be the start mixed state. And then we just do the calculation. We just go push it forward. If we see a square, push it forward. If we saw a triangle, we'll also notice that if I've seen a square, look back here. If I see a square, there are just two states that I could go to. D and D. Or if I saw a triangle, I can only go to A or C. So I come down here to a mixed state that leaves me equally uncertain after I've seen a square as to whether I'm in B or D. I can't discern that. Same thing with a triangle. I come down here and I'm equally confused about A or C. I know I'm not in B or D. And then as I see a square, my uncertainty hops from being uncertain about these two. I should say triangle, this guy. I'm uncertain about these two. I see a triangle. I'm uncertain about whether I'm in these two, then these two, and so on. So no matter how many observations I make, I'm always hopping around. And I never know exactly what state I'm in. So this is a presentation of the process that I cannot sync to its states. And the mixed state, the internal structure, the state distributions tell me that. Because otherwise we'd have state uncertainty zero, but we don't. Given that we've assumed this model, we can't synchronize to it. So the mixed states are of two kinds, just like the epsilon machines. We have transient mixed states, and these are mixed states that at least eventually you never revisit them. And then there are our current states, of course, and then you can rattle around in these forever, repeatedly visiting them infinitely often. One useful thing about the mixed state calculation, as I said, is that this tells us that the original choice of model is not synchronizing. So you'll never know exactly what presentation state you're in. Or the other way we say that is that this particular model is not exactly synchronizing. There's no finite word that will let you synchronize. I mean if there was, and it was unifeler, then you would sync. So it's kind of a nice little exercise here. If this was unifeler, I could have started, this one is, I could have started with non-unifeler presentation or something like that, but imagine that this is unifeler, and if it is synchronizing, this one is not, then you can show that once you're synced, you're always synced. In other words, that there's some word such that the state uncertainty goes to zero. If that happens, then all the following words, all the allowed following words also lead to these delta function distributions. And so state uncertainty is zero. We can do this again. So call this M. When we just calculate it, I notice this had two states, which should be a little intuitive. It's a period two process. So does kind of bring up this question. When I apply this operator and construct the mixed states from an arbitrary hidden Markov model, am I calculating the minimal number of states? It turns out it doesn't always do this for subtle reasons. But we can do this again. So now I've got these three states. I call this machine M, and I can calculate M prime as its mixed state presentation of that. We'll start off here. We'll just assume that we're in the asymptotic state distribution of this, which is 50-50. And in that case, you can calculate this machine M prime, which actually is isomorphic to the original thing. So it's very interesting. Two applications of this bring us back to the same model. It doesn't happen all the time, but there are certain conditions when it does happen, like for example starting the mixed state as pi. And now notice that when we look at this, the mixed state distributions in these new mixed states, they're now delta functions. So we know that this model is synchronizable. This one wasn't, but this one is, and so on. So it sort of comes back to itself. Now it's interesting to ask the question, what are these? So these are distributions over these three states. And these states were distributions over these four states. So what is this start state here? Well, it was pi for this, for the two recurrent states of M. But how is it related back to the original presentation? Well, it's pretty straightforward. It's actually, we have this distribution, so we have half a contribution of this guy and half a contribution of this guy. So in a sense, this means we should add up half of the two recurrent states, like this, and lo and behold, that is the original distribution over the four states. So that's just unpacking this hierarchy. You can always take it back down. So the relationship to the Epsilon machine, in general, although the mixed state presentation is unifiler, it's not the Epsilon machine because it fails to be minimal. And there are cases we're still thinking about the fact criteria that you need to add in to make, or to detect ahead of time that a presentation would, the MSP of it would be the Epsilon machine, but it's often not minimal. In fact, Ryan gave me a nice, simple example, which I maybe I'll put that on the homework. But we can still think about, because it is unifiler, and it's a presentation of the process, it's a prescient rival. It's equally predictive. And that means, in particular, that however it's partitioning up the histories, it's a refinement of the causal state partition. So there is kind of a relation. Yeah? I was going to comment that if anyone wants to work on when exactly MSP and the Epsilon machine has their class project, that's the only group of them on that. Okay, good. Yeah, okay, right. So there's the research project here. Exactly when, or what are the conditions that will, where the mixed state presentation or presentation will be the Epsilon machine. We have some good hints about how to prove this, but it's not proving yet. And we have some simple examples that, again, I'll put one on the homework, where it doesn't minimize, but still gives you useful states. There's still prescient rival states. Now, if you give me a causal state partition and calculate the mixed state presentation, you will get the Epsilon machine, but that's kind of cheating. And in a sense, remember the Epsilon function where you plugged in histories of the causal state you're in. If, when you calculate the mixed state presentation of a machine, if that result is the Epsilon machine, then in fact, the mixed states are the causal states. So we're really close. I mean, it's, you know, but there's just a little extra criteria here. Now, as the previous example where the four state thing allowed us to see, the mixed states actually gave us some information about synchronization, about extra properties about the original presentation that you couldn't necessarily conclude if you just, I just gave you the period two process and you built the Epsilon machine. So it lets you, in some sense, analyze presentations for how, in the four state period two case, how verbose they are, kind of over large. There's a way of, so we'll talk about that next couple of weeks about how to measure things like that, how redundant models can be. They have extra components in them. You don't need a four state presentation for a period two process. That's sort of obvious, but there are more general cases where there are extra components and that's not obvious. You need some quantitative way of measuring that and the mixed state presentation gives us a handle on that. So just to kind of summarize the discussion of how the mixed state presentations relate to Epsilon machines. So we sort of imagine, we're given some hidden Markov model. That generates a process. Well, we could go to the process and just from the word distribution that's produced, get the Epsilon machine using the predictive equivalence relation or there's a more direct way of doing this. What we do is we start with M, calculate its mixed state presentation, the dynamic, and then we simply minimize this and they're sort of well known averaging algorithms. You just look that we'll do this and then the result will be the Epsilon machine. Might be nice to do this kind of all at one go to find some modified mixed state operator that just did that rather than tack on an extra algorithm that goes through and compares the sequences that come from states. Yeah. Yes, right. That does work. Yeah, right. So the number of Popcroft-Ulman-Brazowski, the different ways of minimizing these, they come out of the very familiar from automata theory, although automata theory is just talking about non probabilistic machines, but they're analogs of minimizing state merging, finding equivalent states for probabilistic machines. So that's what I mean here. We minimize. Okay. So concrete consequences of all this. I mean, I'm kind of hinting at why it's more useful, but there's nothing like, can I calculate now? And it actually leads to a lot of quantities we can calculate in closed form. But the first thing is just to think about some algorithmic, the kind of computational complexity of calculating the block entropy. So we have a number of different information measures that all in some way direct or indirect that rely on the block entropy. Right? Block entropy itself. I'm going to calculate these estimates of the entropy rate in terms of the differences between the block entropies. It depends on the block entropy because it's basically the offset for the linear asymptote. So if you can calculate this guy, you immediately get a handle on these guys and other things too. But, of course, one would like to go rather directly from the epsilon machine. Right? We can get the entropy rate if I have the epsilon machine just from the state average transition uncertainty. So there's another way of doing that. But let's just think about the things that we can get from the block entropy. And just to remind us, it's been a number of weeks. So here's the random random XOR process. And then this is the calculation of it going out to 15, like 15 words. And here's our nice linear asymptote. Right? And we're trying to somehow fit this straight line to this blue curve, the block entropy. And that's kind of doing okay, but I don't know. Is there a random XOR process taking a long time to settle down to statistical equilibrium here? It's not exactly clear where the straight line should sit. So is there some better way to do this? And just to remind ourselves, we talked about on Tuesday, right, there are different ways of calculating word probabilities, sort of explicit. We just sum over all possible paths that produce a given word, say 010. But there's an exponential number of those exponential increasing with the word length, but that sum over all paths. Instead, we sort of even motivated thinking about mixed states on Tuesday by noticing that we can get a linear algorithm for calculating word probabilities if we just update the state probability incrementally along with each symbol that we see as we parse through the word. And that ends up being linear in the word length. So that's great. There's an exponential dependence like that. I mean, you're not going to go out to length 50 words if you're doing an algorithm with this kind of complexity. That's just inaccessible. However, when we go to calculate the block entropy, there's this little issue that we have to get the probability. We can calculate the probability now in linear time with the word length, but now we have an exponential number of words, or a number that grows exponentially so when we calculate this thing, we have an exponential number of terms and so the computation of complexity is exponential in the alphabet size. Not states anymore, but alphabet size. Even if we use this efficient word calculation thing. So it seemed like we're kind of screwed. So the problem for block entropy is still exponential in L. Why would we care about that? Well, in fact the random random XOR converges extremely slowly. So here's a length L approximation of E. Just kind of looking at terms trying to go out for the random XOR. It's about 2.5. And notice here, well actually we will get to a way to calculate this. So this is the exact value for random random XOR. 2.5 bits. But here what we're showing is this curve, this approximator coming out here. We have to go out to length 50 words. That's pretty bad. So even simple, relatively simple, you know, there are only five recurrent states here can have informational properties that require looking over very long words and therefore in this case since the entropy rate of the random random XOR is relatively high. So we're seeing pretty close to almost all two to 50 words here. You could say, well, not just a second. How can you show me a quantitative plot of that? I'm actually showing you an estimate of that that goes out to length 50 words. How is that possible? It should be impossible. So that's what I want to talk about. It's how we can actually look at the convergence property and even calculate this approximate all that far out. Next week, end of next week we'll talk about how to get the exact value kind of directly from the epsilon machine. But it takes some work just to get this quantity. So random random XOR, many other randomly chosen hidden Markov models have very slow convergence properties. So we need to do something else. I'm using this plot to tell you how slowly it converges, but it begs the question because in fact we use the efficient algorithm to get this far out to see how it converges so slowly. So how are we going to do this? A few observations. So when we calculate the block entropy call this the sort of implicit way where we're thinking about it before notation from lectures in the winter quarter, right? The entropy of length one words is just the entropy of a single variable length two words, a block of two symbols but now we can of course factor that out. That's a joint entropy. We can break that into a conditional entropy, uncertainty in the next symbol and the previous, plus the uncertainty in the previous. So, well this, that's just h of one. So we can actually get to h of two by remembering h of one and just calculating this conditional one step conditional uncertainty d o for h three here, right? We have this joint entropy over three variables. Well I can break that into the uncertainty in the previous two and I should say uncertainty in the next, the third symbol giving the previous two plus the block entropy over two symbols but we have that number already if we remembered it. So there's this telescoping way of calculating block entropies that results in us just remembering the previous block entropy so we calculate whatever algorithm is going to start from, you know, length one and go on up calculating these things and then all we have to do is calculate this uncertainty in the next symbol given some past. So this is the quantity that if we can calculate this efficiently then we'll be able to go to long word lengths. We can get around this exponential dependence and shouldn't be any surprise that we're going to use the mixed states to be proxies for these long words. We won't actually condition on long words, we're going to just as we're going forward L one L two L three we're going to just keep pushing the current state distribution forward and then we're going to replace this term with the uncertainty in the next symbol given the immediately preceding mixed state and the mixed state calculation will go fast so that's the main idea and to do that we of course have to be a little more explicit about what we're conditioning on so here I'm just rewrote the previous argument for the telescoping calculation and just put in the fact that we were assuming pi as the starting state distribution right this is what we meant all along here and same sort of factorizations occur these these sort of block entropies on the right hand side here just you know what's the uncertainty next symbol given the pi as the start distribution what's the uncertainty next two words two symbols length word of length two given that start distribution that's h of two and so on so really what we have to focus on here is this uncertainty in the single symbol given the previous word and explicitly stating what the start distribution is but again now that I put this in here what we're going to do is move it forward for l equal one mixed states l equal two and so on so this is what we're interested in calculating efficiently with mixed states so the main result is that that term one step uncertainty is can be rewritten in terms of these now conditional random variables but basically the the mixed states at time step l conditioned on your starting mixed state and then we know how to push these things forward linearly so then we'll be in great shape so it's a pretty straightforward idea we just keep pushing the state distribution forward and then we just have to look one step ahead just look at the transitions from those mixed states to calculate this conditional entropy then we're home free so just to write it out explicitly given this result then the block entropy is just we just add this term in incrementally as we go to longer and longer words so we just have a single step conditional conditional block entropy to calculate so and the result is an algorithm that's linear in l we can even show you that previous random random XOR estimate of E and why is this happening well it's just state distributions are great ways of summarizing the past that's the whole lesson here or trick okay so now we need to I did kind of state it as a theorem so we need to prove this and I'll kind of step through it's not too bad so there are a couple properties we need to establish so remember before we had this way we were talking about moving the the mixed state forward one step on seeing a symbol that was this vector matrix multiply in this sort of funny form I'm going to kind of replace this form with a delta function but I mean the same thing it seems highly redundant because we're pushing the distribution forward with a deterministic function therefore there's only one next value we're going to be getting here and every other place this probably zero we're not going to go to that mixed state so stay tuned so I'm just taking those two cases and summarizing them kind of compactly with this delta function just means that right this is one when the eight I'm looking at here is the mixed state pushed forward okay otherwise the the mixed state probably would be zero okay so first result is just to remind us how we go back and forth between the mixed states as random variables and thinking about the presentation state distributions okay and it's just kind of one step we use later on in the proof of the theorem so how do we establish that well we have probably x given mu well I can think of that actually as the marginal distribution when I sum overall possible or marginalize out all next mixed states so this is just sort of unpacking this just kind of adding in what seems to be arbitrary variable here but in other words I'm just taking this joint distribution and marginalizing out eta to just get the symbol in front well I just wrote down what that is here that said this delta function here and then then this just sums out to one so it's really just the statement is that the probability of x given the mixed state is the same thing as probably x given the state distributed according to the mixed state kind of rather straightforward or think of this probability of seeing x given pi is the same thing as seeing x given that the initial state distribution was distributed as pi so again there's this notation we're shifting between talking about a state distribution or a random variable the other thing we need is to look at just how the uncertainty in pushing forward the mixed state so what's the probability of seeing this or that mixed state given that we started with a particular mixed state same move as before we think of this quantity as the marginal now over x of this joint conditional distribution well we're using this delta function here to map this in and then right so then what we do is just move back from the state random variable distributed according to the mixed state to just thinking of the mixed state being equal to that particular mixed state mixed state random variable being that mixed state so now the way to think about this is so what we're trying to figure out is I'm just going from mixed state to mixed state but of course the mixed states are calculated or induced by the symbols that you see so what we're showing here is that the contribution to the probability of mixed state eta comes from all of the symbols that lead to that mixed state so here I am at eta and I've got mu of t the previous and I have very symbols coming in that take the probability from the very states and contribute that to eta so that's what this delta function is doing sort of indirectly because I've rewritten this in terms of the uncertainty in the next symbol so grabbing the probability seeing those symbols and adding those up when those symbols lead to eta or namely the next mixed state okay so and we can generalize this to from just one step to to L steps so I start at mu at time zero and I end up with eta at time step L again kind of trick using the previous thing it may be easy to see for L equal to so I want to take mu zero and push it two steps forward well now I have two random variables r1 at time step one, time step two, eta and psi going forward from mu zero and I just again kind of unpack this using basic identity, same construction as before and the net result is that you can write the probability of being two steps ahead in eta by summing up the probability of seeing these all possible words of length to what they contribute to that mixed state okay now for the longest series of stuff, now the actual proof of that theorem that we have this efficient way of using mixed states as proxies for past and calculating the block entropy so again this is the term that we're adding up at each L to get the block entropy I mean you could think of this in some ways this is sort of like H mu of L it's like the two slope approximation and okay so we're going to start with some initial state distribution and having seen some past and we want to show that we can basically replace this with just an update over mixed states okay so now what we're doing is so when we write out this kind of conditional entropy what we mean is we're averaging over the probability of what we're conditioning on here I'm just keeping track of the statement of which initial distribution we have to do that so this is just the definition of this conditional entropy right I'm averaging in each particular case, each particular past what the uncertainty in the next symbol is and then I sum over the probability that having started a mu zero I produced that word and then sum that up so it's the history average uncertainty in the next symbol then I can just sort of replace based on the previous results rather than think about this you know random variable that's distributed according to this like the states being distributed according to some distribution that I have this again I think of it as like the vector is actually equal to this mixed state and then over here what I'm going to do is rather than have these two variables the word and the initial state what I'm going to do is actually push the initial distribution forward with that word and then I'll just talk about states of the next time step maybe this is the most important thing to do it's just there's a new variable here crucially it's the states just before the symbol you're interested in so we just move that over to here simplify things right now like the previous lemmas essentially showed we can shift from talking about the states of the presentation to the mixed states over the presentation seems just like a notational change but just a semantic shift here then we do these proofs are sort of unsatisfying kind of a null move I'm going to stick in this delta function here which doesn't really add anything and then make some identifications so I'm just going to sum over eta's such that they're equal to this previous immediately proceeding mixed state well that's fine I didn't do anything there's no other eta dependence in here but what I can do then is replace this immediately proceeding mixed state with eta that's just a simple identity there with this delta function now of course the magic happens when you swap the order of summation most of these things so I'm going to pull this this guy here depends on eta but doesn't depend on w this doesn't depend on w so I pull that out front so I'm going to sum over these proceeding mixed states here so I have this single step uncertainty in the next symbol in the previous mixed state pull the sum over omega through here and now just look at this probability but this is the thing we were just looking at so we know what that sum over w is right I can just replace that with this now joint distribution over previous mixed state and having seen the word starting from mu zero and like we showed this is just simply the uncertainty in the mixed state we get at time step L given we started at mu zero times zero right I'm just marginalizing out this okay and that is what we were looking for so now we're looking at this sort of uncertainty in the next symbol given the immediately preceding mixed state times the probability of seeing that mixed state given that we started with mu zero so it's slightly telescoped here but that was the goal we have to think about the history or even the states explicitly but instead we just push forward what's packed into here is we just push forward mu zero up L steps as we read in the different words just push this forward yeah Paul so I'm trying to understand how one would go about calculation given the last steps yeah so are we imagining looking at the mixed state presentation and at any time you look at the mass in each node and the entropy is basically the mass times p log p so you're summing over all of the edges out of each node and waiting by the mass yeah I mean it's almost like it's almost like each no exactly yeah it's the same right yeah so there right if I give you the epsilon machine and then what you do is you do the state average using the asymptotic state district the state average branching uncertainty well this is a state average symbol uncertainty well it's a branching uncertainty because these label the edges it's even a feeler so that's not a problem so basically the same idea except that what we did is we go from one word length to the other we just have this incremental update we do with both possible symbols we can add on to the previous length l minus one word and we just calculate two new mixed states the one I could see on a zero the one I could see on one now I have two new mixed states I use that to calculate the you know h of l minus one and then just keep adding on so the point is always to break this down to as you're increasing l you're just updating this state information the mixed state information and then the calculation at that point is kind of trivial because the mixed states have summarized all the probabilistic information you need from the past and I just took one step ahead just like the h mu so you don't have to calculate any particular word probability you're just pushing the MSP forward yes yeah I mean this yeah how you actually implemented there's some more subtleties which is that it's possible and then you roll up your sleeves and do some clever coding so let me just finish up here just to talk about the synchronization information I mean in fact the whole paradigm seems to be you know turning on this whole idea that observers trying to figure out what the internal hidden state is and then there's this quantity we defined last quarter and we're talking about information theory it was the integrated amount of state uncertainty as I looked at histories of length zero one two three until I got synchronized right so and then and then just to kind of write out what we mean by this kind of conditional entropy where it's summing so fix L we're summing all over the length L histories this uncertainty for each word of length L okay so but now I mean in these quantities I mean that's what we were just working with we had a history and we just wanted to look at the mixed state given that so it should be in principle pretty easy to calculate basically just go through all the words and calculate the mixed states that they induce so we can get these right these probabilities which are would go into that H function there okay then just calculate not the so much the conditional entropy here but just the state uncertainty because that is the conditional uncertainty of the states given what you've seen so it's just really compact and you can simplify things by keeping track of equivalent mixed states and so on okay so I think this is sort of the last proof like thing it's just a calculation so we just have to again the goal is to rewrite this history dependent thing in terms of just updating mixed states okay so first we get our definition here I'm going to put in you know the explicit about the dependence on the initial state distribution and the definition is we're going to average this conditional entropy on the instances over all possible history instances this is just the entropy of the mixed states and I do this little trick again of just sort of introducing this delta function between you know over eta this kind of dummy variable that just fixes eta to be the mixed state that got pushed forward and I'm just going to replace that H here with the eta and then swap orders of summation so the H of eta comes out here so this is over all possible mixed states we could see we're only going to see some small number of these and then we just we're basically averaging or adding up their state uncertainties and then we've got this which we've worked with before right so given we have an initial mixed state what's probably we're going to see this word well we can swap that around and just use the mixed state instead by summing up you know the words that contribute to the mixed state so we've proved that before so we're just now averaging or I should say adding up the mixed state uncertainties over those mixed states that actually occur and weighted by the probability of their occurrence so it's all kind of this very similar manipulation so the result is that we can rather than the original definition we can just do this calculation over the mixed states the mixed state weighted state uncertainties so at each L we compute the mixed state entropy and then you weight it by the probability of being in those states now what's interesting about this is that that the mixed states you run across here they might at some point turn into being those associated with the epsilon machine well those will be the synchronizing words those don't contribute of course because the state uncertainty is zero so this calculation is really the mixed states we're using are transient mixed states to calculate this yeah so we can actually just now write this synchronization information directly in terms of the mixed state presentation for the recurrent epsilon machine so just to summarize now so what's the you know we gained a lot by well I guess at first we were just looking at how to thoughtfully calculate word probabilities and notice that it's better just to propagate to push the state information forward rather than to consider the space of all possible paths in some sense it's kind of a benefit of like having a little model in mind of what the structure of the process is in this case we have the states and looked at state distributions very practical consequence we now calculate block entropy linearly in the word length we did have to suffer a bit through talking about these conditional random variables but hopefully at least the motivation in terms of the simplex and updating on the simplex made that clear why we had to do that the proofs rely at least the way I can think about it now in terms of this so there's a little bit kind of bending over backwards in terms of formalism but the benefits are worth it and we're now also looking at rather than just always focusing on asymptotic state distributions that lead to a stationary process we're actually looking at all possible different word distributions of how they get updated on a machine so we don't have to just think about the sort of stationary case okay so that's that any other questions