 Okay, let's go ahead and get started. So there is a new lab That will help with the homework. Let's see what's published Right homework 13 is out there now, and there's this new lab Which Shows how to go through and Call-up machines Generate data from them using pythonic notation here to just print a hundred of the symbols And then there's some sort of wrapper helper code that you pass in the data The morph link the future distance you want to go and your tree depth And this is using what's called subtree merging reconstruction, which is the main way I introduced reconstruction methods There are others, but and we'll hear about those from Chris Returns a machine and Lo and behold, it's Very close to up to numerical fluctuations in the transition probably is the original so now there are many ways to break this or to get silly answers which is Generate a sample of link 3, right? So in these exercises hint just it doesn't hurt Just throw lots of data at it for now The point is to see the reconstruction process and go through And then and not really worry so much about data fluctuations The lecture that Chris is going to give Crystal laugh is going to give on Bayesian inference will bring us closer to dealing data size fluctuations Those techniques Process table. There's this handy little function in here that you pass it a machine or actually a list of machines And you say give me the entropy rate statistical collection E These are character strings that just identify those and it plots a nice little table so Here's an example of pseudo real data from the logistic map and Again, there's a little helper function you pass a logistic map parameter value in the amount of data you want Look at some iterates of that There's a way of taking the iterates on the interval and Partitioning it another helper function and this isn't deep sophisticated code Then we get our symbolic dynamics Sequence out now. It's just binary data. We can plug it back into that in for machine code Again has parameters here and there are ways of doing this so you get crazy wild things So stick close to the suggested things for now Again lots of iterates help and then in this particular case that was our equal for well That's where that we proved that the map is our covian. So surprise surprise we get the fair coin back out or fair coin up to statistical fluctuations and You can print out now very familiar properties For that and for model Okay, just to give you some idea and also maybe kind of hinting a bit about What might be components in a project a dynamical system? Look at some behavior It's partitioned and then we do some kind of informational or computation mechanics analysis. So that's of the architecture of straightforward project Well, it turns out that So let's get started here and today we have some real work to do A Most of that code certainly calculating that process table command gave all the information properties when you passed in an Epsilon machine there's a lot of Technology built in to the computation mechanics and Python library and Today's lecture and Thursday's lecture are going to be Give you some sense of how things go and I'll try to indicate some of the problems that confront you if you want to calculate These informational quantities from an Epsilon machine or just in general. How do you estimate them from a given? presentation of a process so so far way, we've been discussing things we've been talking about the Epsilon machines in particular focusing on the predictive Equivalence relation the causal state equivalence relation which gave us these partitions of the space of histories so-called causal states and As we emphasize we talk about the intrinsic semantics that a model like that gives these causal states were conditions of knowledge About the past that let you do optimal predictions of the future Well, that's just the tip of the iceberg in this computation mechanics setting so what I'm going to introduce today is actually attend them out to a hierarchy of states of optimal prediction Now the language will largely focus on different kinds of models or presentations different kinds of states But in fact, it's sort of a quite general idea in other words the the original predictive equivalence relation actually induces conditions of knowledge for optimal prediction, but it's sort of conditions of conditions of conditions of knowledge for prediction it Actually telescopes out. We're only going to go up one level. That's all that's really necessary for most of the things We're dealing with until we get to infinite state processes and then It kind of is unavoidable So so there's some mathematical heavy lifting to do to introduce this I'm hoping the gods of clarity will smile on me as I explain some of the technical difficulties, but there'll be examples And in fact just as kind of a heads up Think back to the Tahitian vacation example. We're talking about Reading the newspaper getting forecasts and trying to infer what the state of the weather was On Tahiti before we hopped on the plane and packed our bags It's in one way to think about this is this notion of synchronization. We've been talking about this It's like a key idea here helps organize a lot of the constructions We have some observations There can be a setting which we have a model, but we don't know what the current state is and we're trying to guess What that is so so that's kind of the overall Mathematical setting Do a little bit of review Today Mostly kind of notational things, but it's a little bit subtle I'm going to talk about conditional probabilities and also conditional random variables We need to condition on things like what is the current state distribution or what was the state distribution? When the process started so I have to be more explicit about that I hope it ends up being intuitive. We kind of dive down to a level of Notational detail that I hope will asymptotically be helpful Then with this sort of behind us we'll start talking about this new concept of state so called mixed states and the Induced model mixed state presentations and we'll go through some examples Then on Thursday We get to have more fun We actually get to see the benefit of using this mixed state Presentation so not only we're inducing this new concept of state and It has a number of direct consequences in particular How we can very efficiently calculate these various complexity measures and I'll talk in some detail about Fast ways of calculating block entropies And I'll do just a Revisit the synchronization information to rewrite things in a very efficient way And then this will give you some sense of what's inside the computation mechanics and Python package Particularly right. There's a problem here. You know if you have some kind of stochastic process You're looking at words of length L if it's Slightly positive entropy rate the number of sequences that you can see of length L grows exponentially in L So that's an exponential number of numbers you have to keep track of and makes estimating Where distributions quite Problematic, especially if you're trying to do it literally from data Take a window of length L and sweep it through and just counting things up It's in many ways that one of the punch lines here is going to be this is why we build models Because we need more efficient ways of Estimating quantities than the sort of literal frequency counting of words So we'll see how this mixtape presentation leads the very efficient way of Calculating block entropies and then other Quantities That depend on the block entropies next week. We're going to get back to this question of of We brought up last week. We were kind of we were having to talk about the processes in reverse We didn't really know how to think about it other than just state the intuitive idea. We're scanning the random variables in the opposite direction The question came up. Well, if I have an epsilon machine that produces a process Well, I can certainly produce the process and then scan it in the opposite direction and then build the epsilon machine But is there some way of just getting the reverse epsilon machine from the forward one directly and it turns out this mixtape presentation is key So next week we'll talk about using mixtape presentations to reverse a process And that turns out to be Interesting its own right and also leads to Efficient calculation of a whole mother batch of complexity measures including somewhat surprisingly this time symmetric quantity the excess entropy so anyway Giving sort of a heads up here because well, you'll see why I'm giving a heads up sleeves up. Okay, so review golden mean Process right no consecutive zeros if you see a one the next symbol is a fair coin flip zero one It's up to my machine has two states We have this binary alphabet all very familiar. We have our matrix that gives the transitions on Zero symbols the matrix that gives it trans there's all the transitions on one symbols That's one transition here from a to be with probably a half on symbol zero Here we have two transitions from a and B to a If I'm B with probably one I go to a If I'm in a probably half I come back to a Should all be very familiar by now if we're interested in the internal state process. We just sum over Component-wise sum these symbol label transition matrices and get the internal Markov chain transition matrix If we solve for this eigenvalue equation We're doing left multiplication here normalizing pi and probability. We get the asymptotic state probabilities. Okay, so that's all Very familiar. Okay Most of the time we spend in state a yeah, Chris Sure, yeah arbitrary. Oh, yeah. Yeah, I'll talk about some processes that have I'm gonna turn up the Volume here your speaker. Yeah, you can have I mean I'd say right now a finite measurement alphabet It's easy to handle We did talk about some infinite state processes But yeah, this can generalize a little bit Extending this all to say where the output was a continuous variable. That's the research frontier But yeah, Doug is trying to be concrete here with the golden mean and Then if we think of now the the golden mean epsilon machine as a generator Like you might have noticed every once in a while the edge labels sometimes I put the transition probability on the left side with the output symbol on the right or vice versa There are actually two modes of using These epsilon machines one is in this mode Transition probably output. We think of this as a generator. I have the model and I want to generate the Realizations and then once the other way around we think of it as a recognizer. It's reading in zeros and ones calculating path the machine is calculating path probabilities My minor difference here. So here we've got this and question is well what words does it produce and whether their probabilities and we calculate the word distribution this way so we start with Asymptotic invariant distribution. We have this product of the t zero one matrices Extended to words and then we sum things up with this column one vector and that'll give us the probability of A word simple case. What's what's probably of seeing a zero? So we have two-thirds one-thirds for the state probabilities times t zero And then we sum things up. Well, okay, so we're starting here with two-thirds one-third And then we see a zero Well, we're in state a with probability two-thirds and we're going to see a zero with probably half So that's a contribution of one-third and state B probably one-third, but we can't see a zero So the probability of seeing a zero is one-third Same thing here promising a one. It's just pi times t one I'm here with probably two-thirds. I can see a one with probability half. So that's one-third contribution I'm in state B with probability one-third, but I'm going to see one probably one So that means I have a third and a third contribution So probably seeing a one is two-thirds and so on Okay, I'm going a little sort of pedantically here and we calculate these things this way I mean, this is already sort of showing you what I said before If I wanted to calculate the probability of all the length ten words I'm gonna have two to the ten or a thousand of these calculations to do Oh Tv is okay So just notation for probability theory. We're going to have a very discrete alphabet now We're just thinking of processes. We have a random variable time t big x Sub t the instance of that some value in the alphabet time t lowercase means Particular value and the way we think about this is this random variable x of t Distributed according to this or in the Covert Thomas notation. It's just simply to say Tilda random variable is distributed as this distribution. We're going to use this a lot Okay, we have different types of random variables Quantitative or categorical age of voltage temperature, whatever Find that sort of makes sense I can calculate the average age of people in the room or the average voltage and circuit over an hour in my office Categorical things names colors that I don't quite know But we still talk about You know the probably being sunny, but does it make sense to calculate expectation value of the weather, you know Yeah Probably sunny and probably you know, it's sunny in California. It's rainy in Tokyo and therefore the expectation value is it's fog on earth You mean somehow doesn't for categorical variables Doesn't really make sense Expectation value of colors. Well if you make The categorical variable red and RGB triplet or HSV triplet then we can start talking about an average vector in in color space, but until you You make a quantitative expectation values a little bit Sort of puzzling for categorical random variables For processes we have our observed process right this giant joint distribution of the my infinite chain of random variables Typically, we assume that all the marginals of this are time independent the probabilities are time independent. They're for stationary At a minimum the epsilon machine some sort of convenient representation of the process, but that's just one There are alternatives so So we choose these alternatives based on how useful they are and Then given the epsilon machine we can at least calculate The word distributions using this simple formula vector matrix multiply Now we've been sort of focusing on just the observed process this way and we need to be more careful So really what's going on especially if we have imagine we have the model It's a little generator and it's generating zeros and ones that we observe but then the states are changing So really the observed process or zeros and ones or whatever alphabet the words over whatever alphabet It's really a marginal distribution from the machine process, which is pairs of internal state output symbol Internal state output symbol and so on And we project on to just the measurement symbols when we talk about the process but What the states are is actually important obviously and So it's it's been when we've been writing down something like like this. It's been implied that we use the asymptotic and very distribution and The benefit of that is that then these word probabilities are Stationary They're probably the same anywhere in time That doesn't mean We always have to do that we could we could use other state distributions and the only cost is that then maybe these word probabilities depend on time Or the observed process is non stationary But we certainly do that. So anyway punchline is We're going to be a little more careful when we start to write down the word distribution like this We're going to write out Explicitly that it is an explicit average over state probabilities Right. So what's the probability given that I started in a given state of Seeing a word so we got each state doesn't produce W next state doesn't produce W and then we add up those probabilities and that's what we mean by this That that the machine or process could have could have generated the word starting from any state Okay, so that this just review Some sort of computational things to note so let's just say we have our state set V here as I noted before The number of words we're having to deal with grows exponentially with the length So that's sort of burden some Is there a more efficient way to calculate these word distributions when we do this? Expression here and the answer is yes So because they're kind of straightforward and inefficient way is so here We're asking for what's the probability of seeing the word ABC and again just applying the formula That's pi ta TV TC and then we sum up with the unit vector and if you just sort of Write it out explicitly you have to sum over all possible paths that could have produced a B or C Okay, so that means the sum here is The number of terms we have each one of these variables goes over the number of states the longer the word We have another set of states. We have to Sum over and so the number of paths grows exponentially with this so that's bad. So so this first Method here just multiplying these things out. It's taking into account. It's calculating the probability this word over all possible paths and That's grows exponentially in the state size So that's not so great However, if you just a little bit of careful thought and probably if one was programming this up just the very fact It's so inefficient would lead you to think about grouping things in a different way So the second way which is completely equivalent to the first is that we every time we look at a new symbol from the word We update and since summarize what's happened in the past by just pushing the state distribution forward with that single matrix Well, that's a vector times a matrix. That's nice. And then I have now I have a new State distribution that I just go state distribution and I update it with in this case the B matrix new state distribution Okay, I have now Three matrix multiplies If it was length L, I have L matrix multiplies Therefore the second way of doing it is actually linear in the length We're actually using the states to carry what we need from the past. We don't just throw all the Pads out there and figure out which are The probability that they assigned to ABC we do this nested Updating of the internal state distribution Again, it's just like the synchronization problem. Oh, sunny rain and you're updating your distribution of the internal states of the Dehesion weather system Right, so now we have this vector Here and then the sigmouth component of that is the probability that you produce W and ended up in state sigma And then we just keep pushing that forward so Don't bad Good So rather than having your calculations are poop out on you. It's short word lengths. You can push this quite a lot further Okay, so now Really terrible colors Right so conditional probabilities the way we've been thinking about them Right, what's what's the weather on Tahiti if I read in the paper that it rained yesterday So we think of conditional probabilities been conditioning on events Well, we have to modify that a bit. Why because all I was just talking about is we're conditioning on Knowledge of the state and that's a distribution Actually, we're gonna think of that as a random variable So but that's that's okay, you know It's conditional distribution. There's still distributions. We know how to work with them But we're gonna eat some notation to deal with Conditional random variables. So what we mean by that and I hope the notation will bring this across so we have this two letter alphabet a b and random variable x and I'm going to define a new random variable that represents the distribution of namely the probabilities that x zero is a or x zero is b Okay, so So and what I mean here is that there's going to be this sort of hidden Variable, I'm wondering how a and b are being produced depending upon what the current state is here So this is going to be the probability that that x zero is a or b conditioned on a particular state so And so in a way this notation will let me sort of like occasionally sort of hide Hopefully not obscure the fact that we're depending on this condition Also, there's a sort of an ambiguity here in when we look at entropy notation So the entropy of this new random variable is It's conditioned on the value that that that This s state takes and it's not the same thing as the uncertainty in the variable x zero given The previous state because this entropy we average over all possible states Right, so so this is a random variable that is the value describes the probability of seeing x Given on a particular condition that is fixed So here's one example of this so here we have our our Machine process here state symbol state symbol state symbol so on Okay, going forward in time and I define a new random variable call it j zero That's distributed according to how the probability of x zero Is conditioned on s zero being a so so this random variable is The problem distributed according to the conditional probability of x zero given s zero is a particular value So imagine that we're asking this this this is produced by the even process so And I sort of change this here instead of zeros and ones we have triangles and squares. That's the even triangle Process so what's the probability that j zero is equal to a triangle? Well, what is that probability here? Well, just by the way, we've defined it probably that j zero the triangles probably that x zero is a triangle given that s zero was state a Okay, so given that We're in state a what's the probability of seeing a triangle a half and that's not the same thing is just asking for What the probability of x zero is when it's equal to a triangle? Right that is this product here. Yeah No, it's defined this way right I'm defining this I could I could make you know q zero b s zero is b And I have this new random variable Right Yeah, right so I'm kind of right I'm trying to describe right so this is a random variable that's described by not a distribution but a conditional distribution I'm calling it the conditional random variable and It's related to the event set x but now the event set x is being partitioned up by this other state variable here in this case Mmm. Yeah Right so so so this down here this probably it just when is x zero triangle well I'm here two-thirds of the time. I see a triangle half the time. I'm here one-third of the time You see triangle with probability one, so that's two-thirds times a half That's one-third one-third times one and I end up with two-thirds again just like before Okay, so that's straightforward And these are various constructions we'll use Maybe it'll be a little more intuitive when we get to the examples another example would be Defining this conditional random variable f2 that's the probability of being in state two given that back at time zero I started in sigma zero so so what's the probability of? That f2 is describing the probability of ending up here I should say my uncertainty in here given that I started over here in a particular state call it a or B Okay, so now I can condition on this Right so the point here is I need to be able to talk about sort of telescoping conditions. That's So what's the probability that? x2 is equal to a particular value you know triangle or square Given that f2 was sigma 2 in other words and what that means is we just unpack that I'm going to rewrite f2 here Explicitly it's this conditional random variable and now I'm saying that f2 is equal to sigma 2 that means this first variable here it takes on that value So now I have this conditional random variable. That's s2 is equal to sigma 2 Here I end up in sigma 2 when I started here in sigma 0 Well, that's fine We can just drop the conditioning here. These are just events Same thing as the joint event s2 Was sigma 2 and s0 was sigma 0? Okay? They just unpacked that there, but then we can simplify this of course because we're using these causal states and shielding means I only have to worry about them immediately preceding states. I can simplify that so this is just kind of example calculation I'm interested in this condition up here and later on these are going to turn into distributions Over states like this And in this particular case the uncertainty in x2 Given that f2 was sigma 2 This value up here It's same thing as asking about x2 given that we that that that state was actually sigma 2 So so this f2 variable sort of collapses down because of shielding to mean equivalent to asking about the state at time 2 Okay, so Just to draw a contrast here. So we just established we had this probability identity and then also down At the level of the information the uncertainty Condition either on f2 or s2 those are the same However, when we talk about without conditioning this way, we just talked about what's the uncertainty in x2 given f2 Without fixing it that means we're averaging over the realizations of f2 this conditional entropy That's just sort of the standard definition Of that right, but now I can sort of unpack this a little bit this is this conditional random variable Over s2 and s0 fixed at sigma 2 and sigma 0, but that's not the same thing as The uncertainty in x2 given s2 right this we only would condition over The value of s2 We would have suppressed this condition that was built into F2 So these two quantities needn't be the same Because f2 has built in this I guess for a hidden condition Okay, I guess it's sort of a final example here Right so g2 is the probability of Our uncertainty in s2 given the x2 had a particular value. Okay, so we're sort of Even that we saw This observed symbol we now have we induce a conditional distribution over the previous state We could have been in and now I'm going to ask a question about The probability of the next state I'm just kind of making these up to show the different instructions here so so okay So that's probably of s3 given g2 then these are fixed So then since g2 is now being fixed at sigma 2 that means that this first variable is being set to sigma 2 So I just rewrite this again Probably the s3 given now this condition Conditional random variable. I just break it out into the separate events. It's now a joint event Right and then you can even say something noted down here if if Given that s2 is fixed and x2 is fixed if this was an epsilon machine then the probability would be Zero or one whether I was going to sigma 3 At time 3 Depending on whether that was allowed or just allowed transition. Okay Same thing with these, you know, there's the entropic version of this the uncertainty in in s3 given that g2 This conditional random variable the sigma 2 again. That's just simply this Conditional distribution that would be zero if the machine was in the feeler I Generally again just applying the straight definition if I was interested in the uncertainty in s3 Given g2 without it being fixed then I'm averaging over the the conditions realizations This conditional entropy Right the uncertainty in s3 given the different Realizations or possible values of g2 Again this unpack this again. We have this conditional random variable. It's probability We got this above here this number condition on this joint event to sigma 2 x2 and that's generally not the same thing as in certainty in s3 given The state distribution s2 and that we fixed x2 to be a particular value Again, it's because of We have this this hidden condition here. Oh, so right so so these nested conditions I have a certain syntax and meaning to them But you just sort of unpack them till you just have everything be an event and then we're back into the realm of sort of normal probabilities So what does this generally look like if we have this conditional random variable a prime probably of a given that? B was little b then we can talk about the uncertainty in some third event c Given that a prime was equal to a and what we mean by that is now we fixed this conditional distribution so that the thing were That's being conditioned takes a particular value So this is just now a single number as it were and then it's then this We were conditioning of the random variable on this this conditional random variable turns out to be just the uncertainty in c giving a and b And that's different from if we don't fix the conditional random variable Right, so that's just rewriting the definition of a prime the uncertainty in c given and conditional event a given little b Which sort of write that out that's now we're averaging over this random variable although this One component is is fixed at B And then we average over the values that a could take in the conditional random variable And then that's waiting the conditional uncertainty in c given the conditional random variable Again sort of under third form of this is just looking at the uncertainty of Comparing we have this last Conditional entropy on c given a and b the sort of joint event Then that's that's different than if we haven't fixed a again, what we mean here in the definition of this conditional uncertainty is that the The the random variables that are not fixed Are the things we sum over that we use to weight the conditional entropy so so right so now getting back to But we're going to use this conditional random variable Notation we're going we're interested in the Word distribution and now we're being careful to be clear that it depends on the state distribution As we produce the word so so before we were just conditioning on events And now we are sort of conditioning on these state distributions when we calculate word probabilities and entropies And we have to make this dependence on the state explicit But I guess it's natural enough if we think of this as a generator And this this if you look at this this expression it sort of looks like well You're doing the state average of something it kind of looks like an expectation value so we could think of This is like instance of a random variable or if you like conflate it to think of it's like an event It's like a temperature well, it's a number and we're averaging this number over the states Okay, so this is I guess it's Yeah, it's sort of a this funny semantic shift here. I mean to make this explicit you could imagine we just come up with define this sort of Random variable, I should say distributed as the this conditional distribution, right and What so the values that this random variable takes are the probabilities that? You know the word is w if you started in a or you started in B Right could be two-thirds one-third and that would be the values of z then you know the probability of z taking a particular Value two-thirds or one-third well, it's just plug it probably is that z is actually equals that probability And that just depends on what state you visit Right there everything is driving it here is what state you're in if you're in state a I have this probability if I'm a state B I have this probability So in fact this does look like sort of An average if I average z over this over the its distribution So I'm z here Averaging over the different values it can take the probability that those values occur And you end up back with that original expression. So this is the way of thinking about how the Calculating these these word probabilities Condition on the state. It's it's as if we're averaging these numbers over these events over The state probabilities, but that's I don't know maybe this is over tedious or kind of flogging a dead horse We could just sort of by definition just to kind of agree to Start conditioning on the state distribution here so that these averages over these numbers that we're interested in these state probabilities, right? I want to know over which state I could produce the the word Probability and then we wait those by the state probabilities Okay, so the other thing we're going to do is think back to that Calculating updating the state problems that we're calculating The word Distributions We're going to think of the random variable as actually being a vector of probabilities over the states So for example for the word probabilities So now we can have we can ask about what what's what's the probably producing a word given that it stays just the states Starting at time t or distributed according to the S&P like stationary measure. That's just this formula here right and And now imagine we didn't sort of sum this up. This is actually a vector of Probabilities, or we can now think of this as the the random variable. We were just averaging over the states Okay, but we can also talk about you know other distributions besides pi right imagine It's a uniform distribution Then the probability of seeing these words starting with the uniform distribution is just simply the sum since all of this these States are equally likely just have the factor of one over the number of states And we just then sum up these conditional probabilities Having produced the word or just in general right we can Now start talking about producing probably producing words when the states are just given by some arbitrary Initial state distribution and then we just generalize our expression here Just left multiply by mu tw and then sum up So this notation is it's okay It's trying to make explicit what has been implicit before so Right, so this is Right, what's the probability of the x takes on value little x given this state now This is just things out of the random variable or we can also be clear how it's distributed we can put in the distribution there So these are probabilities right This is the probability that x 1 is equal x. That's a number. That's a number Okay, and then we can also work with distributions the same way Right now. We just have we can condition on both s 1 and s 1 given a particular distribution And then this is a set of numbers All the possible probabilities that x can take for its realizations It gets a little bit dicey though when we're working with entropies It's a little bit ambiguous. I mean typically what we mean by the uncertainty of the next symbol given the current state is we're going to do a state average of the Uncertainty in the next symbol for each state, right, so that's just state average in some sense the the branching uncertainty here from each state conditional entropy and or in simple lines simplified form It's just that you always put the joint distribution out in front of the log and then The argument that log is the particular kind of entropy. We're looking at in this case a symbol state conditional entropy But there's also sort of possible when we write this the same Notation down that we could have meant with think now thinking about these conditional random variables that that this is the Entropy over this conditioned random variable. So now I've got this p log p expression. I'm applying to this Condition random variable that x 1 is equal x condition on s 1 in both Places there, but if you work this through you see there's This ambiguity We could just sort of unpack this here Right, this probability is the probability that we're going to see so here. We're averaging over x That is x and then Also over the states here that's what this conditional probability is just the probability of of We're averaging over the states here whether we're conditioning on so we pull out the state probability and Multiply that weight that the conditional probability of x given sigma And then we do the same thing here same expression But notice that now this this just turns into the joint distribution in both places and this is a different expression than this one so there are ways that In this new notation that we have to be careful We can't unpack things certain ways in particular this way doesn't doesn't quite work So so the first thing is just the average entropy of this variable conditioned on The events all the state events x this one down here with the way we're thinking about this is the entropy of this variable For different s 1 distributions and it leads to the sort of confusing Alternative right in fact this this is just the marginal distribution over x marginal distribution over x some sense we've So we have to be careful with that So what we're going to do is when we write this down will mean the first thing the average entropy of x 1 conditioned on the state events and Then when we want to talk about Averaging over particular distributions then we're going to put in How the the conditioning random variable is distributed so Right so so this later thing like the last example just reduced to entropy in the in the first variable something Was not what we intended but now with this new notation we can it be explicit about what we're conditioning on and of course we recover When you is equal to pi the quantities we were using before namely the asymptotic invariant measure The important thing about this is now we can start looking at how uncertainty is changing if we start with different State distributions in particular we're going to be updating these distributions as we go along So we need to be very explicit about The state distributions we're conditioning on maybe that's kind of the main point of this So what tortured notational Description okay, so for example odd process. So this generates 11 or three ones and then a zero so just the odd analog of the even process The asymptotic distribution is two-fifths two-fifths one-fifth here ABC and We can ask about what's the probability at time three that we see a zero Given that at time three we were in state a and the initial state distribution was distributed according to pi And that's a question. Okay, so what does that mean given how we've developed this? Well first of all this again, this is Is this random variable is distributed according to the distribution that means we're going to average over it using its weights right Times zero what state probability we had those and then Well, of course, this is like a previous example because of causal shielding we can actually this this probability here conditional probability over here Seeing zero at time three doesn't depend on what the state Start state probability was it just depends on the immediately preceding state so we can drop that But now we're still sort of averaging over right we're carrying forward that that state distribution But notice now this probability doesn't depend on sigma what the start state was you pull that out and then we just we're just summing This distribution over sigma. Well, that's one so we end up basically showing that In this case we forget the initial distribution. It only depends on the previous state and that is right Time three. I'm an a what's the probability of seeing? Zero a half But we can choose a different state distribution and ask a similar question. So this Right, so imagine that we have mu just defined to be uniform probability over states ABC And we want to know what the probability of you know The next symbol of time three is given that say at time one that was the distribution of state probabilities. Okay So well, we know how to Calculate this we can just take mu push it forward Three time steps in the last step we use the symbol transition matrix that produces the symbol We're asking about and then sum that up Right remember and here we just can use t Because all we really need to do is to update the state distribution Nothing the first two steps. We're not asking anything about the the observed symbols right So that's the same thing as this probability here is the same thing as Instead of s1 at time one being distributed as mu we can look at s2 being distributed as mu t We just pushed it forward or at time three it was mu tt so we pushed it forward another step That's you know easier to answer here. I can go out to step three and say oh That's my current state distribution. I can go to the states and then just look one step ahead rather than sort of Think about this extend it over at three time steps in this the The only thing that's really important is the relative time difference. So here I was specifying mu at Time one and looking at two steps later at the observed symbol I could have specified mu at time two and looked up at time four for x so that we can shift that way But you can't just shift this random variable here forward When mu is not the asymptotic state distribution when mu is well then as I'm operating on it I just get me back. I mean get get pi back. So then that'll be stationary. So you can't just shift this This variable around why isn't that turning red odd? It must be missing colors. Okay, so Yeah So what are the shorthands we have here? So I write down the probability that At time t we see little x conditioned at being on the state the time t what this is is a set of numbers The size of which is the number of states And each of those numbers is the problem that you're going to see little x for each one of the states, right? whereas if I Don't fix one of the random variables, then this is now a larger set of numbers namely the probability that we see little x For the pairs see little x and see state sigma Right so now when we talk about What these numbers mean, right? We're thinking of this as The average number average likely we're going to see x but averaged over the state probabilities, right? So that would be the expected probability when we write this the expected probability, right? Or if we're not fixing the Variable that's being conditioned what we mean is the distribution over this. It's the expected distribution Right so up here. We had these numbers One for each state and down here we have a distribution that is now We have a distribution for each state we could be in So it's four different things and then we average over the states each one of those numbers and we end up with a distribution over x So hopefully as we go forward the context will help us Get rid of any ambiguity here Either we can think of sets of numbers sets of conditional probabilities Conditional probabilities or expected probability or expected distribution Okay, so enough of that Right, so here's an example process I should say this is a presentation that generates a process Happens to be the epsilon machine for that process and We have a fair coin flip from a and then no matter what we come from b or c back to a So it's almost kind of like a period two thing except When we go from C back to a we flip a coin and generate zero one coming back from B. We generate a zero with probability one so The way to think about the problem is that we know the model So we took our data in our lab, but then we left for lunch the experiment was still running We come back and we don't know where it is. So we have the model and we can for example calculate the Given the model So our expectation if we've been away for a while is kind of in statistic equilibrium So this is actually this this vector here is pi the asymptotic state distribution I have a quarter a quarter and then the state is colored in red here give the probability Magnitudes and so this is the state of affairs when you just walk back in the lab. We haven't seen anything yet okay, and this A half a quarter core are our expectations if we assume equilibrium so We got this one helpful way of thinking about this is looking at the time evolution of the state distributions so this particular state distribution is it's a you know normalized Over three states over three variables. Therefore. It's on a simplex and all of the probabilities that some to one and Are between zero and one lie in this triangle. So think of that a two-dimensional surface this initial state distribution is Over here. It's you know Not much of B. Not much of C and about half of a so Half a quarter quarter. I should should visualize this okay, so now what we'll do is read in the word and then Talk through how our expectations of what state the what hidden state the processes and get updated Okay, so the first step will be we see a one Okay, so then looking at the model. We know there's no way to get to be on a one So there's no probability flowing over there However, we started with probably two-thirds here and sorry probably one quarter one half here one quarter here So on a one we take the half probability We take half of that and move it over here So we now on the next step having seen a one we have one one quarter here We had one quarter probability before we saw anything and then we took seeing a one that takes a half of that Which is an eighth that moves it over here So we end up with this partial distribution of a quarter and an eighth Which when you normalize it? makes Probably state C two-thirds and probably state a one-third right so you have an eighth and a quarter That's three-eighths divide that vector distribution quarter Sorry an eighth and a quarter by three-eighths and you get one third two-thirds and then our our mixed state is updated Now the the mixed state lays along a subsymplex namely just the line the one-dimensional simplex between C and A Because there's no probability of being in B. So we're just along here So on the edge of the simplex so then we see so we update our mixed state in this simplex or we can think of it this way on Distribution on the states we see another one We have we have a new mixed state here So we had two-thirds here one-third here half of one-third is a six We had two-thirds we take half of that that's one-third so we have one-third one-six We normalize we get two-thirds one-third again Okay, then this The red is the current mixed state having seen one one see another one it just Happens to just swap these probabilities like this, but we have to go through the calculation again What's just what we did before and we come up and we can? Visit this same mixed state again Okay Now we'll see a zero So now what happened is we had whatever was here and a it moves over here We take just half of that probability mass and whatever was it see we move that over here Okay, so that was I get the phase right on this right so Two-thirds a half of that is a third ends up at a we had one-third here take a half of that's a six So we have a third into six and again it normalizes out having seen the zero to two-thirds one-third now we can see Another zero Now what happens is that in fact having seen that our expectation is now 50-50 for B and C you go through that calculation again And then now we're down here. We can't be in C. So now we're in the sub-simplex between a and B And exactly 50-50 like that and then on the next symbol I'm just choosing allowed symbols as I like to illustrate things next symbol is a one and what happens is that now We had a and B with positive probability, but there's nothing Leaving B on a one so that doesn't participate. We had this probability up here and whatever it was We take half of that down to here But the C is the only one that has probability when we normalize it's we get this call it delta function right on C The mixed state moves up to one of the vertices Okay, so we get the Basically the the kind of visual picture here of what it means to finally have synchronized now having seen that sequence starting from State ignorance we now know exactly What state we're in so we call this a? synchronizing word There are words that will take you to this if you have the right kind of presentation This is the property synchronization like this a property of the model itself I'll we'll talk about some examples that don't synchronize and Then because this particular presentation is unit feeler happens to be the epsilon machine. It's minimal unit feeler Every other as I read more symbols that are allowed. I Stay in The state of being synchronized. I never lose that again. That's one of the nice properties of unit feeler presentations it just kind of hops around now and The dynamic on the simplex is just hopping between the vertices Stochastically because I'm sort of choosing the allowed transitions to take So over to be after I see a zero Back to a if I see another zero and so on so we're kind of hopping around here stochastically on the simplex of state distributions Okay, so that's the main idea, right I Made an assumption we start in the asymptotic state distribution pie and we're just trying to answer the question How do we update our knowledge of this of the state? Okay, so I maybe that's more intuitive than all this conditional random variable stuff But actually that's you're having to calculate where these things. So so so what are these mixed states? Okay, so so in the simplex We had you know these various dots here now. We're thinking of the A it's a random variable or it's actually this point in a simplex that's hopping around So two different representations of this to describe the evolution of our knowledge or uncertainty about the internal state as we're making measurements So we have now some induced dynamic over this new kind of state space Well, it's a state distribution space the simplex So that's what we want to track with these mixed states. That's basically what they are So they're distributed state distributions that are induced by having seen a word So let's just look at words of length L and that our notation here will be at time t Having seen no words lambda in this case We'll be in this mixed state mu so we're going to come up with this more compact notation here of The mixed state at time t having seen some symbol or word Okay, so what we're going to do is this will be where we start at time t There'll be some state distribution you having seen nothing. That's what that one means. Okay, so then so again, so now these this is a Random variable well, it's a state distribution. Well, it's a random variable. Okay. I Think we're comfortable with it now, but there is some formal Trickery here. So so what's the probability of at time t plus L? being in state sigma given That we observed Word w well write this in more familiar notation and this is where all the previous formal rigmarole What we mean is the probability that the state at time t plus L is sigma given that we saw word w Going from time t to t plus L that block Having started with this initial distribution having seen nothing and in that mixed state Okay, so this is the general sort of way we're going to push distributions forward based on Observing sequences so then these mixed states really are random variables are hopping around they're being driven by a random variable Naming the word that you're observing They're conditioned on these two things So the word and and and and the and the mixed state Of course, and then typically what we've been doing is just time is zero and then this initial distribution We take to be the asymptotic distribution, but as we just saw There are other questions one Has where we use different mus and push them forward Okay, so the simplex picture is really based on thinking of now a different, you know Representation for the mixed states distributions as just vectors in some vector space Right if we have three states, it was a two-dimensional sheet and we can see how to calculate these State distributions we just simply write out what we had before right State that we're in at time t plus L give me saw a link that word and started with this distribution Well, that's a conditional distribution. I can write it as a joint distribution over a marginal That's just a probability identity and Then we know how to calculate both of these that's just given by our familiar formulas before so now This is the way we can directly calculate starting from a given Mixed state at time t and then going forward Basically, this is what I was talking through in the example and then that last normalization step is given by the denominator Just to make so this this is some kind of partial distribution I just push the state probabilities around on a model and then I just make it a normalized distribution after that and Then so now this this sort of dynamic. I now know how to push mixed states forward basically and what we've been doing of course is You know just taking the asymptotic distribution pi times zero we get this this simpler Expression just if we start with the asymptotic state distribution We can calculate the mixed state having seen a word of length L this this way So for the even process just to focus in a little bit We essentially did almost numerically the same example, but now we just have a two-state process over triangles and squares the even triangle process We'll start the initial mixed state off in the asymptotic state distribution two-thirds and a one-third in B and then the question is What's the mixed state at time step one having seen a triangle? Well, that's the state distribution Condition on having seen this past namely triangle and then we're now sort of denoting the the And being explicit about the initial state distribution that at time zero is distributed according to pi So then we just push pi forward using t delta and we normalize it So when you push it forward, right? We have two-thirds one-third Push it forward having seen a delta two-thirds half of that as a third We had a third here before but in that moves the probability one so we take all of it over to here So we get a third a third so this is this partial distribution vector and then we sum that up Well, that's two-thirds and then we use that to normalize and then we get half a half So initially we start out with this we've made this assumption. Oh, it's been running for a long time It's statistically equilibrium most likely in state a Soon as I see a delta I conclude that it's equally likely and to be in either of these So my state uncertainty can change over time Until in this case we hit a synchronizing word and then I would get the delta function and there'd be no more state uncertainty Sorry thinking about the this it's a new kind of dynamical system over these state spaces these Simplicial state spaces. These are just vector spaces The vectors are these probability distributions They're normalized and we think about them on some Simplex so if we just had if we had two states, there'd be a one-dimensional Simplex now probably a probability of B. There'd be this one-dimensional space along which those probabilities Between zero one and sum to one We have three states. That's what we were talking about before we have this surface And then over the surface they're sort of you know notable points the vertices are the what I've called the delta function distributions We know exactly what state we're in The center point is always the state of maximum uncertainty and then you know all the other points including the boundaries are allowed So if we're on one of the you can think of if I you know arbitrary Distributions over three states or just points here But as we saw before when we synchronized or even before we can synchronize to a subspace We have distributions where you know you're not in some state and that just collapses the dimension you're currently operating in So what what are these things? So they're just the uncertainty in the state given some starting distribution and a word you've observed We can look at the mixed state entropy as a way of Detecting whether or not we're in a state with probability one. So this would be our informational criterion for being synchronized Right. So we went when the state distribution is all zeros except for one one Then we know we've probably want what state we're in so we think of these these mixed states with zero entropy as the sort of points of the The vertices of the simplex. They're so-called pure states and then all the other Possible state distributions are spanned by those pure state vectors and they're hence they're called mixtures of the pure states right we can write an Arbitrary state distribution in terms of these these pure states So it's very handy. Just look at this geometric geometrically Also, and this is getting a little bit into the sort of the induced dynamic We're sort of talking about the state space now the induced dynamic Is actually many to one so in particular you can have the same words Lead to the same sorry different words lead to the same mixed state Right. So like we saw before here. We had two-thirds one-third and then after one step seeing a triangle we had Two-thirds one-third here and if I see two more it just permutes and permutes it again So either one triangle or three triangle words lead to that mixed state So this mapping on the simplex going forward I can have For a given mixed state any number of pre-images There should say any number find a number of pre-images, but Okay, so the basic idea is that there's a certain natural dynamic Over words right that's just concatenation. I go from a word and I add something on here to it to make a new word We have this equivalence relation between words now where they end up when we're looking on the simplex Two different words can end up in the same mixed state We think of you know the the mixed state is a state of state uncertainty It's the sort of hierarchical picture we give here and then then the dynamic that's induced over words We have concatenation and then for each one of these we can ask oh Given we specify the initial mixed state. What mixed state do I go to I? Have a word that induces another mixed state then and the way this diagram commutes Then there is a mapping from mixed state to mixed state on symbol so Interestingly this representation of mixed states mixed a dynamic as you a feeler Which means that again? Given the mixed state and the symbol I go to unique next state There may be many pre-images, but there's always I know exactly what state I'm going to given the previous one and the symbol Okay, so that was a lot of work. Yeah Yes, yes, in fact, I wasn't really distinguishing recurrent mixed state from transient mixed state which we will next time Yes, good leading question. Yeah, right Exactly, so so it's kind of Right, so so I mean yeah, so so what we've done here We've moved from and you know all the previous examples where I've been talking as if we had a probabilistic finite state machine and we were talking about oh, I'm in this state or this state or this state well, okay Sometimes we're not certain and that came up a bunch now We've shifted up to this other picture where we have this dynamic over these mixed states and we're tracking Dynamically how our uncertainty and states shifting based on what we've observed their analogs to being synchronized or to being in a the recurrent state of The epsilon machine at the at the vertices of the simplex when we have those delta function mixed states But they're their corresponding questions that we had before for the epsilon machine transient states recurrent states same thing Comes up here and we'll actually see that for example We can work with the transient states quite naturally and it leads us. I'll give you a nice simple expression Kind of closed form expression for the synchronization information The total uncertainty as we synchronize and that's working just where the transient states transient mixed states But along the way it turns out this is also Kind of harkening back to the when we're talking about the computational complexity of calculating word the word distribution there also lets us Deal very handily with an efficient way of calculating Block entropies and so on using the mixed states very efficient So Okay, so that was a lot or dense I should say So I'll finish unless there are questions and we'll come back and Thursday will be more Application-driven and you'll see some of this and why I had to bend over backwards to worry about some of the notation on Thursday