 So let's get started. So the first thing I wanted to point out was this so that you didn't spend a whole lot of time fussing with graphics coding. I provide a little lab here or just some code snippets for the first part of today's homework but just homework 10. It asks you to work through the four variable information diagram and draw things. Well, the drawing thing is not hard. It's just drawing ellipses and having them overlap the right way. So what I did is I just put some of that code up here for you. So you can go grab this little lab. And when you're plotting in Sage, we use this Python library mostly. There are a couple alternatives but you use MatplotLive. It looks very much like Matlab or so I'm told. Anyway, for the information diagrams you have to plot a bunch of ellipses. So here's some code, actually what I did. I just stole one of the examples from the MatplotLive.org site. And there, no surprise, ellipse functions already built in. So most of this exercise is just looking at documentation and fussing around. Unfortunately, I happen to like this kind of fussing too much. So you would hunt around on the website and find ellipse. This is how you set it up. This code is slightly augmented from the example. Just ran more directly and simply here. But then, seeing as how I like to fuss, I modified that. This just throws up some several hundred ellipses. I modified it so that there are just four ellipses. So that you have at least the background diagram. So this is essentially the four variable diagram I showed in the lecture notes. Four variables. There are two to the four minus one atoms the way we do this. So you have to check these things. I mean, just putting up four ellipses and making them overlap somehow is not right. There's a particular way they have to do that. So I built that in here, including some code that lets you label the different atoms. That's part of the exercise. And so, let's see. One, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen. Two to the four minus one atoms. So each one of those is a particular information measure. Some sort of joint conditional mutual information. So if this is h of y here, this is h of x, h of z, and h of w, where they all overlap, this is the four-way mutual information. I of x semicolon, y semicolon, z semicolon, w. This is the variable y, where I've hacked out everything. So this is h of y, uncertainty y, given. I use an exclamation point because my font rendering here gives me a dash. So exclamation, meaning bar, conditioned on x, z, and w, which means I take out those pieces from this y, and I take out everything from the other variables. So this piece right up here is h of y, given, x, z, and w, and so on. The whole point of this is that these different information measures, when we first started talking about information theory last quarter, we noticed this sort of parallel between set theory, how different subsets of events related to each other, and these measures. Of course, the measures are nice because they're scalars. We don't have to carry on probability distribution, they're single numbers. So this describes the set theoretic structure over four different random variables, their event structure, and also the measures on top of it. So that's the Shannon information measure. And then the second part of homework ten problem one, second or like D or something like that, is to show which of these atoms are zero because x, y, z, and w are a Markov chain. In a Markov chain, the intervening variable shield, which means that the various mutual information is zero. The correlation is broken by the shielding. And that means that these various atoms go to zero. In the lecture notes, I talked about how that works for a three variable Markov chain. I put this down and then showed which atoms are zero. And then the graphical trick here is basically you just take these ellipses, you figure out, you have to do by hand which of these are zero. But there's this graphical diagram, information diagram for Markov chains in the lecture notes of three variables. But basically what you do, graphically is you take these ellipses, turn them all vertical and set them down here and then clip them to the plotting frame. In fact, maybe I should just show it to you there. So what I showed you in the lecture notes was a three variable x, y, z version of this. You just add on this. But all I did graphically, the graphical programming is actually, you modify in the full four variable graphic basically five lines. So it's not hard. You just orient these things down. I shift them down and then the plotting package clips it to here. So I kind of get rid of all the intersections down here and then you can go through and label these different things. With the atoms that are still positive after you've assumed there's this shielding property in the four variable Markov chain. Okay, so just a little bit of helper code there. Not too complicated. I just didn't want you to spend a whole bunch of time figuring out getting lost in the documentation pages and doing graphics programming rather than thinking about the information theory. So there you go. Okay, so today I want to bracket the Thursday lecture. The Thursday lecture was maybe philosophy heavy and definition heavy. Like I said, it was the most important thing, but the importance really remains to be demonstrated. But take my word for it. That was the biggest lecture. But what I want to do is first review where we got and then we're going to go through a bunch of examples. So what I call epsilon machine reconstruction. How do you go from the specification of a process, specification of the word distribution, to discovering what the hidden states and transition structure is. That's what we did with the prediction game. But we all did it intuitively. Today I want to show you there's nothing intuitive about it. It's completely mechanistic. There's a way of going from the process specification, the word distribution to finding the hidden states. In some sense finding the intrinsic representation. So that's today, might even be kind of a short lecture depends on how straightforward the examples are that I present. So by way of review, right. So the end point was this thing I call the epsilon machine. So for historical purposes, they're called epsilon machines. It's a particular kind of hidden Markov model that our unifiler, we know what that is. But I have to convince you of that because what we did on Thursday is I said what we're really interested in is just prediction. And I went through a long series of constructions and steps in an argument that ended up with this thing here. This predictive or causal equivalence relation. And it was motivated by, well maybe I over-explained it on Thursday, but a really simple idea. We're trying to predict this process and we make this assumption or ansatz that the effective states of the process are groups of histories each one of which leads to the same prediction. So in other words we don't make distinctions between histories if having seen them they lead us to the same prediction. Are we going to get into more into prediction later? Yes. Right. Maybe I'm wrong there. No, no, that's a fair point. It typically is. So basically the idea is your partition would be dependent on your prediction. Well, okay, so that was the way the Thursday lecture started out. I allowed us to make any assumption. All I'm interested in from the golden mean process is predicting the number of ones. So that's a particular task I set myself and that's obviously a subjective choice. And there are good and bad ways of doing that and that was our sort of candidate scheme R which as soon as I state that it induces effectively a partition of the space of histories. It groups them in certain ways. So what I have to still prove to you is that this predictive equivalence relation is sort of the way with capital T-H-E, the way to do optimal prediction. Today we're going to do examples. So we get some idea of what the power and what the consequences are of assuming this relation. But then I have to come back Thursday. Maybe next week we're going to go through and I have to prove to you that this resulting representation of the Epsilon machine is an optimal predictor. It's of minimal size and lets us basically calculate everything we're interested in information theoretically or even coming up with good prediction algorithms. So it's a pretty outrageous claim. Namely that this thing, again, given the specification of the process, that's up to you. You've got an experiment or a mathematical model. You have to come up with a word distribution. But once you give me that, the rest of this just follows. I apply this equivalence relation and things happen. So I'll show you mechanically how things happen, how we discover the hidden states. But it all starts from here. But I have to prove these properties that I was arguing for on Thursday. That actually does prediction in an optimal sense. Last Thursday, my notion of prediction was pretty, was very general. All I wanted back were these future morphs, just distributions over the futures. So in fact, that's how this is constructed. So I'll just state this again. So what we're doing is we have our space of histories, all these different histories, and we group into the same class two different histories when condition on those particular histories, s' and s' lowercase means realization. So we've seen these two histories. We then do the best we can to predict the future. And when those distributions, future distributions, are the same, we say the process is in the same causal state. It's a little bit of a mouthful, but it really is just as simple as don't make distinctions between histories that are predictively effective. Why make a distinction? You can if you want. You'll end up with more partition elements in a larger model, but you don't have to do that. And then basically from this, basically everything is going to follow, but I will have to prove that to you. With a series of, I think, constructive proofs. Okay, so this is the base here. And all the motivation last Thursday was to have this make some sense. But it's a pretty simple idea. Okay, so then in terms of terminology, so if we start with the space of pasts, we basically go through there, however, and develop these future morphs, or I should say first the causal states, these are sets of past their equivalent under this equivalence relation. And then we also have this series of future morphs, given that I know what causal state I'm in. I have the distribution over all the future sequences that could occur, right? And then we think of the causal states as these groups of histories. A causal state has several things attached to it. One is it's just a set of histories. And different sets of histories leads to different predictions. There's the whole set, so we can kind of write that compactly as sort of the original space we're starting with. And then in sort of algebraic notation we sort of mod out by this equivalence relation. And the resulting set is a set of these states, or sets of histories. Okay. There's another way to describe the induced partition over the space of pasts. That's with this epsilon map, and that's just a simple lookup function. When I plug in a particular history, it returns the set, or causal state, or if you like the name, state seven. Okay, so we have a functional representation of this. And then of course, like I said, attached to each state is some view of the future. Yeah. Is there a reason why down below you have the L on top? Oh. I mean, are we going to get into why you're considering only finite lengths and what you're considering? Yes. Yeah. In fact, I'll kind of do that today with the examples. Here, I was just grabbing things to make a summary slide. So maybe it was a little bit, yeah. Yeah. Right. Yeah. Certainly do practically. You'll see that today when we do this reconstruction process. Okay, so then we went from sort of the raw set of histories, grouped them together so they're particularly equivalent, and then the set of morphs attached to each state or the set of predictions, different predictions we make. Once we had those states, we can then go through and look at if I'm in state seven and I see a one, what's the probability I'd go to state 12? I can go through that. I call that causal state filtering. If I had a series of measurements, at each moment in time I can stop and I've seen some history. I apply the epsilon function that says, oh, you're in state seven. I've seen something else. I have a new history. Oh, you're in state 12 and so on. So I end up going from a raw data string, applying the epsilon function to this causal filtering. Then I have a process over causal states now, and I can figure out what the transition probability is between those states. Okay, so that's, again, the epsilon machine. Set of states and some set of transitions. So it's a kind of mark-off, hidden mark-off model. We have measurement, alphabet, and then these set of states. They're different things, and then this transition structure over it. And I have to tell you what the properties are of this. For example, we have to prove that this model, this particular induced, equivalence relation induced model actually is a unifeler. That's not obvious. In fact, it kind of becomes an interesting property to deal with when you're doing actual reconstruction estimation. There's a unique start state, and the way to think about that is I haven't made any measurements yet. Formally, it's the equivalence relation. Remember, equivalence relation, we use square brackets. Basically it means we haven't seen anything yet. The other way you can think about it is that the start state corresponds to starting with all the probability in this state here. And this example we'll come back to again. We do the even process, 1, 1. Any number of zeros, 1, 1. We have recurrent states that are induced. You rattle around here for a while, but as soon as you transition out of that set, you never go back, and then there'll be a recurrent component where we asymptotic a long time. We just keep rattling around in here. So recurrent states get induced by the equivalence relation. So we think of them in terms of states. So four causal states here. All the edges are labeled with a symbol and a transition probability. All that from the word distribution. Number of states, which transitions. There are how they're labeled and what the transition probabilities are. That's all calculated from the word distribution. So I'll show you how to do that in various cases. Okay. So that was the end result. Again, I have to tell you proved you various properties about that. But let's first just think about, and this is a little bit, doing some examples helps us think about what this predictive equivalence relation means. Also what kinds of properties we're learning. So I call any process that goes from, any, I should say, procedure that goes from a specification of the word distribution of a process to an Epsilon machine by applying the equivalence relation. I call that reconstruction. A lot of times that's analytical. So we'll maybe go through some examples from statistics mechanics, various kinds of spin systems. You write down a Hamiltonian, describes the interaction between the spins. Then you can derive how many causal states there are. You can look at systems going through phase transitions, talking about critical exponents and all that. If you're familiar with statistical physics and critical phenomena. So there's an analytical approach to that. I still call it reconstruction, but it's the analytical calculation of the causal states and transition structure. This is maybe the vocabulary I use the most, although half the examples will be analytical. As if, talk about it as if it was some sort of finite sample like at a machine. We'll talk about finite sample fluctuations. Today we're going to assume we have the exact description of the word distribution. So it's up to you to come up with the word distribution, and then we turn the crank. And there are a number of different algorithms at this point. Different ways of implementing the estimation of the causal states and transition structure. On the one hand, Thursday was the mathematical theory behind this. Next Thursday and next Tuesday will be more proofs in the mathematical case where we're assuming exact word distributions. Exact description of the process. And then, whenever you look at real data finite sample with noise and all these other limiting properties, different algorithms, different implementations of the mathematical ideas have different forms and make different assumptions about the data and the source. We'll talk about that. Today I'm going to give you kind of a cartoon version of what's called subtree reconstruction. Or I could almost call it morph reconstruction. Subtree here is sort of tree of futures here. You can, these methods apply to both temporal data and also space time data. So, you know, if there's a request, we'll go back to the cellular automata case and modify the causal accruence relation to apply to space and time data to patches in space time. Not just time where we have histories, but space time where we have light cones of dependence. Let me show you how to extend that. Causal state splitting. Well, how to say this? Subtree reconstruction. It's like every data point is a possible state and then I group things together. Sometimes you call it subtree merging. There's the opposite one. Another algorithm called causal state splitting reconstruction where you assume the data coming to you is an IID process, has no memory. And as you look at more data, you look for kind of statistical justification for adding more states, inferring more causal states. So we split states. Here we, every data point starts out, every word starts out as a separate state and we merge. So high complexity model gets smaller and smaller in subtree merging. Causal state splitting starts with a single bias coin or multinomial process and splits and builds from below. The model gets more complex. So empirically these kind of bracket the truth. They should converge in machine, model size from above and from below to the truth. Specter reconstruction. This is just a different kind of thing. We've been using this to, it goes from a power spectrum. So frequency spectrum to an Epsilon machine. Been using this to study the structure of complex materials using diffraction spectra. X-ray diffraction spectra. There's another approach. We call it optimal causal inference. It's related to this method called the bottleneck. This is more related to what Shannon introduced called the rate distortion theory. And it's a nice way of looking at how model complexity trades off against desired approximation level. I might have a thousand state model, but I don't want to work with that. I'm willing to give up 5% prediction error if the result is five states. That's a huge win. So that's called optimal causal inference. Again, applies to time or space time data. And then more recently we're working on something we call enumerative Bayesian inference. This is sort of the most straightforward. We have a way of going through and exactly enumerating all of the Epsilon machines up to some number. Like right now the current thing, this is kind of an algorithmic challenge. We're up to eight causal states. There's something like 44 billion of them. We actually ran a machine and calculated all these things. And the Bayesian method we'll have maybe Chris Strelay after my post doc come in and talk about Bayesian inference generally, but also how you can apply that to figure out from a given sample which of these candidate machines in this library, this enumerated library is the best fit. It's just very direct application of what's called Bayesian inference. Anyway, point here is many, many, many different ways of doing this. And we can pick some of these later on to talk about as time allows. But now I just want to give you a flavor of this using the subtree merging approach. So what I'll do is go over the steps and then we'll go through some examples explicitly. Okay, so we're going to start with the word distribution. So this is the input to subtree merging reconstruction. You have to give me this and it has to be accurate. Okay, we're going to assume that. So what we're going to do is form something called a parse tree. It's basically all the words of length D on a tree. Then we use that data structure to form estimates of or approximations of the future Morse conditioned on different pasts. Once we figure out the number of distinct future Morse, those are going to be in one-to-one corresponds with the causal states. We actually go back and look at which, in this parse tree, which nodes make different predictions, basically name them. We can then get the state-to-state transitions from that and then we're done. So the number of statistically distinct Morse, that gives us the causal states. And then we can go back and get the state-to-state transition structure from the tree. In this particular algorithm, we have three parameters like all algorithms. Like there's another one, of course, which should just be how long the data sample is. But I'm assuming you're going to give me the exact word distribution. So we have D, the depth of this parse tree, the number of steps we're looking into the future, and the number of steps we're conditioning on in the past. So three parameters. Okay, so how does this work? So again, you should think in your mind what we're trying to do. The basis is just a direct implementation of the causal equivalence relation. We were comparing future Morse, so we have to make some choice about how we're going to do that comparison over what length of futures and pasts. Okay, so here's an example. So if I had M sample, or the string of length M, like this, what I'm going to do is first lay out a binary alphabet. I lay out a binary tree to some depth. In this case, I'm choosing D equal 5, so that's a parameter. Of course, and then I'm going to look at all the words of length 5 here. So I have the number of instances of that in the data sample of length M is M minus D. M tends to be large, so there's difference. And then in this particular way of doing the reconstruction, the history of lengths we're going to use is either 0, 1, 2, or 3, basically any possible length. So we put down our tree here. The top tree node is the start node. You'll see what that means in just a second. And what we're going to do is just go through. We have our window of D equal 5, and we're just going to sweep that through. And then for the length 5 words we see, we just put in a path. We just label the path as some path. If we don't see a particular word, then we're going to basically take that path out. So if I see 0, 1, 0, 1, 0, I put in 0, 1, 0, 1, 0. Move one step forward. I have a new length 5 word. Put that in starting at the top tree node. Okay, so here. 1, 0, 1, 0, 1. 1, 0, 1, 0, 1. 0, 1, 0, 1, 1. 0, 1, 0, 1, 1. So that was similar to a previous word except the last symbol. The last little leaf was added. And then as we're doing this, the number of times we visit each node, even the start node, there's a little counter sitting there. Every time we hit the node, we increment it. And what we're doing, in a sense, is just keeping track of the number of words that lead to a given node. So if this says 13 after I'm done here, that means I've seen the word 0 13 times. Just that simple. All we're doing here is giving, call it like a tree or hierarchical representation of the word distribution. This is the words of length 1, 0, 1. The words of length 2 and their counts are down here, write 0, 0, 0, 1, 1, 0, 1, 1, and so on. Words of length 3, words of length 4. That's all we're doing. So nothing, this is very straightforward. I hope you're thinking, I can code that up. Okay. And then, of course, you know, if we had lots of data, we're feeling confident, then we can also estimate the probability of the word 0, 0 by just taking the node count and dividing it by the total number of samples we had. That's our empirical, frequentist estimate of each word. We can replace the counts with the probabilities. So again, I'm assuming we have the exact word distribution, so I can do that directly. But you can kind of see here, if you actually were to do this step by step and you had a finite data, then there'll be some issue about how good an estimation of the word is. Probably the word is. I'm assuming we have that exactly. So that's just easy to work with. So here, now the first step, we've gone from process, description, the word distribution, and now we have this hierarchical picture of the word distribution, where each node, in a sense, is associated with the probability of the word that leads to that node. So the probability of 0, 1 in that example is we have this path, the word 0, 1 leads to here, and I store the probability there. In other words, each one of these nodes is just some marginal distribution of this length D word distribution. So we just build that up. So that's the first step. Now the second step is we actually want to figure out what the conditional transition probability is between nodes on the tree. In other words, if I see word W, well, I have that probability, I just look up W, oh, I have that probability, and then I see a new symbol, now I have a new word. So I see W, and I see a new symbol, and that takes me to a new tree node. But I'm interested in what's the relative probability, if I'm in node N, what's the probability of going to node N prime and seeing S? So I'm taking these absolute word probabilities in this tree structure and recalculating them to be local transition probabilities on the tree. So what I'm interested in is the probability of going from N to N prime. That's basically just this ratio of the word probabilities here, the words that go to N prime and N. Or the other way, think about it since W prime is WS, seen W, then see S, this ratio is just the conditional probability of probably seeing a zero one given the word, the history word. So I've seen some word probably seeing one or seeing zero. So that's how we calculate that from the previous tree. It's just simple. Before we had probability of W, probability of W prime, and we're just taking the ratio of those probabilities, and that then is a node condition transition probability. So if you go through and do that, so now what I've done is, and this is just an example, we'll go through something like this in just a bit, but just to show you what the next step is. So now what I'm going to do is I've gone through and changed the absolute word probability tree into a tree that is just labeled with these node to node transition probabilities. I've just gone through and calculated the ratio word here, word here, and so on all the way down. So every link between two nodes is now a symbol, zero at one, and a probability of seeing them, and I've just kind of filled in the numbers just to illustrate the next step is to find sub trees. So what I'm going to do is just look at passive length one and future morphs of length two. In other words, I'm going two steps ahead. So I'm trying to find all the future conditional distributions over the next two symbols given I've seen one in the past. So here's one example calculation that's highlighted here in the red. In green, red is the history, so I'm going to ask what's the probability of the next two symbols right there, four of those. I can see zero, zero, zero, one, one, one, one given that I saw zero, specific. And then on the tree the point here is to think about what this looks like on the tree. I've seen this one so I'm in this tree node and then given that I want to calculate the probability of seeing zero, zero, zero, one, one, zero, and one, one. Well, how do I do that? It's pretty simple. Given that I'm here, the probability of seeing zero, zero is just the product of those two transition probabilities. Four ninths. So given that I saw zero, my prediction I'm going to see zero, zero is four ninths, or zero, one would be two ninths and so on. So I just kind of wrote these out here. So given this this is one of the morphs, right? Conditioned on a past of length one that gives me a certain view of what's going to happen in the future two steps ahead. So these are the morphs. Okay, so the intermediate punch line is that we were thinking, oh, I want to do these future distributions. Those are just sub-trees somewhere on this big parse tree we're trying to find all the probabilistically distinct sub-trees calculated this way. Yeah? So is it necessary true that L plus K is less than B? Well, okay, the problem is you'll bottom out down here, right? Right. So yeah, so there are some trade-offs. There are constraints. In fact, I said before these we have these three parameters for sub-tree merging reconstruction. In fact, they're related to each other. Yeah. And in exactly how you do that in the sort of finite case there's a little more work to do, which maybe we'll have some time to talk about. But here I just want you to think graphically. I've done this move from what seemed to be this very formal equivalence relation definition to being very concrete. We're just looking at sub-trees and it's just simple transformations the original word distribution into these no-to-no transition probabilities and we just go calculate all these and I've made the assumption just to keep things simple that we're just going two steps into the future and just looking at one in the past. Yeah. If our data is not in binary, it's not 0's and 1's then you can use the same process where the tree just looks graphically thicker. Yes, right. Right, so the branching here if it was a five-letter alphabet I'd have five links coming out, A, B, C, D, E. Right, which means these trees, if you think of this as a data structure practical limitation, it gets a lot of control. So, but that's fine. I'm understanding the challenges in my project better. Yes, good. Ah, yes, as soon as they have more alphabet larger alphabet. Again, so another case let's look over here. I'm conditioning on the past of length one and then I look to see probably 0, 0, 0, 1, 1, 0, 1, 1. Those are slightly different numbers now. Okay, so we're calculating this morph and if you look at this so here is the future two steps ahead and we have these probabilities attached to those four sequences. Notice that these four probabilities the ones I get if I've seen a 1 are different than these. In other words, the two morphs are different. This morph is different from this morph. They're different predictions in the way I'm using the word prediction. Okay, so let's just kind of assume that we've done this and we realize that even if I condition on a length two words all I see are these two different morphs. The two steps ahead I had these two different distributions and then I would say that we have these two different morphs. So now I'm just thinking of just the probably distribution of our futures as some kind of signature for the predictions and what I'm kind of assuming to get through this quick tour is that that's all in the tree. Even if I had the original tree was depth 100 let's just assume that that's it. I went through and these are the only depth two sub trees that I saw. The only two morphs that I saw condition on histories of length 0, 1, 2, 3, 4, that's it. So that's it. First conclusion this process that made the word distribution the tree, the parse tree has two causal states. So I'm going from what their distributions are now I'm just talking about their names at this point. Then what I can do is go back into the tree and I go to each tree node and I look two steps ahead and say oh is that morph A or B and I put that name for the causal state up here. Because if I haven't seen anything that's the top tree node. I haven't seen anything. My prediction is that it looks like this, two steps ahead. Well that's A, that's the sub tree A. If I come down if I've seen a zero, well that's the example I gave you, then hanging beneath it there's a depth two tree that's the B morph. If I saw one, they have the A morph. If I saw one one, I have the A morph down here and so on. I just go through and just relabel the tree nodes with the sub tree name for the morph hanging beneath it. Well that's handy, yeah? So did it just work out that way with the top cone? Yes, yes, yes. In this example we're going to go through all the different cases all the basic cases shortly. In fact I will over explain it and make it over explicit. But I'm trying to make it rather than seem abstract, it's completely mechanical at this point. I'm telling you all the little stages. I mean these slides are explicit enough so in principle after electric just go code it up. If you wanted to. So now I have this kind of relabeled parse tree I have the causal states, I just put their names up there. What are the state-to-state transitions? I just read them off. A goes to A on a one with probably a half A goes to B on symbol zero with probably a half and so on. So that's the result as we end up with these two states and then transition structure. So A goes to A on symbol one with a half I just fill in the transitions A goes to B on symbol zero with a half B goes itself on symbol zero with two thirds and then B goes back to A with probably one third and generates a one. So that's just real quick I made a bunch of assumptions didn't really start with a given data string but I wanted to lay out the steps so when we go through the particular examples they're kind of clear. So again it's just given the correct word distribution you build the parse tree calculate these no-to-no relative transition probabilities calculate these morphs they're easy to see now, they're just the distinct sub-trees. Those are synonymous with the number of causal states. I disabled the tree and I can get the state-to-state transition structure result Epsilon machine. I'm making a number of assumptions so that works. The most important of which is actually that the word distribution is correct. So the rest of the lecture I just want to work through these part of that is to get back to the prediction game we played intuitively and show you that there's no intuition involved in this. In fact remember the period two case was a little bit bizarre, it was a little bit surprising so why? Well we'll explain that and then you know some of our sort of favorite process generators well imagine that we have the word distribution for the gold amine and show that in fact the generator we're familiar with that we assumed actually is entailed from that same thing for the even process although what I'm going to do here is just what I do what I call topological reconstruction not do the probabilities and then the homework assigned today do next week as you go through the probabilistic calculation of the morphs and some interesting things that happen okay so back to period one right this is the boring example right yeah sure you can I can imagine like when I'm feeling particularly cranky or something processes where every different history leads to a different prediction of the future yes right right so you'll see in the upcoming lectures when this sort of holds true under what mathematical assumptions so for example we're going to assume stationarity sort of a finite memory property then we'll end up with a finite number of causal states there'll be other cases where we have an infinite number of it's an infinite memory process with long range correlation and what we'll see that corresponds to accountable infinity of causal states so in those cases in particular as you look at longer and longer sequences you get more and more states you keep discovering new things so but there's a way we can deal with that using this technique called renormalization group that lets us sort of bootstrap up to infinite memory models from finite memory assumptions yeah so here today just the simplest base case and any number of ways this can fail finite data the window's not big enough I mean right the real exercise is like give you like maybe we should just do this give you a bunch of data and then I don't tell you anything well I guess I kind of there's a homework that does that and then you have to discover for yourself what's going on and then also come up with some narrative description what the property is typically you don't know is this an infinite complexity process or not I don't know so there are these different ways we have of approaching that that we can now make systematic okay but let's go deal with the simple cases and we can kind of dispatch them move on to more complicated interesting things okay so the period one process so it's just all ones so again let's just step through so I'm going to choose make a parse tree of depth five which means I have this window of length five and every time I see a word of length five I build up the tree here I put in a path that corresponds to that so one one one one one one okay shift over here okay next word five ones one one one and so on right so so the parse tree is just this in fact I kind of jumped ahead and even calculated the relative no to no transition probabilities and probably one if I've seen nothing I'm going to see a one if I saw one I'm going to see a one if I saw one one I'm going to see one one and so on right so the space of histories of this process that I kind of drew as a set is actually just one point all ones okay so now how many morphs are there of depth two subtree shapes one right so here I look two steps ahead I've got that okay I'll put that over here one one come down here look ahead up that same thing that I saw before same thing so it's just one so there's just one morph of depth two conclusion there's one causal state okay call it a so I go back and I label the tree with the A's and A goes to A on symbol one with probability one okay right we can exactly write down what the future morphs are for all length future lengths L condition on basically any past and then we end up with the epsilon machine that's just one state it is the start state so that's the other thing I should tell you here is that whichever causal state is associated with the top tree node that is the unique start state of the epsilon machine and I denote that with the concentric circles here okay so A goes to A on symbol one with probability one we had this utterly trivial one by one symbol labeled transition matrices right there's sort of no uncertainty as to what state we're in what's the asymptotic distribution over the state one again kind of trivial if you remember how we calculated for unifiler hidden Markov models the entropy rate well okay so I go to each state well in this case state A with probability one and I look to the future and what's my branching uncertainty zero so the information version of that is that the entropy rate zero for this statistical complexity so last Thursday we talked about the size of model well here we've got one state and we apply p log p to the state distribution well there is one event A that's completely certain there's no information in that so the statistical complexity is zero bits okay so this is genuinely flogging a dead horse at this point right it's all ones even the first time we did the prediction game was obvious but still it's an important base case and here's the other important base case remember the fair coin so here well that was the previous sample but imagine I give long enough sample or I just tell you it's a fair coin it's a uniform distribution over words any length I build the parse tree of depth 5 and I'm kind of jumping ahead here just because it should be kind of obvious what the node to node transition probability should be from every tree node I get 50 50 0 1 generation okay so I do that and then the question is how many probabilistic distinct morphs of depth 2 are there one exactly so far these are not difficult stay tuned right there's one there's one right you just go here you look down here and you say oh actually 0 0 0 1 1 0 1 1 they're all four of their sequences have probably one quarter and I make the same conclusion down here no matter what history I condition on I look and it's making the same prediction that all length 2 sequences have the same probability so there's just one more again call it a now the space of history is actually the set of all binary strings semi-infinite binary strings it's a huge space right but we can still write down exactly in close form what the future morph is for any condition on any sequence going L steps into the future it's always the uniform distribution over length of binary sequences okay next step so we've concluded there's just one morph I go back to the tree and I label all the tree nodes that have that full binary tree of depth 2 hanging beneath it here here here everywhere right okay so then I can read off the transition problems A goes to A on a 0 probably a half A goes to A probably a half on a 1 A goes to A on a 1 probably a half and so on and so just like we concluded before we have a single state A and the transition structures 0 would probably have 1 would probably have now we talked about the prediction game on last Thursday there was a little bit of debate maybe it should have been a 2 state well there's a 2 state version of this I could just call it A and A prime and label the previous tree in some way but this is the minimal one again what I meant by minimal was rather simple I can't remove this this or anything and have it still properly describe a process and it still captures the fair coin so we didn't have to assume minimality here I mean some of you have some familiar with machine learning and there's in fact a lot of the entire discipline is interested in sort of not overfitting and sort of complexity costs and minimizing choosing the smallest model basically different algorithmic implementations of Occam's razor don't multiply explanation beyond necessity we didn't assume it that fell out that's entailed by the when we most likely get something more complicated in the first two to six right? yes absolutely right and we'll talk about finite sample fluctuations even for a fair coin even for a fair coin especially yeah it's almost maybe the most interesting case in a way it's kind of the base case yes after that yeah right so next lecture or after that I will prove that the equivalence relation leads to minimal models it's not obvious if I give you that that it's minimal once I prove it to you hopefully it will be obvious but not right now so I'm just pointing it out in this case I could have had you know four states with equal branching zero one from each state and that would still generate a fair coin this picked out because we're doing these equivalence classes just one state it gave us the minimal model so we get minimality for free in a sense it's not an additional assumption the only assumption we're making is trying to do prediction what's interesting is trying to do prediction leads us to minimal models minimal structures okay so states well the set of sequences associated with the causal states are just all binary sequences trivial symbol label transition matrices one by one matrices each probably half causal state distribution we have only one state so we're always there but now notice if we go to each state and look at the branching uncertainty to calculate the Shannon entropy rate the source entropy rate we have maximal uncertainty however we know the process is always in one state that is not don't tell me that I know that there's no surprise therefore the statistical complexity is zero or they say this log of one state is zero so we're going to come back and talk about well we talked a lot about degrees of randomness with the entropy rate but we have to think a little more about what the statistical complexity means last Thursday I said well think about like model size roughly it's kind of the uniformity of the distribution over states if you have one state then it's just zero it's actually related to the amount of memory in the process but that's something else I have to prove to you now let's just tweak things a little bit we didn't do this on Thursday in the prediction game but we did do this way back when we first started talking about word distributions and sequences we did the bias coin right and that was peculiar because a it's a simple generator and you can just imagine doing probably two thirds for ones and one third for zeros and it ended up with that really complicated word distribution that fractal probability amplitude word distribution okay I'm just showing you all of the mosaic of the word distributions in a tree form here nothing different just a different graphical representation okay so our bias is two thirds we go down and again I'm kind of jumping ahead in how to fill out the tree so we see zero would probably be the third one would probably be two thirds and so on I calculate the relative transition probabilities and then what I see is that basically by hanging beneath each tree node is the same subtree of depth two what's the top tree node or down here you see the same thing zero zero is always probably one ninth and so on so how many morphs are there just one of depth two hence there's one causal state same big space of histories all sequences occur what's changed from the fair coin is just this complicated set of probability amplitudes attached to each sequence we can write exactly what the morph is conditioned on any history we have this binomial distribution of futures so that's nice closed form so single state we go back to the tree we label each tree node with the subtree and beneath it well that's all A's sort of agreed on already so that means we end up with a single state and then transition on the one probably two thirds transition on the zero with probably one third so simple one by one transition matrices and the trivial this is a simple process again asymptotic invariant distributions just one now the branching uncertainty is the binary entropy function of two thirds that's the bias we see at the single state but the statistical complexity is still zero we're always in state A there's no state information per se okay so so this is slightly more predictable than the fair coin the entropy reads less than one okay so now for the puzzling case for the prediction game the period two process right so zero one zero one zero one we go through the word of length five zero one zero one zero zero one zero one zero shift over here I have a new word one zero one zero one zero one zero one one zero one shift again zero one zero one zero zero one zero one zero okay then it repeats after that to depth five now this is the fun question how many distinct morphs are there of depth two two we have a vote for two three that was kind of tentative be bold right right exactly right so here depth two or if I'm here I go my knee goes this way if I'm here a knee goes that way so in fact there are three right so this is explaining back when we did the the prediction game why I sort of insisted that there could be these other kinds of states now certainly you know once once you sort of forget any initial part that condition on history sufficiently long I'm only going to see futures like this so this thing is this particular morph that's zero this start state top tree node is how I'm figuring out what phase the period two sequence is in I have to measure a zero or one first okay so we can write out explicitly what the space of histories is well it's basically just two points you know all the histories ended a zero well the history that ends in zero and the history that ends in one I can just look at what sequences occur conditioned on having seen nothing top tree node well I can see two futures if I see a zero there's only one future I can see if I see a one there's only one future I can see so I can write out this and then calculate the probabilities from the tree so if I haven't seen anything yet lambda meaning no measurement zero or one looks like a fair coin that's my prediction I don't know what phase is it could be either phase however if I've seen a zero I know I'm going to see a one and I'm not going to see a zero and vice versa I saw a one is my past I know I'm not going to see a one I'm going to see a zero so I go back label all the the transition probabilities and then I go put the causal state names at the tree nodes that have this the morph hanging beneath them so I have S zero here because it's this boat knee thing that's the only place it occurs and then I have S2 here and S1 here S2 here and S1 here so all the way down so we know that S2 goes to S1 on a one S1 goes to S2 on a zero however S0 can go to either S2 or S1 on a zero one with fair probability and then that was the answer I gave when we did the prediction game if I know what the phase is then I can predict exactly but getting started before I made any measurement I have to see is it in the zero phase or is it the one phase and then from that point forward so if I haven't made any measurements my uncertainty is the highest it looks like a fair coin zero will occur with equal probability but as soon as I make a measurement I can now start to predict in fact predict exactly so that's one of the useful things you extract from laying out all of the causal states both transient and recurrent states the transient states tell you how you come to do optimal prediction I have to make a measurement first and then from then on the entropy rates zero uncertainty is zero you can write out like I said before the different histories that are associated with each state each of the three causal states and now I have these three by three symbol labeled transition matrices sparse the causal state distribution now asymptotically all the probability leaks out of here I mean I can imagine if I haven't made a measurement then I assume I can start state and then that splits out 50-50 and then that just rattles around so this is my asymptotic state distribution start state zero probability then s1 and s2 have equal probability the entropy rate is an asymptotic quantity so after I've seen an arbitrarily long history I mean either s1 or s2 and I look to the future and there's just one transition possible so the entropy rate is zero it's completely predictable now for the first time in the series of examples I have two states and there's some information I can tell you it's sort of even phase or the odd phase of its cycle and that's informative to you if you don't know what state it's in so we have two events equally likely so there's one bit of state information one bit of statistical complexity what does that information mean it's the amount of information in the phase yeah what happens to the zero when you're given the statistical complexity but the top state I don't know how to phrase the question but if you're doing p log p right we're doing p log p over this asymptotic state distribution that's the way I defined it now you could say now wait a second I actually have me very interested for what my application is and how I come to know what the asymptotic state distribution or there are other questions how if I start with all the probability up here how does that actually relax on to the asymptotic state distribution so that's a question about finite the conditioning on finite length histories maybe length zero even and how that relaxes so there should be some question in your mind about well then how does maybe there's something about how the initial putting all the probability up here and watching it flow down might be related to that transient information we were talking about when we looked at the block entropy and how that got to the asymptotic e plus h mu l the linear asymptote so the transient states become important for questions like that these are just time asymptotic quantities for now but there's more structure here in fact if I just tell you h mu zero and c mu is equal to one this is much more informative right you know I mean it's kind of trivial but there is you know and we pulled this out just mechanistically we weren't guessing we just turned the crank it's a period two process there's a certain way you synchronized to it imagine it was a period three process now one zero one one zero one one zero one you can do the same thing and there'd be a cycle of three states with actually turns out two transient states that tell you as you're measuring zero one how you come to know which of the three phases the process in our period seven there's actually much more structural information here the architecture of the machine is really telling you how the process is organized okay so now mixtures of these kind of more interesting so what I want to do is talk about the golden mean process so remember that's easy that's golden mean process generates all binary sequences except a zero can't follow a zero that's the only restriction right that's the irreducible forbidden word zero zero and what I'm going to sort of talk through here is not probabilistic Morse but we're just going to look at what sequences occur in the past and also in the future Morse I call that topological reconstruction I'm forgetting all of the the the word probabilities just looking at what words occur and which don't occur and that just means we're putting in certain paths are not sort of putting in the paths in the parse tree yet so zero zero can't occur anywhere right anywhere never can produce okay so I'll just kind of jump right ahead I'm dropping all the probabilistic part of the argument just to get through it it's not hard it's just to simplify because this is one of the homeworks to do the probabilistic reconstruction okay so imagine we try to argue why the golden mean process has this particular tree structure well the only restriction is zero zero so I can see ones I can see a one in the zero but if I see a zero I must see a one if I see a zero I must see a one so you can kind of tell every time I've seen a zero anywhere in the tree I cut off I prune that part of the tree it's the same thing I pointed out when we were looking at Mosaic of word distributions how zero zero occurs at length two and then it actually has a cascading effect all the longer length word distributions where there are subsets taken out and then sort of arguing in the limit infant sequence that's actually a canter set of sequences that are removed so that's one way of looking at is on this tree one restriction has this infinite cascade of restrictions further down in the tree for longer words that are not allowed to contain zero zero okay so now the fun stuff okay so now we're going to look at Morse of depth two so how many distinct morphs are there three very fast good where are they right we always start at the top right so the top is going to branch maybe like a coin flip and then a restriction here this one here this guide out okay so this one here so I see a one and then I can see a coin flip this one here this one here here zero one so this guy isn't this guy the same as this guy right okay there's at least this guy okay right and if I look here I have this one so that's the second one I haven't seen if I look over here well that's branch and then a restriction so this is the same as this guy up here did I get this confused I might have copied the wrong thing oh right we have zero one one zero and one one yeah no no what I'm actually okay I tried dumb this down from the probabilistic version right it does actually right so I kind of jumped ahead here so this is some bizarre mixture of probabilistic reconstruction so this is half the answer to the homework exercise so okay right right so so in fact right I should have written this out so what happens if you put in the probabilities that start top tree node if I look just one step ahead I'm basically just looking at the probability of seeing a zero or probably single one and that happens to be two thirds and one third in this case whereas if I condition on seeing a one then it turns out that this guy which has the same shape as this upper one actually it's 50-50 in this case so I should go write that over which in that case if it's really probabilistic you get three morphs where this should be two thirds one third here and it's different from B because this is 50-50 if you worked out the node to node transition probabilities like I said you should obviously C is just a different shape but A and B would be probabilistically distinct yeah what interesting yeah I should just drop that it's the probabilistic reconstruction I should just not be lazy I put the transition probabilities on there the net result if you're doing well the probabilistic reconstruction is this that's what I just said there were three then we have this the causal state that's associated with the top tree node that first branching on zero one is two thirds one third and then it was it was B once we've seen a one then going forward B is a fair coin flip on a zero and one and then since we saw a zero leaving C we have to see a one with probability one because no consecutive zeros right yeah sorry kind of jump the gun on the topological reconstruction right so then you end up with this so this is the answer to one of the homeworks and what you're supposed to do is fill this out correctly what I should do is just have an A B with these two and then you will then see that they're actually three if you look at the probabilistically distinct future Morse right okay yes right right and then the transition structure you get would be like that and this would be the start state and then it would just be zero one and a one like that right I'll go clean that up so it actually is the topological reconstruction right but anyway this is what you should get with your probabilistic reconstruction so that's the hope the target is this for the homework and then you have to calculate those Morse just like I was doing in the previous examples and the node to node transition probabilities so continuing on that I've got this you know hidden mark off model I can ask the asymptotic state distribution well it's two thirds one third here state A transient state purely transient state after one step I never see it again even so I end up with again I go I mean state B with probably two thirds I see a fair branching that's one bit of uncertainty but that only happens two thirds of the time so that's two thirds of a bit but I'm in state C probably one third of the time but there's no transition in certain because I'm definitely going to see a one so the net result is I have the entropy rate of two thirds bit for time step right only two thirds of the time do I see a fair coin foot from B now the statistical complexity is now this kind of mixture of things it's I have this distribution so not writing out the number it's just the binary entropy function of two thirds two events with bias two thirds that's the state information let's see if I get this topological reconstruction right okay so the even process that's a little hard to describe maybe that means it's more complex so the even process generates all binary sequences and when ones occur they occur in blocks of even number bounded by zeros and every time I see a pair of ones the next zero one occur with fair probability so that's the kind of narrative description of the even process okay so what I've done is I've just put into a depth five tree the words that occur ignoring the probabilities and well actually you know to depth one I see zeros at once to depth two I see all length two words to depth three well actually there's only one forbidden word right that was a zero odd number of ones and a zero that's forbidden in fact what I should do but it almost is impossible to describe remember we don't see another forbidden word until we get down here zero three ones zero one one one and a zero that's forbidden and at every odd length there's a new irreducible forbidden word zero odd number ones and a zero so new restrictions are coming in each restriction at a shorter length has some cascading pruning that it does for words that can contain it those are disallowed okay so now this again it takes a little more pondering I'm pretty sure I got this topological reconstruction right okay so to depth two and there's a bit of an issue here right so now we're starting to see structure over longer futures you might imagine well maybe a parse tree of depth six would be better so there's a bit of a trade off here I'm kind of trying to guess these so I can put it on a graphic that we can actually see okay so here's the parse tree again so how many depth two sub trees are there four we have four well okay let's go through so the way I write the algorithm I go to each thing I look at the signature store that off see if I haven't seen it if not I put it in my list go to this one do the same thing you just kind of go down in this case you're just looking for which words of length four occur so here top tree node two steps ahead we see all four binary sequences here there's actually a restriction I don't see one zero so I put that aside so that's three come down here well that's this full binary tree depth two I saw that above ignoring probabilities the one sometimes we just take account for this I mean we have a start point as a different one sometimes but sometimes that was my mistake on the previous example but I was showing you where the probabilistic distinct things I was trying to set it up so I was giving you minimal information minimal helpful information for the exercise I gave you too much so I did half I'll fix that in the slide so the previous distinction was a probabilistic distinction A and B would have been the same if I was just looking at the topological what sequences occurred in the future so here I don't know I kind of do a pattern matching it gets a little tedious notice if I'm here I must see one and then I have a branch that's a new one there I don't know you can kind of go it's sort of exhausted you kind of go through here keep checking keep checking keep checking then there's a decision criteria well in this case I know I can stop because I define what process we're looking at so the net result for topological reconstruction is that there are three so they're very if you know what this the process is there are ways of calculating how far you have to go ahead into the future and into the past to see all of the topologically distinct probabilistic distinct causal states here I'm just choosing the parameter so it works out I mean A is the same as when we don't have double the start state here start tree node yeah right that's true I should probably have another it's probably a little confusing I should probably not put the double I'm doing that to remind myself not only does this occur further down on the tree but it's also associated with the top tree node which means it's a start it's the start causal state so that's what I'm doing here I don't mean that this occurred elsewhere this can only be at the top but it also occurs elsewhere like down here or down here okay so now I mean at some point you start to realize why this is worth programming up you do enough examples it gets a little bit kind of tedious but there's still interesting things to do it by hand but don't have to do it too long it is a procedure after all okay so now we go back to the tree and label the tree nodes with their associated subtree hanging beneath them okay so A here start tree node full binary tree full binary tree full binary tree but if I see a zero then I end up with this binary tree with this one missing for the odd length thing and so on B's down here and then C is this C1 and then can branch so I've gone through and labeled all of that in fact I even labeled down here because I was actually looking at a depth 6 tree so that's an example where I actually had to look at where it's a little bit longer to get this to work out I mean from just looking one step below this tree node I don't know that it's A so I actually kind of worked it out but I didn't display it because it gets very busy they can shrink down okay so what does this tell us topologically A goes to A on a 1 A goes to A on a 1 A goes to A on a 1 A goes to B on a 0 B goes to B on a 0 and B goes to C on a 1 C goes to B on a 1 and so on notice that most of the tree nodes are just B's and C's the only place I see A tree nodes is on this far left when I'm seeing 1's and as soon as I step out of that I fall back into B and C causal states so this is kind of hinting that A is going to be a transient state that can map to itself so we have the three causal states can go through like I just did and make a list of what states are allowed on a 0 what state transitions are allowed on a 1 this is also kind of mixing things so you end up with this three state picture which I should I'm kind of jumping ahead here ignore the transition probabilities for now but basically you end up with these three states so here's A the start state goes to itself on a 1 as soon as I see a 0 I drop down to B that was the only way of transiting out on the tree of that long series of 1's into the main part of the tree and then from then on well what is this yeah oh sorry B goes to C on a 1 then I must see what and then B goes to itself on a 0 so now I'm putting in transition probabilities here as if it were a stochastic process that's not really justified topologically sometimes what we do is if we just have a machine here without transition probabilities we talk about the topological machine and we just as a default assumption assume that we have fair coin branching if you want to calculate a property like the statistical complexity you have to put a transition probably on there it's kind of a null assumption another way to do it would be to take even if this is the topological picture this describes what sequences are generated after problems I can take a sequence generated by the actual even process and run it through here and calculate empirically what these probabilities should be that would be some kind of approximation it turns out and this is sort of the punchline that maybe makes this distinction between topological reconstruction and epsilon machines and probabilistic machines clear for the even process it turns out there are four causal states four probabilistic distinct causal states so the full probabilistic reconstruction shows in fact it's almost like that previous single state transient state that looped to itself on a zero it kind of splits there's some modulation of the future probabilities that you have to keep track of so I actually now have this this loop in the transient state which means I can stay here as long as I keep saying ones of course the probability that goes down exponentially fast eventually I see a zero and leak into the two recurrent states like this so that should look familiar that this is the actual you know way we've been thinking about the even process as a generator with those transition probabilities so you can go calculate these things out like that so 4x4 transition matrices the entropy rate well that's easy it's really just two thirds of the time I'm in C they put down the asymptotic transition state probabilities no probably C C is seen with probably two thirds D probably one third when I'm here I have a fair coin flip so that's two thirds of time I see one bit of uncertainty but if I'm in D I'm going to see one so that doesn't add so I just have just like the golden mean process the entropy rates two thirds per time step and also the same statistical complexity I have these two events C and D probably two thirds probably one third that's the state information okay so the homework is to elaborate on the topological reconstruction here and write out the the probabilistically distinct Morse and as a guide you should be getting this as the end result it's not too bad so you might find it handy sometimes when I'm doing these calculations by hand it's nice to have that binary tree there the parse tree so I have some PDFs of parse tree paper and Morse paper over here you can download and print out and try it by hand first just kind of saves drawing this branching tree which if you do it by hand is just it gets lopsided so reference for that okay so that's it so these examples just again it's a procedure we just go through this and we get to discover the number of states in the transition structure in a process again starting with it's assuming we're given the word distribution so in a practical application this breaks into two steps some statistical technique going from finite data gives you good word distribution and once you have that then you turn this crank to figure out how many states in the transition structure