 For today, now it's, I guess, mostly kind of structural interpretation. We have all the calculational tools we need, and it's just drawing out consequences, some of which are interesting and puzzling, some of which we're going to say just because we can say them. Maybe they'll be interesting in the future. OK, so I wanted to talk first more directly about scanning in different directions. And the theme is going to be, now we get to choose. So you could be scanning in reverse direction, and you could stop at some time and decide, oh, I want to turn around and scan the other direction. Now you might think, oh, wait a second. If I scan this direction, I already know what the past is I just saw. So if I turn around, there's no uncertainty. So our notion of turning around means we're going to forget a little bit and just look probabilistically what would follow. But I'm going to organize the lecture around this question of calculating the excess entropy from the epsilon machine. And oddly enough, even though the excess entropy is this, quote, superficial, mutual information of the observed process, doesn't make any assumptions about the internal process, it turns out that there's really no way to calculate it unless you use the epsilon machine. And in particular, we want a closed form solution, so I'm going to give that to you today. The net result is pretty much everything, even the sort of novel statistics and some of the things that Ryan's going to talk about next week with cryptic order and so on, Markov order, we can calculate from the epsilon machine. I mean, before we had proved it was a minimal sufficient statistic. So in that sense, in a very non-constructive way, you can calculate everything from it. It generates, the epsilon machine generates the process. But what we want is, given the states and transition structure, give me a closed form solution to calculating the various quantities, like excess entropy, pull things together in terms of a bidirectional information diagram, and then we'll start talking about what we call the bidirectional machine, the sort of unitary time agnostic way of thinking about processes. OK, so bidirectionality. So we have these two sets of states, forward, causal states, reverse, causal states. The forward machine, reverse machine, and they both have this property, for example, of causal shielding. So the future and the past are conditionally independent given the forward, causal states. Well, that's just kind of trivially true for the reverse process. The reverse process is past and future, or future and past, if you're in the forward direction, are conditionally independent. So this justifies the interpretation of the causal states. Now, either direction as shielding one half the lattice from the other half of the lattice. But that's kind of curious, they both do this well. Although the processes are, of course, slightly different, you're scanning in one direction or the other. So we'd like to sort of unpack this, seems like a similarity here, but they're also capturing different information, particularly the cases where we have a process that's causally irreversible and the statistical complexities aren't the same in scan direction or the number of causal states. OK, so just to, again, I'm going to frame this in terms of calculating the excess entropy, because that's a very useful thing, and we'll have a very simple theorem for this, fortunately. So what we want to do is calculate the excess entropy from the epsilon machine. We could get the entropy rate directly, we can get the statistical complexity directly just by calculating eigenvalues and state averaged branching uncertainty, but we didn't know how to get the excess entropy. And if you remember, the excess entropy has this sort of information theoretic interpretation of a process as a communication channel. It's a mutual information between the past and the future, so you can say, well, OK, what's the shared information between the past and the future? As if the present moment, and here we should be thinking about some mechanism, like the epsilon machine or some presentation, it's in some state. And that's sort of communicating that information from the past to the future. And then this excess entropy being a mutual information is sort of like the channel capacity we studied in the winter quarter. So we have this sort of mutual information being channeled input and channeled output. And we can think of any process this way, any physical biological process. In each moment in time, the physical world or mechanism we're looking at is in some state that forgets some past information but remembers some of it. And then somehow that stored information gets communicated to produce future behavior. So everything, this is a very general picture of a dynamical system. Stochastic process, what have you. But it's not really the channel capacity. Remember, Shannon knows the channel capacity was a characterization of the device itself. So he's working a more engineering setting. I want to build a device that communicates information over electrical wires or over the air, something like that with radio waves. And the channel capacity was the maximum mutual information over all possible inputs to the channel. Well, here we have a process and it generates its history. So the input to the channel is kind of fixed. That's how it's behaving. So think of this more as the channel utilization, how much of the system is being used as it's generating its own behavior. It's sort of a circular idea. But the main question is, how do we get this from a process? Well, it's defined this way. But that's kind of a pain because we're dealing with semi-infinite random variables. So that's not practical. It's a nice intuitive definition. But how are we going to calculate this? Well, we're going to do it in terms of the Epsilon machine. So this is our main result, incredibly simple to state. Maybe even at this point, I hope, kind of obvious, the excess entropy is the mutual information between the forward state process and the reverse state process. So the intuitive part should be, well, I have four diverse machines. Therefore, this is just a proof sketch. So I have the mutual information between the past and the future. Well, I can apply my Epsilon plus. I have a past. And that's going to tell me what forward causal state I got led to. And it captures that forward causal state, captures everything that's relevant about the past for predicting the future. Ditto here. I apply the Epsilon minus function for the future when I'm scanning the opposite direction. It takes me to the current reverse causal state. And that reverse causal state captures everything I need about the future to retro-dict the past. So they perfectly find proxies for the past and the future. So we go from semi-infinite random variables here to a finite set of random variables. This is good. Now, proof sketch. One of my students and I just last year wrote out the actual proof. It takes about 10 pages to do this correctly if you do all the limits, past and future, length l, length k. So this step, although it's completely intuitive, is a little tricky. But anyway, this is the idea. There's nothing more. Unfortunately, when proofs are much longer than the goodness of the idea. But there you have it. So there's also an interesting interpretation. This excess entropy, if we go back to Shannon's picture of communication channels when we have a mutual information, it's the effective transmission capacity between the forward and reverse processes. Maybe slightly strange, slightly intriguing. Somehow our process is communicating information between its forward and reverse states. However, those are stored. Practically, of course, to get this, we need what we just talked about. We need to understand the relationship between the forward and reverse states and the general case, the joint distribution. OK. So I want to delve in a little bit more into how we track the relationship between the forward and reverse states. That's going to set us up then to talk about this more general presentation. So we have the mixed state presentation starting from some epsilon machine. We reverse it, we normalize things. And then from the normalized reversed machine, we calculate the mixed states. And as I just emphasized, we have to minimize the mixed states that get the causal states. But that'll give us the reverse epsilon machine. And as I pointed out, in doing T and U, we are tracking essentially the relationship between forward and reverse states. The reverse states are mixed states of the forward states. The forward states are now mixed states or the reverse states when we do this. So we have two maps between these. Call them the switching maps. So there is the forward causal states. Now, their mixed states live in the simplex. I'll just say dimension m. So that would be the number of recurrent causal states in the forward epsilon machine. And so we have these distributions, these mixed state distributions. Here, I don't mean time. I just mean value for the sub-index. Probably that states in state A, probably it's in state B and so on. Same thing for the reverse state simplex, although we could have a different number of causal states in the reverse machine. So this would be a different dimension simplex. Same thing. Mixed states are just points in that simplex. And then we have this forward map. Given that I'm in some mixed state in the forward machine, what mixed state do I go in the reverse machine? So that map is just this distribution. I said backwards. The forward map is actually going from the reverse time to the forward time. So I'm going from the reversed mixed states to the forward mixed states. So it's just this distribution here. If I fix a state, say for example, I'm on one of the vertices of the, I don't know exactly what state I'm in, in reverse causal state I'm in, then I could have some mixture of forward states. And same thing for the reverse map. So these are these conditional distributions we just talked about. And we can think of them as mappings between these two simplices. And we essentially calculate those with the mixed states. We just have to track a little more carefully what we're doing and be comfortable with this interpretation of the mixed states as a switching maps. Now what this allows us to do is basically go through all of the things we've been calculating before and replace semi-infinite futures and pasts depending upon how we're doing things, reverse or forward causal states. So we can basically recast pretty much everything we're doing in terms of machine states. And there are even some sort of new quantifiers. So this is sort of the benefit of formalizing this notion of a switching map. It's going to control the relationship. It captures the relationship between the forward and reverse processes. So let me review a few things for the random noisy copy example, and then I want to work out the switching map as a way of expressing this joint distribution between forward and reverse causal states. So this is the forward epsilon machine. Flip a coin, come back to A. Except when C comes back, there's another coin with bias Q. Calculate its asymptotic state distribution. What we did last lecture was calculate the reverse machine, which actually happened to be also three states. Again, calculate out, in this case, the reason I'm choosing this example, you can do all these calculations in closed form, in terms of the original P and Q coin biases on the forward machine. So this is our pi for the reverse states. On Tuesday, we talked through, as we calculated, the mixed states for the reverse machine, D, E, and F. We calculated these distributions. Now I want us to think about this conditional distribution as a matrix. So these are the mixed states I just put up a few slides ago, just as the rows in this. So this is the forward switching map, this matrix. Maps from past causal states to distributions over future causal states. And then, since I can calculate the asymptotic distribution of the reverse causal states, I multiply that against this and get the joint distribution over past and future causal states, which is also a matrix. So that's what I did here. So this is telling us if we were moving across the joint process lattice, just what's the probability of being in D and A, well, never happens, or D and C always happens, or sometimes there's a mixture. And what I mean by this state distribution times the conditional distribution is a component-wise calculation down here. That's what these entries are, computed component-wise. So ditto, the reverse switching map, it's the same calculation except I would have taken the probability of the reverse causal states given the forward times the asymptotic state probability with the forward states. They're transposes of each other. So maybe that's the easier thing to do. Anyway, so without going through the algebra here, we can calculate these things. So we know how the forward and reverse processes relate to each other. I can then, if I want, go back. Actually, well, the way I did this calculation, I started from the joint and then divided by the forward state asymptotic probabilities to get this conditional distribution past given the future. And then, well, in fact, I guess in doing this, I didn't normalize. Sorry. So this is not normalized. Maybe I should put some funny sign right there. Proportional two. And then when you normalize, this is the actual. We want each row to be a distribution, whereas the intermediate calculation, the matrix I just showed you wasn't. The rows weren't normalized, anyway. So the point here is this is very explicit. Remember, this is for the random noisy copy process. The forward machine had two coin biases, P and Q. And we've carried this all the way through. So the calculation is a little bit teased, but not too bad. And the benefit is that you can actually write down for this whole family of machines parameterized by P and Q an analytical form for the excess entropy. What used to be this kind of mysterious thing, a lot of problem. We now have ways of calculating all this. So just to show you one way of doing this, so now we use our theorem, mutual information between the past and the future relying on this joint. Distribution, I pick one way of one identity for factoring this out. So I pull out the statistical complex of the forward machine minus the uncertainty in the forward states given the reverse states. That's just a definition of mutual information. Well, I know what this is. Calculated that. And then if you remember, there was only one row in that joint process, or in the conditional distribution matrix, that there was any uncertainty in D and E. I knew exactly what forward states I was in. It was only when I was in reverse state F that I was uncertain over whether it was states B or C. So there's only one term to calculate here, just this state uncertainty times the probability of being in state F. So that's a pretty straightforward. Anyway, just plugging things in here. I know what that probability is. I know what this uncertainty is, what that entropy is. And yeah, so you just can directly write it out. So very handy. Now the other thing is we can go back and calculate these crypticities. Remember, the crypticities are the difference between the observed information, E, and the sort of information in the mechanism, the state information, C mu. So we first encountered it in the information diagrams. We had past and future. It was the mystery wedge. And it was this odd thing. It was, oh, I've seen some future. And now I'm uncertain about my present causal state, just slightly odd thing. Well, who cares anymore, right? I have a past. I don't want to do what semi-infinite random variable. I use the epsilon minus. It then returns the reverse causal state. And then this looks very familiar. It's simply this conditional uncertainty in the future states given the past. Same thing. We now have a reverse crypticity, which is the uncertainty in the current reverse causal state given the past. If this gets confusing, think of that joint process lattice which blocks the variables. These things are asking a question about. Same move. It's just the uncertainty in the reverse causal states given the forward causal states, which we can now calculate. So we can do that in closed form for the random noisy copy. Well, we basically just, this was the second term in the mutual information. So we just calculated that. It was just state f that gave us uncertainty where forward states b and c. So we just had that term written out. But then we can also do it in the other direction for the reverse crypticity. Because the state complexities are changing depending upon which AU scan, processes can be more or less cryptic depending upon your scan direction. In this case, if you remember that the joint forward reverse matrix, there was just one uncertainty. If I went forward state c, I was uncertain in states drf. Reverse states drf. And anyway, plug in, and you get that. Anyway, so now we have these explicit expressions for the crypticities. We are starting to see how the crypticities play some role in the causal irreversibility. Although we had examples that showed that they weren't the only determinant. But we can explore that a little bit. Remember, so what we can do here is the causal irreversibility was just the difference between the stored information for statistical complexity in the reverse. Well, using these identities, the crypticities are just the difference between the state complexity and e. I can just use these plug-in here. And the symmetric part falls out, of course, because I'm picking the difference. And what we see, in fact, is that these crypticity terms, it's the difference in those that is controlling the causal irreversibility. So now that's clear what role that's playing. And just to flog the dead horse for the random noisy copy, we just calculated both of these terms, because they were the crypticities. So just the difference in those. We have an explicit form for, as we vary P and Q in the original process, the forward specification of it. We have some processes that are causally irreversible, irreversible, forward cryptic, reverse cryptic, both, and so on. And again, we now have maybe the more important thing, conceptually, we now have this different interpretation of the excess entropy. Rather than the left right half or past future mutual information, it's this information flow between the forward and reverse processes. So in other words, if you were God, looking out at the lattice of past and future, you could imagine that the forward and reverse processes were talking. Anyway, the practical result is that we can directly in closed form calculate these quantities from an epsilon machine. Interestingly, we have to use the reverse epsilon machine to help. So basically anything we want to calculate, we can get analytically now. And no, there's a set of algorithms. There are a few gotchas here. Ryan might talk about it next week, where there are some cases where when you reverse a finite causal state process, when you reverse it, you can get an infinite number of reverse states. So that's kind of curious, but next week. But even in that case, this specification gives us a way of doing good approximations, nice systematic approximations, even if one of the sets, well for that matter, even both sets were infinite. Just takes a little more work. So to summarize all this graphically, we have our information diagram, where every atom in the information measure picture can be interpreted. So we have our future and our past here. The excess entropy now, we've rewritten it in terms of the mutual information between past and future states. We had our state complexity, we calculate directly from the forward machine, or from the reverse machine, or from the forward machine. That information also has to overlap and capture E. And then the difference between C mu and E, the crypticities, are governed by these state conditional distributions. Right, these crypticities, the forward crypticity and reverse crypticity. So we know what those wedges are in terms of the epsilon machine. Before, we always had these semi-intent variables sitting around, so we're still relying on the original explicit specification of our process. And then remember our argument that everything in the future, where I've conditioned on the causal states, in this case the forward causal states, we start doing optimal prediction. That means the red and gray wedges here scale exactly linearly. So we thought of that as these, as word lengths get longer and longer, it's foliated with slices of size H mu. So that's the two of those things together. And the same thing in the reverse, when we have the reverse causal states, we're retradicting the past, we're conditioning on the reverse causal states, and we can do optimal prediction over the yellow greens again. So that's foliated this way, too. So that's pretty much everything we need to know. And we've replaced, in the original process information diagram, everything with machine states, which, in many cases, are finite, and we can work with them. So we can calculate these things. OK, but this is all sort of skirting the subject, what's going on. Very handy, motivates thinking about the reverse process. So there's another way, maybe kind of one step more abstract, or trying to unify things, is to think about a machine that not only is a generator of the forward process, and another machine that's also the generator of the reverse process, but a machine that can generate both. The other way to think about it, you're sort of in your giant, the joint process lattice with forward reverse states and observed sequences, you're sort of walking along, and you can stop at each moment and try to predict, change your mind, start walking back, and try to retro-dict. So I want to give a compact expression to that, and it will capture everything, and also leads to some interesting observations. OK, right, so we were just thinking about re-expressing these various complexity measures in terms of forward and reverse machines, almost separately, although we did need to understand their relationship. But the only thing we understood about the relationship between the forward and reverse processes was this snapshot of present moment, forward causal state, present moment, reverse causal state. We're not talking about blocks of these, or what's my current causal state at time three, and the forward causal state at time 10. This is just a snapshot, it's like a projection down to the instantaneous moment. But anyway, we got to rewrite all the things we're interested in here. But it doesn't capture everything about the relationship between the forward and reverse processes, certainly the metrics we've been thinking about. So what we really want to do is, I've re-represented this a little bit, so here's the observed sequence, and then I put the forward causal states up here, reverse causal states down here. So what do I mean if I hear at time zero? I've seen this particular pass, and that leads me to this causal state, forward causal state. If I was scanning in the opposite direction and I came down to x zero, then that would lead me to s minus at time zero. So futures lead to reverse causal states, pass lead to forward causal states. And what we're going to do is just imagine walking around on this lattice. I'm in this state, I generate that symbol, that state, I stop here, and then somehow I have a change of heart where I have a question, oh, what's reverse causal state I am? So I kind of hop over there, and then I can walk back. Well, that sounds like a mechanistic process, so I should be able to just write down some kind of machine, state-based machine that lets me choose which direction I'm scanning in the lattice. So the way to think about this is there's an additional little machine sitting around here, what I call a path automaton, that tells me to take a particular path. Think of it as this little program says, OK, increment my counter by one, by one, by one, oh, now skip over to this lattice, now decrement that counter. I'm using that counter to index into my random variables. So I can take different paths through here. Question, is there some representation of the process now in terms of symbols in both states? So I could take any path. So that's this bi-directional machine. So any strange question I had about being in this state, or this state, or this symbol, and this symbol would be captured. OK, so this is what we call the bi-directional machine. It allows forward and reverse moves. The past machines, in contrast, were just always moving forward or always moving in reverse. Their path automaton was simple. All pluses or all minuses were being specified. Now you can say, well, what is this strange object? Well, we can give it a formal definition and very much like the original predictive equivalence relation. So this shouldn't be too surprising at this point. We're going to define this predictive equivalence relation, twiddle plus minus, so that it is the equivalent classes over by-infinite strings, pasts and futures, that, let's see if I can say this correctly, it's the set of by-infinite realizations such that it shares the past with others that go to the forward causal state and the futures, those that correspond to the reverse causal state. So I have a particular realization here. It has x past, x future. And then in that equivalence class are all the other realizations whose pasts are in associated with the forward causal state that would correspond to the original by-infinite and whose futures are in the equivalence class of the futures for the reverse causal state. OK, this is kind of turning into a mouthful here. But the simple way to think about it, instead of just partitioning pasts for forward machines or futures for reverse machines, we're now partitioning the entire space, by-infinite space. So maybe the only trick here is just how we associate. Well, let's sort of pass with pasts, futures with futures. And the only trick is that the epsilon machine, epsilon function, takes the corresponding piece, but returns the forward states from pasts. So anyway, we can just go right, yeah. I'm curious, is that a different equivalent relationship for each state that you're in? Or is it a one equivalent state for all states? Right, so what we're going to say is that we're going to apply this, develop this partition over the by-infinite sequences, and then the partition cells will be the quote, states of the bidirectional machine. Right, yeah. I mean, at this point, probably just following the parallels that we've had before, it kind of makes sense. And then the next level, you think about it, it gets confusing, and then hopefully after that, it gets clearer again. But it's formally the same thing we've done before. It's not really that different in terms of what it means is obviously different in particulars. So we're just taking now the by-infinite joint process building these equivalence classes out of this equivalence relation. Now these bidirectional states, and I'm not going to call them causal states, there's some subtleties here, are a subset of the product of the forward and reverse states. There can be some redundancies. There can be information that's shared, and therefore the bidirectional machine can be more concise than giving the two separate machines. So that's an interesting thing. In fact, you could have two machines, forward three states, reverse three states, but when you put the bidirectional machine down, they actually share so much information that it's actually smaller. It's telling you there's actually this other more compact way, there's less stored information than we thought. It's not just the size of the forward machine plus the size of the reverse machine. That would be the product. So here's our bidirectional machine. These bidirectional states, and then there's some transition structure that we can pull out by shifting around. OK, so graphically, we're associating a given bidirectional state with a pair of states, which there's some bi-infinite realization with the past and the future. We break that up, we plug them into the respective epsilon plus epsilon minus, and we get the two states out. And that's the bidirectional state we're in. So the bidirectional states are pairs. We think I'm like pairs. So here's our process. And if we were in the light blue, we'd end up in state at time zero, the state s plus. And following the future, we'd end up in the reverse causal state at time zero. So the bidirectional states are these pairs. And then the transition dynamic of the allowed pairwise transitions, no matter which way we go. So some examples. We did the even process, which turned out, despite its non-Marcovian strangeness to be actually a really simple example, because both the forward and reverse machines were the same. And again, we can write down everything we want. There's some product state distribution. Using those things, we can calculate the forward and reverse quantities, causally reversible. You can write out the switching maps, which I didn't put the details in here. But in fact, there is basically their identities. I know if I'm generating the forward process, and when I land in state A, I know that if I look down, and I asked about the reverse process, you would be in state C. So they're just identities. There's not a whole lot of non-trivial interaction between the forward and reverse machines. They're essentially the same. And then we can go calculate the joint switching map, or use these two switching maps, to put together the bi-directional machine. And I'm not including the details here, but here's the result. So in the bi-directional machine, each state is a pair of forward and reverse state. And now the way it operates is we get to choose which edge we take. Well, I should say which direction we take. So now the edges are labeled by your choice of which direction you're going to go, and then a probability, and then a symbol. So in a sense, I'm in state A, C, and I decide to go forward. Then with probability 1 minus p, I'm going to see a 1. That's how you read these. Or if I was going to go in reverse, then with probably p, I would see a 0. So this allows me to make this choice. Forward, forward, forward. Stop, reverse, reverse, reverse, and so on. Just to kind of write out what sequences correspond to the state pairs. Well, if you remember the way the even process works, state A in the forward machine was the state you went into after you'd seen an even number of ones. Well, state C reverses the same thing too, except now it'll be futures that end in pairs of ones. That'll be in state C. But they happen to just collapse down to be in the same equivalence class. And then same thing, B and D were odd numbers of ones. So that's how that operates. Just an example of realization. I mean, kind of track through here. Again, if I was just looking at the past and trying to see what causal state I went to, that would be forward or futures would lead me to the reverse states. But then you can also sort of check, basically check my graphics here to see that these pairs of states. So AC going forward generates a one. AC going forward generates a one. And what's not shown here is that that would be probability 1 minus P. AC going in reverse generates a zero. Probability P and so on. So in some ways it's not. You could just say, oh, I have this larger alphabet and I'm doing a predictive equivalence relation on this. Kind of stepping out a bit more. The crypticities are zero. C mu is equal to E. State information is equal to excess entropy, both of them. So non-Markov process, right? Because the evenness property had infinite range. But it was this so-called SOFIC system. Non-Markov, synonymous with that. But it was very simple. It's microscopic irreversible. Remember that definition? Causally reversible. And it's just not cryptic. The state and observed information are the same. So that's a simple base case, good to know about. What about the golden mean? Well, we did that calculation. Well, I'll give you the results of the calculation. I think the homework asked you to do this guy. Right, so the forward machine or reverse machines were the same. So it sort of seems like the even process again. Calculate the various state distributions. So it's looking kind of similar in a way. Of course, it's the golden mean process easier to state. It's just no consecutive zeros. All the information measures are the same as the even process. H mu is the same. C mu have the same structure. Causally reversible. But now, when you calculate the switching maps, there's a bit of ambiguity. So some kind of hidden in the golden mean process is some subtlety, some more structure. And if you work that out, if you're in state C, you don't know which of the two forward states you're in. And likewise, if you went forward state A, you don't know which of the two reverse states you're in. So unlike the even process, even process is simpler in this sense. Now there's more ambiguity for the golden mean process. There's ambiguity and sort of loss of information when you turn around and go back in the lattice. And then when you calculate out, so you can use these two conditional distributions to work through what the bidirectional machine is, and it actually has three states, not two. So the even process, we had two of these bidirectional states. Now we have three. And you can sort of see how they're different when you look at, again, you have to sort of go back to the lattice example to check these things out. So A and C correspond to, again, so the processes are the same if you look at the reverse machine and the forward machine. So A and C correspond to having ended on a one coming from the future, ended on a one coming from the past. And then these pairs are sort of symmetric with each other, kind of corresponds on past ending in a one, futures in a zero, or vice versa, past ending on a zero, or futures ending in a one. And then you sort of hop between AD and BC because of that. So you can sort of track through and show how this works. So if I'm in AD and I go forward, I see a zero. If I'm in AD and I go forward with probably one, I'm going to see a zero. That actually sort of corresponds to the I'm moving into the condition where I must see a one. So now I'm in that I go to BC. And the BC, if I move forward, I must see a one. Now what's curious about this is the way the calculation works out is that I have two forward moves. One forward move takes me to AD and one to AC. But in either case, I'm going to see a one. If I go to AD, I'm going to see a one with probably one minus P. If I go to AC here, I'm going to see it with probability P. So the bidirectional machine is non-uniphealer. It's not an epsilon machine. So this is a strange fact we're now working with. Trying to understand that. And so on. So you can go through or you can go back from AD going in the reverse direction. I'm going to see a one. In this case, with probability, I'm definitely going to see a one. But the states I end up in, I'm going to visit with probably one minus P or P. And so on. OK, so you might just sort of work through the lattice diagram here to check to make sure I've done the right calculations for the bidirectional machine. So there's this extra level of interaction between the four universe processes, if you will, and the golden mean process that wasn't present in the even process. And as a consequence, there's a kind of measurable consequence of this, that the golden mean process, the excess entropy, when we had a formula to order our processes, is the difference between statistical complexity and the entropy rate. But we know what those are. Write those out. With the even process, they were equal. We didn't have this extra term here. Now, the derivation of this extra term for order of our macropos doesn't really explain why H mu is sitting here. In fact, it's just a, what you calculate is this. And then you notice that this ratio is H mu. So why would there be an entropy rate causing a difference between stored information and the excess entropy? But anyway, so it's a sub-shift of finite type. Namely, there's just one irreducible forbidden word, no zeros. It's simple. It's reversible like the even process was, but it's also not simple because it's also cryptic. There is this difference between C mu and E. In this case, it happens to be the entropy rate, strangely enough. But that just is a coincidence at this point. Another example that I mentioned results for before was this random insertion process. Flip a coin. If it was a 1, just copy a 1 back. If it was a 0, flip another coin. And then insert. And then come back on a 1. So it's randomly inserting extra bits. We can do this one in closed form. If we put the two device parameters P and Q in, contract this through. Calculate the reverse machine, which last time, we noticed it had four states. Calculate its asymptotic state distribution. And now in this case, it's obviously causally irreversible because the machines aren't the same. And then we can calculate, given the asymptotic state distributions, the forward and reverse stored information. And then we can also calculate the causally irreversible. It's just the difference here. Basically, just this term here pops out. OK. And also, the switching maps for it. Now these matrices aren't different dimensions. So four to three states or three to four states. The reverse state, there's just ambiguity over two of the forward states. Whereas for the reverse switching map, there are two forward states that are ambiguous as to which reverse states we're in. I'm going to break down the joint distribution for this, which, for example, we wanted to calculate E in a different way than before. And then when you work out, and obviously I'm not going to include all of the steps in this, you can work out the bi-directional machine in this case. Some sort of complicated mess here. Here what I've done is it got too messy to include P and Q as variables. So I just set them to half to be fair-corning biases. So we end up with these four states. So there's a lot more internal structure we wouldn't have necessarily guessed. Maybe one would have thought that, well, I had three forward states and four reverse states. It could be the product of that. The bi-directional machine could have had 12 states. They're kind of upper bounded by 12 states. It doesn't. So it means that there's a lot of redundancy in the information captured by the forward and reverse states. I suppose that's sort of clear when you look at the switching maps. There are these rows here where there's an identity between some of the states. There's a certainty in the switching map. And then sort of pushing forward, we can calculate the forward crypticity as either the conditional uncertainty between the forward states given the past, same thing for reverse crypticity. That going this way, if we calculate the crypticities that way, then it gives us a direct way of calculating the excess entropy just as the difference between the, you choose the forward ones or reverse quantities, forward sort of information and forward crypticity. So we have this closed form expression for the excess entropy. And one of the interesting things is you can rather quickly, since it's in closed form, explore the whole family of processes as you vary P and very Q. And basically, this one family illustrates all combinations of non-cryptic reversible, semi-cryptic what I mean is that the crypticity is zero. So what I'm showing here is called a stacked plot. The red piece is the forward crypticity, E in green, and then reverse crypticity. And then these sort of the outside diagrams, I'm varying P, varying P setting Q, or varying Q setting P depending upon what side of the parameter space we're looking at. But they're all these different combinations. Crypticity is essentially zero here. Everything C mu is E. Here we have a lot of reverse crypticity, but it doesn't look very hidden in the forward direction and so on. Just a full panoply of same thing down here. Reverse cryptic, but not forward cryptic. That's from my semi-cryptic and so on. I mean, one could do this study by just generating realizations and calculating block entropies and estimating them. But this is arbitrarily faster to do it with the closed form expressions now. So basically, all things can happen. So if we just kind of ended with talking about forward and reverse machines, one might think, well, there could be any possible relationship between forward and reverse states, just some naive product over those two state spaces. But the bi-directional machine isn't that. It actually captures more, say, redundancy in the forward and reverse processes. But it sort of comes at a cost, maybe. It's not necessarily minimal. One could ask to minimize it and then ask, just like for the epsilon machines, minimality is very important because we use the number of states, or the state complexity, P log P, or the states to show that that's a measure of memory in the process, that it is the stored information. Well, the Shannon information in the bi-directional causal states, I don't quite have an interpretation for that. We'll talk about it in just a second. But it's not sort of directly, we can't show that these states are minimal. Possibly there's a modification of the definition or construction of this where you get that, and then you could use the number of states as a measure of memory. Non-uniphealer, another problem. You can't calculate its entropy rate. So you might say, well, why don't you try to uniphealerize it? Well, interesting project to try. So it's not an epsilon machine. It is a presentation of the forward-versed process. It generates the process in forward-versed. So there's still some basic more work to do with this bi-directional representation. You can project it to get the forward process and reverse process. But when you project it, you then might have to minimize again to get the causal states. Also, this exercise shows you some of the oddities of prediction. So the forward states, the predictive states, are better retro-addictors than they are predictors. They're sort of more cleaved out of the past by an amount of crypticity. And same thing with the, I should have said, retro-addictive states are better predictors than predictive states by forward crypticity. So this is sort of a curious thing about interpreting the forward-versed states. But there's some open questions. So this is kind of right up to the research frontier. I showed you just the recurrent bi-directional states. I didn't talk about transient. So that's a bit of an issue here. What does that mean? How things relax into this asymptotic bi-directional picture? Would be nice to have a minimal presentation of the forward-versed process, also when it was unifeler. So we've got more work to do. But at least we have a presentation of the forward-versed process. Now, let's kind of quickly talk about some of the sort of consequences for some ways of capturing or explaining or measuring what the bi-directional machine captures. I mean, we can go ahead and talk about the Shannon information in the forward-versed states. Somehow, this should be related to the information you need to store to optimally predict and retro-dict. I'd say that is true in some example cases, but not in general. So we still have to think about this, which, as I was just saying, depends on coming up with a minimal presentation. We can also recast the excess entropy, right? This very time-symmetric quantity, it's not sensitive to time irreversibility. But now we can sort of re-express this if we want to unpack the mutual information in terms of the sum of the marginals minus the joint. Now, we can sort of express that in terms of the forward complexity, reverse complexity in this joint complexity, which has this sort of simple information diagram representation. Here's E, and then we have the crypticity in the yellow part. Of course, this is the forward statistical complexity. Blue is the reverse statistical complexity. The wedges are the reverse and forward crypticities. So basically what this is saying is that we can think about E as, well, it's this wedge. But set theoretically, we have the yellow area and the blue area. Well, that overcounts on this. So we subtract off one count of each of these atoms. So the joint entropy over the states is one count of each of these atoms. We subtract that off and we're left with this. We stopped double counting this atom. So that's why it's E. There's also kind of a strange limit here, which I've still sort of pondering over. The only time it seems when the bidirectional machine really is just the sum of the forward and reverse is when there's no shared information. Well, just graphically, there's no E here. They're just completely separate. That's odd. But if you remember, these are the sort of, could be the highly cryptic processes. We have some, basically, there's no available information. This is some sort of IID process. It has no correlation, no structure. Somehow maybe you have a bad measuring instrument or something. And the internal processes are just completely separate. It's sort of curious. I'm not, yeah, I have to think about what this actually means, sort of an odd. Because basically, yeah, nothing's reconstructable in that case. We have other bounds. Before we had, and that was back in the days, we just talked about forward processes. The statistical complexity, stored information was always an upper bound on E. But now you can show, since E is symmetric, it has to be less than each of the complexities, whether they're forward or reverse. So it's a sort of tighter bound on E. Interesting observation. May be helpful for something, like in a calculation. The bidirectional machine is always bounded above by the statistical complexities of the forward and reverse machine. So in some sense, in this sense, it's more efficient. Although I think this interpretation does depend on justifying the state information of bidirectional states as the minimal information you need to predict and better predict. Now, think of the other way, the bidirectional complexity, upper bounds, both the forward and reverse complexities. So I mean, this is just a question of trying to get some sense of what this object is, what are constraints on the bidirectional machine? From the eye diagram, what we're just talking about, we had to think about the bidirectional state complexity as E, the shared information plus this bidirectional crypticity, that just being defined to be the sum of the two outside wedges. But then that was the sum of these state conditional uncertainties. Uncertainty, the future states can pass, plus certainly pass given the future. So the crypticity generally is a measure of the distance between the stored information and what's observed. But this is an interesting expression here, because if you remember way back to when we first introduced information measures in the winter, I noted that the sum, for any two random variables, x and y, the sum of their conditional entropies, uncertainty x given y plus uncertainty y given x, that's actually a metric in the space of distributions. Actually gives us a distance measure. So that's curious. So this bidirectional crypticity, it's really a distance between the forward and reverse state processes. So maybe there's a little more structure in this problem we're working on here. Actually have a metric. That's nice. We're in some space that has a well-defined notion of distance. But generally of course this crypticity is just sort of a measure of the information accessibility, how hidden a process is. Final bounds, I mentioned we're gonna just, we're just drawing out consequences, trying to see what we can understand about this bidirectional machine. The bidirectional crypticity is always less than the bidirectional state information. Truly cryptic process, and I'm just calling on our previous discussion of cryptic processes would be where e is equal to zero, which would mean the state information is equal to the bidirectional crypticity, but then that was this peculiar limit that I just talked about. Again, something with zero access entropy, zero shared past future mutual information is some kind of IID process. And essentially there's no information you can extract that will let you even look inside and discover the hidden states. So it's kind of a singular limit. Basically there's nothing that you can, in this case, nothing you can learn about a process of structure from measurements, which I mean that would be like sending, encrypting a message and sending it and it's never reconstructable. What use is that? I don't know. But anyway, this is a limit. So that sort of exhausts some of the questions we have about the bidirectional machine. I mean we end up with this mechanism that describes the forward-to-verse process. It's a presentation of it. It's sort of nice to be able to drive around this lattice in any way we want, choosing whether we go forward or backward. Captures all possible questions we have. There might be some sort of calculation issues, like it would be nice if it was minimal in unit feeler, but we end up with this sort of time agnostic picture of a process. Pretty much the most complete presentation you could have of a process. It captures all of its properties, including the statistical irreversibility of a process. So that's it for today.