 Yeah, so this is this is stuff that Jim and I are actively working on This has been coded recently and everything so you'll be the first to get a go at trying to use it And so this is related to stuff that I did do for my PhD Which was more on inference of Markov chains of various orders and applying it to coarse-grain dynamical systems Here we're trying to push it to something which is applicable to computational mechanics where the states are potentially hidden and One of the big issues was how do you get to that when you the observed data is not necessarily directly reflecting the Absurd data is not reflecting in a direct way the internal states of the machine that might be the correct one And so I'm going to take a slightly different I mean my goals are very similar to what you've been hearing in the previous two quarters, but it's going to be From a very different viewpoint in a certain sense So hopefully some of the language I'll use is very similar and might be confusing using I'll try and be very clear about what I mean in this case if it's not clear then ask So The basic overview of what I'm going to do is today I'm going to talk about just general goals in my view of statistical inference What are we trying to do and in particular? The a Bayesian approach to the problem and I'm going to do a couple of examples one is for a biased coin just inferring the transition probabilities and the next will be moving on to Epsilon machine like Structures so unifueler hidden Markov models so not all hidden Markov models just unifueler ones and that actually turns out to be very important And I'll do an example of something called the even odd process, which is like the even process that you've seen but slightly more complicated And all of today we're going to be focused on You have data and you have a known structure. You just want to infer transition probabilities So what I mean by known structure is you know that there are a certain number of states You know going from state a to state b and it's a one, but you don't know with what probability So it's going to be figuring out what are those transition probabilities and in the case of something where they're hidden states What is the actual? Hidden state dynamics what paths did it take through the machines? We actually need to figure that out to infer these transition probabilities and then The next lecture is going to be on how do you actually then? differentiate given a certain data set between structure a structure B and structure C And the approach that I'm have been working on is using Bayesian methods for model comparison and work by Ben Johnson who had a way to enumerate unifueler hidden Markov models Or topological epsilon machines in this case Which are more restricted than all unifueler hidden Markov models But in general just Let's try everything pretty much enumerate all possible structures and then for a given say to set which is the most likely Which is impossible these kinds of things and so of course this is limited by computational power and all these kind of things But actually I think you can get quite far doing that and so There I'll be doing two Examples or one example in particular I'll be doing even odd process But in that case what I'll be doing is not assuming a known structure I'll be saying give me the data and then I'll throw Thousands of different apologies at it and see which is the right one and see if we can actually figure it out and then Example four is gonna be I'll do a survey of one of processes that you already know golden mean even and the SNS which actually in this case will be out of class because that's non unifueler and So this will be more the kind of the stuff that's going into the paper that we're working on So it'll be fancier plots these kind of things you can get a sense of of how all this works And I'll wrap up with some complications in terms of motivated by the SNS What is an out? How do you think about out of class things things that aren't unifueler hidden Markov models? and also in real data Things might be non-stationary How might you approach that kind of thing? So just kind of gotcha kind of things And partially the way that I've done this is I've tried to make it There's code in campy that does all of these things. So the slides here They'll be Python code in there and this is actively run and put on the slides and the lab that you have on sage Goes through all of the examples exactly so you can play around with this stuff afterwards So that's really a big part of the goal is play around with this afterwards Okay, so that's the the outline let's get started And so partly I've already been describing this but I'll just sort of introduce some of the notation and Divide how we're doing things. So the first level again is today, which is you're going to be a given set of data Which I'll just generally use D as D We're going to infer parameters Which instance is often used a theta and this is can be one or more parameters It's very non-specific and I put these little index eyes Which is sort of reminder that we've chosen a particular model I okay? So it's that's all that it's meant to mean is that we're using this for a particular structure So this could be like we've assumed a Gaussian That's our model I've in this case. It'll be will be soon a particular topology and again M is the actual model itself So a particular model has a certain set of parameters Okay, and so today we're going to be two goals is Provide a point estimate of parameters so that's the idea is you know the transition probabilities are You know 20% this way 80% that way so you get a particular number But with fun idea to everything is always uncertain So that's only half of the story the other part is to quantify the uncertainty in the estimate So actually your transition probability might be your estimate would be 10% but it might be between You know point, you know 3% and 15% with 95% what in Beijing Lingo will be called a credible interval So we'll see what is the possible range of values for this parameter given we had a thousand symbols this kind of thing and So that will be for today, and then we'll go into this comparing of Many models and then again it will be a single data set But now we'll be considering a whole bunch of models from some set and so the examples from Thursday will be topological epsilon machines with one two three four five states You know so it turns out that those are around 38,000 topologies, and we'll be using this a lot and I'll describe that next time sort of the numbers are quite impressive But you can actually throw a lot of computer power at this and get get some interesting results and The other thing is that you will always get some results or at least you'll get results They have some topologies that will be consistent with the data, which is kind of a surprising thing at least for me I didn't expect this but stay tuned for for next lecture is that it turns out for even things like the even process when you Throw this at the library. There'll be hundreds or thousands of topologies that will say Even process could have come from this and we'll sort of explain why that's the case and here then we get this extra level of What is the uncertainty in the model structure? So we end up with probabilities of a particular model structure given the data and the set that we've looked at But the most likely one may only have 5% probability So then you're very uncertain what the actual structure is or it could have 99.99999% probability and those are very very different things So also discuss how to think about this because what I think a lot of people want to do is here's my data Give me back a single model with very specific transition probabilities And I'm going to argue that that sometimes will be okay and other times will be disastrously wrong so you just need to be careful about Particularly what this distribution over models is given the data And then the final thing because you spent a lot of time learning about computational mechanics and information theory We want to estimate some of these quantities that you learned about so c mu and h mu and all these are functions of Transition probabilities which with finite data are uncertain So your estimates of h mu and c mu and every other thing are going to be uncertain as well so I'm going to argue for this sampling approach to Estimating sort of averages of these quantities and again credible intervals intervals for these quantities that are a function of how much data you have What models you're using and all this kind of thing so that's the big big overview Let's get into it and for this first part. I'm going to be sort of building up the theory and introducing Notation, so I think we've gone most through all the notation here And so our goal at this first level here is to come up with this thing. We call the posterior distribution What is basically? The probability density of the parameters given the data and the model And so this really is a probability density over the transition probabilities So this is something that you can integrate over and will be one This is part of why we can sample from it But how do we get there? We have to build up things that you're probably familiar with in terms of a likelihood We're going to have to define something that's a prior which is often more mysterious if you haven't done Bayesian kind of things and One important thing to keep in note is this this prior that we do have to choose is something that will Affect what we get up here. So how you set the prior is actually important You can set it very weakly and have it not affect What the posterior says or you can set it very very strongly and and have very erroneous Conclusions so part of the goal will be to figure out. What does your prior say? How much is the data contributing to it and do this all in a reasonable way? so These are the the basic elements at this first level and I'll go through all these very specifically for The biased coin example and so the first is going to be the probability of the data given Parameter settings and the model and so actually this is very similar to the kinds of things that you have been calculating Is that you know the transition probabilities? What is the probability that you saw a zero zero zero one one so basically it's that kind of thing It's up in this context. We're often thinking of this as We write down that probability, but we really don't know what the transition probabilities are We can write it down and and say that it's p to the zero with a number of Zeros in the in the word and p to the one with the number of ones, but we don't know what they are What the transition probabilities are and then the other part is this prior and this is something about prior assumptions Expert knowledge it could be restrictions I mean if you're actually applying this to a physical system or social system something there might be reasons where You would say that I think that these parameters should be between us and us The way that the campy code is set up in the way that I generally purchase things is to set everything really diffusely and just let the data drive it but This is something to be considered and I don't think it's it's necessarily a bad thing to use if you really have the Grounding for and the other way to think about it is that if you have two data sets your prior for the second data set could actually be the Posterior from the first set so it could be something that's informed by you can add chunks of data and keep updating and Then this quantity here is something called the evidence and actually turns out to be we're going to see this over and over again And it's turns out to be just it looks like a normalization constant in Bayes theorem But it's a probability of the data Given the model and so it's very close to the likelihood But what you end up doing is averaging over the uncertainty in your knowledge of the parameters And so it's something about the probability data just given the model with that with taking into account that we are uncertain about the model parameters and This will allow us to actually do model comparison at very levels various levels So we're going to have terms like this that will come up over and over again as we do model comparison first for transition probabilities Then inferring start state and then looking at different model topologies. There will be terms that like are like this so Just a heads up to pay attention to that and so all three of these things go in well basically these two things determine this and Then you renormalize and you get a Posterior which is very much like a prior but now it's now conditioned on the data So we've integrated the data and prior assumptions right so the basics of Bayes theorem is This creature right here and So the posterior is what we want We have our prior The likelihood and then we have this evidence term so like I said the evidence term looks like a normalization constant and the form of this Depends again on whether or not these data i which is again a loose term for just Whatever our parameters of interest are for transition probabilities These are going to be continuous and this evidence will actually be an integral of the product of the likelihood and the prior Whereas if it's discreet for instance trying to figure out what state our machine Started in it will be something that looks like this where we iterate over the discreet choices of it could have been in start state a b c or d before it generated the data we saw So this is a very General form for Bayes theorem and the normalization just depends on what types of parameters we're dealing with so the biased coin So all this kind of stuff when it looks in kind of typeset this is actual Python code in can't be so when you go to Sage and look at the notebooks all this stuff will pretty pretty much be verbatim there So you can play around with it and get a sense of it and I chose the biased coin as a particular I think it's it seems like a trivial example, but it lets us get used to inference But also we can think of this and the way that I defined this in can't be was to make a string and Then pay that put that string to the can't be machines from string And so I made it actually into a recurrent epsilon machine within can't be and so This is just like the even process or random random X or it's something with a state and has edges Okay, so I'm going to use this extra notation that looks extraneous when I'm doing the biased coin, but it's because that carries over to all of these other things and so I also chose a Bias coin rather than a fair coin because the prior will make without any data The default setting will on average be the fair coin and so I wanted to do be clear that we're not Just sort of getting back what we're getting from the prior so that we're having something that has you know 10 10% probability of producing a zero 90% probability of producing a one and it always goes back and the key to inferring probabilities is really going to be How many times did we take this edge? How many times do we take this edge versus how many times we were in the state and This carries over to the even process also is that we're going to be how many times we were in state You know a and we took a zero and came back versus how many times we were in state a and did a one and went to state B So it's exactly the same thing So this can be seen as you're going to have basically a binomial distribution at each state And each state will be a binomial distribution. So this completely carries over just you end up getting products of these things So it's a good place to start and less trivial that it might seem at first, but it's lets you do everything so let's assume that we have we're assuming this model class we have some data and And we can write down the likelihood and so this is what I was saying and that this is something very familiar to what you've already done Before is that if we looked at the data the number of times we saw q zero and the number of times q one We saw q one we could basically just count from D like q is not really important here That's why I mean the notation is overly specific But it's the number of times we saw zero and the data set D That's not times we saw one and the data set D and that's also the number of times We did we traveled on those edges Again, we're assuming that we don't know what these things are we're going to infer them But we can write this down in principle and the likelihood for these things that are fixed from the data What they would be and so if you wanted to do maximum likelihood kind of things and estimate probability There you could actually just maximize this thing treat this as a function of the model parameters and often you take the log of this Maximize that and you'll get the the maximum likelihood Estimation which would be just the number of times you've seen q zero over the number times you've seen q so very very straightforward But we're going to add to this and use the Bayesian machinery so Again just being specific about the notation Mi is we've assumed this single-state Binary machine the unknown parameters are these two transition probabilities and of course they're constrained to some to one and the data is These two things and and as I've been saying think of these things as edge counts Because that will be important later All right, so then We get to the prior this is the next element of Bayes theorem and so it turns out this particular form like This particular form of binomial or multinomial so it doesn't matter that we're doing Two letters in our alphabet. It could be three or ten. It could be any number. So this is completely general There's what's called a conjugate prior and For just binary alphabets you can think of this as a binomial distribution and the prior that's conjugate will be a beta distribution If you have more than two two letters, there'll be a multinomial version and The prior will be Dirichlet and what it means to be a conjugate prior is that if we assume the prior of this form Our posterior will also be of this form Okay, so it will we'll end up getting a posterior that looks exactly like this But where we have these alphas will be now alphas plus data that we've seen So then that actually helps in terms of what do we think of this alpha? So the first thing is that this is actually a probability density Over these parameters so here you can see actually the restriction that they have to sum to one so it's actually on the simplex You have this factor here Which is the probabilities to these alphas and the alphas are parameters that you set So these are parameters for the prior and in practice and in the way that can't be handles all this These are set to one and what that ends up happening is that the transition probabilities end up being uniform over the simplex So your expectation is going to be one over the number the alphabet size But that will be the only the expectation. You basically you can see any value between Zero and one Subject to the constraint that they sum to one and we'll sample from the prior to show you that this is what's going on Which is is good to understand and then this term here is actually just a normalization so you can literally integrate over These probabilities and get one Okay, and so I guess the last thing is this term here is The sum over each of the edges so you can see that there's a qx Which is kind of the alpha parameter for the edge q zero and q one and you can think of these as artificial counts So for by setting one is basically said we've given one count To this edge and one count to that edge and sometimes people who don't like to do Bayesian things will call this smoothing But I mean really and it you for small data sets is important where you know You might get a maximum likelihood estimate of zero probability for something But you don't really think that and so partially this is what this is what this is doing And then we also have a counterpart to the number of times we've seen a particular state Which is this thing where we sum over all outgoing edges and so this is a Value of two in this case and of course This depends on the size of the alphabet and how many edges you have on a state these kind of things and the way to think About how much the prior is saying about what you're inferring versus the data is to compare These values of the edge prior relative to the number of times you've actually seen it So if you've never actually seen any traversals on the edge Your baby your estimate is completely prior based Whereas if you see some traversals on the edges It'll be basically the number of Q zero versus the alpha q zero So if the number of times you've done a zero is a thousand and you've set alpha to one Basically the data is driving it and there'll be no difference between the maximum likelihood and the and the posterior estimates It does become important when there's small data is and that's where I think everything becomes really important and priors are sensible And you know what you're setting so I'll argue that it's a good thing to do and that you can understand what it's doing relative to the data So Where are these from So like Wilk statistical mathematical statistics like the so Jewish lay in multinomial distributions are well known Yeah Well, okay, I mean I'll give it. I'll give you a sense of what it is I mean the way to think about it is that it's that it's a probability density over P Q of one given Q and zero given Q and this is constraining it to be some to one And it's done in such a way that when you integrate over this thing with respect to One given Q and zero given Q subject to them something to one that you're going to get back a normalized distribution That'll be you have probability one. So it's just a probability density It's a probability density of probabilities, which is kind of a weird thing But it comes up so it comes up in machine learning a lot Because of that because you're often trying to infer probabilities and so multi-nomials and and By nomials will come up and then they're counterpart to betas and Jewish ladies So I mean another question is I don't have to choose the conjure prior, right? I could choose any prior that I wanted to you I think this is a sensible one and then it's very Flexible in terms of setting things. I could set this I'm like I can set alpha Q zero and alpha Q one and change the shape of my prior So when I set these to one it's flat over the simplex, but I thought I set each of these to a thousand My posity my prior mean would be still 50 50, but it'd be sharply peaked And it would take huge amounts of data to differentiate between that But the advantage of these kind of things is that With this because it's a conjugate prior. I'll end up with something that is also a Beta or Dirichlet and so things like averages of these things and all moments of these things are analytically Capulable and that's really so I don't have to like do any sort of numerics to figure these things out I can just write them down and use them Yes So ultimately you're just looking for two numbers out of this right in this in this case. Yeah. Yeah. Yeah, or one really A lot of ink to get two numbers out well, okay, so I mean I could have done this yeah But I mean to paraphrase what I think you're saying. Yeah, as you're arguing that The motivation for using this particular form. Yeah, is that it allows you to make interpretive statements about What the prior the Values assigned to the prior Yeah, yeah, they're physical. Yeah, they're easy to understand And also I think that I thought about doing an example where you had an alphabet of size 20 and you're given 10 samples It's like this. I mean so Bayesian methods using this stuff would still have sensible things to say about this It would give you means it would say everything is completely uncertain It's dominated by the prior, but it would still make a setmate as a statement about it Whereas not to my kid will give you a whole bunch of things as being zero, which are of course not true But they have ways that I mean they have ways of estimating uncertainty, too So I'm not being completely fair, but yes questions Back at a distribution over two numbers or is it a Yeah, so I mean you're not coming back just like No, no, you're giving a distribution over these problems over peas and cubes. Yes. Yeah. Yeah over these problems Versus saying what's assuming a fair coin, right? Yeah. Yeah. Yeah, I think this will be clear once we get through the examples Yeah, but it is exactly it's a distribution over these probabilities. Yeah Yeah, yeah, so the p of x is given q isn't actually Influencing the probability because the exponent there is zero We're right, I mean so I mean it makes it uniform in this case. Yeah, yeah, yeah Yeah, yeah, yeah All right, so I think this will be come clearer once we go through the example the example and do some sampling so In can't be the way that you set these things is basically there's a infer em Which is assuming you have a particular model topology and so in this case the biased coin thing that I Defined I feed that to it with no data. And so basically it represents this distribution over the probabilities You can say and so analytically I know that the expectations So these are averages with respect to that distribution are Equal to these things and they'll be similar things for all edges if you had many many states So it's the number of times you've seen the state Over the number of times you've seen the state and produced a zero gives you the probability of seeing a zero given q So this pattern holds through through all of them Okay, and so now we get to I think to what I think we'll clear up some of these questions And so what I'm going to try and do is it's to strive the uncertainty that's from the prior And so this is how you actually do this and can't be And basically what I'm going to do is just generate 2,000 samples from the prior. I'm going to sample probabilities That are from this prior and say what is the average? What are the quantiles that give me 95 percent? So that's what I'm doing here is I have an empty list and then I go through a loop 2,000 times This particular thing says generate sample and what I actually get back is a start state Which doesn't make sense now, but well when we get to the next example and a particular machine So it actually gives me back a machine with particular values for probability of zero given q and probability of one given q Because this is just the prior these will vary from anything between zero and one subject to the constraint that they sum to one okay, and so I extract the probabilities here add Them to the list and then this last part here is just let's find the average of the samples and Then let's find the the quantiles so 95% of the samples are between these two values and what we end up getting is The expectation is point 49 and of course this is how good this is a function of How many samples I did so if I did this 10,000 times it's better And then the credible interval is point 027 and point 972 really if you do lots and lots of samples would be point 025 and Stuff so basically this is saying that our information about probability of zero given q is completely uncertain It's anything between zero and one So Now moving on to actually now we will have some data. We're going to change what we know about the probabilities And to do that we actually have to do this integral So and this is another reason for choosing the conjugate prior is that you can do this integral in many many cases you cannot do this analytically and So I mean in some cases they'll call this like the partition function problem So this ends up being very much like a partition function from statistical statistical physics But in this case we can actually and so this is actually the expression For the probability of the data given that so we have a whole bunch of these gamma functions, which are n minus 1 factorial And so the thing to notice here is that basically we have This term which was the normalization From the prior and then we have something that looks very similar, but just flipped over Which now has both parameters from the prior and The data so we have the number of times q is seen and the number times qx was seen and this will end up going into the denominator base theorem This will cancel out with the term from our prior and this term will flip over and we'll end up with something that again is just a beta distribution So again base theorem and then the actual analytic form for The posterior which now is conditioned on the data and so In terms of just looking at it squiddy-eyed it looks exactly the same as the prior right And that's the whole point of the conjugate prior. There's the same non normalization term There's the same requirement that Probability sum to zero or sum to one sorry and then you have everywhere instead of just alpha from the prior You have alpha plus data alpha plus data. So here for states here for edges here for edges okay, and so this is part another reason why we can Think about these alphas as being sort of fake counts You know and and think about how big alpha q is relative to nq and how much those determine The probabilities and so if we do the expectations again, we can do these analytically So before when we do these expectations of the two transition probabilities without the data We basically just had this part here. So we just had or we just had the alphas Without the data, but now we've added those two things in and of course These are done in such a way that you can add these and get one and The way that you do this in campy is to now just add Generate some data. So in this case, I'm going to generate just 200 simple So I'm not going to give it a lot of data, but I'm going to give it some data and feed it to the infram Em class which I give it the structure and I give it the data and Given that both of those are passed to the to the method or to the class It will know that this is a posterior distribution rather than a prior because you're you're giving it data And basically what it does is goes and it counts as the edges Whoop and So we can do the same thing is let's sample from the posterior So this is exactly like what we did with the prior, but now we're doing the samples using the posterior we just generated and so again now we have a Distribution over probabilities that take into account our prior settings and our data And so this distribution will be different than the prior because it's modified by the data but the idea is is Exactly the same where it was sampled through each time we sample a machine the start node We extract the probability at it here, and then we just look for the average and we look for the Credible interval which we've made 95% but it could be whatever you wanted to be and so here You get an expectation of 0.103 and if you remember the way I defined it it was 0.1 So not so bad, but for this amount of data you would expect this not necessarily to be always that accurate And that's kind of reflected by the fact that our 95% credible interval is between 0.6 and 0.15 So there's actually still a fairly large uncertainty in this and so we can actually plot this And get a sense So I'm trying to make this really really practical so that's like if you have a data set You're interested you can just go in and blast away with this And so basically all I'm doing here is is Doing histograms for each one so I do a histogram for the samples that I had from the prior I do a histogram for the samples I had from the posterior and Then I plot them both so we can see what they look like and they should reflect the kind of things that we were just talking about So the the blue is the prior and so basically what you can see is that it's basically at one all the way across zero and one And the reason that it's not uniform is it's 2,000 samples It's a 2,000 samples representing a uniform distribution over zero and one These are the samples from the posterior and the true value is point one So it says something like you can say a point estimate is the average with respect to this distribution You can do a maximum. There's also something called maximum a posteriori So you can maximize these things, but I tend to just use the the mean and then The breadth of this thing is meaningful. It's how certain are you of these things? So and then I think where we start getting into the point where the things you've been thinking about In this class is well now we have a model We've had some sense of what the transition probabilities are how uncertain we are we want to do things like we want to estimate H mu and C mu and all these things. How would we do that? well We have distribution of these things so we really want to take means and have credible intervals for all these things So we're going to do exactly the same thing we were doing for the transition probabilities but for any function that you want to and You know, it's unfortunately not as simple as saying like take the analytic posterior mean for the potential probabilities and plug it In to p log p, you know you that doesn't work. Unfortunately. Well, I don't know if lots and lots of data That's fine, but in general it's best to at least think about this and the idea would be that you're going to sample a settings for that the transition probabilities from your prior or from your posterior and then When whatever is a function of those parameters? So the idea here is that this is a particular setting for the transition probabilities We plug that into the h mu thing we get a sample of h mu and repeat this over and over and over again Okay, so it's just it's a numerical average and our numerical sampling of these functions So these should look really really similar and well again, we'll do it for the the entropy rate For the prior and for the posterior just to look at what you might get So for the prior Now when we Sample things again, we're just getting a machine in the node, but because that's a machine I can actually just call entropy rate So I say what it is get that and I add it to my list and then I at the end of this loop I have a list of entropy rate estimations I find the average and I find the confidence intervals But a credible intervals and we end up getting something that's point seven two But is anything between point nine nine and point oh seven eight so quite uncertain And we'll plot these in a few slides to see what they look like and so one thing that this also Turns out to demonstrate is that a uniform prior over the transition probabilities does not translate to a uniform prior over entropy rate So if your primary goal is estimation of entropy rate, that's something to be at least consider and and think about Do the same thing for the posterior So I won't go through it other than to point out that we get a different value for the posterior with the 200 samples and again, yeah smaller smaller interval so that's collapsed a bit and Do our histograms and it looks like this so this is the entropy rate between zero and one and So if you sam uniformly sample zeroes and ones our probability of zero and one you end up getting something That's very very peaked at one And this is actually I think something related you looked at probability densities for iteration of the logistic map Which has this unimodal shape? So if you have a density that's near the point of one half for for a log There's it's a slope less than Than than one near the center near probability one half for the the entropy calculations you end up folding there's a two-to-one Aspect of this and then so it ends up that you get squeezed and so a uniform distribution over prior Over probabilities ends up looking Like this when you pass it through the entropy rate function Okay, and then this is what the posterior looks like and So the true value for this was point four six, which was you know somewhere in here So it's overlapping, but it also says that it's fairly uncertain So I think that's the the first example of like a very simple model a prior a posterior You can get point estimates But there's uncertainty and of course these uncertainties reflect how much data there is if this were 2,000 symbols rather than 200 this would be much more sharply peaked and I think the other thing that's useful to think about is that in particular for functions of These transition probabilities if you're trying to if that's the critical thing you're trying to look at it may be important to Understand what the prior says in terms of the prior you set for transition probabilities How that affects what it says about the entropy rate So I mean it turns out that 200 symbols is enough to change it from this to this So maybe it's not a concern depending on how much data you have But it's worth considering all right, so now on to The meat of the issue which is dealing with things that you've been studying Epsilon machines and Hidden mark-off models So I'm going to start off with a Couple of definitions just to ground what we're doing relative to what you've already seen And I'm kind of I'm going to restrict what kind of hidden mark-off models We're looking at so just definitions and I think these should all just be familiar with you for you So a finite stage Edge labeled hidden mark-off model consists of a finite set of finite set of hidden states a finite alphabet and Then the set of transition Matrices for each output symbol and if you want the state-to-state transition matrix You can sum all these things so it's a hidden mark-off model that's unifuel it Well, I'll say unifueling the next time But in this case it doesn't say that has to be unifueler, but we will restrict that next But it's an object very much like golden mean even process that you've already been dealing with Okay So a finite state epsilon machine has additional restrictions on it. So it is a finite state Edge labeled hidden mark-off model but it requires unifilarity and Just a reminder that so for each state and each symbol there's at most one outgoing edge It has output symbol X and this turns out to be absolutely critical for what I'm doing and why I can do it analytically Because I can then assume a start state and make trace Through the machine and actually count edges So this one what I'm going to be talking about Can't be done for general hidden mark-off models where there are multiple edges that produces zero going out Then you can't do this uniquely So we're going to require unifilarity and then for an epsilon machine. We want the states to be probabilistically distinct So there'll be some word for each state that's different given that you start on K and J for every pair of nodes in the machine For what I'm doing necessarily. I'm not necessarily so concerned about whether or not satisfies this And it's actually an epsilon machine in terms of the inference That's for later you can test and see whether or not what you get is something that actually Abays this but this could be just a unifilar hidden mark-off model. It doesn't have to be minimal So you can give it any structure and it will still work. So I'm going to require this and what I'm doing this in terms of saying that it's an epsilon machine or not is something to Be tested and look at Okay, so how do we actually do the hedge the edge counting? so We've seen the fair coin model and we were thinking about counting edges and counting state transitions now How do we handle? hidden states and so the basic idea here is We assume this particular structure We have some data that starts with these particular series of symbols What edges were used how many times what states were visited how many times so really really simple? Unifilarity saves us Because we can actually then just test this out and say well, okay if we assume that the start state Is a I can do let's see if I can do this correctly if we start in state a We do a one the next symbol is a zero So the idea of the counting is that we have a start state that is before the first symbol And then we look at what is the next thing so I assume start state a I see a one I go here The next symbol is a zero. There's no edge with a zero here So I know this absolutely cannot be the start state for this thing All right, so it's really really simple not not not complicated at all, but if we use B Well, we can do B. We see a one We see a zero we see a one we see a one right, so we know exactly if we assume start state B what Path was taken through here, and we know how many times we saw B how many times we travel this edge versus this edge And so basically we're going to add an extra level of inference in that we're going to do inference many times Assume this start state do inference of transition probabilities if possible So in this case, it's not possible because the likelihood of the data is absolutely zero. It can't happen Whereas in this case, it's actually non-zero So we would do it just for this case and so for more larger machines This is going to end up being something where you do each of the five states in the five state machine Is that all clear Yeah, so I mean it's really really simple how to handle it is Builds on what we've done already, but I think the basic idea of how we get the count is really important and again It's critical that we were Unifiler structures we couldn't do that counting if it weren't unifiler So like we can't if you feed SNS to my code it will say act non-unifiler And we'll just throw an exception and that's because it can't count this way So now we're going to do inference But we're going to assume a model structure and an assumed start state and so you could almost think of this as a Bunch of different models where you have a model and you assume start state a is one model And you have a model or a structure and stair state B is another model And then we're going to do model comparison to choose which one is which And as I've been saying you have to do this for every Start state in the machine at least try it and so some subtle issues that end up coming in terms of what we're doing here in particular when we get to the next lecture, which is dealing with us library of candidate topology, so these things let our Periodic structures, so you have five states and it just goes along the five states all the probabilities are one by definition There's nothing to infer so these things are either probability one in like the likelihood is one or zero and A typical machine is some combination where there's some states that have probability one going out and some that have two edges or three Edges it depends on the alphabet, so I actually end up having to divide the set of states in the machine into a subset Our potential subset that has more than one outgoing edge So there has to be at least two edges going out so that I actually have to infer what the probability of taking edge One versus edge two is other ones. I don't Worry about at all so in this case when I'm saying theta I I'm looking at just the subset of states that have more than one outgoing edge, and I'm only inferring those The states that have one outgoing edge are still important because they can they're basically like a filter They can be you know because of the probabilities One or zero I mean every time they hit that state they have to have that symbol ago in a particular way All right, so how do we do that so this ends up being Why we write the likelihood this way is kind of for this this issue we end up with something that looks very much like our biased coin before but now we're doing products of them for every state and Every output symbol we have the probabilities Given state output symbol and this is the number of times You've seen state symbol given that we assume to start state a or b or It could be zero like in one of the cases that I just showed where there's just the path is not possible So you basically get these two forms and we get edge counts State counts one end up being all use this little bullet thing to say that we've summed over all going out going edges and Yeah likelihood can be zero and so now we have the prior and so this was part of the motivation of doing the simple Biased coin is that we end up with something now if you take away this product And the things here I end up with something that just looks like the Beta or a Dirichlet distribution for the fair coin, but now I have a product That's for each of those s-star states the ones that have more than one outgoing edge And the fact that these are just products is reflective that there's independence conditioned on these states right But for each outgoing state you have a Distribution so a distribution over probability transition probabilities outgoing for Going out of that state they have to sum to one and they all have to be normalized to one So however many outgoing edges or however many states you have with more than one going out edges is a product of those many beta or Dirichlet or Dirichlet distributions And again, we have these alpha parameters, which again are just like what we had for the biased coin, right? So this is like putting in an official an artificial count for seeing that edge. And so this would be Set to one in general and campy that gives you uniform over the simplex conditioned on that particular state So you end up with bunches of uniform distributions over Transition probabilities conditioned on each state and the product of all of those things Okay. Okay, so that's what I was saying here And we again have an analytic expression for the transition probabilities given the prior again, it's just this Alpha state symbol versus alpha state summed over all the symbols Which I was tempted to write one over the size of the alphabet, but that's not always true In these cases you could have a state that has you know It's four letters in the alphabet only has two edges going out So it might not be one over a and then if it's not things that you're inferring you're going to basically be have by the topology of the machine Certain probabilities that are just and probably I shouldn't put it as an expectation with respect to the prior But just emphasizing the point that some probabilities are just zero one by definition of the topology They have one edge going out of the state All right And so again, the beauty of conjugate that busy. This is the Evidence term. This is the normalization or the partition function. You can actually calculate these things and again We have the term from the prior we have the term from the posterior which is flipped So here we just have the alphas here. We have the alphas plus the actual counts This one up canceling out and We end up with the posterior which looks just like the prior but with alphas replaced by alpha plus ends So you have these things here Okay, and so as you would expect we know analytically what the posterior expectations are again, it's just prior parts plus Data parts and always in the back of our mind that this is these own things only makes sense that there was actually a valid path through the machine Assuming the model and the start state Okay, otherwise basically the likelihood goes to zero the evidence is zero the postures just not defined All right So what about that annoying start state? One thing to keep in mind is that because of unifilarity when we infer the start state We've actually inferred the whole hidden state path Right because if you know the start state and Is a unifilaring machine you have observed data there's a unique path through there or these either a unique path or there's not a path right That's only because it's unifilar. So the start state Seems annoying, but it's actually I mean potentially depending on what you're applying this to might be quite interesting if the hidden states are something important in knowing which way you go through them is Valuable it's actually quite good But we're going to want to or we have to in terms of being able to compare Topologies and ignore things that were uncertain about like start state and transition probabilities We have to sort of integrate out this uncertainty and that's what we're going to try to do now And so we're going to play apply base theorem again at a different level And so the denominator of base theorem when we were inferring the transition probabilities was this normalization term Which is probability of data given the start state in the model We've averaged over the uncertainty in the transition probabilities And so the one thing that I do is this looks very much like a likelihood, right? When we were the very first one we saw this is probability of D given transition probabilities given model Well, this is now just a likelihood of probability data given a start state in a model We've over we've integrated out the uncertainty transition probabilities And we're going to treat this as a likelihood and build another level of base theorem on top of it How do we do that? We end up so there's a cost when every time you do base theorem You have to have some statement a prior about the thing that you're trying to infer So here's this thing that came from inferring transition probabilities And what we want is the probability of the start state given the data and that the model topology But we have to introduce a probability of the start state given the model Which is our prior statement about these things and then we get this term in the bottom Which sums basically the numerator overall possible states Okay, and by default in campy what ends up happening is this is ends up being set to one over the number of states So I know that people have Commented why don't I just set it to the asymptotic distribution And the point is that we don't know the transition probabilities. I mean you could say statements about You have to think of this as being a topology and being completely uncertain about the transition probabilities So really it turns out actually that the asymptotic distribution will be reflected in this thing often So it's it's backwards to the way that you're used to thinking about it So the idea here is to think about we have a structure We don't know what state the data the machine was in when it started generating data And we just want to figure out which one it was and we're going to start off with a prior assumption that We don't know all of them are equally likely and then we're going to use this term Which gives us influence of the data to determine which one is the most likely Okay, so now an example to ground all this and hopefully it will all be clear so even odd process again, this is can't be stuff so I write a string up here, and I've Made the even odd process a little bit strange that if you just say even odd from can't be machines you'll end up getting 50 50 here and 50 50 here and Again, I just change it to be clear that I'm not just the priors and just reiterating that Because the expectations will look exactly like the machine and that I think is confusing So I'm going to be generating the data with something where these probabilities are the ones that are generating the data This is what I'm actually using Okay, and so I create the actual machine and then this thing is just printing this out here I guess on stage you just do a normal draw and then another thing to look at is here I can demonstrate the difference between here's the set of all states so ABCD whereas The ones that actually have more than one going out outgoing edge are just a and c So this has two outgoing edges and this has two outgoing edges. Whereas D That's one by definition B is one by definition So I don't have to infer anything about those those will either those might make the likelihood for a given start state zero But I don't actually have to infer the transition probabilities Okay So again, I'm going to go Pretty much through the same kind of things that I did with the fair coin but I want to be a little bit more careful to demonstrate the Figuring out what the start state is and what actual edges are being traversed and so Here again using the exact same code now I feed it the EO machine that gives me a prior because I didn't give it any data The machine I'm actually going to set the current node to a so it's going to be in state a So the data is going to start Generating from here. We're going to see what the actual code gives us back and Then the data that I'm going to generate it turns out That this symbols iter if you actually want the machine to update the internal state While it's generating the data you want to use this rather than symbols So each time it's generating a symbol and I'm adding it to the data This EO machine is updating its internal state So I can then print this out at the end and say after the two hundred symbols that I generated here It was in state a again Okay, and then we're also going to create a posterior So the same as the prior but now we're giving the machine and the data And yeah, this just prints out what the last state was so This code is basically just trying to say what are the probabilities of each of the start states and What is the state path? And so this is for the prior where there's no data, so it just doesn't have any information about it So I'm just giving you an idea of what to expect when you do this and on the sage worksheet there's a Method that's called summary string when you have a prior or a posterior which put out a whole bunch of information So you'll be able to see that but it just doesn't fit nicely on slide So this is to give you an idea of what's going on, but the idea here is Probability of each of the start states a peori is equal. So it's one fourth one fourth one fourth We have no information about the start state There's no information about the last state and so a lot of the can't be code We'll just spit back none or empty empty lists if it's a prior because there's no information to Be provided for these particular things Now if we do this for the posterior we get something different And so we get the probability of start state a which we know was the true one because we said it is Actually less probable than start state D But then we can actually print out the state path if it was the start state was a it started in a and then it went Cba to the this one was start state D is also possible given the observed data and it goes DC one interesting thing That's very common in this inference thing is these first two symbols are different, but everything else is the same Right, and the last state is the same. So this the paths in terms of the edge counts are actually very very similar This is not always the case, but often is the case and so if we go back to the Thing we can actually get a sense of why a and D might be Common things so One thing is that this is a high probability So most certainly the data went this way one and then zero or something like that And so D also has a probability of a one both of you have high probabilities So the data is almost certainly something that went one zero or one one and you just can't tell the difference between these things and then The part the relative probabilities of the two start states actually ends up being the fact that Both of these are possible, but this is a transition probability of point nine and this is a probability of one And so it's actually just reflecting that fact So they're almost the same but not quite Okay, and given lots and lots of data you would never be able to tell unless I told you whether or not I was a or D Given this inference But in each case so we've for the posterior we've ruled out state B We've ruled out state C and we have state a and we have state D as possible And we can infer transition probabilities and all these things from them And so here's very compact code what we're going to do all on one slide is we're going to infer We're going to get samples for h mu and c mu from both the prior and the posterior And I'm going to look at plots of them for this particular thing to give you a sense I'm not going to worry about the transition probabilities But just the functions at this point and so here we have again We're going to do 2,000 samples and then each time through We do for the prior we generate a sample we get From that sample we get h mu we get c mu we add them to their respective lists for the prior because we use the prior here Here we have samples from the posterior We get the entropy rate statistical complexity we add them to the respective lists So this creates 2,000 samples of h mu and c mu for the prior and the posterior For this even odd process So what do we get Histogram plotting So the first thing is entropy rate and this is the same kind of thing where we end up getting Blue is the the prior And so one thing that's interesting about a lot of things that we'll see this again in the Looking at structure is that for the the fair corn or biased corn That actually could have entropy rates all the way between zero and one But when you start having structure and restrictions and all these things very often the maximum entry rate is actually much less than one Even for the binary alphabet. So that's something that we see here Is that if you just sample uniformly from the transition probabilities? You end up getting a distribution that looks like this and again It's not uniform even though our distribution over the transition probabilities was uniform And then the posterior looks like this And the true value is 0.43 So again, it captures it And this is kind of deceiving and we'll see that it's not always as even when we're doing data from a known source And inferring probabilities from a known source The next thing is we're going to actually have to infer What was a structure and then do all these things on top of it? But what comes next critically depends on everything that we're doing today is all this underlies the model comparison at the higher levels And all of the uncertainty that we have here in terms of start state and transition probabilities is still there when you're doing model comparison So you do still want to reflect that if possible Okay, and then for c mu Same kind of thing c mu for the prior ends up it can be between one and two bits The part of the posterior ends up being here, which I think is pretty Good for the the true value And and again what I'm not doing here But what you could do is you could take those lists of samples find a mean find credible intervals using quantiles like all those kind of things It doesn't have to be just a histogram So you can actually get a number in terms of like what is the mean of the entropy rate with respect to the prior You're probably more interested in what is the mean of the entropy rate with respect to the posterior But it's good to think about both just to see what's happening Okay, and then the very last thing for Today is Each of those samples was being taken. So maybe if I go back to here So when I was doing these things I was Sampling one machine from the prior and I was getting its entropy rate So this is has particular transition probabilities I was getting a specific entropy rate and a specific c mu for that particular setting of the transition probabilities So these things are correlated and they reflect the structure of the even odd process So if I plot h mu and c mu in the same plane, I'm getting the joint density over h mu and c mu For the prior or the posterior And so that's what this last slide is And you'll see a lot of these in the next time because they're quite interesting Um, and so what this is is h mu and c mu The blue is the prior so you're sampling uniformly over transition probabilities And you can see there's a kind of structure here, which Is reflecting sort of moving one transition probably between zero and one And it depends on which edge and you move through all these things and this is again 2000 samples if you want do 5000 more Um or however many it's all a matter of computer time. Um And then the green is what you get if you restrict The transition probabilities on the data that you've observed. So this is How restricted you are given that you've seen 200 samples From the even odd process and I haven't even so this notice wasn't even I wasn't concerned with what the start state was I just took the samples and each time the start state was potentially different it would have to be a or d But it was giving me these values and so it's giving me This right value and it basically 1.84 is the right one And then 0.438 Is that and so if you give it 2000 samples instead you end up getting something that's even more sharply p And so that's my last slide so