 All right, so this is the lecture number two on the Bayesian inference of Epsilon machines work that I'm doing here with Jim as a postdoc and I'm going to start off just again with the overview of what we did sort of last time which was from the basics of statistical inference a Very quick introduction to Bayesian inference and I'll I think I will mention a few points on that again to try and Get a better sense for people of what exactly is happening there and then we went through a couple of examples biased coin inferring the transition probabilities move to unifuler HMMM and looked at the even odd process in particular and we Used the topology we knew what the edges were we knew what symbols they output and so all we were doing here was inferring the transition probabilities And one of the important parts of that was that we also had to assume a start state to do this because if we assume a start state in a unifuler HMM then there's a unique path through it and that allowed us to count edges and So that was an important thing that allowed us to connect sort of very traditional statistical inference to something where the states are hidden And it's because we assumed start states And so this was all for fixed apology and now we're going to move into We want we have one data set and we have a whole bunch of different apologies How do we decide which one's passed? So that's today and part of what we'll be doing is using an enumeration algorithm by Ben Johnson one of Jim's grad students on How to enumerate in particular topological epsilon machine so that will be one type of candidate structures that we can look at and So we'll do an example again with even odd process, but this time you won't assume that we actually know the structure We'll figure it out from this library and then I'll show some plots that have sort of more data and better ways of looking at things from the golden mean the even and The simple not a non-unifuler source which of course we can't directly infer this because it's not unifuler But we can still use unifuler representations to estimate these things. So that's part of the interest This is an out-of-class example in terms of inference and I'll end up with more general out-of-class Example and one of a non-stationary process of kind of cautionary tales of if you deal with real data and you don't know what is coming from What might you expect and how much should you be worried and again? I'll be doing coding examples and there is a sage workbook up there that does all of these things So most of the coding examples you can go and play around there's yet So there's two notebooks there's one for the previous lecture and then there's a new one now that does all this stuff So pretty much all the computation Is available for you to play around with? So this is the one slide review of last time and and I guess talking with Jim afterwards one of the things that wanted to sort of Emphasizes this idea in Bayesian Statistical inference of starting off with a prior distribution and going to a posterior distribution And in the last lecture we did this on two levels So in this first one we were inferring transition probabilities And so we had a distribution Over transition probabilities assuming a start state and assuming a topology But it wasn't informed by the data. We wrote down a likelihood and Sort of the combination of these two things are reweighted and it gives us an Updated distribution over parameters that now takes into account the data and in particular for this part here because these parameters are transition probabilities these are continuous parameters we had a special form here called the a Conjugate prior so if you had a conjugate prior meant that in this case We would have a beta distribution or a Dirichlet distribution We would end up with a beta or a Dirichlet distribution over here, but these two things are very very similar okay, and so It's always this kind of thing where you're starting off with a prior distribution and you're moving to a posterior distribution and The same kind of thing here we did this extra level where we wanted to after having the transition probabilities We wanted to know what was the actual start state because we weren't interested in that necessarily We wanted to figure out what that was given the data, and that's the exact same thing here Where now we have the probability of each start state and our prior there was to assume that each of the start states was equally Probable, so if there's five states it was one fifth Apiori, but then we had the evidence that actually came from here So these these connections between the different levels goes from here to here And that and tells us how likely each of the start states is given the data But there's always this process of a prior distribution to a posterior distribution and here There's some model that's telling us how to evaluate different parameter settings So that's the very very high level and this idea of updating distributions based on data Okay, and the the particulars of the mathematics are Depend on what model you're looking at we're looking at you at you know hidden mark-off models or mark-off chain like things But you could do the exact same kind of thing for inferring the mean and the variance of a Gaussian distribution Where you'd have priors and postures and the mechanisms would be very very similar so hopefully that It's a little bit more clear, but I really encourage that if this is novel to you actually just look at Wikipedia for a Dirichlet distribution Look at Bayes theorem and just grind through the mathematics of one example like by hand So some of these things in particular when those conjugate priors like we have here You can actually calculate these things and take a prior distribution modify it with a Likelihood and calculate them out and it's a useful exercise. It's not always fun, but it's it's useful and it's convincing Especially gets unfamiliar Okay, so we're gonna infer structure now And so one thing that wanted to do was to sort of contrast what I'm going to be doing with what is more common Probably and you're more likely to have encountered this kind of thing. So this is very generally And I'm sure there are exceptions, but often when people say that they're inferring hidden mark-off models They're usually assuming some sort of fixed Topology that they've assumed it's a certain number of states certain number of connections the outputs are of certain types and They're often considering non you the feeler topologies So there isn't a unique start our unique path given a start state through these things and as a result you end up with a whole class of Algorithms that numerically optimize the transition probabilities to agree with the observed data and so these are called expectation Maximization algorithms like bomb mulch and this comes up in in Statistics all the time for when you have some sort of a hidden state thing that you're trying to infer and here It's really you kind of you have to optimize over all the possible paths And there isn't a unique path for an observed data set. There's many many paths So there's this Numerical optimization But the main thing is that it's they are considering non-unit feeler topologies and they're usually assuming a single and fixed Topology and So what we're going to do differently is because we're interested in in epsilon machines and epsilon machine like things we're restricting to uniphealer hidden Markov models so We're going to only consider that those types And we're going to use model comparison to infer topology. So we'll have a whole set of them in this case we'll be Looking at in most of the examples on the order of 1400 topologies and say given a data set Which of these is the most likely? and We'll provide a distribution, you know We could take into account the uncertainty in the structure in the same way we did the uncertainty in the start state and the transition Probabilities that we did last time so the sampling that we did to estimate like entropy rates and c mu To get a mean and credible intervals we could do the same thing for structure Okay, and potentially we could consider this a different way of thinking about structural inference And another level it's I mean it is very conventional Bayesian Application of these days ideas. It's I've been doing basically just model comparison on various levels And so the next thing is what set of models should we use and so I'm going to use one in particular but It doesn't have to be this one, but it's certainly a good one and especially if we're interested in epsilon machines It's a good place to start it just has to be unifiler HMMs can be fed into this So we're going to look at topological epsilon machines, and I'll go through definitions in just a minute but The motivation for this is that there was hard work by Ben Johnson and in this paper here to figure out a fast way to calculate every one state two state three state four state machine For a given alphabet size, and that's actually not such a trivial thing to do So we're going to take advantage of this and be kind of brute force about it Just say we'll look at everything we can up to a certain number of states And so that will be our set of candidate topologies And really we're just we're limited by time and computational resources So one thing you'll see is that as you get more and more states the number of topologies that are available Grow very very fast, and so you can only do this within limits So in the long run Having a maybe more creative way to choose sets of models to look at would be a good thing But this is a good place to start and quite interesting already so definitions again a couple of these we saw from last time But to get us all on the same page finite state edge labeled hidden Markov model as our set of states finite alphabet and the set of symbol labeled transition matrices Stuff that you're all familiar with from the class and if you want the state-to-state transition matrix We can sum over these things and Get a t that's independent of the output symbols We can further refine this to what a finite state epsilon machine is by the additional Restrictions that they have to be unifiler So for each state and each symbol there's at most one outgoing edge from state signal I to and has output symbol X and again for the inference This is really really critical like this is the reason I'm able to do that edge counting and Why what I'm doing applies to the epsilon machine structures and not generally to hidden Markov models And then for an epsilon machine of course each state has to be probabilistically distinct and that there's some word that is has different probabilities for each pair of states in the machine Okay, and then something we didn't talk about last time is topological epsilon machines And so part of the way that these are enumerated is to look at the topologies in a very particular way And so the definition here is a topological epsilon machine is a finite state epsilon machine with a transition probabilities for each State are equal on all outgoing edges And so I gave just two examples here of how this creates restrictions on what types of topologies you'll see in this set So one example here is this one So the way that the probabilities are set is You have this state and there's an edge going out this way an edge going out that way they have different symbols So it's unifiler, but they're set equal to one half one half. So that's each state is equal probable Here there's only one so it's set to probability one. And so this would be a valid topological epsilon machine whereas this one would not because we have One half one half going out here and then one half one half going out here and these are not probabilistically distinct so even though in inference we're going to be sort of Ignoring the transition probabilities and inferring them the set of topological machine topological epsilon machines in the way that Enumeration algorithm works is a test for this And so that's one thing that's important to take into account So for example if you have something that has a support for all words and we're using topological epsilon machines to infer this The only machine that's in there that has full support is a single-state Zero one for binary alphabet. So it will collapse down to that thing There is nothing that is like a first-order Markov chain. So you won't you want to add that And again, we don't have to use this set, but that's that's what we're going to use for the first one It's very structured and So even in this more restricted set, there's the numbers grow pretty quickly So this is one of the tables or part of one of the tables from from Ben's Paper and it gives you a sense of this is the number of states. So one state two state three state four state this is the number of edges and The number up here is the total number with one state total number with two states three states Four states and then it's divided out here by there's for example three states and three edges there's two of them three states four edges, there's 22 of them and These are full alphabet, which means that these machines must Produce both a zero and a one. So for example this the machine that has just a one edge and it gives that one is not in here Things like that. So just in terms of the sheer numbers of candidate topologies This is what you get for topological epsilon machines. Yeah You just said about it having a ship of the zero and the one Yeah, so that's what in terms of the numbers that it means being a full alphabet Topology it means that there's not in this set. So that for example, there's only one one state machine It's the one that has a zero and one and comes back to that state There's not a machine that has outputs of one and just it's just like does one In this case, I'm going to be doing binary examples. It doesn't it doesn't have to be but yeah, yeah so in this case it's But if it were more than that like a very three letters You would if it's a full alphabet thing, it would still have to be You know it have to put out one two and three as a simple Oh, no, no, the number the numbers. Yeah. Yeah, I'm sorry. So the two means alphabet size to I'm sorry Yeah, yeah, yeah, so this gets much worse as the alphabet size gets bigger. Yeah. Yeah It's Right, so there's all these restrictions that you can't have you have to have Probabilistically distinct states when all the outgoing edges are equally probable that puts lots of restrictions. I want you can have there. Yeah Okay, so It's the set and then Coming that we're kind of talking about and not going to Belabor this but I wanted to bring out this this idea that kind of the way that we're thinking about epsilon machines And I think most of what you've done has been thinking about the history formulation of epsilon machines And this is the idea that you have a process that Determines the epsilon me struck epsilon machine structure through an equivalence relation So you group equivalent histories if they have the same future morphs and there's another way to think about this that's been Employed by Nick Travers who's finishing up his PhD now And there's a whole series of papers that were related to synchronization and actually the equivalence between these two different Viewpoints and the other viewpoint is that a generator of formulas and for Epsilon machines and it turns things around and epsilon machine defines the process that could be produced by the topology So in a certain sense is that this is more the way to think about the inference that we're doing here And that we're going to look at a trying in for a bunch of generator epsilon machines that are consistent with the data that we've seen We don't apply an equivalence relation or anything like that. It's going to be much more in the Style of this way of thinking about epsilon machines Okay So how do we go? So it's it's Bayes theorem again And so I mean even the symbols all change, but it's the basic the basic idea should be familiar at this point and Again, we're going from a prior. So we're going to have to specify a prior We're going to choose our set of candidate models, which I use is script m and we're going to be using topological epsilon machines But you can choose whatever set you want and you have to have some sort of prior of Mj is going to be a particular member of that set What is the probability of that a priori and then there's going to have to be a Likelihood like term, which is what is the probability of that data given a particular topology and really we're using stuff from earlier on from the Probability of the data given a single topology. So we actually calculate these things in the in the previous lecture and This value actually really doesn't depend on the set that we're considering So we're writing it this way just so the patterns look similar, but really we've calculated this thing previously And the idea again is that we want to update this prior to a posterior using these weights and the normalization term Again is basically just summing the numerator Over all things and so just like every instance of base theorem where the thing you're trying to infer is Discrete sets of objects you're doing this some Normalization term the only time that we've been done something more complicated is for the transition probabilities where those are continuous parameters And then we actually had to do integrals over multi-dimensional Simplices that is more complicated This is is pretty straightforward and very similar to lots of calculations already done where you have a probability and you're Renormalizing it based on some context Okay, so what kind of prior will we choose? and This is this is just one prior I think it's it's fairly effective, but I'm sure I could be convinced that there were other ways to do this What we're going to do is use a simple prior that has a single parameter in this case it will be beta and There's going to be an exponential penalty for some function of the model topology so in this case what I'm saying here of F of mi is The default in campy and in the code that you'll be able to play around with is just to count the number of states And so there'll be an exponential penalty for the number of states and it'll be Wated by this beta value that you set It could you could make this the number of edges could be something else and I mean we can also set this to zero if we want to and then it's basically all the things are just equally likely Although there are some kind of Occam's razors effects on the lower levels in terms of the number of transition probabilities And how many states that will naturally prefer one topology over another This is sort of an extra part and actually for the topological epsilon machines This becomes important and for very small data sets Because there are examples of Frank for example. There are periodic processes and there that will just have three states and they go zero one one That's a weird structure because we're not inferring transition probabilities They're all probably one by definition. It's likelihood is one or zero and So even if it isn't really a periodic process you can have that pop out and be very very probable For small data sets and this can Remove that aspect of it So it will give some penalty for the fact that you're giving a three-state periodic process to something even though you only have 50 You know symbols worth of data This becomes less important as you get more and more data But I think particularly at small data sizes and this is often the case for priors they're much more important when the small data and Much less important when there's lots of data But so this is one way to do it and the idea again is to start off with the simplest model that you can and Let the data push it to something that actually Can have high likelihood Alright, so what do we do? Yeah? Eliminate So say I get you see do all set up instead of doing an enumeration you just you just stick a really Sure sure yeah Manually look at right way you could see that if it if you have a bunch of zeros Sure sure sure yeah, is there just I don't know is there a benefit over a new range apologies versus Well, what is it? I Mean there is some there is some power to I mean one thing what we'll find is that for all of the examples in particular we'll get to G the golden mean and even in the SNS that when we use this enumeration technique there will be Thousands of unifyable representations that accept data at a hundred and so what you end up getting representing these things Will be in the case of the golden mean in the even you get the correct apology with really high probability And these other ones with much much less probability, but for SNS. That's not in the class at all And so it's representing so Yeah, I'm not sure. I mean, I think it's not it's not clear. I think there's some power to looking at the the set So when you put a prior over so you're talking about six states all connected to all right Your prior is Right, but it turns out There is this issue of the statistical significance of not having seen a transition and The net result is it's very hard for these inference algorithms to actually completely turn off an edge So there's actually a bias towards keeping a little bit of epsilon probability if you allow that to be in your prior That this transition there so so so the trade-off is to say we're going to step through topologies explicitly testing Setting transition problems to zero and that's really focusing more on the structural aspect It's really the restrictions restricted transitions. I mean you're looking at differently structured processes as opposed to some big multinomial process Markov chain over 60s with small Transition probabilities, so so practically the infants have a very hard time pushing Probabilities to zero with finite data by design. It's yeah, so that pushes you over to some Classical processes where there are no restrictions, so that's a very strong bias towards unstructured Yeah, it's not that you could start with some generic machine To motivate a set Yeah I mean any any creative way to come up with a constructive set is good I mean and certainly the transition probabilities will never be weeded to absolutely zero I mean they just sort of they approach that because you have this, you know posture just more Sure No, I mean so I mean all the examples in terms of code I mean they'll run on like 30 seconds on your laptop, so but when you get when you if you really think that there's a ten-state machine or something then enumeration becomes Really, yeah, so I mean there is there's a trade-off at a certain point, but for the yeah, yeah There were other questions or To do what Where you can take a huge matrix of you know data where you have Information and try and block diagonalize In terms of analyzing the structures that are left Yeah, I don't know I mean I can't comment cuz that I don't know Example yeah Community structure sure various forms competing concepts and algorithms were saying for clustering the set of nodes in the network Right as a code community. Yeah, so that kind of thing sure again. These all suffer from being able to prune out things that really shouldn't be there So so the approach that Chris is talking about is really very complimentary to this and it's more focused on Trying to see which structures By which we mean turning this edge completely off or turning that So you could start with a maybe you've had data from the both mean or even process and but you so it's a force state all connected If you do that, it's very hard to see that's a two-state with never see to zero their own You won't get it. Where's this way? We'll pick it up very quickly You see those and I should say there's always this trade-off this right now all these algorithms you give me infinite data Right do it right. So in a very kind of crude sense a lot of the issue we're talking about here is just efficiency data Efficient the use of the data efficiently and can I get more interesting results? Smaller data sense Alright, yeah, I mean so these are all all all good questions. I think that it's Yeah So what do we do with this posterior once we have it and again? We have a probability of each Topology given the the data and the set that we've considered whatever set we've chosen and this is a normalized thing So there's a variety of things that we can do in terms of using that and I think the way that I would tend to use it, but again is more computationally intensive is to use the whole posterior overall models and I'll go through some algorithms of this, but we can quantify the uncertainty in structure as Well as quantifying the uncertainty in start state and transition probabilities that we did in the previous lecture for an individual topology And then when we estimate things like HB and CMU it's taking into account uncertainty in structure uncertainty in start state uncertainty in transition probabilities And another way which Can be okay and sometimes might not be and I'll show you some examples where that's the case So like SNS data is one example where this is dangerous Whereas you just take a single single topology that is the most probable in the posterior And so if this topology has 99.9999 percent in the set Probably you're okay, and we'll still reflect the uncertainty but if The most probable is not very likely in the set you're considering then you're really not capturing all the information that you have From the set so this may not be a good substitution So it's good to at least think about using the full set if you can get away with it And that motivates to kind of pseudocode ways of thinking about how to sample from these and This is if you want to take and sub S number of samples you would sample a topology from the posterior over Your set then given that particular topology you would sample a start state Given that specific start state you'd use this to sample a set of transition probabilities Which is effectively an Epsilon machine, and then you can absolutely you can evaluate any Functions so this would be like our h mu or c mu it could be transition probability Well transition probability still makes no sense because they're attached to a particular thing but functions h mu c mu And then you'll be able to estimate whatever quantity you want and each of these posterior distributions come From the various levels that we did these from the previous lecture this from today And then the other is to just use the single maximum a posteriori topology where you take that single thing There's no longer the sampling of topology, but you still could sample Start state and transition probabilities to capture some of the uncertainty and evaluate your functions And even less refined is you could just take averages over these things and have sort of like one topology with an average transition probability Averaged over the uncertainty and start states And so there is a method to do that and can't be but be careful if you use it You know we're to the wise it is at least to good to think doing this is Reasonable No, no, no Yeah, yeah. No, this is a sample of the function. So this is okay. Yeah No, so I'm sorry. I mean these are these are all distributions So it's like I sample a number from a Gaussian distribution here. I'm sampling from the posterior like this is on the previous slide I Wouldn't even have thought that that would be bit So that's part of the point is that this is a distribution and it's normalized so I can sample from that thing topologies And that's what I'm doing. So here. I'm literally sampling give me topology That's random random X or like because it was highly probable and then given that it's a random random X or topology Sample a start state but this start state depended upon the calculations We did in terms of how likely where the different start states given our particular data set So this is one single data set in here and all of these all of these posterior distributions have been shaped by the single Data set. So the samples are with respect to posterior distributions at each level Yeah No, no and this I mean this is a general you want to estimate entry rate Or you want to estimate senior or any of the quantities that are related to epsilon machines you would this would be the function that you have to give it an epsilon machine to calculate and You need to sample all these things to get as particular setting for the transition probabilities And so it samples in that sense of structure start state then transition probabilities Then you actually have a melee HMM or recurrent epsilon machine in campy and you can say give me the entry rate And I'll give an example of this so Which is exactly the next thing so here's a practical example of doing this with something that we've already seen So we're going to infer the structure of the even odd process And this is straightforward application of what we've been talking about so again. I'm going to Declare the even odd process with slightly strange transition probabilities to just to make sure that it's not Like the prior so point one and point nine and point three and point seven Do that with a string use campy to make an instant of it? And this is what it looks like and so this is going to be what's going to be generating my single Time series and then I'm going to feed this to the algorithm and say let's infer the structure Estimate entropy rate seeing you these kinds of things And so this is exactly what you get to play with the second Lab on on stage To do the inference we have to do a couple of things How big a day set I will show I mean literally this is all executed really yeah, so yeah I think I did five thousand samples in this case, but so yeah All the code is shown So first I'm going to just actually look at the prior over these things and not even look at the data So you import we import the Bayesian Stuff and this particular library generator function is an interface to Ben's algorithm for enumerating structures And so what this? Command does is says give me two is the size of the alphabet so all binary machines with one two three four states And so that's assigned to this model set one and now I'm feeding this to model comparison em So last time we were doing infer em is it for a single topology? But now this is a different class and you instead of feeding it a single topology we're setting a Whole set of topologies and then I set my beta, which is the penalty for Structure size and this little tag verbose true Actually spits out this little summary So if you were to run this it would say this as it's calculating the things and so because I didn't feed it any data Notice I didn't generate any data, and I didn't give it anything. It's this is just the prior over these model structures And so that's what it says here is it says that betas for it says while there's 1,474 candidate machines, they're all possible because there's no data to eliminate any of them a prior they're all possible Yeah Yeah No, I mean I think that it what you set this to For example influences whether or not you approach c mu estimates of c mu as a function of data size length from above or below so this particular value is At least for binary machines is good for approaching c mu from below and that's particularly important because The way the number of machines or candidates grows with size so like you have many many forestate machines So there's lots with high c mu. So if you don't if you want to start from below you have to have a higher penalty Yeah But I mean in the way that I'm applying it here This is just something that you you set and it would be the your inference would be conditioned on that If you want to add another level you could actually You know put a prior over that and let the data tell you which is the best beta, right? So I mean this could be done endlessly, right? Yeah, so Okay, but so no no data here and Here we're going to pull out I'm using MAP usually that means maximum a posteriori in this case It's actually a priori because we haven't given any data I'm just saying what is the most likely machine without any data given the prior that I've set so that's what I do with this particular method so I use the prior that I cleared and This does what what I was saying which pulls out the most likely and then just averages over the uncertainty at start state Which for a single-state thing doesn't make any sense, but It gives the postures and what it gives you back is the single-state machine with 5050 and that's because with a penalty for Structure size. That's the most likely a priori When we feed data this our prior distribution will be updated to a posterior distribution And when we call these same things will be different because the data is restricting what class of things we can get Okay, so that's what we do next so yeah here actually generate the data So I say 5,000 symbols and then I set up my posterior it looks very much like setting up my my prior Again, I'm doing binary one two three four State machines and now when I set the posterior I feed the data in And I use beta equals four and verbose is true and now when we run this the output says well I considered all fourteen hundred and seventy four, but actually only a hundred and seventy five of them are possible and what that means is that For the other ones that are not included Every start state in that topology was tried It went somewhere through there and it came to a node where it required a particular symbol and it wasn't there And so I just said so but there are a hundred and seventy five options of What could be? viable topologies and So then when we actually want to look at again just a single representation because I will look at sort of estimating senior and H mu in more general form, but if you want a single representation of the most likely Object again, we can get this it gives me the maximum in this case a posteriori tribology unit averages Give me averages over uncertainty in start state and give me the posterior means of the transition probability So it gives you something to look at I mean in this case even with 5,000 symbols is actually quite good So this is the original data source and this is if you run campy what it will pop back at you If you just say draw this thing And so this actually is deceivingly good, I mean 5,000 data points. This is is a fair amount of data But for that amount you would be it would not be unusual to see this to be point three two in this point six eight And so there would be some kinds of fluctuations But again there are tools in here where you can actually sample over all that in some teams say that it's this particular value with this amount of uncertainty if you really want to get to that level and do it, but The point being is it captured the correct apology Like we all agree and see that it's the same Yeah Yeah, I guess the default and when you declare even on somehow it does this nicely and not this one I don't muck with the The campy settings, but this is what you would get if you just ran it in campy but So it does a good job there What if we want to do something that's closer to what we were doing last lecture? Which was to estimate cmu and hmu, but now we're going to use the set of all topologies that we used We're not going to use one But this code should look if you look back at the previous lecture will look exactly like what we were doing For a single known topology except what our priors and postures are are now over whole sets of topologies But they're these same methods that look They can be used in similar ways But what's different here is so we're going to do again what we did at the end of last lecture is We're going to use one loop to sample both from the prior and the posterior and we're going to estimate hmu for the Prior cmu for the prior hmu from the posterior cmu from the posterior Each time through and then add them to a list, but this prior Now is over models and this prior is over models So now the machine that gets spit out each time could be different and will be different some of the time So when you calculate these things cmu and hmu will change in ways you can't with a single fixed topology so this is A generate sample does the first algorithm that I was listening to you which was sample from the posterior over model topologies Then sample a start state then sample a set of transition probabilities, and that's what's returned here And so actually the node the start state here is what's the start state that was sampled if you want to do And so we're doing all these at the same time and we can Again look at what these look like and again some of this is hopefully for people who want to use this for Projects useful code in that we're going to feed this then to histograms So our prior samples posterior we're going to make blue and green and print them and the worksheet on sage. I've also done the Finding the mean from the samples and the the credible intervals just the way that we're doing before I haven't I don't have them in the slides here But they're on the worksheet so you can look at that which will give you like a mean of entropy rate and a mean of CMU and and kind of error bars are credible intervals but if we just plot the the samples the blue is The prior and The green is the posterior and the true value if you use the thing that we Put in the beginning is point four three eight and so it does a good job of Capturing that and the one thing that you should notice from the prior is that all these structured processes So the only one the only process in the topological ops on machines that actually actually create these really high Entry rates is the single-state to edge thing Most of them can't get higher and we saw that last time with looking at just the even or are the even odd process that the Maximum integrate you can get with somewhere around point six so the fact that a priori you see this little broad uncertainty in In entropy rate is a function of the fact that that single-state machine is in there The fact that the posterior is piqued like this means that the single state is not very likely anymore once you've fed the data to it So that's entropy rate and again, I'll just go through these a little bit quicker, but Same thing for CMU We're just going to take the samples that we created a couple Samples a couple of slides ago and look at what those look like And This is harder to see and I'll show it in another way, but they CMU Because of the prior when we know we pulled out the single-state as the most probable thing It's so this big peak here is single-state. It has no CMU This little peak here is log 2. So it's the two-state machines There's a little blip here. That's the log base 2 of 3 And you can see that better in some of the things that I'll showing up later But there's this decay and sort of structure that's a function of how we set beta and what set of models we looked at but then the posterior pops out the one that is actually Good after looking at the data In its statistical sense and the true value is 1.84 and again This ends up being a a good approximation and partially that's because It gets the right topology And the transition probabilities are defined well enough that CMU comes out good Okay, and so again, we can look at CMU and H mu Together and I think this is a useful thing to do and I've just been doing CMU and H mu But that's for each of these samples. We look at the CMU and H mu that came from a particular machine for each one of those samples and so the blue again is the prior and the green this little spot here is the posterior and So this whole line here is the single-state machine with different transition probabilities So when you're sampling from the prior you're just seeing oh, it's more likely get the single-state machine and then it gets a certain set of transition probabilities and it basically all these samples are single-state and then this is CMU's one is two-state machines and these are different settings for the transition probabilities on sets of two-state machines and If this is only two thousand samples so to really fill this in you need to do larger numbers of this But you would see more and more of this if I were to set beta to zero You would see something that was not much here and a huge cloud apiori up here So this is part of how you decide what beta to set and I'm taking the project. I want structure to prove itself through data and that's basically what this beta setting is doing And the green is then Once the data is applied lots of topologies are ruled out We have probabilities for each of these topologies and there's to start states and transition probabilities We sample all these things and most likely it looks like it's probably just sampling this one Structure may most of the time over and over and over again That's not always the case and we'll see some an example of that in just a second but it gets the entropy rate is point four three and Point one eight is the CMU So the samples are focusing in on the right HMU CMU combination Yeah Yeah, my guess is that it would be it would look really similar With partially it's a function of how much data you provide to the posterior So the fact that we gave it 5,000 samples my guess is that this would still be a Posteriori very likely and you this posterior would probably look very very similar if I had only fed 200 symbols To the algorithm then they would be closer and they'd be much more diffuse So it's all the it's a function of what models you set how you set your prior and then how much data you give the thing So the yeah the more data you give the more likely the posterior is to be different from the prior So yeah, that's a good question And that's part of the reason that I like visualizing the prior and the posterior is to get a sense of Is this capturing the kinds of things that I want to actually capture? Okay, so now just to give you an example of Some of the processes that you you know Without the the code but just to give some some figures And this is kind of this is going into the paper that Jim and I are working on What I did is I took a single time series for each of these Processes of length 2 to the 17 and then I analyzed substrings of that Of length 2 to the eye for I is 0 1 2 so length 1 length 2 length 4 length things So I'm basically I'm gonna have this long time series But I only look at the first symbol then I look at the next to I look for and and each time I'm say taking those four symbols and pushing it into the code What does it give back to me then I do it again for length 8 to just give a sense of the convergence So partially that might actually get to sort of how the posterior changes as you get more and more data and really reflect on the earlier question And so in this case I'm using all 1 to 5 state binary topological upslown machine. So that ups it to 36,000 660 I'm using beta again of 4 and in this case I'm sampling for each of these L's when I look at a sub string I'm sampling to estimate h mu and c mu 50,000 times So I get a good sense of what the actual posterior looks like and and prior And we would just want to look at the patterns of how this converges in a particular for different processes what what is happening All right, so the first thing is to look at the prior and so this kind of Looks at what happens if you give different betas. So this is I'm feeding no Data to it, but I'm changing beta From zero to two to four and the things to look at is so these are the samples in c mu and h mu Space and then for example, this is c mu this way if you project that Marginalize it just on c mu. This is what the densities look like for those samples and if you project this way this is what the density of H mu looks like and so one thing that I have done also is notice that I'm doing logs of one plus the probability because some of these are quite different in scale actually and The plots just are really difficult to see So some of these differences are actually bigger, but you want to see certain patterns So one is that for blue what we already know That's a big penalty for structure a priori. We have a very big preference for c mu is equal to zero That's that single-state machine and we see a peak at the two-state machines the three-state machines But if we decrease beta that preference then you still see a small peak at zero a peak of two Three and now it's much more of the mass in terms of the average structure in c mu is up here And with beta equals zero not having any penalty now. It's really Pushed up that way and that's partially just because there's so many machines that have that number of states Yeah, yeah, really that's what's going on. Yeah So I mean partially this These differences are a function of the set that we chose that we're being enumerative and we're choosing everything so we want to have beta Be careful in term and yeah, we just want to understand what the behavior is H mu is is less Unusual and the one thing that you can see is that with beta equals four again We get this single-state so you can get the high entropy rates Because a single-state machine is there and this is pretty much a function of that's very probable Whereas if you don't have as much penalty then you get this sort of peaked thing if you just randomly sample all one to five state Topological epsilon machines with certain amounts of penalty you get this peak of entropy rates Taking into account uncertainty. Yeah Is an exponential like it's it's an e to the minus beta times the number of states in the machine number of states Yeah, in this case, it's just number of states Yeah, and so you end up getting these two patterns where actually the difference between zero and two is not all that dramatic With beta equals four eight entropy rates. You really see the the single-state machine coming out So that gives you a sense of what happens without any data without any restrictions by actually observed things So let's feed it golden mean process data With beta equals four so this is what beta equals four and maybe I'll just go back quickly So what we're comparing with the prior looks like the blue thing here So it's the very strongly this here and then a little bit of stuff up here This blue pattern and this blue pattern, so we're comparing that once we feed feed the data This was just to give you a sense of how the prior changed with different beta values the actual examples with data I'm I'm now choosing this one beta equals four and so In this case L equals one is black. So basically we just get the prior back L equals six to four is the brown and then L equals sixteen thousand three hundred eighty four is blue and the general pattern that one should see is That as you go from black to brown to blue is that you're converging in H mu and C mu space And actually so these dashed lines Are the correct values? For these things of correct values and correct values so by sixteen thousand symbols you've in this case certainly identified the the golden mean structure and you're estimating C mu and H mu with quite a bit of accuracy and Probably that's that shouldn't be super surprising because the topology is in the set that we're considering so It should do this So we can look more generally at the convergence Kind of things and so what I'm doing here is now looking at each of these substrings And again I was sampling and looking at the density as a function of length So let me look at C mu in particular So this is a sort of a slice as you get from very small amounts of data to very large amounts of data And the vertical is the posterior distribution. There's a dashed line is the true value and There's a little gray line that you may or may not be able to see that winds in here Which is the posterior mean for C mu and so the fact that I used beta equals four means that the estimates are very Small and then converge the right thing But it also shows that in particular for sample sizes that are not that large. This is a multi-peaked thing and It's not at all clear for C mu. There's ambiguity in the structure and integrate here Again, we have this broad distribution at very small data lengths and as you get more and more it converges to the right value And so this is the full dataset That's basically the prior at the beginning in both these cases just looking at H mu and C mu And so this is an example of a very nicely converging thing where so like two to ten right here This is only a thousand symbols. We've already converged pretty quickly You've spread all the difference between having like in this case an uninterrupted stream of data as opposed to multiple shorter strings It makes a difference Right, right. Yeah, definitely it makes a difference so if If I do one long string I basically only have to worry about inferring the start state in the hidden state path once But I mean if I really going to apply this to you know Dataset one and data set two and data set two is really sort of there's some unknown amount of time and dynamics in between the Two of them I would basically have to do the the same thing over again One thing I could do is do the first inference and get a posterior make that my prior for the next data set But I would still be inferring Start states and all these kind of things so but you would I mean you would still get the benefit of increased amounts of data But yeah, partially. This is just one string of data. But yeah, so I mean that is a real you have to think about these things It's a good question all right, so Even process So I mean I guess one thing to say is that we're sort of going increasing complexities the the golden mean is a the states really can Correspond to something that is observable as a first-order Markov chain with a restriction basically like you could know that one state is really There's no ambiguity. Let's be clear about this You know because you put a one going in here and a zero going in here This could be identified with having seen a zero and this with a one Even process really ends up being something that is An extra level of complexity Because then we no longer there's whole groups of histories that these correspond to and there's not a one-to-one correspondence So in principle, this is much more complicated and I mean certainly in previous work that Jim and I did we use just sort of First and second and third and fourth order Markov chains to infer Things like this and there you would see that you needed higher and higher order Markov chains to approximate Even processes you give it more and more data because it can't be captured by a normal Markov chain Whereas the golden mean can be captured fairly effectively But now we're dealing with a with a model class that can capture it and The same kind of coloring as we had for for golden mean and actually this ends up looking really really similar to the golden mean in all of the patterns again The true values are the dashed lines. This is entropy rate and C mu and again for basically the prior for 64 symbols and for 16,000 and I'm gonna guess the details of the bumps in structure and and these kind of things are slightly different because different Sets of topologies are being ruled out But overall there's a strong convergence just like we had before and again this model topology is in the set we're considering so we would expect it to converge and and it does by 16,000 And if we look at these convergence plots of C mu and of Good convergence And I mean one thing I like about these also is that I'm using again I'm using a single data set and I'm just looking at substring So I I didn't do like average golden mean behavior average even process behavior I just generally generated two to the 17 symbols and then I Chunked it did the analysis chunked it and all this So you get a sense of this convergence and you could do this with a real data set, right? You didn't have to know what it is More complicated behavior early on is it because it has a higher Like in the lower for H mu for H mu. Yeah, I don't know is it more complicated? Yeah a Little bit. Yeah, I mean I'll get into sort of the number of accepting topology, but the details of exactly how these things are being weeded out I think it's still something we're looking into but certainly there will be differences in terms of I mean they're different languages. I mean the support is different Yeah, so it's it the convergence is potentially non-trivial In some ways, it's it's amazing that they're as similar as they are I say for the last one now an example of something where SNS is non-unifiler So like I could not include this and feed it into my algorithm because it just can't find state paths It would it would get to this point and it would go I Don't know where to go and that's where the other algorithms Like the expectation that commutation work is they're doing this optimization of our path What does simple mean in terms of Yeah No, no, no, no, no, yeah, it's one of these sets of typical HMMs that we're looking at as as data sources to consider how different things work. Yeah. Yeah But this is interesting for us because this is not in the set of models that we're using So what does this thing do? and So in this case you can actually you can calculate the entropy rate So this dashed line actually is the entropy rate for SNS and Surprisingly this actually does get the entropy rate and this seems to be a pattern that Happens in a lot of cases where even though the models that are in the set that we're using The data source is not in there It's still capturing the entropy rate and of course it can't capture the structure and we see something extra complicated And even for the large data sets and this persists all the way out We'll see that in a second is that this is this really is two peaks and seem you And so when you actually look at the details, there's actually quite a few topologies that are all On the order of like five or eight percent probability and so there's like a set of three that are all together on another set So exactly why that happens. I mean partially it's out of class But this is certainly non-trivial behavior and we can really see that when we look at the convergence of these things The behavior is much much more complicated You can create two epsilon machines and then create another one that samples between the two Right right. Yeah. Yeah. Yeah, I mean you could elaborate on the model class certainly. Yeah I mean, I'll do an example at the end of lecture where I do exactly that where I take a data set of one process Then just add another process on the end and see what you get But in this case, yeah I mean, it's just completely out of model class and the NTP rate actually I mean it doesn't look too different in terms of convergence from even and GM But where it's really really different and the more you look at the actual specifics of the details It really is different is that you have this very complicated increasing increasing trend as a function of data length for C mu and you end up here with this thing that's very multi-peaked and It isn't again. This isn't just even though those two peaks. It's not two models It's quite a few models that are actually making this up So I mean one one interesting thing is are there ways that we can are things like this typical of out-of-class examples and I mean certainly for Cases where your out-of-class means that I have 10-state Unifiler HMM and I'm only using one to eight states. I'm not sure that you're going to get this kind of signal So out-of-class. I don't think necessarily We'll give these complicated things that won't I'm not sure that this will actually be a Signature that out-of-class is happening. So that's I think a very not Not simple thing to figure out, but in this case it gives a very complicated behavior and so the one thing that goes towards answering some of these kinds of things is Just looking at the number of machines that accept the data as a function of data length. So this is Starting off at the beginning. This is the starting amount of all thirty six thousand six hundred machines And then this is the sub samples that we look at and we're saying we give this data of length one or length four or two to the four How many machines will actually have a path for some starts date So there's there's some starts date where the machine will accept the data it won't come to a node and have a symbol that it needs to produce and Then just get stuck and the common pattern here is for GM for even and for SNS is that in the beginning of course everything accepts but then there's this Bend down and by two to the six you hit this lower bound Where the support is similar and the GM and the SNS Are exactly the same it's six thousand two hundred and twenty five of thirty six thousand six hundred will accept SNS data up to Two to the seventeen data, whereas for even there's less And so I mean part of this is a function of for in particular for golden even It's easy to see how you could take a three-state machine and bury a two-state even inside it And so there would be we just rattle around in those two states and then there's more elaborations with it. It's less clear how SNS does that but certainly the fact the support of these two things is Similar is is means that this is very much of course driven by just Not so much the probabilities, but just the support of what's going on So that's where interests of Ryan in sort of automata and these kind of things I think will come in handy trying to figure out What's common about the languages of these things related to the actual data source? Yeah, how many of these are just Well, so that will you get well, so that's the interesting thing is that yeah, so because they're topological there's only the single state To edge. Yeah, so it's not all yet. That's the only coin flip in there. That's the only full support Yeah, everything else has some sort of restrictions. Yeah. Yeah, so it's it's much less trivial than you might think actually yeah, yeah So interesting and to be investigated Well, right, they're very structurally full, you know in a certain sense. Yeah, right, right, right. Yeah. Yeah Okay, so then I wanted to just kind of finish with a couple of fun examples and again these are in the sage notebooks of things that you might run into so I just came up with a couple of things that Might be problematic And so one is the true model topology as we've already discussed might not be in the set you're considering and certainly that's true for SNS But I'll do an example where we do the even an odd which is a four state machine, but only allow one to three state machines What do you get back? That's an interesting thing to think about and then What if it's non-stationary The transition probabilities or even the structure change over time or if this is in space over space What does the resulting inference tell you And I guess one of the things is that When you get to dealing with real data you want it as much as possible have secondary checks in terms of if you have a Measurements from a real physical system. You come up with these models You want to check your predictions? What does your model say about the physical system and make sure that these all make sense? And there may be ways that are don't come out of the statistical modeling to rule out particular kinds of things And those then can be implemented as reducing your set of candidate models or restrictions in terms of the priors that you're setting all these Kind of things but just two examples of what could happen so the first is just not enough states in the set so So the code again is is the same thing Except now we're only have one two and three state Apologies and we're going to give it the even odd data that we generated before so it's exactly the same data we did Before but we now don't have the the correct topology in there Otherwise we're using beta equals four again and now when we run it We see that now there's only 86 topologies in the one to three and ten are possible and Again would pull out. What is the most likely one? And again most likely and then average ago we're a start state and transition probabilities You end up getting something that looks like this So to topology that generated the data with the right things This is the best model in the one to three states and this is actually I think when you look at the notebooks I was playing around with it this morning. This is the 99 percent probable in the set that you've considered So that doesn't I mean it's when it's really the posture your probability is high It doesn't mean it's the right model necessarily means it's the most probable in the set you've considered given the data You've provided so that's kind of the lesson. I'm trying to push here. Yeah Compared to Oh, I think that in terms of the correct topology in that set versus this one that one's also high But I think less convincing partially there's more options to that length But I mean it's it's high you can check on the notebooks. Actually, it's really easy to print it out So I don't I'm sure it was not 99 odd percent. I think for the other one as well But I mean there's certain things here where Somehow there's certain elements of this part here that are averaging over aspects here I mean you can see the point three and the one and the point seven So you can see it capturing particular parts of the process But somehow it's averaging over other parts of this and so certainly the data that was generated by but generated by this could be generated by this and with similar probabilities probably but I mean yeah It's interesting to see how different they are. Yeah I'm pretty close because right look at B feeding into a there's a loop there Mm-hmm, and you Get rid of that Like you just skip that step and just be back into like zero would be I mean my mind would be a So, I mean Yeah, I mean they should be it should be really really similar. Yeah I mean So I mean this this is the best that the set of models that we gave could do but if you give you include the correct biology We'll give back the thing and it's easy doesn't That ends up only being I don't know actually but you could you could try it. It's really easy We'll take two seconds of computer time But yeah, I mean it ends up being this kind of averaging over I mean it's it's describing the statistics on some sort of average sense and I think that also will be shown in the next example, but Yeah In the real one Yeah Yeah, yeah, yeah, I mean and most inevitably it has to do something like that if it doesn't have the right structure It's going to I mean the fact that we get this structure back Just means that there was a path for some start state for data that comes from this thing to go through here And if you want to like you can go in and you can look and see which of these possible start states were possible and What their probabilities were and all these kind of things so Yeah, there's a lot of interesting things to be thought about this I think I Yeah I see And if you assume that right if you assume like it turned to me alphabet One would you run faster with this algorithm one run faster? And or two would you you'd be able to capture that with a three-state system, right? Right, so I mean I think that yeah Yeah Right your string still buying a string so I mean you could do all kinds of weird things with this I mean I could actually take a binary time series and use a library generator for all Alphabets for the alphabet of size three which include one and two and three And there would be lots of machines that would accept that and they would only go on the zero one edges And it would just not use all these kind of things. I Think that that's like when you know what the answer is and you have an answer I mean it there's easy it's easy to engineer after the fact how you could do this better But part of the point that I'm making here is that if we really didn't know that this was the true source and For whatever reason we could only use one to three state machines. We're just limited by our computer power This would come back and it would be really the most probable So it's just like the words of warnings It's it's I know some of the some of the postdocs like to call this Bayesian magic, but I don't think it's magic I mean it's really really straightforward and like very empirical. It's it does the best it can with the data You've given and the set of models So like with your assembly in non-uniform source to capture H mu right well, right? Is that just kind of general because because it is this is encoding your You can calculate H mu from your right for strings, right? Yeah less model dependent in some sense. Yeah H mu or Things that don't panel on the actual states right more reliable Very likely. Yeah. Yeah, I'm gonna can't answer with definiteness, but I yes, I think that's very likely Right, then it really depends on the yeah, yeah, definitely Between machines Right, I mean there are a variety of ways of doing that I mean my most naive thought would be initially to just sort of me could look at the distribution of words I mean are the support first so look at the support of all future words of length 10 And then there'll be a more fine thing of the probabilities so you could look at some sort of relative entropy and then Others, maybe I don't know Yeah Yeah Yeah, yeah Yeah, I mean an extra challenging thing is those that I mean you could think about doing this in the inference context of if you know The data source and you have an inferred thing of measuring this But when you don't have the true thing then it's unclear you have to come up with some sort of secondary tests of this So let's see yes, not that much more time So just a quick last example of something that I'm calling non-stationary, but a very kind of trivial but interesting I think at least to go through and so all I do for this one is I Create the golden mean topology and I create the even topology and then the first 4,000 symbols are the golden mean The second 6,000 symbols are the even process just tacked right on So it just changes topologies in midstream and so I only do part of this but What you would typically do when you have a data set is just let's take all the data at once and Let's see what we get And so in this case I just I used one to three state machines You could use one to four or one to five I Give it that data again. I use beta equals four that we've already seen and in this case I run 86 machines possible, but only three except the data and So again, we pull pull out the best one an average over start state and transition probability uncertainty And we get something that looks like this So the one thing that's interesting, I mean this accepts that data set there is a path through there And you can definitely see elements of both processes in this. Yeah, I mean it is definitely the even yeah But so I mean part of the reason that that I was thinking about yeah Yeah Yeah So one one thing that this and this is the actually the last slide But I mean one of the things that this motivates is that if you have a data series and the underlying assumptions in this model Is that it's a it's a static structure The transition probability is not changing as a function of your time series That may not be true in what you're dealing with So one way to get around this kind of thing would be to actually chunk up the data and look at the first 4,000 and the next 4,000 the next 4,000 and so I've done this and actually what you'll get in that case is you'll get golden mean Golden mean golden mean when you hit a part that's overlapping you get this back Then you get to even and it becomes even even even even, but there is then this question of If those changes are really quick that's much more complicated and So I mean you can at least make these kinds of these tests In the notebook, I have this thing where actually print out the hidden state paths as a function of the time series That's in the notebook when you look on sage so you can see that actually There's a difference like zero only happens in the golden mean Part of the data set as far as I can tell and then it basically rattles around here So there's a different use of states as a function of where you are in the time series So yeah fun things to think about