 Welcome back. I'm going to finish up your conceptual introduction to Markov Chain Monte Carlo today. We're going to look at a couple of examples of spinning models in stand where the model is quite badly, and stand gives us very clear indications of that, which, as I said last time, is one of the best features of stand as a Markov Chain package. When things go wrong, it is very obvious. This is a bit dispensable. You want to do the alarm. So I'm going to do a couple of examples of that. If I have time, I'll also do an example of a case where something doesn't go wrong. Nothing goes wrong, but I want to... Now that we're using Markov Chain Monte Carlo in these models, this opens up a lot more options for the kinds of models we can take, because we're no longer going to lean on the quadratic approximation to get some freedom to do other stuff. I want to show you some example with choice of prior that we couldn't have done before. It gives us different kinds of shrinkage or regularization. And then I want to spend the rest of the time today talking about maximum entropy, which will be our conceptual introduction to next week when we start to generalize linear models. It will be a set of principles that let us make intelligent choices there. So when I left you, we were going to start with an example of a wild chain. Here's the setup. Imagine two observations. There are two data points. They have values minus one and one. That's the data. This is some anonymous variable. And our goal is very simple. We'd like to estimate the mean and standard deviation of these data. The process generated these data. Let's stick with linear regression, because it's all we have so far, and setting it up in map to stand. First, I just make the vector y of data just a stand model. What I want you to notice here is there's no priors, right? There are no priors in the model definition, and that means they're flat. They're this constant same value for minus infinity to positive infinity. That's what flat means. That's what it's meant with map. That's what it means in map to stand as well. If you omit the priors, they are by implication completely flat. They're always there. If you don't tell it to do something else, they are by implication flat. So we're going to estimate alpha and sigma. No surprises here how to do it. We run it. I want you to look at the Precy output and ask yourself what has gone on in these numbers. This is completely predictable, or at least after I explain to you what's happened, you'll understand exactly, but clearly something's wrong, right? The data are minus one and one. The mean is zero, right? You know the mean. So mu should be near zero. You could do this by your eyes, right? You don't need Bayesian update to tell you that the mean of minus one and one is zero, right? So you could have done that ahead. Maybe you couldn't get the whole posterior distribution. That was the thing you'd get from this. But you should know where the posterior mean is. It should be near zero. It is not 404,584,496, which is what this particular posterior mean is. And the standard deviation equally monstrous, right? And therefore the highest posterior interval goes from minus seven million, is that what that is? Minus seven million to positive seven million? So more than that. Seventy million. Sorry. So this is something's wrong, right? You shouldn't publish these results. My advice. And unless it's an Elsevier journal, then they deserve it. Sorry, some of you know my opinion about Elsevier. This is published here. You know, as you get also, if it worked just at the marginal posterior description, it gives you a warning. The number of effective samples is only 18 here. We drew 10,000 samples. That's what one E4 means. And only 18, effectively. So something's wrong. And our hat is 1.1 is substantially above one. This is it is. That's a big deal. That's a big difference. It's not converging. What has gone wrong here? Let me show you what the trace plots look like for this model. And you should run this yourself. Obviously your trace plots won't be identical, but they'll look something like this. And there are two chains here, and one's plotted in blue and the other is plotted in orange. They're not doing the same thing. In particular, one of these chains takes these wild tours out to very extreme numbers, far from zero. Look at the scale. Those are, you know, minus 5 times 10 to the eighth, right? These are pretty far out. Taking these really long jogs wandering around. That's for the mean alpha up there. And then sigma down here, it can't go below zero, and that's what this baseline is. But it takes these occasional tours, long excursions up to really high values, right? This is 2 times 10 to the 11th. So that's where we're getting the 70 million confidence interval from. What is going on here? What's going on here is Stan is trying to do what you told it to do. You define the model, and in that model there are flat priors, and there's not a lot of data, not a lot of information in the likelihood. Therefore, a lot of probability mass is way out of extreme values, because you start with this prior that says anything is possible, including infinity. That's what flat means. Flat is flat forever, and you can quote me on that. So when the likelihood is not dominating, a Markov chain sampler has got to sample out there those extreme values to do the mission that you have given. Its mission is to show you the whole posterior distribution, and it's trying to do that, which means it has to take these occasional jogs out to extreme places. In this case, actually, there's a problem here that this ends up being an improper posterior distribution, because it doesn't integrate to a constant because of the flat priors. But it won't be hard to fix this. All you've got to do is do a little bit of regularization. So let me try to summarize for you what's going on here. The problem is the flat priors, flat is flat forever. There's not a lot of data, although you know where the mean is. As a consequence, most of the posterior probability is out in the extreme tails of the posterior distribution. And the Hamiltonian-Mawley-Carlo system is cruising out there in this very flat space. The hockey puck keeps going. And in the King-Monte example from Tuesday, the car just keeps driving. It doesn't reach a point where it needs to turn around yet, because there's almost no curvature out there. And it's flat. And as I said, in this case, the prior is actually improper, which is something you don't always have to avoid, but it's a good thing to avoid that, because it will sometimes lead to problems like this. This is easy to fix. Let's just add some weakly-informative priors. And this is a case where I can show you how weak these are. I've been vaguely saying weak for a long time. Like, what does weak mean, Richard? Now, weak is always relative to the likelihood. Now we've got a case here where the likelihood is extremely weak itself. There's only two data points, right? And I'm asking you to estimate the mean and standard deviation of the Gaussian process generating them. Let's start with the intercept centered on 1, which is the wrong answer. That's why I'm choosing it, right? Because the true mean is centered on 0 here from these data, with a standard deviation of 10. And then we're going to put a Cauchy, which is a weakly-regularizing prior for a scale parameter, like sigma, pushed up against 0, which is the minimum bound with a scale of 1. I'm going to show you in a couple of slides what those priors look like and show you what the posterior looks like in comparison to them. Before I do that, let me show you the code. Exactly as you'd expect. Map to stand model fitting looks right now exactly like map model fitting, because we're fitting the same kinds of models. We're just conditioning through a different algorithm. Put those in. Remember, this is implicitly that Cauchy distribution is implicitly a half Cauchy, because sigma can't have any probability mass below 0. It's got to be positive, the scale parameter. Stand is smart enough to figure that out. When you do that, it will truncate it. And now, look at these marginal posteriors. The posterior mean for alpha is close enough to 0, right? It's basically 0. It's a little bit above. That's actually just basically Monte Carlo error, maybe a little bit of drift from the prior, but almost none. And now, the effective number of samples is high and our hat is converged to 1. Sigma at a good value to posterior mean right around 2. Let's show you the trace plots. Left is bad. If your trace plots never look like the left, you've got a problem. That is not what you want. Those two chains are not converging for the same part of probability space, right? Over here, the two chains are plotted in two colors, orange and blue, and they end up in the same stationary place. That's what you want to see. Does that make sense? Right is good. Left is bad. And notice sigma is taking these occasional jogs up, but now it's only going up to about 60, rarely. The tail is on the coast of the earth thick, so there is some probability mass way out there, but very little. And these is King Monte's car taking the occasional excursion out to visit, you know, someone who lives in the woods. And then they spend an uncomfortable week. The King and his entourage with somebody in the woods or something. They drive back to the city and they spend the rest of the time, you know, hanging out with lobbyists and stuff. So this is good. This is a good looking chain. This is what you want to see. It's doing the job right. And if we look at contrast, the prior to the posterior, these are sort of plots. You know how to make these. You know how to plot density estimates for the posterior. You know how to sample from priors, too, because you know the prior, right? You defined it, so you can simulate from a prior directly as well. And in your homework for this week, I'm going to have you use sand to simulate from priors without any likelihood in the model so that you can do this for arbitrarily complicated priors, too. So this is a good idea because you can see visually what the model has learned from the day, right? By contrasting the prior to the posterior. So on the left we're looking at alpha. The prior was centered on one. And the standard DHS-10 gives you this. The posterior distribution very peaked around zero. Right? Right over zero. It's moved over. Even with only two data points, that prior has had almost no effect at all on inference. So no, only the slightest ghost of it left in the quantitative effects here. And likewise for sigma, the prior is that dashed curve. That's what the Koshy 0-1 scale looks like on the positive reels. And posterior distribution in blue. The median, the means right around two, the median just a little bit below that, which is the right estimate for, you know, calculate the standard deviation of minus one to one. You'll see what I mean. This has done its job right. These priors have been overwhelmed. There's only two data points. That's how weak these are. Incredibly weak. But the priors are essential in this case for actually getting the Markov chain to work. That's the lesson here. You can't do these things without priors anymore. You can get away with it with math. Most of the, almost all the models we've done so far in this course with math, you can leave the priors off and get the same answer, right? I've just been trying to like train you like a dog salivating up to this point. So when I ring the bell of a model, you think about priors, right? And that's a good habit to get into. This is the case where it really is going to buy us something. Now, of course, regularization is the other big deal. Flat priors are never the best priors because we expect overfitting and we can damp that down a bit. So here's a case where even if they don't do any real regularization, they just make the estimate possible. This is a case where you could calculate the posterior analytically, so you could check against that too. Okay, I should ask. Are there questions about this? Yeah, Katrina? So why would you want to have... I think I understand the intuition behind what you're saying. So the idea, let me try to repeat it back to you and you tell me if I've got it. So the half-coachy prior here has got a lot more probability up against zero. Zero is the most plausible value according to this. And that's true. And this does cause some shrinkage towards zero. But not much. Most of the probability is in the tail of the cochey, which is why we're using a cochey here. The cochey tail goes on for a very long time. There's a lot of probability mass in it. So actually most of the probability mass is way out there. Think of it this way. Cocheys are deceptive when you look at them because they have no mean. Right? Let's say that again. There's definitely a look at them because they have no mean. But wait, there's an average right? No, there isn't. There is no average in a cochey process. And I think I've explained this before about why this happens. If you sample from a cochey distribution over and over again, and you just keep calculating the mean, how much data you have, right? So let's take the first data point from a cochey. The mean is just that value. The second data point is the average of the first two. Third data point is the average of the first three and so on. Just compute that running mean. Keep going. It will never converge to a stationary value. It will just keep drifting forever, forever. And that's the magic of a cochey distribution and why it's really an uninformative prior, in a sense. Even though you see, yeah, there's more mass of low values and it causes some shrinkage towards zero, but very little. We're going to value that shrinkage a bit later when we do multi-level models, though, because then we're going to want some shrinkage because that will actually be conservative because it will push the varying effects towards not having effect. Does that make sense? So I'm punting a bit here, but anyway. I was resistant. I wasn't planning to go into the magic of cochey. Same is true for the variance, right? It has no variance and you're like, how can the distribution not have variance? Well, welcome to Bayes. This is where we love. The cochey distribution is very popular among Bayesian statisticians because there's no sweat to work with it in Bayesian inference, but in classical frequency statistics of the nightmare because it has no sufficient statistics. So you're just screwed. You can't do anything in classical framework. So sometimes we use it just to annoy anti-Bayesians. But that's not my purpose here, actually. There's a big literature on in-simulation tests. It has really good properties. Both are model identification and regularization. That said, an exponential here would work equally well. You can use an exponential. It works great, too. Okay. Are there other questions about this? No? Okay. Another example that's related. Let's look at a case where we deliberately make a model where there's an unidentified pair of parameters. So this is like the left leg-right leg thing, which you will reprise in your homework this week. And so we define a model up here where the mean of a normal distribution is a sum of two intercepts, alpha 1 and alpha 2. Since there are an infinite number of combinations of those two parameters that can make the same sum, you're going to get a long ridge. Those parameters are going to be highly correlated in the posterior. You're going to get a long ridge of plausible values, all of which kind of hug the sum that is the math estimate. You remember this from way back weeks ago, before you got starved by the course in the meantime. So same idea. Put it in here. I'm going to put a weekly regularizing prior, the same Cauchy prior on Sigma, but I'm going to leave priors off of the intercepts to prove the point about how hard it is to identify values. And while I go through this example, you're thinking, I would never do this. You think you wouldn't. But I occasionally forget to do this. As your models get more complicated, you can make mistakes and flip up and do this without realizing it. It's pretty easy in complicated model structures to accidentally create situations where there are a sum of parameters that cannot be identified uniquely. And this will happen by accident. So you want to see what's going on. There are other cases where you just get strong correlations between them and you need some sort of regularization to help you get estimates and help the chain work. So I want to show you what this looks like when it goes awry. Again, the precy output gives you good indication that something's up. Those means look weird, right? They're pretty big. They're pretty far from zero. And notice that they kind of summed to zero. It makes sense here because the data, it's the same data we had from before. Or rather, I've simulated it up there. 100 values with mean of zero and standard deviation one. So the posterior mean should be about zero. The sum of A1 and A2 here is about zero. That's what's going on, the massive standard deviations. This is the kind of symptom of unidentified parameters. One effective parameter, one effective sample, right? And our hat is four, which is bad. So I want you to worry about 1.1. Okay, so is it four? Yeah, again, don't publish this. Not even in an Elsevier term. Sorry, Elsevier, but you deserve it. So what's going on? Well, you might have a hit by now. This has to do with it's flat priors and having a lot of probability mess way out of the tails at implausible values. And of course, this is a case where they're non-identified and we've collapsed back to the non-basin case where without any prior information there are an infinite number of sums. But what happens with a Markov chain when you do this is you get random walks basically through the posterior distribution. Everything's so flat that the calculations break down. So, and what I want to show you, you're looking at the trace plot and there's two chains here in different colors. They're just doing kind of Brownian motion, definitely not converging, right? This is why our hat is four, because our hat is about then collapsing to the same stationary spot in the probability space. Sigma mixes well only during adaptation, right? So sometimes this will happen. And then boom, as soon as you get out of that it collapses too, because now it's dependent upon the nonsense that's happening on the other parameters, right? Because you need, it's the joint probability that matters. So it messes up every parameter in this case. That won't always be true. Sometimes some of your traces are fine when others are broken, but still you shouldn't trust things like that. So this is another bad chain. Now, here's the thing I've been cautioning before. If you use bugs and jags you can have perfectly good chains that still look like this. One of the reasons I don't like to use those packages so much anymore is sometimes I have to, because there are models you can fit in those you can't fit and stand. But the reason is because give sampling and metropolis hastings can devolve to look kind of like a random walk, even when it's behaving well. So you don't get this alarm like you do here. Hamiltonian Monte Carlo should not look like this when it's working. This is an alarm bell. If anybody here has used that package in R, it's a great package, but it uses a combination of give sampling and slight sampling and some other samplers. And even when they're working fine they can look like brownie emotion like this. And you're going to have to take like 500,000 samples and pin to every 50th and some of you have done this madness. Maybe as we've had colleagues that have. That's one of the reasons. It's a great package, but it doesn't have these alarms like this. When Hamiltonian Monte Carlo does this you don't have to drop the drone and you need to fix it up. And usually putting a little bit more information in the priors is enough. Sometimes you really have to fix the model otherwise. So here you can get this to behave reasonably well. All the best thing of course is and what you really should do is remove one of these parameters. Let's get that out of the way first. Don't run this model, that's my advice. But even if you do or you feel like you have to prior information helps you identify things. Certainly speaking, everything is identifiable because you can have prior information on the distribution of one of these parameters versus the other. In this case we'll just give them the same weekly regularizing priors. Run it again. Now it's much better. Standard deviations are smaller. They still sum to zero. But now the chain convergence. You get a bunch of effective samples. And the trace plots look great. One thing you'll notice when you start playing with these yourself and I encourage you to go in all these codes in the chapter 2. Run the bad version of this and the good version of this. The bad version runs slow too. A lot slower than the good version. That's one of the symptoms. When you've done something wrong with sand it really chugs when you have flat priors. There's a lot of these cars flying out to infinity. Those simulations take a while. So one of the symptoms that's something wrong is it goes really slow. It seems abnormally slow. The caveat here is that sand is slowest during adaptation because it's trying a bunch of experiments. So don't freak out if it's slow during adaptation. Once it breaks into real sampling mode it can move in a much faster clip. So it isn't like a lot of the other sampling strategies where it runs at the same rate in the process of the chain. It gets better as it picks up speed. Does this make sense? What's going on here? All right. One thing you can do, you can extract the samples from this and you can re-identify the actual intercept by summing together the posterior distributions and get see that we have estimated the actual mean of the data. The actual mean of the data is 0.0075 and the posterior mean for alpha is exactly the same thing. But you've got to sum together the posterior distributions of the two parameters. Make sense? You guys with me? This is not the most exciting, difficult lesson in the world, but you have to get in there and the whole purpose of this course is that do good applied statistics in assigned to the context. You have to interact with the machine and you can't pretend that there's just math and magic going on in there making everything come out right and you shouldn't trust your machine. So don't trust your machine. Maybe you do trust your field assistants in which case stop. Don't trust yourself either. The universe is out to get you and the universe is hostile to inference so it's an amazingly learned stuff, right? Yeah, question. What issue I've had going through this chapter is the first three or maybe five time steps look like a random walk and then it converges almost immediately but it doesn't look like any of the plot itself is the actual convergence area is small because it's making a big jump during that. Yeah, nothing that happens in adaptation affects inference. Those samples don't get used. When you do extract samples like on this slide the adaptation steps are never in there. That's the gray part of it. Yeah, none of that is ever appearing in your posterior samples and it doesn't appear in the crazy table or anything. So the jumping is Stan trying to figure out the contour of the infinite hockey range in the multi-dimensional hockey rink in builder space, right? Builder space, right? This is the in-dimensional generalization of Euclidean space, right? And that's what we do statistics in most of the time and so yeah, that's what it's doing. It's doing experiments so it's perfectly normal if that makes sense. It's a good question. I should have talked about that earlier. Okay, I want to get to maximum entropy so let's do this last example and we'll get to maximum entropy. Let me show you something that now they have different priors and these both sense these priors are perfectly fine but they represent different kinds of prior information or different kinds of skepticism about coefficients, about regression coefficients and both of these are very commonly in use as regularizing priors so I want to expose them to you and show you now that you have the Markov chain sampler you can use any old priory life with any reason because it's not flat and sometimes you'll want to. I'm going to do this example in the context of the old Foxes data. You remember the Foxes, Urban Foxes, right? Good band name. And we predict their weight using food and group size. You've seen these models before. The only thing that's different about these two models is the one on the left uses the regular Gaussian priors we've been using so far to do shrinkage or regularization on the regression coefficients and on the right I've replaced them with something called Laplace priors named after our hero Laplace, one of the founders of Bayesian inference and what does a Laplace prior look like? A Laplace prior is also called a double exponential. I've pictured one on the top here. The one that I put in here is centered at zero with a scale of one. It's declines faster than a Gaussian one. I have a slide coming up where you get to compare them much more easily. Why would you want to use a prior like this? It amounts to a different kind of skepticism about the sizes of regression effects. Notice now that we've got a really sharp peak of prior probability at zero and it declines much more rapidly to move away from zero than a Gaussian distribution does. Gaussian distributions have this little shelf around zero, which is sort of like saying things near zero are pretty plausible. Then it starts to decline slowly and then by the time it gets to the tail things are highly impossible. What the Gaussian prior essentially does is create what's called uniform shrinkage at all distances if you're likely to get Gaussian as well. Laplace prior doesn't because the Laplace priors instead say things really close to zero are probably actually zero. We expect a ton of stuff that's basically zero. And if something is really big and important there'll be no shrinkage in fact because the tails get flat fast because they decline exponentially and by the time you're out out here there's almost no effect unlike in the Gaussian. Let me show you it's hard to see that maybe, but let me show you what it looks like when it happens. To put it in map to stand there's this D Laplace which is the Laplace density and it just looks like that. That's a double exponential. You can think of it that way. It really is just two exponentials in both directions. So just to show you the true trace plot. Everybody makes mistakes and sometimes the trace plot is not everything but it's the first thing to check. These both look healthy, right? This is what you want to see. I was talking to someone this morning who had been running a bunch of models in MCMCGLMM and he was looking at one of my manuscripts where in the appendix we have trace plots like this and he was like when I read your appendix that's when I decided I was going to start using sand because I had never seen trace plots before. You should think the sand team. They've done a tremendous amount of tests on this code. There's more lines of test code in the sand project than there is actual functioning code. That's how rigorous they are about making sure this stuff works. Really important stuff. Really reliable software. And you get beautiful Hamiltonian traces like this which, yeah, you have to be a nerd like me to see the aesthetics in these things maybe but just like feel, you get more fuzzies from this. Nice stationery trace plots like that. Okay. So let me show you what happens. So against the Fox's data the Gaussian prior model on the top on the bottom we're only looking at the beta coefficient for the food, the average food available predicting the Fox's weight. The priors are shown with the thin lines so at the top we've got that old Gaussian prior centered on zero. It's at the same scale as this Laplace prior and the posterior distribution is shown in blue and you may remember that you've got a regression coefficient right around two. It's been shrunk a little bit from that because of the action of the prior in a little bit. Now I've drawn this red line to help you compare the locations of the two posterior distributions. The Laplace prior you saw it before, it looks like that. It's like the tint. There's a tint pole right in the middle and it's just dripping down. Notice that the posterior distribution has a higher posterior mean when you use the Laplace prior. The reason is there was less regularization because this is a pretty big effect and it's further out in the tail of the Laplace prior than it is in the Gaussian prior. The Gaussian prior has sucked it in and the Gaussian prior will suck things in no matter how far out they are actually and suck them in. The Laplace prior leaves big powerful effects alone. But when you get close to zero that's what I'm going to show you next. It doesn't leave them alone and this relationship switches. So the lesson here is going to be, before I show you the contrast, Gaussian priors regularize things at all sizes. When effects are big, Gaussian priors are more conservative than Laplace priors are because Laplace priors, if something's really powerful it leaves it out there. But the opposite happens when the effect is weak when the beta coefficient is estimated near zero then the Laplace prior is more conservative. Let me show you what happens with that. I'll put it up here. I randomly sampled half the Fox data so now we have less power and the likelihood gets flatter. Now watch what happens. We get more information from the prior. Now the relationship switches now the posterior median for the Laplace model is to the left of the red line. The estimate has gone down of course because there's less information in the prior, I mean in the likelihood so the priors have had more influence but now this Laplace prior creates a posterior distribution which I think with me is not Gaussian. It's very pleasingly non-Gaussian which is what you can get now with the Markov chain sampling and it sucked it over so that the posterior median is almost exactly zero. People who use especially machine learning use Laplace priors for this reason the idea is you have this prior knowledge to expect or only care about big effects. If you only care about predictors that are strongly associated with the outcome Laplace priors are a great way to find them because everything is weekly associated with the outcome always in every data set. Nothing has ever has a combative coefficient of exactly zero. Even if it's in truth that simulates a data you'll never get it to be exactly zero. There's always an association. The Laplace prior says things close to zero are probably actually zero and it tends to suck them right up to it. So regularization is strong for small effects and very weak for big effects. So you don't have to understand all the details of this but I wanted to expose you to it at some point. This also has an effect on the Lechler Prize the effective number of parameters in these models and think about how that matters here too. So Map to Scan will also calculate WAIC by default for you when these things come out and so we can look at P sub D the effective that should be P sub WAIC. Anyway it's the effective number of parameters for each model. Sorry I want to tell. We're going to start by going down this table and look at the effective number of parameters and then what happens in WAIC at the end. Laplace priors are more flexible as long as the effects are big unless the likelihood is near zero. The consequence of this is that the P the effective number of parameters is typically greater when you use Laplace priors than when you use Gaussian priors and we see that in the first two models we fit with all the data. About three parameters are the effective parameters which I think is the actual number it's less than that because we've done some regularization and almost four for the Laplace model and that's because the likelihood ended up in an area where the priors really flat and so it was able to slide around flexibly and that's what it is. You get more variation in the posterior distribution out there less regularization the model is more flexible it can encode the sample better with the Laplace priors now Laplace is aggressively regularizing because the effect gets close to zero and now that's the bottom two models I'm showing here now they have effectively the same flexibility the same number of effective parameters the effective number of parameters is a function of both the data and the model and all of the interaction of them because it's measuring as I've joked before the squishiness of the posterior distribution it's a measure, a theoretically the overfitting risk that arises from the flexibility in the model to fit the sample there can be complex interactions like this and I wanted to show you an example like this so that, well, I'm being honest with you this is how it is this is the reality of model fitting so Laplace fits the sample better when we use all the data because it's more flexible but notice that the WSE values are almost identical you should never get excited about a difference of 0.7 or whatever that is never ever get excited about that that's less than one data point and but they're expected to overfit more because of PD and being bigger and so in WA I see they're basically the same two aspects to overfitting and with the Gaussian prior one way you can think about this is you expect many small effects with the Gaussian prior near 0 which has a similar prior probability whereas with the Laplace prior you can think of it as you expect a few important effects and lots of things basically 0 so it'll leave if you get really strong evidence that some predictor is strongly associated with the outcome Laplace priors leave it alone so they're popular in machine learning questions I know this is bound to be a little mystical I'm just trying to broaden your horizons here about this and you'll see these things come up the Laplace prior is used a ton in machine learning all the time the question was this is an open question but when would you use this we need some domain knowledge I've tried with the bottom two bullet points on this slide to give you an idea do you expect lots of small things to cause the outcome or do you instead have some prior theory which tells you only one or two of these things are actually associated with the outcome Laplace priors are just pretenders and that gives you the Laplace prior that's the best I can do on a horoscopic advice I think but that's the usual way you think about it is a prior we expect or only care about things that are strongly associated with the outcome Laplace priors will give you this subset of predictors like that but if instead like in biology and the social sciences things are caused by a million things and you care about them all but it depends all models are false so discovering the truth should probably not be our mission it's a question of pragmatically what you're going to do with the model once you get it if you're looking for the most important thing Laplace priors could make sense even if the background theory says lots of things matter that's a common issue like in marketing research Laplace priors are reasonably popular because they're only looking for the things they can use to best manipulate us those are the things that they can have contracts on it sounds dark there's darkness that way anyway not done yet but we're at the point of the end of material relevant to doing your homework for this week so homework's already up please take a look there are three problems the first two are practice with getting used to Markov chains I've designed some hopefully interesting and educational pair of problems to get you used to running stand and to force you to actually install it I know some of you haven't yet right so do so and the third problem you're going to analyze the judgment of Princeton wine data that I introduced last week and have fun with it I've left it open-ended because you guys are rock stars you're going to do a great job with it but it's an interaction you're going to practice interactions and lots of other things with it and try to figure out what's going on understand the mind of a Belgian judge that's your goal can we get New Jersey wine? that's a good question does anybody know if we can get New Jersey wine in California? it seems seems like I could be fired for even uttering that sentence I bet New Jersey wine is fine at least once you're drunk I bet it's great if someone can find some let me know I'll buy a box after the class ends we can have a tasting session or something I'm an anthropologist I have to consume anything someone puts in front of me it's a disciplinary obligation roasted grubs anything you like alright let's shift gears now this is I'll tell you what it is what I get through the intro imagine there are hundreds on the ground they're arranged so they're equally distant from you and you're standing in front of a pile of 100 pebbles I know there's only 11 pictures here but I got tired of copying and pasting imagine there's 100 there's a large number of pebbles and you're going to stand and pick up a pebble and throw them into the buckets and since this is a thought experiment I can assert that you will always land one in a bucket as equally likely any particular pebble is equally likely to land in any of the buckets I'll repeat this each pebble has an equal chance of landing in any of the five buckets we're going to throw them all all 100 pebbles into the buckets throw them all in there and what could happen there could be a bunch of distributions of pebbles across buckets we're going to count them up lots of things could happen we could get all 100 buckets in the same... all 100 buckets all 100 pebbles in the same bucket right? unlikely but it could happen if we do this exercise enough times it will happen eventually the sun will probably go supernova before it happens but it might happen and this could happen to any of the buckets there are five ways to get all the pebbles in any one particular bucket much more likely is some mixture across the buckets here's an example you get 5, 22, 12, 37, 24 in each of the buckets respectively and a very large number of different distributions are possible, a huge number we calculated in fact from the combinatorics we're not going to do that but we are going to be interested in the combinatorics of this problem why? because this is the foundation of statistics before you think back to week one where we drew marbles from a bag remember the whole premise behind bagging inferences according to our theory we counted all the ways the data could arise according to our assumptions and absent any other information which lets us discount the relative plausibilities of the different conjectures that could have produced the data the conjecture that has the most ways to produce the data is the most plausible and posterior distributions simply rank parameter values combinations of parameters that way combinations of parameters are conjectures about the machine producing the data and posterior distribution probability theory is just summing up possibilities it's convenient calculus for summing over all the stuff in the garden of working data we're going to, this is another thought experiment along the same lines let me keep with it so you can see where we're going here so let's say in general we're going to make these symbols in one, in two, in three, in four, in five for the numbers of pebbles that land in each bucket and then given any distribution of pebbles across the buckets can be characterized with these numbers you with me? Right? A bunch of different distributions but each of them can be characterized completely by specific values assigned to in one through in five for any one of those distributions there's an easy way to count up how many ways it could happen and what I mean is think about in the distribution I just had up there, I'll go back to it there are a bunch of ways to get this distribution because the individual identities of the pebbles don't matter, right? so I have numbered the pebbles on the first slide so imagine they're still numbered I could take one pebble from bucket one and put it in bucket two and take one pebble from bucket two and put it in bucket one, it's still the same distribution but it's another way to get it see that? there will be a large number of different arrangements of the pebbles the individual unique snowflakes of pebbles in the bag before, right? if all you're doing is counting their color then there are different ways to get the same data if here all we're doing is counting the numbers of pebbles in each bucket, there are a bunch of different ways to get the same distribution you with me? I appreciate that this is very abstract and Rashad is looking at me like okay, I'm with you but you better get to the point soon Richard that's fine I appreciate that, that is a reasonable reaction however we've got a ways to go through the woods still but okay, so we can calculate this for combinatorics, sometime in secondary school you learn the following, right? and then you promptly blacked out and you retrain the game, it's not all about it but this is the multinomial chooser, sometimes called the multiplicity it is, w here is the number of ways to get the distribution defined by n1 through n5, where capital N is the sum of them all, it's a hundred in this case, right? so 100 factorial divided by the product of n1 factorial n2 factorial, n3 factorial n4 factorial, n5 factorial that's the number of distinct arrangements of the pebbles that will give you the same distribution you with me? so that's what we want, that's the number of ways it's often called the multiplicity in combinatorics, the multiplicity of the distribution so let's think about different distributions and now to make it cognitively easy let's imagine you only have 10 pebbles otherwise those would keep the numbers under control otherwise my slides not big enough to hold the number of ways that you can get these distributions so let's start with a very unlikely distribution all 10 pebbles landed bucket 3 right? if you were trying to do this you could probably do it, I believe in you but remember we defined by assertion for the thought experiment that it's random across the buckets equal chance, there's exactly one way for this to happen, right? if you upload the 10 pebbles, half the land in there they can't be in any of the other buckets so there's one unique way for this to happen that's intuitive, right? you didn't need combinatorics to tell you that agreed? yeah? let's imagine taking two pebbles from that and putting them in buckets 2 and 4 now so now, before I reveal the answer to you just think to yourselves or you're free to blurt it out if you're really good at combinatorics how many ways are there to get this distribution? think about it for a second and you think about it like how many more? there was one before exactly now, how many for this just think about it for a second, if you're good at factorials you've already got the answer out, you're like chugging in your head I can see Trina doing this, right? let's think about it for a second the answer is 90, there are 90 ways to get this, because there are a lot of pebbles well there are only 10 pebbles, but there's all these ratios, we can swap the pebbles in 2 and 4 as the first one and then we can either one of those and swap it with a pebble in bucket 3 and then we can also swap the pebble in bucket 4 with one in bucket 3 and there are all these swaps you can do and there are 90 such arrangements to produce it this is the wonder of combinatorics again, I'm in the nerdy aesthetic area of this stuff, but let's keep going with this odd experiment let's take two more pebbles out of bucket 3 and put them in buckets 2 and 4 does that mean that the third pebble in the second basically it's not just about bucket it's about which one it was in the bucket like if it was fixed the first time in 3 and then the fifth time in 3 then it would be considered good I had not followed that there was a whole lot of bucket in it well it's not about the different 90 ways it's not about which bucket it's dancing it doesn't matter the identity of the actual pebble itself it's about the identity of the pebble is that what your question was? it's arrangements of pebbles in the buckets we could shift this distribution over so that we were in buckets 1, 2 and 3 and there would also be 90 ways to do that that would also be true does that help? let's consider we take two more pebbles out of the middle bucket put them in buckets 2 and 4 how many ways to do this? same idea think about it, we went from 1 to 90 simple rearrangement of just a couple pebbles the answer is 1,260 ways this is the wonder of combinatories this is the wonder of factorials they escalate pretty rapidly lots of ways, lots of arrangements huge numbers of ways let's keep going because I am going someplace trust me you're going to learn something from this hopefully that you would like now let's take again two more out of the middle put them in the end now this is the flattest distribution we've seen so far how many ways to get this? again I asked you to inspect your intuition we went from 1 to 90 to 1,260 what do you think this is going to be? 20 times more say we got a bid 20 times more we should have a raffle here there's no reason your intuition should be good at this by the way this is not a human skill the answer is 20 times is a pretty good it's pretty close 37,800 ways to get this distribution if we took two more yet from the middle and put them on the end it's a perfectly flatt distribution which is the next distribution we're going to think about there are 113,400 ways to get that distribution and this is the distribution with the greatest number of unique arrangements of pebbles that can produce them this is the distribution with the greatest multiplicity the greatest number of ways it can be realized of all the others so now I ask you we've done this experiment and you haven't counted up in the bucket yet I want you to bet on a distribution what distribution do you bet on? you bet on this one because you have no other information this is your best bet the number of ways you can get this distribution dwarfs every other distribution are arbitrarily similar to it order of magnitude fewer ways to arise we went from 37,000 to 113,000 and as the number of pebbles goes up the distances between these distributions gets bigger and bigger as well by the time you're up to 100 I first did this to be calculated for 100 and I was like oh my god these numbers are too big to think about and I wanted to do this guessing aid too so you're still within the realm of numbers of people with and it escalates because of the combinatorics so again it absent any other information Ted has just thrown a thousand pebbles into buckets and he's very skilled at this and then I ask you guys to bet given no other information what distribution do you think has arisen bet on the one that can arise the most number of ways now here's the interesting thing about this though all of these distributions all of the unique distributions defined by the independent arrangements of pebbles are equally likely I'll say that again every distribution defined by a unique arrangement of the individual identified pebbles is equally likely because every pebble is independent of every other pebble there's a bunch of distributions that can arise from this random throwing process but some of them the little micro states that is the arrangements of the pebbles make the same macro states that is distribution have many many more ways to arise than others and those are the ones we bet on in probability theory that's the whole essence of probability theory it really is that stupid it really is and it's also that awesome we use this is defined given the information you have available you define how the tossing works right and given that you count up all the ways particular things macro states in the system could arise and the ones that could arise the most unique arrangements of the particles in the system are the ones we bet on this is also how thermodynamics works in physics exactly the same idea there are a huge uncountably large number of ways that the velocities in the molecules in a gas can be arranged but the most plausible arrangement of them the distribution of those velocities turns out to be Gaussian you can appeal to some mathematical thing like the central limit theorem and that's perfectly legitimate but betting on it empirically is just this because it could be anything it's just the reason in nature you see Gaussian distributions and I'm going to get to this a little bit later before you go is because there are many many more ways for real physical processes to end up in a Gaussian distribution than almost anything else and that's why they appear right so it's like if you had 20 million pebbles and you threw them into this thing they would be almost exactly equally distributed through the buckets how do you get a Gaussian out of this hang on I think 8 slides from now we're there but we're going to go there are you guys with me though for the moment this is like a revival for probability theory is what it is I'm going to get a hallelujah in a moment I really learned that on maximum entropy this is what we're doing with maximum entropy those of you who cut on so now here's the trick to W we define it from combinatorics it's the number of distinct ways you can get any arrangement of the individual pebbles in the system and here's a trick that will relate this to some stuff you already know it's convenient to work with the logarithm of this thing because it's a really big number so let's say take the natural log of it which will make it much more convenient to work with then we're working with exponents of the counts instead orders of magnitude the log of the count is the order of magnitude of that count and we divide by 1 over n to just normalize it for the number of particles and this thing for large n even for like a thousand it's approximately equal to that thing there there's a box in the book where I show you how to derive this if you're interested and you guys have all the algebra skills to do it this may look vaguely like something you've seen before but in case you don't quite see it yet in sub i over n is the proportion of pebbles in each bucket and this is a distribution so in a distribution in which you have the proportion of events at each particular identity what is that that's a probability distribution because they all sum to 1 and this is entropy entropy all it is is the 1 over n log multiplicity of a distribution that's all it is it's a measure of a number of unique micro arrangements of the system to produce that distribution and distributions with lots of different ways to be produced big w's have big entropy and big entropy is what we've been on when we want to do writing does that make sense so this is where information theory comes in and entropy is the foundation of Bayesian inference and at least one of the philosophies that gets you there there are like six others that are all so fucking but I like this one because it's just logical and you don't have to pretend it always has the answers but it's uniquely logically defined by the information you put in this perspective on Bayesian statistics is largely not entirely but largely due to Edwin James pictured there as a young man when he was a naval officer and he was later a professor of physics it's his book that was published posthumously in 2003 shown up there at the top and James defined this thing that we now call the maximum entropy principle for max int which is the distribution with the largest entropy is the distribution that is most consistent with our state of assumptions and assumes nothing else why? because it can happen the largest number of ways according to our assumptions if you bet on any other distribution that implies that in your intuition you have other information if you have not put into the model and therefore you should attend to that problem and think about putting it in the model and then you'll get another maximum entropy distribution with the greatest entropy is the one that bet on now what you what were you going to do with this well I'll tell you in a moment but does this make a little bit of sense right now this is not intuitive at all and one of the things about that is our intuition well intuition is just intuition it's a guide to ideas but it's not a way to test if something's right so if it doesn't feel intuitive to you you're just normal but I'm going to try with some examples to understand what's going on here what we're going to use this for is a way to construct likelihood distributions given knowledge about the outcome variable of interest and that's what we're going to start doing next week and give you some examples before you go today so but all it's classically used and one of the ways that James originally used it was to construct prior distributions if you have some information usually in the form of constraints and I'll show you what that means in a second on a parameter before you've seen it you can choose a probability distribution that contains only that information and no other information and what will that distribution be it'll be the one with the highest entropy consistent with the constraints we've only done unconstrained pebble tossing we're going to do constrained pebble tossing in a moment so hang on as we'll do constrained pebble tossing for observations we use this to construct likelihood so what does that mean when we see the actual data we still know things about it we know for example that it may be constrained to positive numbers because it's a distance or a duration that's information so if you want to choose a likelihood a distribution for the data before you see the actual values which is a good idea you can use entropy to do it and what that guarantees is only the information you input into it will be embodied in the distribution you choose and no other is consistent with your assumptions that is the constraints and it turns out as a special case that all of Bayesian updating can be rediscovered this way as a special case of entropy maximization the posterior distribution is the distribution so if you have a flat prior the posterior distribution is the distribution with the largest entropy consistent with the data as a constraint I'll say that again if you have a flat prior then you get something called minimum cross entropy the posterior distribution has the minimum cross entropy what does that mean? it means it's most like the prior distribution you can get by still being consistent with the new information you've added in and that's what Bayesian updating does it's just counting just like this it's counting pebbles little imaginary quanta probability of possibility ways that you can draw marbles from a bag and that's the unlamorous basis of probability theory just this, counting up stuff and then launching rockets based upon the advice you get from it I think this is pretty cool Bayesian inference is a special case of entropy maximization, not the reverse entropy maximization not everything is data input you can have constraints on moments of distribution and that's information maximum entropy or more broadly minimum cross entropy is a much larger domain of inference of logical inference Bayesian updating is just a special case where you're attending a posterior distribution that said we're going to keep tracking with posterior distributions like we always have but hopefully this will help you understand something more you with me so far? I've got more examples let's take an example of putting in some constraints well we'll get there, first we'll start with the uniform and I'll show you what happens there so ye olde information entropy you enjoyed this right? information entropy, Shannon, all that now what I've tried to show you is that this expression, really all it is is an order of magnitude representation of the number of unique microarrangements that can give you the distribution that's what large entropy means entropy is a measure of the multiplicity of the distribution and the distribution with the maximum entropy has the largest multiplicity of all the distributions that are possible so we could ask what kind of distribution will maximize information entropy and it turns out the answer is if you don't state any other kind of constraints it's the flatest well in general it's the flatest distribution still consistent with the constraints if all the p's are equal entropy will be maximized now I'll show you some examples of this in a second this is something you can prove analytically but I think I can give you the intuition for it in the next slide with some pictures and the reason is because that distribution can happen the most way so think about the buckets and the pebbles as they got more and more even the multiplicity went up there's more and more microarrangements of pebbles that will produce the same distribution across buckets and then as soon as you're perfectly even if you started stacking up pebbles on the end the entropy will start going down again the multiplicity would go down now I need buckets, I'm like paying so the flat distribution has highest entropy always but sometimes you're not allowed to have a flat distribution we're going to deal with that in a couple of slides but what if you do well for example the easiest case to think about is a case where the only constraint is that you're in an interval so say all we know about some variable is that it must be constrained to think of values a which is at minimum and b is maximum and this defines what's called the uniform distribution which you've been using as priors for sigma for a long time now and in this case if you have any probability distribution bound and continuous probability distribution bound between a and b the distribution with maximum entropy is a straight flat line the uniform distribution has maximum entropy the only information that goes into it are the bounds and then the most unique ways given a process that produces continuous deviations in this interval so just showing you some examples at the bottom this diagonal one has an entropy of minus 0.19 interpies can be negative by the way for continuous distributions something we haven't dealt with before but that's fine no big deal bigger is still better and the flat one has an entropy of zero and this curved one has a negative entropy as well you can't get better than zero in this interval I encourage you to try it out it won't work what if and I think this is intuitive I haven't given you any information by which to tell you that it's skewed or where the mean is you don't know anything about that so if you bet on it having any of that information in it you're cheating and you don't want to say the constraints and only that and add nothing else entropy demands that you choose the flat distribution because it has high entropy the largest number of ways to do it the only distribution perfectly consistent with the constraints you stated at the start now you may look at this and say you know I don't really think it's uniform though well that means you know something more now you want to add a constraint so let's do that next let's add constraints to the distribution so again think about our pebbles let's put the buckets back down at the bottom and let's imagine we have a lot more buckets say like how many of this is like 17 buckets here and we're going to number them minus 8 to 8 we could do this we could have a million buckets line them up what are these these are parameter values or observable data values or an outcome variable and now let's imagine stating the constraint that we distribute we can throw the pebbles they have an equal chance of landing in any particular bucket but we impose this constraint that when we count up the pebbles in each bucket we compute the variance across buckets and then that variance must equal 1 if it doesn't we empty all the buckets out to do it again so there's going to be a lot of pebble tossing thankfully we have computers better yet we have calculus and combinatorics to do this for us you don't want to actually do this I think I make a joke in the book about if we can employ the entire country of Poland throwing pebbles for a millennium we could probably do this empirically but because the multiplicities are really hostile to doing this empirically now the question is what would happen so this is like saying there's a subset of all the distributions of pebbles that can arise in that subset variance equals 1 that is we take the proportion of pebbles in each bucket we've got a value for that bucket we compute the variance of that distribution across those values is just like computing the variance of a posterior distribution, same kind of operation for a prior it's just an operation on the random variable the subset of public arrangements that has a variance of 1 those are the only ones we're considering we've thrown all the others away now among the ones these distributions that remain that have variance 1 which distribution has the highest multiplicity, has the highest multiplicity and again now that's the one we should bet on because that's the one that can happen in the most ways consistent with this constraint that we've stated so if you have prior information that the variance is equal to some finite value that's what we're interested in here what is the distribution you get now I know I can tell from some of the eyes of the audience that some of you know the answer to this already I think I've said it before in class but the answer is a Gaussian distribution it won't be perfectly Gaussian here because it's not perfectly continuous for government work and nature is quantized anyway real numbers are going to lose the mathematics so here we revisit the Gaussian the constraints on the Gaussian is that it's an unbounded real values that can be any one from minus infinity to positive infinity although it's almost always constrained to a pretty narrow interval for high probability and that it has a finite variance sometimes you can say you know it but even if you don't know it all you know is that there is some finite variance then what distribution maximizes entropy which means it can arise the largest number of ways and the answer is the Gaussian just like on the soccer field with the coin flipping back in the beginning of chapter 4 inevitably that's a collection of sums converged to a Gaussian inevitably was that magic no it did that because there are all these wandering paths on the soccer field and each of those has got its own little special snowflake history and the collective of them is moving towards some distribution and as you run the experiment long enough they end up converging on the distribution that can happen vastly more ways than any other and those distributions all look Gaussian now they may not be exactly Gaussian but they're so close to Gaussian you can't tell the difference and that's because there are vastly vastly vastly more ways to get something that looks bell shaped by adding together random values in any other shape and that's why gases have a distribution of velocities that's Gaussian there's nothing magic about it it's just that whatever does happen the distribution that is realized determined by the real deterministic physics of it it's bound to be Gaussian because it's almost impossible for it to end up any other thing because there are orders and orders of magnitude more ways for the collective to be Gaussian than anything else and our information we could do better if we knew something more but it turns out isn't it great that being as ignorant as that you predict what the collective looks like and that's the value of maximum interpretivity it gives us a principle logical way to make that prediction without knowing the real physics of gas molecules so physically what's going on here is when you add up fluctuations from processes the distribution of those sums converges to Gaussian again it's not magic it's just there are vastly more ways to realize the Gaussian shaped in any other shape and the Gaussian turns out to be as you might suspect from this fact the flattest distribution possible with the given variance you can't squish it or stretch it or anything in any other way and increase the entropy you just can't do it and maintain the same variance among all continuous probability distributions with the given same variance you can't change the shape from a Gaussian and increase the entropy and it's because of this physical fact that there are just many many more microarrangements of the coin flips in the past of the people on the soccer field so what I've showed you in the upper right and I'll give you the expressions to generate this in the book the Gaussian distribution is shown in blue there there's a bell curve in there and then the other three distributions in black that I've shown are what are called generalized Gaussian distributions they have different exponents and normalizing constants that stretch them so you can get something that looks like a Laplace distribution that really peak one that's one extreme all the way down to this thing that's really flat in the middle I don't know what we call that it's like a butt cake pan I don't know if people just buy things from the freezer section but it kind of looks like a butt cake to me a butt cake pan but all of these distributions have the same variance they do but the flattest one the one that distributes the probability most evenly across all of the possible parameter values and still has that variance is the blue one, the Gaussian is that a little bit intuitive? it might be hard to see because again there's nothing actually intuitive about combinatorics unless you're Paul Erdisch or something like that but at the bottom I show you you can plot across the shape parameter that adjusts across those distributions up there in the generalized Gaussian family when it's two you're at the quadratic shape which is the Gaussian remember this, the square in the Gaussian density function and then I'm plotting on the vertical the entropy distribution all of these have exactly the same variance as you adjust the shape of it entropy is maximized at exactly the Gaussian distribution as I said and what's the significance of that? well it just says this is the largest number of ways given a set of assumptions that distribution has the largest number of ways it can arise how many more? vastly vastly more remember how fast multiplicities went up? the reason gas molecules always range their velocities in the Gaussian distribution is because the multiplicity for that is vastly bigger than every other collective shape that's what's incredible about it Janney's called this the entropy concentration theorem and it's just, it's really amazing how fast the combinatorics go up for large numbers of say gas molecules or coin flips things like that but there's no magic involved at all, it's just multiplicities let's do another example just enough time to do this before you guys go let's think about the binomial distribution again we're going to start working with in earnest next week we're going to start doing logistic regression for reals and you're going to love it because then we can get beyond Gaussian land and have some fun so let's reconsider the binomial distribution now it is also a maximum of entropy distribution like the Gaussian but for a different set of constraints so again buckets tens of thousand pebbles and for throwing a man, sorry Ted you just sit there that's all that's why I keep referencing you sorry it's a bad habit I should like go over to this side but so you're fine with it, okay sorry the constraints now you're encouraging it so we throw pebbles in there I should use Paul because I'm paying him so he can't resist but we're throwing pebbles in the buckets again we're going to reject some distributions depending upon the constraints we impose the constraints on the binomial are that there are only binary outcomes are observable and what we can observe are sums of binary outcomes that's all we can observe like numbers of blue marbles out of the bag the binomial process so if we've got any kind of process where only binary outcomes are possible and the expected value across trials is the same another way to think about that is the probability of a 1 or a 0 is constant whatever it is as long as it's constant across trials then it turns out the distribution that maximizes entropy is the binomial can be the right distribution but if those are the only things you know about the outcome variable which is often is true then the only logically consistent choice for a likelihood function I'll reprise this on Tuesday next week is the binomial any other choice of likelihood function means it implies other constraints that you haven't noticed and they will affect the inference but you don't know what they are so this is also a way to police your assumptions it's a great thing to do the nice about this is that it recovers all of the most commonly used likelihood distributions and statistics it's just a different way to justify them usually they are justified because someone told you so why to use a binomial distribution or logistic regression because that's what it is because I told you so I'm trying to do better than that I'm trying to say it's principled on logic if all you want to use are the stated information that you stated at the beginning of the problem maximum entropy tells you the only logically consistent assumption for a distribution I may not like it once you realize what it is but that means you knew something else and you hadn't realized it right? so let me give you some examples of the binomial I've got just enough time to do this justice I think first let's get back to the flat issue with some expected values as a constraint you can get a flat binomial distribution and what you want to think about here is that back to marbles w for white marble, b for blue marble four possible distributions of assignments we're going to pull two marbles out of the bag there are four possible events now there's white white blue white white blue and blue blue and those are different events right? it's different order you don't care about them but that's something you would post later you don't think order matters so distribution A is even I picture it here the flat distribution B is uneven it's got more probability in the tails I show it here and then C and D, C is the flip of B and D has more variance across the possible outcomes all of these are possible and they can happen if you draw a marble from the back but A is the one with the highest entropy because it's the flattest in fact it's perfectly flat and why the expected value across these two draws is equal in all of these distributions if you count up if you compute the expected value of distributions A, B, C and D they're all equal I'm not expecting one blue marble on average in two draws I believe this is an exercise for the student actually I showed you the code in the book so you should go run it but they all have the same average count of blue marbles across the two trials guaranteed they do the same expected value but one of them has the highest entropy and that happens to be the flat one which can be realized vastly many more ways than the others under the random process so you bet on that because logic consists that's the only thing consistent with your assumptions or it's the thing most consistent with your assumptions makes some sense not guaranteed to work let's look at something a little bit more interesting before I let you go now let's use a different expected value when you choose an expected value in two trials of two marbles from the back you can have an expected value of one blue marble you can expect one blue marble in two draws and that would give you a flat distribution I think other than even one means even right the probability of blue marbles a half when the expected value is one imagine the probability of blue marble was something else like say 0.7 and there were more blue marbles and someone had told you this this was your prior information from the factory 70% of the marbles were blue now the expected value is 1.4 this is a constraint and we can solve for the maximum entropy distribution I'm leaving aside a lot of the math here because usually these processes involve using something called the Lagrangian to solve for a distribution that maximizes a function some of you had a couple of advanced calculus courses you remember variational analysis where you solve for functions and maximize a function instead of solving for you're just using calculus to solve for values of a particular variable to maximize a function but you can also solve for distributions and maximize a function that sounds fun right? yeah and it's great fun and lots of important problems in biology or variational analysis problems like that maximum entropy is a problem like that because we're solving for the distribution that maximizes the entropy function and there is a way to do that and it usually involves something called Lagrangian but we're not going to do that and there are boxes in the chapter where I give you another route to justify these proving these are maximum entropy distributions so if you're interested in that go ahead and take a look at them if you're an honest human being if you don't care and you want to leave that aside but so let me show you kind of in the informal way what happens now we can what I've done here and I give you all the code to do this in the book we can simulate a very large number I think I did 100,000 or something distributions over two draws of marbles that have expected value 1.4 that's the only constraint binary outcomes, two trials reject all the ones that don't have expected value 1.4 you get a huge number of them and I'm showing now what the plot ruler left is is the density of those distributions across their entropy for each of those realized distributions consistent with the constraints you can compute its entropy and then I want to show you four examples from that so the first one A which has the highest entropy that was observed is actually exactly the binomial distribution you would calculate from the binomial like with a function with expected value 1.4 that is if you plug in p.7 and two trials to get that distribution and that distribution has the biggest entropy of all the random ones that have been simulated here why? because the binomial is the maximum entropy distribution for this there's a box that proves this in the book here are some examples of others if your curiosity is we descend an entropy going down this curve notice there are lots of distributions really similar to 50 that have high entropy this is part of what we call the entropy concentration lots of neighboring distributions are also really likely and they also have high multiplicity and they also have high entropy B looks like this it's less even is what I want you to see as we descend an entropy the distributions get less even this is distribution A here the binomial is the most even distribution the most even assignment across the different outcome possibilities that is consistent with the stated constraint of its expected value being 1.4 C is even yet less they get increasingly more beautiful though as they get uneven I think they're nice as we descend an entropy but unfortunately they're unlikely and then D assigns almost zero to this one outcome and that's the reason it has such low entropy does this make some sense? all I'm trying to give you is an intuition for the fact that these common distributions we work with in statistics why use them? and the answer is because they're the distributions that are consistent with the stated constraints and only that under those stated constraints random processes will generate these distributions vastly many more ways than any other distribution and that's why we use them not because they're necessarily right in any sense it all depends upon stated constraints and your stated constraints would be wrong so there's no magic involved here but if you want to do this logically then we appeal to this as a way distribution is consistent with stated information and when you come back next week we're going to use this to justify generalized linear models so let us connect our linear model remember our geocentric model with how things are associated with an outcome to some other kind of variable which has constraints like discrete outcomes can't be below zero lots of data is like that and we need to choose likelihood functions in that case what can we appeal to, maximum entropy so with that I'm going to leave you guys to go off and do your homework with Markov Chain and I'll see you on Tuesday