 So I thought I'd make just a couple of logistical announcements. This is the second to last class. We have one more remaining and that'll be next Monday. And we decided not to give you a homework on the last lecture's material. Hopefully you'll just figure it out on your own. So this will be the last homework that you'll get today at the end of the class. The second thing is on the final project, which as you know is due May 10th. And we had sort of two ideas for this as you'll remember from the first lecture. One was to build a substantial framework for doing something of your, kind of in the realm of your own research. And what we'd like for you to do is draw from the lessons of at least two of the lectures over the semester. So maybe you're interested in building a GUI and doing some fitting of data. You'll learn about fitting of data today. Maybe you're interacting with an instrument and you want to save data to some big database. This should be something that I think has the amount of thinking and coding involved. Maybe a class and a half worth of homework. But what we ask you to do is to essentially vet it with us. If you haven't already, send us a few lines just in an email saying what it is that you're interested in doing and what sort of application you're thinking about and what's the scope of it. If you say, I want to write a new GUI toolkit, we'll be like no. If you say, I want to make a new widget that shows a picture of a dog and then when you click on the quit button it quits, we'll say no. So it's the geometric mean somewhere in between those two things that you should be aiming for. And what we really want, because this course has been about teaching you Python tools that will allow you to do your own work more efficiently, we want you to build something that you're actually going to be able to use in your daily research life. And if you're having trouble coming up with some ideas, you can talk to us and we'll be happy to pitch a few. That's one sort of branch of possibilities and I think the line share of what people generally wanted doing in the course is that the other possibilities contribute some sort of major functionality or some level of functionality to some open source framework. So for instance, if you've been really bugged that there's some really cool thing in Matlab that's just awesome and it doesn't exist in Matplotlib, then you can just say I want to do this. Now this isn't just building that code that lives on top of Matplotlib, part of your grade will be how you interact with the community that you're going to wind up dumping this code on. So you'll have to get on the mailing list, you'll have to say what do you guys think about this, other people have to give you sort of feedback, you code something up, you maybe do a get pull on the whole thing and then you branch something off and then at some point you're going to request to get merged back into the main branch. You're welcome to do that and it'd be really awesome if you want to get into that. As you know through lectures like with Fernando and Paul and Stefan, there's a huge sort of open source initiative at Berkeley and so you'll also get a lot of help from people on actually how to do that and I'm sure they'd be more than willing to sort of walk you through that process. So let us know and I think just to put some specificity here, it'd be good if by say next Wednesday, not this Wednesday but the following Wednesday you have sent us an email and more or less iterated with us on what your final project is going to wind up being so that you have some time to work on it and of course all of us are willing to answer questions so if you have a specific question from one of the lectures you could say, hey Joey I didn't understand Markov Chain Monte Carlo, can you tell me how to do this better or can I show you some code and we can all sort of help you as need be. Are there any questions? Yeah. I mean it would be good to be able to do a demo and if we can just run it by ourselves that's great as you know, that's what we like to do with the homework. But if it's something that really needs to have access to a special database or it's really a workflow that you're interested in having and you don't think many others would or maybe just your research group in principle you could just do a screen capture where you're doing a voiceover so here's what my functionality is and here's what I'm doing and if we don't know what's happening under the hood obviously you'll have to send us all your code and everything and if we don't understand what's happening or we can't reproduce enough of it then maybe we'll ask to just sit down with you and just look over your shoulder. That would be totally fine, yeah. I mean in the end it'd be nice to sort of think about an envelope around this project where you actually have a setup.py file you're doing using distutils you've got to read me, you've got subdirectories so that you could actually start this as a project on GitHub and then maybe it's still private for a while but eventually you might want to release it to a wider community. That would be great. Okay, so that's it, I'll leave it to Barion. Okay, thanks Josh. Alright, so this lecture is about Python for Scientific Inference and Joey and I will present the material. So this is sort of a continuation from what we spoke about in lecture four when we cast your mind back when we discussed stats models, machine learning, and psychics but it goes down a different path particularly down the path of Bayesian inference which is something we're going to talk about quite a bit today. So the overview for the lecture is as follows I'm going to start out by introducing tools for symbolic computation in Python so there's the SimPy package which is what I'll talk about in detail and then there's also a separate very large project called SAGE which is independent of Python but based on Python so it's basically a Python interpreter but it stands alone it has a notebook functionality and has existed for more than five years now. I won't be talking so much about that so we're going to focus on SimPy but we're going to talk about generally symbolic computation frameworks what they do and how progress is being made with that in Python and then after that we'll talk about Bayesian inference generally one of the pieces of feedback that Joey and I had after the earlier lecture was that maybe we covered some of the theoretical material on stats and machine learning a bit briskly so we are going to spend a bit more time on statistical ideas while introducing these topics in Python but we'll both try to gauge based on how you guys are getting along in the lecture and with the breakout exercises about whether we're going too slow or too fast but please do ask questions. Okay and so that leads up to a discussion of Bayesian inference packages in Python particularly MC and PyMC which Joey will be talking about at the end. Okay so that's the overview for today. So let's start off by discussing symbolic computation in Python and in particular we're going to talk about SimPy, the SimPy package so if you have a Python or IPython or notebook open please start by typing import SimPy and if you get an error, a module error then you might need to install SimPy but hopefully you won't. Okay so SimPy is a symbolic mathematics project and it allows, it's still in development, it's still at an early stage of its development but it's an extremely promising development that allows for symbolic computation in the style of Mathematica or Maple if you're familiar with those tools in fact I have a show of hands. Has anyone worked extensively with Maple or Mathematica before or even non-extensively? So you're all generally familiar with what those are and how they differ from something like MATLAB for instance. MATLAB is a numerical toolbox whereas these are symbolic computation. Okay so the home page and the reference page are pretty well filled out so there's a lot of information in there that is extremely readable and extremely useful and during the short breakout we'll have after this symbolic section we'll probably be using some of the material in there. Okay so as I've put on the slide here it's okay maybe to think of SimPy as sort of Mathematica for Python. With the advent of the iPython notebook now there's this extremely good interface for doing symbolic computation with cells that you can type in little bits of code and execute code and the modules that SimPy provides like integration, geometry, this is just a sample, linear algebra statistics also ODE solving tensor algebra are really pretty robust so there's a lot you can do with this package and we'll go through some of it. In addition to the doc pages the GitHub repository has actually got loads of information in it and the Wiki on the GitHub page has a great deal of interesting information that's not so much covered in the other documentation. Yeah in particular are these three pages if you're coming to SimPy from other languages or other packages like Mathematica, Maple, MATLAB or a bunch of others that exist these are ways you can sort of see what's different and what's the same. So symbolic computation and computer algebra systems have existed since the 1960s and initially were developed primarily within the theoretical physics and AI communities until around the 1980s when Maple and Mathematica took off and since then they're sort of very independent industrial companies that are aiming to really stand alone with these software packages that they provide and so those packages are extremely useful and as you know you've used them before many problems in symbolic computation however remain unsolved and a lot of the problems that have been posed have also been solved on paper but not implemented and integration is a great example of that I'll talk about that in a little bit. There are journals devoted to symbolic computation and mathematical software there's two entire archive subcategories devoted to these topics so it's extremely active these are within mathematics or within computer science rather. So there's a lot of active development going on right now and there's a lot of really interesting questions in these areas as well. So let's talk about what SIMPAI aims to provide. So it aims to provide representations of mathematical objects and everything is starting out with this very abstract class basic from which all other mathematical objects like variables, symbols like X or matrices or functions are derived. So if you have a very broad view of mathematics you think there are these very general objects that exist abstractly within mathematics and then the things that we commonly encounter in science or in mathematics like functions and so on are derived from those. So one difference between SIMPAI and regular Python use is that all variables which are going to be used symbolically have to be declared as such in advance. So for instance if you want to use X in an equation you would write X equals symbol X, we'll see that in a moment and then you would say Y equals X squared or something like that. From the command line however, not from within the Python interpreter you can run I-SIMPAI which generates, it sort of opens a Python or I-Python session with a number of things pre-imported. The principle as you can see there's basically a lot of things pre-imported. SIMPAI is a big package but it also pre-declars symbols so these represent general pronumerals, these represent integers and these represent functions. These are just conventions that we use. The SIMPAI core package is what provides all of this sort of abstract mathematical framework in which symbolic computation takes place. The actual modules for tasks like integration or linear algebra are separate within SIMPAI and so this page gives a list. I haven't put down the list because it's quite long. There are a lot of modules that are already available and a lot more that could be written but we're going to use some of them. Let's talk about integration briefly. The question for symbolic computation regarding integration is if I give you a function, if I give the computer a function, how is it supposed to return an integral or an antiderivative of that function? The idea for solving this question goes back to Liouville. I'm sure it goes back further than that to Liouville in the late 19th century and it was only in around the middle of the 20th century that ideas like the rich algorithm were fully spelled out so the rich algorithm basically decides it's a decision procedure for working out where the given expression has an antiderivative. As a byproduct of that decision, if there is an antiderivative, it returns it and so that's how symbolic integration works in basically every symbolic computation package. So the full rich algorithm has never been implemented. It's extremely complicated. Mathematica is a fairly complete implementation but I'm sure you've used the online integrator or the integrate within Mathematica before and it I'm sure has a lot of inbuilt like proprietary techniques to use for integrating. Within Sympi, a restricted implementation of this rich algorithm called the rich-norman algorithm is what's used. So it's fast but it doesn't cover everything. Axiom which is a different computer algebra system that I don't have a lot of experience with is I think the only system that implements rich maybe not completely but almost like further than any other system and that's active development. So this is still an open problem. So here's an example. If you've run iSympi, you could type this command in, for instance, and it would return the integral of x on x squared plus 2x plus 1 with respect to x. Normally one would have to declare x as a symbol in advance but if you've loaded iSympi, then x has already been declared as a symbol. If you type that in, hopefully you will get output that looks like that. So that's an example of integration. So does anyone have any general questions about integration? We're going to be doing more of that shortly. Sorry, I didn't catch. Does it mean that if you implemented rich completely and it doesn't have an anti-divorative that it's proven that there is no anti-divorative? That is the full rich algorithm. So if a function like x to the x, it would say this does not have an anti-divorative. Whereas if you put that into like Sympi, well, I don't know what happens but it won't be pretty. At best it will just sit there running for a while, I think. Try it. Okay, all right. So yes, that's graceful behavior. So if Sympi can't... So it must have decided that that can't be integrated and it returns just an expression showing the integral. But I know from experience it is possible and not difficult to give Sympi an expression to integrate where it can't decide and just sits there running. But things will develop, it will get better. Okay, so Sympi also provides linear algebra and matrix computation. So this is kind of conjugate to what NumPy provides. NumPy is a great numerical array framework, but there are also matrices implemented in NumPy, but we tend not to use them very often. This is a bit more fully-featured, I guess. So typing in this command, for instance, will generate a matrix and you can perform matrix multiplication. So this is a three-by-three matrix of those numbers and you can perform matrix multiplication by typing m times n. And there you go. We'll print out something like this. I have a comment on the notation. You note matrix, of course, is a function, so it has parentheses, but then there are some sub-parentheses here. These could also be square brackets. It doesn't make a difference. I think this is just a Sympi convenience, the inset parentheses, because it looks nicer when you don't have lots of square brackets. I don't think that would work, for instance, if you were declaring a NumPy array that way. Would that matrix m mutable? I don't know the answer to that. I presume, yeah, no, they are. They are. Yeah, absolutely. So it's weird to declare it with a two-dimensional array. Exactly, yeah. You shouldn't be able to... No, exactly. So, you know, formally... You destroy the row. Formally, it's... Yeah, exactly. If you're thinking of it as a list of lists, then it should be square brackets and that's sort of what it is. But maybe it's just easier to keep track of... I get lost declaring two-dimensional arrays using square brackets, but it's useful to be able to do that. And so, of course, anything that works with numbers, that's not... Numbers aren't the point of Sympi. It's the symbols that are the purpose. And so, if you declare, for instance, x and y are symbols, and put them in a matrix, you can compute the determinant of... or you can compute any property of that matrix and it should return an expression involving those symbols. So that's the determinant. You might also ask for the inverse or Cholesky decomposition or anything. And another useful... Another useful routine is the dot-subs routine, which allows you to substitute for all instances of some variable a number. So, if you've declared m as this matrix, you could type m.subs open parentheses x,3, and that would replace all instances of x in the matrix with 3. And so, this is useful if you're trying to build up generic functions based on these local objects. So, Sympi also provides... I should say we're gonna do a bit more of actually looking at these functions in the notebook just after I finish these slides. So, I just wanted to cover a few of the packages which are gonna be useful for us throughout this discussion of inference. So, yeah, Sympi also provides Sympi.statistics, a module for doing some statistical... work with statistical distributions. So, its goal in the long term is to provide a very full range of abstract structures representing probability distributions. At the moment, only two such probability distributions are implemented, the univariate normal distribution and the uniform distribution. But there is an underlying API in Sympi.statistics.distributions, which allows you to declare more general PDFs of your own devising. So, you could make an exponential PDF that way. Sorry, when I say PDF, I mean probability distribution function. This is one of these questions I need to ask to make. Is everyone comfortable with the idea of probability distributions and so on? Or if you're not, chat to me during the breakout and we'll make sure you have one. Sure. Here's a uniform distribution. Here's a normal distribution. X, P of X. So, probability distribution for our purposes is some function that integrates from minus infinity to infinity to give one. All right. So, in Sympi, if you want to declare, for instance, a particular normal distribution, you run n equals normal mu, sigma. And so that defines the mean and standard deviation of the normal distribution. So, this would give... This would be a normal distribution and has width parameter sigma of one. And so, if you want to evaluate the PDF at a particular value, you go ahead and do that. And it returns to that. Go on. Ah, okay. Sorry, thanks. Yeah, so, from Sympi import star, we'll not import everything. So, maybe import Sympi dot statistics as something, or from Sympi dot statistics import star. I get concerned about... It's dangerous to do from X import star, but it happens like many things in life. Okay. Sorry, Paul, did you want to add something? No, you're agreeing. You're coughing your agreement with me. Yeah. Hopefully that's fairly straightforward. And so, this has returned like an exact expression for the PDF. But you can also, if you want a numerical value, just the dot eval f method will return a numerical expression. There's to some significant figures. It can also accept an argument, which is the number of significant figures to which you would like it. And so, symbolic computation used within these is possible of course. So now, if you define X as a symbol, and if you ask for the CDF of the normal distribution as a function of a general symbol X, you'll get some ASCII art that looks like this. I'm going to discuss printing in a moment. Thank you, Chris. For the viewers at home, Chris was expressing his displeasure at the way the material looks in output. So we're going to have a look at a couple of ways of doing this. So if you're working in the terminal, which I think many of you will be, generally you probably won't get ASCII art, but you'll get something that looks kind of similar with like Unicode bracket things. So that's a little bit better, but not great. If you're working in the IPython notebook, which I will encourage you to in a moment, you have some other options as well. All right, and so here I was just evaluating the CDF at a number of different values. Does anyone guess what this is? What is OO? This is how you represent INF symbolically. I mean, I guess INF is numerical infinity. This is symbolic infinity. So yes, if you evaluate the CDF at those ranges, you'll get numbers. And of course you can evaluate the CDF. The letter O, yeah. Okay, output and rendering. And so I'll switch to the notebook after this. I think this is the last slide before the break out. So I'll switch to the notebook after this and we'll actually have a look at some of these. So if you're in the terminal, then Unicode rendering is available using either by default or if you use pre-print, pre-print on a symbolic expression. So you could try that now if you type pre-print and then some expression. It may not look significantly different. You can also get a variety of different outputs. For instance, if you run latech on a symbolic expression, it'll return the latech expression for that. So that's, that could be useful. Whether, depending on how good you are at latech, it might be quicker for me. If I just type, typing this generally will take me longer than typing this. But over time one could become quite proficient at this. Also, Python printing is useful, particularly if you're like me and you always forget to declare things as symbols in advance. So if I want to, you know, do this. So maybe I can even run like latech on that and that would save me even more time. So those are two output options. Okay, and then, significantly, within the IPython notebook, the load extension SimPy printing magic produces rendered latech output. So let's have a look at that now in addition to some other things. Is that, does that all, that view is okay? Okay. So can I do full screen or something? Sorry. This is Firefox so I have no idea how to... It's for your benefit, Josh. All right. Anyway, I think everyone can see it, right? All right. There we go. All right, okay. Okay, so some things to try out in SimPy. This expression, for instance, will not return what you think, return zero, or maybe it is what you think. This is integer division of one on two plus one on three. So this is just using Python's default integer division. You can import division from future. So, oh man, I don't even know if this is going to work, but I'll try. And so this actually generates... This turns the division operator into what you would expect for symbolic computation. And you can get the old Python division operator back by doing this, I think. Okay. Alternatively, you would declare a half and a third as being rational expressions, which is what they are. And so... And of course, using dots to declare things in natural Python works straightforwardly as well. I mean, to be clear, the future is Python three. Yeah, that's the aim. So a number of expressions for symbolic computation. You want to take the limit of sine x on x as x goes to zero, for example. It gives one, which is correct. The derivative of the error function with respect to x is a Gaussian. And then some integral expressions, which we've seen already. So in the notebook, the default without loading any extensions is just these types of expressions. But if you use the load x to simpy printing magic, then you get rendered latex output, which is nice. So this is the inverse of m, and it renders it. It's a bit small, unfortunately, but it looks very nice. And similarly, with the Cholesky decomposition of that matrix. Is that okay? Cool. All right. So this is a short breakout on symbolic computation. So I only envisage spending 10 to 15 minutes. After that, we'll talk about an inference generally, and there'll be a longer breakout after that section. So the purpose of this breakout is to just get you started playing with some of the tools in the symbolic computation toolbox. So this question is deliberately open-ended, which the next breakout question will not be. So think of a way to implement the bivariate normal distribution in simpy. At the moment, you've got the univariate normal distribution. But the bivariate normal distribution, or you can look up this Wikipedia page to get an example of that. But I'll give an overview here. So a bivariate normal distribution is a two-dimensional distribution. Like a one-dimensional normal distribution has a mean mu, which is a vector. And the width, I'm going to try to draw this better. It's an ellipse. So here is x, y, and you can imagine this is being a one-sigma contour. Has everyone seen something like this before? You know what I mean when I say bivariate normal distribution. Cool. So the mean is some x naught, y naught. And the width is specified by a 2x2 matrix. So in particular, you have widths in the x and y dimensions, but you also have the possibility of covariance between x and y. And so you need a matrix to represent the covariance matrix. Maybe you need a matrix to represent mu. Come up with some way to implement some facet of the bivariate normal distribution using as much SimPy as you can. And I ask questions like what functions are available to do such and such. Look in tab completion in iPython and the module references online will go a long way to coming up with answers. Okay, but let's spend about 10 to 15 minutes on this. Okay? So let's move on to the second part of the lecture today, which is an overview of Bayesian inference. So this is a tutorial in inference and then that will lead up to Joey's final discussion of how these inference ideas are implemented in different ways in Python packages. Yeah, again, to calibrate, how many people are aware of the distinction between frequentist and Bayesian inference and how many people... I don't think anyone in existence can claim to fully comprehend that distinction, but maybe some of us are only aware of it and others are familiar with the importance of the difference between... Okay, so, yeah, the purpose of this discussion is to highlight different ways of approaching inference in science. Okay, so as scientists, the role of inference is to be able to draw... This is my characterization anyway. A role of inference is to be able to draw conclusions from noisy data. Okay? Or uncertain data. Data that are random variants is what's really meant there. Okay? And so, historically, approaches to inference can be divided into two, and I have a star here to remind me that, of course, there's more than two, but no one ever talks about the other ones. Two camps, terms frequentist and Bayesian. Okay? And so, the rule of... the one-line description of the difference between these two views is that the frequentist interpretation of probability is expressed in terms of repeated trials, while Bayesian's interpret probability as a degree of belief. Now, that's easy to say, but it's not easy to comprehend exactly what that distinction means, what it leads to. And so, we'll talk about... We'll talk about what that means in a moment. Yes, some additional reading material if people are interested in history of statistical inference in court cases, which is... Bayesian inference and its use in court cases is controversial at present, but statistical inference is very important. Okay. Anyway, that was just an aside. So, the methodological implications of the distinction between frequentist and Bayesian statistics are profound and subtle. The notion of hypothesis testing is largely drawn from frequentist statistics where propositions are to be evaluated as either true or false, and perhaps with the possibility of there being false, which is what a p-value, for instance, provides. Okay. On the other hand, concepts that we'll discuss shortly, like likelihood, I'm not going to talk about evidence so much, but likelihood in particular arise from within Bayesian statistics. And it's important to understand what these tools are and what they allow us to do when we are approaching the problem of here is the scientific data set, I want to draw some quantitative conclusions from it. It's possible to make a lot of the distinction between frequentist and Bayesian statistics. It's an interesting philosophical topic. My impression, and I'm not deeply within the statistics community, but my impression is that it's not considered a major division in statistics today. It was during much of the 20th century. Frequentism was the prevailing viewpoint in scientific discourse, and it's only recently that Bayesian statistics have become very widely used. So, for scientists who remember that division, I suppose there's still some rank call, but it was before my time, and so I think as scientists, we have to be pretty pragmatic. Nevertheless, the application of inference within scientific field is often surprisingly one-sided. Within my own field, cosmology, Bayesian inference is standard. In particle physics, I understand frequentist statistics are the norm. And astronomy as a whole, my impression is that it's more mixed, like a mixture of both. And I guess I wanted to ask whether people had impressions based on their own fields of whether one type, one field. So, pragmatically, if I think which one do you prefer, how do you distinguish which type for any, like labeling the one type for the other? So, the question was, how does one distinguish which type of inference or a Bayesian apart from carrying it? Yeah, I mean, if you're concerned with things like prior and posterior distributions for parameters, and when you approach inference problems. I think the biggest thought is that most of the thing has a distribution over parameters. Okay, so will you actually deal with your data? Where does it make a difference? We will see. Okay. So, in most cases, we can see that practically, there are cases where they won't. I'm just trying to figure out which one my community is. The computer science community, Paul. Bayesian won a World Series in the last one. Okay. So, to think about approaches to statistical inference within science, I'll just ultimately, the distinction between these two camps is not of the highest practical importance for us. The kinds of questions that matter to scientists are things like, I have these data with some error bars that maybe I trust, maybe I don't. I want to publish in a journal. What do I do in between? Or more specifically, how do I fit a model to data? Or decide which of two models is better? More specifically, which is what we're going to focus on, how do I take a NumPy array and find, you know, maximum likelihood parameters for some model and associated covariance matrix? Okay. So, let's unpack this statement. The goal of this lecture is to provide you with tools for inference with real data in Python. So, we're interested in packages that interface with NumPy to allow you to use all of the fantastic tools that are available in Python. We want to find ways to return optimized best fit parameters. We want to find, you know, you want to find some numbers you can quote and say, this is the value of the parameters. And you also want to have some idea about the uncertainty and come up with a way of characterizing the uncertainty in those parameters. Okay. So, that's the goal. Does anyone have any questions? Yeah. What it will change is the way you answer the question and what things you think are important. In general, the numerical results you get should not be different if you've done things correctly. But the things you emphasize, the sorts of conclusions, the way you interpret the statistics from your data will differ greatly. Okay. So, if you cast your minds back earlier in the course, we did, we talked about a number of packages that touch on these things. Maybe not so explicitly. Stats models, for machine learning in Python. I just wanted to mention those. My impression is that these are more frequentist tools to give you some indication. The two tools we're going to talk about today, PyMC and MC, are very Bayesian. Okay. So, I wanted to come up with one slide to describe all of Bayesian inference. What is that slide? When embarking on an experiment, we almost always have some prior expectation about the outcome. So, it could be a prejudice and when people have like a negative view of Bayesianism, this is what they emphasize that like science, they might say science should be subjective. So, if two people come to an experiment with two different prejudices, two different starting viewpoints, with Bayesianism they'll get two different answers that are bad. I don't agree with that. I think that science is not perfectly objective. Sure. You have to choose your lecture in your case. And the model, the model they choose is different or the the penalization parameter. Right. So, Joey is making the point that even within frequentism it's possible to have to make these sorts of subjective decisions. The power of Bayesianism perhaps is explicit, these assumptions that we are putting in. Okay. So, Bayesian inference is the process by which we update our expectations for parameter values of the distribution of parameter to take account of new data that we've received. Okay. And so, you know, our input, our prior expectation for a parameter isn't always necessarily a prejudice. It could be the result of a previous experiment where last year an experiment was done. The mass of neutrinos was found to be this distribution of values with an uncertainty. Now we can do a better experiment that returns values that are more precise. And so, Bayesian inference and inference generally allows us to update prior information into a posterior. And the single equation that does that, you should read it right to left, is you have a prior, so you're trying to infer the probability of distribution for your parameter set. I just represent your prior information. I'm going to sort of not explicitly include that from here on, but it's important to understand that it's there. So, you want to infer, this is your prior, you've got the parameter set given your previous information. And what you want to get is parameter set given this new data you have and still, of course, the prior information has been incorporated. And the way you get there is using this expression. The first thing is the likelihood, which is you have to evaluate the probability of getting your data given a particular parameter set. We're going to talk about how to evaluate this. I, like, for the next 10 slides I'm going to be talking about this. So, but the point of this equation is just to make clear exactly how you're going to do what you're doing in Bayesian inference. As a scientist, maybe you want to derive a particular parameter value. You have something to begin with, you have new data, you're going to get a posterior, okay? So, I'm going to be talking about Bayesian inference in my view. Does anyone have any questions or comments about that? One thing I will add is that I put a proportionality sign here. The constant of proportionality is the evidence p of d given i, but which from a formal standpoint has to be included. This is Bayes' theorem, Bayes' rule. But from a practical standpoint I never worry about that because this has to be a probability distribution. What I end up with is like a vector of numbers, for instance, which represents the probability distribution for theta. And then I just take vector divided by sum so that it's normalized to one. So that, I mean, that's... Yes, it's the space of this node, theta, and it's the same expression on the top that integrated over theta. Right. So then it doesn't depend on theta. Cool. Except, can you do a model comparison? Then it's important, yeah. If it's not MCMC, it's not. That's one of the benefits of MCMC is you don't have to worry about these things. Okay. So let's talk about what the likelihood is and how we evaluate it given a data set. So much of the work, the practical work that we will do when analyzing a data set involves inferring parameter values from the data set, the likelihood distribution written L of theta, sometimes written L of theta given d, which is confusing the probability distribution for theta given the data. It's the other. So this is confusing. I try to avoid that notation because it confuses me. This is the same thing. Okay. So given a data set, how do I evaluate how likely a particular choice of parameters is given that data? That's what we're going to answer. So the way we do that is by understanding the following. Data points are secretly probability distributions. You may have seen this before. It's a data point with an error bar. If you look at it from the side though, it's secretly a normal distribution. The dot represents its mean and the bar represents the one sigma width. Okay. So a data point is a normal distribution. Maybe you have had that realization before. And so if you have a data set, a bunch of data points, they represent a bunch of probability distributions. And so these are independent data points. So if I've got x, y, bunch of data points, and I've assumed they're independent here. And so here I've remapped x and y and you can see the distribution. Okay. So let's talk about how we evaluate the likelihood for independent data points. And we'll talk about covariant data, which are much more commonly encountered after this. So likelihood for independent data sets is just the individual likelihoods multiplied together. A product is generally a very large number, very complicated. So it's common to work with the log likelihood. If you take the log of this expression, the product turns into a sum. This is much easier to work with. So we're going to be working with the log likelihood. So you imagine you have five data points here. And I wanted to draw a line. Imagine you're evaluating your model is a straight line and you want to evaluate the gradient of that line. That's a free parameter and you're going to try to guess what the distribution of that parameter is. Okay. And so each of these, remember, is a normal distribution. So I've drawn my line through and then for each of them, I'm going to compute the probability of the parameter value at that point given the data. So I'll give an explicit example of this in a moment. And that, for a particular parameter choice, is the likelihood. Does anyone have? Do I? Sorry. Thanks. Yeah, of course. Sorry, I will point that out. Yes. So there's a typo here in the last line. This should be log sum of log. So in particular, if our probability distributions are Gaussians, I've written a Gaussian here, and I have not normalized this Gaussian properly, but I just want to focus on what's in the middle, then the log will be this expression, which is commonly encountered, I think, in expressing goodness of fit for a model, given a data set. In astronomy, we call this the chi-squared because if the data normally distributed this quantity is chi-squared distributed, but that's, I think, only in astronomy, and it's not good in a furniture. So in general, in evaluating likelihood, what you will do is you will take your set of data, yi, you have a model expressed here as f, for example, I had a straight line before, and it's a function of the data, but also of this parameter, the gradient, for instance. And I evaluate the likelihood of the parameter value by taking the data minus the model value at each of the points where I have a data point, divided by the error, squared. And I sum all that. Has everyone carried out a computation like that before with a data set? So maybe we don't think of it as a sort of a likelihood evaluation, but that's what it is. Okay. So here's an example. Here are a bunch of data points, and I've drawn the line from which the data are generated through them, but we're going to pretend we don't know what that line is exactly. We're going to say it's some a1 x squared plus a2. And the goal is to infer what a1 and a2 are, or to infer the probability distribution for a1 and a2 given the data. Okay. So as I mentioned in the previous slide, we have a model f of x given a1 and a2. I write down my log likelihood. So I will code up a function, for instance, in Python which evaluates the log likelihood. And then I can express the likelihood in that fashion. And so if I do that, I'll get a plot that looks something like this. Or I may get a plot that looks something like this. And this shows me how, which region of parameter space, a1, a2 is best for describing the data. So in this case, there's a peak around here. You can also see that the data are not the covariant. So very often you will find that the two parameters are not independent of one another. Probably some covariance. Do you want to say something? Yeah, I will. So here is a joint distribution for a1 and a2. Now this is just the likelihood. What I want is the posterior. So maybe I have a prior distribution on what a1 and a2 are. Maybe someone did an experiment last year and they found that things were a bit broader. I had some broader error bars. Now I use this new information. I update the prior to the posterior and I have posterior probability distribution. This is a joint distribution for a1 and a2. Now, if you're interested in a1 and a2 and the relationship between them, that is as compact a statement of the results as you can come up with. But very often you might only be interested in a1. Maybe a1 represents the mass and a2 is some nuisance parameter that you're not interested in. In that case you should marginalize over the second parameter. So integrate with respect to a2. And then what you'll end up with is the marginalization march. And what you might end up with, for instance, is a one-dimensional distribution. So if I marginalize over a2 and if this is an array of values, which is what I've done, I just made a grid of a1, a2 values and I computed the likelihood. Then what I do is just sum over the uninteresting axis and I end up with a vector which represents the likelihood. And so this would be, for instance, a result you quote, here is a1, its value is 0.7 plus or minus the one-sigma value of. And it's good to quote things like that in papers because then people can say BlogZL found this much better than last year's results, but of course they didn't account for systematics etc. Josh, did you want to add anything to that? Exactly. That distribution generally is not. And rules of thumb for what we say when we see one-sigma. Right. So, yeah, the implicit statement here is that if a1, if the distribution of a1 is normal, then it can be completely characterized by its mean and its standard deviation. But the probability distribution might not be Gaussian. It happens very commonly. In which case you should plot in your paper the full distribution. That's useful information for people. What you can do, though, is quote the plus or minus one-sigma values that excise the top two-thirds of the data or whatever would be, you know, the one-sigma value for a Gaussian. If you want to quote a one-sigma error bar, maybe familiar with that. But in general, if your probability distribution is not Gaussian, you can't you can't condense it. Is that what you're going to talk? Yeah. I think it's going to be a highlight. Cool. For this, everything written up there is frequentist. What you get from these is when you put a joint probability distribution over the parameters. So, in frequentist statistics, you just, that does not allow it to get prounder in a creative space. And so, what you can do is find the maximum likelihood of the value of the parameters. You know, the reddest area on that slide. And look at the contours around that. And there are, you know, there are ways of inverting the information matrix that can give you confidence intervals. Meaning, the true value of a one, a two is blank, likely, 95% at the fall of any interval. But it doesn't have the same interpretation as putting a joint probability right. Right. Yeah. Yeah. Yeah. Yeah, actually, I mean, if you want to evaluate, not that you should go around thinking, am I talking to a frequentist or a Bayesian right now? But yeah. The reason, Brent, you can say, the probability that a two is, you can't say that, there's no probability distribution or just how confident I am exactly. Yeah. So you always think of it in terms of repeated trials. So if I got another data set that had the exact same probability as this one, and I did that a thousand times, 95% of those confidence intervals should be the same. Cool. And one other, so, well, not this little thing here, it was the point out that we have a tried example, but you can have a likelihood distribution which is not which has multiple values. Yeah. They're the notion of what's the value of my parameter doesn't really make sense. Right. You could say there's a 95% chance it's in this peak and this peak in that peak. Yeah, this is a great point that's worth repeating. So this is a very simple case. You could end up with a probability distribution, a joint probability distribution, for instance, that is a banana. We're going to see some of those in a minute. But even more difficult to handle distributions where you don't have one single peak. And so to give a one-dimensional example, where is the eraser over here? Favorite example of cosmologists is photometric redshift. If you are evaluating the redshift of some distant object based only on limited photometric information, you very commonly end up with, because galaxies are complicated objects, a distribution for the redshift that might have more than one peak. And so you can say with reasonably high confidence that it's not some large range of values and that it's here or here. But you have to be able to, you have to be clear that you can express, you know, it could be very close by or it could be very far away. But not in between. So dealing with multimodal distributions is something that is a difficult challenge. So the case that I described a moment ago was describing when your data points are uncorrelated. Commonly you might have a data set where there's covariance between the data. And that you need to account for this properly when evaluating the likelihood. So the way I think about this, here is a very limited data set has three data points. If the data points are covariant, there are three data points and you can think of this as being a sample from a three-dimensional multi-variate distribution. So this data set as a whole is one point in the multi-variate normal distribution. Okay? And so if I picked another point, the data set will be slightly different. And you can think of this as being a sample, the data set you're taking is a sample from that three-dimensional Gaussian distribution. And this is evaluating the probability is then done using the probability expression for the multi-variate distribution that we looked at in the breakout earlier. So the way you would perform the likelihood evaluation we did with uncorrelated data a moment ago with covariant data is you now have a 3 by 3 or n by n covariance matrix representing the covariance between the data points rather than error bars, sigma. And you have the mean values still, the points, a vector of mean values. So you have your vector of model values given, I've got to put given parameter choice. So you evaluate, you have a vector of separations and then you use the inverse covariance matrix. When the data points are uncorrelated, that is when the covariance matrix is diagonal, this expression reduces to the one we saw a couple of slides ago, y minus f of xi divided by sigma, because the sigmas are just on the diagonal. Should I go over that again on the board? Or is that something folks are familiar with and we'll talk about it afterwards. So this is just a general expression for evaluating the goodness of fit or evaluating the likelihood when you have covariant data and you have an expression for their covariance. So to give an example of this here is an example from the Wiggles Galaxy Redshift Survey, this was published last year. What we're inferring here are a couple of cosmological parameters and matter density in the universe and the equation of state parameter and energy. I don't have time to describe what these are and how they affect the data set, unfortunately. But we are inferring two cosmological parameters from a measurement made with the galaxy distribution. This is spatial scale and this is a measure of the clustering of galaxies as a function of spatial scale. Very short scales, galaxies very clustered, larger scales. They're basically unclustered. This bump is a very interesting feature that sadly I do not have time to talk about. So... The answer to this question the answer to this question which we gave to the referee is on these scales the data points are extremely covariant. So this bump is not terribly significant. It just happens that one point is like a little bit shifted down and it's so covariant that it drags everything down. Now you could argue, well, it could be for the bump as well. And so happily the covariance is slightly less shorter scales and also this is just one redshift slice. So there are actually four separate data sets that are independent that all show. No, actually the best would be for me to put the consolidated data set. They're actually they all look pretty much the same. If I combine all of them I stack the signal then it looks better. Or if I wanted to use the best looking data set I could use the results from another Galaxy Regif Survey which were re-published only a few weeks ago. Which are really nice. So the Baryon acoustic oscillation signal is significant. That should be your take home from today. OK. But to do a likelihood analysis so this red line is say a best fitting prediction for what clustering is as a function of scale based on parameter choices for those two parameters I mentioned the matter density and equation of state of dark energy. So given this data set we want to evaluate the likelihood distribution for those two parameters. So happily we have a covariance matrix telling us how clustered these points are at different scales. So how covariant they are. Sorry how covariant they are, thank you. And in fact this is a correlation matrix so it's like normalized so that the diagonal means perfectly correlated and this off diagonal basically goes down to zero. So it's a relatively simple covariance matrix but you can also see it's not diagonal so it needs to be taken into account. If you just used y minus function divided by sigma you would not get. OK. So the likelihood evaluation is carried out using the expression from the previous slide. We invert the covariance matrix given a choice of parameters that predicts the model. We take model difference from the data and so here is some contours 1 and 2 probably for the probability distributions for the matter density parameter and the equation of state parameter. So pardon me, the blue is just for this experiment for the galaxy distribution and you can see that they're centered on a particular value it's not a Gaussian, it's got some degeneracy, there's some correlation between but nevertheless what we would expect to have is the output of the likelihood evaluation. There are two other experiments which are quite different. The red is based on measurements of supernovae and the black are based on measurements of the microwave background distribution and so in cosmology we have three almost independent means of evaluating parameters like this and so we are able to stack the posterior likelihood evaluations and if we do we stack all three of them together we end up with something like this which is a very compact relatively compact and so that's more or less the state of the art in cosmology. Yeah Yeah Yeah Yeah Yeah Yeah Yeah Yeah You can effect do it It's a nice thing about the statistics that you have probably just used and it worked out way less for And maybe I'm overstepping a little bit here but my understanding also is that your prior is that the matter density was getting negative whereas the frequency is in no way it just wants to effectively scrunch in your resulting posterior to make sure that it's positive but it doesn't make sense to have negative matter You can interject but it's just allowing sort of we're doing the spraying and optimization but it's nice to have an explicit probability that the probability if you want to have a flat prior goes from zero up to infinity that's fine but it really is about to be less than Yeah, I think for the interpretation thinking of these things in terms of probability distributions makes it very easy Okay Yeah Okay So that's all of cosmology No, I joke There are more than two parameters and so what cosmologists do is publish representations of these distributions these joint distributions for the many cosmological parameters that govern the behavior of objects in the universe Okay, so that I think is the end of what I wanted to talk about so breakout is next slide I've just given examples involving one or two, one when I'm drawing on the blackboard or two dimensions very commonly you will be dealing with models that have more than two dimensions. If you're like me two is about the limit of my patience if I'm evaluating the likelihood using a grid where I have a grid of parameter values and I calculate the likelihood I want cleverer methods to do that so what Joey is going to talk about is those cleverer methods There are also tools you can do without MCMC to get characterizations of particular aspects of the probability distribution if you have a good reason for thinking that your probability distribution is unimodal or better still that it's multivariate normal then you can characterize it fairly easily using an optimization routine like fmin so you put in the likelihood you like write a python function that returns the likelihood value given a set of parameters and then you run that through fmin or fmin bfgs or something and it will return the maximum likelihood parameter estimates. You can also get estimates of the covariance if you're assuming it's a multivariate normal distribution the parameter distribution then you can return estimates the covariance between the parameters and so that's common task but it isn't always the case in fact it isn't really very often the case that you have something you're definitely sure is multivariate normal in which case you need other methods to characterize the likelihood distribution and that's what we'll talk about after the breakout okay so here's the breakout it will take 20 to 30 minutes well we'll see how we get on I've provided a data set which hopefully you guys can find on bspace it's a three column text file x y error bar on y so these are not covariant data treat it as independent data so write in python model function of that form and write another function to evaluate the likelihood given a choice of parameter values so this is in general a two parameter function I've suggested a model that has two parameters a and b to begin with just assume that b is zero let's do the one-dimensional case one parameter case and so compute the likelihood for a range of parameter values find the maximum likelihood either by i or using an optimization routine and maybe plot the distribution and so now after you've done that maybe you can imagine that I tell you now that a previous experiment told me that the probability distribution of a is a normal distribution with mean, this value and with that value and then plot how you update that prior with your new information to give a posterior you're just multiplying things and then there's an extension of course you've done it with just one parameter but you could do it with two so plot the joint distribution of a and b and maybe the marginal distribution of a maybe b is an uninteresting parameter that you need to marginalize over okay so that is a breakout so let's go through quickly not the solution to the exact problem I gave but an example of doing this in research this is taken from my own research where I'm evaluating the distribution for a cosmological parameter the density parameter that we mentioned earlier based on some data that I've measured from a galaxy redshift survey okay so most of this notebook which if people are interested I can post is about building up the model so in this in the breakout problem the model is very simple very often in cosmology and in other science I'm sure the model is not simple so most of the notebook is generating the model but the actual likelihood evaluation is relatively straightforward so here I have loaded the data set it has 36 data points those are their values this is the covariance matrix it's not trivial it's got a funny shape because it's a funny combination of different data sets so some are covariant and some are not this is this is a separate I'm just demonstrating how I would implement a solution to this problem sorry what do they represent they they okay it's something different to that but it could be that for instance okay but I want to highlight the like at an abstract level this is what you are doing is extremely generic there is a data set there is a covariance matrix there's some models then I write a function the log likelihood function which takes in the the yes thank you which takes in the data the covariance matrix and the parameter estimates and returns a log likelihood value and then so this returns so in this case it's a two parameter two parameter vector the matter density and the equation of state parameter for those two parameters can generate a model that's this vector compare it to the data and then the covariance matrix I've stored as an umpy array and this evaluates the log likelihood if so let's look at the commented line above because it's clearer this is just a numerical trick which I won't talk about I take delta times the inverse covariance matrix times the transpose this returns the same thing but it's more numerically stable okay it defines the likelihood so hopefully several of you wrote something that was similar in spirit to that and then once you have that you can evaluate your likelihood in a number of different ways we'll be talking about MCMC in a moment or because it's simple just use a grid of values and so here I define a grid of values and for a grid of values for one of the parameters and evaluate the likelihood I get a plot it's not quite Gaussian it has a funny bump that's interesting and then from this I can estimate the mean value and the width of the distribution that's oh yeah again yeah exactly no no it's it's not normalization it's because the model is not a terribly good fit to the data that's something that has to be grappled with it happens it's actually several of the data points are extremely deviant from the model but it's a good fit to three quarters of the data and then one quarter of the data is not good that would be all the model is bad you can interpret that two ways okay so I guess the only other thing to comment on here is estimating the width of the distribution so finding the peak is done by i or using fmin here I've estimated the width just by noting that if you calculate the values of the likelihood or if you compare the values of the log likelihood at several different points you can estimate the value of sigma and that works in higher dimensions as well so this is a common way of estimating the covariance matrix and as I discussed with some people there are routines within Python within optimize the least squares least sq routine can return an estimate of the covariance matrix as well as the likelihood peak do anyone have any questions comments about my research can I use your power machine okay so we're going to continue the discussion of Bayesian inference thanks to Barry for a nice introduction makes my discussion a little bit easier and please ask a lot of questions especially for those of you who aren't experienced with Bayesian inference or these types of statistical methods so as we kind of discussed before the goal of Bayesian inference is to try to produce the posterior probability distribution over the parameters of interest so in Bayesian inference we treat the parameters as random variables that have distributions and that's really nice because if we knew the posterior distributions perfectly estimating the parameters and their uncertainties is extremely easy if you have the whole distribution you know it's median you know it's central 95% region and all statistical inferences are really kind of trivial once you have the posterior I pose this to frequent statistics where we treat the parameters as fixed and the data as random and now you're kind of always trying to compare a fixed parameter with a new quantization of your data and trying to come up with confidence intervals or doing hypothesis tests and things like that so in a way I find thinking about Bayesian statistics a lot more intuitive I think a lot of the differences in different fields is based on how people conceptualize their data and their parameters focusing in cosmology I think there's often a lot of prior information about your parameters that information and those and those supposed or assumed distributions are really powerful in trying to constrain the posterior so the value of the parameter given the data so as Barian said our main tool in Bayesian inference is where we have the prior distribution of our parameters given some modeling prior information which is updated via the likelihood which is the probability of data given parameters to give us our posterior which is probably the updated probability of our parameters given the data right so like I said before contrast is the classical statistics where the parameters are fixed, the data are random and then all statistical inferences are just hypothetical repeated experiments so frequentists like to talk about coverage which is basically if you were to have repeated experiments what percent of the inferred parameters from those repeated data are actually covered by your confidence interval and that's kind of like for frequentists that's the thing you're after whereas in Bayesian statistics you're worried more about the distribution of your parameters given your data so for Bayesian inference really the challenge is finding the posterior distribution once you have your posterior all inference is really trivial you just have your entire distribution of your parameters and presumably the parameters that you're interested in so you can take the mean you can take scenario deviation you can look at the central 95% credible stats but getting the posterior is the challenge so the simplest case is so say we have some likelihood probability of data given parameters simplest case is to just assume what's called a conjugate prior is the prior that gives us a closed form solution to the posterior distribution so for example if our likelihood is indeed normal or Gaussian here we've assumed that we know the variance so we've assumed that this is fixed and known and so our parameter of interest here is mu so this is just shorthand for probability of data given the parameter so that's our likelihood if we assume a normal prior on mu so if we assume that mu is drawn from some normal distribution with hyperparameters mu knot and sigma knot squared then there's actually a closed form to our posterior it's simply this normal distribution and you see that both the prior hyperparameters so the prior distribution of mu as well as the data come into play in updating our knowledge about mu so it's actually really it's really natural it's kind of a weighted average between your prior belief and what the data tell you and as you collect more data kind of this this term, the term with the x's in it and the term with the n's in it start to dominate over your prior so in the limit that you collect an infinite amount of data essentially converge to what frequentist statistics the maximum likelihood would tell you but if you have very little amount of information then your prior belief kind of takes over so you're always kind of walking between those two regimes so these conjugate forms exist for a lot of different likelihoods if you just go into Wikipedia and type in conjugate prior there's like dozens and dozens of likelihoods that have nice simple conjugates so this is kind of the simplest form imaginable right that's certainly a good point that why restrict yourself to having a normal prior I mean a normal prior is a very type of distribution certainly the question is why pick anything there's some sort of knowledge in the background and actually the field of statistics is called elicitation of priors which is based on all the knowledge that I have before seeing the data what is my best guess and how do I actually write a distribution out for doing that so there's like this whole subfield about questioning the experts to try to come up with the best prior possible but typically in whenever you do Bayesian statistics you want to check the assumed prior if you don't do that then you don't know if your prior is driving the results or the data so posterior predictive checks and sensitivity of prior is really important things to look at okay so for more complex models and realistic settings and realistic models conjugate priors don't exist or you might not want to assume a conjugate prior like Bayesian says so in this setting the exact computation of the posterior distribution is impossible and we have to do sampling techniques so instead of actually writing down the functional form of our posterior distribution we're going to try to take a sample from the posterior and with that sample we can do all of our inferences so we can take the mean of the sample with respect to each parameter confidence intervals whatever okay so there's lots of different MCMC techniques out there prior this one of the simpler ones is called Gibbs sampling where we assume some conjugacy in our likelihood in our likelihood and prior specifically that given some subset of the parameter so say we have a factor of parameters that we're interested in given some subset of that there is conjugacy so you can write down what the posterior of that subset given the rest of the parameters plus the data and likewise for the other subset you basically you can iterate back and forth sampling theta 1 through theta p so your entire parameter set is theta 1 through theta k and you can iterate sampling a subset of the parameters given the other subset and vice versa and you keep iterating this until you converge this is kind of the simplest MCMC sampler out there and it doesn't have to be just two subsets it can be k subsets it can be any number of iterations so this is nice it converges quite quickly there's lots of problems where you can actually write this down but it usually takes a lot of writing out math and making sure your terms are correct the pain is more on the not on the implementation but not the running of the algorithm but the writing down the exact formulae updating formulae for doing the sampling so there's also the more commonly used MCMC is metropolis hastings and variance of that and here you basically take a random walk through your parameter space and at each at each what do they call it there's a there's a proposal density so you start somewhere in your parameter space down via proposal density and then you look at that particular parameter compute the likelihood and the prior for that parameter so then your posterior is just a multiplication of those two and compare that to where you were and if the posterior probability has gone up you take that step and if it hasn't you take that step with a certain probability and this is proven to converge to the posterior under so if you if you just take this this random walk through this Monte Carlo Mark Goff-Shane random walk through your dataset you are more or less for most situations or for some situations going to converge to the true posterior distribution of your parameters so I'll show in the breakout session but yeah maybe the banana will work so assume that this is our two dimensional parameter space and this is our and this is the actual posterior so it's probably of the parameters given x so what an MCMC Metropolis Hastings algorithm will do is you start somewhere it's actually the same thing with Gibbs sampling but Gibbs sampling knows the exact form of the theta one given theta two and x and ptheta two given theta one and x so you can actually sample you basically take the marginal likelihood of this sample from that and then vice versa so you quickly converge to in here where there's a lot of density and you'll just kind of randomly walk around this area and sample from that posterior so there's always a burn-in phase so you're going to start basically some random area in this parameter space and start random walking until you're actually truly capturing that density and you typically throw away that burn-in stage okay so Metropolis Hastings you typically take Gaussian proposal density distributions so here we'd have our initial value theta zero our initial guess of the parameters will draw another parameter value from some density we'll check the likelihood of the posterior density here versus here if it's greater we'll move to that if it's less we probably won't move there and little by little we'll converge to the posterior so it's something that you can set manually typically you just use a Gaussian oh sorry it's telling you given your current given your last parameter value what's the next we're going to jump somewhere yeah where do we want to go so you know something right if it's not maybe we're supposed to find the maximum so the proposal is just where are we going to go next we're going to go somewhere and there's actually a tuning parameter in all of this and how far are we going to jump are we going to try to jump over here are we going to try to jump very close and that you kind of want to tune that to get the appropriate number of rejections and accepted steps so what I brought up a good point though is that Metropolis Hastings is not just a hill climbing algorithm so many algorithms will just go to the highest point and stay there whereas Metropolis Hastings can go downhill so it doesn't just go to the maximum of the posterior it can actually come down it can go here it can come down, it can go back and in that way it actually it does sample the the entire density not just focusing on the maximum of that how are you getting the posterior so you can you can compute that so if you have the likelihood in the prior right there's a closed form to the likelihood there's a closed form to the prior what there's not is a closed form to the posterior so at any given parameter you can actually perfectly compute the prior density perfectly compute the likelihood that's kind of the whole point we can't just write down a form for the posterior but we can sample from it so once the chain converges that's true so at your initial point no, right, but once you converge so there's properties of Markov chains that if it's ergodic then you get to the solution and you've converged and you'll always stay sampling from that exact density it's the term it's the term by the ratio of your posteriors so if you have one parameter value in another one you can compute the posterior density for each of those points and then it's the ratio of that which is the probability that you'll take the step so it's the ratio of the new posterior over the old posterior so if that's greater than one you always take the step if it's less than one you take that step okay good, any other questions on these sorts of sampling techniques? okay so the nice thing is that with Python and particularly PyMC we don't have to think too hard about constructing these samplers they have really nice implementations it's pretty straightforward to put in really complicated probability models and do posterior sampling from those we can do hierarchical models which we'll see on the homework there's also convergence diagnostics so you want to know if your chain is actually converged to some steady state the actual posterior and there's a a handful of diagnostics that have been included with PyMC also tuning so with Metropolis Hastings there's a tuning parameter which is the width of your proposal density and they have some automatic routines to get the appropriate rejection ratio there's kind of these rules of thumb that you don't want to accept every jump because that means you're not jumping far enough but you don't want to reject too many because that means you're jumping too far so there's kind of a sweet spot that you want to to be in to make the MCMC converge faster and there's really nice documentation so go ahead and look at their website okay so let's see how we can build a model with PyMC so I've taken this data set of coal mining disasters in the UK so this is a time series where every year from 1850 to 1960 they recorded how many coal mining disasters there were and we want to use this data so it seems here that it's kind of like maybe it's pretty high up to some point and it's like pretty low after that so it looks like at some point there was a switch something happened maybe around here to change from like not very safe mining conditions to safer mining conditions so let's try to estimate when that change occurred and maybe the magnitude of that change in terms of accidents, disasters per year so we can write that down as a statistical model so our likelihood here is Poisson so Poisson is a kind of accounting likelihood it's a discrete likelihood so it gives you the probability that some number between 0, 1, 2, 3, out to infinity occurs within a time window and it's parameterized by this rate so the rate here is number of disasters per year and so if we draw a Poisson around a variable we'll notice how many disasters occur on a given year so now we model the rate function as some value before time s and some other value after time s meaning the rate was somewhere until s and then some time s happens and then the rate changes and something else so then if we knew the parameters d, l and s then we would know exactly when that change occurred and what the magnitude of that change was so in Bayesian inference we have probability distributions over our parameters so we need to specify prior so before seeing any data what's our prior belief about the values of the parameters so here for s we've just chosen a discrete uniform distribution so that basically says there's equal probability of falling anywhere from 1850 to 1960 seems sensible we don't want to inject before seeing this data we don't want to assume it happened some place more than another place and then for e and l which are these different rate functions we've chosen an exponential exponential has pretty nice properties with in terms of modeling rates so they're constrained to be positive and depending on what parameters so these are hyperparameters r e and r l depending on what parameters you put in there we can give more or less weight to lower or higher values any questions about this model so this is our statistical model definitely not Gaussian okay so how do we code up such a model in pi m c there's a file on b space called disastermodel.py which has this code in there so within pi m c there's lots of different distributions we have discrete uniform so to specify to instantiate a random variable s which is our parameter for the change time it's as easy as just this one line of code pi m c discrete uniform s which is the name of the variable and then a lower and upper bound so here it's 0 to 110 but plus 1850 or whatever it's not too important and then same with e and l we've chosen to be exponential and here we fix so beta is the parameter and the exponential we've assumed to be 1 so we'll go back here this is beta so here they're just written as re and rl we're going to say re and rl are both 1 and that's a relatively uninformative prior on e and l it looks something like this is e looks something like this where we go around 8 or 10 it's basically telling our rate, it's probably more likely to be lower but it could be pretty high it's relatively uninformative and we'll see actually that even though we assume our prior belief on e and l is that they follow the same distribution our posterior will show that they're in fact very different okay and then we also have this r this variable r which is e if we're less than s and l are greater than s at time greater than s and the way that we instantiate that in pi mc is by using this pi mc deterministic and this essentially means that when we do all the sampling in pi mc we're not going to sample r because r is completely determined by other parameters it's not random if we know s, e and l r is just a simple function of m okay and then finally we have our likelihood so again it's a Poisson likelihood d is our data so it's just a vector of number of accidents per year the mean of that Poisson is r which is our rate and then pi mc doesn't really make a distinction between likelihoods and priors except that if you say observed equals true it means we've actually observed data following this distribution d and we have to put the value of that data here so disaster's array is just the data array so you have to use a more complex model so basically at the heart of it it wouldn't be Poisson Poisson it wouldn't be Poisson but it wouldn't be normal either because Poisson is discrete our counts are discrete but yeah so you use a different likelihood because Poisson assumes that the variance is also some mu and the variance are the same but there are other distributions out there that you could use basically the whole model would be different there'd be like probably a few more parameters okay so how do we fit this model so by fit I mean how do we sample the posterior for all of these parameters of interest it's pretty easy we just import so this is saved in some .py file we import that we import mc mc from pi mc and then we instantiate this mc mc variable m so let's do that let's see how cool stuff we can do how's that look okay so here we're just again instantiating the model so we've imported disaster model here's our mc mc object that seems to work so pi mc also has this really cool graphing function where we can actually visualize our model as a graphical model and I'll tell you what that means in a second but if I just run this line of code with the model it saves a .png file or any type of file that you want and I just then read in the file to plot it so this is our model any questions what's the difference between an ellipse and a triangle yeah so an ellipse so L, E and S are parameters the ellipse means that this is some prior distribution over that parameter it's related to R through the value of that parameter R is not random it's determined completely by L, E and S which is why it's a triangle I guess with L, E and S pointing in and then based on what R is R produces through this mu which is our mean function on the Poisson and D is shaded in because we've observed it it's actual data there's some really cool examples in the PIMC documentation of these just crazy graphical models but anyone who's doing really hardcore basing modeling is thinking about all of these in terms of graphical models it's a good way of keeping track of what's random, what's not random what's observed, how things relate to one another and it's a good way to check that your model is actually right okay so when we instantiate this disaster model it just starts with some kind of arbitrary values of each of the parameters but we can actually draw samples from the posterior using m dot samples so m again was our MCMC object and here I've so we're using the Metropolis Hastings algorithm and I've said take 10,000 samples throw away the first thousand because I'm thinking it's not going to converge immediately and then this thinning parameter means you don't take every single sample but you throw away every second one and that's to try to minimize kind of the auto covariance in your samples because you're taking these very small jumps in this parameter space so there's going to be quite high correlation between one value and the next there are people that argue that doesn't matter, there are some people who say that matters a lot it's typically safer yeah exactly to throw away a few especially if it's, so here it's like the likelihood calls are instantaneous so you can also use iSample which does interactive sampling where you can kind of stop, restart the sampler do other things mid sample, like if you're doing something that is really computationally intensive you might want to stop it and realize that your model is wrong and change things and then go back okay so that ran pretty quickly we can look at summary statistics from that run and we get a lot of information so here we have each of our parameters EL R which is of course this giant vector this is all information on R and S and so we get things like what's the mean oops how many samples did we take are there all the quantiles with standard deviation all the useful stuff so we can plot what's called the trace of the sampler which is just plot all these samples of the parameter S as a function of the MCMC iteration and we see that within the first 1500 iterations we certainly were not converged all over the place and at some point we converge and we're just kind of steady state for the rest of the chain so this tells me that we need to burn in longer we need to burn in until we're pretty confident that we've reached the actual posterior distribution before this value we certainly have not reached that distribution we're kind of lost in this parameter space looking for the right posterior and at some point we latch onto it and by the properties of the MCMC we're just we're good we're going to stay here always any questions so I need to go back up and burn in for longer so 4000 and it takes a little while okay so now let's see ah perfect so here are samples of S as a function of MCMC iteration if you want to draw inferences about S we basically collapse all these samples onto the Y axis and that would give us a distribution over S okay um so so I've done this so we I do this here for L so L is one of the other parameters just taking the histogram of all of the samples in L and then on top of it I've plotted what the posterior mean is so the mean of that as well as the 2.5 and 97.5 quantiles so that gives us a 95% credible set for the true value of L so this is what it looks like so here kind of the conclusion I can say is that you know with 95% probability L is between 0.7 and 1.2 disasters per year with a mean of 0.93 years disasters per year I wonder what a disaster is um there could be other things yeah maybe our model is not rich enough to include all the effects okay so within PMC we also have our own plotting functionality this matplot so this matplot lives in PMC the matplot.plot we'll just plot all of these good things for each of the parameters ventures so here is the trace of E and the posterior distribution of E same for S and for L S is discrete S is a year so I think there's some sort of I think I think that's why you see this kind of weird pattern that's a good question I'm not sure um yeah I'm just wondering in this dot plot if there's something in here it doesn't look like it but you can plot yes so that's this is my next plot actually so this is plotting E versus S and we see a little bit of antichorrelation which if you look back to the original data I think kind of makes sense so here we're in like the range of 3.25 disasters per year the pre switch disasters per year and here's the change point in year plus 18.50 so as our change point gets more into the future the E is actually smaller which kind of makes sense because it seems like it is getting lower with the year okay so these are all of the inferences that we can do to take posterior means and credible intervals so here our posterior mean for the year of change point is 18.90 the credible interval goes from 18.86 to 18.96 note that it's not symmetric around 18.90 here I want to look at the posterior mean of the magnitude of the change so how many disasters per year are we preventing after that and 2.12 is the mean credible set is kind of between 1.5 and 2.75 so that's a significant number of disasters that we're preventing thank you so this was the original data and one really neat thing about Bayesian inference is that we can now draw realizations of our data from our MCMC sample so each parameter vector in MCMC space gives us a full model so we could here I've just taken the first 250 MCMC samples and you can just plot all the models so these were these are draws from the posterior distribution of our models and kind of gives you a really nice idea about what is going on in this model parameter space here there's quite a lot of spread in this E parameter and here there's much less and you can see all the correlations so if this change point happens later then maybe it's higher in this one or I guess it would be lower in this one okay I guess I just wanted to touch on a couple other things so any questions about any of that okay so diagnostics are a really important part of doing MCMC we want to be sure that we've actually converged to the true posterior so the first check that I would suggest doing is starting multiple chains from different starting values and ensuring that they do indeed converge to the same place and there are different statistics that are commonly used in the literature to test convergence of multiple chains and there are more formal methods in the notebook I think I showed the result of one of those and then finally goodness of fit so kind of posterior predictive checks which kind of tests not only how we converge but also is the model kind of good enough to actually reproduce the data so there's if you look in PyMC there's a lot of literature in their manual about all these techniques okay so I think PyMC is pretty great but there's a lot of other MCMC code out there in Python, Bayesian inference I think it's not as good as PyMC but it might have some functionality or some different models that are not included in PyMC MC is pretty recent so there's lots of different samplers out there Metropolis Hastings is one and there's different versions of that in PyMC but there are other ones that try to do more efficient sampling of really complicated posterior distributions multimodal distributions so there's lots of code out there to look at in R there's a lot of things too, some of these have really particular models that are included that might be difficult to code yourself but it's usually pretty application specific okay so I guess I'll introduce the homework now so we'll be modeling batting averages for baseball players and this is kind of a neat example about how Bayesian fitting can give you a pretty reasonable answer where typical maximum likelihood is just a terrible estimate and so the setup is that okay so for those who don't know the batting average for a hitter is the total number of hits that the hitter has made divided by the number of times at bat so yeah and cricket is probably different I don't know so here number of hits is xi number of bats is ni and our task is to estimate a hitter's true batting average which I call mui over the entire season from the first month of data so we get to see the hitter's performance for one month and we want to estimate what the batting average at the end of the year will be assuming that that hitter has the same aptitude throughout the entire season so a good way of modeling that is with this likelihood that the number of hits given some mui is binomial with mean mui and number of p mu here is the probability of success at any given at bat so probability of hit during a single at bat ni is the number of at bats so this is basically telling us that the implicit assumptions of the binomial is that mu is unchanging and that the different at bats are independent so it's relatively simple model, maybe it's not a realistic thing but it works relatively well so ni is the number of at bats xi is the number of hits in these data files you'll find that information for the month of April and then for the entire season so they instantiate this in pi mc it's quite simple and so your task is I think three things so first the maximum likelihood estimate of mui for each player so all this is just given the april stats only using the data in la2001april.txt the hand is that it takes a simple closed form which it's the mean which is obviously a terrible estimate because if you see someone batting for 10 at bats and they get five hits you can't assume that they're going to bat well the whole season because we have prior information about how well people typically bat yes in the data sets no in the world in b in b I'll tell you part b we'll find a posterior mean of each mui so each player has a mui each player has an attitude but now we assume a beta prior over the muis so that mui follows some distribution and that beta encodes our prior knowledge about baseball batting averages and alpha and beta the mean and the variance so a beta distribution is restricted between 0 and 1 it's a continuous distribution but it can take on different means and variances based on the values of alpha and beta so just look at Wikipedia to figure out what the appropriate alpha and beta parameters are to get a mean batting average around 0.25 with a variance on order sorry standard deviation sorry this should be standard deviation on order of 0.05 typically most people bat between 0.2 and 0.3 so in b we're basically modeling this parameter mui for each player following a beta distribution with fixed parameters in part c we're actually going to take that a step further and put a prior on each of the hyper parameters based on alpha and beta so one level up we now have probability of alpha given to other parameters r and s following a gamma distribution and here I don't give you values of r and s and u of v so just kind of pick some and see what happens I guess so this is a hierarchical Bayesian model so we can think of it as in terms of the graphical model now instead of having a mui point to every player we go one level up and we have alpha and beta pointing to the mui and then r s and u v pointing to the alpha and the beta so it's hierarchical because we now have several layers of parameters and hyper parameters each having its own distribution okay questions so okay so in terms of conjugacy actually I shouldn't tell you this but b is conjugate so there's a closed form solution to the posterior given a beta prior if that makes sense if you know beta and you know it's a binomial likelihood then you have the closed form solution for your posterior that's not the case for here having gamma priors on alpha and beta there's no closed form solution if you wrote it all in math it wouldn't get you to any usable distribution so you'll certainly have to use MCMC in this one that one you should but you don't have to you can you can figure out what the posterior is okay so the point of the assignment then is to plot each versus the full season batting average for each player the MLE the posterior mean from part b and then the posterior mean from part c and you'll see which ones are better or worse estimates so here I'm assuming that mu i is kind of standing in for the season long batting average for that player and then additionally what's your posterior mean for alpha and beta in part c so the neat thing here is that whereas for part b you have to fix alpha and beta just based on your knowledge that the mean is 0.25 and the standard deviation is 0.05 here we can actually infer what the appropriate posterior mean for alpha and beta are without having to choose a value for that okay any questions