 Well, I want to start by thanking Kyle. Kyle has done so much to get this together. I'm really excited about the week, and I'm really excited about being here, and I'm really thankful to... I know it's a lot of work, and when people are in the junior stages of their careers, this is a big commitment, and I really appreciate it. I'm gonna talk about very, very first things in statistics, just to set the stage. I'm not really gonna say things that are necessarily relevant to your plans, but I want to just say some things that get us talking about statistics and data analysis, and hopefully I can key into things that other people can talk about as the week goes on. I should say, one of the things Phil is gonna say is in the hack sessions, we're gonna talk about what are our expertise. My expertise is in computational data analysis, and if you're doing a computational data analysis, I'm happy to consult and give advice and help with things. There are a lot of problems we've done, and we know a lot about doing computational data analysis in astronomy, and I'm here to help with that. That's one of the reasons I want to be here, and my goal is to influence new projects. In fact, I like these hack weeks because they're an opportunity to influence actual projects that are going on rather than waiting till they're over and criticizing them too late. Okay, so I'm gonna talk about the simplest possible data analysis problem that I know. Maybe I could go a little simpler, but probably not. You have some data points in a 2D scatter plot like this with error bars, and you want to fit a line to those data points. So I'm literally just gonna talk about that problem. In fact, I'm not even really gonna solve that problem, I'm just gonna talk about it. Now, I have written a 50-page paper about this problem, which is the worst waste of your time imaginable, and I don't recommend you read it. However, if you are in a situation where you're having some issues with fitting a model to data, there's a lot of different sections in that paper, and I can help guide you into that paper where something might be useful, and it opens with some very simple comments with some of which I'm gonna make now, and it has comments about what you do about outlier data points, and what you do if you have errors in both directions, and what do you do if your model has higher complexity or whatever. So there's intrinsic scatter, there are various issues that are discussed there, but let's just talk about the dumbest thing here, which is just we wanna put, we wanna figure out what line to put through these data. I did it, boom! Okay, next, I think Jake's up now. No, you can do that, actually that's one of the things you can do, of course, but let's, I don't wanna be too flippant, it is actually true that if your data are very good, you often can just read it off the plot, and sometimes, in some cases, when you have a strong prior on what it should be, say you're asking, maybe this is very good measurements of stellar temperatures, and these are very bad stellar measurements of stellar temperatures, you can just check, does the slope look like it's unity? You know, and that's the kind of thing where you can do often without doing any computation at all. Good, so what do we do usually in this problem? So the way we usually solve this problem, or the way I was taught to solve this problem when I was a undergrad, is you say that, well there's a true model here, which is that y equals mx plus b, this is a description of the expectation for my data, and then there's some scatter of the points around that data, and what I'm gonna do is try to find the values of these parameters, this is a parameter, and this is a parameter, this is the parameter we might call the slope, and this is the parameter we'd call the intercept, I try to find the values of those parameters that somehow minimize the scatter, right, and that's sort of what I was taught to do in lab class or whatever in physics. So the way we usually do that, now I'm gonna jump forward to like maybe the third semester of lab class, is we compute a statistic which we called chi-squared, which looks like this, let me just be a little bit more specific, this is data point n, this data point n has a value yn, and it has a value xn, and there's capital n of these data points. By the way, if you're a statistician you have a stats background, I'm using n and n differently than a statistician usually does, and I apologize for that, but I don't care that much. For a statistician, little n is the size of your sample, and capital n is the size of your population. Like 18 people were in your study, and there's 300 million in the United States. So if I call this my model for the data, then my belief about this data point is that it's really here, and it was offset by some amount, and so really I can think of there as being a prediction, let's call it a prediction yn tilde, which is m times xn plus b, which is our prediction for the nth data point. So I put a tilde on it to imply that it's a prediction, the true value is just the plain old y. And so the chi-squared statistic that people usually use looks like this, chi-squared is, whoops, I think I'll need a sum, it's a sum over n data points of yn minus the prediction squared over some error estimate squared. And notice this is the statistic, we usually try to minimize this, maximizing this is not a good idea. We try to minimize this thing, and what are we trying to do? We're trying to reduce the difference, we're trying to find the values of m and b that reduce the difference between the true data and the data predictions. But we're weighting the different data differently, data with large error bars get low weights, and data with small error bars get high weights, right? And that's how we do this fitting of a line. Does that seem familiar to most people? Anybody wanna ask anything at this point? Okay, cause we're about to get a little crazy. Before we get crazy, one thing that a physicist loves about this thing is dimensional and out, if you think about dimensionally, like if these y's have units, these sigma n's also have units, so this thing is a dimensionless number, which is pretty sweet. So that's one thing I love about this object is it's a dimensionless object which tells you about the goodness of fit. Unlike mean squared error or the sum of the absolute values of the deviations, there's lots of other statistics you could write down. This statistic is a dimensionless statistic, which is very nice. The other thing is it has a connection to geometry which we're gonna see in a minute, it looks like a metric distance. This looks like a Euclidean distance between the model and the data cause it's a sum of squared distances, like you could imagine in n-dimensional space, what's the distance between two points? This looks like a Euclidean distance. That's gonna come up in a second. So physicists love this. Okay, so now my question is, if you do this, what assumptions are you making? Implicitly, you must be making assumptions. What are those assumptions? Yes, hit me. Ah, you were somehow assuming something, I kind of, I don't wanna go too far down because there's like, people are not transparent. Can I steal another one of these whiteboards? Does anybody know? Nobody cares, right? If I just take one of these whiteboards, no one's gonna, is somebody gonna get upset with me? That's amusing. The backside of that one's blank. Awesome. Okay, cause I wanna write my assumptions down. I really wanna write my assumptions down cause the point I'm about to make, steering these things from the back is not a good idea by the way, people, just in case you wanna know for the future. Sorry. I apologize. I should have thought of this in advance. Thank you. I really wanna write down these assumptions because one of the points I'm gonna make is that whenever you're doing a computational data analysis or any data analysis at all, you want to be very clear about what your assumptions are and you wanna be clear that the method you're using meets those assumptions. So we're gonna try, and I hope it becomes a theme of the things we talk about in this week that we make sure we're very clear what our assumptions are and then clear that our methods are meeting those assumptions. And then if we can't meet the assumptions instead of just hacking something, of course it's Astro Hack Week so I can't really complain about hacking, we try to back off the assumptions. Our hack should be to somehow relax the assumptions rather than say, oh, whatever, I can't do the right thing so I'm just gonna put a plus two in here in my code. You know what I'm saying? Okay, good. So one of the assumptions that's being made is some kind of Gaussianity assumption. So let's see, assumptions. Can you say a little bit more about that Gaussian assumption? What am I assuming? Good, very good. That was very nicely stated that each point is somehow sampling away from its true value with some kind of Gaussian. There's another assumption under there. There's a very nice, so there's another kind of assumption going on which is that somehow we know these variances. These are somehow true. So how does the Gaussian assumption come in here? Why is there a Gaussian assumption? It's because this is a term, if you write down the Gaussian model for generating the data and you take the log of that model, this is negative two the log of the value of the Gaussian distribution. So this is related to the Gaussian distribution directly. So implicitly it assumes the data is somehow generated by a Gaussian. So that's a crazy thing, first of all. But second of all, it assumes that you know the variances of those Gaussians, that these errors are correct. And of course, these errors are never correct for very deep reasons. You never know these correctly. But you are somehow assuming that you do. So it's both that you're assuming some Gaussianity and you're also assuming known error variances. I should really call them noise variances because when it's in your data, I think of it as being noise. Your data don't have errors, they have noise. Yes, Phil? Exactly, so sometimes you would write this, noise. I often write this, in fact, I often write this because it's kind of a statement of my likelihood function, that there's a mean behavior and then there's noise. But as you say, what that would generalize to here would be plus sigma n times a unit variance. This is like something that's drawn from a unit variance, zero mean Gaussian, and it's multiplied up by the error and added in. Oh, you're right, it's that. Yeah, I can't put it in the top equation because I don't have subscript n's, which I need. I know, that's a problem, which I apologize for. Good, I'm not going to use that, but that is, I often do write things like that in papers. Good, that would be another way to clearly state this Gaussian assumption and then if you do that Gaussian assumption, you are, you do lead, this is what you get if you take the likelihood, you write down the likelihood and take the log of it, this term appears. Okay, good, what else am I assuming? Oh yeah, go ahead. Ah, I'm assuming independence. This is a very important assumption. I'm assuming independence, why independence? Because I'm summing a separate term for every data point, gets its own term here and none of the data points refer to any of the other data points, which means somehow the noise is added independently to every data point. So somehow there's an independence thing going on in here that this sum represents. Now if the data weren't independent, what would this turn into? It would be some kind of matrix, it would be, this would become an inner product of two vectors, we're gonna see that inner product in a second, it would become some kind of inner product of vectors, which would be a Euclidean distance under a more complicated metric. I mean if you think about it in this geometric way, which will come up again later. But so physicists still love that, it's still a very nice physics thing, but it doesn't look nearly as simple because you have to switch up to linear algebra if you're going to actually look simpler, but anyway. You have to switch up to linear algebra if the points become non-independent. Okay, what else are we assuming? There's lots more assumptions people, yes. Ah, very good. Wow, you guys are rocking it. Negligible X errors, or X noise, I should call it because it's noise in the data. Meaning we only have to worry about whether the Y values meet the predictions. We never ask whether the X values meet the predictions. If you want the X values to also meet the predictions, things get a lot more complicated. You don't have nearly as simple a solution. In fact, there's a panoply of solutions that involve additional choices if you think that the X values also are noisy, which by the way is generic. When have you ever taken data where you knew this axis exactly? I mean I know a good, one example I know is when we do Kepler data, we have the time that spacecraft clock really does have negligible uncertainty. But that's a very rare case. Good, what else are we assuming? Say again? Right, we're assuming somehow this formulation is sort of assuming that the straight line is in fact the right model. We're somehow assuming that the data really are drawn from a straight line. Or there's some kind of model appropriateness assumption which is a little hard to state. But we're not quite assuming I think that these data are exactly drawn from the line but we're assuming something very close to that. What else? I got lots more. So very related to this one, we're assuming that there's no intrinsic width to this distribution. We're assuming that if you had less and less noise these would converge onto the line which is also related to the model appropriateness so you could say and zero scatter. Like we're assuming that there would be no scatter in the model predictions if we had no noise in the data, yeah. Jake, yeah, you. Good, we're coming back to that. So there's an additional assumption which is kind of a meta assumption which is that all we need is an estimator. We're trying to get an estimator. So an estimator is good enough. Sorry, I wrote that where Jake can't see it. But anyway, you know I wrote it there. We're assuming that all you need is to know the slope and the intercept. You don't need to know somehow a more complicated set of information like the full posterior information about the slope and the intercept or the full shape of the likelihood function. We only care about minimizing the scattering getting a single answer out. Okay, and so one of the things that's gonna happen on Wednesday is there's gonna be a discussion of Bayesian inference which I think will be which will presumably attack this and also talk about how you obtain more valuable information than just an estimator. I wanna talk for a little bit about the fact that it is an estimator. But first I wanna ask, does anybody else wanna put in any more assumptions that we're making here? Ah, so you're somehow making the assumption. There's an additional assumption which is related to this estimator assumption that there is no prior information. Of value, of value, about, and this is too long, but about B and M. So they don't even bother writing that down, but there's an assumption that you really know nothing about B and M in some certain sense. And that, and of course you can't really assume that you know nothing. There's no way to assume you know nothing. But you're somehow assuming that all that matters is this likelihood. There's no additional information you want to add in there like that it's much more likely that the slope is one than any other value or that if the slope got very large it would mean the universe was only 10,000 years old and we wouldn't be here or whatever. There might be other things you know that come in and you're somehow assuming that there is no such information and that's probably false. It's rarely true. Good, great, that's a great set of assumptions. Boris, you gonna throw one in? Say again? Ah, great, related to this point of known variances you're also assuming and no bad data. Oh look, you can see my board down there. I mean, there, there. It's hard to point there. Good, yeah you're assuming that there's no bad data that every data point is the same in the sense that it's all the only difference between the data points is that they have different variances. Right, good, it's also really good that also comes in here too because it might be that the data were actually generated by a mixture of two lines one like this and one like this and some of the data came from one line and some came from the other, absolutely. Now, if you really state all the assumptions eventually we'll get into like that your computer calculates the multiply operation correctly and things like that so we're gonna, and you know you aren't living in a simulation. Actually, that's irrelevant you should still fit this way even in a simulation. But, so you can't really state all your assumptions but I think it is extremely valuable when we're doing data analyses to try to write down the assumptions as a bulleted list and in fact one of the things in my group we're trying to do is kind of systematize a little bit the way we write scientific papers and we're trying to write scientific papers in a way where we explicitly state our assumptions as a bulleted list show that our methods work under those assumptions apply the methods to the real data and then discuss what would have changed if we had made different assumptions which is usually the code would have taken the age of the universe to run because usually we're making assumptions like this because we're trying to be tractable. One of the things that's lovely about this problem is it's extremely tractable. And so if you can live with these assumptions it's a great way to live because you don't burn tons of trees and create tons of greenhouse gas emissions when you do your data analysis. Okay, good, now let me just solve this problem and then I'm basically done I'm just gonna point people to ridiculously long papers that I've written. Okay, good, let's solve this problem. I'm just gonna erase my objective function and we'll come right back to it in a second. So here we go. So in computational data analysis the most important thing is linear algebra and I can't say this enough and students come to me and ask what do I need? I would like to work in your group what do I need to work in your group? And I said the first thing you need to do is become a good writer because the most important thing in science is writing and then the second thing is linear algebra and then after that we'll talk about physics. And the reason is that any facility with linear algebra directly translates into a facility with data analysis and almost all data analysis problems can be cast in a way that looks like linear algebra or it contains linear algebra operations. So in this case, how do we cast this into a linear algebra operation? The first thing we do is we think of the data set it's n data points but instead of thinking it as n individual data points we're gonna think of it as a y is an n vector. So we're gonna think of all these n data points as the n components of an n vector. And for me vectors will always be column vectors. Good, so the data we're gonna think of as a vector and then the model that we wrote down is this linear model we're asking we're gonna fit it like a line. So we can write that again as a linear algebra operation. So I'm now gonna write it as a linear algebra operation. We're gonna say that this n vector is some matrix times some other vector plus noise. And this, this is gonna be the vector, the column vector which contains m and b are two parameters. Okay, so we're gonna write this as just a pure linear algebra operation. So this is the slope and the intercept of the line is I've formed into a two vector. This is the n vector, the horrifying n vector y1, y2, y3, da, da, da, da, da, to y capital N. Good, everyone with me? This is an n vector, this is a two vector. So what does this have to be? This has to be an n by two, right? Which we will go like that to make that, right? Linear algebra, ring a bell. I cannot believe I did not learn how to multiply matrices till I was a junior in college. What the hell is wrong with the world? Anyway, if you're not familiar with multiplying matrices, this week is a great week to learn about it. And I really learned it late and we should do it. I mean, the elementary school people. So what does this look like under the hood? What it looks like under the hood is it's x1, x2, x3, da, da, da, da, down to x capital N, and then one, one, one, da, da, da, da, down to one. Do you see why that's the right thing? Because then when I do this multiply mx1 plus b, mx2 plus b, mx3 plus b, that's my predictions for the y's. You see? It's gorgeous. And this thing in the literature is often called the design matrix. Design, oh look, you can read it up there. Matrix, if I write large enough. That's called the design matrix. And of course they can get much bigger. And this model is a very simple linear model. If this was a non-linear model, this would be a matrix of derivatives. You see? And instead of getting to the answer, we'd be making a move because we'd be going on some curvy path or some crazy stuff. The nice thing in this linear model, we're gonna be able to just solve everything straight up. But you can think of this as the derivatives. This is a vector and this is a vector. What's the general linear relationship between two vectors? It's a tensor here. That a is the rectangular tensor that is like a derivative dy dx. You can think of it as like dy dx. Okay, good. Now I've overused the symbol x here, which is annoying, but I don't really know what to call this thing. I could have called it theta. Maybe I'll do that just to be, just to cause everyone to cross out. And the people who are upset about crossing out in their notes are the people who will help them through that later. My notes are all wrong now. They're written over there. Okay, good. Good, let's do that. We'll call it theta because it's the parameters theta. Good, any questions at this point? Okay, now, the magic of least square fitting. If your objective is chi-squared, then we can write chi-squared in an incredibly simple way now. And from a linear algebra perspective, there's a very simple way to write down chi-squared, which I'm now gonna cross out our assumptions. Everyone remembers our assumptions. That's good, I'm glad. We can write chi-squared as, well, so this is our data. These are our data. This is our sort of model for the data, and then there's noise has been added to the model. So this is like our mean prediction for the data. So the difference between y and this mean prediction are like our residuals. So we can construct this vector, which is y minus a theta, right? That is now a column vector, which is a column vector of differences between the prediction and the data. And I can do this crazy operation where this c inverse is the insane tensor which has one over sigma one squared, one over sigma two squared, one over sigma three squared, dot, dot, dot, dot, one over sigma n squared, and zeros everywhere else. Do people see how that is exactly chi-squared? See, it's a difference between the model and the prediction times a difference between the model and the prediction multiplied by one over sigma squared. Do you see how this is exactly chi-squared? Oh, you see, that answers the independence question. If our data points weren't independent, very good, Daniela's great suggestion from Daniela. This is much more general than our earlier statement because you don't have to put zeros in here. And by the way, if you're ever writing code and you do have zeros in here, don't ever make this thing. From an implementation point of view, right? You never wanna make a matrix that's almost entirely zero. So really just don't do that. And we can help. One of my jobs in this world is to help people not make matrices that are almost entirely zero. It's one of the things I do for living. So if you have some matrices that are almost entirely zero and you're running out of RAM, come chat with me. That's part of my job here. Okay, good. But the nice thing, so this, this has many good properties. There's so many beautiful properties of this. Now is the point where we go until about two in the afternoon and I'm still talking about the properties of this. Because look at how beautiful this is. First of all, general relativity. That's a crazy thing to say. But in general relativity, or in special relativity, the important thing is the metric. Remember metric distances between points, which are like squared distances. I don't know if that means anything to people in this room, but this is exactly the way you write a metric distance in special relativity or general relativity. This is like an object that happens in physics or geometry. Those are geometric theories. This is a geometric object, which says what's the distance between my data and my model. So that's one set of, there's a whole set of beautiful things there. There's another whole set of beautiful things from the fact that it's very easy to take the derivative of this. And you might not be, you haven't done calculus in with linear algebra operators probably and I haven't either. But it turns out you can take a very simple derivative of this and you can just find the optimum, right? Because the optimum is where the derivative is gonna vanish. And I'm just gonna write down the answer now. The answer is this, and it's so beautiful. The best value I'm gonna call best, or I could have called it min, the theta that minimizes chi-squared is just A transpose C inverse A, inverse A transpose C inverse Y. And I just wanna talk about this object for a second and then I'm done. So, what are the properties of this? What is this thing? So first of all, what is this here? This is A transpose. This is a, A transpose is a two by N. C inverse is an N by N. A is an N by two. So this is a two by two positive definite matrix. This thing's a two by two. Two by two positive definite, which means it's like a shear. It's a shear tensor or whatever. It's like a, it's a thing that has shear, but this is just a little two by two. Now what's this thing? Well, this is an N by two, or this is a two by N. This is N by N. And this is an N vector. So this takes the data, this is the data we're throwing in here, and this thing projects it down to just a two vector. So this thing is a two vector. And this is what happens if you turn on like linear least squares. This is what's happening under the hood. In fact, this is so beautiful that whenever people are doing linear least squares, I always advise them strongly to just write the linear algebra themselves and not use a package because it forces you to confront the structure of the problem and you can look inside and see whether things make sense and whether your matrices are well conditioned and you can do all sorts of checks. But what's going on in the hood is very beautiful. This operation is projecting the data down to just two statistics of the data and this is shearing them into your answer. So it's like a projection and a shear and you get your answer. And that is just magic. And that's why linear least square fitting is such a good idea because it's super fast. It is closed form. It's linear algebra. Oh my God. Exactly. It means, but that temp, and you can play that two ways. You can say, well, it's tempting to make those assumptions because you get this, but of course that's also what you gain for making those assumptions. So it's both the case that you get drawn to make assumptions that you shouldn't because this is so easy. But it's also the case that you get the benefit if you can make the assumptions. And good. Now I guess the one thing that's sitting on the table, I don't know how much time do I have. What's, are you gonna cut me off in a second? I have time, okay, good. I wanna say a few words about what Jake brought up, which is this question of the estimator. What does this produce? This thing produces an estimator. It's an estimator for M and B, right? It's produced, what does this thing contain in it? It's a little column vector that contains M and B, the best M and the best B. Okay, so Jake was suspicious that you'd ever want an estimator. And he rightly so. In fact, I've devoted my life to making us not use estimators when we can afford not to. So let's talk a little bit about the estimator. I hasn't tend to erase the data because it's so beautiful. But here it goes. Okay, let's talk about estimators. So there are a lot of people in this room have drunk the Kool-Aid and are full-on Bayseans. And in fact, astronomy is the land of the Baysean these days. And I am partially responsible for that, only very slightly partial. I did, I produced a lot of Kool-Aid. Let me just put it that way. I try not to drink it, however, because there are many times where all you want is an estimator. So when do, wait, let's ask it a little slightly easier. When are we happy with only an estimator? So the point that's gonna come through over when we talk about Bayse and we talk about sampling is that really you want to propagate the full noise in your data into your answers. You wanna have a full description. You also wanna bring in all your prior beliefs. And so you want to produce a full posterior sampling that fully describes your posterior beliefs about your parameters given your data. We're going there. But sometimes you don't want to go there. What are those times? Anybody who here has written a piece of code that estimates something? Okay, I saw Adrian put up his hand. Why did you just estimate it and why didn't you produce your full posterior beliefs? Ah, sometimes you need to do, make a measurement many, many times. Trying to think of a good example of that. Oh, I know. Forman Mackie and I wrote a paper recently about finding single transits and Kepler data. We had to perform an enormous number of hypothesis tests in that literally for every location of a transit at any time in any star in a huge part of the Kepler data, we had to ask yes or no. So we just needed a yes-no estimator because we couldn't do a full but Bayesian inference for every possible hypothesis that you could make about the plan. We just could not afford it. So sometimes just computationally it's impossible. Now, I shouldn't really say impossible. I'll put it in quotes because of course it might be that there would be some clever ways. And in fact, we did some clever tricks to get our estimator very close to a Bayesian evidence calculation but we did not do a Bayesian evidence calculation. It was computationally impossible so we tried to build an estimator that was very close to a Bayesian evidence calculation. Good. What's another case in which all you want is an estimator? Good. I'm gonna take that one and then I'll take you. So what if you just wanna initialize and this is going to be, you might wanna initialize some kind of inference. So you might wanna do a Bayesian estimator to get yourself started, I mean a non-Bayesian estimator to get yourself started on a Bayesian inference. That's very important. In fact, we strongly advise that in hard problems. This is gonna be a sub point of a more general point which is about to come up. But yes, Daniel, what were you gonna say? Good. It's possible that there's two things about, there's two cases in which you might wanna estimator that's based on how bad your posterior PDF is. One is if your posterior PDF is really bad, then fully dealing with the, fully dealing with the posterior is computationally impossible and it comes back to this point. But the other case is when your posterior PDF is very simple. If your data are highly informative, much more informative than your prior beliefs and basically there's negligible error on your estimate, your estimate's just as good as any possible. You can sample all the hell you like, but in fact it only moves within 0.001% of a certain M and B value. You're good enough with just the estimator. So there's both the case of the computationally impossible Bayes. Your posterior is just too hard to deal with so you're just, I can't deal. But there's also the data very informative. In fact, there are proofs that as your data become informative, the Bayesian inference and the asymptotically efficient estimator from frequentism actually converge. And so you actually provably, when your data are very informative, you're wasting your time sampling. Good, anybody else wanna say? Good, I think that is kind of the mix of these two. You know something about this, but you don't wanna compute it and your data are informative that you know that you have a single mode. So I think that sort of falls in the same two categories, but exactly. And we do have cases like that. In fact, when you're doing linear least square fitting, you know there's a single mode. So, yes, yeah. Good, you might not, you might have taken all those XY points, but the only thing you're going to care about for your future inferences is this slope. The slope is the only thing that matters. You might find that you don't need to carry forward all your data. You just need to carry forward the slopes. You basically are transforming your data or compressing your data from a set of scatter plots to a set of slopes and just passing the slopes forward. That's a data compression. It's a lossy data compression, but it is a data compression. And often you can afford the loss or maybe you can have even proofs that you don't lose anything. You might be able to prove that you have a sufficient statistic or something. So it's a, I think you use the word summary statistics. You can say it creates summary statistics. Summary statistics. Or it compresses data. And I should say lossy. It's a lossy compression of the data because of course you're throwing away information when you do that operation. We explicitly, this operation is a projector. You're throwing away data. So this is truly lossy. Good. Good. I'm going to come back to that if I still have time. Kyle, when is like the big football on me? The Anvil. And Jake needs like three and a half hours. Okay, good. I'm just going to, I just want to get back to what's the generalization of this and then I want to come back to what Phil just said, which is that if the assumptions we wrote down are correct, you actually don't need anything more than the estimator and some kind of uncertainty on that estimate, which is sitting on the board over there, which we'll see in a second. But so that's kind of, that's very related to this case, but really there are some situations in which the estimator and the uncertainty is truly sufficient to carry forward your posterior beliefs. So sometimes I'm going to put it in prens because it's so rarely true, but sometimes you have sufficiency in the sense that you can do this operation and then that's sufficient to carry your information forward. Okay, and under the assumptions we wrote down, that turns out to be true in this problem. Okay, I want to generalize this point, this initialized point. Often when you're doing a least square fit, it is just to make a decision of some kind. It's just a stepping stone to do something else. So for instance, we'll plot some beautiful scatter plot in a paper and a referee will say, what's the slope of that relation? And we will say, we don't care because it has no meaning or interest to us or anyone else, but we don't write that to the referee. We sit a line to it and say, oh, the slope is this and we added that to the paper. Because we don't care about the answer, we just need something to write in the paper that is not wrong. Why go through a huge computational effort just to make this, put in this one little thing? We just need to make a decision what number to write in the paper. We could have eyeballed it, but it would be better to do a computation that we can justify. Let's just do that. We'll put that in the paper. Another thing is, sometimes you have to decide go, no, go. We're about to turn on LSSD and Gaia and there'll be these alerts. And you have to decide if you have a telescope, am I gonna go follow up this supernova that's been announced? You have to make a go, no, go decision. You have to make that, you can compute all the posterior probabilities you like, but at the end of the day, you actually have to decide whether you go or not go. You can't have a superposition of you going and not going. You can't like sample your telescopes and do some of them, follow it up and some of them don't. You actually have to just decide whether the telescope's following it up or not. Once you're making that decision, a lot of things come in there and in the end you're always producing an estimator. Decisions are always estimators. There's no probabilistic decisions. Well, at least not in my world. It depends what you think about many worlds in quantum mechanics, but it's not, there's no way you can partially do something. You either do it or you don't do it. And so sometimes you need to do decision-making. Now, of course, there's a whole theory of doing decision-making within Bayes. You could do fully Bayesian decision-making, but it requires knowing things about your utility that are very hard to know, almost unknowable. And so often a very well-designed estimator beats any kind of Bayesian decision theory you can pull off. Of course, I've written papers where we try to do the proper Bayesian decision theory. Astrometry.net are a thing that recognizes images. Does a proper Bayesian decision theory under the hood, but it's rare that you can write down your proper decision theory thing. And so if you're just making a decision and that decision is not life or death, maybe you just want an estimator. Of course, if it is life or death, then you really want an estimator. But anyway, yes, yes. That's right. Good, good. So Daniel asked a great question, which I'll repeat so it's on the audio stream. This is, if I computed an estimator here, it's that's a lossy transmission of the information in the data. If I did the full posterior, would it be a complete transmission of all the information in the data? Oh my God, that's a great question. And that is like the whole sort of subject of Bayesian inference. And the true like priests of Bayesian inference say that the reason you must be Bayesian is that's the only information preserving thing you can do. It's the only thing that carries the information through. However, Bayesian inference only carries the information through in the context that your Bayesian assumptions are correct. So if your model's inappropriate, it will not pass forward all the relevant information. It will only do so if your model's appropriate. And of course, for deep reasons, your model is never appropriate. However, the reference on that is David Mackay's book. David Mackay has this beautiful book called Information Theory, Information Theory, Information Theory, Inference and Learning Algorithms, or maybe the other way around. But anyway, I think it's 2003, that book is by far in my mind the best description of Bayesian inference and its connection to information theory and propagation of noise. It's beautiful, and he even gives you, the book's too big, but he gives you tracks through the book to take depending on your interests. And so you can look at the tracks and take the right track. It's beautiful. David Mackay died this year, which is a terrible loss to our field. Okay, finally, let me say one last thing. There are other reasons you might want estimators, by the way, but those are my top choices there. Well, I guess one thing I should say about that is it's very important in data analysis to be pragmatic. You wanna write down what your assumptions are, but it's also important to be pragmatic. If you go off the fully Bayesian deep end in everything you do, you will spend your entire life writing your first paper. And which, you know, I'm totally down with that, but it's not good career choice. But so we wanna be pragmatic when we do these things and think about what matters and what doesn't matter. The noise. So the thing that, the other miracle of this, the miracle of this operation here is that this two by two matrix has a very important property, which is that this two by two matrix is the covariance matrix for the answer. So theta best, theta best, is this thing M best, B best, and this crazy thing, A transpose C inverse A inverse is sigma M squared, sigma B squared, sigma MB, sigma MB. I don't know whether to put squares on those or not. There's a notational issue there. It is the covariance matrix for these parameters and under the assumptions that we made at the beginning, the posterior, the Bayesian posterior is actually Gaussian with this mean and this variance tensor. So if our assumptions are correct, then not only does this answer just in closed form give us the best fit line, it also fully describes the posterior probability in the problem. Yes. It, when I said actually depends, you know I said we have no prior beliefs and I was like you can't have no prior beliefs, it's true, you can't have no prior beliefs. In the limit that the prior is much broader than the posterior, this is the perfect description of the posterior. You don't, it doesn't require much more about the prior except that it'd be much broader. So either a very broad Gaussian, it doesn't have to be improper. I mean technically, well that's a deep philosophical question I do not want to answer. But yeah, and the assumption of the prior is very broad and written and is quasi-flat over the region that you are writing down your posterior, this is the posterior. This will fully describe the posterior. So that is a gorgeous thing about this model. That's only true of this model, that's not generally true. It's true in this linear Gaussian, zero scatter, appropriate model. Yada yada, all the assumptions we wrote down. That becomes true. However, so what is this object? This object has horrifying characteristics. You see this, it purports to be the uncertainty. But notice that this object makes no reference to the data whatsoever, no data. See the data aren't in here, there's no why in here. How can we know the uncertainty in our answer without looking at the data? Yes, Phil? Not only that, that it comes from a straight line. We had a lot of assumptions. Under those assumptions, it is true that this will be the uncertainty, but it's a little complicated where this comes from. So the way I think about this, fundamentally it's true that if all of our assumptions are true, then it really is true that the uncertainty on our slope and intercept don't depend on the data. It's just, that's the way it is. Do the math, people, that's it. But that's a real warning flag for me. And I would say a couple of things. One thing is I would say that what this really computes is not the uncertainty on the intercept and slope. What this really computes is the best possible uncertainty on the intercept and slope. It doesn't tell you what your uncertainties are. It tells you what they would be if your data were perfectly drawn from this model, which it almost certainly is not. So the way I sometimes write it is this is the Kramer-Rau bound. And the Kramer-Rau bound has a totally incomprehensible Wikipedia page, but the Kramer-Rau bound is the limit on how well you can measure something given noisy data in the best possible circumstances. And this is the best possible circumstances. This would be true. Since we don't live in the world that is generated from best possible circumstances, not even close, I strongly recommend using either jackknife or bootstrap to get errors because these are uncertainties, to get uncertainties on the slope and the intercept, because these are uncertainties that do make reference to the data and do account at least partially for the fact that the data are not perfectly generated by your model. And these are out of, I'm not gonna talk about these now. These have good Wikipedia pages and they're very easy to implement and they give you an empirical estimate of your uncertainties, which accounts at least partially for the fact that the data are not drawn from your model. Although, there's no truly conservative error bar methodology that you can just turn on. Yes, Phil. Yep. That's right. Ah, good. Good, let me say a word about that. So Phil points out that all the data, like error propagation formulas, we learned as youth, some of you are youth, were, made no reference to the data. They were just you take an error and you transform it by some derivatives or something squared. Take squared errors and transform them by derivatives. That's exactly what this is doing. This is the inverse variance matrix and remember A is the derivative of the model expectation with respect to the parameters. So this is like the squared derivative multiplied by the inverse squared error and that gives you the inverse squared error on the parameters you care about. So this is exactly that error propagation formula you learned in lab class back when you were an undergrad. It is the inverse variances of your data taken through squared derivatives inverted and that tells you how well you should do on the new parameters. And so this is exactly that standard error propagation and again, the right way to think about it, error propagation is it doesn't tell you what your uncertainties are. It tells you what your uncertainties could be if you lived in the best of all possible worlds, which I think is a much better way to think about the uncertainties. So computed. Phil, good. The word model. So Phil asks about what's the definition of the word model? And I have a very strong position on this. In my view, a model is a probability distribution for the data parameterized by parameters you care about and with prior probability distributions over the parameters you don't care about at least and maybe the parameters you care about as well. But a model for me is something that generates the data or it's one where you could turn a crank and produce data sets. That's a model for me. It has to be something that generates the data. So a model, when an astronomer says, oh, I have a model of stellar atmospheres, for me that's only a component of the model. It's the component that produces the spectrum of the star but then you have to then add a noise model to that to write down a probability for the data on the star. So the stellar data is generated by a mean model which is like all that physics that you really desperately tried not to learn in that class and then the plus a noise model which is like what a spectrograph does when it observes a star. And so for me, a model is never purely something that refers to physics equations. It also has to refer to hardware because it's always the case that your data, am I visible on the board? Yes, that your data are produced by physics and produced by hardware and which also includes things like detectors and stuff like that. And the hardware is probably also governed by physics at the end of the day. Or God, I hope so. God, I hope so. So your data are being generated and so one of the things of course that I love about data analysis is that data analysis is where theory and engineering meet. Theory and engineering hit in data analysis. This is why it's such a great field to work in and why we're so lucky to be astronomers and able to work right here. But it means that if you say if you only have worked here and you've never worked here in my view you don't have a model. So a theorist who says I have a model of stars I'm like no, you don't actually. Because a model of stars would also include photon counts, calibration errors, resolution variations, CCD artifacts, cosmic rays and all the things that come into here. Is that, does that address your point Phil? Good, are there questions? Yes, hit me. So these, what these methods do, what these two methods do is they produce empirical measures of how precisely you are measuring these parameters. So in bootstrap what you do is you throw all your data into a bag and you draw random data out of the bag and refit and then throw them all back in, throw random data out, refit, throw them in, throw out, refit and you see how your fits vary depending on which data points come out of the bag. And that gives you a sense of whether or not the variation is as big or as small as you expect given this Kramer-Rau bound. So it's an empirical way of checking your error bars using the data. That's correct, that's correct. So it doesn't make your methods self-consistent. It just provides an empirical estimate for your error bars. You can then go back and try to reverse engineer and make a self-consistent story, but once you're going down that path you should turn on Bayes and parameterize and you should become a Bayesian. Phil? Yeah, the genius thing, as Phil just said, I'm only repeating things because of the audio stream. As Phil just said, the thing that's genius about these is they produce new data sets that have the same statistical properties in the limit as your data set. So in the limit of large amounts of data these are ways of generating new data sets. They are of course covariant with your data set because they share data points, but you can still explore that covariance and assess whether it's the right size given your beliefs about the noise and your beliefs about the model. And there's variations of jackknife that can be used to test the models. There's things like cross-validation that are related to jackknife that test your models. So in fact, I'm a very big believer these days in using jackknife in general. Whenever you're doing an estimator you should be jackknifing it, in my view. Jackknife is where you drop some of the data and refit and look at the distribution of answers as you drop data. It's very valuable for examining your model. Other questions, comments? Yes? I prefer it to Bootstrap because it gives you more detailed information about your residuals. So because you're dropping individual data points or chunks of the data, you can ask which chunks of the data are most anomalous? Which chunks of the data are asking you for a more complex model? Which chunks of the data are really setting the slope and which chunks are really setting the intercept? By going through this jackknife operation if you look at the outputs of the jackknife you really get a sense of which data points are doing what. So whereas at the Bootstrap you kind of just get gross statistical properties unless you do a lot of combinatoric experiments in here. Here because you systematically drop data you learn systematically what the issues are with your data. And one of the things that's kind of, you know there's all this theory that I have theory here of estimators. I've said a little bit about it. There's all this theory about Bayes but there's absolutely no theory about model construction in the first place. Like there's no statistical theory of how we construct models. And we construct models basically by investigating the data and how the models do against the data points. And that's how we figure out that we need to change or adjust our models. Yes. Yeah. Yeah. More or less, yeah. I mean this is, jackknife has a lot of different properties. One of the properties is it gives you error estimates. Another is if you look at the left out data or the prediction left out data does model testing and it also gives you a sense of when things really go to hell. If dropping a few data points causes your model to just go to hell you don't have enough data. You just need to take more data or there's something really wrong with your data, your model. But yeah, it's general. One of the reasons these are good things to be doing is they just force you to play with the data and do a model. And as you play with the data in the model you learn about issues. I mean fundamentally, you know there's all these assumptions we started with. We had all those assumptions. That shows you data analysis is a subjective pursuit. We make assumptions based on our intuitions and based on our willingness to do computation. That means that data analysis is fundamentally a subjective pursuit. The more love and kindness you give to your data the more playing around with your data the better will be your assumptions and your intuitions about it and the better will be your subjective inferences. That is I think a critical idea of all data analysis. Many times. Good, good. I mean, Daniela makes a nice point if that if you love your data too much you might end up like getting to know it too well and then overfitting and trimming out data you don't like and you can go down the garden of forking paths and end up really getting wrong results. And of course a lot of people are talking in the social sciences and people should probably be talking in the astronomical sciences but a lot of people are talking in the social sciences about the fact that experiments are not reproducing. And when people redo experiments they don't get the same answers what's going on and some of that might be that people kind of love their data too much. And some of that might be happening in astronomy too of course. Yes. Good, so Phil asked like but doesn't Jack and I protect you from overfitting and from loving your data too much and making decisions sort of posterior decisions that look like prior decisions and so on. And that is true but they're even more extreme if you really want to think about like if astronomy really wanted to move into a mode in which we wanted to be really careful about our data analyses. Whenever we took a survey like you know the Gaia survey it would give us one nth of its data. We would do exploratory data analysis. We'd get to love those data. We would register hypotheses about the full data and then we would only run our analyses on the held out data after we've registered hypotheses. I would fully support setting something up like that for astronomy. That could change our life. That is very related to Jackknife because it involves holding out data doing analysis and then asking what's true about these data. So it is true that Jackknife is in a space of methods that are very conservative data analysis methods. So if you go down that path you get to extremely conservative and people in fact David Donahoe here at Stanford is a big proponent of us moving to a very conservative data analysis. I say here at Stanford because to a New Yorker you know Stanford and Berkeley are in the same place. I know how offensive that is. I'm sorry. I just you know you gotta say something offensive when you're in New Yorker. I've been so nice. Good are there other comments or questions? Let's drink coffee. Let's take a 15 minute break and come back at 10.45. Awesome. Was I on the audio stream? Are you in here? Is it on? There we go. It takes a little bit of time to yeah. So everyone eating snacks can hear me admonishing you to rejoin us. Okay cool let's get started again. So my name is Jake. I'm a data scientist at University of Washington and the East Science Institute. It's kind of the UDUB's version of this space that Berkeley has and I did my PhD in astronomy and I've sort of been an astronomer in the years since then and spent a lot of my time on other stuff as well. So I want to thank David Hogg for such a great starting intro to some of this stuff. And I want to try to build on some of this and give us the vocabulary and the way of thinking about these sort of models that will help us, help prepare us for the sessions that are coming later in the week. We have, I think we're going to be talking about machine learning tomorrow. We're going to be talking about Beijing inference in depth on Wednesday and then going into some more computational things later in the week. One thing that I remember is when I, even through grad school, I always had this notion that there was the Bayesian versus frequentist way of doing things and I never really understood exactly what that was. I just kind of smiled and nodded anytime someone made the distinction. So the first thing I want to do is give you a firm sense of what the difference between Bayesianism and frequentism is. And then from there, dive into a little bit to how we should think about probability, which is probability is fundamental to some of these models both in machine learning tomorrow and in Bayesian inference. And we'll see how far we get in that. So anyway, fundamentally the difference between frequentism and Bayesianism comes down to just a definition of probability. And so this is when people talk about the difference between the frequentist and Bayesian approach. It's really just this philosophical definition of probability. So a frequentist will say that probability is intimately related to long-term repetitions. So the classic example that I'm actually surprised that hasn't come up yet is flipping a coin, right? So you flip a coin and when you say the probability of flipping a coin is 50%, what does that mean? So to a frequentist, what that means is that if you flip the coin a thousand times, you'll get heads about 500-ish times. So if you flip the coin a million times, you'll get about 500,000, right? So it's this notion that if you repeat some task over and over that the probability is intimately related to the results that you get after all those repetitions. And so to contrast that, the Bayesian view of things is that the probability is a degree of belief. So if I say as a Bayesian that the coin that I have in my hands, my imaginary coin, has a 50% chance of getting heads, what I'm saying is that I have no other information that would tell me whether this should be heads or tails so the results are equally probable. Does anyone have complaints with that? I wanna make sure that, yeah. That's a great question. So what he said is if you have a machine that kind of flips the coin in the same way and evacuated an environment and does the same every time so that you would get heads every time, how does that change things? Yeah, that's true. So I think the way that I've heard that address before is that if you have, so when you're making these probabilistic statements, it's fundamentally about things that you, that have some sort of uncertainty to them. And we could, you can imagine doing all the physics and taking into account the torques and the forces and everything that's happening to a coin as it flies through the air and hits the ground and predicting exactly what the heads or the tails are gonna be. But you're not really in the realm of probability anymore. Do you have a better way of saying that? Yeah. Oh, okay. I thought, see, I had you pegged as a frequentist. Yeah, so fundamentally, this is really the philosophical difference between the two. Is it all right for me to move on from Hogg's point? I don't know. I don't have a better way to talk about that. Yeah, Phil? It's good that you don't bet. Yeah, that's a good way of putting it. So if you're, if you fundamentally think of it as a thousand times and 500 will be heads, that's a frequentist. If you fundamentally think of it as an odds, like I have even odds of getting this. I'm gonna bet you $1 to my $1 that it'll be heads. That's a Beijing way of thinking about it. So this might seem small, but it really, really is fundamental. And so one of the reasons this is, is let's say, I'm sure there are a few cosmologists in the room. So if you're doing something like you're trying to constrain Omega Matter with an experiment. So in a Bayesian sense, we can, so Bayes, we can write something like the probability of Omega Matter, right? Because it's this expressing this uncertainty about what the value of Omega Matter is. And we can condition that on data or theory or whatever we want, but we can talk about a probability. In frequentism, this expression probability of the value of Omega Matter is meaningless, right? Because there is only one Omega Matter. There's only one value. So you can't repeat something over and over and get a different true value for Omega Matter. So what I'm trying to say here is that this ends up this philosophical difference between frequentism and Bayesianism ends up constraining the types of problems that we can write and the types of the ways that we can express the science that we're interested in. Now this gets a little muddied because you can talk about like observed values or derived values of Omega Matter from your data that will change depending on noise in your data and things like that. But here I'm talking about just the value itself, the fundamental, the value of Omega Matter. So for example, this also means that the Bayesians can, we can write, if we go back to the straight line, the slope and the intercept, the Bayesian can write, can ask what is the probability of my theta, the slope and intercept given my data? Whereas the frequentist, if you try to write P of theta given data, that doesn't make sense, right? Because theta, these model parameters are fixed values and you can't have multiple experiments where those fixed values take on different values. So this is where this distinction comes in and we'll see some of the consequences of where that comes up. Any questions about that? Any clarifications from the other experts in the room? Yeah, Phil. Yeah, yeah, maybe. Yeah. So the pragmatism, that's a good point because it is sort of fundamental, like fundamental uncertainty of knowledge versus this pragmatic thing. And if you talk to someone who is a pure frequentist, like there's a guy named Phil Stark here at Berkeley and he's always interesting to have discussions about with this, he, for example, he doesn't really care about degree of belief. He thinks this whole thing, this degree of belief thing is wishy-washy and ill-defined and not really want what you want to compute because what you're really interested in is when I come up with an estimate, I wanna be able to say, if I keep doing that experiment over and over, I wanna be able to say how many times I'm gonna get it right. You know, that's his, fundamentally he's saying that this long-term repetitions is what we wanna know about. But for whatever reason in astronomy and physics, even as Bayesianism was ill-repewed in the middle of the 20th century, it was sort of held strong by a core of physicists who thought this is how we should look at the world. So it really does come down to kind of a philosophical question about how you think of the questions you're trying to answer and how you think of what those answers should, how those answers should be constructed. So it's a little bit weird. But I'm hoping that with this, you kind of have a framework for how to think about that. So in Bayesianism, I'm gonna erase this right now. In the Bayesian method, I wanna show you a little bit of what's going on here. We're sort of, what we're saying then is that we want to compute fundamentally the probability of science. I'm borrowing a little bit from a presentation that Dan Ford and Mackie gave a little while ago. So we wanna compute the probability of science, right? Where science is whatever parameters describe the model that we think describes the universe. So your science might be what's a mega-matter and W or your science might be like what's the orbital characteristic of this planet around a given star or something like that. So we're fundamentally expressing our science in terms of a probability, our knowledge of the world in terms of a probability. But we need to, we can't just write that. We need something to inform our science. So we actually put in this little, this bar here and say the probability of the science given the data. And this is a conditional probability. We're saying the probability of some term given the value of something that we've observed. How many of you have seen conditional probabilities before? Okay, so most people are familiar with this. Just to define it real quick, the conditional probability of A given B is the joint probability of A and B divided by the probability of B. And the way that I, the nice way to think about this is to kind of think of it as a Venn diagram where you have, like this is the probability of A happening out here. This is the probability of B happening out here. This is the joint probability that they'll both happen. So you can think of it as like, probability of A is that I flip the heads on my first coin toss. This is that I flip the heads on my second coin toss. And the joint probability of them both happening is I get a heads on both. So this might be 0.5 out here and 0.5 out here. And for independent trials, it's 0.25 in the middle. And so that's what joint probability is. It's both our events that were interested happening at the same time. And then the conditional probability is just saying, let's just think about the space where B happens right here. And then what we can look at is it's a simple division. We divide the probability of the thing we're interested in happening, both A and B happening, divided by the space that we're interested in, which is when B happens. So it's P of A and B divided by P of B, and you get this conditional probability there. So that was quick aside on that. So anyway, in Bayesianism, we have this probability of science given data. What else are we missing in here? What else goes in here? Well, I'll just go ahead since no one has any ideas. So we can do the science and given the data, but often we aren't conditioning our science just on the things we measure, right? We're conditioning our science on some sort of background information. So this background information might be something along the lines of like, relativity is correct, or our background might be something like, we're working in an FRW metric, or the background might be like, Kepler's laws are correct, and we assume that planets orbit stars according to those. Yeah, so this is kind of a more complete description. And if we go, Hogg mentioned earlier some nuisance parameters. So what we actually have in practice is we have the probability of science and the nuisance parameters given the data and the background info. So this is what we end up computing in Bayesian analysis. We're looking at the probability of all the things that affect our model. So it might be the science might be omega matter, the nuisance parameters might be like some of these power spectrum slopes that we don't really care about for this analysis, but they affect our model. And then we're looking at the data that we measured and then some sort of background info that tells us how our model is done. So this is what Bayesian analysis is looking at. So often just because it gets really long we'll abbreviate this and say something like theta s, theta n given d for data and i for this background info. And sometimes you'll even see people leave out this background info right there because it's sort of implied the whole time that your model is built on assumptions. But I find it useful to leave that in there just to remind you that everything you're doing is conditioned on some sort of assumptions that are either stated or unstated. And yeah, and this is what we're trying to find with Bayesianism. This is the answer that we're looking for often. And what did I wanna say about that? Yeah, so let's stick with that. Okay, so when we come to Bayesian analysis later on Wednesday this is what we're gonna be dealing with. We're gonna be trying to find the probability of these parameters for science given the data and the background information that we have. Does that all make sense? Am I writing this high enough for everyone to see? I guess you can see it on top. So the interesting thing here though is that if we look at the definition of frequentism long-term repetitions, this is fundamentally different than what frequentists are trying to find. Whereas frequentists instead of trying to find a probability or trying to find what David was talking about earlier is this estimator theta hat, right? Or maybe theta s hat. So it's some sort of point estimate that tells us what value of our science parameters we expect. And so these are the, frequentism is going for an estimator, Bayes is doing, trying to go for some theta s given data. So this is the fundamental difference between the two. And just to put this, just to frame this a little bit, the stuff that you'll see tomorrow, the machine learning, not entirely, but it tends to be more frequentist oriented. And most of that is for computational considerations. So if you're doing a machine learning model, you tend to be working in this frequentist realm. And the stuff on Wednesday, the Bayesian inference tends to be in this realm right here. So any questions about that distinction? Yeah, I'd like to get to that. So I'll get there. Yeah, oh yeah, that's another thing. So the question was, how do we interpret confidence intervals versus the Bayesian credible regions or there's different vocabulary for that? Sure, why don't we talk about that right now? So this is really interesting. So the question is, when you have this, when you're arriving at these answers, how do you interpret uncertainty within both of these models? I'm going to, I'm gonna erase this right now and just write frequentism and Bayesianism at the top. I wasn't gonna talk about this, so hopefully this makes sense when I dive into it. So frequentist is the estimator theta hat and Bayesianism is the P of theta. I haven't defined what this is yet, but this is known as the posterior probability. So the frequentists are trying to find the estimator theta, Bayesians are trying to find the posterior probability. So let's say, just for sake of example, since we used it earlier, let's say our data is x and y and we have some points with error bars, right? And we're trying to fit a line with the slope of m in an intercept of b, right? So our theta here, our theta is the slope and the intercept. So one useful way to characterize uncertainty is to draw this region of this parameter space of m and b and say maybe theta hat is right here. And in the Bayesian view, we have a similar thing. We have m and b, but our posterior is something that looks like this maybe. And I haven't labeled the axes, so this is correct. The relationship between m and b is correct. I always forget which way it's supposed to go. Yeah, so fundamentally this is what we have. We have the frequentist approach is giving you an estimator for theta hat. And the Bayesian approach is giving you this extended probability distribution. So I'm gonna start, as far as uncertainty, I'm gonna start with the Bayesian approach because I think this is much easier to wrap your head around. So the Bayesian approach says that this probability distribution right here is the uncertainty. So if you remember back when we defined what probability is, probability is a measure of degree of belief or measure of an uncertainty of a value. So this posterior in Bayesianism is the uncertainty. So you can decide what you wanna do with that, right? You can say something like, I don't care about the b value, so I might marginalize over it and just integrate this out and then have some distribution in the slope and then choose approximated as a Gaussian and then choose the 95% cutoff or something like that. So that's how you go about and get an error bar in the Bayesian sense. You just use the posterior as a probability distribution. Yeah, yeah, exactly, exactly. So what Phil said is that your statement here that you can make about this, if you draw something like a 95% contour around the outside, as you can say, I believe with 95% certainty that the true values of slope and intercept lie inside that region right there. So this is the kind of thing when you see these banana diagrams and cosmology papers, they have the kind of like curvy contours. That's the kind of statements they're making with usually with the Bayesian analysis is that I believe with 95% confidence that the values are within this outer contour. Now, frequentism, it gets a little more messy because remember, we don't have any extended information in the estimator itself. So what we do have, we have to think back a little bit about how we got this estimator. Remember this estimator here, I'm gonna draw it in green. This estimator here, can you guys see that all right? Probably not, let's do blue again. This estimator here is some function, I'm just gonna write a big F, of our data. And some assumptions about the model. So let's encode it with the I right there. So our estimator is some function that we compute from the data. So where does the, in frequentism, our only sense of uncertainty is gonna come from the fact that this estimate changes every time that we do the experiment. So fundamentally this estimator, where this variance comes from, our assumptions aren't changing probably. So it's the data, our data is basically, we assume the data is something that changes every time we run the experiment. So you can think of this as like, if you're doing a supernova experiment, every time you look at the sky for a week and look for supernovae, you get a different set of supernovae. You get a different set of data and what is in that set of data will drive this variation in your parameters. Right, so what happens then is that if you run the experiment again, you get one right here, and you get one right there, get one right here. And if you imagine doing this over and over again, then eventually you get a nice spread of measured parameter values. And if you carry this on to infinity, then what you get is some, you can imagine that it's some contoured distribution like this. So this is what frequentists mean by uncertainty. If you do the experiment over and over to a very, very large number of times, then you end up with some distribution of results of your estimator and the errors are the size of this distribution right here. So what's the problem with this? I mean, I shouldn't say it's a problem, but does anyone see where this might be difficult? Yeah. Yeah, you only have one measurement. And I mean, we only have one LHC, right? Or we're not gonna be able to build a million different LHCs to get this distribution. So what you have to do is you have to make some assumptions about the estimator that you've gotten. So for example, your estimator might be like right out here. This might be the data that you just happened to get, right? Or your data that you got might give you an estimator that's right here or it might be right there. The trick in frequentism is that given, generally given a single estimator, given a single estimator, you have to try to infer what this whole distribution is if you were to repeat your experiment a million times. So that's the real trick in frequentism. And what ends up happening is, I erased the stuff that Hogrod, the actual equations for this, but what ends up happening is you do something along the lines of error propagation. And what you can generally do in frequentism is you can estimate the size of these ellipses, if you will, right? So that, whatever it was that Hogrod, I think it's ATC inverse A or something like that. Is that the uncertainty? What's that? All inverted, that's right. So that's the uncertainty in frequentism. What that tells you is basically the size and orientation of this ellipse. But the problem is in frequentism that you're gonna plop that ellipse down right on your best estimator. I keep saying the problem. I don't want it to seem like I'm saying frequentism is wrong and Bayesianism is right because they're both approaches to useful things. But the difficulty here, I sort of brushed over the difficulties in Bayesianism, but the difficulty here is that you don't know where to plop that ellipse. So let's say that this happens to be your best estimator given the random draw of data you got. You're gonna draw an ellipse like this and this is gonna be your uncertainty. I'm gonna get there. Yeah. Yeah, I'm gonna get there. And part of it is I have to stand up here and remember what the statement is. I know I've made the statement before. So this is what frequentism is doing. You're saying, oh yeah, yeah, this is what I wanted to say as well. So these ellipses right here, you can think of this as kind of like an estimator of sigma, right? Because this is also some function, let's call it g of the data and the information, right? So the size of this ellipse is something that you estimate and the size of the ellipse is something that's subject to the same kind of variance as, oh wait a second, but okay. So I think in general, the size of the ellipse depends on the data, but in this particular case, the size of the ellipse, it doesn't depend on data. So we can just treat it as fixed, so that's good. But so what the statement you're making then is that you're saying that every time I repeat this experiment, I will plop down an estimator and draw an ellipse around it. And then if I do it again, I'm gonna plop down an estimator somewhere and I'm gonna draw an ellipse around it. And if I do it again, I'll plop down an estimator and draw an ellipse. Let's say that this right here, right in the middle, is the true value that you don't know, but that's what you're trying to constrain. What you're saying in frequentism is that 95% of the time I run my experiment and perform this frequentist procedure, the ellipse will contain the true value. So this is a little bit subtle. So back to Bayesian, this is the one that's easy to understand. You say, I'm gonna draw this ellipse and there's a 95% chance that the true value lies somewhere in there. In frequentism you say, I'm gonna draw this ellipse and there's a 95% chance that similarly drawn ellipses will contain the true value. So that's the statement that you wanted. Does that make sense to you, Phil? It really is, yeah, yeah. So the, yeah, and in practice often it is subtle and it's difficult to find situations in which it'll really make that much of a difference in this simple model with no prior and things like that. But just philosophically what you're saying is that just remember, frequentism is about repeated trials and the particular confidence interval that you construct, this particular ellipse doesn't mean anything so much as the recipe you use to create it, the procedure you use to create it. Whereas in Bayesianism, Bayseans will instill some sort of fundamental meaning into this one ellipse that you create. I think I'm completely, does this help anyone at all? Yeah. Okay, so that's when you're looking at the difference between confidence intervals and so confidence intervals are a frequentist thing and this is usually called a credible region or something like that, that's a Bayesian thing. That's the difference. And the thing to be aware of is that, yeah, I guess I won't say any more about that, yeah. Yeah, well it's exactly, it comes back exactly to what Hogg was saying earlier is that often the frequentist approach is really, really convenient. You can do it really quickly because you're finding this point estimate and then you're performing some procedure that finds the confidence bounds around that. And it's also if you're actually in a situation where you're repeating something over and over and you want to make true probabilistic statements about how your procedure will perform in the future, then it's a useful way to go. Not always. So I think now that I've thoroughly confused everyone, we have a little bit of time, I wanna move on to actually talking a little bit more about what we know about probability and then deriving Bayes' rule, which some of you have probably seen before. And then from there, we'll be in a good position to think about the types of models that we're gonna run in the next few days. Does that sound good? So yeah, multiple ellipses versus one ellipse and multiple values, that's what it comes down to. So I will put my little plug in for science. I think this is my soapbox that I get on sometimes. I think fundamentally the way that we as scientists think about data is Bayesian, right? We think about making a statement about model parameters and saying there's a 95% chance that those model parameters fall within the region I've constructed. So I think one of the reasons that Bayesianism appeals to the astronomy community is that it maps really well onto how we think about the world. And then, but there are statisticians who don't like that for reasons that I'll get into here. Okay, so we have this statement. This is what we're trying to find as a Bayesian. So we talked already about, there are a couple fundamental probability things that are useful to know. I already showed you this P of A given B, which is conditional probability is equal to P of A and B divided by P of B. So this is the conditional probability. This is called the joint probability, right? And this is our nice little Venn diagram here, AB and this is A comma B. And then the probability of A given B, did I write this the right way? Yeah, probability of A given B is this little sliver divided by that big sliver right there. The other thing that's useful to think about when we're talking about probabilities is, so let's see, I should number these. We have one thing that's useful to know about probabilities is normalization. So probabilities are normalized, which means that if we integrate the probability of A divided by or times D A, that should equal one. Right, so just staring at this as physicists, you should stare at this equation and automatically you learn something useful about probabilities, right? Because this D A, this has units of whatever A is. So what is the probability of A have units of? It's gotta be the inverse of whatever A is, right? So if A is meters, if we're trying to measure the length of a pole, then the probability of A is units of one over meters. So that's because these things that we're usually talking about here are actually probability densities. So we're saying that if this is the size of the pole that we're looking at, and here's the probability of that size, it's some distribution right here. And to get the probability of a size and some little range right there, we multiply the probability distribution times this little D A right here, this little width. And that gives us the actual probability. So probabilities are always normalized. Does that make sense? So it's useful to think about this because then we can say things like the probability of A given B has units of one over A, right? Probability of A and B has units of one over A times one over B. Probability of B has units of one over B. So if we do all this, one over A, one over AB, one over B, you divide it all together, the B's cancel out and the units match so you're really happy, right? When you start getting into really complicated manipulations of probability and pulling all these out, this fundamental unit match is a nice way to confirm to yourself that you've actually gotten something correct out, that you're not losing terms in there. I sat, a couple years ago, I sat in Dave Hogg's office writing these just insane, messy statements involving multiplying like 30 different probabilities together because we were trying to show something and this kind of unit consideration helped a lot because I could always count things up and then realize that I was missing a term and then go back up and check what I was missing. Yeah, it was glorious. This was our Hurricane Sandy adventure. We never ended up writing a paper. Oh, right, right, yeah. So you should look up Hogg's archive only papers because they're generally pretty useful and this is one of them, I assume, yeah. Okay, so we have, let's see, normalization, joint conditional, so we have the normalization. The normalization, we already talked about the joint probabilities and what those mean, P of A and B. The conditional probabilities, we got to those already. P of A given B. And this normalization thing actually holds for all of these as well. So if you take something like, I'm gonna erase these over here. If you take this joint probability, P of A and B, and you say integrate P of A and B, the joint probability over D A, what you end up getting is, oh, actually let's start with this. We know that if it's normalized, you have to be able to do the double integral over P of A and B, D A, D B and that has to be one, right? Because that's our definition of normalized probabilities extended to something a little bit bigger. But what if now I take this, and I put a little parentheses there. This is telling us that the integral of something times D B equals one. So what is that something? What is the function that if you integrate it over all B, it equals one? It's the probability of B, right? So what that tells us is that this term right here is just the probability to do my whiteboard management a bit better. This is just the probability of B. So this is something that's really fundamental and really useful, is that the probability of B is equal to the integral of the probability of A and B integrated over D A. So this right here is known as marginalization. So these are your fundamental pieces of probability theory that are useful when you go into Bayesian analysis and when you go into analyzing models and things like this. And if you wanna really entertaining, well, entertaining for the right person, first principles read on these. There's this book by James, who was one of the big giants of Bayesian theory in the 20th century. He should call it his manifesto that it has some other title. The probability theory of the logic of science. Yeah, probability theory of the logic of science. The first chapter of this, a probability theory of the logic of science essentially starts from ground zero saying, let's assume we know nothing about the universe. And it derives the concept of probability as a measure of human knowledge and shows that it meets all these criteria right here and has all these properties. So it's really like if you really wanna geek out about like deriving probability from axioms, this is something that's really nice. And the cool thing about it is that he actually shows that probability is the optimal measurement of degree of certainty in the world. So it really kind of puts a good foundation on the Bayesian philosophy of what probability means and the fact that it follows all these things. You can also derive all these properties by assuming that probability is a frequency of outcomes over an infinite number of trials. And so the frequentist method is really founded on solid axioms as well. And it's just kind of, I haven't figured out if it's an accident that the results all agree or if it's just, if it's something deeper than that, but I'm sure there are philosophers out there who have thought about these things. But anyway, so we have this marginalization and all this. So let's look at a couple things here. Couple things that this tells us is that if we, let's say we found our Bayesian, which I'm gonna erase this up here too. I wanna keep those four properties of probability. This one was integral of P of A, D, A equals one. So let's say we have our Bayesian result. We've gone and we've computed our model and we find P of theta S and theta N given our data and our information. We're not really interested in these nuisance parameters. We don't want those to be in our answer, right? So we can get rid of those really easily. We just go back to this idea of marginalization and we say then that we can express our P of theta S given data and information is just the integral of this term up here times D theta N. So what we're saying is that if we have a model that depends on some sort of nuisance parameters, if you're a cosmologist and you're measuring the, if you're measuring the equation of state of dark matter but you have these tricky little things you don't know about the intrinsic properties of supernova that contribute to your model. As a Bayesian, you can just integrate over all possible values of those intrinsic supernova properties and you get out this science answer, right? So this is the marginalization thing. So this is one really nice feature of Bayesian analysis is it gives you a natural and well-founded way to get rid of these nuisance parameters. I know there are good ways of dealing with nuisance parameters and frequentism but I've never been entirely clear on what they are maybe because I'm too much of a Bayesian fanboy. But I've made statements before to that effect and people have pushed back. The profile likelihood, mm, yeah, mm-hmm, oh, interesting, yeah. I know the LHC people do similar things and they're fundamentally doing frequentist results but they have thousands of nuisance parameters that they're handling. So anyway, as an aside, what did you call that? Profile likelihood. Profile likelihood. Yeah, so theta s are the science parameters that you care about, theta n are what we call the nuisance parameters over here that we don't really care about. Yep, yes, that's right. So the probability of the parameters governing our model. And so we can do other things with this as well. How much time do we have? We're pretty good. Okay, so we can take a look at this. Let's just look at this right here and actually I'm gonna just summarize this right now as p of theta given data and i just to make it a little bit simpler. And p of theta and knowing that if we want to we can split theta into parameters we care about and parameters we don't care about and do this sort of marginalization. And so this is the thing that we want to compute. But what we actually know, what we actually can compute given our model and our data, if we have a model in the sense of the one that Hogg described earlier, where a model is a probabilistic generating function for our data given some parameters. What a model is then is actually p of data given theta. So this is what we can compute right here. p of data given theta. So if theta is a slope and intercept of a line then we can draw that line and we can ask what the random scatter is about that line and generate that data. Yeah, Phil. Yeah, you do have to know the i, thanks. So the model that we have is p of data given theta i and we can actually calculate that. Given data, we lost our data. I'm drawing here. Given some data with error bars, we can ask, given that data and some model, we can ask what the probability is that these data are drawn from that model, right? It's just, we ask how likely each data point is given the model and then we multiply those probabilities together. We turn it from a single probability into a joint probability. p of a times p of b for independence. So the point is here that this is something that we can compute. This is something very similar to the likelihood that we saw earlier when we asked how well these points fit the line. That's something I probably won't get into right now but you can actually go from this probabilistic perspective and derive the likelihood that David wrote up earlier. So the question then is how do we go from this statement right here to what we want to, what we actually care about which is that right there. And I'm going to, I'm gonna erase this list right here. Actually I shouldn't erase that list because that is the properties we're gonna use to do this. I'm gonna erase this. So we have the thing we care about which is p of theta given data i. And we have the thing that we can measure which is p of data given the theta i. So let's go ahead and actually express this in terms of the conditional probability that we wrote up before. p of a and b divided by p of b. So p of theta given data i is the same thing as p of theta n data, the joint probability given i over p of, I guess that would be a data i, right? So if we say theta is a, data is b, yeah I wrote that. So we have that statement and then the thing, so this is what we care about and then the thing that we can measure is p of data given theta given theta i and we can write that the same way. We can do, call this p of theta, sorry, I should, I'm gonna write it this way. p of data comma theta given i divided by p of theta given i. And now if you're into algebra and canceling things you should start to get really excited, right? Because this joint probability, this p of theta n data together is the same thing as p of data and theta together. The order of these doesn't matter. If you're just talking about the joint probability of the two of these. So we can set this equal to this and multiply this side by that side, multiply that side by that side. And what we get out is this statement that p of theta given data and information, which is what we're interested in, is equal to the p of the data given theta in information times p of theta, given information over p of data, given information. So this right here is Bayes rule. And you've probably seen it before, but the thing about Bayes rule that's interesting is it's not anything controversial. Like this statement of probability is not anything that's controversial. A frequentist would derive the same statement of probability because all you're doing is you're taking the conditional probability, you're flipping one and looking at the result and then using algebra to get them together. But what is a little bit controversial here in terms of data analysis is what we're doing with the parameters, right? As a Bayesian, we're putting theta in the first position right here. We're saying there's a probability distribution over our model parameters. From the example I used before, it's a probability distribution over the actual value of omega matter, which a frequentist would say does not exist, right? That probability distribution is meaningless. But what we're saying as a Bayesian is that we can take this and we can express it in terms of these other properties. So I'm going to erase this right here and I wanna just, as a last little thing, I wanna show you, tell you exactly what these properties are. So we have the first term is P of theta given data and information. This is known as the posterior. So this is the posterior probability that we're interested in finding. This is our science, right? This is the probability of the model parameters and this encodes all the knowledge that we have about those model parameters. So given our information, given our assumption that that model accurately reflects the universe, this is an encoding of our statement of knowledge about the universe, right? And yeah, did someone have a question? Okay, and on the other side here, we have this, the P of data given theta and I. So this is flipping it around. This is saying how likely is it to get our data given a particular model? So draw a candidate model on there and ask how likely it is that your data came from that. And we can estimate that or we can calculate that in the same way we did in the frequentist method. I'm not gonna get in, I told you I was gonna get into the likelihood as the point of contact between frequentism and basinism. Maybe we can save that for later because it's almost lunchtime. But the likelihood, you'll have to trust me for right now, the likelihood that David showed us earlier up here when fitting a line to data is exactly this right here. It's this probability, it's related to the probability of the data given the model. And then right here we have P of theta given I. So I'm gonna call this the likelihood. P of theta given I, this is known as the prior. And at this point, if you think about what that means, you see one of the reasons that frequentists complain about Bayesian analysis so much. What this is telling us, what the prior does is it encodes what we knew about the model, what we knew about our scientific result before we gathered any data. So we have to actually state, in order to do this correctly, we have to make a correct statement, mathematical statement of what we knew about the world before we looked at it. And that's where Bayesianism gets hard, this prior. But it also can be nice, because the prior is a way to encode any knowledge we had before we ran our particular experiment. So for example, going back to cosmology, maybe you're observing supernova and looking for the expansion, looking for the equation of state of dark matter, but we know some stuff about that. We have WMAP and we have the Sloan BAO results. So you can actually mathematically encode what you know from WMAP and Sloan into this prior and use that to figure out the posterior, given both your data and what you knew before you measured the data. So the prior is kind of a blessing and a curse, right? It can be hard to specify what you know about the world before you looked at it, but it can also be useful to use that to specify what you knew from earlier experiments. And then the last one, this P of data, given I. This is, my favorite term for this is via Hogg and his group. I like calling it the fully marginalized likelihood. And the reason it's such a good statement for what that is is because it describes it well. That's exactly what it is. If you marginalize the likelihood and integrate, if you integrate out over theta, you get that, but also the acronym is suitable to how you feel if you have to compute it. So we'll get in, I assume we'll get into a little more of what each of these terms mean as we, when we go into Bayesianism on Wednesday. But the broad brush is that this FML right here is basically in model fitting, this is basically a normalization constant. So you can think about applying this normalization property right here to this data right here and you have a free parameter in here and the free parameter is your fully marginalized likelihood. So if you're just trying to figure out where the posterior peaks and kind of the spread of the posterior, you can treat this as a normalization constant. So what that leaves you with is this likelihood, which we know how to compute and the prior, which is some sort of mathematical statement of what we knew about the world before we gathered our data. And that's what goes into Bayesian analysis. So to compute this posterior, you look at your likelihood, you look at your prior and you get out results. So this is up till this point has been very, very abstract. I know that we're gonna see some more actual applications of this on Wednesday. And so at this point I'm gonna stop right here so that I don't kind of take the thunder out of Wednesday's talk. But hopefully by the time we come back to this on Wednesday, if this is the first time we've seen these terms, this will be a little more gelled in your head and we'll be able to kind of scoot through the initial parts of that. Any questions or comments? Yeah. Yeah, that's a good point. Where's that? Oh yeah, yeah. We can make this true everywhere if we just do P of B given A, which is the same as, so yeah, you can only multiply P of A and B and get the joint if A and B are independent. Thanks for that. Yep, yeah, yeah. Other comments? Yeah. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. But that means, once the light is reflected. Yeah. In all regions. Yeah, yeah. Yeah. You can see only the exact amount of B of B. Most of what you do at data analysis is to build a likelihood response. Yeah. That's always the R and R. Yeah. So the statement for, in case someone needs the microphone, the statement is that both Frequence and Bayesians use the likelihood, it's fundamental. In the Bayesian approach, it's fundamental in the Frequence approach we saw earlier. So it's kind of a point of agreement. And if we, we're all in, all of us have the job of specifying the likelihood correctly. And then we kind of, we're all in the same position. And then we kind of go off from there and we can understand each other's analysis between Frequenceism and Bayesianism through the point of contact of the likelihood. One thing I learned last week at Astra, has been from Brendan Brewer, is it's helpful to think of the sampling distribution as the prior over data sets. You see how it's a probability of D given asserted theta and background information. Mm-hmm. So you can write that down or code it up before you take any data. And you can generate simulated data by sampling from that PDF before you take any data. So it's your prior PDF for data sets. And in fact, a good, a good hack for Astro Hack Week is often make a toy version of the problem you're really interested in and then generate data from the sampling distribution and then try and infer the parameters that you asserted when you drew those data sets as a way of testing your inference machinery. Yeah, definitely. And I think we'll get into some of that on Wednesday. I hope we will. All right, so I think it's just about time now to break for lunch, but before that happens, we're gonna bring up Hackmaster Phil to have some words about what this afternoon and the rest of the week is gonna look like. All right, let's thank Jake. Thank you.