 I'm really glad that the organizers invited me here, and I'd like to thank them for that. I haven't been to Berkeley before, and this is my first time, and it's nice, warm, and it's good to talk about Bayesian analysis. So my title is Bayesian Data Analysis, and I modified it from Bayesian statistics because I'm really going to talk about analysis, mostly, because this is what we do. And statisticians worry about the proper statistical properties of mathematical expressions and definitions. So we, as astronomers, are using the tools that they developed, and these tools help us to have the inference about our problem, scientific inference. So I'm an astronomer, and I'm working with Chandra X-ray Observatory, which I started this work 20 years ago. And this is also the time when we tried to define the challenges in the new data which were coming up. So 20 years ago was 1996. We were thinking about the future of X-ray astronomy and X-ray data. We didn't know what comes in, but we predicted that there will be some problems. So at that time, my boss suggested that I will go to a statistics department and talk to statisticians and ask them how to deal with the problems and challenges which we face. So this is where the task astrostatistics formed. And it's been 20 years. So let's see if this works or it does. I don't know if I can zoom this, but so let's see. We can do that. So we have a web page. And my first meeting was with Don Rubin, who was the dean or chair of the stats department at the time. And he suggested that the new professor, David Van Dijk, will start collaborating with us because he's interested in new problems. So this is where David Van Dijk, his students, me, Alana Conorth, sit down and discuss how to approach the X-ray analysis in a Bayesian way. We didn't have a clue what the Bayesian was, or maybe Alana actually did. She introduced me to Bayesian problems. But at that time, it was something new and possibly something which will help us to analyze the data. So this is the history of our collaboration, statistics collaboration. And if you go to the very bottom, this is the 1997 paper here in which we defined the problem. It was presented at the Penn State Statistical Challenges in Modern Astronomy Meeting, which Eric Feigelsson and Degas Babu organized during the last 25, 30 years. So in 90s, this was the place where you could learn about new things in statistics. This is where first Bayesian statistical analysis paper or manuscript was published. It was by Tom Loreto. This was the paper for astronomy, I should say. And in that 1992 conference proceedings, there was this Bayesian analysis definitions for astronomers. And then our short paper was basically about the challenges we will face. OK, so 20 years later, this is the conferences. These are the things we do and some software. And we have this meeting with statisticians. We work with students, and we develop algorithms. They take long time. If you start with your new problem and you come to statisticians, they want to understand every detail about that problem. They want to know where the data comes from, what do you know about the data, and what do you really want to get? So they ask you all these questions, so you need to be prepared to answer those. And then during the process of thinking about the method, you understand that not all the questions, they ask you answered correctly, or you were confused. So it was a big experience for me and evolution in my knowledge of statistics, which is very great, because now I'm standing here. 20 years ago, the probability that I will come and give you this lecture was less than 0.001, or even less probably. At that time, nobody talked about statistics, Bayesian statistics, and people didn't care. Now, of course, situations is very different. So we'll come back to that later. But first, this has some statistics of publications in ABJ. So when you look at the world, I just search ADS for Bayesian world in ABJ publications. So you have the shooting up number of papers which say Bayesian analysis. I'm not sure if they actually do that. And then MCMC, which is Markov-Chain Monte Carlo method, in-referred papers also increases. So you can look at that growth of usage or application of Bayesian analysis in astronomy, and it's great. The main problem is that maybe not everybody understands details of the processing of the data, details of applications. But we are getting there, I think, which is very nice. So I was thinking about telling you where to find the information, because, of course, I'm not a statistician. I work with statisticians, but I'm not a statistician. And I can tell myself that I'm an astro statistician, but I'm more astro than statistician. So the references, I put the three books here. The first one is a popular reading from Nate Silver. And it is a great introduction about probabilistic thinking for us, because he describes standard events in our world and applies the Bayesian rules to explain those. So there are examples from poker games, from the election, from history. And one interesting example, which he gives, is the discovery example and the usage of technology. So he says that the big breakthrough in our knowledge or our applications comes with the big event and big discovery. And he gives example of the Gutenberg discovery of print. So when you think about books before Gutenberg, the books were very expensive. The knowledge was not generally applicable or searchable. After Gutenberg application of printing, the amount of books which people can read started to increase, and the knowledge started to circulate around. And this is something which brought the evolution, revolution to our world, because people started being more knowledgeable and the world started. Things shifted around, and the whole industry revolution came in the 17th, 18th century. And since then, things started growing very, very quickly. So when you look at the internet now, and which is sort of similar case to sharing information in the 15th century, we are having a lot of information, a lot of data, and we're trying to figure out what to do with it. So we are maybe at the edge of revolution. Even in astronomy, when you look at these graphs and our papers with Bayesian analysis going up, then maybe that's something which reflects this change which is coming. The main reference for Bayesian data analysis is this red book by Andrew Gelman. It's a very big book. It's the textbook for statisticians. And as somebody told me, it is very hard reading. But it gives you all the details about Bayesian thinking and definitions. If you read first four chapters, maybe six, you will have coverage of everything you may need to know in order to understand Bayesian analysis. That's what it is. But it is hard reading, and it may not take you a month. It can take a few years, but it's really good. And this is the third edition. I know I was reading the second edition 15 years ago, and this was hard reading. And I don't tell you that I understand everything which is written in that book either. So for astronomers, there is a very short, small book which you can probably read in a week or two. And it's Wall and Jenkins' Practical Statistics for Astronomers. It's very nice. Gives you a lot of examples from astronomy, and it is a mixture of standard classical approach and Bayesian analysis. And I like that book. It gives you sort of a quick introduction to it. All right, right. Exactly. So it covers a lot of things which we need to know the definition of. So if you hear a word, this is where you can find the description of it. And it has examples. And these examples can help you. You can work through examples. Very good. So topics. I listed all the topics here, which cover the standards analysis. I'm not sure we will get into all of them, but I wanted to have them listed in terms of what they are. I structured this lecture into three components. And it will start these components of Bayesian analysis, talk about models and inference, and then model checking and model selections. You had some discussion of probability on Monday. This is just to come back to it and, in normal ways, probability quantifies randomness, basically. That's what it is. And statistics uses the probability to make the scientific inferences. So this is a sort of connection between statistics and probability. And we, as scientists, have to use statistics in order to provide scientific inference. If we don't use statistics, then we're just talking in a normal conversation with somebody. We have to have some measure of our inference. And statistics provide us with this measure. We also have to have a way to decide about certain things, certain models, or certain views of the world. So this is where statistics comes and helps us. Monday discussion about Bayesian infrequentist approach gave you the introduction to the differences. So I just put here that Bayesian probability quantifies the degree of belief that an event occur. And the frequentist understanding of probability is that this is the frequency of an event occurring in the limited or limit on infinite number of trials, which, in astronomy, will be very hard to do. Because our source, observed by the X-ray telescope once, may be observed second time. But the state of that source can vary. We know that X-ray source is very. So for us, it is very hard to repeat the experiment. And we usually think about our experiments in a Bayesian way most of the time. Then we have probability distributions, which are sort of formation of our probability thinking. And this is the example of the decision process in which we will basically reject the hypothesis, which gives you this distribution plotted as histogram. Because the probability that this event at 6.8 came as a part of that model describes the rest. It's very small. It's less than 1.6%. So we would say it's very unlikely that this is the event related to whatever we were talking about. So we decide based on that test that we reject whatever the model told us about something. Probability distribution have the formal mathematical prescription. So we have this continuous probability function, which is called probability density function. And probability of random variable takes the value between x and x plus dx. It's formalized here. And the probability that x is between these two values is basically the integral between those. So you know those things. All right. We talked about joint marginal and conditional probability of the x and y. And let's see if we can illustrate that a little bit. Or maybe somebody can illustrate things. So if I say this is x and y, and let's see. This is my scatter points. And I sort of draw things like that. So what is the joint probability of x and y on that plot? Anybody? Can you tell me? What is the joint probability when you look at this x and y? What is the joint probability of x and y? Yes, everything. So any point on this plane, if it's just two variables, any point on the plane tells us about probability density. Sort of. Of course, there is this density of the points, which are used at density. So I put the counters, and there are small points here, less points here. So if point is located, x and y is here. The p of x, y will be higher than here, for example. But anything on that plane will be the joint probability. So what is the marginal probability of x on that plane? The integral is there, so you should be able to integrate. So can you plot that? Yes, exactly. That's the histogram. Very good. So I'm going to do this and sort of maybe histogram, maybe there's function like that, something like that. So what is the conditional probability? Yes. So fix x, right? You just have to integrate over this little bit here, right? So there is the curve, which will be in that direction somehow. If you reach, I don't know how to plot that, but it will be somewhere like this. Yes, exactly. It's a little harder to illustrate, but it will be anything in that sort of part of the graph. Because we fixed one value. So I wanted to say something about the notation, which means that for us, sometimes it's very hard to read the statistics books and also talk to statisticians. So for the distributions, which are known binomial, Poisson, normal, chi-square and t-distribution, we think about distribution of the equation typically. And when those equations are put into the probability statements, they become very hard to read. So the way to describe those in complex statements, these are the ways the notation is simplified. So if you see bin and on p, you know that this is binomial distribution. And Poisson rate, it's obvious. And then normal will be Gaussian with the mean mu and standard deviation, the chi-square distribution and t-distribution. There are gamma distributions. There is many distributions. And I think as an exercise, I have some exercises for you. So as an exercise, I was suggesting you plot some of them or you simulate the data from those distributions. But there are so many tools out there. So when we write the code, you usually go to the library and you call, let's say, numpy random. And you say random normal or random multivariate normal. And you don't care about the equation at this point. You just care about the parameters of the distribution. So it's sort of similar to the way statisticians think about these distributions, because they also think about the parameters in that context. Of course, they know the equations and they deeply understand all the transformations and things that need to be done. But when they write things to simplify this, they use this notation. So for us, if you code something, you just have to sort of understand the properties of these distributions and then use them, which is nice. Let me see what else I have on this. OK, so there is more to the notations. I guess we talked about probabilities. But when you say, let's say, theta is the parameter of a model. And it could be 1, it could be 10, 20 often. And you say the parameter is sort of distributed as normal with mu and sigma equal. So this would tell you that the parameters are distributed as the Gaussians. And there is certain mean for these parameters, for one parameter of if there is many, this would be multivariate Gaussian with many mu's and many sigma's, some correlations between those. And then probability of theta could be written as normal of sigma squared. So something like that. I guess this could be, this is random variable and that's the probability of that variable, which could be normal. And then probability of theta given mu and sigma squared, I mean, this is the variance, yes? Oh, that's a better one, OK. Yeah, sorry. My writing is bad, but if you can't even see that, it's even worse, right? I have to have those. OK, so when you say about probability density, we put normal here as well. So it will be normal. And there will be theta given mu and sigma. Oh, that's surprising, I didn't know. Yeah. Yeah, yeah. All right, so, OK, these are these three main notations. So this is our probability of model parameter given these parameters of the distribution, right? And this tells us that it's normal, let's see like that, normal, even those. Why would you never write a middle one? Well, if you think about prior, this is what you would write, right? If you think about what you think about the parameters and you can say, OK, my parameters, let's say, slope, and those values are distributed according to certain distribution, which is normal with that mean and that range, right? You mean this one, yeah? Yeah, that's, yes, but this, yeah, yeah, yeah. So this is where, you know, when you go and read some statistical text, you need to have the cheat sheets on the side until you learn. I mean, this is like learning different language, right? You don't know all the words and you learn, and so. So actually, on our task webpage, we created the dictionary for astronomers to understand statistics and for statisticians to understand astronomers because when we started talking, we couldn't communicate clearly because our language, astronomy language is so bad, or was so bad that they didn't understand what we were talking about. So we had to do that, which was a good exercise, but now I looked at the dictionary and it has a lot of missing spots, so we have to update it. It is on task webpage. It's called stats jargon go there. Okay, so let's go. All right, I was talking about distributions. I wanted to suggest you one exercise which you can do. You can try to compare Poisson and Gaussian distributions which we all know and we all use because those are very common in astronomy. So for low counts numbers, we really have to use the proper Poisson. So this graphs, graph one graph and then the next one show the difference. So the curves, black curves show the Gaussian distributions with mean two and mean five. And the dots illustrate the Poisson distribution with rate two and rate five. And you can see that those are very different. So if you're dealing with the very small number of counts, for counts I mean not only x-ray counts, when you have two galaxies in the population of galaxies and are you sample one and two or three or five, you need to think about Poisson case. But if you go to higher numbers, number of counts, these distributions become closer and closer together. So for 40 counts, so mean of 40 and the 40 rate in Poisson, those shapes are very similar. So you can do this and play with all the distributions. This is just Poisson versus Gaussian but you go to your random library and there is beta and binomial distribution. So try with different parameters and illustrate, plot them and get a feeling of how they look because this gives you the sort of intuition about problems. When you think about your scientific question and definitions which come into your description of the problem, if you have the intuition and you sort of see where what would be applicable, then it's nicer. So practicing this is good. So now Bayesian data analysis, right? What is it? So the main component, there are sort of these three components that usually we setting up the full probability model for the problem. So you look at the joint probability distribution of everything, observable and unobservable. So what you know, what you can observe, what could be observed in the future. So all the quantities of the problem you're going to face. Then you condition on the observed data to calculate and interpret posterior distribution. And then of course you evaluate at the end the model checks and evaluate the fit of the model and the implication of this posterior distribution. So one problem with the Bayesian analysis was that this is the past. Was that calculating the posterior distribution was very hard because before the computers you really have had to do things analytically. And as we show later, approaching the analytical calculation, calculations is very hard in the Bayesian way because you start getting into this very strange functions, gamma distributions, beta distributions and you have to integrate them somehow. So people didn't use Bayesian formalism mostly because they had problems with these applications. Very important strength of the Bayesian analysis is this last point that when you finish your calculations you have the full understanding of the posterior distribution and the fit. So you basically learned what you could given what you observed. You cannot learn more because that's what you already put into your model. So the only thing you could do is to say, what do I need to do next? Is it really feasible what I already observed? I'm done, I have my conclusion and it's a perfect conclusion. Or should I go and do more observations? So this final point of the analysis is actually very important in Bayesian applications and as you will see tomorrow in the sampling tutorial, there are tricks or confusing places where you have to understand what's happening. You do this. So these are the building blocks now of Bayesian inference. You talked about sampling distribution and likelihood which tells us how likely is the data given what parameters of our models are. What are the priors? So what do we know about these parameters? And then the Bayes theorem which is defining the posterior distribution and this is the simplification of course because there was a normalization factor but we don't care about normalization usually. It's one of those things which you say marginalize, integrate out and forget about it. These are the most important parts of the Bayes theorem which we understand and we need to use. So let's talk about those. Oh no, I have a little bit more about Bayesian inference. So this is based on the red book statement. There is, if you go to one of the sections, don't remember exactly where, there are this trend and some additional point about Bayesian inference. So the trend as I already said that we combine all the information about the problem and we also, this is important, we also input the uncertainties into those calculations. So it takes everything into account. Some additional points. So Bayesian data analysis is based on parameters and on models, these parameters. There is some new development on approximate Bayes, non-parametric Bayes right now but these are new avenues in statistics and what we do in astronomy, we apply Bayesian framework to our analysis in that way that we have models sometimes with many parameters. And there are problems of course if you have many parameters in your model because setting the full probability model for full probability statistical model I should say with many parameters is very tricky. So it's tricky to set it up and also to do the full inference because your posterior becomes very scary, right? With all the bumps everywhere. So you have to be willing to use this many parameters. You have to, if you have this many parameters you have to think about structuring the problem into this hierarchy where you can maybe separate certain parameters in the structure of your approach. So not saying this is my full posterior or likelihood and prior with all the parameters in one line because this may be very hard to handle. You think about here are he of those parameters and maybe you can calculate them first and then use them as a step in the next calculation. There is model checking. We'll talk about model checking. So there is also emphasis on inference in the forms of distributions. So when you get your result from Bayesian analysis what you get is the posterior distribution. It's not just maximum likelihood and this is the main difference between the point estimate approach which is often found in the frequentist's methods where you're interested in one value which characterize your model. So point estimate versus the full posterior distribution and this is important and maybe I need to actually say something on the side right now because as you saw we have a lot of work applied Bayesian methods in our problems and publications and there is a critical point on propagating the posterior distribution or your result from Bayesian analysis into the publication. So how do you publish what you learned? And I think it's not solved because we don't have normal data structures, FITS files, how do you handle the posterior, the distribution which is sort of result of your MCMC at the end because that's what you do. So you get all your samples, you have 100,000 data points from your posterior. Where do you put them? There is no place in update or mostly notices for it. So I was thinking that this could be one of the additional breakout session just brainstorm about possibility of defining the way to publish the result of our analysis. That's what yesterday I was like, oh this is the group of people who are passionate about computing, they really want to do this analysis. So let's give them this problem. So it's like and this is the computational problem or database problem or whatever you frame it but I think it's important because we have to have a way to propagate that information. Often you publish just the summaries of your results. So you see the histograms in the parameters but this is only partial information because you can imagine that if you had the information about the full posterior and you read the paper then you can take this and maybe apply as part of your analysis in the next step. So it's not only to show the result, it's also to show and use it later on. Which is important I think. Yeah, the use of simulations we know. So we have all MCMC and let's simulate. Use of probability models as tools for understanding and possibly improving our techniques and this may not really involve Bayesian models, it can involve other things. So include as much information as possible and then you look at the data as the random sample. So you can take the data and include in your analysis because that's the random sample from your posterior. So it's part of this analysis. And so you need to be willing to design studies of the inference which is robust to the model assumptions. Often people say, well, don't do Bayes because you really skew your final result because you put all you know in this prior and then what you get is what you put in the prior. Well, it means that your prior is wrong, your model is wrong and your approach is wrong. So it's not that Bayes is wrong, it's just what you did is wrong, right? So you have to be careful about those. All right. Oh, maybe we will do this one thing and then take a break. I feel like I'm talking too much so we have to do some calculations. All right, I find my, so likelihood of one data point. Okay, so what's, if we have just one, I was thinking about the event, where is this? We have one observation, let's say, one star and it comes from the normal distribution. We know that that's the likelihood of that normal distribution will be just the standard Gaussian. Equation. Do I need to write the, I probably need to write the normal. What I really wanted to do was to write a Poisson. So maybe we'll do the Poisson. I'll leave you to work through normal by yourself because it's, I think it's sort of more intuitive maybe. So Poisson example, right? So we have the Poisson observation and I have my notebook because I don't think I can do all this derivation just from my head. So I use my cheat sheet. Okay, so we have the Poisson and we have one observation from the Poisson so we will do this. So one observation, which is why it's my observation from the Poisson distribution. So this is my parameters, right? And this is my Poisson over my sigma, right? So why here is, in this case, it's just one observation and that's the Poisson equation for you and theta is the models. In the case of Poisson, it will be the rate of the, let's say, counts coming towards you. So sometimes this rate parameter includes our exposure time because you can define the Poisson in terms of the time, arrival times and the rate. It goes into this one here. So for single observation, this looks like this. If we have multiple observations, then we have y, which is y, yn, right? So we have the array of, let's say, we have spectrum, we have 50 channels in x-rays and we have this 50 little Poisson arrivals. So to define the likelihood now, what you need to do is to basically write the Poisson for the whole thing, right? So again, y here is the tau and we do the i2, I said n, let's say, let's do n. One over yi sigma, this is actually easy right now and we will become a little bit more complex later. So this is my likelihood, right? So I multiply all the probabilities for all the channels. One exercise which I suggest you also do and I think I will have it at the end of this lecture. I have suggested exercises on the different form which I will put into the Astorhawk page. So that's the multiplication of all of those. So we'll do some tricks and I hope you do math, right? So you do those things. So you can say yi is, so there is sort of sum of i. You can define this as just the summation. So when you do that, then you can rewrite this as theta to t of y into minus theta, sort of proportional. Okay, so now, this is, oh no, there is one n missing. So, you can actually go from one to the other in your n times theta, can you derive this on, you don't have your notebooks. Nobody does math in their notebooks anymore. You just use computers, but you can try expanding this really nice, so when you expand this, you get theta minus theta. To ty, e to minus n theta, so you can do that. Now, the property of the exponents, right? So, this is something which people did in high school. So, what do we do? So natural log, how do we do this theta, okay? Theta t of y, if you do this, it will be e to t of y times log of theta. This is natural log always, right? This is just from the property of the exponents. So when you use this tool, you get something interesting because when you put this thing here, you can write this in the form of, let's rewrite a little, e to minus n, so the theta, n, is it too low, e to n log n, you can define the prior because these things look similar. So if you ask about prior for Poisson, you would write sort of this equivalent here. And in this case, this mu and this should be eta. So that's a different, that's not n. So this will be the parameters called hyperparameters. Mu and theta here. So the likelihood has the form of, where is it? We lost it. Okay, the likelihood, I go back and I write it here. People can see. The likelihood has this form, p of a, e to minus p of theta. That's our likelihood, right? You can think about this to us parameters. And then the prior looks similar. So prior on theta will have sort of form like this, typically a, e to minus b theta, which sometimes it's written as basically e to minus beta tau, theta to alpha minus one. So this, if you go back to the distribution, is basically the gamma distribution. And you can work through equations yourself again and derive things and go and look at the distribution equations later. What's important is that the gamma distribution is so-called conjugate to the Poisson distribution. So it's often used as a prior for the Poisson problem. And it comes through this. It means conjugate means it's in the same family of functions, right? So in this case, if you think about priors for your Poisson problem, you often get into the gamma priors. And as astronomers, like what gamma priors, why? I mean, this is why because these functions are very much connected. And when you use these functions, especially in the past, when there was no computers, the calculations could be actually done, right? Now, in the era of computers, we don't care about these distributions as much or conjugate, I don't know how to pronounce that, or having a conjugate prior, because even in the case where you have this prior, which comes from completely different part of the math, you can still do it because you have computer, right? So you can derive this tool. That's a nice exercise. What else I have here before we break? You see, I lost my computer. It's gone. It's not, I don't have it here. So now it's like I don't have anything on my computer. So I think this is a break, okay? We'll fix it because my computer crashes sometimes, but that's maybe a good time to break. 10 minutes, right? Do you want to take two breaks? Yeah, I'll probably take two breaks, yeah. 10 minutes now and then 10 minutes later. I cannot talk for a long time. I need it, yeah. All right. Maybe, yeah, I mean, if we end up writing more. Okay, we got coffee, chocolate, now we go. Yeah, I guess that's the main reason to use it. In the past, if you had conjugate prior, you could do your analytical calculations better now to speed up and to maybe better understand your problem because some of the priors may not be of that family, it could be different, but I don't know if astronomers care about conjugate priors because we usually have some prior knowledge about our problem and we apply that knowledge and that's our prior, right? So when you set up the statistical model, then yes, you may talk to your colleague, statistician, which I actually tell you that's the best way to learn and develop your Bayesian modeling is to find a statistician and work with them because you either do statistics or you either do astronomy. It's very hard to do both. So we are as statisticians trying to do both but if you want to develop a new thing, it's always good to do this with the expert in the field and statisticians are the experts in statistics. We are not. Yes, yes, yes, of course, and this is why you need to be there to talk with them because they can go into their mathematics and be very happy but you have to check to make sure that it's not pure mathematics, it's actually our problem. So maybe it's unreasonable. The other thing which I was talking to Daniela about just a few minutes ago was to use log. First time when I heard about taking a log when I work with statistics problem, I was surprised because you work with the X-ray spectrum and you don't think about logs, something. No, no, no, log of the parameters. Now the data, we do log flux. It's a magnitude, it's very natural way for astronomers to measure the intensities of the stars, the magnitudes. So here's the parameters and you take log once. If it doesn't look like normal distribution, you take another one. So you have log normal, log normal but at the end you come back to your parameters on your original scale because this is what you were interested on. But taking a log makes something which is very skewed into something which looks very nice. So something like that with the long tail, if you take a log, it may look like this. So this point will move into some more or nicer place in the parameter space. So two things, you know, conjugate priors and logs and this is all for computations and for understanding the problem mostly. But also for understanding what they're talking about. It's good to understand what it means, right? So if you talk about conjugate priors, it means that you're looking at the priors which are in the same family as your posterior. So exponents in Poisson, in gamma, you lead to the gamma distribution. Okay, posterior. But how do you know? You don't know. What sounds? You know what's your, yeah. You don't know, you can expect because you have the likelihood. So we started with the likelihood, right? So this is where we go. So the likelihood, this is written for X-ray astronomers and it's very old way. I guess I found this transparency and I decided that it's already having some math on it so it's good to show it. So we have the likelihood for some counts given, you know, this is probability distribution for model parameters. There are likelihood of observed counts given this. So we did this on the board earlier. This is- That's pretty general. Why is this one actually counting in the city? Yes, because you measure stars and each of them has certain magnitude. Yes. It's very general because that's, you know, the Poisson. If you think about a problem, you come up with this distribution. So for stars, depending what you want to do with the stars it could be Poisson. If you have a problem which I put on the exercise list as well, if you let's say have a star cluster and you observe stars from that star cluster and you get red giant and white dwarf, it won't be Poisson if you want to learn about the population. It will be by model because you, let's say you have just two types of stars in that cluster and you want to know about distribution of the stars, right, number of one type and the other type. So this will be by model. Poisson is one. Oh, this is priors. Oh, so we talk about conjugate priors and there is something about informative and non-informative priors. See, actually, so this is where I, I didn't give you the example of beta prior. So when you look at the binomial likelihood and you follow the derivation, you end up with the beta distribution. Very nice. Which is hard to think about. So don't think about this, this is mathematics but one can do this exercise. Oh, good, good. So there is, oh, oh, all right. So I guess Wikipedia has a very big amount of information now and that's good. So it's good, yeah, you can go and look at those there. So non-informative versus informative priors. Typically, if you don't know anything about the problem, you have no information. You say, you know, I want to get my slope of the spectrum and I don't know what it is. Maybe it's flat between minus 10 and 10. It's not minus infinity plus infinity because the slope is reasonable but it's somewhere between these two values. So you don't give any information, it will be just continuous and flat. If you know something about that spectrum because you observed it before and you know it's 0.5 for the radio astronomers, the synchrotron spectrum, right, 0.5. So you say 0.5 and it could be between minus one and three because you're looking at the radio spectrum. And the probability could be higher if you know that the synchrotron, so it's not flat in this case, it could be related to the electron distribution, for example, in your problem. And you know, if you accelerate electrons, you get a certain slope. So looking at slope of 10 doesn't make sense. You just look at something which is within the range of your understanding of the problem with some distribution. So you have informative and non-informative priors and conjugate priors. And I think for us astronomers, thinking about priors in general, it's not a problem because we learn by reading the papers, studying our sources, we learn about them. I think the problem comes when you have to assign a distribution to that prior, right, and the selection of that prior follows some criteria in the sense that you typically think about function, normal function of flats, very rarely about gamma distributions, beta and other ones. Those are the mathematicians coming with the priors. But I should say that this selection of prior is very important because it is a component of the full model, right? Full posterior distribution. So it is really important that you select the prior which is good for your problem but doesn't over-specify the problem. So if your prior is too narrow, you really can force the solution into that parameter space, which is not what you want to do. So it's kind of weighted this values, weighted the approaches. Okay, statistical models, what do I have there? Okay, so I think it's time that we will define some models for some problems in astronomy, right? So this is the very noisy X-ray spectrum which Mike Novak showed me a year ago. I think Daniela is still working on that problem. But it's an example of the data in X-ray astronomy, very high resolution data and some continuum fit to this data. So you have a continuum, which is red line, drawn through all these points and then you have a lot of ups and downs, a lot of residuals in the bottom panel. You have the difference between this continuum and the lines. So in this problem, what you really want to know is about the lines present in this spectrum. Whether there is any line present is this all noise and what type of lines are there. And the problem is not trivial and you can talk to Daniela if you want to learn about treatment of that problem because it hasn't been solved but this is an example of the problem. So to set up statistical model in X-rays you think about the physical source model. So it's continuum, some process emitting the continuum. Here, I put it F with some parameters and energy. So you define this process on energy scale and you have the lines, which are written in these strange forms but these are Gaussian lines and you have the sort of number of lines at certain location and certain widths there. I think somebody is doing the problem here, right, with the lines, yes? Or the void team, okay. Oh, it's just because this was written in the paper and I just copied the, yes, this is the lines trend and that's the shape, yes, yes. It could be, it depends how you define the problem but in theory, it could be one way or the other and so. And it's not, there are two things here because if you don't know how many lines are there, you don't know your K. So K could be also a parameter of your model. In this case, it won't be and I won't talk about this but you don't know if you have 10 lines, if you have 100 lines. In the case of void profile, it's not clear how many lines I don't know if they use the number of lines as the parameter, okay. Oh, okay, okay, okay. That's nice, very good. So we will hear about that problem later on. So in X-rays, we have calibration files. So the model we usually think about is in physical space. So flux, let's say you observed intensity, you look at the source. In X-rays, you have calibration files or some instrument information. So in this case, we have a source and we have sort of redistribution matrix and effective area in this sum and there is also background. So when we looked at this observed counts, actually model counts in this case, this is how we would calculate them. It's similarly also, if you think about two-dimensional models, like image and source in the image, you would include a PSF, in this case as the instrument component in your model. So when you're building the full model, you have to put those into an equation and I think I will have time to show you the example with 2D case later on. So the likelihood for this Poisson model is here. We already said this is the Poisson and then the posterior distribution will be defined. So we did this before, right? This is what we didn't say. So posterior predictive distribution will be when we put our model on the right side and we integrate over the parameters to get the predictive counts. So predictive data given the observed data. So we have likelihood in the posterior, you integrate and we get posterior predictive distribution. Which is important as we will show later if we try to evaluate the model. So evaluate the posterior distribution and the models. So when you build now, these are the equations but when you build the full model, this is the example of the structure built in our first paper with statisticians. So this is David van Dijk 2001 paper where we basically defined the Bayesian model for the X-ray case. So we started with the emitted spectrum up at the top in energy and counts per some unit of energy. The spectrum comes to the telescope, it's observed. It's already absorbed. And also has the modification to this original shape because of the effective area of the telescope. So not every photon at every energy which lands on the telescope comes to the detector. So this is why you have this very weird shape when you look at the X-ray spectrum, it doesn't look anything like optical spectrum. It always have some instrumental shape in it. That's your effective area. And absorption is because universe is not empty. You always absorb. Then we have this instrument response which sort of splits what comes to the detector and now detector gets the counts, measures counts in the, and because it's not very good resolution, it just broadens the, how you say, broadens the energy of the line. So if there was a line, the line is not of the same shape as it was emitted, it's only broadened. And also the energies can be mixed. In X-rays you can have photon of energy, six kiloelectronovolt but detected in the channels which are typically sensitive to the smaller four kiloelectronovolt or one kiloelectronovolt energies. In Chandra we also have pile up which means that the source is very bright. It's photon energy which we detect could be not correct because two photons can land at the same location and be read as one photon. So your count and energy of that count could be wrong. So you basically have the notification of the original spectrum by all these things and then we have a background component. Not necessarily, it could be but in this case it wasn't. So all these arrows point to loss of data. And this is what statisticians and people who work on models also care about because you want to describe the process in which the loss is happening. So at every level we lose the data and somehow this has to be incorporated in the whole process because that's what we observed at the bottom versus what was emitted somewhere. There are two lines there, so maybe one of the lines was in that spectrum. It's just the illustration. Yes, yes, of course, yes. Okay, all right. Okay, so in the first step and typically when you do forward folding analysis in X-rays like X-back or Sherpa, use your chi-square, all right. So in this case you only worry about the physical parameters. In the, yes, yes. So in this case here, we assumed that our parameters in the first approach were physical parameters and including absorption. And instruments was fixed. You basically had the instrumental responses and they were fixed. This was 2001. I don't know if I heard the case later, but during last few years, we worked on the approach in which we don't know exactly the instrument's properties. So all these parameters which describe the ARF and RMF are also parameter of our model. So we have that implementation to took a long time to figure out how this includes, how this comes into the Bayesian framework because you can go through calibration team and tell them, tell us what's your uncertainty on ARF shape. But the ARF, see, we didn't, I didn't think about this as a problem because systematic uncertainties in X-rays are typically propagated, right? You propagate and you add them in quadrature somewhere, blah, blah. Arithmetic. But for Chandra responses, you can't really do that because the errors are nonlinear. So if you look at part of the soft spectrum and you change the ARF, you affect also the properties of the spectrum detected on the ARF at high energy tail, you know, sort of, it's not linear. So you cannot just shift up and down. You have to do, you know, this thing. So when you look at ARFs generated, they behave nonlinear. So we did describe the variations of ARF using PCA and we include the parametrization of PCI variability with the PCA components. And we needed eight components to describe the four properties of the ARF. And that was good because, you know, this parametrization allowed us to handle somehow this uncertainty and then include it in the analysis. So it's important, but when you take this into account and we did the simulations, it happens that the uncertainty is only, this uncertainty is small for low number of counts. If you have a very bright source or, you know, observed source and you got 60,000 counts in your spectrum, like in extra binaries, for example, this is where uncertainty, calibration uncertainty really broadens your parameter distributions, the errors or uncertainties. All right. Good point. So this trend of this approach was that we could put this, oh, it's actually here. We could put this uncertainty into this framework. So in Bayesian analysis, after you define the model, you find your priors, what you do, you go and you run simulations to see how your posterior or see, or find your posterior distribution. And it's simple as it's in this graph. Oh, it didn't, anyway. Okay, so I have it on the next one, which is a bigger graph. So we had the model and the prior on one side, we draw parameters, we compute likelihood and posterior probability given those parameters. And then we accept, reject and update parameters. And depending on the acceptance, reject criteria, you have a different chains, right? MCMC chains, which we'll talk tomorrow about. So you accept, reject and move back around. So you do this loop, when you compute the likelihood and posterior probability, you compare data, so you take data calibration, you compare, so in this one block, you do everything. So this is sort of schematics. And when we applied calibration uncertainties, it was very natural to put this into this calibration box. Instead of using one value, we could use, we could vary that parameter. So we could use, move the calibration onto this draw parameter and connect those, right? So you had data, you draw parameters from the model and model includes the calibration in it and you compute likelihood and posterior probability and go and get your posterior. So with the calibration problem, there were few tricks because when you take the description of PCA as the calibration uncertainties, right? The distribution of calibration values, you also don't know if that's everything you know about your calibration. So you could also learn about the calibration. So whether can you use the data now to learn about your calibration of the instruments, the parameters of this calibration? So this took a lot of discussions at the formal statistics level because you start thinking about using data twice. Can you actually do that in a formal way? And it's possible, we did it. There are two papers which came up after this. We full description of the problem and to graduate in statistics. So two people who graduated. Now one was the student, the other one was our postdoc. So for us, yes, yes. Yes. Yes, yes, yes, yes, so yes, exactly. This was like, so what's the problem? Why can't we just do it? No, no, no, no, no, it was Zhanyi Meng and David van Dijk talking about it at the board for a long time. So yeah, it wasn't, I guess because when you do this first time, you always have to ask yourself, if that's the proper formal way of dealing with these problems. Are you introducing anything which is strange? How it will behave? So yes, it was, oh right, okay. PGM, okay, okay, right, right, right, right. Yeah, oh good, yeah, thanks. So in the afternoon, are you going to do this movie of this appearing, this appearing, or it will be just the static? No, it will be at a whiteboard drawing. Oh, white, okay, whiteboard drawing. All right, I was imagining the animation where things pop up. No, no, no, no, no, no, no, no, no, no, no, no, no. Whiteboard animation, very good. Any more questions here? Yeah, I should ask you guys more questions. Are you asleep? All right, let's move forward. So we did all these simulations, right? This is just one slide of what comes out of NCMC. I think everybody saw this scene this before because it's very popular now. We have two things to remember. One thing, you always have to look at the trace of your chains, always. This is very well-behaved train. Oh, okay, sorry, trace, okay. So when you do, yes, when you do the simulation, MCMC simulation, you have iteration, right? So when you go through this loop, you count one, two, three, four, and you have this iteration. And Markov chain, so when you have this accept-reject criteria, it tells you how to move through that parameter space. It's a random walk. It's a random walk, but you always look at the previous parameter where it was and accept-reject based on that parameter. So you forget, right? But it blinks. So chain and trace, right? Trace would be the location of the value of that parameter in the iteration, basically. And you can look at, if your model has five parameters, look at the traces for five parameters, each individual, because one could be very nicely behaved like this one, which means you see very random behavior flickering like this. And the other could be just going flat, not changing at all. It means that your problem is not really being properly solved, right? I can bring it up later. It's not in the slides. Tomorrow, actually, I think there will be more on the MCMC tomorrow. So I can prepare some very badly behaved chains, yes. But this one is very good. So when you look at your chains, you right away now, whether it was properly set. And if you don't, then you have no idea what you did. So I think this is lesson for every astronomer who wants to run MCMC, you have to look at your traces. If you don't, you just don't know your problem. Even if you get a value at the end, you just don't know. And I can repeat this over and over because I've seen things. So you're looking at the traces. Then in this upper graph, I made the scatter plot of the two parameters which were in the problem. So there is the slope and the absorption. And as you can see, the scatter plot is kind of stretched, not stretched, squished and elongated, which indicates that there is a correlation between these parameters. It's not very nice symmetric, there is a correlation. And then at the bottom is the PDF, so probability density for this photon slope, which is what we draw earlier on. So this is if you take the other one and look at this just marginal on this space. So it's very nicely, this distribution is very nice. It looks like Gaussian in this case, right? Sort of like a Gaussian. So this was the question of how do I now take this information and publish? So we don't, we just typically say, this is the photon index of what, 1.25 and then take this 68% or 90% because this is nice Gaussian shape and then do 1.25 plus or minus 0.6 or 0.45 in this case. So that's what we published. What you want to show is your distribution because if it's not nice, like in this case, that looks like this, let's say, which happens. So I can draw a few cases. So one could be like this, right? This is your parameter. The other case could be, let's say, like this. So, I mean, what happened here? Can you say what happened here? This is, let's say this is my 1.2 and that's 2 when I draw. Yes, it could be something wrong with the prior. It could be that data doesn't tell me anything here, right? But my data is basically has no take or no value, no constraint before that's below that. So if you see this, it's nice and perfect but if you show this in the paper, then somebody knows that your quality of the data wasn't as good as you would think and there could be problems because if that's just this part of the parameter space which you were able to constrain, then it means that we need to have more data or try something else. In this case, it's something different, right? We have, this is our, let's say, peak. I'll get to this quickly later. But you have something which behaves very nicely on that side but then it has a long tail. So it means that your data in this case doesn't constrain your high value at all. It can be anywhere, the tool. So in reporting point estimate, right? If you do maximum likelihood analysis and maximize your likelihood for getting about whether or not you use the prior, but you maximize your likelihood, you will report that point value, right? The value at that location. So this gamma will be gamma best. And then you will say, you know, maybe this or maybe, you know, like higher. So anyway, this would be in the point estimate but if you supply the full posterior distribution then it's a better understanding for the reader of what happened to your problem. Okay, we better move. So this is the other, I guess I have this few slides on how to report your analysis later. So you're looking at the posterior probability and in this case, we, you know, we work with the one line for that problem. So there is the line location at 2.8 has very strong peak, right? That's very strong evidence for the line there and probability line is there. And posterior probability density looks like this when I zoomed into that region. So it shows the peak, which in this case will be called mode, right? The mode is always if there is one mode in the distribution that's the mode of the distribution but you can have bimodal distribution which means you can have, you know, a few modes, maybe let's do this differently like this, right? So you can have two modes and your data tell you maybe then this one has, you know, it's sort of higher than this. And more likely, perhaps then this, but you may want to know that there are two modes. It is possible that this mode is unphysical and you can just say, well, I don't care about this one because basically cannot happen in my problem. Maybe that's okay. Right, I mean, you know, we are not perfect, right? If you are perfect, we will solve everything by now. So yeah, you know, but well, you know, it's, it could be that your range of parameters, let's say went to the place where it shouldn't, right? If you have your gamma at 10 and you know that the gamma cannot go above two, then maybe. But there are reasons where you may want to explore the parameter space outside what you think is reasonable. Yeah, not even temporarily because as we heard yesterday, discoveries happen when something unexpected happens, right? Yes, exactly. So, you know, there are mistakes which lead to discoveries too, right? So if you by mistake did something and then suddenly you saw the peak somewhere else, right? I don't have an example now, but you know, that's maybe where you discover your Higgs particle or something. It's not that I, you know, I think this is the way to discover it, but yeah. Okay, so bimodal distribution or multimodal is kind of tricky because first of all, it's very hard to use the maximum likelihood approach, let's say, to get the maximum likelihood, even using standard classical methods. If you think about multimodal parameter space and optimization methods which are based on gradient, so optimization searches in this parameter space looking at the gradients, or, you know, like Simplex and even, you know, Levenberg marker which is just reversing the problem. I mean, this methods may lead to local menu, first of all. And may have trouble in researching the full parameter space. This is why MCMC and Bayesian approach, or Bayesian approach with MCMC is beneficial because in astronomy we often deal with these problems. I mean, this is very simple, but, you know, if you have 10 parameters in this 10 dimensional space, you definitely find places where you have some modes, little modes. So anyway, so it was about the modes. So you characterize the distribution by modes, by mean, by standard deviation if it looks like normal. You can also look at the joint posterior distribution. So this is a two dimensional one. So if you have two lines instead of one, and we looked at the posterior distributions for locations of the two lines, it could look like this, which was in our example. There was like the dotted line and then a little smooth one when you start seeing bumps. So in this two dimensional parameter space, you can see that you have, it's not like contour-like or scatter-like. In this case, this is a location of the lines and they are very well sort of constrained to certain parts of the parameter space. So when you think about summarizing your distributions and summarize statistics, this is a posterior. There are three examples here. So typically we look at the right one, right? We have probability density, which is the solid line and then when we do Monte Carlo, we get these histograms. So that's normal, right? Then we have the highest posterior density. Oh, those are the same. Those are wrong, all right. I thought it was the same. This is not a good example, all right. Because it's that... Is it going in a different way? Yes, but I was just wondering if the graph, oh no, the graph is good, right. Right. Because it's not as dramatic as I was expected. So yes, the highest posterior density is where you look at the 68% here of the area in which you have the most of your posterior contained. And this left one is equal tail. So you look at the, if this is the Gaussian, then you have the equal tail of the Gaussian, which gives you 68%. In this case, it will be the equal tail. So everything inside of the equal tails I think I had a question. Well, I think that's what you normally do when you use the Gaussian distribution when you report. The same, yes, but that's how you Gaussian one, yes. Well, you know, I would just plot my distribution and typically it's very hard to, I think in astronomy, we typically work with equal tails and that's what we report, even if it's not a Gaussian distribution. Yes. Yes. Yeah, yeah. Yeah, exactly, that's the region, mm-hmm. Confidence, yeah, well, it's credible region, credible region, right, which contains the 90 or 68% of your posterior. All in the credible region, you're looking at the top down. So you're looking from the top, it's the highest density one, right? The smallest range sort of, yeah. Yeah, when you start. So I don't know if I have that plot, but if you have two modes and you can, or maybe here, see, there you go. This is the example of this line in the spectrum and the difference. Oh, that is the case with the posterior. Yes. And this is why you don't typically report that in astronomy because, you know, for this problem, it was actually good to have the HPG because we wanted to know where the probability for the line locations are, which you separate those and you see where they are. While when you look at the equal tail, you can see that, you know, it could be anywhere and it's not what your problem was. Yes, yes. Yeah, it's also, it depends on the problem, you know. It depends on the question you pose. Yes, and plot, right? Yes, so yeah, I can repeat to everybody in the, yes, Daniela said that in the paper, it is very important to write what you did. Yeah, all right, so this is another illustration. Okay, let's go to model checking. Just looking at the time. Okay, so we have the posterior, we have everything and we had the model, which we really liked and we show everything we learned about the model based on our data. Now we have to check if our model actually is correct and what happens, you know, where we write. And there are a few ways of doing this checking in, in Bayesian, yes, I wouldn't even go there. You cannot talk about reduced chi-square. You can, I mean, yes. Yes, so if you, you know, I should not say, if you have a normal problem, problem which can be described as normal Gaussian, then chi-square approach is fine, it's perfect. And you have the chi-square goodness of fit test. So there are two things, there's always a confusion between statistic estimator and the chi-square distribution. And I think this is a problem which is often confusing people too, because somewhere, someone called statistics used to fit the data chi-square. And it propagated, but it's just the estimator. It's not the actual distribution which is something different. So your goodness of fit test for normal data is based on the chi-square distribution. Sort of, yes, because you're looking at the, you know, goodness of fit test is just, it's hard to say, this is the check of how well you specify the problem and how well you can predict the future data from your problem, right? So yes, you do the check, you do the check in the sense that you take the posterior, you calculate your posterior predictive data. So what happens? And then you compare this data to your posterior. Does this new predicted data come any close to this, right? So if you happen to find your new data, new stars somewhere far away, I mean, doesn't make sense, right? You solve something different. It's not applicable to this model. So you do this through the simulations. And I don't think there is the analytical way to do this, you know, there are information criteria later on, we'll talk about it, but to do the check, you typically explore the posterior and explore it in the simulations and the dissimulations checks. One thing is that you need this test, right? You need some tests to validate the data the model somehow. And you describe this magnitude of discrepancy between the model and the data. And we can use the p-value, which we sort of understand from the classical tests, right? And here it's the Bayesian way when you have some tests, statistics. So some measure of the differences between the new and the, I mean, new data, predicted data and this. So you have some probability that this replica data come from this posterior, which we described. So you have to have some statistics to do it. And likelihood and likelihood ratio could be used to define the statistics. One thing which you need to do before you even go there to test, I guess we're going to the hypothesis test because this sort of touches the hypothesis test because when you evaluate the model and checking this model, you want to accept or reject the model, right? So this is sort of the hypothesis test and we're checking whether our model agrees with it. So there is the p-value, which is the probability that this new data fail, I guess that's the test of new data come within the posterior, which we just defined. Listen, that's not the right way of saying, I guess. Yeah, an example. Well, we will have an example later on, but let's say we have, yeah, I think I have an example on the next slide. Yes, okay, this is the one. Yeah, this is calculating p-p-p-value. Posterior predicted p-value. So in this, what I started to say was that when we do the test, we define the level, so probability at which we accept or reject and we have to do it a priori. You cannot, after doing the test, say oh, I didn't mean 0.01, I mean 0.5 or something like that, right? You have to, before you start this exploration, you define the level, alpha, p-value. Is the way the same level? No, no, doesn't need to be the same. Depending on your problem, you can say, actually it's touching probably a few things because traditionally we say value of 0.1 and p-value of 0.1 can be used to reject some model, right? If we get value slower than that, let's say. 0.1, 0.01, sorry, 0.01, 1%, yeah, 99%, yeah, so it's, right. But depending on the way you specify the problem and whether or not you use some information from the data, your level of the p-value could be way too high. You may need to go in the tail of this distribution and say 0.0001 is the one at which I have to run my test, my check. So your p-value, if you set p-value a priori too high, then you may not get the right results. And I think this is kind of a problem of p-values which has been discussed very intensely over the last year or two, I don't remember now, but there was a lot of discussion probably a year ago about p-values, yeah, two years ago, being completely bad, bad. Don't use it because you get the wrong information, you just, you know, we don't have anything better, right? So this view of p-value in both communities are actually mostly statisticians because there were some results which were accepted at this some levels and then they were not correct. Well, yeah, I guess that's a good question. So if you have a discovery and your p-value is wrong, then you are hurt because the discovery is not discovery, right? So there you have to be very careful. If you have a result which agrees and, you know, it's sort of a ballpark number of what everybody expects, then, you know, it's probably okay. So it all depends on the application and I think people should be very careful when they see something new. So you can talk about p-values but you have to find some other measure to confirm your discovery. And I would say some people should be more careful, maybe, because, of course, everybody jumps. So I can tell you that one day it happened to me that I discovered something, but I saw it, you know? I looked at my tundra data and I saw this data and I went home. I woke up next morning and I realized that what I saw was a jet in my source and it was really, really exciting. This was one of the longest X-ray result jets with which we know it was Parkesville 127 and it was really exciting. And it was a discovery because I saw it by eye but I wasn't sure because in tundra you have the readout streak and the readout streak can confuse you, right? Instead of the long jet, you may have the streak which looks like a jet. And the jet wasn't known to anybody in radio community. So I spent all day Sunday basically investigating and the readout streak was in different direction than the jet and the jet was there. We found the PhD thesis publication where the radio astronomer published an image and we went to VLA and got a good VLA data to confirm it. So it was fun. And I can tell you that I was surprised that such discoveries can happen in 21st century. It was really good. It's really specializing in various ways. Yes, yes. To do the model check in sort of visually first to get a feel for what's going on, like looking at the residuals, looking in various different ways. But in data space, comparing the predictions that your model is making with the observation that you actually had. Yes. Which gives you a lot of guidance and enabled you. Yes, I have a comment on this because I work with a lot of data that has a lot of correlated, that's usually correlated processes. So the kind of stuff that you can model with a Gaussian process. And if you look at that data, especially if you look at data that data and sort of as a time series of a long time scales, you often see patterns in the data that are actually pure noise and where your brain confuses you into thinking. There's patterns in there when there's actually no patterns in there. So that's the kind of flip side of this. It's good to look at the data, but looking at the data can also fool you into thinking you've discovered something when you've discovered nothing. Right. You can turn data space into other useful summaries of data. Right. So if you're confused by the actual light curve and you're predicted light curves, maybe you see it clearly in the Mellon transformation of that light curve. But it's still data space, right? You're still looking at data or summaries of data rather than parameters. Right, right. And that's what you do, you generate the posterior predictive data, right, given your model. So you do look at the new data which you're collecting the future, right? This is the future experiment, what you expect. Agree or not with your current data. That's the data given that you're following? Yes, yes, yes, follow it. With the sampling distribution and then you compare these simulations. Yes. So let's say, yes. So this is, I think this leads to this one. Let me show it to her. This leads to this one. So I think I have maybe more detailed description of it later but we can talk about it now because this is the example in which you do two things, right? One is to calculate the predicted value. I mean, you always calculate the predicted value based on your model to posterior PPP value, right? So the distribution sort of here. So this is our distribution and then PPP value of the data from your model and data from some other models. So if in this particular case, on the right side, let's say, right? If you have your distribution as in this histogram and your PPP value marked with the line or somewhere within the distribution, you would say that it agrees, right? So this would sort of tell you that there is an agreement. Where say you're why? Well, don't go there now because that's a different thing here. So that's why I was saying that the test on the bottom is kind of not what we're talking about now. So in the left diagram, we have the value which is at the tail of the distribution and if I get from my model the value like this, I would say that it's very unlikely that this future observation comes from that point. So now we're going to what's actually plotted on this graph. So you have a question, Phil? So some of you who are interested in Bayesian inference, you may already have analysis set up where you have samples from a posterior PDF that you drew with MCMC or something. A good hack would be to write some code that enabled you to do this kind of analysis. So for each sample in your chain, you would make a replica data set including noise using your generative model and then visualize that in suitable ways and then design test statistics that capture some feature of the replica data and then try and make these distributions. So this is a histogram where each point going into the histogram is an MCMC sample from your posterior where you've made a mock data set and you've evaluated the test statistic on that replica data set. And then the line is where the same test statistic function on your observed data where it fell. So it's a great hack would be to just build the machinery that enabled you to do that visual check followed by a test statistic check to allow you to play this kind of game. Should we take a break and come back to this example in 10 minutes? Yeah, sounds good. All right. This one back at 11.25. To wait until the mic warms up. Can you hear me back there? Okay, good. All right, so we come back to this test in a few minutes, like few slides. We start with this one. This is the cartoon. I think statisticians don't like that cartoon because it's not really correct in many ways but XKCD cartoon about Bayesian and Frequentists. So the neutrino detector measures whether the sun has gone over and somebody say, somebody say, oh, then it rolls two dice. If they both come up six, it lies to us. Otherwise it tells the truth. So let's try detector has the sun gone over. Detector says yes. So you have the approach from Frequentist in Bayesian and this is where the controversy comes. So the Frequentists say the probability of this result happening by chance is one over 36. Since the p-value is less than 0.05, I conclude that the sun has exploded. The Bayesian statistician says bet your $50 it hasn't. So there are many problems with this cartoon and it kind of illustrates the discussions between one camp and the other camp, which is, I don't know, I feel like it's not relevant, the discussion between the two anymore because we're following Bayesian ways but when you think about this, the first thing comes to mind is about the first top of this test, the design of the experiment, which is very important. If you design the experiment, you know the probability of the dice roll, right? And you basically know the results and the answer right away. So, you know, you can say anything. So that's where the problem is. And of course you're asking a question, which is by our knowledge, impossible to answer by that experiment. So it tells you, you know, a lot of things about you planning the scientific experiment, which is actually pretty nice, right? So you have to think about it. You have to think about what you want to answer and what you, and how do you go to test that answer? That's where you set up the statistical model. But somebody posted this on their doors in our hallway. It was pretty fun. By the way, what is the probability of the sun exploding tomorrow? You can calculate that in a Bayesian way. You can decide to find the model and calculate it knowing anything you know about the sun. So model selection, we're going back to our model selection. So the classical test, chi-square goodness of fit test, which is based on the chi-square probability distribution and the degrees of freedom, the F test and the likelihood ratio test. And depending on the, I guess the approach you're taking, you can say this are the classical test and they are good. They were very common and very good in the classical way or classical sense in classical statistics because the probabilities were tabulated. We didn't have computers. So you go to the table and you read from the table and you know whether your p-value is this or this. What is the p-value and whether you accept or reject. And I was confused by those tables. So when I started my research, we had to use the tables. There was nothing else. There was no computer to calculate probability distributions. So I was always confused by the tables. So I'm happy, we don't have to use them. Oh, we could, but we don't have to go and read the table. We can run the experiment. Oh, okay. I was going to delete some of this from that, but it's probably okay. So Bayesian model selection. You can think about the base theorem and calculate odds ratio. I think I got confused. Anyway, the odds ratio, which is the for given models, what is the probability of obtaining one or the other, right? And the base factor is that given the model one or given the model two, what is the ratio? What is the factor? So I think this is in the wrong place. It shouldn't be here. This is what should be there, okay? All right, so we're going back to the base theorem. So we have posterior data given model, likelihoods and prior for the model. It's written in a different way. So it's a little confusing, but we have all this and we have to calculate this global likelihood, the evidence. So we have to integrate of all parameter space when you do this. And when you calculate that, the calculations could be very difficult. And so you know that in base factors, you have to integrate over parameter space and over priors. And when you do that numerically for your n-dimensional parameter space, it could be very hard. It can take time. And that's why calculating base factors hasn't been the primary way of selecting a model in astronomy, I think. That anybody has a nice way to calculate base factor, to do integrals. Nested sampling, all right. That's true, the nested sampling could do that. So do they calculate the, or you have to actually write your own code to do the base factor. It comes out when you do this. Okay. That's part of it, right? You have to do for two models, yeah, right. So we did the comparison for the X-ray problem between base factors and p-values at some point. This hasn't been finalized, I guess. But one thing to remember is that there is no number. So base factor, similar to p-value, you cannot say, if my base factor is 10 or 20, then this model is better than the other. It all depends on the problem again. So it's very non-intuitive to use the model selection with the base factor, just because of that number. It's short, yes, but if you integrate, so I guess the normalization of the problem is a question, it's hard to do properly. So in our example, our test for accepting the line or evidence for the line, right? Using p-value and base factors. So for the p-values of, you know, p-p-p-value, I would say less than 0.01, our base factors were still of the order of 20, let's say, which would tell you that this model wouldn't be there in that case, that the evidence for the line, I guess, if I remember that correctly. So they were not close to one, which is what you think about the odds ratio and base factors. And if the odds is one in 10 or one in hundreds, I mean, is it the right way to think about that this out, right? I mean, what is the, in your problem, what would be the odds for you to accept your, or reject your model? It's the same as when you think about p-values, kind of, right? It's just phrased in a different way because you're not looking at distribution, you're looking at the number. And this had the model comparisons, yes. I'll skip, you know, this is, yeah, this is important, but it's just calculations, I say. And there are also information criteria which are used. So you have base factors, you have predicted posterior predictive p-values and information criteria, which are based on the deviance, which is, if I pronounce this correctly. So it's the sort of the deviations, right? So log predictive density of the data given the point estimate of the fitted model. So here you use point estimate, so it's maximum likelihood, for example, in your model versus the predictive density, probably, if you have the whole model. And the one criteria, so, you know, this is the AIC, so AI-CAC, I guess, criteria. Just, you know, this is log of likelihood of eminence, could be chi-square, for example. So it could be applied to classical model selection as well where you add these corrections which are based on the number of model parameters and number of data points and you calculate that value. And then you select the model, and this is the example for different, this is some of the models with two parameters, P and Q, and for one of the parameters, we change the, we calculate this AI-C and we change one of the parameters and you can see that for P of four and five, you know, these models are basically giving you the same sort of value, so they are equivalent. In this case, we know that the models which give you very high, negative value in our calculations are not well describing the problem. So all of them are, and this is also the examples where you start adding the parameters, so P is kind of adding more and more parameters to the model, and you can see that above four, if you add a five, if you start adding more and more parameters, you don't learn more about the model. So in this case, you would use the model with the smallest number of parameters, you don't overparameterize the model. Oh, okay, so I think, I don't know if we can, so this is, I think I wanted to go back to the PPP model, there it is. We come back to this example. All right, so there is, you know, I just showed ICCACA Criterion, there is Bayesian Criterion, Deviance Criterion, so all of this criteria, information criteria, differ in the way they integrate over the posterior and how they describe the posterior, but they all compare the posterior predictive with your maximum likelihood posterior in some case. More, not posterior. So I think I need to talk about this example because that's the test that Daniela asked me to talk about, and this is the example of the hypothesis test, a model selection test, where you observe the X-ray source in this case and you ask whether there is a mission line in that source. So this is the paper, protest of a tall paper published in 2002, and this was the paper which took us a lot of time to do and work on, and referee asked us to give examples of bad usage of likelihood ratio test and statistics for that problem, and I remember Alana Connell's coming to our room with piles of up-jays, printed out versions from up-jays and we were putting them on the pile and testing, so it was an interesting experiment to see how many people used this test in a wrong way. So the paper describes all the details of a problem and the approaches to the problem. The main issue is that you apply likelihood ratio test, but to apply likelihood ratio test, you have to understand what is this distribution of that test, what is the probability distribution, and often you just use the table or something, right? You assume normality, and this graph actually shows you the likelihood ratio test statistics distribution which is in this histogram. So this is the simulation of the distribution of the model. Sorry, the distribution of the likelihood ratio test statistic using the null model, so the model which we fit to the data without the line, and then the solid line corresponds to the chi-square distribution for the problem. So when you look at this graph, let me see the other part, yes. So when you see, what you see in this graph is that this chi-square distribution is very different than your likelihood ratio test distribution. So when you do the tests, you do simulations to describe the distribution. You cannot just go and use the chi-square distribution because it's not the same. So what people did were they used the likelihood ratio test or F test, assuming that this Gaussian and assigning the values for acceptance or rejecting the model based on that chi-square distribution which was incorrect because your likelihood ratio test statistics looks very different. A, B, C are different models. Yes, I will. Yes, this is a posterior predictor, so there it is. These are the steps, yes. So what you do, you do one, I think maybe this description is on, because you can do the test for the classical, simulations for classical approach in the Bayesian, it depends on how you calculate your posterior and whether you use the prior in all this procedure. So you fit the model, I mean fit the data with your model which contains a line or not, so you have two models. One is the null model, there is no line or no source if you're looking at the two dimensional data. The other one is with the data and the line and you look at the likelihood ratio test between the two, so likelihood ratio. You obtain the distribution for parameters of this model. So you assume a simpler model for simulations which will be the model with no line and you simulate the data, so you calculate the posterior predictive sample from that simple model. So this is where your simulations generate you the predicted data, so if you had no line this is what the distribution would look like. And then in the next step you fit the line with the model, simple model and with the line at each step of the simulations. So you basically generate this likelihood ratio test distribution because you do the test at every step. So you not only simulate the big sample, simulate the sample but you also calculate the ratio. And then you, so you build this probability density which is on the bottom one, this histogram. And the line on this histogram marks the first likelihood ratio calculation or test calculation where it was based on the data. So I didn't plot it marked here. So the histogram is what we simulate in this whole simulation scheme and the line marks our data, right? This is where we were where we did our experiment. So it tells you how far, let's see, yeah. Oh, here it is. How far you are within this distribution. So if you go back to this, let's see quickly. To this distribution, right? So we have two cases here. In one test, our PPP value based on this ratio was 0.018 somewhere in this tail and in the other case, it was within that. So this are the result of the simulations which I just described and our test statistics. This is for one line and this is for the second line. You know, they are shown here. So you would accept one, maybe, depending on how your P value for rejecting the simple model, rejecting the model with no line, what was your favorite value? And the one here, you would accept the null model, the simple model, this case. So we have these steps here. This is in the handbook for X-ray astronomy. Keith Arnault, Randall Smith and myself, myself published a book a few years ago. It's handbook of X-ray astronomy. It has everything you want to know about X-ray techniques, the detectors, the calibration, the statistics related to X-rays is in that book. And that flow chart is there. These are the steps, let's see. Oh, I have, okay. All right, we go back to the examples. I think I need to wrap up quickly. But I have few examples if people want to explore them later during the afternoon or tomorrow, that you can go and look at the example of using the MCMC chains in Sherpa. That's the Python package for modeling and fitting and includes MCMC. And there is a notebook that I just pointed to written by Doug Burke, one of our colleagues at Chandra. So it basically runs through two-dimensional analysis of simulated data and shows you how to do it. It goes to exploring the parameter space and traces. So this is the trace and looks pretty bad in that example. And tomorrow you will learn how to characterize this traces and how to formally calculate the numbers which tells you whether or not there was something wrong with your trace. But this are the things here. So you can try this. And in this histogram it shows that there are few modes and you look at the full posterior and then graphs of the different parameters. There are many parameters in that problem. You can see. So that's one thing. I will put the links into our web pages so you will be able to access those. The other example is Lira, which is one of our, oh I think I crushed this. This is the reconstruction of two-dimensional data Bayesian low-count reconstruction. So it assumes percent data and it is the algorithm which has been developed over the last 10 years, I think, slowly. Well, 2004 was the publication of first algorithm by Ash, our students within task. And then the code, which is there you can see two or three years ago, four years ago was put on, doesn't have the license. So I'm going to put the license on that page when I'm here. It is the code is in R, but the post-processing tools are in Python. And what the code does, or what the algorithm does is to, it includes the point spread function and works on your image. You define your model for what's in the image, sort of like the baseline model, null model, with let's say one source in the background and the PSF. And then you, at the end, you get MCMC images of the resulting posterior. And you can look at those images and infer the possibility of extended structures there. And we used it recently. There are still problems, statistical problems related to interpreting the results of MCMC simulations and detecting a structure which you suddenly find in MCMC images. So this is what we're working on right now, trying to understand whether what you see by eye in MCMC result is actually a detection of some extra emission in the data. I was going to, maybe I have a picture. No, yeah, I had a picture, but it just disappeared when my computer crashed. So I can go into details if people are interested in image analysis and detecting structures in images, diffuse emission, jets, extra points. We can talk about it. And then there's the example, I think, of modeling time delay, which I guess we don't have a code right now, but this is the example of a hierarchical model. So maybe at one of these breakout sessions, we can talk about hierarchical model which we built for this time delay problem. It was Hinsu Tak, who was very excited as a grad student, statistics grad students, when we time delay challenge from an SST paper came out. And I showed him the paper and he, I think this was Phil who sent us the first description of the paper and the data. So Tak was jumping up, up, up very, very happily. And he said he was going to do it. So this was what, two years ago? Yeah, so this was his PhD thesis, basically working through that problem. And so he, you see two light curves from a lensed quasar. One is the one image and the other one is the other image. And what you want to find is the time delay between these images. So in the very simplistic approximation, you just assume that you have some model which describes the light curves. We assume that this was OU process. And then you look for the shift in your time axis and shift in your magnitudes. And this are the parameters of the model. So when you build the hierarchy here, you have some light curves which you model as OU process and the parameters of this OU process are not really interesting to you. You just describe this variability and what you really want to do is to find the shift. So this is sort of this hierarchical structure where you start building your levels of parametrizations for the hierarchy of the problem. So we're not going into details, we can talk about it. If you like to discuss either of these problems when I'm here and you have any questions. If you have any general question right now, you can ask, right? We have two minutes before the lunch, right? How you want to talk, say something. If there are questions, I think we are done. First let's thank another. Thank you. Yeah, I don't have a summary slide. I can have a summary slide, but I don't. Are there any quick questions? Yes, I think this was three hours, right? I've been doing this for 20 years. Okay. Not like.