 Very good. So welcome back everyone. Good morning, good evening, whatever the time it is in your part of the world. And I hope you all got the email from the organizers about the program or the schedule of the lectures. So really for this week we will still be doing introductory stuff about simple probability distributions, but hopefully it will shed some light into where the whole emphasis of Bayesian inference goes. Before we get going, perhaps if there are any questions that you want to ask about yesterday's lecture, it might be a good time. Okay, so someone didn't receive the email. It's worth picking it up with the ICTP organizers. Or several people didn't receive an email. Okay, okay, stop saying me neither. We got it. Yes. Okay, so Matteo, please make a note. Oh, it is in the website. It's not sent in the email. Okay, so check the website. But anyway, so this gives you a better idea of where we are going to go and it also contains a list of possible references. The definition of the base factor is something that we are still interested in. Let me share my screen and get the whiteboard back up. So base factor. Okay, so the situation is that we want to compare two models. Let's say model one, model one, what do I mean by model? Well, it's really a set of a bunch of variables X, Y, you know, Y1, Y2, whatever. And let's say a joint probability distribution. And this one by definition is the observable. Well, these ones are unobserved. And these joint probability distribution specifies the relationship between all the variables in our model. Okay, so we have one model and one. Then we have another model, which crucially needs to share the observables. So these needs to be the same. But then might have another set of variables, maybe more variables. And let's call this P1 because it's the joint probability for model one and this is P2 for M1. So this is the setup. We want to compare based on the data so on the observables X, where the model one is better than model two. Okay, so how do we do that? Well, using model one, so four models, yes. Can you leave, when you write, can you leave the screen a little bit longer? Yeah. Okay, sorry about that. So I can go back to it. I don't know. I mean, either you feel a little bit more the whites, not because of some kind of oral vacuum, but just because the slides, you know. Yes, yes, yes, I know, because when I shift a slide, then it becomes. You lose the last lines, which are the ones that you usually still write when the professor. Okay, so at the moment I'm going back and then I'll kind of pause at the end of every slide then. Okay, so this is just a setup. So we have two models to me a model is a collection of variables and the joint probability over all the variables. And the key thing is that I want to compare these models. So these models must somewhat be related to the same set of observations. So I want to say which model gives me a better description of the observations. Right. So M one and M two are entirely analogous except that for clarity reason I call that the unobserved variables associated with M two but of course, you know, I could have called them whatever I wanted. So, now I'll move forward. And so, the question is how do I assess the fit to the data of each of these models. So for the model one framework, I can compute the probability under the model one assumptions of the observable. And this by definition is the integral or the sum if we have discrete variables over all the possible values that each of the other variables in model one can take of the joint probability law implied by model one. What does this tell me the data that I've observed x how probable it is under the framework of model one for model two, I can do exactly the same thing. So I marginalized the unobserved variables. I call them ZL let's say so for each of the two models. I can use the joint probability to estimate by marginalizing the unobserved variables estimate the fit of the model. Notice that when I'm doing this marginalization. I'm essentially averaging out the value of the Y variables or the set variables in the M two case. So I'm saying on average with respect to why and so on. What is the probability of seeing X and X I have seen. The two are the evidence of the data for model one and the evidence of the data for model two. And the base factor is the ratio of the two evidences is P one. And that I have no reason to believe a priori that model one is more likely than model two. If for whatever reason, I think that model one has a different prior probability of model two, then I have to correct this by adding the probability a priori or model one, probably a priori model. For model one, I believe is more likely before I see any data. Then I skew the support of the data a bit towards model one. Okay. Notice that this P one or P two could also be written. As P of X, given model two, because I've defined it as given the time in model two, what is the probability of X by marginalizing the probability law of model two. And equally, P one is nothing else but P of X, even model one. So from these, you can see that this is equivalent to the ratio of the posterior probabilities of model one given X, because this is the joint which is proportional to the posterior. And this is the joint which is proportional to the posterior with the same proportionality constant. So it's a posterior odd ratio. So is the concept of base factor clearer now. Yes. I didn't hear you today. Okay, very good. Excellent. Okay, any more questions about yesterday. Professor, here you did the first marginal is a margin, marginalization. Sorry, my mistake for my, sorry for my mistake. Then is that then we compare the our value of P one X with our empirical data. And then we can say, we can say in one is better than him to if it matches more with the empirical data if model one is. I mean, so to some extent, the probability of an observation under a model is is a measure of how well the model fits the data. Okay, because if the model assumed that the model was, you know, deterministic so it had one fixed output. The model fixed fits the data, in which case, you know, you have perfect agreement, or it doesn't which you have zero. But stochastic models or probabilistic models are not deterministic so they give us an output, a range of probabilities. And then the higher the probability of what you observe, the better the fit of the model to the data. Okay. Thank you. So, excuse me, Professor. Yes, please. Does it make any sense. If we choose the two model. I mean, for example, let's assume that model two is a generalized version of model one. In that case, calculation of the base factor. Does it make any sense. Yes. Yeah, so conditional the base factors didn't make sense even if you have nested models. Yes, it does because basically the normalization property ensures that you are always comparing things that make sense. And there was another question from Jacob I think I managed to catch it in the in the chat. How do I assign a prior probability to a model. Again, here you end up in the in the domain of subjective basalism so someone needs to tell you that has some good reasons to believe that a model is more likely than the other. So if you are doing, you know, an application to biology of data science. For example, for example, is what I mostly do for a living. I would have a conversation with the biologists and you know they may say look. There is all these body of empirical evidence that would support typically, you know, empirical evidence is really conclusive in frontier research, but they would have an idea. It might be subjective, you know, people have opinions in science as well. And to some extent, you know, the issue of the prior of the models is always not so, you know, it's only relevant when you don't have much data. When you have a lot of data, then the, then you kind of the data speaks for itself. Yeah, and tends to dominate the base factor computation. And this becomes a likelihood ratio essentially test. Okay. Good. So if there are no more questions on the base factor. Let's, let's move on to today's topic. Okay, which is what what we said yesterday are general properties of probabilities. Yeah. So the marginalization of some rule base theorem, they apply to probability as measures over spaces of events. And this is when we want to do calculations, which is what we want to do if we want to build models, we have to make choices about what is the functional form of these probabilities. Yeah. And to do this, there are a useful. There are a set of useful probabilities that come up time and again, and I think it's useful to to list some of them. Yeah, so these are the probabilities we work with every day probability distributions we work with every day and we love. To cover today I'll introduce a few of these ones. So I'll introduce the Poisson, the gamma, and most important of all, of course, the chaos. So I'll refresh your memory on something that maybe you know, or maybe you don't. But which is fairly fundamental which is the change of variable rule. So the rule for changing variables in probability distributions which is not quite exactly the same as in standard functions. The conditions are not functions, they are measures. And if we get a little bit of time. I'll mention very briefly, the concept of conjugacy and the exponential family. So most of the calculations about using base with these classes of probability distributions. I'll do them tomorrow and Monday next week. Today we'll focus on Gaussian in one dimension so we'll only do one dimensional probability distributions. Okay, so I say, but still isn't clear. I want to tell you what is still unclear. Okay, so there is a little bit of uncertainty about what is unclear and what is clear. I don't know whether it's the writing or the base factor. But, you know, if you have further problems with the base factor, maybe you can just check some of the references. Change of variables. Change of variables. This doesn't matter. You learn everything while I talk about it. Obviously, the simplest set of distributions is when you have only a few possible outcomes. And then you have, you know, a discrete distribution is in the chat. There's, there are some reference I think there are references in the website. If you put, if you go down beyond the list of topics for the lectures. The probability distribution. Simply means event space is a finite Xi equals one to begin. And the probability distribution is just essentially a connection of numbers. P one to P and such that PI is greater than zero. Zero for every I and the sum right was one to N of PI is equal to one. So this is the simplest possible distribution where, you know, you can define it in whichever way you want, but it's just a collection of numbers that sound to one a point in the simplex. Things become a little bit more interesting when we have an infinite number of items. And the first thing that we meet is the Poisson distribution. Okay, so the event space is the natural numbers plus zero. And now obviously we cannot enumerate an infinite number of probabilities so we have to give a formula. And the Poisson formula tells us that the probability of observing K events. Poisson has a parameter mu and the probability of observing K events from a Poisson distribution that has parameter mu is mu to the K divided by K factorial. So this is just the formula of the Poisson distribution. And this parameter mu is the rate. I see there's lots and lots of questions in the chat. So K is the probability of observing K and mu is the parameter the rate parameter of the Poisson distribution. Okay, so the Poisson distribution has one parameter, which is its rate. Okay, so the rate has a very important interpretation. And it's also the mean of the of the distribution. And so we know that if we have a random variable, we can compute the expectations of the, of the random variable so we can ask what is the expectations under P of K given mu of the random variable K itself. And how do we compute expectations. Well, we take the run, we take the function of which we want to compute an expectation. We compute on each of the possible values of the random variable and we average out the value of the random variable. In other words, that means, taking a sum of K equals one to infinity in this case which is there. Sorry K equals zero to infinity of K times P of K, even you. And now, if we use the formula that we have before, we have the probability of K of observing K given a particular rate parameter is mu to the K divided by K factorial times e to the minus mu. And this is, as you can see here, it's, you know, we can use the fact that we can the zero term, you know, we get zero so we can just ignore that. We write it as K equals one to infinity, because the zero term just vanishes, mu to the K divided by K factorial e to the minus mu, and these we can use the fact that basically. So I can rewrite this mu to the K as mu times mu to the K minus one. And I can simplify this K with the K factorial to give me K minus one factorial at the denominator. So this is equal to, I can take the mu out, sum K equals one to infinity, mu to the K minus one divided by K minus one factorial e to the minus mu. And now, well, I can sum K from one to infinity and have K minus one everywhere else or I can sum K minus one. So I can call K minus one P. So I do this P, no, not P, let's call it L equals K minus one. And this is going to be equal mu L equals zero to infinity, mu to the L divided by factorial e to the minus mu. And this is just the Poisson distribution itself. So I'm summing it over all possible events. So all of these things is equal to one. And I've just proved that the mean of the Poisson distribution is mu, the rate parameter. Okay, I saw that there were a number of questions on the chat. So this is probably a good moment to stop and ask those questions. Does someone want to unmute himself or herself and ask a question? Okay, so I can now catch. Let me stop sharing the screen and see the actual. Let's stop sharing a sec so I can actually see the questions. Oh, how does it become one? Yeah. So let's share the screen again. Maybe from summation of that it will be one that formula it is. Yeah, so the summation of this will be one because this is the Poisson distribution itself. Yeah, so we defined it like this, and it is well normalized and the fact that it is normalized comes from the definition of the exponential. Yeah, as a series. So, you know, if you take the infinite sum of x to the k divided by k factorial you get e to the x. So this is well, this is normalized to one. And therefore, when I renamed the variable k minus one I call it L I get something that's sums to one. So there's a few more questions in the chat. There are two good questions calling this question and then but let's first take. So in the equation of expectation value, there was something in the numerator which is one. I mean for normalization. No, it's automatically normalized. So if you wish to normalization that makes it one is e to the minus me. Yeah, because, you know, some zero x to the k divided by k. That's that's the reason. Now there were a couple of questions of where would we use it in practice. So there are a number of possible processes where we would use the Poisson distribution in practice. In, in physics, for example, so it is a distribution which is generally associated with rare events. So when you have, for example, a detector of some radiation that counts how many particles arrive that people would typically use the Poisson distribution for modeling that count. So for in computer science. If you have a server that receives queries. Then people would use something called a Poisson process which just means that the number of queries that the server is receiving is a in a certain amount of time that the team is modeled as a Poisson distribution with a rate that depends on the time that obviously becomes the more clear is the more well so obviously it is a random process. But typically the Poisson distribution is useful for situations where the count of numbers is close to zero. So it's small. So that is that let me show you another. Well, I won't show you, but I'll leave you as an exercise. So we have proved that the mean of me of the personal processes of the personal distribution is new. So as an exercise, I'll ask you to prove that prove that the variance Poisson distribution is new. So, okay, give a mu, which is the same thing as the expectation under the probability distribution of K given me of K squared minus expectation. How do you prove it well it's very similar to this you have to basically go to further back. So because then you have to come you would have a K squared and you turn it into a K squared minus K plus K. And then you you take K squared minus case K times K minus one and you get summation going from to infinity. So why is this important. And, and then maybe I can get to some more questions. Yeah, so what why is this important, because so what does the, what is a typical P, what is typical Poisson distribution look like. So what does what does P of K given you look like. Well, suppose if we have K equals one, for example, sorry, mu equals one. Then I mean it's a distribution of the integer so it's best written as a plotted as a as a histogram. So with mu equals one, you typically have something like this decays very quickly. If I have something like mu equals 10, for example, I don't know whether, you know, I'm just giving you an idea. You would have something that is very likely to be zero then it increases. And then at around, you know, 10, it's not going to be picked at 10 but we start actually declining at 10. If I take mu equals 100. What I will get is something that looks essentially like, and why is that because you see the variance is equal to mu itself. Now, the square root of the variance the standard deviation is a measure of how off typically the random variable is. And so what I'm getting is that the greater that mu is the smaller the standard deviation relative to the mean is so the relative error will be smaller and smaller. So this of mu equals 10, I'll be typically off by say, you know, a random personal number, I would expect it to be more or less between six and 14, let's say. Yeah, I would expect it to be somewhere in this range. Well, if I am at around 100. I would expect it to be somewhere between say 85 and 115. And so the relative error here is 15% while here, it's something like 40%. And that is why the Poisson is not only a good distribution for random events, of course, it's a distribution so it's a distribution of a random events. But it's a good distribution for rare events. So it only makes sense to use it when the numbers are relatively small. So that's a streak of questions in the chat. So let's see, yeah, is the expected value. Why is the range only including not random numbers integer numbers, natural numbers and zero well that's the way it is defined so it's a distribution over integer numbers and zero. I mean, it could be, I guess, you know, you can define the factorial also as a zeta function so as extended to real numbers but that's it's used always in the case of integer numbers. In fact, one important thing of the Poisson, which I will not prove but it's a consequence of an important theorem we'll talk about in a sec is that when mu becomes large. It tends to become a Gaussian distribution. And that's why people don't use it for large numbers because you might as well use a Gaussian distribution. It's a consequence of the central limit theorem that I imagine many of you will have seen in one form or the other. Yeah, so there isn't you see data science like many practical applications of science is somewhat in between a science and an art. So, precisely, when would you want to use a Poisson distribution. It's a difficult issue. So, generally, one good thing that you could do is to look at your data, for example, and see whether it is a strongly skewed distribution. If it's a symmetrical distribution and the mean is fairly high then you might as well use a Gaussian, not a Poisson distribution. I'm not sure what is the key here means as a question. I have a question. Yeah, please. When we were talking about the base rule, you said then the number of the observation goes higher, the prior and the primary probability. The prior probability. Yes, you should not. I think you are at risk of being confused here, because K is not the number of observations is the value of one observation of one random variable. So if I have K, one Poisson observation so someone comes along and says the value from the counter is 1000. So that is one observation. It's not 1000 observations. But there is a kind of a foundation of truth in what you're saying that in many cases where we're thinking about Poissons, we're really thinking of independent random arrivals. And then it becomes somewhat closer to the situation where we have lots of observations and that's also why it becomes Gaussian. Thanks. Thank you. You're welcome. Professor. I've got one question. Yeah, since Poisson distribution is for rare events. I just want to make sure if I'm right or wrong. In terms of astronomical data points, which most of the times there are big data. There are data with big values. Does that mean that it is less likely to see Poisson distribution for astronomical events, whereas in rare events such as particle collision, where it's more likely to see Poisson distribution. It depends on what type of astronomical experiment you're doing. Because if we are thinking of, for example, you know, it's true astronomical data can be very big data and why is it big because they take a lot of pictures of the University from telescopes. So they're very high dimensional and they take lots of it needs to be compressive. But if you're thinking, for example, you want to estimate the luminosity of a star. Just to get an idea of how far it is, for example, then what you're going to do you need to count the number of photons that come into your receptor from that star. Yeah, and that number is not big. So the applications in. If the star is far, of course, if the star is our own son, then that's number is huge. But the application in astronomy would be more on the line of you have got detectors that are detecting photons, or any other form of radiation from a very weak source that is far away. So the number of actual detections is often measured as a, as a personal model does a personal distributed variable. Okay. In particular collision since it's a rare event. It's more likely to see personal distribution in such such phenomena right. So imagine, you know, if you want to count the number of a certain type of particles that comes out of collisions that could be a personal distributed number that could be used. Thank you. You're welcome. So, this was one of the three probability distributions I wanted to talk to you about, and perhaps the least important of the three. The two probability distributions are the gamma distribution and the Gaussian distribution. So the gamma distribution, I'm just going to not tell you almost anything, but it's now it's a distribution. The event space. So this, the set of values that the random variable that a gamma distributed random variable can take is are greater equal to zero. Okay, so it's the non negative numbers. Okay. And the gamma distribution now is defined over a continuous space. So you're not going to be able to assign one probability to one event because there is a non-numerable set of events. You're going to have to define a density. So the density for the gamma depends on on two parameters let's say X is variable. It depends on two parameters K and theta. And the gamma density is something, it can be written in two different ways. I think it's like this. This is theta X X to the K by theta to the K double check that this is the right version. Okay, so I make, I made some of these mistakes I think. So this is the density. What does it look like. Well, again, in the case for example, where if K is more than one, but obviously needs to be greater than zero, then what you're going to get is something that looks a little bit like this. And DK is with an exponential DK for K greater than one. It's going to look something like this. Okay. This is X. And this is P of X. Yeah, so. Yes, the good examples. Yes, of course, a personal distribution can be used to model processes in cell biology, for example, the production of RNAs from genes that have been transcribed at the constant rate. And yes, there should be a K minus one. So I put the K minus one there. Thanks. The reason why I'm introducing the gamma distribution is that it does play an important role in base in inference with respect to the Gaussian distribution, which is what I'm going to move to in the last 10 minutes of the lecture. The Gaussian distribution is the most famous of all distributions. And the event space is the whole real numbers and the probability density depends on two variables, let's say M and sigma squared. By normalization constant, which has to be times the exponential minus half X minus M squared divided by sigma squared. Okay. So this distribution is kind of, you know, it has the standard shape, the Gaussian shaped per shaped curve. The mean it's got its maximum, its mode is a mean, and it has very light tails because they, they decay as the square of the distance. That's what it means light tails, it means that there is very little probability mass far away from the mean. And the parameter sigma, sigma squared is the variance. So as a further exercise from today, you could show that the expectation P of X, the mean of the distribution is M, which is trivial because it is symmetric with respect to M. So you can just change variables and the expectation under P of X, X squared is sigma squared plus M squared. So the variance of the Gaussian is sigma squared. So as a final thing that I would like to show you today before we have to end in about five minutes time is change of variables. Suppose you have a random variable X, which has got its own distribution. Now, you apply a function, a deterministic function to the random variable X. Okay. So the question is, what should be P of Y. Now obviously if X is random, then also why is going to be random. But if X has a certain distribution, what should be the distribution of why. Okay, so the question that we're asking, you know, suppose we are here. And, you know, the function is something like these. It has to be a monotone function. What is going to be the probability of having a point, a value of Y coming here. Let's think about X. Okay, so here I have very low probability of sampling an X. Okay. And if a sample axis here, then they're going to be fairly well spaced here. Yeah, because this one was here. This one was here. So there's quite a bit of distance between the various Y values. And if a sample axis here, then they will map here, and they're going to become very compressed in the Y values. Yeah. So, what is going on? Well, the density of Y at point Y, that is F of X. So I got to go and look at what is the density of distribution at X, the probability of sampling an X there. And then I have to stretch it or compress it according to how steep the derivative of F is. Yeah, so if F is nearly flat in that position. If the derivative of F at F minus one of Y is close to zero, then that means that I will have a high density because I'm compressing a lot. If the derivative is of F at F minus one of Y is high, then I'm stretching things more. So I have to look at F prime to the minus one evaluated at F minus one. So this is the change of variables rule for probability densities. So you don't just change the argument from X to Y. But you have to multiply by the inverse of the derivative. Okay, so that's about all that I wanted to tell you today. This thing here is going to be very useful for me to introduce the motivated Gaussian on Thursday. And then to explain the relationship between the Gaussian and the gamma random probability distributions. Yeah, so let me rephrase it. Yeah. So I think, you know, you can find. Yeah, so why does it need to be monotonic, it needs to be monotonic for this proof it doesn't necessarily need to be monotonic but if you want a non monotonic function, then you have to break it down into monotonic bits and add bits of so you know because you would have it needs to be monotonic for being invertible now for the function to be invertible. Yes, so so if you have bijective so the thing is, if you are in a if you have a non monotonic function. Okay. You will have potentially this joint subsets of the X space that map to the same position in the wide space. So, your probability mass will come from different positions in the original so the mass of the probability at a certain position in why will come from different parts of the X space. So that's why it becomes a little bit more fiddly to do the proof in the case where F is non monotonic, but it can be done but you the formula changes a bit. What is the relation between F prime and F F prime is a derivative. Sorry. So, F prime. So, F prime is the F in the X. So, this is, you know you can find rigorous proof of the change of variables formula anywhere on Wikipedia or on any real analysis books. But this is an intuitive proof. So the intuitive proof here is basically saying, you know, if I'm sampling X values from the X direction from the X random variable, and I'm transforming them using the F function, then how dense will they be in white space. I think there are maybe one more question. Excuse me. Can I have a question. Yes, please. Yeah. So maybe it's too soon to make such a question that since you introduced you asked to these three probability distribution. For example, if you want to be more specific about our research, for example, if I'm working with biological systems such as protein protein interactions. Which one is the most suitable in this case. Often in biological systems you have discrete random variables, and they tend to be low numbers so the Poisson is very useful in in biological systems. Yeah. But of course the Gaussian and the gamma also are very relevant and I see in the chat also there's a question if I could comment on the similarity of the various distributions. I will, but not now I will tomorrow that's part of the other lecture so you have to take these. Yeah, okay, thank you. Good. So, if no more questions I think this might be a good point to stop, and we'll catch up on Thursday. So you know ask questions about today's lecture on Thursday as well. Yes, I can. So, in the last relation, the, the very last equation that you wrote, you imposed, well, you impose that p y that the q that the differential cumulative along y is equal to the differential cumulative along x, right. So that p y d y is equal to p x dx. I mean, that's why you have to do yes essentially yes. Okay, this is, I mean, yeah, okay. Okay. So, thank you very much. I think. So we are done for today. And tomorrow we will have to new lectures to new courses starting. So, I wish you a nice afternoon evening or morning and see you tomorrow.