 So it's a pleasure to introduce Guido for this course on probabilistic modeling and vision inference. Guido, it's up to you. Excellent, thank you very much Matteo. It's a great pleasure to be able to teach in this spring college and I start sharing the screen. So I think the format, if you have questions, do put them in the chat. I know it's a little bit difficult sometimes because when I'm sharing the screen I may not see whether there is something in the chat but as Matteo was suggesting, every 10 minutes I'll take a mini break and enable questions in the chat so that that would be hopefully a good way to keep it interactive as much as it can be possible, that of course is not as interactive as if you were here in person. So I'll share my screen, I'll share a whiteboard page and minimize Matteo so that he doesn't cover up most of my whiteboard. Excellent, so probabilistic modeling and vision inference. So what is this all about? Well, this is a spring college in complex systems and you probably are very well aware that most complex systems are not fully reproducible so they exhibit some stochasticity. Now what do we intend by stochasticity? What I mean by this is that if we repeat the same experiment, let's say you get a different result in general. Not deterministic and this might be because of a variety of reasons, it might be because there is extreme sensitivity to initial conditions like you're in a chaotic dynamical system or it can be because there is intrinsic stochasticity, say quantum mechanics but also in systems for example of chemical kinetics in very low molecule numbers. Obviously the strategy, if you don't get the same result every time you do the same experiment is to try to model the variability instead of modeling directly only the outcome and what we'll be doing is to formulate all our models in terms of probability distributions. Now I kind of assume that you have been exposed to a little bit of the foundations of probability calculus. Now there are very different levels of mathematical rigor you can teach this from. I'll take a somewhat practical but reasonably rigorous level and when we talk about probabilistic modeling what we mean is that we're going to consider the outcomes of the experiments, let's say the variable that we eventually measure is going to be a random variable and we're going to model its probability distribution. So the probability distribution is essentially what gives us a weight for each possible outcome. So probability distributions have to be greater than zero for all possible outcomes and these outcomes, this index here might be even you know non-numerable index set if we're thinking about continuous random variables and we must have that the sum of the probability associated to each event over all possible events must be equal to one. So it defines a measure over the set of possible values taken by the random variable X. Okay so this could be written, okay so let me write it just to give you an idea of the type of notational shortcuts that we would employ that I will employ regularly this would be written really as this so the probability that the random variable X takes the value Xi this must be always greater or equal to zero. It's possible to have events that never happen and then I will systematically shorten this p of X equal Xi into p of Xi. Okay the thing that we will be primarily interested in is the situation where we have more than one random variable and the simplest case is when we have two so we have a random variable X and a random variable Y and we will define a joint probability distribution over the pairs of random variable X and Y and by this I mean the probability let's say p of Xi Yi that X is equal to Xi and Y is equal to Yi. This is the joint probability sorry sometimes I have difficulties with my pen so that's why I suddenly stop. Now another question that we may ask when we have more than one random variable is the relevance of one random variable with respect to the other really and so we may want to ask what is the probability what is the probability of Xi given Yi. So if I know that Y has taken a particular value how does this affect my belief over the variable X so this is called the conditional probability. The final ingredient in the basic things we need to know about probabilities is what is the probability of X taking the value Xi regardless of what Y is. This is called the marginal probability and the reason why it's called the marginal probability is very neat because in the old days a lot of these probabilities were just simply tables where you would have the values of the two variables and you would have you know say here are the possible values of X here are the possible values of Y and when we on the last column on the margins of the table you would have p of Y so here would be the value of Y equals one let's say first value regardless of the three values that X take so it's called the marginal probability because it was written at the margins and was given by summing the rows all the columns of the table. So the final fundamental ingredient is the celebrated base here. Now base the reverend base was a student at my other institution so I'm a professor at CISA but I'm also a professor at the University of Edinburgh where base studied theology many many years ago about 300 years ago almost and he observed that a trivial consequence of the fundamental laws of probability allows you to say something really nontrivial. So the first law of probability we've already seen and it's this marginal is it's the normalization property but the second law is a relationship between the marginal and the joint so the probability of X i is equal to the sum over all possible values of Y j or p of X i Y j that's the marginalization property the third fundamental property is that the probability of X i joint with Y i is equal to the probability of X i given Y i times the probability of Y i okay so if I want to know what is the probability of observing simultaneously X i and Y i well I can think first let's observe Y and see Y i what is the probability of this Y i and then given that I've observed Y i what is the probability of X i and now obviously this is purely symmetrical it doesn't matter whether I choose to observe Y i before X i and so we get also that this is p of Y i given X i times p of X i and therefore base theorem simply takes this observation of this equality and tells us that the probability of observing Y i given X i is well it's the probability of X i of X i given Y i times the probability of Y i divided by the probability of X i so this is base theorem so let's take the first mini break so I've done a very very quick run through what I mean by stochastic modeling by random variables the basic concept when you talk about random variables is that you shift from modeling the precise value of the variable and how it depends on other variables to model the distribution of the possible values of the variable of interest and how it connects to other random variables of interest and then we've listed the basic properties of probabilities including base theorem so are there any questions that people want to pose in the chat can I please explain marginal probability again so sure so marginal probability is the probability of observing X i regardless of what Y i is and that is obtained as I have a joint probability law so I have a probability distribution of the joint values of the X from the variable and the Y on the variable and what I need to do is to sum out over all possible values of Y and let's call it Y j actually instead of Y i so I don't want you to be confused it's not the case that X must have the same number of possible outcomes than Y so it could be this could be that I don't know X could take three values and Y could take seven values or something like that or in general they could both be continuous variables on one or one of them could be a continuous variable and the other is discrete variable and so on so this is the definition of the marginal probability probability of X taking a certain value regardless of the values of Y and you obtain that by considering the probability of X being that value and Y all the possible values of Y and summing that is this clear now professor yes what is the difference between marginal probability if I calculate the probability of X i individually and what's the difference between these two quantities no so the marginal probability is the probability of so we're always thinking about the scenario where we have two random variables yeah I don't know temperature and pressure for example yeah and the two things are linked so you want to know what is the probability of a certain temperature regardless of the pressure is yeah but they're interrelation exactly so what you need to do is to say okay you have some probability distribution of the pressures because you know that is your system for example and you sum the probability of the temperature being that value that you're interested in yeah and the pressure being a certain value yeah across all possible values of pressure thank you okay yeah I have a question yes please ask this phase theorem how the how is it different with frequency frequency theorem in probability with frequentist probability yes yes yes yes okay so that was going to be what we were going to talk about after the question session so maybe it's a good time to move yeah so there are two broad schools of probability of statistics really yeah so there is frequentist and frequentist versus basin so this dichotomy is really about what is your interpretation of probability okay so if we're thinking in terms of measure theorem and and so on you know the way I've introduced the concepts of probability are really the probability gives me a measure over a set of possible events and these rules have nothing to do with being Bayesian or being frequentist now the question that you're opposing I think is you can't read what I'm writing is that true of everyone well you can't read because I write in an incomprehensible way I see okay that's that's a more serious problem yeah yes I think the next times I'll go on a blackboard at ICTP so frequentist versus Bayesian and so what I've written it doesn't really matter this is just the title of the section what I've written is so far I've just looked at basically rules of probability yeah defined as a measure of a set of possible events marginalization products rule Bayes theorem there are all obvious true statements now the difference between frequentist and Bayesian is what is our interpretation of what probability means okay so in the frequentist scenario so for frequentist this is not something that I was particularly going to get into for frequentist statisticians the probability of x taking a certain value xi is defined is defined as a limit in the number of infinite number of experiments of the number of times x is equal to xi over the number of experiments number of experiments goes to infinity so in this way you're supposed to have observed the world a lot of times and then you can define what the probability of a certain event is as an approximation of the limit however there are many scenarios where we are interested in situations where this limiting approach is not even conceptually possible and for example you know if I wanted to ask a question you know suppose Italy is going to play football against England next month what is the probability that Italy will win now that's a question that is well defined and very often asked in the common life but it doesn't make any sense in the frequentist statistician world because it's impossible for Italy to have played against England an infinite amount of number of times and of course even if it has played a large number of times it will have been different teams on different conditions in different days so for basic statisticians the probability of x taking a value xi is defined in a somewhat hazy way as the degree of belief in outcome x equals to xi so from the Bayesian perspective you're supposed to think that there is an expert or someone that knows something about the system you're interested in that will be able to quantify these probabilities and then you can operate on them how would you operate on them well let's see we have seen that Bayes theorem tells us that the probability of y taking a certain value given that we've observed x taking a certain value is given by the probability of x given y times p of yi divided by p of xi so how do we interpret these terms okay so the expert has a prior belief about what y should be and also the expert has a belief about what x should be given that y is something so the prior has a model of the how the the expert has a model of how x is connected with y this is called the likelihood right the likelihood the likelihood model so if I have a prior belief and I have a likelihood model then Bayes theorem tells me how to update my belief and this is called in fact the posterior belief posterior because it comes post having observed your uh which in this case is an experiment on x good okay so some people can understand the writing that's cheers me okay so this is the kind of the basics of probabilistic modeling and Bayesian inference so probabilistic modeling means setting up models in terms of relationships between random variables so to do probabilistic modeling you have to start with some beliefs about at least some of the variables priors and you have to start with some model that tells you how likely some observations would be started from your prior belief so you need the prior and the likelihood this is a probabilistic model is a model of the probabilistic relationships between different random variables and Bayesian inference is a procedure that allows you to connect the prior beliefs and the likelihood model in order to update your belief on the unobserved variables so a simple example and we'll see actually explicitly how this works suppose you had you know a variable let's say your variable x and you have a very broad belief about this variable and then you observe a variable y okay which is equal to equal to x plus a little bit of noise okay so if x is equal to three then your y would be centered would be distributed like this here let's say this is three so it would be something that can fall around there so now suppose I observe y equal let's say one two three and so on suppose now observed y equals I don't know five so I have an observation here well then after having observed that my belief over x will no longer be spread all over the number the real numbers but it would be much more concentrated it would be something like this so this would be the prior this would be the likelihood so this is y given x equals three and this would be the posterior okay any questions let's have a wee pause so you know the important thing that I've been telling you now in these last 10 minutes or so is really the key concept of the course okay to do probabilistic modeling you have to formulate all your models in terms of random variables you have to formulate the model in terms of probabilistic relationships between random variables and based on inference is important because it is the procedure that allows you to update your belief on unobserved variables based on observations and that's why it's so important and used in practice now how do we get that chat so I see that there is some question why don't you unmute yourself and ask questions now sir I have a question so no probabilistic modeling is possible without some prior belief yeah that is a good question indeed so it's let's say it is possible to formulate likelihood models yeah but if you want to do based on inference then you have to compute combine the likelihood model with the prior so you could have say absolute certainty that your unobserved variable why is equal to I don't know five say and you could still have a stochastic outcome x but then y would not be a random variable okay so you would only have one random variable if you have multiple random variables then you have to have something that acts like a prior and something that acts like a likelihood in order to build a model is this sufficiently clear I think so professor yes please in the denominator you wrote e of x i is that the marginal probability absolutely and this thing has got a lot of names so this is the marginal and it's also called the evidence so it's the probability of the observations regardless of what the unobserved y variable would be so it's the evidence of the observations on the wrong thank you I think there was one more question maybe yeah did I get it right do you say that it doesn't matter what is my prior belief probability if I have the right likelihood my answer would be something specific no you didn't I don't think I said that and it would be wrong so your prior belief is extremely important as you see to some extent prior and likelihood are exactly on the same footing so if I change my prior my posterior would change for example in in this simple example if my prior now I'll try to draw it dashed so that it's not confused if my prior was that the random variable x lies all to the right then what the posterior so alternative prior let's say then the posterior given this observation would also be shifted to the right okay so this observation was in the previous case quite close to the prior mean so what would happen is that the mass the the mode of your distribution would be shifted a little bit with respect to the observation in this case if the prior is very much in disagreement with the observer with the likelihood then the posterior would be a compromise it would be somewhere in between the prior and the likelihood so the prior is absolutely important professor yes can we plot the marginal probability the graph just like we did the rest of the parameters well so the marginal probability this is a graph of x and so here we're looking at probabilities of x so the prior is probability of x it's a marginal probability if you wish itself but it's a prior the marginal probability would be the probability of y so if I have y equals five regardless of what is x I should plot it here and it would be given by the integral of p of y given x times p of x in dx and so here we would it would probably be something rather broad as well I guess thank you I have a question are we going to see in this course how to choose the prior or or even if we don't if we don't see it is there any way in which to know if the prior would choose is good or is it's wrong okay so that's two questions one is related to this course and and the answer to the first question is really not really so we're not going to look at how you choose the prior now I could open a very very big discussion on how to choose priors yeah I won't but I'll tell you a little bit so there are essentially there are two schools of basins yeah there are the so-called objective basins and subjective basins so the subjective basins insist that the prior is something that an expert should specify and should be informative so if I were doing probabilistic modeling as I do actually for a job say in a in a scientific context for example in applications to biology I would choose the prior by either talking to a biologist or for example if I was looking at I don't know structural data on proteins I might use models of protein folding to obtain a prior distribution if you're an objective basin instead you're looking at situations typically where you don't think there is good knowledge available and there is a whole school amongst basins statisticians on how to choose a prior that biases your posterior results as little as possible so the objective basins say look we don't we only want the data to speak we don't want to impose our biases on the results and so we want to select priors that confer convey as little information as possible and the problem is that as should be clear from base theorem itself there isn't such a universal uninformative prior so depending on what the actual likelihood model is an uninformative prior might be different okay and so a lot of research is for these class of models this is an uninformative prior that was the first part of the question so this is all I'm going to tell you about how to choose your priors in general I'll assume that you have someone that can give you good information or you are an expert yourself maybe you are a condensed matter physicist that writes down priors for product folding the second part of your question instead is are there good ways to decide whether a prior was good now I'm also not going to tell you much in this course although I could tell you a little bit more this falls into the so-called model selection set of questions yeah so there are ways to assess whether one model gives a better fit to the data than another model and there are also ways to do unconditional let's say to assess how good the fit number model is to another model so let's let's do that now so the key comparing models okay so you can always view you in your Bayesian setup so we are looking at the probability of a certain variable yi being equal to y being equal to a certain value given that we observed x being equal to xj so that marginal probability that we were talking about is the key to decide whether a model is a good model okay so this is also called the evidence as well as the marginal likelihood and it says essentially answers the question of what is the probability of observing xi regardless so now if I have two different models so two different either because the prior is different or because the likelihood is different or possibly because you know there are even different sets of random variables this is the objective quantity that allows to tell me allows allows me to say you know how probable is this data given the model so if I now have two models prior probabilities over the models p of m1 p of m2 because you may think that the model is more likely than the more probable than the other a priori then the base factor is the probability of the observations and model the observation and model one divided the probability of the observation and model two which is of course the same as the probability of the observation given that we're working in the framework of model one times the prior pretty from model one divided by the probability of the observation given that we're working in the framework of model two and so if this number is greater than one that means that regardless of all the other variables that have marginalized the data is more supportive of model one than of model two so base factors are used for base and decision making so I saw there was a question flashing up so maybe we can have that question I also saw in the in the in the chat that many of you were asking me about references and I can give you a couple of references at the end of this lesson or maybe I can send them to Matteo and you can circulate them by email but let's have the question is it still there the question hi sir like I also have a question yeah yeah so like you're saying the model models don't determine the priors right the priors are determined by the expert yes and the model is like a likelihood model is the model right or is it something else no say that again I didn't get that like you said p of xi given yi in the last slide I mean yeah yeah it is the likelihood model yes and then so that is what we think is the model yeah yeah and what is p of m1 then p of m1 is how you compute p of xi given i mean m1 is how you compute p of xi given yi right no so m1 so now we are thinking that we have two models here so two models a model means a connection of random variables and how they are related yeah of these random variables x is observable okay so in one model we might have that the temperature we consider temperature and pressure and they have a certain relationship in another model we may I don't know consider temperature pressure and volume for example so it's a it's a richer model but we always only observe temperatures okay so in model one we have a certain set of random variables and a certain set of relationships with the random variables and we observe x in model two we have another set of random variables and another set of relationships but we still observe x so what we do we compute the marginal distribution for x in the framework of model one okay and divide by the marginal distribution of x the observation so this is a number it's not it's not a distribution anymore so because we put in the observation it becomes a number and so this becomes the ratio of two numbers and then maybe we might have someone an expert again that tells us look it's the probability that you also need the volume is a third while the probability that you don't need the volume is two stars so we put those numbers there so there was another question appearing also there was a question in the chat asking whether there are any situations where the prior doesn't matter so let me answer that question very briefly because this is a common situation that we will see so what about when you have many observations yeah so maybe you observe x equals to xi once and then you repeat the experiment let's say this is the first time you do the experiment and you repeat the experiment and you see the second time you get xj this index counts the number of times the the the number of the experiment so if these experiments are independent then effectively in any case you have a joint distribution over the probability of all the x1 to xn if you've done experiments and time and there are no variables so suppose you're always doing the experiment in the same condition so that the variable y remains the same okay if the experiments are independent and identically distributed then this becomes a product of n equals one to big n of p of xn given y times p of y so now this is still my prior and this is my likelihood for all the observations but now as you can see essentially the more observations I do the more likelihood I get so the contribution of the observations to the posterior will eventually dwarf the contribution of the prior so it is true to some extent that the prior becomes irrelevant in the limit for the number of observations becomes infinity but even that is not precise because if the prior says that one certain value of y has zero probability then there is no amount of observations that will shift that prior however it is common to say that in the limit for very large numbers of observation the prior becomes less important so when you have lots and lots of observation the prior becomes less relevant for your posterior computation and the likelihood becomes dominant so I sort of was a question flashing please just unmute yourself because I don't know how to do with zoom to find I mean the chat flashes and then goes away I have no idea how to excuse me professor yes please back then when you were talking about the bias factor I mean the odds ratio I have a question about that could the odds ratio give us any intuition I mean any information about the limits of our variables that we choose in our modeling I mean for example when you were talking about having both variables pressure and temperature we for sure in our modeling we could have some limits on each of these variables could the odds ratio give us any intuition about which I mean what we should choose for the limits yeah so okay so one way maybe the base factor could be one way to do that but another thing that one could do and and we're going to see it later on towards the course you could actually consider the limits of your variables as some additional parameters or some additional random variables yeah and you could if you wanted to be fully based and you could get a posteriori what are the distributions over the lower bound and the upper bound or what many people do and it's convenient computationally to do you could take the evidence and say okay this evidence depends on these additional parameters and so I will find the parameters so the lower possible let's say pressure and the higher possible higher possible pressure I'll treat them as parameters that I optimize so I'll optimize the marginal likelihood with respect to these parameters okay thank you and one more question please do we prefer to have less variables in our modeling when we are using the Bayesian theorem or not it doesn't matter at all in I mean okay so there are several answers to this question so the purely theoretical answer is it does not matter in fact one of the one of the great advantages of Bayesian models is that because you are doing this averaging when you're comparing models you're not really at risk so when you're doing you see when you compute the base factor so you're averaging out all the variables that you're not observing okay in model one and in model two and these averaging essentially automatically accounts because the probabilities are normalized automatically accounts for model complexity so it will not allow you to overfit in some sense now obviously though the the other answer to your question is practical and working with lots and lots of variables might be computationally more complicated so and also it could be let's say that from the point of view of understanding what you're doing might be more complex so to some extent there is always a good case for having simple models thank you no problem sorry yeah please um based based on this base factor if r is greater than one then you are saying model one is preferred over model two or are you not making any such comments I'm not making so yeah yeah so if r is greater than one you would prefer model one compared to model two I mean can you explain I mean I don't understand why you say that oh because you see what does this probability means look it's better to look on the right side and then the left side yeah this is a joint probability but on the right side you have okay given the assumptions of model one so the number of random variables the prior distribution the likelihood what is the probability of the observations on the denominator is given the assumptions of model two what is the probability of the observation so the concept is that the observations should be somewhat typical and so if they are more likely under model one then model one is a better explanation for the observations and that's why if r is greater than one I will prefer model one I will say okay model one is more supported by the data okay okay thanks no problem and there were some more questions flashing were there more questions or was there or was that it I don't know there are some questions I don't just unmute yourself and and speak is the easiest thing so if not let me just write a roadmap for the course so today we've kind of covered the the the basic concepts of probability now what we're going to look at are some explicit cases of probability distributions next time and first of all in one dimension and then we will do probability distributions multivariate so in multiple dimensions how do I find the chat well anyway once we have that professor yes if you stop sharing your screen you can easily see the chat yeah I know that but I still want to write so maybe I can stop sharing and then I can I can see yeah so now I can see the chat yeah so okay Matteo was suggesting the Davy Makai book yeah another good book that I and it's freely available another good book that I recommend is Bayesian Reasoning and Machine Learning by David Barber which is also I'm sorry I didn't want to send it privately which is also freely available online yeah so okay so let's let's go back to the roadmap for the rest of the course so we'll do probability distributions in one dimension and then we'll go on to multivariate and this would be mainly well it would be the Gaussian distribution the multivariate Gaussian distribution which is kind of the fundamental tool then we'll spend some time talking about linear models and so this would be probabilistic PCA which is a linear dimensionality reduction method and then we'll talk about Bayesian linear regression and basis function regression and then finally we move on to Gaussian processes all of these models are in some sense linear which means that you can actually perform Bayesian inference analytically and and write down explicitly what the posterior distributions are if we have more time towards the end I will explain briefly some methods to perform Bayesian inference when the models are intractable and by intractable I mean that you can't analytically compute the posterior so the problem which you know there's all sorts of problems in Bayesian inference one of them which we discussed quite a lot is how do we choose the prior but another good more problem is how do you choose the likelihood the toughest problem computationally is now how do you compute the evidence and without the evidence you don't get normalized probability distributions and you can't compute the posterior so most of the difficulties in Bayesian statistics and machine learning research are compute efficiently these evidence because these evidence involves marginalizing out all the unobserved y variables and if you have lots of variables or if they are continuous variables you may have sums with lots of terms or high-dimensional integrals which are not feasible not even with very strong computers so I think that's it I guess I'm just over the one hour and yeah professor can I ask regarding the material so for instance professor Marsili will be providing say lecture notes and videos beforehand will that be the case with your lectures will the notes and or videos be available some for instance maybe not the videos but the notes will they maybe be available somewhere beforehand um well I can discuss so how does it work I'll talk with Matteo about that because I guess like I could pre-record the the lectures and then we could just have some questions during the lecture hour um if if that worked the the notes I've I've got stuff scribbled on my on you know okay on this piece of paper but on this notebook but I don't have them on maybe it'll be maybe it will be uh say more useful maybe yes more useful for the students to kind of have the notes in advance maybe even just to get acquainted with what we'll be discussing about during the lectures to make them more uh followable but I don't know it's uh it's may not be I don't know yeah so I yeah as someone is saying that pre-recorded lecture won't be interactive so yeah I think I would prefer not to record a lecture but I can try to be specific with the the type of so giving you some concrete pointers to the books beforehand Professor okay well if hi um would you be providing us with a syllabus or of course outline detailing the entirety of probability and where you can discussions yeah yeah I can I can write you a more detailed syllabus but you know it would be like you know a page of bullet points or something like that all right thank you okay so I think um we're getting close to the session so you will have uh time in the forthcoming lectures to ask for question and uh so I think uh again uh Guido for um sorry sir I have like just a small question yeah yeah like uh sir so like the base factor is a function of the outcome which you want to kind of study right no so the the base factor is a function of the observations observation sorry not yeah yeah observation yeah okay yeah so typically in the Bayesian world you have some things that you can observe and some things that you can't observe the idea is to use the observations to learn something mathematical about the distribution of what you can't observe the base factor is yeah it's a function of the observations but the observations are not something that can be changed sorry yeah okay thank you yes any other question so okay yes just one more please uh professor uh when uh when we say the probability of x given y uh x and y are two random variables right so that means these variables can be either independent variables or x could be the x could be dependent to y it means in terms of y is our observable our independent variable yeah and x it depends on how we see the y so yeah when when x is a dependent variable and we don't know the formula and we just we we just can't observe y yet we don't know the formula so how can we find the evidence which is p the probability of x exactly that is why the course is called probabilistic modeling because you have to make models so you may not know the formula but what you generally will do is you will have a model of what x will be conditioned on y so all the models assume that there is dependency if there are independent variables then there's not much that can be said it's just you know the prior if if x is independent of y then there is nothing to be learned about y from observing x thank you okay so i think we can close our session here thank you again rido and see you tomorrow at 2 p.m. for tomorrow's lectures excuse me professor marsiri yes so you said that you upload the recorded lectures and we see them in advance or we just look at the material then yes i mean the idea is that you you look them in