 Now it's my pleasure to introduce the first short course that will be given at our summer school by Antonio Artes Rodriguez. And he will give an introduction on probabilistic machine learning. And he's one of the experts for this topic. He's a full professor at the Universidad Carlos Dres de Madrid. He specializes in signal theory communications and in particular the information theory behind it that reaches deep into probabilistic machine learning. He has worked on more than 100 research projects in this domain. He's not only a successful professor, he also founded a company evidence-based behavior which develops healthcare solutions for psychiatric patients using AI. So he is the person to introduce us to this field of probabilistic machine learning. Antonio, we are very happy to have you as a speaker here. We also are very happy to have you as a longstanding member of our ITN, this one and the previous one. Thank you. So the floor and the word is yours, Antonio. Okay, thank you, Karsten, for your gentle introduction. Well, good morning to all. It's a pleasure to be here at the summer school for providing a short overview of what probabilistic machine learning graphical or latent variant model are. Okay, well, this short course, it is not, it will not cover only the last advance in the field. Here you can see a tweet from some weeks ago. So in which it's a new learning exercise that there has been a long journey until the result we have now at hand in probabilistic machine learning. So I will explain things that have been proposed several years ago. Okay, well, the end of the course, of this short course, of this lecture, that it is in deep depth of one journal article that is by David Bly. That is, I recommend to all of you to take a look at it. It is entitled Bill Computer Critique and Repeat. And this talk or the structure of this talk will be based on mostly on this journal article. So the first electoral activities to offer a basic and comprehensive overview of some latent variable model. I will put the emphasis into understand the relationship among different model. So I wanted you to understand what is a Gaussian mixture as in common with a variational answer in color, to say about things. And also I want to let you to know that you have the freedom to play with model. Let's say if your model or if your data, it doesn't fit in all the existing model in all the libraries, second-hand or whatever your preferred libraries. You can build your own probabilistic model and design your own first method for your particular research problem. So having this in mind, the outline of the talk will be, first I will give you a short interview, so introduction to probabilistic machine learning after that model for discrete and continuous data, latent variable model, that will be the biggest part. And while preparing the slide, I think that I have been a little bit optimistic, so I don't know if I will escape or not the last topic that is the Markovian model for modeling time series or sequential data. Okay, let's start. Well, this is an example of the real world data that represented the data that I am working on almost every day. This is an example, what is called the digital footprint. The digital footprint of a person, it is all the data that we can gather possibly from smartphone, wearable device and so on and so forth. So the first thing that you can realize is that your data is not as simple as you can expect because you can have data that is, that are natural number, count data, like the step you perform in a given day. Okay, well, I forgot to tell that this is something that is almost 180,000 days of data from different user. So you have count data, you have real data, more real data, more real data, you can have binary data that is basically, this data tells you if you have been at home or not, you have also the number of location that you have visited that is a natural number, very different in magnitude from the number of a step you perform, you can have categorical data and you can have also other binary data. So the data is heterogeneous so and you have to learn all of these data jointly. But if this is not enough to deal with heterogeneous data, in real world application, you have a lot of missing data. So here is representing is black line, it represent a day of data. But what is not filled, what is white, is the data that there is missing. So you have, particularly in some type of data, you have a lot of missing data and you have to learn with this and you have most of the current model need all the data to be complete. I mean, not if you have to perform some inputation. But here we want to attack, to deal with this problem in a major way, in a more formal way. So for this is what we propose to use the probabilistic machine learning itself. It is, let's start with the problem setting and the basic definition. The most simple setting is when you have some data that you observe, you observe capital M data, that can be composed of several type of data. So you can treat this X as a realization of random variable. And this is your observation and you can represent it in a graphical model as the number of the variable inside the sub-circle and the circle grade. When what it is grade is saying that you are observing this kind of data. And this play, it is that you have N data of those. And this data is related to some unknown parameter theta through some likelihood of probability distribution, P of X given theta. So the first thing that you need to establish to specify your model is your joint distribution P of X and theta, or if you consider all the data, P of D and theta, okay? This is the first one. The second one, you have to define what is your hypothesis. Let's say your prior distribution over the theta parameter over your model, okay? This is the first component in order to specify it. And the thing that you have to define also the likelihood, the likelihood is usually given by the physical phenomenon that govern all the data. But remember that the prior distribution is a true probability distribution, but the likelihood it doesn't. The likelihood is a deterministic function of the parameter theta when you know the data, okay? And the thing you want to know it is with the prior and the likelihood, and you want to obtain the posterior distribution of parameter theta after observing the data. If you apply the Bayes rule, you can decompose this into the joint distribution of the data and the parameter divided by the marginal probability of the observation. And this can be, if you consider for the termination theta, the marginal distribution of the data is just a constant. So the posterior distribution is just proportional to the prior distribution, your hypothesis, times your likelihood, the likelihood of your data, that is this one. So this is your primary goal is to obtain the posterior distribution over the unknown parameter. One quantity that do you want also to know it is when after observing the data, what is the probability distribution of a new fresh data? This is called the posterior predicted distribution, okay? And the last thing it is the marginal distribution of the data that is known as evidence or the marginal likelihood. It is that you obtain it by integrated out by marginalizing over all the possible value of the parameter theta. This quantity that it is assumed to be that there's a constant for obtaining the posterior distribution. In some situations, you need it to know. And it is in most of the cases, this is one of the most difficult tasks in machine learning to know that. So you want to translate this that that is a model into a machine learning problem. The thing it is that, okay, obtaining the posterior distribution of the unknown parameter is the learning problem. So you have to fix or you have the distribution, the posterior distribution of theta. The predicted distribution is when you want to predict what is the new data, what is the form of a new data? This is a prediction task. And the evidence helps you for the problem of model comparison. Because if you have to integrate out for all the possible values of a model parameter, you obtain the evidence. I mean, you compute the evidence for different kind of model you can compare. And whatever gives you the highest evidence, it is the model that best fit to your data, okay? Well, additionally to this setting, we have some, we can construct different kind of model. And we can make use of different framework. The first one it is, you can, we talk about generative model. You just try to model both simultaneously, the marginal distribution of the parameter and the likelihood. So it's everything here is a random variable. There is also the discriminative model in which you want to model the probability of the observation given the parameter or some function of the observation. But here the difference is that theta, it is not needed to be a random variable because you will not model the probability distribution of theta. And respect to framework, the one we are interested in is when you want to obtain the true posterior distribution of the general known parameter. This is called a basing estimation or probabilistic machine learning. If you want to know not the full distribution but only a given quantity, it is usually called point estimation. The point estimation could be base point estimation or raise basis estimation. But here the difference respect to the probabilistic machine learning is that here you are not interested into the full distribution, you want to have a specific value. And even you don't need to assume any randomness around like in the maximum likelihood. So here you can use the framework of theta not to be a random variable. We will be centered on the probabilistic machine learning. We want to know the posterior distribution of the unknown given the data. Okay, so let's go to the basic model we can make use. Let's say, let's start with the basic piece of a puzzle in order to construct your model. The first thing that you can encounter is binary data. So your data is binary, you have the same kind of information. So the first thing you have to do is to establish your likelihood, your measure. Here is the most is the binomial distribution. So you can assume that is data come from the realization of a Bernoulli random variable where theta is the probability of one. So as you can transfer this, you put a prior distribution in this case we fit a distribution that gives you number between zero and one. So we choose a beta distribution to take these four and we calculate doing the math, what is the posterior? And the posterior is just another beta distribution with a different set of parameters. This is very important. This is called, or this prior is called a conjugate prior. And it is very important because it gives us a fast way for doing inference because the inference is just estimating or calculating a new set of parameters because the posterior distribution is of the same family, the same type of the prior work. So it is, we call it this a conjugate prior. This is at the side property of one estimator. And in most cases, just for convenience, we will choose conjugate prior even if we know that it is not the most suitable one or the one that fits your physical problem better. So in this case, perform the estimation, perform the inference is just to count the number of ones and the number of zero and to add it to the parameter of the prior and you have the full posterior distribution. Well, in addition to binary, you can extend it to a categorical data. So in this case, we can make use of the Dirichlet Multinomial Model. Obviously you can have more and more model, but here this is one of the simple and most common one for categorical data. You define your likelihood. Your likelihood is categorical data. It's a multinomial distribution. Here you extend for the beta prior to a Dirichlet distribution. That is a sort of multidimensional beta distribution. You obtain the posterior distribution that it is using this Dirichlet distribution as a prior. This is a conjugate prior. And your posterior also belong to the Dirichlet distribution with a different parameter, okay? This is our second element. And the third element, it will be real data. That is here we will make use of a multivariate Gaussian. So as a likelihood, we will assume that our data belong or has been generated by a multivariate Gaussian defined by the mu, by the mu defined by the vector mu and the covariance matrix sigma, okay? Your likelihood, so when you transform this into the likelihood for the whole data, you can express it in a compact form in which here you have the empirical mean and the empirical auto covariance matrix. This is just a rewriting of the likelihood. You have to establish a prior distribution over the parameter you want to infer that is mu and sigma. And in order to achieve a posterior distribution of those parameters that belong to the same family, one of the option is to use what is called the normal InverWishers distribution. So basically it is you assume an InverWishers distribution for your covariance matrix. There is a probability distribution that provides you as a realization semi-definite non-negative matrices like the auto covariance matrix and given sigma you establish normal prior distribution for the mean. So after doing the math, you obtain that the posterior distribution is also a normal InverWishers with the different parameter that is obtained and it's the number of observations. Here you have the empirical mean and here you have the empirical covariance. So performing inference over this basic model, there is pretty easy and you have done the math or other guy has done the math and you only have to calculate parameter in order to specify the full probability distribution, the full posterior probability distribution over you right now. Obviously there are more distribution for binary, for count and for real data but this will be enough for now, okay? So with this piece of the puzzle, we can make a different thing but if we revisit what is our objective, let's say how can we apply all this basic model for modeling our heterogeneous data? Well, the first thing that you can do unless you have other information, it is you can use a factorized model. So if you have heterogeneous data and you don't have any other information, you can put the likelihood as a product of the individual likelihood of each type of data, okay? This is the solution you have at hand using this simple model but if you try to learn the joint distribution, the data type, it doesn't mix. It doesn't mix because each data type is independently learned. So if when you learn this, the problem is separable so basically the solution of this problem is to do the inference for the binary distribution for the binary data, performing the learning, performing the inference for the categorical data and performing the inference for the continuous data. And also the roughness that is missing data is limited. If you assume that every data is factorized, you don't have any powerful tool at your hand because basically the thing that you can do it is, okay, how to deal with the missing data? The most principal way is, okay, marginalize all your missing data. So if you don't observe one of the covariate, let's say one particular variable of your data, the thing that you can do it is, okay, I still make use of the whole data, the probability distribution, and I marginalize over this variable. So I integrate out this variable in order to perform the inference. This makes the problem tractable, so you don't need to do any kind of imputation of you can see the marginalization as an advanced way of imputation. But the thing is that when you do the marginalization where you calculate the expectation respect to the variable you are missing, so you can decompose the joint probability distribution of X missing and X observed using Bayes' rule. And it is if you do the expectation of the missing variable, this time can go out. But if every data type is factorized, so this conditional of the data observation, it is missed. So there is just the marginal distribution of the missing data. And when you do the expectation, they vanish. So the thing is that when you use heterogeneous data with simple model, you're seeing a factorized model. The only thing that you can do with missing data is make the problem tractable. So you can compute it without do any kind of imputation, but your robustness of the missingness is very limited. Okay? So let's go to a more elaborate model that is the latent variable model. So here, let's say in addition to our observation and the unknown parameter, we have an unobserved, in this case, we will start by the discrete variable Z, okay? So it is the thing we are trying to model here, it is okay, if you want to build more complex model, you can assume that you can observe only part of the data. And the data that you don't observe, it is the latent or hidden variable. So in this first simple model, you assume that your unknown is just a discrete variable Z. So you have a distribution over this unknown, so with this latent variable, I do the probability of Z B equal to K is equal to Pi K, okay? So now we have the likelihood it is by, it's different in principle depending of the value of the hidden variables. So the likelihood given Z is P of Z of X given theta. Okay? So you can construct the full likelihood as if you are observing the Z variable. So when you compute the P of X and Z given theta, you multiply the likelihood, the distribution over the hidden variable, you have this joint distribution of the observed and the hidden variable that it takes these four. But there is one thing that it is, if you don't observe Z, one thing that you can do is you can marginalize over the unknown, over the unobserved Z variable. And when you marginalize over Z, the thing that you encounter is a mixed model. Sorry. Okay? So in this is let's say each one of those P K belong to the one, the previous single model, okay? And here your inference problem is to obtain the posterior distribution over all the unknown. So you want to obtain the posterior distribution of Z and Zeta given all the data. But there is here, there is one thing that is very difficult, that is okay. This is just the joint distribution of the observed variable, the hidden and the parameter divided by the expected value. So the difficult part for obtaining this posterior is just this expectation. And but the difficulty is not because you are not able to do the math and to obtain the formulas. It is because you have to, let's say, evaluate all the probability distribution given, all the possible values are, there is a mistake. Here should be a Z instead of an X. So you have to do the expectation over all the possible values on Z and all the possible values of Zeta. So it is equivalent to obtaining the evidence. This is a hard computational problem. So when we introduce the latent variable here, we will encounter with the mixture model. And depending on the distribution you make or you use as a base piece of the puzzle, you can obtain different mixed model. So remember that we are dealing with this kind of model. So if your base distribution is a Gaussian, you obtain a mixture of Gaussian. If your base distribution is Bernoulli, you obtain the mixture of multinomial, that is that way. Or if your base distribution is categorical, you obtain the mixture of categorical. But most important here is that now you are able to do, to define heterogeneous mixture. So in the heterogeneous vector is the unknown variable that factorize, that could factorize over all the possible data types. And you can make a mixture model that contain a part of the mixture that is the Gaussian distribution and other that is for your binary data, that is a mixture of multinomial and for your categorical one that is a mixture of categorical. This is the kind of distribution we are interested in, okay? Well, but how can we do inference? So as we saw, the difficulty is to obtain this expectation. So in order to think on back this, there are several proposals that must be made into the literature too. So as the inference problem is hard to solve, we can add up a different scheme. So one of the most simple one, it is, okay, you can perform a Taylor series decomposition over the unknown distribution and you can retain the first two term. Basically it is equivalent to assume Gaussian posterior. This is called the Laplace approximation and it has been proposed several years ago, okay? But this is a way of solving it, limited but still doable. The other thing that you can do, it is you can obtain a point Bayesian or a maximum likelihood estimation. So for doing this, you have the classical EM algorithm that has been proposed several years ago and it is, this is when it has been formalized because the algorithm setting is even older. Another thing that we will enter after that into the discussion of the EM and the variational one, okay? The other method that you can make use is the use of Markov chain Monte Carlo method. So in this kind of approximation, the thing that you do, you are not able to obtain a parametric form of the posterior but you are able to obtain sample from the posterior. So one of the simple models of this is the Gibbs sampling that is basically consists of iteratively sampling for all the conditional probabilities of all the random variables, okay? So it works in this way. So you have all the z i and all the model parameter. So you obtain all the conditional probability of each of z i conditioned to the rest of the z and conditioned to all the observation and all the model parameter. And you do the same for all the values of i, for all the n values. You do the same for all the model parameter. And basically you order all this conditional as you start with a random sample for each one of those with a random value. And let's say using the first conditional, you obtain a random sample from this conditional, you substitute the value of this variable by the new sample. And using this new sample, you make use of it for obtaining the next conditional distribution and use sample of it. And you are replacing iteratively all the sample. So this guarantees you that at the end, when all the algorithm converts, you are sampling for posterior distribution of the parameter. Okay, so this is the solution that you have at hand with MCMC method and the other is with variational inference optimization. Taking into account that it is for point bastion for the EM algorithm and for the variational inference algorithm to transfer the inference problem that previously was just a calculation of three parameter into an optimization problem. Okay, and in the last approximation, you are still doing just working with a parameter and with the Gibbs sampling. In fact, you are solving an optimization problem but you don't have to optimize. The only thing you have to do is to generate something. Okay, let's go or let's enter into the EM algorithm. So for solving this inference problem in the EM algorithm, it is the aim is to obtain the parameter theta at the maximum of the likelihood of your data or the maximum of the posterior distribution of your parameter. Sorry, there is a mistake here. So you can apply both the maximum likelihood of the maximum of posterior, but at the end you obtain only a point estimate. So the only thing that you want to obtain is a value for theta, not a full probability distribution. Okay, so you have the likelihood of the data. There's the log likelihood of the data. There's the logarithm of the probability of all the data given in the parameter. And you can treat it as the marginalization over the hidden variable, okay? So for obtaining the EM algorithm, for obtaining the best solution or the maximum of the logarithm of the posterior probability of theta, we replace the hidden variable by a value key of theta for make the problem easier. So the log likelihood of your data can be treated as this one. So we multiply it by this function, this auxiliary function, q of z. And using the Jensen inequality, putting the logarithms inside of the expectation gives you a lower bound of this quantity that is given by the kula labor divergent between your likelihood and the distribution over the hidden variable. So the thing is that your problem it is to adjust to find the key of z that makes the best approximation. So it's to maximize this minus kL divergent on to find the key of z that is closest to your joint likelihood of data when you observe the data and the hidden variable, okay? And for solving this problem, the EM algorithm put it a coordinate as n. By alternating this, by alternating between in the ESS is maximize this lower bound of the log likelihood by with respect to q theta when we hold theta fixed. So this is, we obtain q at the duration t of z as the r max of this complete the log likelihood, the lower bound of q and the previous value of theta, okay? And this is the solution of this maximization problem is just the expectation of the z given the data and your previous value of the parameter. Like in the MCMC method, like in the Gibbs sample, you have to start with a value for all the z i and for all the theta variable, okay? So, and the other thing, the other step, it is to one, you have obtained the maximum of q you maximize respect of theta. So it is your maximize the lower bound with respect to theta holding q of theta fixed, okay? So if you do this kind of optimization that it is alternating between this step until the algorithm converge, okay? The algorithm is, the conversion of the algorithm is warranted so you can demonstrate that your log likelihood over your data, it doesn't agree iteration by iteration, okay? I want to make here some comment that it is, okay? Let's say this algorithm, well, with the first step with the e step, you perform the expectation. So you calculate the probability of z given all the data. So this is an easy problem, okay? In the EM algorithm, it is the theta parameter in the iteration is fully maximized, okay? So you can do this, but even the algorithm converge if this optimization is not a full one, let's say if instead of find analytically the maximum of the probability distribution of the answer given theta you perform gratin assent, for example at each iteration of the EM algorithm you don't obtain the best solution for theta parameter but just a solution that increase the likelihood of the log likelihood with this new solution the algorithm is still convert. So this framework is very flexible. So you can apply here the full mass initiation of a partial optimization. So the only thing that you have to do is obtain a solution, a theta solution that provide a higher log likelihood, okay? So and if you want to do this, the conversion is also warranted. But there is bad news that it is with all this kind of method the solution you achieve, you cannot demonstrate that this is the global optimum. So the algorithm only converts to a local optimum, okay? So this is the EM algorithm. And this is how do you do the math? Okay, I will not enter into the full detail, but okay. It is, you have here the mixture of Gaussian, the log likelihood of the mixture of action. So here is how do you estimate the probability of Z given the data and the previous value of the parameter? So you continue this, this is the kind of evaluation. Basically the thing that you do, it is the solution to your problem is to use at the at the suspected value is the posterior probability of Z given the data and the parameter. And once you have this, okay, you replace this into the maximization. So you perform this, now it is like, so you perform your optimization algorithm for maximum likelihood, you maximize the likelihood, okay? And you obtain the solution into a mixture model and you perform this. So first you obtain those quantity, the posterior probability of the hidden variable and with the posterior probability of the hidden variable, you re-estimate the model parameter, you substitute the model parameter and you recalculate the posterior under the new model parameter and do it there and so on and so forth until the algorithm is stabilized, okay? And you have to perform several regaining salutation in order to not warranty, but have a better option to achieve the global maximum, okay? So you have to, but here if you want to change your maximum likelihood, but your maximum posteriority, you retain the ESF, the ESF is the same for all the estimation framework, but now you define some prior over the parameters, so you have to define some prior over the probability of the mixture that you can make use of a digital distribution and you can, you make a, you put a prior on the vector mean of mean and a sigma covariant matrix, like in the previous one, it's a number in there, we share and doing the same and using the same quantity that is the posterior probability of the hidden variable, given the model parameter, you update the model and with this model, you re-estimate the posterior probability of the hidden and so on and so forth. So the basic algorithm, it is the same. So this is one of the solution that you have at hand for solving your inference problem with latent variable law. The other one of the other solution, it is variational inference. The difference between variational inference and the EM or the classical EM algorithm, it is that with the variational inference, you don't want to obtain a point estimate, you want to obtain an approximation to a distribution, to the posterior distribution of the unknown. Let's say of z and zeta and theta variable, okay? So the problem is that the setting is in some ways, it is similar to the algorithm, okay? But now instead of having a point estimate, you put a family of function. So you propose a variational family, so a family of function of z and theta, that changes with new variational free parameters. So changing new changes the function, the probability distribution of the variation, the probability distribution over z and theta, okay? And now the thing that you want to do, and it is, let's say similar to the EM algorithm. So in the EM algorithm, you want to minimize the KL divergent between your auxiliary function key or z and the probability of the data and the hidden variable given the parameter. So it is, so you use the KL divergent as a method of distance, so to obtain the closest solution. So in this case, in the case of variational inference, let's say you have to fix the parameter new in order to approximate the full posterior, okay? So the solution of your problem is given by the parameter theta that provide the lowest KL divergent between the variational family and the true posterior of the parameter theta, okay? Okay, but one thing of solving it, it is, okay, we can define lower bound of the evidence. Let's say if we subtract to the evidence to the evidence the KL divergent that we want to minimize. So this is an upper bound, sorry, a lower bound of the evidence, the logarithm of P of D. And the problem is equivalent, so it is equivalent to minimize the KL divergent up to maximize the evidence lower bound, that is the elbow for sure, okay? So now the problem is to maximize this function that can be expressed as the expectation over the unknown of the joint distribution of the data and all the unknown, the hidden and the parameter minus the expectation over Z and Q of this variational function, okay? So for solving this, this is the basic idea, but you have to fix three different things. The first thing it is, what is your variational family? The most simple one, it is called the mean field, the mean field variational family. It is among all the different proposal that you can make for the variational function for your solution, for your approximate, for your approximation of the full posterior, one of the C plus one, it is assumed that all the latent variable are independent. So in this case, you have all the hidden variable and the theta parameter, okay? The theta parameter are the global parameters, so that is, you have this Q function with parameter lambda, okay? And you assume the independence among all the Z and also respect to theta. So this is, this provide you the mean field variational family. If you have additional information that it is that tells you that there is something between theta and one of those Zi, you can construct your own variational family, but if you don't have all the information at hand, the mean field is a classical one, okay? So you have to, let's say it is, what is the, how do you decompose the variational family? And other, it is okay. So you have decided that all the latent variable are independent, latent variable and parameter. And now you have to fix the family function for each of one, for this Q of theta or for this Q of Z, okay? And one of a sensible option is to use conditional conjugating mode because in this setting, the posterior of the model will belong to the same family. And one family that it is very flexible and you can make use of it is the exponential families, okay? You use the exponential family. Exponential family shows us a particular case, the Gaussian, the Bernoulli, the categorical, et cetera, so it covered a lot of things. Okay, so why don't you have fixed how to decompose what is the structure of your variational family and what is the distribution or the distribution you can make use on still has to solve the optimization problem. So with simple model, you can use a coordinate ascent, okay? It is basically, it is, okay, you take the expectation, the conditional expectation of each one of your family of unknown, maintain this, the rest. So in this end, what this algorithm is, it is like, is similar to the Gibbs algorithm, but you don't generally sample your calculated expectation. And it's also similar to the EM algorithm because you, the thing is that you are alternated. So this could be the equivalent to the ES step into the EM algorithm. And this is equivalent to the M step into the EM algorithm. So basically you take the expectation of Z and expectation of theta alternately. So for doing this, you are doing, a similar way of there is a coordinate ascent. The similar, it is exactly the same procedure of the EM algorithm. So you perform the expected value. You perform this step. So you substitute the family you obtain here by the newest one and you iterate until you achieve conversion. The difference with the EM algorithm is that you are now working with parameters that define distribution, okay? But in fact, you are calculating distribution or a sequence of distribution. Remember that this, you cannot guarantee that this is the true form of the procedure. And in general, it is not. So it is a poorly an approximate method because the solution you provide is the queue function with a given set of parameter new that approximates the full posterior. But it is just an approximation. In general, you cannot achieve the true posterior. Okay, so you can apply this and instead of, so you can vary here and you make use of different variational families, variational distribution and optimization method. For example, if once you have defined the variational family and the variational distribution and instead of the coordinate ascent, you can use a gradient ascent. So the thing is that you can take the partial derivative or the elbow, respect to the new parameter and to ascent a step by step into the solution. And if your data set is huge, instead of this, you can use the stochastic gradient ascent, okay? But you can use here your favorite optimization method and in most of the cases, the most suitable solution depend on your data, the sample size, your model complexity, et cetera, et cetera. Okay, so this is gonna take. And let's go to the original problem. So how we learn, what can we obtain with the heterogeneous and missing data using now this simple latent problem with discrete variable? So if you seen our heterogeneous data mixture model that it is this one, our graphical model that is this. So we have now that the hidden variable Z is the one that factorized the different data types. So let's say it is very easy. So it is, one of the data type, for example, is missed. So if you don't have the discrete part, it doesn't matter. Because if you don't have this variable, you marginalize. But this is factorized by a hidden variable Z you don't calculate, you don't do any kind of estimation, you marginalize and the model remain the same but with this observation. So this makes the problem easy. Okay, the data types, even if in each of the component or even if they are factorized by the hidden variable, the data type are no longer independent into the mixture model because basically the thing is that they are connected through the latent variable and when you integrate the hidden variable, when you marginalize the hidden variable to obtain the mixing model, all the data type are mixed, okay? So if in a favorable situation, in a favorable situation, so you are able to learn your model and obtain a good mix between the different data types. Okay, but still you have some problem that are inherent to the data. That's it. And one of the problem is the likelihood scale problem. So basically the thing that you do is you want to obtain the best posterior approximation but the contribution of all the data to the general model through the likelihood can be very different. So if you have a binary or discrete variable, do you know that this likelihood function, it is something that is between zero and one, okay? And in the binary categorical but with a limited amount of categories, in general, except in extreme case, you don't go to the extreme cases, for example, a binary variable to take 10 to minus three, the probability to be equal to one. So they are, let's say, because you don't have such amount of data to obtain or to be sure that it is your probability. So your likelihood will be between zero and one and doesn't go to the extreme. But the problem comes for the continuous part because in the continuous part, the likelihood is unbounded, okay? So you can have here a quantity, the likelihood to be, let's say, 0.3. Here it could be 10 to three, 1,000. Or on the other hand, this continuous being extremely low and it is not very rare to have low likelihood if your model is complex and your data is multi-dimensional it is not rare that it is continuous, the likelihood of this continuous, of a continuous data sample becomes something in the order of 10 to minus three, 10 to minus four, something like this. So if you mix all of this, it is that there could be some data dominate, the data with higher likelihood in absolute term, it dominate and the data with low values of likelihood becomes irregular, okay? So this is a general problem dealing with heterogeneous data. There are several solutions for deal with and this kind of problem. One of them it is, okay, you can perform some sort of normalization. One thing to do, you can apply a deterministic transformation and normalization. If you do a deterministic transformation, basically the thing that you can do, it is you can transform all the discrete variable into continuous one. And after that do some sort of normalization among the likelihood to make that all the data mix well into the mixed model. So here you have a particular solution, recent solution, well, this is for additional autoencoder, but the recipe is just the same. Another thing it is other way of solving the likelihood scale problem. It is transform the data and to work for example, instead of working on the data itself, one thing that you can do is working on the distribution parameter. So because all the distribution parameter, let's say are on the similar order, but they are different solution. You, until you have your data and you train first the plain heterogeneous mixture model, you cannot realize if one of the data become irrelevant. So you have to check and if this situation occur, you have several recipes to correct this kind of situation, okay? This is respect to the heterogeneous data. The other thing that you are dealing with is the missing data. If you have huge amount of missing data, basically you can have two, two dot hand. One of them is the marginalization. Let's say it's the, so you can marginalize respect to all the unknown, okay? This is one solution. And the other is to establish a more complex model in which you modulate with a binary variable the probability of missing. So you are modeling the missing pattern, okay? Well, there is a little bit of decision. Basically the thing is that you can add another variable here that is also hidden and you have to estimate it to be included in your model and uses this new variable. It is the observation, it is just a product of this binary variable and the true value, okay? Of the observation. So this allows you to perform the inference and also to learn a little bit about the missing pattern. Because in some cases also the missing pattern could be informative, okay? And you can, with this pass of the puzzle, so you can construct a more elaborate data model. Okay, so let's go to, let's transfer the Z now from this grid to continue. So let's transform the mixture model into a linear factor model. So in this case, or in this situation maybe long for example, the probabilistic PCA, the probabilistic principal component analysis, okay? So in this case, you transform Z from this grid to continuous. And here we appear the same problem that appear when you are transforming this grid variable to continuous one. In this grid variable, everything is clear or everything is specified using the probability distribution. So you have those numbers and every information is contained in it, okay? But for continuous variable, you have to define a PDF, a probability density function. So you have to choose a family of function, okay? And this is, and the transformation between one data type and the other is not always so simple, okay? So to maintain tractability and to maintain inference in something that is doable, one thing that you can do, it is use linear factor model, okay? So in this, you are transforming the discrete variable into a continuous variable into a K dimensional space, okay? So basically the thing that you are saying it is that this problem is very common because it is the low dimensionality where all the variability of the data reside, okay? So basically the thing that you are doing here is from going from X to Z, you are reducing the dimensionality of your data and your problem here, it is fine. Now the linear subspace that contain most of your data or more of the variability of your data and it can express in a continuous latent variable model, okay? So in this, so you have to put a prior distribution over the Z, over the latent space. So you can assume a very simple one. It is a normal distribution, zero mean and independent component, okay? And the transformation between Z and X is governed by the model parameter that it is. So you have for each of the dimension of the hidden space, you have a vector W, okay? And the thing that you do it is for transforming each of the dimension of Z into X, you made the dot product between the component, the Z and the W vector, okay? And each of one generate a vector of dimension D that is the dimension of the data, okay? And you can put, you can arrange all of this into a matrix W, okay? So you have to put a prior when you model parameter and you can assume a Gaussian prior, zero mean and identity covariance matrix. And for the observation, your likelihood is a distribution that is a normal distribution as a mean, the product between the W transpose time Z and as a covariance you can use if you don't have other information and independent noise or independence variance component, all of them with the same variance is like this. So here for solving it, you can use your preferred algorithm. So you can use EM algorithm. It is the development is similar to the previous one. So you have to do the math, but it is you have to estimate the expectation step, how to perform the expectation step. That is basically to take the posterior probability of Z given X and given W. And after that to maximize the W given a value for the hidden variable and the observation, okay? Or you can use the variational method. So you construct a mean field or another kind of factorization for the variational function. And you perform your favorite optimization method. Let's say you can perform the coordinate ascent, a gradient ascent, stock activity gradient ascent, whatever you like, okay? So and with this, the problem is all about using this one is the linear factor model. You can put or you can introduce a more complex model that a simple combination between Z and W, let's say. And with this, well, let's go after that. With this, you can have one recent model that is that it is, it is exhibit the same properties, it is the same structure, but there is a deep neural network inside. And this is the variational autoencoder. So the variational autoencoder is the same as the variational autoencoder. So the variational autoencoder is nothing more and nothing less. That's a latent variable model, a continuous latent variable model. But now the transformation between the hidden variable and the observation, it is no longer linear. It is governed by a deep neural network, okay? So you have the same distribution for the hidden. So you have the distribution for the observation is the same, it's a normal one, but the transformation of both the mean and the covariance matrix, it is not longer a linear combination of some way, but it is the output of a neural network that has as an input the Z vector, okay? With parameter vector theta. This is called the decoding network, okay? The network that goes to the low dimensional manifold now to the observation. We are no longer talking about subspaces, but manifold, okay? So the variational, one of the key point here is the variational autoencoder, it is exactly the same. So you can cope with the same kind of model method, et cetera. So, but now the problem now here is how to do the inference, how to fix all those theta parameter that determine what is the neural network, because the number of parameter now can be huge, depending on the layer, but you can have a thousand and even millions of parameters. So the inference problem is now so difficult. So for doing this, let's say, we can see we can perform, and it is variational autoencoder. His name comes from the variational inference method. So we consider any approximation Q of Z to the posterior of Z, and it is proportional to this. So we are doing the same, that is we want to maximize the L value or minimizing the kind of divergence between now the posterior probability of the hidden variable and the posterior probability and the variational family we are making use of, okay? So our objective here is like in the previous variational method, is to maximize the L. So first we need to select a variational family for Q of Z. So selecting the family, the family is the inference network or encoding network, okay? So it is the network who defines your variational family, okay? Because basically the thing that you do is your Q is given by this distribution. So it's governed by the new, that it is the output of the neural network and sigma that is the output also of a neural network, okay? And now the maximization of the elbow is become in this way, okay? This is the formulas. So recall that the prior probability of the hidden variable are still independent and identically distributed Gaussian variable. Let's say on the Q and N, the variational family that include the neural network. So the thing that we do, it is one we have defined the elbow. We are no longer use the coordinate ascent because it is very difficult. We attack directly the maximization of the elbow using gradient ascent, okay? So the first thing it is, okay? You have or your parameter eta of your both your neural network. So you have to determine the gradient of this kind of the version of the elbow respect to your model parameter, okay? So and for solving it, you can use stochastic gradient ascent. So you can, let's say you can divide your data into pieces, into mini batches that you can estimate the gradient of those parameter respect to this mini batch. And the things, the use of mini batch, it is because here if you have to determine a network, a neural network with a thousand of parameter, you will need a lot of data and it is suitable to perform mini batch by mini batch, okay? So in addition to the kind of divergent that is one of the term, we also need an unbiased gradient estimate for the second term for this term. For doing this, we can use Monte Carlo. This is an expectation, we can use Monte Carlo sampling estimator and after that we can perform the gradient of this, okay? But it is, let's say instead of collecting, so that the inference problem, the computational problem is still very hard because you have a huge amount of sample. And the only thing that you have to do or you need here is to obtain an approximation of this expectation there. You can use a simple one that is, okay, you can use a Monte Carlo sampling estimator with a single sample. The variance is also is huge, okay? But this computation is very cheap, okay? So now you have the gradient estimator for one of the terms for the KL divergence. Now you have an approximation for the expectation term of the elbow and now you need to calculate that you compute the gradient to respect to the network parameter, okay? So this still is a huge problem and we can make use of what is called the reparametric and it is expressing a sample of the deterministic function given so in a store that this is dependent of the network parameter. So for the Gaussian distribution, we can have this, this is in some sense, and this is similar to a Laplace approach. And when we put all these together, we have the estimator of the variance and the gradient estimator, okay? So with all of this, the only thing we have to do is okay, when you have the gradient update, it is to put everything together as you construct the algorithm. So you have also additional model that we are not called here. For example, we have another model that is a mix and memory machine model. One of the case, it is the LDA, the Aladdin Diraclet allocation. That is also there is the matrix factorization. There is a very important family of model that we are not covered here that is the Bayesian non-parametric model in which you don't have to assume a K fix. Instead of this, you can assume that your initial K is infinite but you can compute everything for fixed K. This gives you a Bayesian non-parametric model. Here you have nice overviews of those models or even you can, let's say, you can have a mix between the mixed model and the continuous latent variable model. And you have continuous and discrete latent variable model. So you can, let's say, what are you want to obtain? Let's say it is, imagine that what you obtain with mixed model is some sort of separation of your data between homogeneous group. So you have to let you organize the data. And it's a latent continuous model. The thing that you obtain, it is low dimensional representation of your data. If you have some structure, you can make use of this structure using both continuous and discrete latent variable. It is like if we have heterogeneous data with a mix of continuous and discrete, the latent variable could be also a mix among linear continuous and discrete variable model. I skip all the, let's say, Markovian model and I want to conclude with the sort take home messages. Well, all the model we have here and the model that we are not able to cope with, they provide you improbably the same material learning, not only an estimate of the parameter, but a distribution of the parameters. So it provides the assessment or the uncertainty or your problem. And also the probability or the possibility story of generating new sample, like in the generative of the serial network. But remember that you can make use of this basic element of this piece of the puzzle to construct your model or the model that fits better into your own data, okay? Because perhaps your data doesn't fit in all the model that you can encounter in all the software libraries. And once you have fixed your data model, it is very important to cope with the inference method and to maintain the complexity of your inference method at a level that can be something that it is doable. Okay, thank you very much. This concludes my lecture. And if you have some question, I'll be happy to answer. Thank you, Antonio, for this excellent overview and introduction. Now, I propose that we take questions iteratively from Slido and from the network. I'll start with the Slido channel. One question was just asked. So the comment is excellent presentation. Are there any innate limitations of ELBO, for example, regarding posterior collapse? Have other regularization losses been explored for the VAE, Antonio? Well, yep, there is... So when the thing is that when you construct your ELBO, the most important thing is that the mass immunization is possible and it is doable. So for example, if you look to the variational out-and-gather, if you don't use the representation trick, the computational problem will be huge. Like in, for example, also in some mixture of continuous and discrete variable, the inference problem may become something like the or similar in complexity to estimating the evidence of the model. So one of the critical part here is to maintain your inference method, the computational part of your inference method to something that is manageable. Manageable in terms of sample data site, but this can be managed using the mini-ratches and stochastic gradient optimization, but also the ability to reach the solution. So there is another part here that is okay. First you have to reach to a reasonable solution and you have to check when your solution is clubbed to the optimum or not so you can see how the evolution of the ELBO become when you are performing optimization or the different terms of the ELBO. So you have to monitorize everything in order to guarantee that you reach a reasonable solution and also that this solution is feasible. And in some cases, even you can propose very beautiful model in which you are unable to reach the solution because your inference problem is very, very ill-conditioned. So also, but you have an ill-conditioned problem and you have a lot of literature tour about regularization technique in order to make it possible. So it is, you have to iterate over your data, your model, your inference until you achieve something that is reasonable. And even there is for the next nips, there will be a workshop that it is dedicated one for situation in which you have done all of this and you still don't have achieved a reasonable solution. Thank you, Antonio. Then we have time for one more question from the network and I see here that Joe Barney, he's owner has a question, Joe Barney. Hi, first of all, thank you for the talk. It was really great. So my question is, you mentioned earlier a few different methods for performing inference, so Laplace approximation, various inference, MCMC. Which criteria should I use to choose which one to employ? So if I had say limited computational power or low amount of data, something like that. Well, well, first of all, it is the Laplace approximation except in very simple model. It is they don't provide a high quality solution because it is too simple. And basically the thing that you put is a Gaussian into each one of the unknown, okay? The MCMC method, it has, in some cases they are very slow to converge. And at the end, you don't have any, let's say, explicit representation of your posterior. The only thing that you can go, it is. So if you are fine with this, okay. And even you have to solve the different problem because for example, if you have a discrete variable, you have to cope with the switching label problem but you have still some trick like the marginalization of some of them. So, blau-ragualization and et cetera. And also that is, but if you are able, if you are fine with sample, it could be fine. Also respect to the EM and the variational, the EM only provide you a point estimate, okay? So if you are fine with this point estimate, it's okay and in some cases it could be simple, okay? If you obtain some information about the uncertainty or the sensitivity of your problem, you can go either to the full probability method that it is either the variational on the MCMC. And let's say it is, this is the basic tool and you have to iterate to do a literature search to find a solution to all the problem with your model. But that is not a universal recipe or the only recipe that I can give you, it is, okay, put a model, do the inference, see if you are able to refine the model, refine the inference and go until you achieve a level in which you are happy with the result. Thank you, Antonio. Thank you for answering the questions and also for this great start into the summer school.