 Thank you very much. Well, thanks a lot for the invitation. Thanks to everybody for being here in your couch. I hope. So I'm, I guess, I'm neither a real computer scientist nor a real mathematician. So I'll try to speak to everyone. And I've been doing for many years. I've been interested in health for many years. And I've been interested in in brain imaging for many years, but recently I've been looking at other kinds of data. And these other kinds of data have led to new kinds of problems, challenges and new collaborations. So I'll be talking about some work that came out, which is about supervised learning with missing values. And this is work that was done with Julie Joss, Erwin Scarnette, Marine Le Moron, Nicolas Prost, and Thomas Moreau. So the problem of missing values is that in some data analysis settings, we have only partially observed examples. So a very typical situation is if you do social sciences and you use questionnaires to acquire data. So you ask, you have people fill in the questions and some people will not fill in fully the questionnaire. Okay, that's another question for me. So that's the first example. Another example is if you're doing data integration here, integrating data from different databases or different tables, you might have missing correspondences across tables. You might have one, somebody who is in one table but not another, you might have an entity such as a company, for instance, in one table but not another. And if you do the join to create the table that you need for the analysis, then you'll have some missing values there. And finally, you might have measurements that are not performed. For instance, due to the urgency of a given patient, you might not measure a patient's weight if that patient is bleeding a lot. So the problem of missing values is really ubiquitous in health and social sciences. Conversely, it's not very present in other kinds of data. For instance, when I was doing medical imaging, it was much less of a problem. Now, one thing that I'd like to stress about the examples that I gave before is that the missingness is noise but it's also a signal. If somebody is given a questionnaire and chooses not to answer some questions, there might be some information in there. Maybe that person doesn't know or doesn't want to answer. If some measurements are not performed because the patients were too badly ill, just the fact that those measurements are not performed tells us that the patient was badly ill. Conversely, the fact that some measurements were performed tells us some suspicion about what the doctors thought. So this is a bit of the setting and the questions that I'm interested in are given those settings, how do we build good predictive models on such data? The outline of my talk is I'll explain the classical settings both for supervised learning theory and I suspect everyone here is familiar but also for the classical missing values framework, which has a long-running history and statistics. Then I'll discuss how we can adapt existing learning procedures by basically adapting tools from classical missing value framework. But I'll show that there are a few new results here. Then I'll look at the seemingly simple problem of the linear data-generating mechanism and I'll show that the optimal predictor is not linear. And finally, I'll take a differentiable programming point of view on this and introduce what will basically be a neural network architecture that's motivated by the previous work and by a specific theory that draws from the classical missing value framework. And if you have any questions, I have the chat that is open on my second screen and I'm monitoring it, so please do not hesitate. So let me first discuss the settings starting by the supervised learning theory and then discussing the classical missing value framework. This is based on a long paper that we wrote that really where we really made the effort of stating in a consistent framework the prior art, well, the main prior art in the two bodies of literature, both the supervised learning and the missing value framework. So let's start with supervised learning theory, fairly classic settings. I'm given pairs x, y that are here. I'll choose them as drawn i, i, d from two spaces x and y. And my goal is to find a function that goes from x to y such that f of x is close to y for some distance. And typically, I'll consider a loss that captures this notion of closeness that gives me some measure of error on the space y. And then I'll be interested in questions such as can I get close to what I will call the base predictor, which is the function that gives me the minimal expected loss. So really what I'm interested here is in this expected loss and in finding a function that gives me a small expected loss. And as a small comment, if I use certain losses on certain problems, I will estimate different statistical quantities. And for instance, if I use the quadratic loss, then my base predictor is the conditional expectation of y given x. So this also, that last statement to me is quite important because there are many places where people are not interested in prediction in engineering sense, but they're definitely interested in conditional expectancies. So what we have here is a toolset that enables us to estimate conditional expectancies. And so then I, you know, there's a body of literature on different procedures to estimate a function given training data. So I'll call a learning procedure, something that gives me an estimated function from a train data set. So train data set is a finite ensemble of pairs x and y. And I'll say that a procedure is base consistency is base consistent. And if asymptotically, it gives me, it converges to the base predictor. So this is, this is the notion of consistency, which is quite common in statistics, but focused on the risk. And it's important to say that to keep in mind that this notion tells me nothing about any form of parameters of the other function. And I really do not care that function may not be unique. It's not important to me. My goal is really to minimize the expected loss. And so a very common procedure to tackle these problems is to minimize the empirical risk, instead of the expected risk, and I might add some some regularization I might, I might choose a specific function class in which I do my minimization and the function class the procedure that I use to minimize this, this risk. All this is basically the science and the art of supervised learning, which I'm sure I'm sure you. So this is a setting for supervised learning and once again I'm stressing that I'm really interested in controlling the expected loss and not really anything else. Let's have a look at the classical results for the missing value literature. And here I let me be a bit precise with my notations because that so the challenge with missing values is that your observations basically do not live naturally in a vector space because you're going to have missing entries. In a classical framework, you're, you're, you have full data X that is in a vector space. And you're given a missing this indicator, which is a binary mask of the same dimension as the full data which tells you whether you've observed a feature or not. What I'm really observing is the incomplete data, which is, which lives in a space that's the tensor product of the union of the my original vector space, and a symbol that I'll call any which tells me that I didn't observe the data. So to give you an example, an example realization here, my underlining data is given by by this vector, can you, I guess you can we'll see my cursor. So the, the, the example realization is given by by this vector. My mouse tells me that I'm observing only fraction of the entries. And so what I'm really given to my learning. Thank you for the for the confirmation for the cursor, but I'm really seeing what I'm really giving to my my learning procedure is the following data, which is incomplete data. And I can also introduce the notation x, oh, and x, and for observed and missing and x, so next and together form the full data, but each of them contains only the observed data or the missing. Okay. And really the challenge, and I'll come back to this but the challenge is that I'm working in this in this space which is a bit of an awkward space. And now the classical really a very important classical result of the missing values literature, which was established by Rubin in 76 is in the setting of parametric likelihoods. And here, the setting is the following. I'm given, well, I assume there's a data generating mechanism with a given distribution of a theta for the complete data x, and there's also another generating another random process with another distribution that generates a mask and now the goal in statistical inference is to estimate theta of the complete of the complete data generating process. And if I do this I can write the full likelihood, and then the full likelihood, I need to compute the expectation given that I'm given only the observed data not the missing data, I need to compute the expectation over the missing value mechanism. So I need to compute this expectation and for this I need the details of the data generating mechanism. Now this this is annoying because it's a data generating mechanism that I'm not interested in. So, we can do something else that seems more, maybe not natural but easier which is to completely forget it to ignore it and to say well I'm just going to marginalize over the non observed entry using the data generating distribution for the complete data. And now there's a very important result by Donald Rubin which tells us that in certain setting I it's it's legit to do this, I can actually ignore the missing value mechanism. And for this, I need to be in a situation that's known as missing at random. And this is an ad hoc, an ad hoc assumption that tells us that for non observed value, the probability of missing this does not depend on the non observed value. Which tells us that if I have two vectors with the same observed value, and I'm not specifying anything for the non observed value, then their missingness mechanism gives me the same data distribution, well the same distribution in the missing. So first setting, which are known as missing at random or more than maximizing the likelihood that ignores the missingness mechanism gives the same maximum likelihood estimates for the parameters of model a which is the model of the the full data as the full likelihood. Basically, I can maximize the likelihood of the data generating process that I'm interested in, forgetting the data generating process that I'm not interested in which is the missing value data generating process. Now this, their variants know that this first assumption is a bit complicated and there's a special case which is much easier to understand, which is missing completely at random which tells me that basically m is independent of x. And now I'll claim that those two assumptions about are actually not very realistic. Very often, my missingness is related to the value that it's quote unquote hiding. If I'm asking you how much money you make. If you're in the middle of the distribution you're more likely to answer that if you're on the upper or lower. And so this gives us to the this gives the situation that is the missing not at random situation in which case it's not ignorable and the then the inferences harder because we must explicitly model the mechanism. Now the missing. So, here are intuitions this is the complete data generating model. Well, the data generating process. I've shown in white partially observed values and I'm showing you in missing completely at random so I'm basically dropping randomly some values. And here I'm showing a censoring process which is an extremely brutal process basically I'm dropping values that are above a threshold. Now we can clearly see that the censoring process is actually biasing the distribution whereas the missing completely at random is not. So this is going to be a much harder statistical inference setting than this. The missing at random assumption and the notion of ignorability has been used to derive many estimation procedures for missing value setting. The most famous one being expectation maximization algorithm which was, I believe, historically introduced for missing values, and really the IDs that we're going to optimize the likelihood that ignores the missing value mechanism by alternating an expectation in the likelihood over the non observed value and then the maximization of the resulting expression and alternating the two. Another approach, which is maybe a simpler often in practice, because the problem with whether the challenge with the expectation maximization is that requires coding a new routine. Each time you're given a new likelihood problem, not the end of the world. So there's another approach which is to use imputation which is a routine that will compute the probability of the missing values given the observed. And then from using this probability will create a complete data for instance by imputing by the initial expectation. And this enables us basically to emulate the expectation in in likelihood that ignores the missing mechanism. And then on this complete data we can apply a standard routine from our favorite package to maximize the likelihood of complete data. And we can do slightly better by using multiple imputation in step two, which is basically sampling in the in the conditional probability sampling missing imputed plausible imputations in the conditional probability rather than taking only one value such as an expectation. Now, in prediction settings, those two procedures must be adapted to be able to work on out of sample data, because the naive way of writing imputation or writing expectation maximization does not know how to separate a fitting procedure from a testing procedure from an inference procedure. And we're immediately hitting one problem, which is that the predictive model is applied only on partially observed test data. So if you give me test data, it's going to have holes in it. And I need to have a predictive model that works on this data with holes in it. And so these are the settings we're interested in. And so our settings really is actually a merge between the two things that I've presented. We focus on risks and not likelihood. So, I'd like to say that the core results of the MAR assumption, the historical one does not immediately apply and some of the dogmas may not carry over. There are missing values at test time so we need a function that must predict on missing values. And we have a challenge which is that if we just want to take our textbook statistical learning theory recipes or procedures and apply them. We're going to minimize, for instance, the empirical risk and the challenges that the function that we need to create is a function in the semi-discrete space. And sorry for the typos in the notation, this doesn't really make sense. So this semi-discrete space is going to pose problems simply because it's harder to optimize on this semi-discrete space and we're going to easily fall in combinatorial optimization problems. That's one of the set of problems that we fall into. Okay, so now I'm going to give a set of results that we introduced in the last few years, which basically bridged the gap. And I'll start by a few results on how do we adapt classical learning procedures to work with missing values. And the first thing that we can do is that we can use imputation and we can impute the test. So suppose that we're given F star, which is the base predictor and fully observed data. I'm not telling you how we got it. Maybe we had enough data in a training set that was fully observed. But we're just given this base predictor on the fully observed data. What I'm telling you in this expression here is that I can build using expectation on the conditional distribution of the missing given the observed. I can build base optimal predictor that works on partially observed data. And for this, I need to compute the expectation on the missing given the observed. And this can be done using multiple imputation by sampling. So really at test time, I can sample multiple imputations and I can basically average the predictions. By chance, I have the base predictor on the fully observed data, which is a big if but and one comment is that in general single imputation is not consistent. I can't take a base predictor which for fully observed data and converted to base predictor for partially observed data. So this is a single imputation. By the way, nowhere here am I saying that I'm going to perform as well on the partially observed data as on the fully observed data there will be a cost, there will be a drop in performance. However, I'm telling you them performing as well as I can that's something important. Often people want to perform as well on the full data on the partially observed data as on the full data. In general, this is not possible. Now another procedure, which is a bit brutal is to impute by a constant. So each time I have a missing value. I can replace it by a constant alpha and I can choose my constant the way I wish I can choose it for instance as being the mean in the train set. And I'm going to do this over all my data over my train set and my test set. I'm going to assume a few things which are basically some some regularity assumptions. First, I'm assuming that my, my, my regression, my link between X and Y is sufficiently regular. And then the second same ID, I'm assuming that my missing mechanism is also sufficiently regular and here we have made the assumption that we have only one variable on which there is missingness and the important thing is that the function that gives me the probability of missingness is continuous. Now, given those assumptions, we can show that the base predictor after constant imputation is equals to the base predictor on the original data almost everywhere. Okay, so those are almost everywhere the same function. In this sequence, I can have a procedure that those constant imputation here, followed by a learner that is consistent on my data so a universally consistent learner. This will give me a predictor that is consistent almost everywhere. Now the almost everywhere is a technical detail but the reason is that I can have collision and if in my data, I have a feature vector that collides with my imputation, then I will, I will not be consistent here. So if my features are continuous and noisy, this is, this won't happen. But if my features are discreet and I've chosen alpha as a value of the feature takes, then this will happen. So this is immediately by the way an argument for choosing a mean imputation instead of median imputation because median imputation will create those solutions. But this is this is quite interesting because it's telling us that imputing my constant imputing by the mean for instance, is not a stupid ID if I'm interested in prediction. And the reason is that my learning procedure will capture this and compensate for this. I'm creating something that is, that is abnormal in my in my distribution my ex prime. But my learner is able to detect this and to compensate for this because it's, it's a universally consistent. And this is an interesting result because it's a stronger position to the classic missing value practice which tells my constant is disastrous because it will strongly distort your distribution. And the reason why we have this, why we depart from this good practices because we're interested in different goals and because it's a risk minimization and because we use extremely non parametric models which is basically our university consistent learning. Okay, so we can adapt supervised learning procedures, this will lead us to different tradeoffs than classical to the school influence and this is something to me that's quite important and I'd like to stress. We have different goals, we have different tools, and hence we're not tied by the classical good practice. Good imputation is not necessary as I've shown you. And in our paper we also looked at the risk of tree based models such as random forests. And these are interesting and they're used a lot with missing values because they can naturally optimize for input and semi discrete spaces. Just like they can naturally optimize for categorical data because they're, they're basically performing a greedy combinatorial optimization. And they're very used in practice for the kind of data for which there is missing. Now I'd like to switch gear and to consider a parametric setting, which is basically a setting in which I have a linear data generating mechanism. And I'll show you that this setting is actually quite rich and has a few reasons that that may be surprising. And so this is work that was done by Marine Le Morgant with Nicolas Prost, Erwin Skarnet, and Julie Joseph and myself and was published at AI stats last year. And so the setting is we have a linear generating mechanism. Y is a linear function of XT fully observed data, sorry, the complete data. However, we're observing Z which is only partially observed. It's XMOS by, sorry, I have a change of notation compared to the prior parts because I copy-pasted things too fast yesterday. And the first thing that we see is the best predictor, the optimal predictor may not be linear. And for this, let me introduce a very simple example. I have Y which is the sum of X1 and X2 plus noise. And I have the link between X2 and X1, which is a nonlinear link. It's an exponential function. When I'm observing only X1, I can, the optimal predictor is given by using this formula. And then I'm, the optimal predictor is written as Y which is the sum of X1 and exponential of X1. And the linearity is introduced because of the link between the two variables. And because when I'm removing one variable, the best thing that the model can do is use this link. So even with a linear generating mechanism, we may not have a best predictor which is a linear. So we need to add extra assumptions and basically we'll make assumptions that our covariates are Gaussian. And we're using a classical assumption from the missing value literature, which is that X conditionally on M is Gaussian. If I give you a missing value pattern, then there exists a mu and a sigma such that X follows a Gaussian distribution with mu and sigma. And this mu and sigma may be dependent on M or they may be independent from M and if they're independent from M, then we're in missing random settings. So now if I, if I'm in these settings, then we can show that the optimal predictor is now polynomial of X and cross product of M. And the problem is that this polynomial has two to the power D terms. So what I'm the optimal predictor is really given by the linear function of M and then to which I will add the combination of M and X and I need to add the two, the two term combination of M and the X's and each time I have different, different model coefficients. And we can immediately see the problem here is that we have two to the power D terms. So this is a fairly complicated expression. So this is a polynomial and so I can fit it using linear model on an expanded basis. And if I, if I do this and I use an ordinary least square, then unsurprisingly I have a risk of finite sample risk that's on the order two to the power D divided by. I need to say that I have a sample complexity with this procedure that scales with two to the power D. I need to guarantee a given estimation error. I need a number of samples that are that is on the order of magnitude of two to the power D. And that's that's related to the complexity of this expression. So this is bad news. I can twist the problem and I can consider the problem as the following my optimal predictor my best predictor is piecewise a fine. It's for a given combination of missing value features. It's a it's an affine function. The function that is piecewise affine is a function that I can learn with multi layer perception with rectified linear units. So rectified linear units are or piecewise affine. And so what we can show is that we can apply a multi layer perception with really non-linearities to the concatenated vector which is x imputed by zeros for the missing values and M below it. And for this procedure to be consistent. Unfortunately, if I have a single layer, which is the setting that we have studied, I need a width of two to the power D. Now the thing is, we can reduce this with and hopefully capture some structure in the base predictor with the multi layer perception. So horristically, we can control the model complexity by reducing the width of the multi layer perception. Given that we've run experiments and some there's a lot of information in here so I'm going to go over quickly. And our experiments showed that in missing completely at random settings were imputation and EM should work well. Well, imputation, which is in red and EM which is in green work well. They both at small sample in small sample settings and large sample settings. They give us a good prediction. But constant imputation with a linear model doesn't work well. It doesn't work too bad but doesn't work well. It's not consistent as you can see here. And what we're doing here with those lines is that we're varying the width of the MLP the multi layer perception. By varying the width of the MLP, we can see that we're exploring different tradeoffs and if we have a lot of data, we're better off having a very wide MLP but if we don't have a lot of data, we're better off not having a very wide MLP. Now we can do this with more complex data missing value generating mechanism. And if we do this, we see that our procedure tends to be becomes interesting compared to the other procedure. And the reason is that it becomes extremely hard and non consistent for the other procedures to capture the more complex missing value mechanism. And in particular, if we're in not ignorable setting if we're not ignorable setting imputation and EM cannot work. Whereas our procedure based on an MLP works and and converges. The interesting thing is that we're using a very flexible model the multi layer perception. We know it's consistent if we have if it's wide enough, but more importantly because it's it's very flexible it's going to be robust to violations of the model. So, to summarize this, this part, the linear prediction even predictor, even if we do constant imputation and even if we optimize the constant is not consistent. So if we go back to our previous result that tells us that constant imputation works. Yes constant imputation works but only if we have a very rich predictor. So even in linear settings in linear data generating settings. Then a linear predictor with constant imputation will not work is imperfect. I can make this better by doing a polynomial of the mosque using a commutorials of the mosque, but this will incur large sample complexity. So I can I'm better off replacing this with a multi layer percept perception which is going to be consistent. Which if I have no assumption and no structure on my missing value mechanism will require a lot of hidden units but if I have more structure will adapt. Okay, one last part, I guess you're all very tired. I have only intuitions here, but one last part. Given everything we've seen now what we're going to do is that we're going to craft a dedicated architecture to approximate the base predictor. This paper was published at NeurIPS last year and is work that was led by Marine Le Moran with Erwin Scone and Julie Joss and Thomas Mont. And the intuition is the following. If you just stick around for this intuition slide if you're if you're very tired than the rest is going to be technical details. The intuition is the following. So suppose the simple setting where I have my output Y which is linear function of two inputs. X1 and X2. And those two inputs are correlated. And if I have one of the two that is missing, then the base predictor will use the correlation between those two, and hence needs to modify its parameter it's it's coefficient with regard to X1. If X2 is missing, I need a different value for X1. But the point being that there is a link between those values and this link is driven by the correlation of X1 and X2. In general, if I have multivariate data, it's going to be driven by the covariance of this multivariate data. Each time that I have a different missing value pattern, I need to adapt my coefficients of my linear model, but I'm doing this modification to account for the covariance of my covariance. The challenge is that there is many of those missing data patterns to to the D possible missing data patterns. And if I need to learn independently, all those coefficients I'm falling back to the previous problem which is, I have many coefficients to learn and it's hard. So we need to model the link between those those coefficients. And for this, we're going to assume that we're in a linear model and that the data is generated with the Gaussian covariance. And in missing completely at random settings, we can write the optimal predictor as follows so what you're seeing here. So here we have a fairly standard linear model of the of the observed data. And here what we can see is that we have another term which uses the covariance of the observed data and the cross covariance of the missing data and the observed data to capture basically this. This link between the between the covariance. Okay. And so this term is the really the important term here it's really due to their Gaussian assumption on our data and with this we see it appear in our optimal predictor. Now we can go even a bit further and we can look at other missing value mechanism, and for instance, a Gaussian self masking where the data is my time over. Okay, I have time over. Okay, so then I'll just I'll just skip over this. What I can tell you very quickly is that what we do is that we take those expression. We approximate them using a differential approximation, which is basically enrolling a series that approximates the the inverse. And if we do this we can come up with a dedicated architecture that approximates well. The previous expressions, and the important thing is that this dedicated architecture works much better than MLPs whether they're wide or deep, because it needs much less number of parameters to approximate well. In practice it predicts well in more settings or in missing in not at random setting, and it predicts better than expectation maximization or imputation. When their assumptions are violated in non more settings, or in high dimension because in high dimension, those procedures struggle. And I'll stop here and I'll take a few questions. Thank you. Thank you very much again for this nice nice talk. Any questions. Maybe I could start with one question. Could you talk about a bit about applications and this is application of imputation of missing data for instance in medical diagnosis or your, your engineering for instance I know that there are some modern data or often are recorded on graphs or images and how can you deal with that. It's a specific interest of mine to no longer look at those structured signals. And I'll take a few questions. Because those. So, rather than an application I'll tell you an anecdote. A year ago, I was in postdoc at in Montreal, and I came back here because of the COVID situation to work with many different people with the hospitals. And the data that that they have that is there in large amounts is not imaging data. Imaging data is expensive. And my friends and medical doctors tell me that imaging data is too late. You make an image of someone when that person is not going well. The data that that that is data that is there in huge amounts is is very quote unquote stupid data that has its own problem and I could talk about this in length. But there are more problems of databases. So there is missing value because the date, the questionnaire was not filled in correctly there's missing value because the joint was wrong. There's missing value because somebody used a different convention. And here we really chose to focus on those settings. And now if you want to do missing values and images, I guess there are two options. Either you have a few voxels or a few pixels that are missing in your image, in which case, you should really be using the image literature and you should be using the structure of the image local filters to do what what's rather known as in painting. Or you have the full image that's missing. But if you have the full image that's missing, what you're going to give to your, you know, bigger procedure is, is not the full image. It's going to be a descriptor that's extracted from the full image, in which case, we're back to the to the settings that I've described here. Okay. Okay, that's clear. I think I'm going to have a question. Yeah, can you hear me. Yes. Thank you a lot for the for the very interesting talk. So I tried to ask my question quickly. So, basically, the message I get is from the summary slide and from your talk is that if you have a non missing at random mechanism basically exactly you have to do the power these different models. So in principle, you would have to theoretically, to learn each of these separately corresponding to missing pattern, potentially completely different for each other but somehow in practice you hope that if you have close missing patterns maybe there are communities between these different models. So my question is, have you compared to any approach based on something like multitask multitask learning or transfer learning where here you have like each pattern is a task or is a, if you wish, and you have a very natural way of which is the having distance between missing patterns to have a closeness relationship between these tasks. I think that's a that's a very good question. I think you know implicitly what we're doing at the end does this not not specifically the way you the way you said it, but implicitly it can be seen as as a multitask setting. You could definitely use other multitask learners. What we think is that there are many but not all situations they're well approximate by the setting that that that we that we describe which is you know the simple assumptions and in no setting then we do hope that our procedure will be optimal. We're certainly improving quote unquote details which are not details on the procedure, which make it much more robust to finite sample, and even to non-linearities. And with those improvements we hope to go in a fairly intensive benchmarking on real data to test when we're no longer in our toy settings but they are assumption is that our toy settings are a good local approximations of real settings. I think my time is up I'll answer. Yes, I think we will stop here. You have a question you will see in the in the in the chat box, but I would suggest that you maybe you could answer in the chat box or maybe to directly to the to Milan lately. And so I think we will stop now and I want to thank you again. Thank you.