 Okay, so welcome to the 12th HMI Data AI and Society Seminar. I'd like to start by acknowledging the traditional custodians of the land where we meet and pay my respects to our elders past, present and emerging. And today we've got no man go well, who's coming to us from Switzerland from the, from ETH said there, he completed his PhD over there just this year congratulations the month, and he's taking a postdoc there at the moment. And so the man's going to be talking about data missingness and algorithmic fairness. So the man if you'd like to start you can go ahead and share your screen, and we'll get cracking. Okay, so do you see the full screen or do you just see also see next slide anything to see in the full screen. Okay, okay, good. So thank you everyone for joining today. This is early our Australian time. So as I said, I'm going to talk about this recent work the importance of modeling data missingness and algorithmic fairness. And this is still work under review not yet public anywhere on archive. So this is joint work with my collaborators from Microsoft I'm Mr my name is this Monday, and Alfonso is a master's student at EPFL. So, as we all know that algorithms are being used now in many applications of societal importance like criminal justice, hiring university admission, healthcare loan approvals etc. And since many of these algorithms are data driven they can learn pre existing biases social biases from the data. And when they're applied to take decisions for the future, they often take bias decisions. Algorithm has received a lot of research attention recently with from researchers coming from different fields. People have come up with a lot of definitions for fairness. What fairness means for for algorithmic decision making, and also techniques to satisfy those definitions for different algorithms. So, a common approach now for training for fair machine learning classifiers is something like this. So the first thing that we need to train a classifier any classifier is obviously a big training data set where different rows contain observed outcomes for different feature values. Right. So, for example, if you're training a classifier to, let's say, take making loan decisions, then our data set may look something like this. On the left hand side, we see different feature values, like prior credit history salary education sector for different individuals. And on the right hand side, we see the outcome the observed outcome whether the respective individuals paid back the loan or not. So this is our data set. Then what we do is in the second step will very carefully think and decide on an appropriate fairness metric for that particular application for example it could be demographic parity, or equality of opportunity, and then we'll pick up a state of the art fair machine learning algorithm, and then we'll apply it on this data set to train a classifier with fairness constraints. Finally, we'll just deploy this train classifier to make future decisions. So this is like the current approach of training fair machine learning classifiers. So this seems like a good approach. Right. But are there any problems in this approach. Well, there is one big problem, which is that there is no guarantee that this supposedly fair classifier that we just trained using the state of the art machine learning classifier algorithm will actually take fair decisions in the real world. And the reason for that is missingness in training data. So what do I mean by missingness in training data. So, let's consider two individuals who in the past applied for loan with a bank, and they had these different features let's say x1 and x2. The bank looked at the features of the both individuals and for the first individual, it approved the loan, and for the second individual the bank denied loan. So what will happen after some amount of time that you will observe the true outcome for the first individual, whether this person paid back the loan or not. And this feature and outcome pair will become part of the training data set. Whereas for the other individual true outcome was never observed. And so this feature and outcome pair goes completely missing from the training training data. So if I, if I can summarize that slide in one line is that training data even if it contains objective ground for the outcome and infinitely many samples is often one sided due to some kind of systematic censoring by past decisions. So there are obviously other kinds of issues in in data that that can cause bias for example you may not be able to measure the outcome sometimes. The outcome may be subjective or may not be there may be some kind of errors in it but what I'm just considering here is that very objective ground truth that you cannot observe correctly. But there is some kind of missingness in the training data so you don't observe everything you just observe partially some of the records. So while most of the work in fairness literature ignores this problem but there are some recent work that do consider this problem I'm not going to go into the details of all the related work for now, but towards the end if there is time, time left to probably go through it. I can just summarize that in this work our focus is on general identifiability of distributions from from missing data. And what does it imply for a fair machine learning algorithms. Before we get started, let me also give a very quick introduction to causal graphs so causal graphs are probabilistic graphical models that are used to encode assumptions about data generation process. So for example, if you want to represent the causal relationship between variables gender, age and obesity that and we assume that both gender and age have a causal relation with obesity then we can use this causal graph to represent that relation. Where edges represent the causal relation between variables. Now, when there is missingness in training data Karthika Mohan and you the pearl propose this elegant framework recently for also representing the missingness mechanism through causal graphs. So assume that for the same example of gender, age and obesity. Let's say we observe the age and gender completely for all individuals, but obesity has some missingness so we don't observe obesity variable for every individual. You can see in this obesity star column where for some records you actually observe whether individuals obese or not as for these two you have the missingness. So you don't observe the correct data. And then there is this separate variable that we have added which we call as let's say missingness mechanism that represents whether you observe the true variable or not for obesity. Whenever this are not our subscript O is zero. You observe it, you observe the obesity variable and whenever this is one you don't observe it. So you can think of it as some kind of on off switch. Now, when we want to represent it in the causal graph, we can use. The first part is the same GA and O variable. Now we add this new variable O star, which represents the variable that we actually observe. So it's the version of variable O, but we don't observe O we observe O star and O star is affected by this are not our subscript O variable, that is the missingness variable. And it's also obviously affected by the original variable. Oh, so the first case is what we call as the missingness missing completely at random case, because in this case are not does not have any incoming edge from any of the other variables so there is missingness in obesity variable but it's completely random. So this is the classical terminology for any statistics for missingness in data missing completely at random. And then the other case is let's say missing at random in which the are not variable has an incoming edge from one of the other observed variables, let's say a. So, people of certain age and try not to report their obesity, let's say, so age causes indirectly missingness in the variable obesity. Now there is this final case missing not at random where are not has an incoming edge from variable or itself. So in this case, the variable itself is causing its own missingness. So, people who are let's say, have a certain obesity level, they tend not to report their data. So, this is the case of missing not at random where the variable itself causes missingness. So, there is some notation here that I'll be using in the rest of the talk. So, x, I'll use to denote nonsensitive features of individuals that observable nonsensitive features. Z is the sensitive attribute of individual D is the past binary decision that we say is affecting the missingness in training data. So, why is the true outcome that we observe after some time whether, for example, somebody paid back the loan or not, and you is using it to denote unobserved features. So unobservable features are features let's say that are not in the training data set. But for example, a human may have used those features while making decisions, but they are not part of the data set so for example, if a person came for an event with the human then the human may have observed how this person was dressed, what was his hairstyle etc but you know these things are not in the training data set but they may have affected the decision, past decision. And why hat is going to denote class classifiers prediction that we are going to train on this missing data and incomplete data. Again, there are lots of fairness definitions in literature but I'm just going to focus here for example on demographic parity and equality of opportunity. So demographic parity requires that the probability of classifier, giving a positive decision is independent of the sensitive attribute of the individual and equality of opportunity requires that the probability of the classifier giving a positive decision is independent of the race or the sensitive attribute given that the true outcome is also positive. So, so now let's say we are interested in training of classifier that satisfies this EOP constraints. And many algorithms actually require while training these fair classifiers to estimate these EOP constraints from the training data. So, how do we do it so consider this simple graph where X are the non sensitive features as I explained, they are let's say affecting the true outcome, and then there is a variable Z, the sensitive attribute that is affecting, let's say the features. And these are by the way causal graphs are just examples here. I know that the causal graphs can be different. So you can have your own causal graphs and draw similar conclusions. So we're just using these causal graphs for example here. And then what you do is you train your classifier and then you make predictions on this training data set, which you call as why had and then you compare why had with why to measure the accuracy of your class or how you do it in a similar way to estimate whether you are meeting the equality of opportunity constraints or not. But there is this problem in the training data set that you don't observe actually the full training data full data set but you have these past decisions that were again I'm assuming were taken based on let's say, X and Z. And because of those data past decisions you observe only partially the variables Z, X and Y. So you observe Z star X star and Y star not Z X and Y. And then now what you can do is you can only estimate the conditions of your classifier on these selected training records. So what you have is why had a star not why. Okay, so when we want to estimate your P constraints from this incomplete data, what we are actually estimating is P of why had a star given why had my star and Z star. Okay, and what we actually wanted to estimate was P of why had given why, comma Z. Okay, so this by definition you can write as this so whenever D is equal to one that is whenever decisions were positive in the past. You observe the same thing whenever D is equal to zero it's missing right so by definition you can write wise star variables as the original variable given D is equal to one. Now this is not equal to the, the metric, the constraint that we actually wanted to estimate which was P of why had given why comma Z so this was the actual EOP constraint that we wanted to satisfy what we are estimating is why star given Z, why star and Z star. And why is that the reason is that if you look at this graph, why had is not conditionally independent of D, given the variables, why and Z. So, this is, there is a technique for reading this conditional independence from a causal graph it's called the D separation criteria, again also proposed by you the pearl in 1988. I have one slide here to explain how this is read from the causal graph but I'm just going to skip it because it's a bit technical detail. If somebody wants me to explain it. I can go through it quickly. So, what what we concluded from this previous slide was that when we have missing data that missing data, then the constraints that we are estimating from our data are actually incorrect. So whatever guarantees that we are providing based on evaluating our algorithms on this training data in complete training data are not going to hold in practice. That's true if you want to estimate let's say the demographic parity constraints also. So again, here you want to estimate P of why had given Z, but what we are estimating is P of why had a star given Z star. And then you can show that this is not equal to the quantity that you actually wanted to estimate because again why had is not conditional independent of D given Z. So the D separation criteria D separation criteria you apply again to make this conclusion from the causal graph. So this was actually shown in a paper by Nathan Keles and Angela Joe in 2018 ICML 2018. So it was the trend and equal opportunity classifier of this famous paper by Mozart's equality of opportunity in supervised machine learning. They trained it on this stop question and first data set from New York City, and then they applied the classifier this fair classifier on the general NYC population. So, because it was trained to be fair the expectation was that it will be fair when applied to the general population, but the observation was that, whereas it only 11% of non white non Hispanic innocence were wrongly targeted up to 20% of white innocence and 16% of other and 15% of black innocence were wrongly targeted in class class and this by the way was a fair class fire. So, so clearly missing, if we ignore this missing issue while training the class fire the guarantees are not going to hold. This is something that was already shown in a paper. But this causal graph framework that that I'm discussing actually allows you to to reach more similar conclusions general results for for a wider class of algorithms that do not let's estimate these constraints nightly from the training data but they, for example require these restrictions so if you read fairness papers you will often see that they assume that let's say the risk scores are the true risk scores of the people of individuals are known. So the risk scores are basically these probability distributions P of Y given X, P of Y given XC. So this is a general assumption while designing for algorithms that there is that exists some method by which you can estimate these true distributions from the training data. Now we are going to see whether it's a good assumption or not. Can you actually estimate it from training data when you have missingness in the training data. Then basically whatever algorithm you have designed again as I said it's guarantees are not going to hold. So let's first consider the case of fully automated decisions. So when past decisions that cause missingness in the training data were fully automated in the sense they were completely based on observed features and all these observed features are in your training data set. So D is here is only dependent on X and Z and the rest of the story is the same that the cause missingness. In this case you can show actually that P of Y given X the distribution, the risk score can actually be consistently estimated from incomplete data also. So P of Y given X, Z can also be estimated correctly, but P of X, which is the distribution of the features themselves that cannot be estimated. And it's actually non recoverable, which means that there exists no estimator and given no matter how many training data samples you have you can never recover this true distribution from your training data. So your algorithm assumes that this distribution is known. Keep in mind that you know this algorithm is not going to work. Then I keep the same graph as I had in this case. But let's say I add one edge between Z and Y. So I assume that the sensitive attribute also has a direct causal impact on the outcome here. Now in this case, if we see whether P of Y given X Z can be estimated, then again the answer is same P of X, again the answer is same, but for P of Y given X the answer changes. So earlier it was easily recoverable consistently estimated but now, if you naively estimated, it's going to be a wrong estimate. Just because I added one edge in the graph. Now let's consider the case of human decisions when past decisions were made by humans, and that caused missingness in your training data. So to distinguish it from automated decision making, I'm going to assume that humans use these unobservable features while making decisions. As I gave examples, let's say what the person was wearing or his hairstyle etc. So this is the same graph as earlier. Now I added this one node you and D, and I'm assuming here that you is this unobserved feature that may have effect on the observe features and may also affect the past decisions of the humans. And that decision finally caused missingness in the training data. So now, if we reason about whether P of Y given X Z can be recovered. Yes, it can be easily P of X, it's, it cannot be recovered again, even in this case. Now, let's consider one more example of human decisions. So here, this is the same case as earlier but now I'm going to change this one edge that was coming from you to X. Let's make this edge point from you to Y. So, an example could be, let's say a human judge. When a defendant comes for hearing can observe whether the defendant has come with his family or not. So that feature about the defendant can affect the judge's decision, and it can also affect whether this defendant goes on to commit a crime in two years or not. So just, you know, this feature whether this person has come with a family or not to the hearing. So, but this feature may not be recorded in your training data set. So I call it an observed feature. This story is the same the rest of the story that this decision causes missingness in the training data. So, P of X again, it's the same thing. But now here we have a negative result that now if this was the case that humans were using these unobserved features and that unobserved feature had an impact on both the decision and the outcome. And then you cannot estimate the risk is course no matter how many samples in the training data you have. You can also use this common causal graph to model also how much machine aided decisions are made. So, here I'm assuming that so this is the same graph as earlier, but now I added one more node here da which represents the representation of the algorithm. So the algorithm also looked at X and Z, and came up with a decision da the human looked at algorithms decision. Also the features independently, and then came up with a decision D. Okay, and that decision ultimately caused missingness. So if that that was the case, then you can show again things are not as difficult as compared to the previous case. P of X is non recoverable but the risk is core at least you can recover. But now if I add that human characteristic of adding of using the unobserved features. Again, right so in addition to looking at the algorithms input, the human also uses someone observe feature. Then you have a negative result again that you cannot estimate the risk distribution. So this is in general a problem because on one hand we say when we talk about regulation. I think I checked the Australian framework ethics, ethics framework also they also have this notion about human oversight. In, in, in EU framework also EU Commission framework they talk about human oversight GDP are there is article 22 there that shows that says that individuals have right not to be not to be subject to come completely automated decision making so a human has to be always involved. And if on one hand. So these human involvement is something that you know we see as as increasing trustworthiness, but on the other hand we have these negative results that show that if humans are involved in decision making, then things get so difficult, and it comes to actually learning from the data that is being produced. To summarize, what I discussed till now is that estimating joint distributions of features is impossible in almost all cases of missingness caused by past decisions. Conditional distributions may be recoverable depending on the nature of past decision making and the causal relationship between the variables. Missingness caused by automated decision making is relatively easier to handle than missingness caused by human decision making or machine a decision making. And I am not going to give more examples here but we have more examples in the paper where we show that even a small changes in the causal graph where you keep everything same but just change the direction of the arrow. The X do not affect the features you will not affect the outcome but you assume that the outcome affects the features. So it could be like, if we are talking about test scores as features and why as you know true outcome whether a candidate is qualified or not, then it that in that case you may want to think that, yes, it's the candidates skills that determine the test course, and not the other way down. So if you just change the edge the direction of the edge in causal graph, then you may have actually very different conclusions about whether the distributions can be recovered or not. So all these conclusions are very sensitive to actual very precise causal structure of missingness. So far I think I gave mostly negative news that the distributions are not recoverable but actually if you look there were also lots of positive news that some of the distributions are recoverable, especially if the sans string is fully automated. So what, as an example, we showed that in multi stage decision making how you can use these results to design an algorithm that is actually in the real world when even when you have missingness in the training data. So consider this two stage decision making setting. So in the first stage decision making decision maker uses some features, let's say X one of the individuals whether to promote them to the next stage or not. And then in the second stage, the decision maker collects additional features, let's say X two for the individuals who go to the second stage, and then final decision is made, whether about whether the person is let's say hired or not, or a loan is given to the that person or not. So you can think in university admissions in the first stage it could be test course and then the second stage you may ask for letters of recommendation etc. So additional features are being collected in different stages, and the second stage only sees the output of the first stage so people who pass the first stage, only for them you ask for recommendation letter so there is also this missingness that is being caused by the first stage decisions, and then again by the second stage decisions you also cause missingness based on two features. So, in this case, we can show actually that. So there are three distributions that are of interest. The first one is P of Y given X one comma X to the risk score. And you can show that this can be easily recovered from the training data even if training data has this missingness P of the disjoint distribution P of X one comma X to as in the earlier cases it's still non recoverable you cannot recover it using an estimator P of Y given X one we show that it's if you do it naively it's incorrect but using a simple factorization technique you can actually recover the correct distribution from the training data. So, now the question is can we design a fair algorithm that uses only P of Y given X one X two and P of Y given X X one and not P of X one comma X two. But then how does it compare to the algorithm that actually makes use of all three distributions. So, we propose this algorithm, detailed free decentralized and fair algorithm, and that solves the following optimization problem at every stage of the selection process. It maximizes the precision precision of the decision taken at stage I subject to some budget constraint at stage I so at every stage you want to reduce the number of candidates that go to the next day so that is your budget. So you have some budget constraint at every stage, and you have a fairness considered at every stage. So the idea is that you can actually write this objective optimization objective and the constraints using only the recoverable distributions, and you can avoid the use of P of the joint distribution. Now we compare this empirically with an algorithm that assumes that somehow P of X one and X two is known to the algorithm. We compare on three data sets that the results are actually very comparable. So on X axis, we just have this budget of different of the first stage like what fraction of candidates go to the second stage. And then on the y axis you see the utility which is basically the precision of the decisions taken at the final stage. I mean the, the thing the trend to notice here is that there are two algorithms e a ggl which is the oracle algorithm that uses these non recoverable distributions. I mean that is somehow assumes that this distribution is correct distribution is given to it. And then we have our algorithm the DFS square algorithm that does not use any non recoverable distribution. And as you can see that the utility difference is not significant actually. So you can design algorithms that will that can avoid the use of these non disc non recoverable distributions. And so the guarantees will actually hold in practice. So to conclude the points I made was that first the probability estimates that we use in our supposedly fair algorithms may not be consistent due to data missingness, and therefore, any fairness guarantee of the training states does not hold in practice. And depending on the causal mechanism of data missingness and applications, we can sometimes design algorithms to be fair, even if trained with incomplete data, and we can probably do that without compromising the utility of the algorithm. The final, I guess, which was very interesting point for me was that the human involvement in in decision making presence challenging research questions from data censoring and missingness perspective. I mean we always know that you know humans make things complicated, but from from fairness from the censoring and missingness perspective also at least we have this formal result now that it does make things hard. That's it from my side. I'll take any questions. Thank you.