 So let me start with the introduction. So welcome everyone to the last TCS Plus of the spring. And we're very happy to have Michael Kearns give a talk today. Before I get there, let me start with some information. So yes, it's usually all very welcome to ask questions. And I'd like to thank also the my fellow organizers, Thomas Fiddick, Anindya Day, G. Kamat, Ilya Razenshine, and Clemence Kanon were all helping behind the scenes to make this possible. Let me maybe first quickly go around the table. We usually do that. Let's continue the traditions. So with us, we have Anindya today. Hi, Anindya. Clemence Kanon with the group from Sanford. And we have Chris Tobal from the Universidad in Chile, Universidad Católica in Chile. I think we have four continents today. Thanks for adding South America. We have Erfan joining from Indiana and the group. Hello, everyone. Fangi Yu from the University of Michigan and the group there. Hello. Govind from BU from Boston University. Hello, everyone. We have Rafael Frangio from Boulder and the group. Hello, everyone. Thomas Fiddick from the group from Caltech. The Weizmann Institute just finished their Chinese food. Good to see you, everyone. And she's joining us from... Actually, I don't know where from. But welcome, she anyway. So maybe you'll tell us later. OK, so let me just do this. Great. So we're very happy to welcome Michael Kernes, who will be the speaker today. Michael is a professor at the University of Pennsylvania. Let me just do this. And he's also the National Center Chair there in the Computer and Information Science Department at the University of Pennsylvania. Michael received his PhD in computer science from Harvard under the supervision of Leslie Valiant. Before that, before joining the University of Pennsylvania, he was the head of the AI department in AT&T and Bell Labs. Michael is a fellow of the ACM and of the American Academy of Arts and Sciences. He's a wide research interest, including machine learning, algorithmic game theory and algorithmic trading. And today he's going to tell us about preventing fairness, gerrymandering. And welcome, Michael. OK, thank you. Hopefully my audio is OK or Odette will tell me if not. Thanks, everybody, for tuning in. My co-authors on this work are my faculty colleague here at Penn, Aaron Roth, our former grad student, Steven Woo, who's now doing a postdoc at Microsoft Research in New York and will join the Minnesota faculty in the fall and our current graduate student, Seth Neal. So this is a machine learning talk, which will appear this summer at ICML. But a couple of the ICML reviewers complained that it was a theory paper. So that made me think that it was OK to talk about it here. And I'll spend most of the time talking about theory, but I'll talk a little bit about some experimental validation of the main algorithm towards the end. And so this is a talk about fairness in machine learning and about a particular kind of stronger definition of fairness than the standard ones that have been around for some time. I'm going to skip the slides that I might normally put up that I think are probably unnecessary. There's even popular media attention and mainstream books these days devoted to the various ways in which machine learning can produce models and algorithms that are discriminatory in various ways, either, let's say, by racial characteristics or gender characteristics. And there's many ways that this can happen. And I'm not going to go into those now or the importance of it societally. But this is all part of the general trend of algorithmic decision making about important aspects of people's lives, like whether they get loans, what criminal sentence they receive, whether they get into the college of their choice. So algorithms in machine learning and data are being used to either make or guide those decisions on a large scale these days. And so there's a lot of recent interest in how to design algorithms that are more fair in these regards. If you're interested in this topic, let me give a shameless plug for a tutorial at the upcoming stock that I'll be giving with Cynthia Dwork and Tony Ann Patasi. It's actually on the workshops day. So somehow they managed to get it as a workshop, even though it's really a tutorial, but there'll be a few related talks there. And you can learn more about these topics there if you're interested. So let me go ahead and get started. So the high level question here is how can we design machine learning algorithms that output models that are fair in some sense? And of course, the first natural question to ask after I say that is what might I mean by fairness? But by the way, these slides, I inherited these slides from Seth. They're probably too many of them, and they're probably a bit too dense for my taste. So I will be following the sometimes annoying style where there will be more on the slides than I actually intend to talk about, but hopefully I can guide people's attention to the right place. So typical notions of fairness in machine learning and statistics are what I would call statistical in nature. So you start off by sort of deciding who it is you're concerned about protecting or what features you're trying to protect. So you might be concerned about discrimination based on race. You might be worried about discrimination based on gender or age. So you pick one or more groups, if you like, or features that you want to protect. And then most of the definitions that you'll see in the literature essentially ask that some statistical quantity be equal across the different groups or races or approximately equal. And so I'm giving a few examples here. Many of you are probably familiar with many of them. And on this particular slide, I'll use the kind of running example of an algorithm or model that's trying to decide who to give loans to based on properties of that person and their loan application. So in something like statistical parity, let's say I care about fairness with respect to race. So in statistical parity, I would just ask that the rate at which I give loans to different groups or different races be approximately equal. So in other words, if I'm gonna give out loans to 30% of the white applicants, I also need to give out loans to approximately 30% of the black applicants, 30% of the Hispanic applicants, et cetera. So this definition of fairness is called statistical parity. It's statistical because I'm essentially promising that on average across the populations, the rates will be equal between different races and it's parity because of course I'm asking for approximate equality. Notice that this definition is a little bit weird in this setting, right? Because in general, you might think that I would want to refer to whether the applicants were actually credit worthy, whether for instance, they were gonna pay the loan back or not. And here I'm just asking that the rates be equal. So in a statistical parity definition, if I'm giving out loans to 30% of one population, it has to be roughly 30% in the other populations, even if some populations might on average be more credit worthy than others. In contrast, similar definitions include things like equality of false positive or false negative rates, right? And these definitions do take into account if you like the why value that you're trying to predict which in this case would be whether somebody repays the loan or not. So for instance, in an equality of false negatives in loaning, for instance, let's call a false negative where I deny somebody a loan when in fact they would have repaid that loan. So they were credit worthy, but I denied them the loan. So in equality of false negatives, I would ask that that rate be roughly the same across racial groups, i.e. if I deny 10% of white credit worthy applicants alone, I need to also deny approximately 10% of black credit worthy individuals loans also. And again, these definitions are statistical in the sense that if you're one of the credit worthy people that was denied a loan, your compensation by this definition is my basically telling you, oh, well, sorry about that, but maybe you should feel better because I also denied somebody from another race who was credit worthy alone to compensate for the mistake on you. So things like equality of false positives and false negatives are essentially talking about how an algorithm or model distributes its mistakes, because both false positives and false negatives by definition are types of mistakes. And you're asking that, you're not asking that the model not make mistakes, but you're asking that those mistakes not be disproportionately concentrated on one particular racial group, for example. And there's other similar definitions that again are about kind of equalizing some statistical notion across different groups like calibration. One of the interesting and perhaps disturbing things about fairness is that, even very fundamental natural definitions of fairness, like the ones I've listed on this slide, can actually be incompatible with each other. So to give one example, there are recent theorems that show that if you simultaneously want approximate equality of false positives, false negatives and calibration, there are impossibility results that basically say you cannot achieve that, except in kind of trivial, uninteresting situations. So there's all, you know, not only as we'll see is there going to be some tension between predictive accuracy and different statistical notions of fairness, there is even tension between just definitions of fairness for its own sake. Okay, the particular type of fairness that we're interested in this work is a stronger definition than the types that I gave on the last slide. And it's meant to interpolate between notions of group fairness and notions of individual fairness. And I don't wanna go too far into different types of definitions of individual fairness. All of the definitions I gave on the last slide were group fairness definitions, right? You pre-identify some collection of groups or attributes like race or gender that you want to protect or equalize across. And then you try to provide that guarantee. This is in contrast to definitions that would really give guarantees to every specific population and individual in a population. So you couldn't, for instance, as you can in the definitions on the last slide, kind of compensate for a false positive in one population by a false positive in another population, okay? Now, if you think about the logical extreme of individual fairness, it's kind of unachievable except under extremely strong and usually unachievable assumptions because it's essentially asking that you never make mistakes, right? If I really say, like if I really look at the individual level and I can define a group to be one person, right? Well, then if I make a mistake on that person, then they're gonna have like a false positive or a false negative rate of 1.0 and it'll be impossible for me to equalize that across all individuals or groups in the population unless I have a perfect model. And in general in machine learning, at least in the practice of machine learning, it's extremely rare that you can build a model that even approaches perfect prediction. And so this is why in general people settle for these statistical course group level definitions. And what we're interested in this talk is in something that's in between these two extremes of absolute individual fairness and these very crude groups like race and gender. And the type of concern we have or the type of problem that we're trying to address is what we might call fairness gerrymandering as per the title of the talk in which you achieve group fairness kind of by discriminating on combinatorial subgroups of the attributes in question. So for example, a very natural thing might be that I want models that don't discriminate against disabled people, that don't discriminate by race, by gender or by age. And let's suppose I build a model that manages to kind of equalize, let's say the false positive rates in some prediction problem across all four of those features separately or marginally that doesn't in general guarantee that I won't have discrimination on combinations of those same attributes. So I might be fair by disability, by race, by gender and age but discover that in fact the false positive rate on disabled Hispanic women over the age of 50 which is a combination of those attributes might be much higher than the background population rate. And there's no reason to expect that this won't happen under the standard fairness notions that I mentioned on the earlier slide and I'll show you later that on actual data this kind of thing does routinely happen. And of course it's not surprising because in general of course in machine learning or any optimization problem you shouldn't expect for things that you didn't explicitly ask for in your objective function to magically appear in your solution. And again, I do encourage people to ask questions along the way if they have them. I know the medium is not the best for it but please feel free to do so. Michael, there's a question actually. What about, so somebody's asking what about individual fairness for randomized algorithms? So I'm not, I'm going to guess at what that question means. So in general to achieve notions of individual fairness you definitely need randomization. So kind of randomization is necessary but not sufficient. One way in which randomization is very important in achieving fairness definitions is for sort of exploration purposes like until you know what a good model is and you want to treat people fairly while you're gathering data to build your model randomization is sort of like if you look back at the earlier definitions that I gave all of them will be met by making randomized decisions by a lottery, right? So if I give loans out just randomly at a 30% rate I will meet all of those statistical definitions of fairness. It's also worth mentioning that all of those definitions can be satisfied by perfect predictions. So if I magically find a model which never makes a mistake then there are no false positives anywhere in the population and so for instance I'll be fair. And so one way in which randomization plays out in some fairness models is that you know kind of for some prefix of time you're essentially using a lottery to make decisions and then at some point you have enough data to make accurate predictions and then you can be fair for that other reason. Did that answer the question? So the, hi, it's Ron. So the question was more specific about your statement that individual fairness is hopeless unless you're perfectly accurate. So if you're, you know, if you allow randomness and you just toss coins every time and maybe is there hope of getting individual fairness without perfect accuracy? I mean, only in the sense that if for every individual in the population I kind of flipped a coin to decide whether to give them a loan or not, yes. But as soon as I have even, you know, if I allow, if I really ask that I enforce these definitions for each specific individual in a population then I can never kind of, you know, then I really sort of need a perfect model in order to be able to take the risk for instance of giving loans with different probabilities to different individuals. Okay. Okay. Okay, thanks. Let's see, where was I? So that, you know, this slide is sort of a cartoon of what we mean by fairness gerrymandering. So suppose there are two races, let's call them the blue race and the green race and two genders, males and females. And let's just suppose for the sake of argument that we have some resource to distribute. It's like, you know, free tickets to a concert. Okay. And we have enough tickets to give out to a limited number of individuals in the population. And let's, for the sake of an argument, imagine that as I've shown in this diagram that there's the same number in each one of these quadrants, there's the same number of blue males as there are green males and blue females and green females. So, you know, if I have, let's say a total of eight tickets to give out to this population or whatever, however many I have here, actually there should be one more green female circle. But you get the idea, right? I've kind of distributed these resources in a way that if I look at race marginally, the same number of blues and greens have received tickets. The same number of males and females have received tickets, but no blue females or green males have received tickets. And this is what we mean by gerrymandering because it's, you know, the metaphor here is that I've achieved some sort of aggregate fairness or marginal fairness, but only by, you know, kind of carving up the population in a more refined way in which I'm discriminating or favoring those finer grain subgroups. And so it's, you know, at a high level, what we're trying to do in this work is see how far we can go towards not just guaranteeing, let's say, fairness by race and gender separately, but also to require fairness for these more refined subgroups. So let me kind of set up the formalism a little bit. So we're gonna imagine that there is a distribution or sample, if you like, of individuals. And these individuals are as usual in machine learning represented by feature vectors. And there's some value Y that we're trying to predict about them, let's say in a loan application, it might be whether they're credit worthy or not, for example. And so these are sort of XY pairs, but here I'm distinguishing between, you know, I'm dividing the features into two parts, X and X prime. The features in the vector X are the protected features. These are the features that I explicitly want to make some fairness guarantee about. So if you want to think concretely, think of these as the usual things like race, age, gender or disability. And X prime are sort of other things that I know about the individuals in the population that might be correlated with or related to the protected features, but are not the things I'm explicitly trying to protect, so maybe I'm looking at, you know, I'm trying to decide who to give loans and I want to protect by race and gender. And I also have some summaries of your social media activity that I bought from some third party vendor online. And that might be in my features X prime and it might be correlated with race and gender, but I'm not explicitly trying to protect on these unprotected features X prime. And I'm going to use capital D to denote a algorithm or possibly a model, like, you know, IE the output of machine learning algorithm. There's some algorithm D that makes predictions or decisions based on, you know, the whatever is known about an individual on their protected and their unprotected features. Now, we're then going to introduce kind of a very general notion of subgroups. So G of X, I'm going to use to just denote the characteristic function of some subgroup of the population defined by a function over the protected attributes. So to give a concrete example, following from what I've said so far, if the, you know, if X are features like race, gender, age, et cetera, then a particular subgroup might be, you know, disabled Hispanic women over age 50, right? So that G would be the indicator or characteristic function of that group, okay? And so G of X equals one just means that the individual represented by the tuple X, X prime, Y belongs to the subgroup G. And of course, this characteristic function is only looking at the protected features to decide group membership or not. And so throughout this talk, I want you to imagine that there's some class capital G that's a rich but limited class of subgroups over the protected features. So for instance, G might be the set of, you know, if we had binary features, G might be all conjunctions over there by those binary features or it might be linear threshold functions or what have you. Sort of some, you know, exponentially large or, you know, if we had continuous features, maybe even infinitely large class of subgroups but with bounded complexity, you know, not arbitrary. So they have some simple representation, okay? In particular, you know, for generalization purposes, we will eventually need that this class have bounded VC dimension, for example. Okay, so what is the notion of fairness that we're interested in this paper? So it's basically the definitions that I gave you on an earlier slide but now applied to all of the subgroups in this class of groups capital G. And so our paper focuses on two choices, statistical parity and false positive and false negative disparity. But let me just, just to kind of, you know, reduce complexity in the talk, let me just talk about approximate false positive fairness with respect to G, okay? So just to parse things here. Remember, G is a class of characteristic functions for groups defined only over the protected attributes X, okay? And we'll basically say that an algorithm making decisions about individuals given by tuples XX prime Y satisfies gamma false positive fairness with respect to this class of groups if for every group little g in capital G the following inequality holds. And so let's break this inequality down. Let me look at the second part first. The second part, you know, so what is this expression? This is the probability that the algorithm, you know, outputs one on the individual given by XX prime conditioned on Y being zero, right? So this is a false positive. The true label of the individual is zero but the algorithm has predicted or decided one, okay? So this first quantity is just the false positive rate of the algorithm D on the overall population or sample, okay? The second quantity here is the same thing but now I've added the second condition that the individual be a member of the subgroup little g. So this is, the second probability is the false positive rates on just the subgroup given by G, okay? And so at a high level what we're asking for in this definition is that the absolute value of these two quantities, the background false positive rate versus the false positive rate on any subgroup in the class capital G be small, okay? Now if you think about this from a statistical estimation standpoint, it asking that that inequality hold for every group is asking too much because some groups might just be, you know, with respect to the distribution or the sample, infinitesimally small, okay? So in particular, if there is some group in the class capital G for which the probability that, you know, somebody's a member of the group and their true label is zero, if this probability is arbitrarily small, then even in a large sample we will never witness an individual from that group whose true label is zero, okay? And we can't expect for instance to say anything about generalization error or generalization, i.e. that, you know, this quantity in sample will be close to what it is out of sample for every group unless we can observe, you know, such events. So our overall definition of fairness is that we want that for every group in capital G, the product of this kind of group weight term times the false positive disparity between that group and the background rate be less than a specified parameter gamma. Okay, so let me pause there and see if people have questions about this definition because this is the definition I'll be using going forward throughout the talk. Yeah, I've got a small definition. So here we want that guarantee to apply for any possible distribution on the examples. That's correct, yeah. So kind of in the spirit of PAC learning, we're asking for this in a distribution-free setting. You could of course very reasonably consider the restriction of this to natural definitions for what some definition of natural, but from, you know, from our perspective, like, you know, we wanna go as far as we can without having to make distributional assumptions and as you'll see, we get pretty far. Okay, now, so in general, so from here forward in the talk, I'm going to talk about just sort of meeting this definition on a given sample of data drawn from an unknown distribution, okay? And the kind of matching between, you know, sort of the relationship between the goal of making sure that this condition holds for every group in G on a sample from a distribution and having that, you know, be reflected in it holding approximately on the true distribution as well is really kind of a uniform convergence or generalization issue. And without going into detail, but, you know, for any kind of PAC or machine learning aficionado on the call, the way we handle this is entirely standard and there's nothing particularly novel about it in particular, you know, as long as the VC dimension of this class capital G is bounded and we take our sample from the distribution is large compared to the VC dimension of G, then we'll know that satisfying this definition on the sample from the distribution means that we're going to approximately satisfy this definition with respect to the true distribution, okay? And so since that part of this work is entirely standard and not novel, I'm going to set it aside and just for concreteness from here on out, talk about the problem of kind of meeting this guarantee on a fixed sample of data points drawn from an unknown distribution, okay? So armed with this definition of fairness, we immediately have two kind of interesting algorithmic or computational problems facing us. One is what I will call the auditing problem and the other is what I'll call the learning problem. So the auditing problem is simply determining whether a given algorithm or model meets this definition of fairness, okay? So what we imagine is that there's some decision-making algorithm or model D as before, right? That takes X and X prime as input, makes some decision and what we can do is we can kind of push this algorithms button many time and see the decisions it makes. We get to see, you know, all we really need to see or get to see are the protected features for each individual in the sample, the decision made by the algorithm about that individual which might also have depended on these non-protected features and the true decision, right? For instance, whether we should have given them alone or not, whether they were credit worthy. And so the auditing problem is given access to samples of this form, decide whether this algorithm or this decision-making algorithm D meets the approximate fairness definition I gave on the last slide. And if it doesn't, then you should output like a witness to it being unfair. You should output a G in the class G for which the approximate fairness guarantee is violated. And of course, this, just to be clear, this is a trivial problem if the class G is really is small enough that you can enumerate it, right? If my class G for instance is just the traditional I wanna be fair with respect to race, gender, age, disability, et cetera, separately, then I can just enumerate those and check each one of them. But when the class G is exponentially or infinitely large, then there's like a real search problem here. There's a real computational problem to think about. So that's the auditing problem. The learning problem would ask that we actually, take data and go ahead. I'm wondering why you don't need the X prime samples in this definition of the auditing problem? This is somewhat cosmetic from a theory standpoint in that nothing I'm gonna say would change if you allowed the auditing algorithm with access to X prime. So the theory will all remain the same. But in reality, right, if you think about the auditor being a regulator, right, the regulator may not know the proprietary features that the decision-making algorithm uses in making its decisions, right? So like if I wanted to audit or regulate Google for some notion of racial fairness in the advertising they show, I might know or I might be able to get them to tell me what they know about the race of different users, but I surely can't ask them to tell me like every proprietary feature and data source that they've developed to make their decisions, right? So that's why we're kind of separating X and X prime and imagining that the auditor only has access to the declared sensitive or protected features and not everything that the proprietary algorithm might know. But from a theory standpoint, it doesn't really matter in that we're gonna give a hardness results for auditing that will hold even if the algorithm gets to see X prime or there is no X prime, like all of the features are protected. And then we're gonna give an algorithm for learning that also is sort of invariant to whether the auditor sees X prime or not. I've got a question with regard to the definition of auditing. It seems to be a very stringent thing to ask. Does it make sense to us for some kind of gap like property testing-like version where you only have to output a violated G? Yeah, so actually I'm being lazy here. The true definition of course does allow that slack, right? So I'm being a little bit informal here, right? So there is like a window in which the algorithm can do anything as you would have in property testing. So if you go look at the statement of the theorem, there's also an epsilon in it, right? And the algorithm, if D is not gamma plus epsilon fair, it has to find a witness. If it is gamma fair, it has to declare that it's fair. And if it's between gamma and gamma plus epsilon, the algorithm can do whatever it wants. And so then the learning problem of course is to on a distribution or on a sample, find a classifier that is gamma fair with respect to G, i.e. to actually learn a D that will pass this auditing definition, okay, that meets the fairness criteria, okay? And of course, as usual, in learning theory or machine learning in general, we have to, we're gonna commit to some hypothesis or model class H that the learner is using to make decisions over the features XX prime. And what we wanna do is find a classifier over this model class that satisfied this definition of fairness, okay? And so that's kind of a more challenging problem even than the auditing problem because we're not just trying to detect fairness, we're actually trying to generate a model that obeys fairness. Okay, so let's talk about some results. So the first theorem, which I'm gonna state informally, again, I invite you to go look at the paper for the details, including the good one which was just mentioned about kind of, the natural definition allowing a bit of a gap for the auditing algorithm. But basically the first result is sort of bad news from a theory perspective. So the result is that auditing an arbitrary decision-making algorithm for gamma fairness, and again here by SP, this is statistical parry or false parity, false positive, but let's just talk about the false positive case. The techniques are very similar for all these different variants of fairness. So the computational problem of auditing an arbitrary algorithm for approximate false positive fairness with respect to a given class G of subgroups is computationally equivalent to the problem of weakly-agnostically learning the class G itself. So in other words, this is basically drawing an equivalence between auditing and actually a learning problem in which you're essentially trying to learn functions from the class capital G. And I'm not gonna describe what weak-agnostic learning is, but if you're familiar with PAC learning, agnostic learning is PAC learning in which you make no assumptions about the process generating the examples. You only make a restriction on your hypothesis space and weak-agnostic learning is the variant of that in which you're not required to kind of get the optimal error in your model class, but only something slightly better than random guessing, okay? And so the first theorem says that actually this auditing problem is itself of a type of learning problem. And so that's the first theorem. And the high-level intuition behind it is that, if you go back and look at this definition of fairness, if an algorithm is unfair and therefore there's some subgroup little G on which it's unfair, my conditioning on membership in that subgroup is somehow changing the statistical properties of D, right? If you look back at that definition, kind of the background false positive rate and the subgroup false positive rate must be different. And so there's some sense in which this G that I found that violates fairness must be weakly correlated with the decisions made by D and you make that formal and you kind of get this reduction or equivalence with weak-agnostic learning. Okay, so how should we interpret this result? Well, it's kind of a bad news, good news situation. From a theoretical standpoint, this is unambiguously bad news because this mean, so there's this not small literature now on agnostic learning in the theory literature and almost all of it is negative, right? So we know that even for very simple structured classes like conjunctions of Boolean attributes or linear threshold functions, weak-agnostic learning is computationally intractable. Again, in the worst case, overall distributions, okay? So these are kind of NP hardness results, right? So in general, this reduction or equivalence is bad news from a theory standpoint because of all the negative results for agnostic learning. But it's good news in practice because what we in the theory community call agnostic learning is what people in machine learning kind of just called machine learning, right? It's like you don't have any, you can't make strong assumptions on the data generating process. All you can do is pick a restricted model class and try to fit models to the data in front of you in the hopes of making good predictions. And this is exactly what heuristics for machine learning do. So by this, I mean neural networks, boosting, logistic regression, support vector machine. When I say heuristics, I mean their practical instantiations. I mean, some of these methods come, of course, with very strong and interesting theory attending them as well. But maybe another way of trying to make the point I'm saying here is that somehow the fact that agnostic learning is hard from a theoretical perspective does not seem to have prevented machine learning from being successful in practice. And so we do have these heuristics that seem to work quite well under fairly broad circumstances despite the worst case and tractability of these problems. And so this raises the possibility that the auditing problem could be solved in a practical way by these standard off-the-shelf machine learning heuristics or tools. And we're gonna kind of take that viewpoint or hope forward in considering the learning problem next, okay? Let's see, given the amount of time I have, I'm gonna try to do some online editing of these slides. So I wanna now move on to talk about the learning problem, i.e. the problem of given data trying to learn or find a model from some specified hypothesis class H that meets this strong subgroup fairness definition. And I've already talked about the equivalence between auditing and agnostic learning. And just from a technical perspective, instead of talking about agnostic learning going forward, it's slightly easier to talk about cost sensitive classification. So in cost sensitive classification, you basically, the inputs to a cost sensitive classification problem are a bunch of feature vectors, which again, I'm here breaking into their protected and unprotected components. And in a normal classification problem, you get x, y data. I give you pairs, I give you data of the form, here's a vector x and it's a positive or a negative example. And the normal thing that you wanna do is predict the labels accurately. So like in zero, one loss, if you predict the y value correctly, you suffer zero loss. If you predicted incorrectly, you suffer loss one. Cost sensitive classification is just the simple generalization of that, where for each vector, I have two costs. One is the cost of predicting zero and one is the cost of predicting one. So in the typical zero one loss, one of these two costs would be zero and the other one would be one corresponding to getting the right or wrong label. But more generally, these might be different values. There might be a cost of minus 0.7 for predicting zero and minus 0.29 for predicting one or any other numbers that you want, okay? And so going forward to talk about the learning problem, we're going to basically assume that we have cost sensitive classification oracles for both the subgroup class G and for the hypothesis space of the learner H, okay? So we're gonna assume that we have a black box that given data of this form can either find the group in G which minimizes the cost sensitive classification problem for the given data set and that we have one for the class H as well, okay? Now, this constant, as I say here at the bottom, the cost sensitive classification problem is actually known to be equivalent computationally to agnostic learning, but from a technical perspective, it's a little bit easier for us to talk about this problem directly rather than just referring to agnostic learning because it gives us a well-defined optimization and objective function that we're going to exploit mathematically in the sort of theoretical development. Now, just to be clear here, as per the results I said on the last page about the difficulty of agnostic learning, when going forward, I assume that we have a cost sensitive classification oracle for these two classes G and H, in the worst case from a theory perspective, I'm assuming I have an oracle for an NP hard problem, right? And so a natural reaction would be like, well, once you have that, can't you do anything you want by just encoding everything as an instance of some NP hard problem? And the answer to that from a theory perspective is yes, but from a practical perspective, our goal is to actually get algorithms that use these cost sensitive classification oracles in a natural way that could actually be implemented as an empirically viable algorithm. And that's what we end up doing in the experimental part. Okay, so let's see. Yeah, go ahead, question. It was actually a question from the audience. Jean asked, does the reduction preserve a distribution as in is the learning problem over the same distribution? Very good question. And the answer to that is yes. So if there's some particular distribution, which is hard for cost sensitive classification, then it's also hard for agnostic learning. And the same comment applies to my earlier reduction between auditing and agnostic learning and therefore cost sensitive classification. If you've got a hard or easy problem for auditing or for agnostic learning, if you have a hard or easy distribution for agnostic learning, it will be hard or easy, respectively, for auditing. Right, I think the question was about the auditing part. Thanks. Actually, one more possibly stupid question. How come the complexity of D doesn't enter in the auditing problem? Because we're treating it as a black box algorithm. So you have access so you can estimate the problem. Yeah, exactly, exactly. Yeah, thanks. But by the way, it's an interesting question whether one might get more positive auditing results by assuming some very restricted form for the, you know, like that the auditing, I mean that the decision-making algorithm is simple in some sense. Because after all, if we think about the decision-making algorithm as being the output of a learning algorithm, then usually it would be a fairly simple model class, right? It turns out that doesn't really buy you a lot. And again, this is sort of imported from hardness results for agnostic learning where, you know, like for example, not only is agnostically learning conjunctions hard, but even if I tell you that you're trying to learn a conjunction and the true function is a, you know, is a, you know, short DNF formula, you still have NP hardness results. So you don't need the function that you're trying to learn to be, you know, arbitrarily complicated before you start to get hardness results for agnostic learning, unfortunately. Let's see. So let me, I think given, I wanted, I do wanna spend some amount of time sort of showing you kind of cool experimental results. And so let me kind of try to on the fly summarize what we do from the learning side. So basically what we end up with is, and let me kind of cut to the chase. And then I'll maybe flash through the slides and point out a couple of, you know, more detailed highlights quickly. But what we end up with at the end is an algorithm that assuming that we can solve cost sensitive classification problems both for classifiers in the learner's model or hypothesis class H and in the subgroup class capital G, we give an algorithm that in a polynomial number of steps will converge to the optimal, fair randomized classifier in your model class. So the output of the algorithm is going to be a distribution over your model class H, but it'll be a sparse distribution because it's gonna be the result of a polynomial time algorithm that's only adding kind of support, you know, adding one to the support at each round. So it'll have a polynomial support size, even though the class H might be exponentially or infinitely large. And this randomized classifier, this mixture model, if you like, will have the optimal error subject to gamma false positive fairness with respect to G, okay? So we're really gonna solve the optimization problem that one wants to solve, which is find the most accurate model I can subject to the fairness constraints given by the class capital G. And the design of the algorithm has kind of an appealing structure, which is we essentially, the algorithm simulates a zero sum game between an auditor and a learner. So without going into the math yet or maybe not at all, just from a time consideration, you know, the way this game starts off is by the learning algorithm at the first round, just finding the classifier in H that minimizes the error entirely ignoring fairness, okay? So that's sort of the learner's first move. So they pick a classifier, little H and capital H that minimizes error. The auditor then audits that classifier and finds a subgroup on which approximate false positive fairness is violated, okay? And basically presents that to the learner and says, okay, you know, your model is unfair to this subgroup in capital G, okay? The learner then essentially has to respond to that. And at a high level, what happens is that the, you know, the subgroup that the auditor found that's being treated unfairly is added to the Lagrangian of the learning algorithm, okay? And now at the round two, the learning algorithm has to minimize instead of just the error, some mixture of the error and the unfairness inflicted on the group found by the auditor in the first round, okay? And so more formally, right? You know, then maybe I'll do a little bit of the mouth and this is just writing down the optimization problem that I mentioned of minimizing the error subject to the fairness constraints. And then you basically set this up as a zero sum game between a primal and a dual player, right? So the primal player is the learner, their strategy, their pure strategy space is the class H of hypotheses. The dual player is the auditor. They are basically determining the weights on subgroups in capital G, you know, and basically putting their weight on the group or groups that are being discriminated against most strongly by the learning algorithm. And you know, you kind of use semi-standard techniques to turn your optimization problem into this Lagrangian formulation, which then becomes formally a zero sum game between the primal player and the dual player, right? And you know, there's a lot of technical machinery you have to go through to make sure that you know, strong duality holds and that this all works out at the end. But the challenge here of course is not sort of writing this game down, you know, sort of writing this Lagrangian down and formalizing it as a primal dual zero sum game, but the challenge comes of course from the fact that both players have a pure strategy space that might be exponentially large or infinitely large and we want a polynomial time approximation to the optimal solution, which because we've set it up as a game is in fact the minimax equilibrium, okay? And so, you know, the question is how do you do that? And for the first few months of this, you know, after we first posted this paper on archive, for a few months we didn't know how to provably kind of solve that game efficiently. And we instead proposed just, you know, like well, once you have this game's theoretic formulation, you can just do, and now I'm kind of getting a little bit into the learning and games literature here, but you know, you could just do fictitious play. Both players could play fictitious play and what's known about that is that it will eventually converge, but it might only be asymptotically as far as we know, although it is conjectured that under rather general circumstances, fictitious play would converge rapidly. But then after kind of working on it for a few months and playing around with different formulations, we did manage to eventually get a polynomial time convergence results by no regret dynamics. And so something that's known from like the no regret learning literature is that if you have, you know, a zero sum game and one of the players plays a no regret algorithm with respect to their strategy space, which in this case would be the hypothesis space of the learning algorithm, which again might be exponentially or infinitely large, if somehow you can manage to implement one of the players, let's say the learner as a no regret algorithm, and then the other player just simply plays best response to each round. It's known for a while now that this will converge in polynomial time to an approximate equilibrium of the game and therefore an approximate solution to the original optimization problem of minimizing the error subject to the subgroup fairness constraints, okay? And again, now I'm kind of really getting into the weeds, but I know there's at least a few people on this call that will know what I'm talking about. We ended up figuring out how to implement a no regret algorithm for the learner via follow the perturbed leader, right? So basically to do that, you have to figure out a way of essentially linearizing the optimization problem that the learner faces, and you kind of do this by applying Sauer's lemma to the possible labelings of the data, right? So you have data in front of you and essentially the number of possible labelings of this data that then get translated into a cost sensitive classification problem can be bounded as long as the VC dimension of both the hypothesis space and the subgroup space capital G is bounded. And this is sort of an interesting use of the Sauer's lemma because we're not using it here for generalization purposes, which is the usual way, but to essentially kind of get a small number of possible cost sensitive classification problems that the learner has to solve and then therefore can linearize at each step. So I know that was fast, but so then you end up with the, once you figured this out, the actual follow the perturbed leader algorithm is rather simple and involves kind of creating a series of cost sensitive classification problems based on the auditor's best responses so far and then adding noise to them in a very particular but simple way. Okay, and so when the dust settles on all of this, the main algorithmic theorem that we give, which I'll state informally here is that given access to these cost sensitive classification oracles for the classes H and G, you get an algorithm that runs in polynomial time and outputs a randomized classifier over H that is gamma fair with respect to all groups in G. That should be fair, not free. I heard some little ring tone, but hopefully everybody's still there. Yeah, I think we're over here. It's mostly there. So this is all well and good and one question is, is this useful? Okay, so we've introduced this much stronger definition of fairness and we've shown that under assumptions about these CSC oracles that you can in fact get a polynomial time algorithm that will converge to the error optimal approximately fair model. But this doesn't answer the question of whether on real data, this is an appealing definition of fairness. So in other words, one definite concern would be, okay, so you can meet the stronger definition of fairness, but maybe the stronger definition of fairness is so strong that to achieve it, you essentially have to output very poor models from a predictive standpoint. In other words, maybe the fairness constraints are so far that the approximately fair models won't be that much better than random guessing, for example. And this of course is a question that you can only answer with experiments and not with theory, right? Because it's a question about practice and about real data sets. And so we implemented this algorithm, but we didn't implement the algorithm that I described. We've actually implemented this earlier proposal I made using fictitious play. And there are kind of good reasons for this. One is the fact that in general fictitious play in practice seems to converge rather quickly, even though nobody can prove it in general yet. And that's also true in the experiments that I'm gonna describe to you. But furthermore, it's much simpler from an implementation and an experimental standpoint. So those of you that know, follow the perturbed leader, remember we're doing follow the perturbed leader over an exponentially large strategy space. So this means that the auditor to best respond to the follow the perturbed leader learner needs to actually sample from the follow the perturbed leader distribution at each step, which is induced by the noise injected by follow the perturbed leader. And this means that at a minimum, it's a randomized algorithm, right? And so you're injecting a source of variance into your experiments because your algorithm is randomized. And so even on a fixed data set, you need to run things many, many times and it's also computationally very expensive. So this is not to say that we might not eventually try implementing the kind of theoretically best algorithm, but we started off with fictitious play and it seems to do rather well already. And again, since I'm running out of time, let me just show kind of one, there's a lot of pictures here. Let me show what I think is sort of the most compelling summary of the experiments we've done so far. So we went out and got several publicly available data sets. These are all binary classification problems in which fairness is a consideration, i.e. some subset of the features you could arguably want to protect with respect to subgroup fairness of the kind that we've defined here, right? And I'm not gonna go into the description of each of these data sets, but they vary in terms of the number of protected attributes, they vary from something like 10 at the smallest to 10 or so to something like 20, right? So these are not small dimensional problems, right? If I have 20 attributes that I'm trying to protect and I'm considering like a rich class of functions over those attributes, then I've got a rather high dimensional class of subgroups that I'm trying to guarantee approximate fairness with respect to. And so what I'm showing you here is the result of kind of running our algorithm to convergence on each of these data sets. And I'm essentially showing you the Pareto frontier of error versus unfairness. So I'm showing you the trade-off between error and unfairness. So let's take a look at this upper left one, which is the communities and crime data set. This is the one in which the number of protected features was the highest dimensional, okay? And so what you can see here is there's this appealing trade-off. So the point, let's start with this point at the upper left here. This point at the upper left is essentially where the value gamma that we give to our algorithm, which remember determines how strong we ask for approximate fairness, was some very large value, right? So essentially to the point where we're ignoring fairness entirely in the optimization, okay? And so you can see here that if I ignore fairness entirely, the error I can get is the smallest possible and it's about, I don't know, about 11% misclassification rate. And at an 11% misclassification rate, I can, you know, the unfairness that I get, the violation of subgroup, the worst violation of subgroup fairness that the auditor is able to find in that model is, let's call it about 0.25. I'm sorry, 0.025. And just to put like some meaning on that, that means that it, that could for instance be a subgroup that's, you know, constitutes 10% of the population on which there's a 25% disparity between the false positive rate on that group and the background population, okay? So that's pretty unfair, right? That's a big fraction of the population with a very high false positive disparity. At the other extreme, I can drive the unfairness by the same algorithm by just giving it a value of gamma that's zero. I can drive the unfairness to zero, but now my error rate will be well over 25%, okay? And in between, I can get in between. But, you know, the reason I think of this, these results is very promising is that they provide, you know, the user quote unquote, or the stakeholder with like the trade-off that they're faced with, right? You know, this is a specific and numerical trade-off for each one of these datasets about how much, you know, how much error you have to give up in order to get more fairness or how much fairness you have to give up to lower your error. And of course, you know, there's some sort of relatively obvious decisions. So one point is that, you know, for each one of these datasets, you have a curve like this, but they all look quite different. And so if you were really somebody who cared about this particular, one of these particular datasets and were trying to make decisions, you'd have to look at these curves and make like policy choices. You know, some of them are easy, right? Like it would seem like a winning situation to go from error, let's say point, you know, two seven down to error approximately point two with almost no increase in unfairness. But then this curve starts to steepen quite a bit. And, you know, where you want to be on this curve probably depends on, you know, what your goal is and what's at stake. Like maybe if this is a curve about the error unfairness trade-off for what ads you see on Facebook, you know, maybe nobody cares that much, but maybe if it's a criminal sentencing application, people care a lot. And so the, you know, I'll kind of close by just saying that, you know, we have a theoretically, we have this very strong new notion of fairness. We have a kind of theoretically justified, rapidly converging algorithm that relies on kind of a black box oracle or known heuristics for solving standard machine learning problems. And from that algorithm, we can produce these kinds of numerical Pareto frontiers that kind of specify the trade-off between these two criteria. And I had a cool movie to show about how this algorithm converges over time, but I'll skip all that and just stop and take any questions that people have. And thanks for listening in. Great, thank you, Michael. And we have some time for questions. And Michael, if you wish, you can turn off the screen share so you can actually see us. And this one shows the movie, which I wouldn't mind seeing. Yeah, I mean, while people are formulating questions, I can show the movie. The movie is kind of cool, let's see. So the movie just shows you the actual, you know, so you have these two players, right? One of them is trying essentially to minimize error. The other one's trying to find unfairness. And so at each point in this game between the auditor and the learner, the learner has some model at each point in time. And that model has some error, which I'm gonna plot on the x-axis and some unfairness, right? Some maximally violated subgroup. And that maximally violated subgroup has some, you know, disfair or false positive disparity. So I'm just gonna kind of plot in what a physicist might call phase space, the behavior of the algorithm. And so what's gonna happen is that as I described, when this movie starts, you know, you'll see a dot appear in the upper left here corresponding to rather low air and high unfairness. And then this trajectory is gonna move over time and it's color coded. So early points are dark blue and towards the end it's they're red. And the dashed line here basically corresponds to the value of gamma that we gave as inputs to the algorithm. So we basically gave it here, I guess it's 0.008. So the, you know, the constrained optimization problem of the learner is to minimize the error subjects to the constraint that there not be any subgroup in the class that violates false positive disparity of 0.008, okay? And so anytime the trajectory is kind of below this dashed line, at a high level, the learner can just focus on minimizing error, although there is some inertia in the system, right? Because it's, you know, the Lagrangian is kind of growing over time. But anytime the trajectory is above this line, it means that the auditor is finding kind of badly violated subgroups and adding them to the Lagrangian of the learner. And so let me start the movie. Okay, so, you know, and the thing that's just kind of interesting about it is the complexity of the dynamics here, right? So you see this sort of early period, which is this kind of crazy looping as the battle kind of begins and the learner is trying to, you know, minimize error, but the learner, the auditor is finding these badly violated subgroups. And eventually you kind of get to the point where the trajectory sort of starts dancing around this dashed line. Because anytime you're below that dashed line, the learner can kind of focus on minimizing error, but eventually their effort to do so will then push the model back above the threshold for unfairness and then they've got to address kind of new subgroups. And so, you know, there's nothing kind of super scientific to say about this diagram other than that, even though the algorithm has kind of rather clean theory and a polynomial time convergence theorem, the actual behavior of the thing is relatively complicated. And you can see that there's kind of a lot of back and forth going on and there are these periods where, you know, kind of the auditor is holding greater sway or, you know, unfairness is holding greater sway over the objective function and others where error is. Any questions on this or other aspects of the work? Any more questions? I see Raphael, yes. What was the class G that you used in the practical examples? Okay, good, yeah. So the class G was linear threshold functions, right? So in each of these data sets, there are some kind of categorical attributes and there are some numerical attributes. And so the categorical attributes, we kind of, you know, if you know the expression hot one encoded, we kind of turned them into binary, kind of redundant binary features. And then the class G was just linear threshold functions over that encoding. And then the same was true for H. H was just linear threshold functions as well. So the learner is learning a linear model or a mixture of linear models more accurately. And the auditor is basically finding subgroups defined by hyperplanes in protected attribute space. One interesting question in this regard is one set of experiments we're still planning on doing is, you know, kind of fixing the auditor's class to let's say be linear threshold functions but greatly enriching the learner's model space like, you know, even allowing the learner to have, you know, a neural network of some depth. And one thing you might hope if you had enough data is that this would improve these Pareto curves that I was showing, right? Because in general, the richer, in the same way that in just standard learning, giving a learner a richer model class allows them, you know, to find things with lower error. You might imagine that allowing them a richer model class against a fixed auditing class would allow the learner to find a better trade-off between error and unfairness. Is there no danger that as the learner gets more sophisticated, it would learn to cheat the auditor because the auditor is still heuristic or it's not provable? Yeah, there's always that risk and all these curves I'm showing are subject to that caveat, right? I mean, the learner is sort of making a best efforts to find the best model they can with respect to the objective and the auditor is making a best effort. But it's possible, for instance, on those curves that I showed you that there were worse subgroups than what the auditor was able to find. You know, and so the danger you mentioned is, you know, I think a real one, at least in principle and maybe in practice. But, you know, then I think the kind of empirical question is, are we still better off with the stronger definition of fairness rather than only looking for fairness on a handful of marginal attributes independently? Yeah, I'm kind of of the belief that we are better off with these stronger definitions. Hello? Yep. I have a question on, so I noticed for, so again, the statistical parity fairness, it doesn't require the labels for the data, right? That's right. Is there some payoff if you don't care about unlabeled sample complexity? Is there something more that we can do because of that or is it not that important? From a theory perspective, it doesn't really change anything. So the auditing problem remains, you know, equivalent to agnostic learning for the class G and, you know, since we end up with a polynomial time result for false positive fairness, you don't know of any great theoretical benefit algorithmically for statistical parity either. We haven't really done experiments with statistical parity just because I think it's sort of a less interesting definition because it doesn't account for, you know, the fact that in many machine learning applications, you're actually trying to predict something about people, like, you know, will they repay the loan? Will they succeed in college? Will they recommit a violent crime if I let them go? And so you really do want your fairness notion to be about, you know, making correct predictions and not just about, like, giving away free tickets to a concert, for example. Thanks. More questions? And Michael, you can turn those screen sheets. You can see us once. Let's see. Sorry, question? Question from Ron, yes. Another question. So this process you've described is for a single distribution, right? It's a per-distribution. And the definition in the beginning was for all distributions, in some sense, right? So you would have to do it for the distributions or for some sort of distributions or at least understanding? So I'm having, the audio is not great wherever you're coming from or wherever I'm receiving it. But if, I mean, so the equivalence between auditing and agnostic learning preserves the probability distribution in both directions, right? So if one of the problems is hard on some particular distribution, then the other problem is hard on that particular distribution. And if one of them is easy on a particular distribution, so is the other one. So in general, you know, algorithmically, we would prefer algorithms that are distribution free and are not making distributional assumptions. And that is the case for the algorithm that I described with the caveat that, of course, I'm assuming these subroutines would solve cost-sensitive classification problems and those problems themselves might be easier or harder depending on the distribution. And we know that they're hard in the worst case. Did this answer your question? Okay, I guess I'll pick up a little point, thanks. Okay, I'm happy to follow up by email or anything. Yeah, okay, thanks. Any more questions? Okay, so if not, let me take the broadcast offline and just remind everyone this was the last talk for the spring. So I'm hoping to see you back in September. And thanks again to Michael for this great talk and everyone is welcome to stay here and just take us offline and turn off the recording.