 Okay, my name is Saint-Lizard, I'm the project leader of the Humanizing Machine Intelligence Project at the ANU. I'd like to welcome you to this rescheduled seminar, and I'd like to start by pointing out that the ANU is on none of our lands. I'm recording from Marigold lands. I acknowledge traditional custodians of country here and throughout Australia, and the continuing connection to culture, community, land, sea and sky. I pay my respects to their elders past, present and emerging. So today we're delighted to welcome Angela Jo, who's coming to us from Cornell's campus in New York, where she's just defended her PhD in operations research, and she's heading next to Berkeley for a postdoc at the Simon's Institute in computing theory, and then heading on to USC for a tenure track position. So congratulations Angela for defending your PhD. We're delighted to have you here to talk to us about credible algorithmic fairness. So if you'd like to go ahead and take the floor, it's all yours. Thanks so much for the introduction, and thanks so much for the invitation to speak. I'm really excited to talk today about credible fairness, and let me unpack that term first. So I'm going to first provide an introduction and talk about some problems in the kind of emerging of realizations about the issues with credible fairness more broadly. And so the kind of key question is understanding how data bias issues that we kind of understand to be widely present in the data sets and the domains where algorithmic fairness is particularly important and understanding why that poses specific challenges to the project of assessing or trying to remedy algorithmic fairness by algorithmic interventions. And so that'll be a kind of broader overview on the interaction between data bias and algorithmic fairness. And so with that broader overview, I'm going to talk about a deeper dive on a specific data challenge with kind of one element of a data set, which is a practical challenge that often the unobserved protected attribute is actually not available in the measured data set. And so I'll talk about this paper on assessing algorithmic fairness using data combination, which looks at common proxy variable kinds of practices that are used by regulators and tries to develop methods that will show you the kind of fundamental limitations of where the disparities can be without making further assumptions. And so a very broad overview of a few high level goals in algorithmic fairness. The first is to simply assess disparities. So given a classifier, given some kind of regressor, we define some kind of fairness metric, and we want to simply measure how are there disparities in this model on the data set. And so this is typically taken for granted, but the bulk of this talk is actually going to be trying to illustrate some of the subtle but fundamental difficulties with our ability to do this in a statistical sense in view of very common practical challenges. And so of course, in the realm of algorithmic fairness, there's a lot of research on algorithmic interventions in order to mitigate disparities, such as defining a fairness metric and then modifying machine learning procedure to satisfy that equality and that fairness metric and so on. And so it should be clear and that our ability to do algorithmic interventions is premised on our ability to assess disparities in the first place. And so what are some of the challenges in our ability to assess disparities? Well, in many domains, where algorithmic fairness is of kind of key concern, such as criminal justice, lending, provision of social services, and so on, there are a variety of broad strokes, critiques that point out fundamental data bias issues with various elements of the data set. So broadly, the data set can be comprised of outcomes why protected attribute a and covariate information about individuals, which are called Z, but is more commonly called X and I'll talk about this more later in a formal problem set. And so on the one hand, we are widely aware of these data bias issues. On the other hand, the kind of key concern, conceptually speaking, is not only that there is data bias on one kind of, on one level, a disparity is a difference between two measures. And so if we're taking the difference between two bias measurements, but the bias was simply independently distributed, independent of everything else. Perhaps everything would be kind of statistically okay in the end. So the kind of challenge conceptually is that the measurement error associated with these data biases is correlated with the protected attribute itself for covariates. And so that's a kind of broad level statement, but I'll instantiate that next in a specific example. And so because of this kind of differential bias, a kind of conjecture or concern is that algorithmic interventions that are optimizing for fairness based on these differentially biased estimates of disparities could compound, could further compound the distributional harms that are associated with measurement error and therefore be providing kind of serious fairness guarantees that can kind of mask residual unfairness in the data set. So this is a kind of very broad overview of why data bias could be of concern, even if we are successful in the program of algorithmic fairness itself. And so, of course, kind of establishing each of the steps in this argument will require kind of specifying to a specific setting or a specific algorithm. And so a paper that has, a previous paper that has done this in the specific context of distributional shift, this paper on residual unfairness. And I won't talk about that here, but I just want to provide a kind of very broad meta argument for why is data bias of concern for algorithmic fairness generically. And so I wanted to mention a recent paper that I'm excited about. Unfortunately, it'll be on archiving in two days, but a kind of a part of the argument of this paper is just illustrating how a very common benchmark data set compass in algorithmic fairness is actually an extremely poor choice of a benchmark for algorithmic fairness, despite its kind of conceptual visibility in the discussion about algorithmic fairness because of these technical data bias issues, which again have been kind of widely viewed critiques, variously, but this paper is trying to look at these data bias critiques, think about the kind of algorithmic fairness literature and think about these specific interactions of trying to instantiate that previous argument in view of issues in criminal justice. So I just want to talk a little bit about what it means for measurement error mechanism to be correlated with the protected attribute. And so again, I mentioned that there's these different components to a data set and perhaps unsurprisingly, given the kind of broader discussion about risk assessment instruments, all of these components are subject to mis-measurement and data bias. And so a main source of issues is this issue of label bias. So the compass data is coming from a risk assessment instrument in criminal justice where the task, broadly speaking, is to try to predict an individual's probability of committing another crime or cynicism or their failure to appear for a future court date. So the problem is that the construct that we actually care about predicting is more about absconding from the legal system, which is called flight in the legal studies literature. And a paper kind of argues that the failure to appear that we measure in our data sets, whether or not a historical individual did not show up for the court date actually conflates willfully trying to escape the legal system and simply not appearing for the court date. And this is compounded by various other effects. And so, again, also, given that there are randomized controlled trials of various text messaging systems to remind individuals of the court date, these trials have statistically significant positive effects. So if you text an individual to remind them to come to their court date, you do see treatment effects on the order of around 13 percent, depending on the trial. So there's non-negligible improvement. So there are some individuals who are not appearing. Various research, just kind of summarize it, criminal justice literature shows that non-appearance is correlated via self-report with transportation, work, and scheduling difficulties. And so the argument is that these difficulties can be, in turn, correlated with class inequality and possibly race inequality. And so this is a way in which we are concerned about data bias. And specifically that data bias is kind of correlated with the distributional concerns that we're concerned about in this case. And so I won't go into the same amount of detail for the other aspects, but with this kind of criminal justice data, the protected attribute is often mismeasured because it's based on a noisy police officer's reporting or a visual assessment. The covariates are also noisy due to, for example, prosecutorial discretion on the kinds of charges that are associated with an individual record and other kinds of mismeasured covariates. And finally, there are many distributional issues because data sets are themselves downstream of historical decisions about inclusion or exclusion. So this was just to kind of step through all the possible ways in which the data could be biased in algorithmic fairness and try to argue that for relevant data sets, these are relevant at the same time. So now I'm going to shift to talking about just one issue, the issue of the unobserved protected class. So just kind of drilling down to issues with A. And so this paper is kind of motivated by these practical issues that come up kind of most saliently in lending. So I'm going to shift context to the lending side to try to provide some background on how this discussion comes up. So ultimately, the goal is to develop a methodology that will tell you what are the fundamental limits on where the possible disparities associated with this data can be in a incomplete information setting, where I try to account for the fact that I don't measure the protected attribute. I don't have race information by considering auxiliary data. So perhaps my main data set as the Consumer Financial Protection Bureau, unfortunately, most of my examples will be in the US context. Maybe I only have a data set of historical loans and individuals, their income and their location, but I don't have their race information. On the other hand, in the quantitative social sciences, we are kind of there is a benefit in that people are ultimately concerned in questions about disparities about people. And so we have the census data set and the census data set could provide information about the generic distribution of income and race. And so the key question is how can we use that kind of auxiliary information such as that found in the census that can give us proxy probabilities of membership given covariates? And the picture will be not completely optimistic. We can't achieve unbiased estimation, for example. And so this paper is going to take a robustness approach in that we're going to be extremely agnostic about the underlying probabilistic structure. So the goal is really not necessarily about estimation, but trying to characterize the worst case bounds on where the disparities can be given no assumptions. And so the idea is that this is a kind of zero order step that can tell you about the informativity about a proxy variable. And so either you will be able to draw a conclusion or you will not be able to draw a conclusion. And if you're not able to draw a conclusion, what that suggests is that you should move on to a different kind of inferential procedure or you should try to make or justify further assumptions. And so ultimately, I'm going to walk through developing the method and talk about how we are able to provide these kinds of guarantees on where the disparities are. So in the context of this funding example, the main challenge will occur kind of technically when we have multiple valued protected attributes. So here on the y-axis, I have a pairwise disparity. So the difference between white and black, just rates of being given alone. And on the x-axis, I have a different pairwise disparity. The difference between white and Asians, what is the rate of receiving alone? And these sets or these rectangles are the guarantees provided by our method on where the disparities can be. Now, in contrast, the red star is a true ground truth disparity. We know that because of the structure of this special dataset. And I'm going to explain how we're able to provide these kind of restrictions at all, even though we haven't observed race data in our dataset that has the lone outcomes. And so just to provide some context on the originating background, why don't we have race information in this dataset when a financial regulator wants to audit for race? Well, it's actually illegal in the lending context to collect race information for non-mortgage loans. And this can be a desirable operational constraint for other reasons or similar reasons. And this issue also comes up in health insurance. Nonetheless, of course, as a regulator, we also want to be able to audit for discrimination or disparities. So what does the consumer financial protection bureau actually do? Well, what they've done is kind of take recourse to proxy methods. So it's called the Bayesian improved surname geocoding, the method that's been developed. And you try to estimate the race based on an individual zip code and surname frequency. And so in previous work, Nathan and Sharja have shown that actually you can't achieve statistically unbiased estimation using some of the methods that are used by regulators in practice. And so using these kinds of proxy methods can be controversial because it essentially renders the project of trying to litigate on these disparities or discrimination, vulnerable to critiques of the proxy method itself, which is somewhat conceptually independent. And this is not just a kind of academic quibbling, but in a kind of famous large settlement between Li Financial and CFPB, there was a lot of criticism. And some of the criticism was directed at this use of BISG because the actual settlement checks that went out to individuals were actually based on thresholding on this proxy probability. And so again, this is a kind of practically relevant problem. And I'll talk about some related work first about this methodology and then other approaches for unobserved protected class. And so I mentioned in previous work that looks at what regulators are doing in practice and establishes that it's statistically biased and some conditions. This approach is motivated by some developments in econometrics that provide some general strategies for doing data combination. So I mentioned that there's going to be a data sets of the kind of outcomes and covariate data. And the kind of key idea is that there is auxiliary data that is available in the social sciences and what can we gain by possibly trying to combine this data in the way that we can. And finally, the robustness ideas are drawing on partial identification. And so in the specific kind of area of robust protected attributes, follow up work from a different team of authors has actually used this so-called robustness set of disparities that this work provides. So I mentioned that we're focusing on just assessing disparities. But given the ability to robustly assess disparities, you can optimize these robust estimates to also consider robust algorithmic interventions. And I'll also mention that this is going to take a very agnostic approach. So there are other papers that consider other kinds of structure or impose additional conditions or allow you to also have a validation set or query some individuals for their protected attribute. And so as you kind of add more structure and add more information, you could probably see quite dramatic gains in the informativity. So this will be on one spectrum of the kind of agnostic approach. So for the problem setup, the white hat information is going to be a decision such as a historical or algorithmic decision but perhaps associated with a classifier. I am going to be focusing on the binary classification setting for outcomes as well as a multi-valued or discrete protected attribute A. And Z is the auxiliary information or the covariates, usually called X in other papers, but we have this kind of split structure. And so Z is going to be this auxiliary information used for proxy methods such as an individual's income or race. And again, the kind of statistical formulation is that we assume that there's this underlying joint distribution that generates some kind of unknown data sets on the underlying true outcomes protected attribute and covariate or auxiliary information. What we have access to as data scientists is just the data sets on the bottom. The main data set contains only information about loan defaults and income. Again, sticking to the lending setting while there is free data set provides information about just race membership and the income of individuals. And so this can be thought of as a setting where we care about assessing some functional of this unknown joint distribution, but we only ever observe these these censored marginals of the joint distribution. Okay, and so what are the disparities of interest? Well, what is the demographic disparity is simply a conditional probability. For example, it could be of the classifier label or the true outcome conditioned on the protected attribute. And so differences in the marginal rates of granting loans could be of interest. There are also other classification type disparities and other error metrics to find on the confusion matrix for binary classification. And so the true positive rate disparity is the difference in conditional probabilities of being offered the loan by the classifier given that you would truly pay a back, for example. And of course we could correspondingly define the disparities corresponding to other metrics associated with the confusion matrix. I'm going to focus on the true positive rate and demographic disparity, but the arguments, it would be directly extend to other kinds of classification or confusion matrix metrics as well. And again, the key problem is that in order to assess disparities, we need to be able to condition or count records of outcomes and the race membership of individuals. Of course, as we don't observe A in our data, we can't ever estimate what's the joint distribution of someone getting a loan and being of a certain race because we don't know when these events jointly occur. And again, what can we estimate or what are we going to use in our estimation procedure? Well, from the main data set, we can estimate the conditional probability of receiving the loan or various outcomes given your covariates. And on the auxiliary data set, I can learn the probability of race membership based on the auxiliary information. And so the goal is to obtain the bounds, loosely speaking, or the set of what is the most we can say about disparities and quantities like these disparities where we can only work with these joint distributions or these conditional probability models. And so this is a different question to ask instead of let's assume that things are perhaps normally distributed or trying to run some kind of latent variable model in order to combine the data sets. That would be more of a kind of inferential approach. But the point is that those inferential approaches can only estimate disparities by making additional strong assumptions that may not be informed from the data. And a kind of important idea is identification. So I'm going to use delta to denote disparity measures in general and subscript them with the kind of specific type of disparity. Big theta is going to be the partial identification set or the set of all disparities that are compatible with the observed data. So in most of the settings in classical statistics, a parameter is identified if from the kind of observed data set there's a one-to-one mapping to a parameter that generated that data under some kind of inferential procedure such as maximum likelihood or so on. Now the problem is if there are multiple models or parameters that are consistent with the observed data perhaps because we haven't set up the model correctly or because there are kind of joint events happening. Now in this case there are multiple disparities compatible with the observed data because of an incomplete measurement situation. We simply haven't observed enough information to recover the probabilities that we want to recover. So in terms of the method, I'll illustrate it for the simpler case of a marginal disparity or a demographic disparity. So if I want to estimate the probability of an individual being alone, getting alone given their race being some generic value with you, the key idea is to try to use Bayes' rule to rewrite this conditional probability and separate out the part that we don't know and rewrite our estimate of interest as an expectation of an unknown function over the main data set. Now, why is this helpful? It's because we do have some a priori structural knowledge on this unknown quantity. At least that it's a probability or that it satisfies the properties of a probability distribution. And so that's also why I'm describing this as kind of on one end of the robustness spectrum. If the only information that I'm using is that I know that this unknown function on the probability of A being U given both the outcome and prefer information is just a probability distribution but I know nothing else about it. Then that is leaving a lot of structure on the table. On the other hand, I am always justified in using that information because I don't have to appeal to anything else in order to incorporate that information. So this unknown conditional probability depends on the outcomes, the covariates and the protective attribute which are never jointly observed so I don't know it. On the other hand, I do observe the so-called row sums and column sums of this unknown function. And so what I know is that it lives in this linear constraint system. So if I sum this unknown function with respect to the probability of the outcome given the covariates, then I know that that sum must equal to the row sum which is this other probability that I can estimate from the other data sum. So these constraints are at least estimable even if the unobserved function itself is not. And so the key idea is that I'm going to optimize this estimate of the unknown disparity, the objective function here over the range of the unobserved quantity subject to the constraints that all probability satisfy which is simply that this unknown joint distribution integrates out to these known or estimable marginal distributions and it satisfies the law of total probability. Probabilities are between zero and one. And so on the one hand, there is a classical literature on this. They're called for shea-husting downs. They're related to copula theory. And so what we're going to do is we're going to consider more complicated estimands, discrete level or multiple level of protected attributes. And that will introduce more technical and computational and statistical challenges relative to what has been known previously. So as a summary of those kinds of extensions and challenges, here's a quick summary of the results. There are two dimensions in which we can add complexity. On the one hand, we can ask for a more complex estimate. So demographic disparity estimate is pretty simple, essentially because we know the marginal probability of we know the average number of blacks or whites in the entire population or the population we're auditing on. True positive rate type disparities are more difficult because we need to condition on information that contains both the outcomes on the protected attribute information. And so essentially as we go further down, we introduce more optimization complexity because we also need to optimize something in the denominator. On the other hand, we can make things more complex on the modeling side. So we can try to add additional constraints to kind of be less robust or agnostic. And we can also be concerned with multiple-valued protected attributes, which is the kind of most difficult thing to consider kind of computationally and so on. And so we can obtain closed form bounds in some special cases. I won't go into too much detail. We can simply think about optimizing over these sets. And as I mentioned, the true positive rate disparity is more difficult because there's uncertainty or ambiguity in the denominator. And this is why these measurement difficulties with the protected attribute are different from measurement difficulties with other covariates. And so on the one hand, you can add additional assumptions. So I said you could compute closed form bounds. These are the solutions. And again, these are supposed to be conditional probabilities in terms of the outcome why and the covariates. And so they have this kind of very non-smooth structure. So one can try to add additional smoothness. And these are all things you can do in the space of optimization. The more difficult thing to do is handle multiple values of the protected attribute. And so for this, what we do is borrow an idea from convex analysis, which has been used in the partial identification literature. And so what we're going to do is parameterize the sets of these disparities for A, B, and C, this three-dimensional set by the pairwise disparities with respect to a reference class. So that's why we end up with white and black on the y-axis and white versus Asian on the x-axis. I want to use the fact that I can optimize various weights on these pairwise disparities. Now, if this weight vector row was in one hot vector, then that's equivalent to just optimizing a pairwise disparity. Otherwise, I do incur more computational complexity. And the main idea is that I can compute this maximum of the weighted disparity. And what that allows me to do is evaluate the support function of the convex hull of the set of disparities. And so by sampling different weight vectors, I can essentially sample different supporting hyperplanes of this convex set. And then at the end, take a convex hull of the resulting points and obtain a convex hull of the set. So I want to talk about statistical, other statistical difficulties with inference. And I'll return to this case study. And so on the one hand, one can compute these kinds of bounds for pairwise disparities by thinking about one versus all disparities. So collapse the protected attribute to a simple binary protected attribute by, for example, combining race classes. And then you can use the kind of more computationally easy method. However, in this setting with the financial data with the mortgage lending data, this isn't really a substantive case study for how the consumer financial protection bureau does their disparity assessment because we don't have access to the surname data, essentially. So the proxy we're using is the probability of race membership based on just an individual's income and their county fixed effect. And so as a result, our bounds are really quite large. They don't really eliminate more than 50% of the volume. And this is true also of the kinds of convex holes that we can compute. And so we can add, so the problem here is really that the proxy information is simply not so informative. On the other hand, I can look at a medical example, an example of warfarin dosing. So we can turn this into a classification problem by kind of thresholding whether or not a doctor gave the actual optimal dosage to a patient because this data set records the optimal dosage. And then we can assess our patients of different races, indeed getting the optimal dosage based on a doctor's first try because doctors adjust the dosage over time. And what we can do in this setting is that we can actually leverage stronger proxy information. We can leverage genetic biomarker information. And so as we go from left to right, we go from just the proxy being medications, the patient has received to genetic biomarkers to the combination of them both. And so finally, when we combine this proxy information, the x-axis is white versus Asian disparities. And so using our method, we can guarantee you that the disparity between white versus Asians is bracketed between 0.25 and 0.5 in favor of whites. And this is strictly away from zero. And so we can still allow you to make conclusions from incomplete data. Notably, I had to use genetic information to try to estimate race. So on the one hand, this method is more informative. The more strongly correlated the proxy information is. On the other hand, it may be quite difficult to have this kind of informative or accurate proxy estimation in other kinds of settings. And so just in terms of policy implications, it's a really hard question in terms of tradeoffs between reporting and recording. And the goal is just to provide this kind of zero order layer of how informative is the proxy variable and what should I do next? Should I improve the proxy variable informativity or should I try to develop some other kind of procedure? So I just, I'll stop there. And the conclusion is that this paper was about trying to characterize some of the fundamental limits of what can be said about disparities. But more broadly, these data challenges are quite common and pervasive in the data sets in which we want to ensure public fairness on in the first place. Thank you. Thank you so much, Angela. Now, can I ask the folks who are in the panel, if you wouldn't mind showing, putting your cameras on so that we can see, so Angela can see that there are people on the end of them. I don't have a button here on my webinar thing to do the little clappy hands. I'm gonna do my clappy hands. That was a great talk. Thank you so much for that. Okay, so if you've got questions in the panel, then just raise your hand on there. If we're in the audience and you'd like to put a question, just put it into the Q&A and it will pop up there for me. And since there's few enough of us, I'll enable you to talk so you can ask yourself. Can I just ask you, Angela, as a sort of starting off question, you had the policy implications slide just a moment back. Would you mind running back to that and just telling us a little bit about the Airbnb example that you have on the slide there? Yeah, yeah, that's something that I glossed over. Yeah, so there are a lot of difficult questions to think about in terms of the trade-offs between recording this protective attribute information and the kind of statistical kind of contortions we have to go through if we don't record it and we still want to audit disparities. That's the effect. And I know that upturn and Miranda Bogan has also a paper kind of outlining some of these trade-offs as well. So I was kind of shocked. I don't know if I'm getting this completely right, but at least then one paper is reported that a kind of standard, what the Federal Reserve does is do a probabilistic join with I think CAR, registration information, and it has like a 40% match rate. So even the regulators still have to go through these contortions. And with Airbnb, I'm not clear on all of the details, but I think they looked into this BISD approach as well, but their setting is quite different because for as a tech company, they have finer grained information about individuals, in particular, including a profile picture. And so they're really looking to using that profile picture to try to have this kind of hybrid system with, for example, incorporating facial recognition or human assessors and additionally because they're thinking about the effects of perceived race or disparities that are associated with race perceptions. And I know that they've also released a white paper and this project was in collaboration with Color of Change to try to outline some of these kind of subtle implications and so on of kind of what are the trade-offs we're willing to make to try to contribute to the beneficial project of auditing disparities. And that paper is really interesting to think about some of these considerations and talking about some of these considerations. Yeah, I mean, so one thought that immediately kind of springs to mind and you raised this at the end, this notion of sort of going through like a lot of really complex statistics in order to kind of access this information. I was wondering, like I know that there's a sort of hiring and recruitment site in the U.S. that they have a sort of a training or a test data set or an evaluation set of customers who they've sort of acquired the appropriate permission for to allow them to know the protected attribute and it's 50,000 people, it's a large set. So when they're trying to determine whether their model's going to have disparate impacts, they basically apply it to this set and they explore its impacts and then they optimize for sort of satisfying the various discrimination criteria there. So what do you think of those kinds of approaches? Do you think that they're optimal and this is an alternative to use when we're not able to do that or something else? Yeah, I think that's a great point and I wasn't aware of that example. So thank you. And I do think that a validation set is very practically achievable as compared to kind of changing the entire information architecture. And the truth is that given the... by changing the information architecture, you obtain so much more statistical improvement whereas for these kinds of problems with infinite data without making some kind of change for the information collection, you could not hope to recover and so on. And so I think a validation set is a great kind of applied strategy and it's really not so hard to compute these conditional probabilities and so I think there's a lot of statistical gain for that type of approach. The kind of only consideration is if one wants to do kind of finer-grained performance analysis or kind of error analysis. Thank you. Okay, I've got a question from Kathy. Okay, Kathy, I'm going to enable your mic now. Hopefully this works. Ah, it doesn't have a mic connected. Okay. All right, so Kathy has asked if inferences can be made about protected attributes using covariate statistically. What does this mean for protected attributes in a policy sense? Should we change how we think about protected attributes if they're not really protected statistically? Thank you for the question. I think this is a great question and I haven't really speculated on trying to unpack the implications of our ability to use these proxy probabilities in the first place, but it does speak to essentially the correlation between protected attribute information and the already observed covariate information. And so I do think that, not to kind of paraphrase or speak to the algorithmic fairness community, but for example, some of the legal protections against using the protected attribute information in decisions from a statistical point of view, given that we understand that there are correlations between the covariate and the protected attribute. One can kind of devise or kind of work around or leverage this covariate or correlation between the protected attribute to, if one really wanted to kind of impose disparate treatment and so on, one could easily hide disparate treatment by complexity in the covariates because of the correlation between this information and the protected attribute. So that might be a somewhat contrived adversarial example, but I think that's an example of how one might think about changing how we think about protected attributes because they are not necessarily protected statistically. And I think this kind of ability to do disparity assessment is a kind of dramatic visual example from that, but yeah, this is fundamentally just about phase rule and so on. And so this type of correlation is present in other kinds of settings, even when the protected attribute is observed. I don't have a great sense of how to formulate that into a legal argument so that algorithmic fairness interventions might be legally allowed to use the protected attribute information, but I think that's a really interesting thing to think about. Yeah, absolutely. So again, I want to encourage folks to put their hands up to ask questions. I'll keep going in the absence of any hands. So the next thing I was going to say was to sort of try and connect this up with some of the adversarial fairness stuff, which Chang Soon has written a paper on. So perhaps he'll mention that, but if not, I'll come back to it. So Chang Soon, if you go ahead please. Angela, thanks for an amazing talk. I really liked it. I have a more technical question where you very quickly mentioned that, you know, in some of your analysis, you impose smoothness. I think you use libschitzness. How should one interpret smoothness kind of from the fairness side of things? I mean, yes, technically, yes, we can work that out, but from, you know, looking at it as a fairness problem in terms of what does this smoothness and disparity mean? Yeah, that's a good question. I should clarify that the smoothness assumption is on the optimization variable, which corresponds to the probability of the protected attribute, conditional on the outcome and the covariance. And so it's an assumption that the optimization variable, which is this probability, this rate's probability is smooth and the underlying covariance. So it is something that is not informed from the data. So this is really a kind of domain level kind of assumption. And so this is like an assumption that depending on your, that the probability of being a certain race cannot be so crazy or cannot be so non-smooth or non-libschitz as I change the income. So for example, this could be imposed on variables that are thought to be kind of, to have continuous relationships with the race variable. But it is an assumption just about the underlying probability of race membership given covariant information, not necessarily about the disparities. On the other hand, it is true that I couldn't really give you a good data-driven scheme to motivate a kind of libschitz assumption. That's something where the suggestion would be kind of, try different values of the smoothness parameter. And perhaps on a domain level, one can kind of think about the trade-offs in informativity versus the justifiability of a specific smoothness level. No, thanks. And I mean, it has two kind of, sorry for the double-barrel question, but it implies two different things. One is this thing that I mentioned earlier that maybe this smoothness assumption allows you to understand adversarial attacks on these questions. And the other side of it is that it allows, maybe you can't try different smoothness parameters, but you might be able to empirically estimate libschitz. I mean, I know this is all fraught to estimate smoothness from an empirical distribution, but perhaps this is worth estimating empirically. That's all. Thank you. Thanks. Next question from Pamela. Thanks. Yeah. I was really interested in this multi-variable attributes issue. And I guess this isn't a super well-formed question, but I'm just interested in the ways in which it could go wrong, the ways in which we could sort of mess things up and dealing with this. And I just want to invite you to say more. And one thing I'm particularly interested in is like, how often do we have cases where there are like many, many, many variables where it would be way too computationally complex and we have to squash them? Like, are we usually just dealing with like a handful of variables or like realistically, do we have like hundreds or even millions? And we're often just like simplifying things down to three, and already sort of making a bunch of assumptions out. This is just invite you to say more about this. So this is really fascinating. Yeah. That's a great question. Thanks. I didn't really expand on that a lot, but there is some kind of, there is something being kind of lost or there are some trade-offs in our approach to dealing with multiple value protected attributes. So with the binary valued case, that was a bit of a computational kind of, that's more to leverage the computationally easier case of two classes because we've redefined the disparity to be a one class versus everyone else kind of disparity. And so on the one hand, these bounds should be kind of projections of the entire set along a certain dimension, but it's like kind of slicing the set in three dimensions and then reporting those slices. So I think that's a great question. And I don't have a good way to back this up right now, but I think that kind of depending on, I think there are cases where you might have kind of fine grained disparities or certain. So one conjecture is that if your set is kind of very sharp in a corner or something, which kind of corresponds to like a class that has particular disparities, has particularly bad disparities relative to others, then perhaps that may not be surfaced by comparing one versus all. And perhaps this issue would be compounded in a setting with many protected classes with disparities between kind of these small groups, so to speak. So that's a kind of very intuitive guess of what might happen. And of course, even this picture for the multiple valued setting, it is an approximation because it's just reporting the convex hull of the set. So we were able to report this picture by taking the convex hull, but the actual set itself could be actually non-convex. On the other hand, it is really, on the other hand, the motivation for doing this is because under this robustness type situation, what we really care about is the kind of worst case disparity subject under this agnosticness. So we are losing something in the interior of the set, but we are still able to report various worst case estimates of the disparity. And for the kind of last part of your question, kind of what does multiple valued mean? I think this is kind of, I guess this is my kind of personal thinking, but there are actually a lot of different approaches that one can take if you kind of allow for kind of combatorily many protected attributes. That's kind of a statistically different regime in which to think about. So for example, to return to the question about adversarial protected attribute, there is a different approach that kind of leverages that kind of thinking that you can kind of adversarily reweight your data over kind of so-called computationally identifiable subgroup. So for however you can partition covariates by classifier, you can call that a protected group. And so you can provide robustness guarantees kind of parallel to kind of worst case classifiers or the adversarial robustness literature. But you do lose out on some interpretability when it comes to the kind of descriptive statistics. It's somehow less, well, I guess my sense is that it's kind of less, it's less nice statistically to report that the worst case as I ranged over the worst case like depth 10 decision trees is that the disparity is the worst between kind of Hispanic mothers less than the age of 40 who have a degree in like library science or something. That's like a very bad example and so on, but kind of sticking with directly interpretable subgroups is nice for descriptive purposes of disparities. Whereas you can gain a lot computationally and statistically by being robust against computationally identifiable subgroups. But I think there is some interpretability that is lost there by saying that I've ensured you that the worst case situation is this crazy situation and we are safe against that situation. Again statistically it makes sense. From an interpretability point of view there are some sacrifices I think. Thanks so much, it's so interesting. Just building on that and on Chang's one as well. So the obvious sort of thought that comes to mind there is issues of these intersectionality analyses that are being done. They are intersectional but they are intersectional up to a certain point. It's aiming to identify sort of socially relevant intersections rather than kind of computationally generated ones. But the thing I wanted just to ask you on that related to some of Chang's work and some other folks have done some similar things where basically you have tried to train sort of, train classifiers in such a way that on the one hand you are trying to optimize for whatever you are trying to optimize for and on the other side you've got an adversary that is trying to predict protected attributes from the resulting classification. And then you are basically allowing yourself to tweak those two things one against another. So you are trying to make it the case that the adversary can't predict the protected attribute. And I'm just curious to know about the degree of responsibility of that. Is this just a sort of homophily that appears to me from not knowing the underlying statistics? Or are there ways in which the sort of things that you are talking about here would work in tandem with the sort of things that folks are doing in that adversary or fairness literature in such a way that it can actually be used to improve those types of classification models? Yeah, that's a great connection. I've thought about that over the course of the project like the connection to kind of trying to think about conditional independence and so on. Off the top of my head, I don't have a great sense if that would help very much because this is kind of starting out the kind of most pathological joint distributions that could exist between the covariates, the outcome, and the protected attribute. Whereas my impression is that on the other hand, during the training procedure of kind of conditional independence regularized classifiers, it will be very much about the kind of interior of the kind of range of copulas. It will be about copulas that are nearly independent. Whereas this is really about kind of optimizing over the space of kind of like completely correlated copulas. So there is a kind of strange sense in which this is in opposite directions. But that kind of is true in the case where there's closed form solutions. The more complex cases that I mentioned with the multiple value protected attribute and so on, it's no longer a linear program. So I can't say that the solution is at the kind of completely correlated or anti-correlated copulas. So there could be some interesting things there. But I do think that this is kind of working with the strictly correlated copulas and starting in the most basic setting. In the more complicated settings, I'm not really sure what regions are achieving them. But that's a great question. I don't know if there are ways to incorporate this into other kinds of techniques as well. Well, hopefully something maybe sparked there. All right. So, Angela, thanks so much for a great talk. It was really interesting. And congratulations on defending your PhD and on the position in California that I'm sure will be lovely. For all of us here, we'd like to thank you and thanks to the audience as well. Thank you.