 Right, that's it, I think that's the transition. Hi, welcome everyone to the next session. My name's Chris Bailey, I'll be introducing the next host. So we have with us here, Ziad Obamaya, an Associate Professor and the Blue Cross of California Distinguished Professor of Health Policy and Management at the Barton School of Public Health. Ziad trained as an emergency doctor and he still gets away as often as he can to hospital in rural Arizona to do what he loves, working the ER. But these days, Ziad spends most of his time doing research and teaching at UC Berkeley. Inspired by his clinical work, he builds machine learning algorithms to help doctors make better decisions. He also studies where algorithms can go wrong, how they can scale up racial bias and how to fix them. He's received numerous awards from the National Academy of Medicine, the NIH and publishes in a wide range of journals. His work has been highly influential and is frequently cited in the public debate about algorithms as well as federal and state regulatory guidance and civil investigations. So without any further ado, I shall pass you over to Ziad. Thank you so much, Chris. And as I struggle to share my screen, I'll say that as Chris mentioned, a lot of the work that I do is actually very optimistic about the role that algorithms can play in medicine. And I think that that's founded in my belief that there are lots of very productive things that algorithms can and should do around risk prediction, around diagnostics. But at the same time, there are many ways that that optimistic vision can falter. And I think that's really important for us to keep in mind because that's the single greatest threat to I think all of the gains that we can make using algorithms in medicine is letting them go wrong in these increasingly well-known ways. And so what I wanna do today is just walk through two case studies of where algorithms can go wrong and more optimistically where they can go right. So the first algorithm is gonna be the evil twin and the second one is going to be the good twin, the one that hopefully gives you a more positive vision of what I think algorithms can do. So along the way, I'm gonna try to highlight two common themes that I try to stay mindful of in a lot of my work. One is that so much of the ways that algorithms can go wrong come from the wrong target variable, training algorithms to predict the wrong thing, often a convenient and tempting proxy. And the second one is that a lot of the problems in algorithms come from underlying issues that I'm trying very hard to fix and I'll tell you about that soon. So let me start with the first case study. The setting here is one that I think will be familiar to all of you and basically everyone who's working in health, which is that there's a small number of patients with very complex needs whose care gets fragmented and poor, who have high costs and who have bad outcomes. And so the way that almost all health systems in the country and in a lot of the world have started to get a handle on this is through high-risk care management. So the way I think about high-risk care management is basically like a VIP program for patients who really need help, extra primary care slots, home visits, whatever you need, medication refills, they will get it. But of course, all of that stuff is itself expensive and so you can't do it for everyone. And so you need to target it to the people in your primary care population or your insured population who need it the most. And that's where algorithms come in, of course, because this sounds like a really good use of algorithms. So we study this one piece of software made by one company. It's one of the biggest in this market. And by that company's estimates, the software is being used to help make medical decisions for about 70 million patients per year. If you look at the market estimates for this family of algorithms that all work in essentially the same way, it's the majority of the US population that's being screened through one of these algorithms every year. So it's one thing to keep in mind is that the scale of these algorithms already in the healthcare system is just enormous. How these algorithms work is fundamentally trying to find patients who are going to get sick. So we are taking care of this population today. There are some needles in that haystack that we really wanna know about now because those people are going to get sick. And if we knew who they were today, we could target them with this high-risk care management set of interventions. And we could do two things. One is we could make them healthier and prevent all of these complications down the road. So better health for the patient. And of course, we keep them out of the ER and the hospital. We prevent them from seeing substandard quality doctors like me and we drive down healthcare costs at the same time. And so everyone wins in the scenario. The way these algorithms work concretely is they say, okay, you've got this population of patients. I'm gonna look in my algorithmic crystal ball ahead a year and I'm gonna predict of all of these patients that look okay now who is gonna cost us a lot of money in the next year as a proxy for who's going to get sick. And that's gonna let you target help now. So you're gonna take the highest risk people and you're gonna fast track them into this population health management program. So we were working with one health system and as we've worked with many others over the next two years after this project they all work in essentially the same way although the details vary a little bit. Essentially imagine the algorithm generates some distribution of risk. The top few percent in the hospital we studied got fast tracked into the high risk care management program. About the next half down got shown to their primary care doctors. So the primary care doctor got a list with all the patients, the algorithm score and some information. And the PCP got to decide if this patient should be enrolled in the high risk care management program and everyone else just got screened out. So in a very concrete way where you are in this distribution the algorithm score determines how you're gonna get treated and you're a level of access to the program. So we were interested in studying whether this algorithm was biased and one I think really important thing about studying bias is that in order to study it you need to define it in a very crisp way and that definition needs to be rooted in how the algorithm is being used in this real world setting. So as I just told you people who have the same algorithmic score are treated the same way. And since we're trying to allocate this program that helps them with their health needs those people who have the same score should have the same needs. So that's our very crisp statement of what an unbiased algorithm would do. And in particular, if those people have the same needs that the color of their skin shouldn't matter for their score or how they're treated. That is not what we found. So let me show you this graph and I'll just walk through the axes in detail because I'll show you another one just like it. So on the X axis, on the horizontal axis we're ranking everyone by their algorithm risk score in percentiles. So very low at zero and very high at a hundred and you can see that the dotted black vertical line to the right of that line is where the fast track into the program starts. So everyone above that gets fast tracked in. On the Y axis, I'm showing you one measure of their health needs over that next year. So this is, you can think about this as basically like a comorbidity score. So if they had an encounter for heart failure, for diabetes, for kidney failure, whatever over that year, you get a plus one on the score. You just tally up all of these exacerbations of chronic conditions over that year as a measure of health. We looked at a lot of other measures and they all look the same way but let me just show you this one. The two lines here show the averages for black patients on top in purple and white patients on the bottom in gold. And what you can see is that no matter where we are in that algorithm risk distribution, the purple line is above the gold line. And what that means is that the black patients are doing worse in terms of their realized health at any given algorithm score. So that is a disturbing finding. It's not immediately obvious from this graph how big this bias is. So let me just give you one fact about that. And I'll start by giving you like a fact that's a general fact about how we think about bias. If you look at that high priority population and shaded in purple today when we studied this program, that fast track was 18% black. And you might have looked at that number and compared it to the base rate of black patients in this primary care population, which is 12%. And you might have thought, wow, actually the algorithm is over-representing black patients in the high priority group by 50%. This looks great. This algorithm doesn't look biased at all because it's over-representing black patients. If you instead of looking at the population rate judged needs and said, okay, we're gonna actually give priority in this fast track to people with bad health, not the algorithm score, you would have ended up with a fast track that was almost half black. So even though if you compared the population rates, this algorithm looks good, it's dramatically under-representing black patients relative to their health needs. And I think that's just a thing to keep in mind when we're thinking about whether an algorithm is biased, these kinds of simple measures are often very misleading. So we of course wanted to understand what was going wrong in this algorithm. And a really important clue to that was where this algorithm was going right. So here's a graph that again shows patients ranked by their risk on the x-axis. But now on the y-axis, instead of showing you a measure of health, I'm showing you how much those patients cost. And what you can see here is that those lines are basically sitting right on top of each other. So for black patients and white patients alike, costs are increasing a lot in predicted risk. It's a log scale on the y-axis. All these graphs are done in R by the way. I hope you guys noticed. So these lines are sitting right on top of each other and they're predicting costs very well. So this algorithm is actually doing a great job and a fairly unbiased job of predicting total healthcare costs for both black and white patients alike. So putting this together, the algorithm is biased for predicting health but unbiased for predicting cost. Why is that? Well, it's because black and white patients don't have the same correlation structure between their health and their costs. And that's for two main reasons. First is that white patients have better access to the healthcare system. Now this is an insured population. So this isn't even about insurance but even within an insured population, transportation, poverty, job schedule, family support, all of these other things mean that conditional on someone's needing healthcare, white patients face lower barriers to actually accessing that healthcare and generating costs. The other problem of course is that our health system just treats black patients differently. So there are many, many studies that show that doctors are less likely to recommend invasive testing for heart disease, less likely to treat pain, all of these other things that again, lead to fewer costs being totaled up at the end of that year. So to summarize, conditional on someone's health, black patients are going to cost lower, not because they don't need healthcare but because they don't get healthcare. And so when you train an algorithm to predict cost accurately, you are at the same time training it to predict health in a biased way. And I think that's the summary of the whole paper. So let me try to distill that into lessons. It's really, really important when we're building algorithms, when we're critically evaluating algorithms for use or for purchase, that we articulate exactly what the algorithm should be doing. What is the ideal target, the ideal piece of information that this algorithm should be providing to help me as the decision maker, do my job and make the best decisions possible. And that's how we hold algorithms accountable. We articulate in very precise terms what the algorithm is supposed to be predicting. In this case, it was health needs. And then we compare that to what it is actually predicting, which is healthcare costs. And if those two are different, as they often are in subtle seeming, but really, really important ways, you get what we call label choice bias. So the bias that results from choosing the wrong label, predicting the wrong variable, often a convenient proxy for the underlying thing that we actually care about. Now, on a more optimistic note, I wanna point out that when you detect and articulate bias in this way, at the same time, you're potentially giving yourself a roadmap for fixing the bias because we've articulated that the algorithm is doing this thing wrong and actually here's how to do it better. And so in this case, what we did is we actually just cold emailed the company that made this algorithm and we told them that we had identified this problem and they were very motivated to work with us to fix it. And so with their technical teams, we actually rebuilt the algorithm. We trained it to predict the basket of health outcomes, not just costs. And in so doing, we dramatically reduced the amount of bias in that resulting algorithm. So that was two years ago and we were very lucky to get some attention for that article. And what we tried to do is turn that attention into collaborations with a bunch of health systems, insurers, tech companies, and regulators at the state and federal level. And that reinforced a lot of the lessons that we actually took from that original paper. So the bad news first is that we basically, any time that an organization approached us for collaboration around a particular algorithm or for a more global assessment of bias in all of the algorithms that we're using, we found it. So we've replicated this initial finding from one algorithm in several other algorithms that do population health resource allocation all around that issue of predicting costs when they should have been predicting something more linked to a patient's health. We found it in a lot of clinical prediction tools because those clinical tools often predict, what do they say they're doing? They say they're predicting diabetes. What are they actually predicting? They're predicting an ICD code for diabetes or a test done by a doctor for diabetes. And in so doing, they're leaving out a bunch of people who never get diagnosed who are in many ways the people we care about the most in these clinical prediction tasks. It's not limited to clinical or population health tools. It's also these very detailed operational decisions. So for example, think about what a lot of your primary care clinics do in your health systems. They predict who's not gonna show up for an appointment. There are two types of people who don't show up for an appointment. There's the people who choose not to show up for the appointment. And then there's the people who can't show up for the appointment because of barriers to access, transportation, even sometimes getting too sick to actually go to the clinic. So now think about the fact that we're predicting who's not gonna show up and we're reallocating those clinic slots to someone else who is gonna show up. You're basically doing the same thing for that person who wants to show up but couldn't, you're taking that slot away from someone who faces barriers to access and you're giving it to someone who doesn't face barriers to access. So it's exactly the wrong thing, which is a recurring theme in a lot of these things that we found. Now again, the good news is that in all of these cases, in the same data frames that the original algorithms were trained on, there is a less biased label that we can use to retrain that algorithm and make it far less biased and turn it from a tool that reinforces all of these awful disparities in our society and in our healthcare system into algorithms that actually reallocate resources to people who need them. And all of that is by retraining the algorithm on a label that's less biased. And so I'll post this in the chat if I can later, but you can also just Google algorithmic bias playbook. We tried to distill all of these lessons from our work and our collaborations with dozens of partners over the past couple of years into this playbook and make it very practical and useful and hopefully readable. There are a couple of jokes in there too if you manage to make it through all the way. So let me just distill all of this down into a couple of things. Getting the exact target for the algorithm really matters. And I think that, you know, it's a very tempting when we've done all of the work to get all the data that we need into our data frame and you're looking at it and you just want to get to work. You just kind of like, yeah, I got all this data and now I just want to build it and predict this thing. And that decision is often made pretty expediently when it should be made very, very carefully because that data frame is a portrait of the world as it is, not the world as it should be. We have a short article. I'll also post this in the chat that's called on the inequity of predicting A while hoping for B. And this is a play on an old classic article in management which is on the folly of rewarding A while hoping for B. All of us are very familiar with this in health because what do we reward in health in a fee for service model? We reward more care. What do we hope to get? Good care, those two things are different and algorithms work very, very similarly. We often train an algorithm to predict A but hope it's actually gonna predict B. We train it hoping it's gonna predict someone who has high healthcare needs but actually we're just training it to predict cost. Many of these variables that you want someone's true healthcare needs are missing from your data set. And when they're there, they're measured with a lot of bias. And so we just need to be really, really careful and deliver it when we're picking targets for those algorithms. The second one that I wanted to mention is that I mentioned one of these which is that very simple heuristics about bias can be very misleading. I already showed you one about population representation. So the baseline population was 12% black. The fast track was 18% black. We might have concluded based on that that this algorithm was unbiased and we would have been very wrong. Similarly, I think there's a tendency to equate bias with the presence or absence of race-based adjustments. Now, I can tell you that in this case this algorithm that we studied explicitly did not contain a race-based adjustments because the people who made it were very worried about bias when they built the algorithm. As you saw, the absence of race-based adjustments does not guarantee unbiasedness. Just like the presence of racial adjustments doesn't guarantee that something is biased, it all really depends on what that label is and what the use case is for the algorithm. And finally, this kind of bias from label choice can be very hard to catch. As I showed you, had you just taken the algorithm at face value and looked at its ability to predict cost, you would have found that the algorithm is basically unbiased for predicting cost but that would have been catastrophically wrong because the algorithm wasn't supposed to just predict cost. It wasn't a finance tool to be used in the hospital's accounting department. It was a tool to be used to allocate a health resource and that's what allowed us to catch this bias. It was understanding what population health is supposed to do, understanding how these algorithms are used. And I wanted to flag that because the people in this virtual room, you guys have exactly the kind of knowledge to catch these problems. You guys are typically working within health systems or just have deep knowledge of health but you also know how algorithms work and how data frames work. And so that combination of things is exactly the kind of thing that you need to catch these problems. And I really hope that you start taking a careful look at the algorithms that you encounter in your work and elsewhere to try to find these kinds of problems because the scale is huge and the potential for harm is also huge. Okay, let me move on to the good twin from the evil twin. So I'll start with an observation that is sad but fairly obvious, which is that when you look at the distribution of pain, so look at surveys that ask people like over the past couple of days, have you been in severe pain? It is shocking how unequally distributed pain is like everything else in the world. And so poor patients, non-white patients in the US and around the world just report a lot more pain. And I just wanna pause there and kind of acknowledge that fact because it's a, we talk a lot about income inequality but there's inequality in these daily experiences in people's lives that is truly awful and can reinforce all of the problems that we see around income inequality as well. And so in these surveys in the US, black patients have twice the prevalence of being in severe pain at any given point in time. That's a lot. So you might think, well, like everything else, pain and medical things are just more common. And so you might just think this increased pain is the result of, for example, more arthritis or other things that cause pain but it's actually not that simple. So there are lots of papers that do a variant on the following exercise, which is let's say, take people with knee arthritis and then take two patients. And rather than just comparing who's in more pain, who's in less pain, let's actually condition on the way their knee looks. So let's take people whose knees look the same on the x-ray and let's compare their pain scores. And a very surprising finding is that black patients, lower income, lower education patients, they still report more pain even when their x-rays look the same to the radiologist. And so that's a mystery that's been in the medical literature for a long time. And the way largely the literature has squared this circle is to make the following observation. So we've looked at the knees and we've determined that the knees are looking the same. So if it's not in their knees, the source of the pain, maybe it's in their heads. And I don't mean this in a bad way at all. There's a lot of really excellent research that looks at the fact that take two patients, one is under more stress than the other. The more stressed patient is gonna report more pain from the same physical stimulus. There's a lot of psychosomatic factors. There's just a lot of other things going on in the lives of poorer people, non-white people that make them less able to cope with pain. So for all these reasons, it's very plausible that this is the case. Alternatively, the problem could be with doctors. Doctors are under treating some patients relative to the amount of pain medication, for example, that they should be getting. So let me just walk through the concrete scenario just to give you a real sense of how this might play out in practice. So patient walks into your office with knee pain and you refer her for an x-ray after of course doing a very careful physical exam because that's very important. And if the implication of this literature is taken seriously, what you'll find on average from that x-ray is that the pain that your black patient is reporting is not gonna be reflected or is less likely to be reflected in the disease severity on that x-ray report. So what's gonna happen is you're gonna say, well, the knee looks okay. So I'm gonna pursue some other things. I'm going to work on stress or work on other things, but I'm not gonna refer them to an orthopedist or do anything else that's specifically focused on the knee. Now notice that everything I've told you so far depends a lot on what we mean by disease severity. Like the knees look the same. The disease severity is the same. All these statements are very, very dependent on measurement. And so how do we measure this? In the case of knee arthritis, well, let me show you the current state of the art. It's someone looking at the knee and doing a very careful job of grading it according to accepted criteria. So what are these accepted criteria? Well, there are objective grading scales that go through every compartment of the knee and grade them. And the most commonly used scale was developed by doctors, Kellgren and Lawrence in 1957. That's the Kellgren-Lawrence grade. And when you go back to those original studies, you find that they were done by comparing, for example, coal miners to office workers in England in the 1950s. And there's, in fact, in these original studies in the methods or whatever that section was back then, they don't even mention race or sex composition of this population because it was all the same. And so that might make you worried and that might make you think like, well, if that's the state of the medical knowledge and where these radiologist impressions and scores are coming from, maybe we could do better with an algorithm. And so if humans are missing something, we know that algorithms are now achieving very high performance on these kinds of tasks. So we might want to enlist an algorithm to help. But here's the problem. When you look at all of these papers that are training convenants to look at x-rays, what do they do? They say, oh, we've achieved human level performance. But that's exactly what we don't want. We do not want human level performance. We want an algorithm that actually does better than these humans who might be making mistakes predictably or might be biased in other ways. And so that is, I think, a deep problem in a lot of this literature is that we say we're predicting atrial fibrillation. We say we're predicting arthritis. But in fact, what we're predicting is a doctor looking at a thing and telling us what the doctor sees. And those two things can be subtly but importantly different in the same way that someone's health needs and their health costs can be different to. Because one is an objective statement about someone's health and their physiology. And the other is filtered through layers and layers of bias and structural disadvantage that we might not want. So again, to return to a common theme, we were interested in seeing whether we could find a better target for prediction. And so the standard ML playbook here, as I mentioned, is to get a bunch of x-rays and train the network to spit out what the radiologist would have said about this knee. And it turns out that there's another human that you might want to ask about the knee. And that's the patient. And that was the basic idea for our revised effort to train an algorithm to interpret x-rays is not learning from what the radiologist sees when she looks at x-rays, but listening to the patient when she says, my pain is nine out of 10 from this knee that might not have looked so bad to the radiologist. Now, this will not be news to any of you. But finding data to do this is not at all straightforward. So it is pretty easy to find x-rays paired with a radiologist's interpretation of that x-ray, because that's sitting on every hospital's PAC system. So it's easy to just pull that out and dump it. But as any of you who have tried to get patient reported outcomes into your hospital's data warehouse knows, that is still pretty rare and pretty hard to do. And it's very labor-intensive to do. And so it's a lot harder to find data sets that match those two things. The x-ray with the patient's experience of pain. So we were very, very lucky, because there's an NIH study of knee arthritis that we were able to plug into and just get the data. And once we had that data, which is something I'll come back to in a second, it's a very, very straightforward ML problem. Because then instead of just running the old playbook of getting the x-rays and training a network to predict the radiologist, you get the x-rays and train the network to predict the patient's knee pain. And I want to make a point that I think is one of the coolest tricks of ML in this area, which is that if an algorithm can tell, given two knees, oh, this one's painful, this one's not painful, that means that there is signal for predicting pain in the knee. So we try to rule out some confounding factors, which is always a problem. But if you can predict pain from the pixels of the x-ray, then you can trace that pain to something in the knee that is showing up on the x-ray and not something that is in society or in someone's head. So let me show you a couple of summary statistics from that. So this is in our sample is like a longitudinal sample of people with knee pain across the US. It's a very diverse population. And if you just compare black and white patients, their average level of pain, it's enormous. So this is a counter-intuitive scale where no pain is 100 and severe pain is less than 86. And the unconditional difference between black and white is about 11 points. So it's over halfway between no pain and severe pain every day. So that's like the baseline. And if you control for KLG, this is the radiologist score, you actually account for about 9% of that gap. So black patients, on average, have a worse knee arthritis. And that's accounting for 9% of this gap. But if you adjust for our algorithm's severity measure, our algorithm predicted pain score, you actually get rid of about half of that gap or five times more than the standard measure. And we see similar results for income and education. So the algorithm is finding things in that x-ray that are linked to pain. And it's things that are disproportionately affecting black patients in our sample and are disproportionately not making their way into the radiologist's severity judgment as manifested in the KLG score. So there are a few things that could be the most important one as a recent paper suggests. Algorithms can actually reconstruct race from an x-ray, but they could also reconstruct body mass index or anything else that's correlated to pain. And one advantage of this dataset is that we can actually control for those things directly. So rather than saying, oh, the algorithm could be reconstructing some imperfect measure of race, we can actually just put race into the algorithms or into the regression directly. And we show that actually just including a perfect version of race or at least the self-reported version of race doesn't change the algorithm's ability to predict pain. So it's still predicting pain even when you tell the algorithm this person is black or white, it's predicting it with the same strength and that correlation is about the same size. So we don't think any of these observed things are actually why the algorithm is able to read pain off of these x-rays. We cross validate across study sites, it's not picking up on artifacts that might be from some sites having more pain than others. It's not even better weighting of the radiologist's features. So if you just regress the algorithm pain score on all of the detailed radiologist reports on every single compartment, you can't explain the algorithm score with things that we know about already. So the algorithm does seem to be finding new sources of signal and that x-ray and a really important and open question is what those things are that we're trying to address with some of our collaborators including Judy Gichaya at Emory. So why is this so important and not just a curiosity about pain scores? It's because the way we judge severity is the way we allocate a lot of things including most critically knee replacement surgeries. And we know that there are huge disparities between black and white populations in rates of access to this. And I think the going hypothesis of course like so many other things that this is kind of income, insurance, access, things like that. But what if it's also the way we're reading x-rays that's systematically excluding people from getting access to these therapies? So we did the following simulation exercise. Let's take all of the patients with severe pain in our sample. And these are patients that that's part of the criteria not just referral to an orthopedic surgeon but in many settings insurance companies actually use these criteria to allocate and to determine who's eligible for surgery. So pain is half the problem but of course you can't just replace people's knees because they're in pain, they have to have a knee problem. You don't wanna be replacing someone's knee for a problem that's not in their knee. So what we did is among those people with severe pain we simulated swapping out the radiologist severity score for our algorithm severity score and giving the same number of knee replacements to people who were in pain and that looked bad to the algorithm not people who were in pain and looked bad to the radiologist. And if we did that, we would actually double the fraction of black knees that were eligible for surgery. So this again is a pretty large amount of bias in these simulated guidelines resulting from using the radiologist opinion rather than something else. So stepping back, I think this is going to be a really big research area not the knee pain stuff specifically but finding better proxy measures rather than human judgment. The whole reason we want algorithms in medicine is not to reproduce what humans are doing including all of our errors and biases. We want algorithms to do better than humans. But if we wanna do that then we need algorithms that are learning from nature not learning just from humans. The problem is that data on these patient outcomes and experiences that we need to train the algorithms are siloed, they're locked up in health systems. And so this market is fundamentally broken. If you're lucky enough to be at a hospital who has x-rays linked to patient reported outcomes that's great. But think about how talent is distributed. Like what's the likelihood that the right person is gonna be in the right place with the right data? It's just very small and I think it's been a huge barrier to this field becoming like a real field. So I wanted to mention one nonprofit venture that I co-founded a couple of years ago called Nightingale Open Science. And to summarize the point of this exercise it's basically just ImageNet for medicine. It is a way for us to build relationships with health systems in the US and around the world, academic centers but also under-resourced county health systems and invest in building up exactly these kinds of data sets. Data sets of medical imaging linked to interesting patient outcomes and experiences and curated around really critical unsolved medical problems like sudden cardiac death, like cancer metastasis, like pain. So in all these areas I think algorithms applied to images and waveforms have huge potential but it's just hard to get access to the data. So we create the data inside the health system. We actually have a lot of philanthropic money available if anyone's interested in building some data sets around knee pain, other kinds of pain, patientreported.com, so that would be great. Please, please get in touch with me. We've got a ton of money and we just give it to people to build data sets. Those data sets stay in the health system and the health system does whatever it wants. Researchers can run their own projects. We can pay for anything, research or time, et cetera. And then in exchange we get a de-identified HIPAA safe harbor version of that. And we put that on our cloud server and we make that available to nonprofit researchers at no cost under a very lightweight DUA that looks basically like the MIMIC DUA. And so please get in touch with me if you are in the business of making data sets for your research, we can absolutely work together. Or if you're interested in getting access to this, we're starting to give access to it to kind of friends and family, which I consider all of you. And we're launching publicly at a NeurIPS workshop this December, I think it's December 14th and I'll post links in the chat as well. So let me close by saying that all of these examples have taught me that finding that right target for the algorithm, the label choice problem is really central. It is a huge source of problem in health algorithms. And as I'm learning, as I do more of this work outside of health, it's very similar in other fields. So in many, many settings, we're interested in predicting some underlying quantity, but we don't see that underlying quantity. We just see some convenient proxy. So in criminal justice, we might be interested in predicting someone's true likelihood of committing another crime, but we don't have that in our data set. We have a variable called arrests or convictions. In finance, we're interested in credit worthiness. Is someone gonna pay back their loan, but instead what we have is income. So in all of these settings, I think we face similar problems. And because we can actually get data and all of you can get data to explore some of these questions in health, I think a lot of other fields are actually learning a lot about what bias looks like because it's much harder to access data on algorithms and predictions and outcomes in other fields. Whereas in health, all these algorithms are just sitting in the health systems data warehouse where people like you can access it and study it and detect bias and other problems. So I think this is a major opportunity for all of you and all of us who are working at the intersection of health and data science. And I think that people like you who are bilingual, who are kind of interested in health and understand what are important and interesting questions to ask, but also have the capacity to do stuff with data. All of you I think are the future of this field and I'd love to work with you. So I'm looking forward to hopefully we have time for some questions and I will stop here. Hello, sorry, I can't get my camera back on. Hang on a minute, I'm pressing the button. I can hear you Chris, so. Yeah, okay. Well, you don't need to see my face. I don't think I've seen to work. So not to worry. Anyway, thank you very much. That was absolutely fascinating. The chat has been very enthusiastic throughout as well. I love the idea of, well, in my team, we're very interested in things like fairness and algorithmic fairness. But the idea of doing better than a human, I think is a really sort of novel, brilliant one. So let's go to some questions in the chat. So let's start with this one. How do you deal with the flurry of fairness metrics in the literature? Isn't every model unfair under at least one definition? Thanks for pointing that out. It's such a great question. I think there's a proliferation of fairness measures in the literature. And exactly as you suggest, there's also some very elegant proofs that unless very specific and unlikely conditions are met, you cannot simultaneously satisfy all of these things. So I just posted in a link to that algorithmic bias playbook that I mentioned. And so here's kind of how I think about it. Because precisely because there are so many measures and you can't have it all, you need to work backwards from the real world use case of these algorithms. So what is the algorithm doing exactly? What is the decision the algorithm is trying to inform? And what is the ideal piece of information that you'd like to give that decision maker to help them make the decision better? And so working backwards from that is, I think we didn't know enough to articulate it in this way in our original study, but that's how we define bias. So we say, okay, for example, this algorithm is being used to allocate a resource that helps people with health. And what we want it to do is we want that to go to people with high health needs. So that is the articulation of the ideal target. So now all we need to do is find some measure of health needs. And then we take the algorithm score. We look at people in our two categories, whatever categories, or five categories or whatever categories you wanna investigate bias on and you compare that ultimate ideal target that you're interested in for people at the same score. And I think that has a nice set of parallels to a lot of civil rights law, for example, that does the same kind of thing. In fact, there's one case, I'll just kind of, this is a bit of a tangent, but it's kind of interesting. So there was a case in the 70s that went to the Supreme Court and it was a jail that wanted to hire people to kind of like lift heavy stuff and do kind of maintenance at the jail. And what they did is they set height and weight requirements. So you needed to be above a certain height or above a certain weight to be considered for this job. And what the Supreme Court ruled as it went all the way up in the process was that that was actually unconstitutional because that discriminated against women. Now that was ruled to be discriminatory because it was a proxy for the ultimate quantity of interest, which was someone's ability to actually do the job and move heavy objects. And so what the Supreme Court said is you can actually evaluate people on their ability to move around heavy stuff. And that actually might lead to fewer women being hired if women are less able to move around heavy stuff than men, but that is the ideal target for your hiring and you're not allowed to use discriminatory proxies. So that's, I think that's referenced in the playbook. I forget the actual name of the case, but I think there's a really nice parallel between that and what we're trying to do, which is articulate the ideal target and hold the algorithm accountable for that. And that's your way out of this morass of lots of different fairness measures that I think are genuinely very confusing to figure out which to use. So working backwards from the decision, the ideal target and evaluating the algorithm based on that. Great, thank you. Okay, so the next question is, are there published guidelines that could be used to analyze your algorithm for bias and mitigate them? Yeah, so I think, I mean, there's a ton of work in this space. And I think that there's also, I'll also mention, because I think I saw this in the chat, there's also, I think a lot of packages that are kind of like trying to build solutions that are screening algorithms for bias. And so I put these all in the same category of like, can we create some generic tools for assessing bias? And I think you can to some extent. So when the bias is resulting from like a failure of generalization, so it's trained on one population, it's discriminatory on another population. So this is something that's like the pulse oximetry example where apparently the pulse oximeter was trained on a pretty homogenous population and it misestimates true blood oxygen concentration for patients with darker skin. So that's a very straightforward failure of generalization. It was supposed to be predicting oxygen. It fails to predict oxygen. But contrast that to some of the cases that we were studying where the algorithm was supposed to be predicting cost. If you just looked at cost and ran a package, you would have concluded that the algorithm was unbiased. So what do you need to do to find that bias? Well, you need to engage in this kind of semantic exercise of what is the value system that this algorithm is trying to learn? What is the purpose of the algorithm? And the purpose is never in your dataset. The purpose is just something that you as a human, all of us as humans need to figure out and that's not in the data. And that's why I think like generic guidelines can have some time, but they have their utility for doing these basic screens on the algorithm is supposed to be predicting A and we can look to see how it's predicting A for these two groups and compare performance. But that's never going to find label choice bias. And as we've learned over the past couple of years, label choice bias is the biggest problem that's affecting all of these health algorithms. Not to say that there aren't other kinds of biases, but I think that is by far the most common thing we've found and I think it's also the hardest to catch for exactly this reason. Right, we've got plenty of time. So I'm afraid we're going to keep grilling you for a while. There's plenty of people who are still quite enthusiastic. So there's another question that arose in the chat actually, which is very interesting. So should unifying the machine learning framework such as tidy models have built in support for fairness metrics? Yeah, I mean, absolutely. I think like the more ways in which we bring this issue to the top of mind, the better. And the only caution I'd say is maybe the same one as I just mentioned, which is that we shouldn't be falsely reassured when algorithms pass those generic metrics with flying colors. Because again, if you just take the algorithm's value system as given and evaluated on that, you might find that it's unbiased, but you might find that it's just doing the wrong thing. And those kinds of generic checks that are built into packages are not going to find that. And so not to say that they couldn't, but I think that the way many of them are currently built are not going to find these problems. So for example, I think it would be fantastic if in a lot of these packages, you had an option to not just say, how is the algorithm that's supposed to be predicting A? How does it do for A in terms of accuracy and in terms of disparities between protected groups? Let's also build in a field for like, what is the true underlying quantity of interest? Is that in the data? What are some variables in this data frame that could give me some insight into that underlying measure? And now let's look at how the algorithm is doing on those. I think that would actually be very useful, but I think that's not the way many of these packages currently are. They just take what the algorithm is doing as given and I think that can be very dangerous. I've got a little two questions here that are slightly similar. So I'm going to kind of try and push them together a bit. So the first one is what do you make of a recent paper showing you can predict race from images across a variety of data sets. And then a related question is, we hear stories about medical imaging studies going wrong because of data leakage from someone with the part of the actual image not related to the part. Was there anything you've done to specifically look at this kind of problem? Yeah, so as I mentioned, I think these are two big potential confounds with anything. So I think that the paper you mentioned, I think right now it's a pre-print but it seems very convincing. And I think that fits with a lot of other indications we have that, for example, you can read age and sex off of retinal images. And so it doesn't seem at all implausible that algorithms would be able to very accurately reconstruct someone's race from looking at their X-rays in a way that humans might not be able to. And so we were very worried about something like that in the paper that I mentioned on knee pain. And I think that if you just kind of read through that, the paper, it just gives a way to test for that. So basically you kind of see if the algorithm is still able to predict the thing you're interested in, even when you control for not the algorithms, some reconstruction of race that might be doing the work of predicting the outcome, but actual race that is reported by the patient and is presumably kind of a, that's the variable we're interested in the disparity in, not the algorithms kind of noisy reconstruction. So if you control for that and the algorithm is still doing the same thing, the correlation structure is the same, then you can feel pretty good that that variable isn't how the algorithm is getting there. And so I think that that's worth keeping in mind in a lot of these studies, but there are ways to deal with that. And in the same way, there are these well-known artifacts, like where the little R is placed or X-ray machines differ that can easily allow an algorithm to know whether a patient is coming from site A or site B. And so I think those are also very top of mind for, or they should be top of mind for anyone doing work in this field. I think in the same way that you can kind of control for race and make sure that's not how the algorithm is getting its answer, you can also just make sure that the algorithm is predicting well across sites. So you hold out a site, you make sure the algorithm can predict that site from all of the other sites. And so there are just ways to kind of like start crossing these problems off of your list. Great. Okay. So I think we're going to wrap up soon. I think people are still wanting to ask you questions though. So I'm being asked whether you're going to mean the birds for the feather session that's coming up later and whereabouts you'll be. Tragically, I was not able to teaching this semester and of course teaching is wonderful, but it also makes life chaos. So unfortunately, my schedule is not the way I want it to be this semester. So I wasn't able to make it, but please email me, tweet at me. I can also, I'll stay on in the chat for a little bit and just scroll through and make sure I didn't miss anything. But thank you guys so much. This was fantastic. Thanks for the great questions and all the interest and look forward to being in touch. Great. Thank you. So I think what's happening now, oh, hello, sorry, we're back. Yes, so there's a break now. The next talk is 24 minutes past and if you go to the Spatial Chat now and there's a bit of feedback going on in the chat here as well. And yes, so either basically go to the Spatial Chat, answer the feedback questions in this chat or we'll see you in the next talk at 24 minutes past. Thank you.