 For the last piece of the week, I want to turn back to the question of what is it we're actually trying to optimize? And I want to look at different ways in which machine learning and deep learning inevitably have assumptions built into them. And these assumptions lead to biases, some good, some bad, but ones that you should be aware of. Sometimes the biases come from the training data. If you learn statistical patterns on historical data, you'll replicate those statistical patterns. If you have a way of labeling the data, you learn the labels on the data. Remember back the Cobras and the ships going through, trying to get to not the end, but to destroy other ships and realize that the loss functions can lead to something other than what you thought were really minimizing. Other ones come very mathematically from the nature of the loss function learned. And we'll look at some of each of those in this last module. So what do you want to optimize? Controversially, many police departments are using machine learning to decide or to provide advice to judges on who should get parole and how long they should be paroled for or who should be imprisoned and how long they should be sentenced for. The state of Pennsylvania uses a random forest designed by a pen professor, criminologist Richard Burke, not deep learning yet, but they'll be using it soon I would suspect. Well, why might the machine learning go wrong? Richard Burke knows how to run a random forest. What is it trying to do? Let's see, we have a history of someone's potential criminal records of information about them and a label. We predict whether they will be arrested or not. Surely that's a good thing to optimize. We can predict the probability of arrest. Yes, but in fact, what we're trying to optimize is to predict the probability of the person committing a crime again, which is not the same as their probability of being arrested. If it's the case that the police tend to arrest more African Americans than whites, proportionately to how many crimes they commit, then what are you doing with this machine learning? You're perpetuating a racist tradition of arrests. Not my role to comment too much on what the true utility function should be, but just note that this is incredibly common. There's a divergence between what you care about, will the person commit a crime, which you can't measure, and will I get arrested, which you can measure, and so you tend to optimize for the thing you can measure rather than the thing you care about. I've seen many companies do stupid things like optimize their ads to maximize click-through to get the probability of a click. They don't actually want clicks. They want to make money. They want to know how much you will spend, how much profit you will make, but lots of companies don't know how much profit you will make, so they optimize a surrogate whether you click or not. That often leads to unexpected consequences. There's a whole field of this called algorithmic fairness to Penn faculty Michael Kearns and Aaron Roth have written the whole book about it. The simple setting is that there are two groups, one called a protected class, often the minority that you care about and a general population, and you'd like to have some fairness criteria to build your loss function. You might want to have equal accuracy of prediction for the two groups. If you're doing speech recognition, you might want speech recognition that works just as well for Americans raised in America as it does for Americans who are raised in India or Wales where they speak with different accents. So you might want equal accuracy for the two groups. You might want the same false positive rate, which is not necessarily the same. How many people do you incorrectly imprison? How many people do you miss the cancer of? You might want the same fraction labeled true so that you hire the same fraction of people from these two groups. Lots of different loss functions. You might want to put a rule that says that, well, you can't use the label. You can't use whether someone is a male or a female for deciding what ads to show. Well, it turns out that that sounds nice, but often even if you don't know if someone's male or female, there's lots of indications you get about that. How much hair they have in their photo, for example, might be indicative. Not guaranteed, but indicative. So there are lots of criteria you might use to design your loss function. And the simple math tells you that these criteria are incompatible. You can't optimize everything. You need to optimize something or some weighted combination of them, but you can't make everything as small as possible or everything as big as possible. There are always trade-offs. So you have to think what's going on when you're designing your loss function. Talking about using or not using criteria. Facebook historically allowed people to advertise I'm looking for a worker in my warehouse. I want someone who is a male, I want to show the ads to males between the ages of 20 and 40. Turns out that's illegal. You are not allowed in the US to target when you advertise based on race or gender or age. Facebook was sued over this and a year ago stopped. Two years ago stopped doing it. They no longer allow you to click off. I'm looking for black males in their 20s. However, it's still the case that that information of course is easily identifiable from information people post on Facebook. So we prohibited overtly targeting men or women in employment ads or financial ads but often that information is still implicit. So it's important to figure out again what do you want in the loss function, put it into the loss function. So the other forms of bias are in fact very mathematical. They have to do with the respective error rates within the protected and non-protected classes and with how often you label someone one class versus the other. Now's the time to turn back once more to co-lamp, run the experiments and see how the loss functions necessarily introduce certain forms of bias.