 My name is Abdul Majid Raja. You can see my talk's name is Machine Learning Bias. A little bit about me. Actually, Google can tell you more about myself, so you could have Googled it by this time. And it, in fact, knows what all apps I use, when do I use, what ISP, so I don't have that much information here. I'm the organizer of Bangalore R User Group. I open source at this link. I blog at this link. I have a language, agnostic data science newsletter at nulldata.substack.com. And I also run a website called freshgrads.info, which is to help fresh graduates, you know, see real stories. So getting into the talk, what we are going to see in this talk is, first, we'll try to recognize the problem. What is the problem that we are going to see? And then we'll see what is machine learning bias. Then we'll see what is the definition of fairness, and then we'll see about interpretable machine learning, or these days, the buzzword for that is explainable. And then we'll see some Python tools. And then if time permits, we'll have a case study. To begin with, would you believe that if I tell you that computers lie, probably you wouldn't believe, right? So we have been always told, whenever you have a comparison between accountant and computer, computers do not lie. And we have been always told that computers never lie, right? But that is not the case. So I don't know how much you can read it. Yeah, computers actually lie. And this is part of a project called ImageNetRollet. It was created by two researchers to indicate the underlying problems with the most popular machine learning data set called ImageNet. ImageNet is one of the most used machine learning data set in a lot of pre-trained models and see what it has done to Obama. Now, at this point, you might think, Abdul, you are an Indian, that you have not put Indian politician. So I just wanted to go back home alive in one piece. So that is why I didn't put any Indian politician. So the next is problem sample. We'll see more samples. This is Google Translate. Google Translate is trying to improve. But what it actually does is there are two kinds of languages. One is the language where you have gender specific words, and the other one is general neutral words. So in English, when you say, she is a doctor, and then you translate it to Hungarian, which is a general neutral language, then translate it back to English, it converts, he is a doctor. Of course, yeah, we know that it is a general neutral language, but this actually happens because of a machine learning concept called word embedding. But traditionally, whenever they trained this algorithm, he was always closer to doctor than he was closer to nurse. So that is how it has happened. Right now, you might think that I applied for Google and I didn't get a job. So I've been defaming Google. So that's not the case. So let us see one more case of Google, where Google Photos. Actually, we may not even take it as anything because Google Photos has launched this universal app, and then it has classified black people as gorillas. It's not actually funny, because if you put yourself in that position, then you might realize that this is how the world treats you. So it is an indication. Now, yeah, let's move from Sandhya Pichai to Satya Nathala. Microsoft actually had announced a big PR, and then they had this bot called Teh. They wanted it to be a very cool millennial bot. And the biggest mistake that they actually did is they used Reddit data to train it. And anyone who uses Reddit knows about Reddit. The bot actually turned to be anti-Semitism, and then Microsoft had to take it down. So this actually became a very bad press than a positive press. And then when you look at this picture, that lady actually stole neighbor's bicycle to go somewhere, and then she's been classified as high risk. You know what is the recent, right? And machine learning bias, again, a lot of papers have been written saying that criminal prediction algorithms are usually biased against black people. So now this is like, okay, I smoke. If I smoke, I die. So you recognize the problem. So now what is it? Next. Machine learning bias is nothing, but whenever you see that the algorithm is not fair, it's not fair to the entire group of people, the population, then you say that there is machine learning bias. So we try to define it with respect to fairness. And very ironically, we do not have any common consensus or definition for a standard or framework for fairness, because it is still an emerging research area. That's why most of my content is actually theoretical. But machine learning researchers have taken some cues from judiciary, where usually the legal terms are used, and they try to define it in terms of individual fairness. One is called group fairness. The second one is called individual fairness. Group fairness is you take the entire population, and then you take something called a sensitive variable. And what is a sensitive variable in terms of legal attribute is that sensitive variable is something that you should never use. It's used to identify your personal attributes, like your religion, your caste, or your race. Judge should never use this as one dimension to give a verdict. That is how legal uses, and that's why it's called protected or sensitive attribute. And when you use this sensitive attribute and divide your group and use any statistical measure, both the groups should have the same result. That's what it says about group fairness. And instead of approaching from a group perspective, you just drill down, and you go to the individual fairness level, where you take individual data points. Ideally, humans are the data points here, so humans. Similar individuals should be treated similarly. And a little bit of causes. The first one is skewed sample. Skewed sample means your data, the sample that you have collected is highly skewed. Let me give you an example. The example is, let us take India now. And right now we are in Chennai. In Chennai, let us assume that there was a police station since 1900. And now let us take Bangalore where I am coming from. Let us say that in Bangalore, there was a police station from 1990, which means because there are so much records in Chennai since 1900, and there is less record in Bangalore because the police station was established at 1990, which means your data is going to be highly skewed. But when you train your machine learning algorithm based on this thing, your algorithm is going to say, probably some American is going to come and it is going to say, okay, don't go to Bangalore, there is more crime there. So this is what happens because of the skewed sample. The second one is a tainted example, which we saw at the start, where he is closer to doctor and she is closer to nurse. Even though that is not how it is supposed to be, but because we have this prejudice or we have this human bias fed into the data, that example itself has been tainted. And limited features. Imagine you have a data set for India, you can have good data set for North, you can have good data set for South, but you cannot have good data set for Northeast. And in those cases where you have minorities, where you have people who are not always been included, you always have limited features. And anyone who practices machine learning, you know that if you have limited features, your algorithm cannot learn enough about that particular group. So again, you have a biased algorithm in place. Then sample size disparity is how we just saw that example, because you have less number of data points, again, you have a disparity. The final thing is the most important thing, proxy. So what happens in a lot of cases is like, probably I'm a responsible data scientist and then I'm like, okay, I don't want to include gender in by machine learning algorithm, or I don't want to include, let's say, race or people's background. But there are something called proxy, which can represent this thing. Let me give you an example. Let us say you have to predict something in India, and then now you want to calculate, and there is a column called language. Let's say this data set is about Tamil Nadu, because we are in Tamil Nadu, of course. So the data set is about Tamil Nadu, and then you are saying, okay, I want to make my language as a protected variable, which means in Tamil Nadu, people predominantly speak Tamil, but there are other languages. For example, people might speak Kannada, people might speak Malayalam, people might speak any other language, because these are neighboring states, right? Now you have made language as your protected variable. But even if you make language as your protected variable, you can pretty much use longitude and latitude or pin code to say that, okay, this is the part of Tamil Nadu, which is closer to Kerala. Let's say Kanyakumari, which means people speak more Malayalam there than Tamil, because it's a border. So these kind of attributes can become proxies to the actual sensitive variable, which also can bring in bias, because machine learning algorithm is good in learning these things, so it can bring in bias. Now how, okay, good. So you have given a good talk about all these things. Now what do we do? How do we mitigate it? And simply when we call something as mitigating machine learning bias, all we are trying to do is improving the fairness. So it's like glass problem, half-full-full-full. So we just try to increase the glass level so that bias is reduced. In a typical machine learning process, you have three primary stages. It could be if you take a CRISPR data mining framework, you would see a lot more stages, but on abstract level you have pre-processing, training on the optimization and post-processing. So in the pre-processing stage, what you can actually do is, instead of using the sensitive variable or proxy, there are certain mathematical methods. Please do not ask me what it is, I'm not good with that. So there are certain mathematical methods, which you can use to create a new representation, let's say Z. So your sensitive variable is X and Y. You create a new representation Z to give a little bit example. When you plot something in an n-dimensional space and you take something orthogonal to your existing plane, then it is completely devoid of that information. Similar to that, you take something which loses all the sensitive variables properties, but still holds some value. Okay, this is one way you can improve it. In the training phase, what you can do is, this is an age-old technique a lot of people actually do to reduce overfitting, which is add a cost or regularization term where you limit your machine learning algorithms predicting capability. You are not letting it do anything it wants, so you are trying to penalize it whenever it tries to misbehave. The third one is the post-processing technique, where you can actually add a threshold and cap to say that I don't want a score beyond that, I'm fine with the score because I want a fair score rather than I want a high score accuracy. And another thing that you can do in a post-processing scenarios, you can actually try to balance the output. Let's say you have a data set of black and white in the US and then you are trying to do some algorithm at the end of that result, you actually see the algorithm is more biased towards white people than black people. So you try to add a little bit of coefficient or something to improve that balance to get both in the same level, so which is something some people do. So these are the three stages. So now we are talking about machine learning bias, right? And we also saw that how you can little bit improve it. And what is happening in the industry is, the good thing is actually people have started talking about machine learning bias, like since 2016, people have actually started mentioning fairness in their ML research papers, which means humanity is answering the calls. So this is part of a Kaggle, Kaggle is a machine learning competitive platform. There was a survey in 2018 where Kaggle, Kagglers were asked about machine learning algorithm. Most difficult about ensuring your algorithm is fair and unbiased. And I'm not talking about the first or second, but I'm concerned about the third option, which is I've never performed this task. So this is actually concerning. So it doesn't matter whether you're a machine learning practitioner, it doesn't matter whether you're an engineering manager or it doesn't matter whether you're on marketing sales. You have to remember that when these algorithms come into social life, see, as long as these algorithms are in someone's computer and then trying to play some games, cards, or poker, whatever it is, right? As long as it is in that level, okay, we are cool. But imagine, tomorrow they're going to put this algorithm in your airport. Tomorrow this algorithm is going to decide whether you'll get a loan or not. Tomorrow this algorithm is going to decide whether you will get health insurance or not. Tomorrow this algorithm is going to say whether you are going to be a criminal or not. Imagine in such a scenario, you just cannot put anything in production without even checking these basic things. So that is where it has to be a universal attempt to understand these kinds of things. So let us work toward ensuring that our algorithm is unbiased. And that is where this thing comes into picture, which is called interpretable machine learning and the buzzword for it is called explainable AI, where you just do not build something and then deploy it. You try to understand what is happening inside it irrespective of whatever algorithm it is. So there are certain machine learning models considered to be black box models, which means a data scientist builds that model and then just claims or shares a LinkedIn post or Facebook post that I got an accuracy of 97% but has no clue about it. So that is a black box model. So interpretable machine learning is where you try to make sense. Humans should understand this algorithm like what is happening behind this algorithm. That's IML. And there are a couple of advantages that IML brings in. One is fairness, of course that's what we spoke about. Privacy. So only if you actually see what is happening inside that algorithm, you get to know whether this algorithm is violating someone's privacy or not. See ultimately you can collect as much as data you want. Your organization can simply collect whether you open LinkedIn, how many times open you LinkedIn and then it can build an attrition model and then that attrition model definitely would have high accuracy if you open LinkedIn and Naukri and indeed for 10 times a day, right? That's how everyone is going to look for a job. But do you want to build a machine learning model that violates every privacy on the planet? So it ensures privacy, reliability. People start trusting you. Now I wouldn't trust my bank which deploy some machine learning algorithm built by some consulting company for heck lot of money but I would trust it when I know like what goes behind it. Then of course causality, you get to know what is driving this prediction and the most important thing is like we said, trust. When machine learning algorithms show up in social system, let's say government or any public sector, you need to have that trust in your system so that your society is sustainable and when you do not have that trust that's where your society breaks loose. The basic thing, if you're a machine learning practitioner the very basic thing that you can simply do is do a variable importance plot or do a variable significance plot. It can actually tell you what are the things that are driving your prediction positively or negatively so that you have some basic understanding of what is actually going on inside that algorithm that you just claim to be superstar. And this is again from the same Kaggle survey. These are some of the techniques in the sorted order that Kaggle is actually used. The basic thing is prediction versus actual result, feature importance, feature correlation and printing out a decision tree. Let's say you are building a random forest, maybe you don't have an idea about random forest, then you can take one tree from it and then it's not an environmental talk, it's a machine learning algorithm so you take a tree and then you print it. So it's like that and there are a couple of tools that can help you and that is what we are going to see. So people have created tools that can make it a lot more easier. They have obstructed the entire complexity, just given you a function, so just use this function. So there's a very famous package called LIME, SHAP. As you can see in the example code, SHAP can in two lines give you the explainability to that algorithm that you have built and SHAP has everything for your conventional machine learning algorithm, your ensemble machine learning algorithm or your deep learning algorithm. So it has functions for everything. ELI-5, like everyone knows, explain like I'm five, there is a package like that. Scikit-lego is a very interesting package I recently came across. They even have created two functions that will help you build fair logistic regression. Logistic regression is a machine learning algorithm to classify, to score, propensity scores. So they have built a fair classifier where you can actually include sensitive variables and then this algorithm will try to exclude them and exclude their proxies and then build this algorithm. There is a tool that has been open sourced by Google, it's called whatof tool. So it supports tensor flow or any deep learning algorithm for you to do. And very recent announcement a couple of days back, PyTorch, which is from Facebook, one of the most popularly used deep learning libraries. PyTorch has come up with something called CAPTUM, again focusing on machine learning interpretability. So that is all about tools. Let us just see a case study to have some understanding of like how this impacts in the industry. So I vaguely gave you an example like a bank or a society or a airport, someone is frisking you, but let us get a little bit deeper into it and then try to see one case study. The case study is adhesion prediction, like every company today wants to predict adhesion because of course when an employee leaves the company, then it is a huge overhead for them, right? They have to bring in new people. And so the objective is predict, build a machine learning solution to predict employee adhesion for the next quarter, very simple. The data that you use demographic compensation, whether that guy or girl has got any promotion or anything, rewards and recognition, very simple data. Great, fortunately you built a model which is ensemble of random forest and XG boost, which is what a lot of people on Kaggle used to do. And your accuracy is acceptable, which means probably you are eligible for the next quarter award in your company. So it's good, all good. Can we go ahead and productionize this model? No, I just watched Abdul's presentation, so I'm not going to do that. So what are we going to do that is we are going to do a very simple variable importance plot. And that plot is now telling us maternity leave is one of the most important variable to drive attrition. Now, should we consider about machine learning ethics? Is it a matter of considering machine learning ethics and why we should consider it? Because maternity leave is only defined to two groups of people, which is one is female, the second one is married. And if you go ahead with this model, probably you are going to get award in your next quarter. You can frame it in your company's wall or house that you can do. But the downside is your HR departments, that algorithm will never let any female who is married come into the company because it is always going to say that, okay, this female married might get pregnant, which means she might take maternity leave, which means she might churn out of the system. So it is good that you don't hire. So the problem is not for me. No one is going to ask me when I appear in an interview, Abdul, are you planning to get pregnant? No one is going to ask me, right? But every time a married female is going to show up, this is going to be a concern because a lot of people already ask, okay, are you going to get married? Are you going to relocate with your husband? So already it is discouraging a lot of people from entering into workforce when there is a huge attempt to bring diversity inclusion in the workforce and algorithm like this could actually eliminate a complete set of people without you knowing it, but ultimately you are this Hitler bombing female gender, so you don't want to do that. So what do you do? So you take a simple call and then you say, okay, I don't want to go with maternity leave, I'm going to remove it from my feature list and I'm going to build a model and I'm going to make maternity leave as a sensitive variable. And what is the outcome? Yeah, of course my model's accuracy is going to be less, but at the end of the day, I can get sleep in the night because I have built a machine learning solution that did not have an obvious bias built into it, which will act to this algorithm against a particular set of group of people who are already been, you know, not given chance. So what are the lessons learned? In this case, unlike we saw before, in this case, there was no bias in the data itself. So it was an important feature, but there was no bias in the data and the model training during feature engineer, something we came with this and there is always a trade-off between accuracy and a responsible data science. You know when you are going to remove one particular column, you are going to have some impact on the accuracy or your AAC score or whatever your evaluation metrics is. But again, yeah, there are better techniques for the same solution. We just now just simply removed the maternity leave, but like we just saw before, you can create a new representation to have some information about it. And yeah, finally, the most important thing is machine learning ethics, you know, it actually matters when you build solutions where your data points are human beings. Means of course, it doesn't matter when you're going to use Titanic data set on Kaggle and then try to predict whether someone is going to die in Titanic because everyone died in Titanic already, right? So but it matters really when you build solution for human beings, actual human beings like you and I, so ethics matters. I had no clue about this topic unless until I went through all these things and I would definitely recommend the last talk. It is given in PiData London. It's a very entertaining talk. It's called artificial stupidity. It's a very useful talk. So finally, I want to leave you with these thoughts. It's actually very easy to be another cool data scientist because every data scientist is cool these days, but what is actually tough is to be a responsible ethics-driven data scientist and no one is going to tell you to do that. It is a call that only you can make and you have to make, so it's up to you. Thank you very much. Yeah, it's a nice talk. Thank you. We are, hello. Yeah. We are currently building a machine learning model to predict like they are going to fall or not, like in senior home. We are currently taking the feature like gender. They're female or male. In this case, I am considering gender isn't feature. Like in crime, do I need to consider gender as a feature? See, in your case, it is of course obvious to consider because a female after 40 years might get, they're highly likely to get osteoporosis or rheumatoid arthritis. They have bone deficiency, right? So you're trying to predict fall, which means it is okay. But when you want to build a solution that will impact something that has been given to them like job or a crime prediction or insurance or anything like that, then I think considering gender is not appropriate. Abdul, this is Chakri. So from the first slide, that translation of this gender neutral language, that how you think we should handle those things like, because those define the definition between artificial intelligence and human decisions. So actually Google have already started handling that to give you a simple example that day before yesterday, I had a different language in there. And before the presentation yesterday, I thought, okay, I have to just double check it once. You can see day before yesterday, I actually had Turkish and Finnish. And then within one day, Google had actually updated it. So Google is actually doing it and I didn't want to be embarrassed in front of the entire crowd. So I double checked it. And Hungarian was the only language I could find for this particular example. But the point is what Google is trying to do is, Google has a community initiative where the community member would validate it, whether it is, so either sometimes they show both the genders as he or sometimes they show, is this right? And they ask you to validate it. Say like how you do in maps. So that's how they're handling it already. Okay, so I got to ask, how do you identify this sensitive variable? So is it identified using machine learning itself or is it identified using legal or other? No, it's just pure common sense in the domain that you want to build with. So like he asked, so in his case, gender might help him detecting fall, but in a different context, gender is definitely sensitive variable. So it's a daily life, social science. Hello. Hi. It's a nice talk actually. I have a doubt or maybe it's, I want your view from here. So this is all about the bias we are talking about. So in most of the common practices, we impute missing values. So we are actually as a data scientist himself or herself is imputing some bias. What is your take on that? Because there is a whole bunch or suit of algorithms which handles missing data imputation part. So that's what. See that may not come under bias. So and even there are advanced methods. People basically start with imputing a mean or median and then they go until to the level that you build a classifier, that classifier ultimately tells you what to be done. And that kind of thing happens in a lot of context, but that may not come under bias. But what you are trying to say that, it is not the real data, but you are trying to put something which is what you have learned. But we don't have any other choice, right? When data is not there, you can either exclude it. But then like the example that I gave, maybe you don't have values for Northeast India, but you want to include them in the model. So what do you do? Go for sample and then impute values. So you're saying like we should go with missing value imputation process because it's a standard practice. Yeah, because we don't have any other choice. Thank you, thank you. Hi, Abdul. Hello, hi. Yeah, sorry. Yeah, wonderful talk. Thank you. So I just wanted to know, like does generative model comes under ethics of machine learning? You are talking about GANs, a Generative Adversary in New York. Yeah, because generally if you run a classifier for object detections, then in that case, if we have a data for one of the categories, we have less data. What if we run a generative model on that, generate a data for that particular category, and that generated data is pretty good. So should we consider that as a practice or not? See, again, I am not exactly sure. You're talking about generating synthetic data, right? Yeah, generating synthetic data, in the sense like using GANs, we can generate images. Yeah. So suppose we have, suppose we are trying to blend a classifier for detecting animals in it. And for one particular animal, we have less number of images. Using GANs, we try to generate that. So is it a good practice to do or not? That is very similar to what he asked with Missing While Imputation, right? You have to do it. Otherwise you will exclude that class. Means you don't have any other choice. Hi, Abdul. I have a question. In the real-time, I have trained my model with some, let's say, few data sets. Like after that, I want to improve my model. Based on that, model is mispredicted or misclassified. So I want to give the feedback and keep on improving it. So how I can design that kind of architecture? So that is how usually a lot of recommendation systems and every other model building technique usually happens when you move it to production. So you either have a feedback system looping coming back to the model. Let's say you have labeled data, cancer and no cancer. So our spam, no spam, let's take. So if you see a lot of email clients, they ask you whether this is not spam inside your spam. So when you say it is not spam, that the algorithm knows that this is a false positive. Now this data is again feedback into the training where the model is improved after a certain iteration, like one month model refresh or something like that. So that feedback loop is mostly kept in a lot of productionized solutions. Yeah, I'm here. Serious thank you for bringing up these issues up and they are not mentioned often enough. They have been increased a lot in visibility but these issues are not mentioned often. So I thank you for bringing this to so many attention and making it delivered so well. My question is that if you have a recommendation system, especially of that sort, often that can be a large bias but that's hard to detect in the output. Like recommendation systems are like, oh, here are my recommendations for your movies but because you are female or because you are married, I'm going to give you more traditionally girly recommendations or things like that, which may or may not work sometimes but it does have some bias. So how would you suggest when the output is so large or output is so confusing that you cannot really detect bias? How do you do it? How do you pass the data? See again, in that case, it is more of your business strategy than it is bias. Let's say there is a company and they sell exclusive female clothing and exclusive male clothing. Now of course they have to include whether it is for a female or it is for a male, right? There you cannot say it is a bias. So it is based on the business context and what I'm specifically talking about is when something happens in a social science context, which is like government related or your credit scoring, let us take an example of credit scoring. Just because you are a female, if your credit score is less, then you can never get a loan. So if you want to start a company, you will never get a loan. So you will never get a female entrepreneur because her credit score was already less. So I'm talking about these kinds of social science context but that is more of a business strategy, what you want to do with female targeting and male targeting.