 Greetings, everyone. Welcome. My name is Nanakshi Kaushik and I'd like to introduce my friend and co-presenter, Nilema Mukheri. We both work at Cisco and we are responsible for strategy and execution of two container platforms. We see increasing adoption by our customers for machine learning, and that is a reflection of the current world where increasingly machine learning models are used for making critical decisions. For example, machine learning models are used first for hiring and then for filing. Similarly, machine learning models are used to grant FDA approval of a COVID-19 vaccine and also to decide who gets an ICU bed. All these have real-world implications, and obviously we want our models to be as perfect as possible. However, we increasingly see headlines like these. Since these decisions are made by a machine learning model, it is easy to measure them and measurement leads to awareness and awareness leads to change. How do we define fairness and bias? As you can see from the popular definitions I took from the web, it is very subjective and it depends on the stakeholder. In this presentation, we will assume that the stakeholder has decided what their definition of fairness is and we are going to look at sources of unfair algorithm bias and how to make the outcome fair and provide conclusion. How does a bias creep in? This is a popular way to build a machine learning model and deploy it in production. Now let's take an example of a stakeholder. For example, a bank which wants to decide who to give a loan and they decide that they want to give a loan to people who have a projected salary of greater than 50K. So they hire a data scientist and give him a census data to build a machine learning model. The data scientist builds a model and finds that his model can provide 80% accuracy. They come back to the bank and the bank accepts it and deploys the model in production. User starts interacting with the model and it does exactly what the data scientist said. It is accurate 80% of the time. However, within a subgroup, female are getting disproportionately rejected for loan even though their salary is greater than 50K. And on the other hand, the males are getting accepted for loan even though their salary is less than 50K. So even though it is accurate in the overall population, it is not accurate within the subgroup. So how does that happen? If the data was perfect and the machine learning model was perfect, it would have been equally accurate within the subgroup as well. However, data is never perfect. So like any other algorithm, we go for a little bit of exploration and debugging. Fairness is an upcoming and a very popular topic and there are many fairness toolkits available. We have looked at these four toolkits and the libraries which they provide are easily integrable into your machine learning pipeline running on tube. For data exploration, I picked Whatif tool because it provides an interactive analysis and it is easy to look at what if conditions. But the other libraries are pretty great as well which we will be looking as we look through different other examples. So with that, let me go into my on-prem Kubernetes cluster where I have deployed Kubeflow. Let me start by giving a quick introduction of Kubeflow. Kubeflow is a popular machine learning lifecycle manager and it's also used by many of our customers. It pretty much contains all the utilities you need for your machine learning task. For example, it consists of a notebook server for notebook exploration, a hyperparameter tuning library, ability to run your experiments and finally build pipelines. So for running the widget library from Whatif tool, I'm going to use my notebook server. One thing which I wanted to point out is that to run the widget library, I ran into dependency issues using the stock Jupyter lab 3.1 images provided by Kubeflow. So to get around that issue, I just built my own notebook server and then I had no issues running the widget. So with that, let me just go to my notebook server and look at how a machine learning data scientist would use this. So the data scientist initially would look at the data which is a UCI census data from the 1990s and it has all the columns which you would expect from age to education to occupation and the number of hours per week and whether their salary is greater than or less than 50k. And then they build a quick model, they deploy it in production as I mentioned and now we are seeing that there are gender differences, even though the overall accuracy is the same. So then the first thing we do is start exploring the features. And since there are gender differences, we look at the gender and see if there is a discrepancy in the number of data points for that gender. So we can see that the male data points are way higher than the female data points and that would mean that the model generalizes for the male population. And that could be one of the reasons where the bias creeps in. The second thing is that other aspect which we normally do with data is we do feature engineering and really more for the data which has non-uniformity or non-uniformity. And so we can see that for example, the capital gains is mostly zeros and then they are from one spectrum to the other. So we might have to do some kind of feature engineering perhaps to change that. And that is another place where the bias may creep in. And obviously there are many ways in which we can pick the data which we want to feature engineer to clean the data set. One way to look at that is counterfactual. What a counterfactual is is a data point which is closest to the data point except in the one dimension. So for example if I pick a blue data point you can see that there is a counterfactual here. And for example in this example which I pick you can see that most of the fields are almost the same except that the field which has some differences is this field. So then that gives a little bit of a helpful indicator as to how we can fix how we can feature engineer that data. Now let's look at the models. As we saw that as I mentioned that the model has an accuracy of over 80 percent and which is pretty good. But then based on the for genders it gives us slightly different results. So we can then now go deep into what is the difference between the genders. And as we are digging deeper we can see that the female accuracy is higher than the male accuracy. However the number of false positives for the male are higher than the female. And that is what we see in the real world for example that the even though the male population has income less than or even though the real income is less than or equal to 50k the model is predicting that income is greater than 50. And so this is this is this is an easy way to look at where the bias is in your data set. Then to just give a give an idea about how we can change that we are going to look into rest of our presentation. But one way to look at it is the classification. If we change the classification threshold depending on where how the stakeholder in this case the bank decides the fairness is we can we can achieve that. But there is a trade off. So when we change the threshold the accuracy gets changed as well. So for example if you wanted to do demographic parity as you can see that changing the threshold achieves that but then the there is a slight trade off in accuracy and similarly with other other option. So with that let me go back to my slides to recap there can be many sources of bias. So how can we mitigate bias in our existing framework. Let's take a quick look at our existing framework. Now the stakeholder will add an additional fairness criteria for example bank deciding demographic parity. We would measure for bias in our in our pipelines and then also make our pipelines fair and finally provide the user with a specification under which our model would work. So for example COVID-19 vaccine would provide 95% efficacy if the two dosages are taken 30 days apart. One thing which I want to point out in this framework is that when we are mitigating bias it likes security you want to shift left. So you want to do it at the earliest stage. So for example if we fix our world then we do not need any fairness criteria. If you are not able to fix the world then if we can make our data perfect then you don't have to go later on in the pipeline to make your model. So with that let's take a quick look at how we do the define the equity and do measure and bias detection. We took a quick look during exploration phase where instead of looking at the entire population we look at the subsection of the population. Even for the subsection of population we can go as much deeper as we want. So for example it could be parity between blue and orange population or it could be a subsection within the population. So for example the ones who were denied loan but would default or were granted loan but would default and that would help us pick a appropriate matrix for fairness. With that I'm going to hand it over to my friend Nelima so that she can walk you through the rest of the presentation. Thank you Meenakshi. Hello everyone. Let's now take a look at how we can improve fairness in your machine learning pipeline. Let's say you're trying to build a machine learning model to identify dogs and you've trained it with lots of cute pictures of golden retrievers. It's not going to do very well when you give it a different type of a dog. So to be able to identify any kind of dog you want to feed your machine learning pipeline a diverse data set. So the first step in your machine learning is data collection and you start to look for an fairness right from there. Once you have data that's that you've collected you can explore the data to see if there's any bias in it. This is an example where we have two groups of people the yellow group and the blue group who are applying for a job. Your job selection process seems to have a bias here towards the yellow group and we have selected more yellow people than blue people. Maybe our fairness metric as a society for this specific job is to have equal representation among all data sets. So we pick a fairness metric called demographic parity and we try to optimize for that. One mechanism to do that is using relaveling where we rank all the people that we have available within each subgroup and then we take the lowest ranked person selected from the majority group and set them to unselected and you take the highest ranked persons from the minority group and mark them as selected. By doing that we see that we have balanced the selection amongst the blue group and the yellow group here. So we have equalized the outcomes irrespective of the demographic that the person belongs to. There are many other preprocessing techniques and depending on the situation and the data you may want to select one of these. So for example if you don't have a lot of data in one category but you do have data in another category you may want to do resampling to balance the data across the different demographics. Once you have preprocessed your data we get to the model training phase and we can also try to improve fairness as we build our model. There are multiple mechanisms available here like regularization, constraint optimization and adversarial debiasing. When we try to stop a model from overfitting that's in itself is a form of debiasing. This is a step in your machine learning pipeline which can be parallelized very well. The current open source models that we have open source packages that we have available to do debiasing are not really optimized for distributed training on Kubeflow and this is a place where we feel like we can have more work done. So for example if you take constraint optimization that's a technique that's very similar to hyperparameter tuning. In hyperparameter tuning you take your learning process and you change the parameters to your learning process like the step size for a gradient descent algorithm. In constraint optimization however we are not changing the inputs to your learning algorithm but you're adding additional constraints in addition to your optimization problem where you're trying to optimize for accuracy. So the methods used here are very similar to hyperparameter tuning and we feel like this is a place where Kubeflow can be easily extended to support distributed fairness bias reduction and like to see work happening there. Once you've trained a fair model we get to the serving process. Before we serve however we always want to analyze to see if there's still any form of bias that is leftover and we can do debiasing even at the point of before even before we start serving which is called post-processing. You can do this by changing thresholds. You can also have a fairness in mind while you're serving your model. You can do this by building interpretable models and publishing model explanations using things like model cards which describe what your model does, what are its limitations and what are its biases today and how you want to see it improved. Now let's get to an example. So here we are back to the Jupyter notebook which is running in Kubeflow on an on-prem Kubernetes cluster. We are going to look at the UCI dataset again. So we have the income across different categories plotted. So we have here income across genders and you can clearly see that you have a higher number of higher proportion of higher income people amongst men versus women. You also have a difference in the number of data points that are available for female versus male. With this kind of imbalanced data you would expect your model to have a bias. Let's check that out and see by building a model and evaluating the fairness metrics. So we use a decision tree classifier and we train the classifier on the data that we have. We then use it to predict the income level on a test dataset and we plot the dashboard, the fairness dashboard, using Fairline. You can see that there is a pretty big disparity in recall across the two genders. So you have a higher proportion of men getting over prediction or prediction of higher income when they did not have as compared to women. You also have a disparity in the absolute prediction as well. So you have a much larger percentage of men getting higher income predictions as compared to women, which is very intuitive given that our data was skewed towards a larger proportion of men having higher income as compared to women. Now let's try and apply different debiasing steps here. So we start with preprocessing and in preprocessing we take a correlation remover which takes a sensitive property and tries to remove the correlation to that sensitive property across your dataset. Here we have specific gender as the sensitive property and once the data is transformed to remove any correlation with the sensitive property, we go ahead and fit the model again and we try to predict based on this new model that we've built. Looking at the Fairline dashboard, we see a definite improvement in the recall. So you have a similar proportion of women being given over prediction as men. However, the disparity in predictions hasn't really improved much. You still see a lot more men being given a higher income classification as compared to women. So preprocessing has helped but not very much. Let's see if we can do something while we are building the model. So here's something we can do during in processing. We take a constraint optimizer here. This is we've selected exponentiated gradient from Fairline and we specify the constraint to it which is trying to achieve demographic parity and we pick the same model as before and run it through this constraint optimizer and try to reduce the bias. Again, as mentioned before, this is where Kubeflow would be super helpful if we had the ability to distribute different runs of the constraint optimizer across different nodes in your Kubernetes cluster. But let's see how the performance is. We really don't see an improvement in recall. You still see a high disparity between women and men in terms of the over prediction versus under prediction. However, what is interesting to note is now we're seeing a higher percentage of women getting higher income prediction as compared to men. This does not really intuitively make sense given the data we saw skewed in the other direction. But then this is what the model had to do to reduce disparity between the two genders and which is very clearly reflected in the selection rate disparity, which is pretty close to zero. So yes, in processing helped a little bit, but it's also introduced new disparity. So what about post processing? Can we do better? We picked a very simple post processing method called threshold optimization. Again, Fairland makes it very simple with a threshold optimizer, which takes as a constraint the type of fairness metrics that you're evaluating for. And here, again, we've selected demographic parity. And we build the model and see how it performs. You can see that it's selected different thresholds for zero, which is female and male categories. And the performance seems to be much better. The disparity in recall is under 2%. You have a similar proportion of over prediction and under prediction for male and female. The disparity in predictions is very close to 0%. So you're actually having equal proportion of positive predictions for both. So instead of trying to fit a model that is fair, given the data that is not fair, it looks like setting different thresholds did the job better. The post processing seems to be doing a better job in cases where your data has imbalance or bias that is built in. Now, are we done? Not really. Once we've built a model, you want to understand how it works and why it is doing what it is doing. The simplest way to look at it is building interpretable models and looking at their interpretations. In this case, we have a decision tree classifier and we're trying to see the importance it gives to each of the features. Surprisingly, it selected marital status and not the sex as the higher importance feature. So it's very much possible that we looked at the bias between single people and couples. You might have seen the data skewed much more than we see across the genders. So it is very important that at every step of your pipeline, you look not only at the data, but you try to identify the different biases that are there. Instead of coming with a preconception of, hey, there's a gender bias and we're trying to remove it or, hey, there's an age bias and we're trying to remove it, explore the data, explore your model and find out what kind of bias really exists and understand and explain so that even when the biases are not easy to remove, you still are making it very transparent why the machine learning model is making a better prediction. This is where machine learning can perform better than humans because it's not always easy for us to explain why we are taking a specific decision as compared to a machine which can be very explicit about, hey, these are the parameters I've used to make a prediction of positive or negative. With that, let's go back to the slides and summarize. To summarize, today we looked at different ways to improve the, improve fairness in your machine learning pipeline. Starting with defining the problem, collecting the data, measuring bias at each step of your pipeline, improving and building a fair machine learning pipeline and building interpretable models that users can understand. There are tools available today that make it easy to improve fairness at every stage in your machine learning pipeline. We think that there can be work done to improve integration into Kubeflow as well as to make things more scalable within Kubeflow. With that, thank you.