 Welcome, everyone, once again to the EuroPython conference. I'm Rashmi Nagpal. I'm a software engineer by profession and a researcher by passion. And today, I'll be sharing my knowledge and learning in building and deploying fair and unbiased machine learning systems. So let's begin. OK, so the agenda of the talk is, firstly, we'll understand what is machine learning and its related concepts to bring all of us on the same page. And then we'll understand the core black box problem, why machine learning sometimes gives biased and unfair decisions. Further, we'll explore the strategies to build fair, unbiased, and trustworthy models. On top of it, we'll also understand the various levels of the test tank. That's the machine learning test pyramid. And towards the end, I'll share some strategies how we can address the technical deaths in machine learning systems and bunch of resources for future reference. So without any further ado, let's begin. So let's understand the broader meaning of machine learning and its related domain. So AI is kind of a very broad term, which implies that any program or a technique which mimics the human behavior. While machine learning, it's a subset of AI. So it basically focuses on the development of the algorithms. And that learns from the data set and just corresponding features and gives you certain output or real-time data-driven decisions. While deep learning, it's a subset of machine learning which extracts the patterns from the data set and gives you some data-driven decisions. For example, the handwriting recognition, it's an example of the deep learning. And spam email detection, it's an example of machine learning. OK, so now explain the basics. Let me test the waters. I want you to take a minute and think. Think, which of these faces are real? Please raise the hand if you think face A is a real one. Interesting. Five percent within the audience. Please raise the hand if you think face B is a real one. I see. OK, now please raise the hand if you think face C is a real one. Very nice. 80% of the audience have raised the hands either for the faces A, B, and C. Well, the answer is none of these faces are real. These people do not exist on the planet Earth, to say the least. I'm not sure about the extraterrestrial planets. Maybe someday if I'll go to the Andromeda, I'll see that. So this is an application of the deep-generated modeling in which the algorithms are not just only learnings, patterns from the data set, but also going a step beyond and synthesizing the brand-new data instances based on those learned patterns. It's a very complex and it's also a very powerful idea for a generative AI, you know, it's chat GPD. It's also an application of the deep-generated modeling. But wait a minute. What's the building blocks behind those algorithms? So let's see where the neural, so that's how, you know, the neural networks comes into the picture. So this is a basic structure of the neural network, you know, given the data set as an input and then you feed it into the machine learning system which comprises the weights, sum, and biases and activation function, right? And then it gives you certain output. Output could be, you know, whatever the use case that you have trained or just building your machine learning model for. So let's understand this entire structure through an example. So here the use case is, oh, that's my fluffy, which is my Maltese beat pet dog. So you want, you know, you are given an image. You need to determine to which animal category that image belongs to. So you will feed the image data set as an input, right? And then in the system, we'll understand it as a combination of zeros and ones. And then it will give you a certain output with, you know, an accuracy or kind of a confidence. So here the image belongs to the dog category with 82% confidence. Now, what if we want to deploy this image recognition example for our end users? So how our machine learning pipeline will look like. So this infographic shows the end-to-end machine learning pipeline or machine learning system holistically, you can say, which basically comprises of the three main pillars or the three main stages. The first is the bird stage. The next is the deployment. And the last one is the monitoring stage. So what happens in the build stage is you're not just ingesting the input data set, but also doing pre-processing and post-processing model training and, you know, model testing. And further on, you are packaging and registering the model. While in the deployment stage, your application is released in the production or maybe like in the Dev environment. And then you are just making it available for the end users. So that's a very high level of the machine learning system or the stages in the pipeline. But what if something goes wrong, especially in the build stage? So let's see that. So let's understand the core problem of this entire talk and which talk, I'll just give you certain examples in the subsequent slides. So this example shows a biased word embedding. You can see in the Cartesian or you can see the vector space that the affinity of the homemakers is much more towards the women than a man, you know. So word embedding, basically what happens is it's kind of a vector representation of words or the text in the vector space. So meaning that the words which are closer in the meaning, they are going to have, you know, they are much more similar in the meanings. They are going to be very much closer in the vector space. So does that imply that the homemaker and the women, they are similar in meaning? Well, unfortunately, that's true. That's what the word to web algorithm at least shows. You know, representation which is displayed here shows that the word to web algorithm, it's trained on a biased Google news data set. So we know that the garbage and garbage out, right? So if you have a biased data set, then, of course, your algorithm or the output is going to be biased. Okay, so the next example, it's a biased response given by the chat GPT. I'll just give you a couple of seconds so that you can go on through the prompt and the reply being given by the chat GPT. So the prompt says that, you know, the doctor yell at the nurse because he was late. And so logically, of course, the nurse was the one who was late, not the doctor, either way around. But the response given by the chat GPT is too hilarious, at least to me, it's... So he said that, you know, the response given by the chat GPT is kind of... He was the one and he refers back to the doctor, not to the nurse. Similarly, for the next example is also very kind of a bias in which the interchange of the pronounces happen. Unfortunately, we are the consumers of these technologies. Are we aware about the biases inherent in these systems? That's a question that I really want you to ponder upon. Similarly, there is a biased response given by the chat GPT, which alters the sentence in itself. You can take a couple of seconds to read through the response given by the chat GPT, in which it explicitly mentioned that, you know, the construction of the sentence, it's wrong. And alters the sentence by itself. And then it gives, of course, the output, you know, whatever the response it gives. Okay, on a similar lines, we have a biased credit scoring. You know, sometimes it's so much so that, you know, certain sections of the group, particularly the people of color, they don't get the credit or maybe just they can't get the loan sanction. And because that's a potential kind of a discrimination in the algorithms which are running behind it. So these are examples are the bias. You know, sometimes how machine learning gets you, you know, biased responses. So, okay. Now, after pawning over those couple of examples and their bias and unfair outcome, now it's time for us to introspect. How did we get here? You know, how can the models output such discriminatory results or decisions? So there are a bunch of reasons. First of all, you know, historically the algorithms which are trained on biased or discriminated data set, they could amplify or just perpetuate the bias so much so, which could lead to, you know, unfair decisions, especially towards underrepresented sections of the society. And the second is when you have the lack of diversity within the data set, therefore it will lead to skewed results, right? We say the garbage and garbage out. So if you have a new distribution of the data set or the class labels within the data set, then it will amplify that you have biased results. Next is the cognitive bias, which actually it's, you know, when the individuals and the teams, they are not themselves aware of the biases when they are training, testing, or deploying any stage of the machine learning pipeline. And it's so much so big in our lives that it shows up when we are training or, you know, when the model has already been deployed. Lastly, when your evaluation metrics, you know, when it's not aligned with the objective that we are trying to achieve. So that will lead to some haywire results. Okay, so now this leads to me a question, you know, how can we actually build some fair and unbiased models? And what are the strategies that we can follow? So there are a bunch of strategies which I've listed here. The first is when you can collect some diverse and representative data set, because you want to ensure that the training data is well representative and it's also diverse, especially to the users whom you're catering your machine learning pipeline to. The next one is pre-processing. And in the post-processing stages, you know, you can use techniques like day documentation, feature selection and so on, just to balance out the data and reduce kind of a bias. Others are like algorithmic fairness and explainability. I'll of course give you the demo of all these strategies or just a couple of strategies to make you much more comprehensive to understand. So in fairness techniques, what happens is, you know, you can use certain techniques like demographic parity or just equalize odds or, you know, equal opportunity to ensure that the model that you have trained, it's equally or just treating all groups equally. And explainability, you know, so black box models sometimes zero results are so much haywire that you don't even understand or just hard to debug, you know, why the model's giving such decisions. So therefore use some explainability algorithms and then testing into your machine learning pipeline would help in ensuring that you don't have any bottleneck within the systems. So, okay. So this is an application of the loan classification that, you know, you are given a data set. So here I'm using the German credit data set and it's like publicly available data set. And what we want to do is using the loan classification is given a data set, you need to identify whether people from different ethnic groups, whether they'll get it sanctioned or not, right? And this data set has a bunch of features like status, credit history, employment duration, job, their telephone number, gender, age, and so on. So we are defining here the features and also the labels. And now we are using the smart technique. I'll explain what is a smart technique as well. So it's kind of a statistical measure because you want to balance class distribution and smart basically helps in synthesizing some minority samples which are not prevalent in your data set so that you have the balance of both the class labels. So you can see that, you know, before applying the smart technique, we have the counter to class zero distribution as 569 and one is 551. But afterwards when you applied it, you get like equal distribution. So using this kind of a strategy helps in, you know, equal distribution or just balancing out class distribution. And then you can just, you know, split the data set in just 80 to 20 ratio which is like training is for 80 person and testing is for 20 person, right? And then you just feed into your model. Okay, so now here I'm using this XGBoost model and also using some parameters. It's kind of a tree based classifier because what we are doing is doing just loan classification, right? And we are defining this kind of a build pipeline in which we are ensuring that we define our model. We also then do it just training and testing on our model. And here the parameters in the XGBoost classifiers are already being, you know, you have some learning rates, you have seed values or something like that. And then you define in the fit function. So once you have defined all the parameters, then you call the XGBoost classifier model and then you do the, you know, fitting on the training and the testing data. Now for the testing, here you use the predict method for the XGBoost model and then you evaluate the model using the accuracy as one such evaluation metric. You're free to use other evaluation metrics as well. For example, precision recall and F1 score, these are other evaluation metrics that you can use to see how our model is performing. But wait, now we have built a model. Let's find our weathered sphere enough versus not. Okay, so for that, we define our privilege and unprivileged group based on our hypothesis. So, you know, here I've defined the privilege and the unprivileged group by doing the EDA first. You know, so privilege group is something like all the men beyond the age of 18 and unprivileged is women beyond the age of 18 as unprivileged. While I'm using it because in the data set, if you see, then we have very less samples of the data with samples of the women within the data set. So we want to ensure that the model that we are building, it's not biased towards one gender versus another. So that's why we use this as more kind of, or you know, like all these techniques just to balance out and then also using kind of a metrics. So how we are testing the model, you know, we are just not, first we have trained the model, evaluated our model, and now we are using this multi-objective optimization algorithm because we want to optimize not just on the accuracy, but also fairness as a metric as well, because we don't want to compromise and either of these two. And then after running the optimization algorithm, we calculated the statistical parity, which is nothing but a fairness metric. And then it gets a statistical parity, basically it gives you the difference in probability of positive outcome between the privileged and the unprivileged group. The smaller the better, right? And after running that, you get a statistical probability of 0.0039, which implies that, you know, it's not creating any kind of a discrimination towards one gender versus another. So that's how we test and build our model in that case, just to ensure that it's not biased and also it's fair enough. So now it's time for us to understand the ML test pyramid concepts and see various levels of testing. So the first one is the unit test, which basically it's focusing on testing the individual components of your ML system in isolation. So that involves maybe you're just testing some very small functions or the modules that you have created, which are, of course, the building blocks of your ML system, right? And then we can use some integration testing just to see how it's interacting between the various components of the ML system and model testing, which is what I've given already the demo before. Then you can use some accuracy, precision, recall, all these measures to check the effectiveness of your ML system. And other ones are robustness and generalization. You can use whether the model is actually generalizing well to the unseen data set or the data which has not been trained upon. And then you can check whether it's performing reliably well in the real world scenarios, right? So in the deployment testing also, we can ensure that we have smooth CI CD pipeline and also validating the model performance in production. And the last one is ethical and fairness testing. So we want to see our ML system, which has already been deployed in production or maybe just in the verge of it, whether it's not compromising any of privacy concerns or just algorithm biases in itself. So that's why we just want to test for the ethical considerations because it's too much vital, right? We want to ensure that the models are fair and unbiased. So let's see some unit and model testing in action, okay? So here I've added some unit testers for your ML system. It's a sentiment classification problem statement. So I'll explain what's a sentiment classification. Given a text, you need to identify whether it's a positive versus a negative. For example, I'm very much thankful to the organizers of the EuroPython and so many loving people are here for giving me the support and share my learnings with all of you. So that's a positive sentiment, right? And I'm also very much thankful to my mentor to standing up and cheering me from the loud. So that's also like a positive sentiment. And the negative sentiment could be, I really don't like a pineapple on a pizza. I'm not sure how many people over here like it, but definitely not me. So that's a negative sentiment, right? So when you have defined your sentiment classification kind of a model, how you can add some test cases just defining if are they working really well with the actual output that's projected or if the data set which has been given in the model, if it's not English or maybe like some other Hebrew language, then you have to also. So you can write some kind of unit test for your system. Okay, so that's a very famous concept actually, which I'll also hope that you all like it. So that's an explainable AI. So let's, you know, how plug in the explainability just to make sure our black box models are explainable well, interpretable when I laugh. So, okay, first of all, we need to understand what's explainability. So explainability, you know, in machine learning implies that your machine learning, you know, you can explain what the model is doing right from the beginning, from the input to the output because it makes the model much more, you know, transparent and it kind of solves the black box problem. For example, medical diagnosis or maybe like in healthcare system, right? Or in just autonomous driving. The six are like super high, right? The decisions made by these ML systems, you can have like significant impact on the people's life. So to build the trust, ML systems need to be very much transparent, right? And they also need to be explainable so that the users can understand how the system works and why ML systems are making certain decisions. So the users also need to know who is responsible for the system and how they can be held responsible for its action. So let's see how we can plug in the explainability when we are building the ML system. So here, again, taking the similar problem statement of sentiment classification that given a sentence, you need to understand which positive or it's a negative, right? And positive here is one and negative is zero. So this is a textual data set. And of course we know that, you know, in machine learning system or just like, it will not understand the text. So you have to vectorize it so that it becomes like zero and one search, numeric in that sense. Then afterwards, once we have created the model, I just wanted to put in one test case to see how it works, you know. Here I've tested on sentence, which is so exciting presenting at the Euro Python conference in Prague. So of course it's a positive sentiment. And is that model also giving it positive? So let's see. And here clearly it had said that it's a positive sentiment and also highlighted the words which are contributing towards it. Which is saying, you know, okay, these are the words which have contributed and make it, you know, it's a positive sentiment. So that's how you can plug in the explainability while you're building your ML systems. Okay. So now that's interesting, you know, 13% So why we see that, you know, just 13% of machine learning models make it production. Have you ever thought about it? So, well, you know, there are a lot of reasons why this happens. So first of all, let's understand what are the technical debts which in occur when we are building or just what are the strategies that we can use to mitigate that? So the first is the data debt, you know. So the issue is like having low quality of the data. We say that, you know, as the good data, you have the better your model or the accuracy of the model will be. So if you have the low quality or just smaller data set and it has the biases or just lack of representativeness, then you are making imperfect or kind of limited. You're working on the limited data set and which makes the model much more harder to interpret or just harder to update. And especially it's not well represented and also creates some kind of a bias and of course inaccurate outcomes, right? Second is the architecture debt, which happens when you have defined your architecture in such a way it's so complex that it's so much disorganized over the period of time and so much poorly documented or not well understood within the, you know, within the team or someone that even, you know, even if you want to just refine it or just want to update it, you can't do it. So that's where the architecture debt comes into the picture. The next one is the algorithmic is when you are relying on some outdated or suboptimal modeling approaches and just creating certain assumptions. You know, if this thing works, that's how you just do it. So having those assumptions and not waiting it through properly also creates kind of a bigger bottleneck. So when you new methods you want to emerge, you know, the system is required updating and retraining, right? But that process becomes so much difficult if there is an architecture or data pipelines which are so much designed within outdated fashion. Okay, and the testing is, you know, when you're lacking so much test either from the very beginning, you know, processing of the data side, modeling the data side or just testing it out. So having the lack of the test also leads to some, you know, testing debts. So let's see how we can address and mitigate such technical debts. So these are the bunch of strategies that I've listed here. For example, first you can review your ML pipeline, you know, just go with the entire pipeline. Right from the beginning, you know, including the data collection, pre-processing or just modeling and deployment stage. And look for any manual steps if there are or just any hacks or shortcuts if they have taken because that will help in identifying any potential technical debt. And then you just try to automate, you know, address them by automating and streamlining the pipeline. And the next one is most important is, you know, the data is in your, as what we say, right? So if you have the poor quality of data, then of course it's a major bottleneck. So looking for the missing values or just biases or if there are any label errors and just try to address them by improving the data collection pipeline. And the third one is, you know, you need to examine your model performance. It could possible, you know, if it's been trained on some data set which has, which is like so much outdated now because the new models or the new techniques are emerging and your model is not fitting well. So you need to check in for the model performance, whether you need to update the algorithms behind it. And other is, you know, you need to set up the comprehensive automated testing practices, either unit testers, integration test or so on. And also check in for the regular code updates because you can refactor one of the best ways is what I feel like, you know, just to avoid any kind of issues which could possibly rise when the model has been deployed in production and just refactor some of the repetitive code base if it's already present there. And of course, using the explainability factor is also like really important because models are considered as black box opaque, right? They sometimes it becomes so much hard to diagnose them and just to get it updated. Therefore it's better if you use a lime or sharp or these techniques or frameworks just to understand how the model is performing. Okay, so now it's a conclusion on the closing remarks which I want to leave you for this entire talk is, the first of all, it's very important to have the comprehensive testing because it helps in unlocking any kind of potential that the machine learning system could have and also to ensure that we build, test and deploy fear and reliable models. And the second is having plugged in the availability or explainability makes our model transparent to understand why it's taking certain decision. And lastly, having enough test cases or just evaluation metrics, it helps in promoting fairness, accountability and transparency. Hence it's important to monitor, you know, how the model is performing over the period of time so that it, I mean, it does the performance as what it's expected. And these are a bunch of resources which I want to leave you for the future reference. And lastly, I want to leave you with the thought provoking player words which is what I mean and here it is. So which goes by, you know, garbage in, garbage out. So let's sit through the biases and push them out. In the realm of fairness, one size never fits. It's a kaleidoscope of perspectives and intricate pets. So thank you very much. Thank you for being a warm and lovely audience. And of course to the organizers for giving me this amazing opportunity and my mentor for being here, cheering me up and again, being my source of inspiration. And thank you so much, everyone. Thank you, Razmi. Great talk. We have time for questions. Please use the microphones. I will put mine too. Thank you for the presentation. So can you recommend some packages for fairness monitoring? Some packages for explaining models and for monitoring models? Right, right. So there is one famous package that's a fair learn model by the Microsoft that definitely you can use it out. And for explainability, Lime and Sharp are the two packages that I've already shown. Like I've explained one example of the Lime. So you can definitely use that when you're using any of the textual, you know, data set. And Sharp is also one such package that you can use for our framework that you can use for when you are building any kind of machine learning system. Yeah. Thanks, really interesting talk. And nice to see some solutions to this because you don't always get that. My question was about, I work as a software engineer, so maybe less from the data science side. And I think quite often I'm gonna have to start working with LLMs. And I wondered if you have any tips on how I can try and make them fairer. Because when you don't get to control the model, but you still have to work with it. Right, right, okay. So when you're saying that you want to start with the LLMs and also make them fairer is what the question is. Yeah, I guess from my perspective, I might not get to choose what the model is, but I'll still have to ship a product with it. So maybe it's impossible, but I'm wondering if I can do anything to try and help. Like, cause I, you know, I hate like the examples you showed about the pronouns and things. Right, right. Okay, it's short. So if you have the constraint over the model that you have to choose, you know, if you don't have the choice of choosing any kind of a model, but just given, okay, this is the model that you have to use. And then there are a lot of strategies that you can use. Just checking in on how the model is performing over various data sets, whether it's generalized well enough. And if it's actually performing really well on the real world scenario, because you are working in a simulator environment, right? You need to just test it out on the real world data, how it's performing on the real world data. And if it's not performing really well, then fairness techniques are of course there, or you can use some data augmentation techniques. And at the same time, you need to understand how the model has been trained. For example, the word-to-beg algorithm, the representation which I've shown in here, it's trained on the Google News data set, which is a biased, right? So if the model which is, or maybe if you're using any pre-trained model, which is trained on certain aspect of the data, right? If it's already biased, then definitely you can't remove the bias at all, right? And if you're creating a model from the scratch, then definitely you will take some design challenges or just design choices in consideration. But I would say if you are starting with the LLMs, just choose those LLMs which are performing really well on a diverse data set and flash to your use case that you are catering to. Yeah. Thanks, it's very helpful. That's no problem. Hello, thanks for your talk. I wanted to ask you something about the last thing you explained, the design bias somehow. Right. For example, imagine you wanna design a model that predicts the success in the schools. And one of your features is the postal code. So there are gonna be areas where, well, because they are poorer, then people are gonna have more like school failure usually. Right. So for example, the model that you would train on that would be biased, but somehow would be correct. Correct in terms of it's not gonna have a bad prediction. Right. It's not gonna be, not that it's not biased. This is like the result mathematically, it's not gonna be that bad. How would you like re-design this? I think since, okay, so the data set that you are saying, it's a school curriculum data set, what's the data set? Yes, yes, it's like a school success or who's gonna finish a school? Probably to finish a school, for example. It's a hypothetical case. Okay, right, right. I think that's an interesting question, I think, and point of one. But all I can think on the top of my head right now is whether it's catering to the, you know, the hypothesis. So first of all, before even going for the training, you clean the data set, right? You also do hypothesis testing. That's a part of the EDA. That's explanatory data analysis. You get the insights from the data set, right? And then you create some certain hypothesis whether it's catering to that particular use case versus not or just it's diverse well enough. Whether it's general, you know, if you can use on a different sub-population. And in that case, I think, if it's not satisfying the hypothesis testing or during the EDA process in itself, then definitely I will not take that model ahead. And even if we want to take it, you know, you need to identify various bottlenecks out there or just writing thorough test cases. Possibly it could be unit testers or, you know, the prediction when the model has been trained. These kind of accuracy, you know, precision recall. You need to identify exactly which hypothesis that you're trying to cater to, right? For example, in the credit scoring. So the ethnicity, you know, or the gender were one, these kind of hypothesis were, you know, the hypothesis was I didn't want to create a model which is creating a bias towards underrepresented sections of the society. So that was one of my hypothesis that I've thought provoking, you know, the model should be revolving around it. So for you in that case as well, if it's performing really well on certain hypothesis, then I think you can carry it forward. Otherwise, there are a lot of, maybe just doing some testing techniques. Possibly that answers your question. Okay, yeah, thank you. No problem. Any other questions? No? Thank you, that was me. If anyone has.