 Hi, everyone. I'm Linda. Welcome to my talk in Fair Models We Trust, where I will be presenting my plugin for increasing the auditability of mood learning analytics. Now, if you didn't understand any of these words, please bear with me. I'll explain them. I'll first give you some context information, namely what are mood learning analytics at all, what do I mean with fairness and trust, and what is auditing. Then I'll talk about the problem that I encountered when trying to audit mood learning analytics. I'll present the solution that I found for these problems, and I'll end with a conclusion, with an outlook, etc. So, first of all, what is learning analytics? It has popped up during the conference a little, but just to make sure we're all on the same page, learning analytics are software algorithms that I use to predict or detect unknown aspects of learning process based on historical data and current behavior. So, basically, learning analytics are algorithms that look at the data of students at how they behaved, and based on that, try to make some analysis, some prediction. Who has used learning analytics before? Anyone? Cool. And anyone used mood learning analytics? Yeah. Okay, cool. So, this talk is especially interesting for you, I hope. For those of you who haven't used mood learning analytics, I'll just show you quickly what that is. So, Moodle provides some learning analytics capabilities. They have some classification models that based on student behavior try to make predictions about whether a student will succeed in a course. There are different things that these model configurations can predict, for instance, students at risk of dropping out. So, the model will try to predict whether a student might drop out of the course. This here is the overview of the mood learning analytics. So, kind of like the landing page of this functionality. We can see the different model configurations. We could add new ones, and we get some basic information and also the ability to do stuff with these models. For instance, we can edit model configurations. So, this is an example for an editing page for the students at risk of dropping out model. Here we can see some more details of the model, or hopefully you can read it, maybe not. But I can tell you that each model has a so-called target. That is what it's supposed to predict. And it has indicators, which we in machine learning also call features. So, those can be some data that is fed and based on that, it makes predictions. For instance, here it could be whether a student has done any writing action in the course. So, for instance, posted something in the forums. It is important to know that Moodle offers the configurations for these models only. So, there are no pre-trained models offered by Moodle. Each configuration has to live in their own Moodle instance. So, we first have to train it on our Moodle instance before we can use it to make predictions in our classes, for our classes. Once the model has been trained and is running in our instance, it will make predictions. For instance, here the model predicted that Augustus array might drop out of a course. And we also know why, which is displayed at the bottom under indicators. We see, for instance, that Augustus has not done any write action in the course. So, this is an indicator that they might drop out. Now, as a teacher, I can decide to send Augustus a message. And I can also give some feedback for the prediction whether it was correct or maybe incorrect. We can also evaluate those model configurations. Moodle for that provides the so-called evaluation mode. What it does is it takes the model configuration and trains, like, I think, ten models separately and lets the model do some predictions. And it does that with the data in my Moodle instance. That way I can get an estimate of how accurate my model might be. We get two values returned from this evaluation mode. That is the accuracy or F1 score. And a standard deviation for, like, how differently accurate the ten different models were. Now, this sounds really cool, I think. Moodle has some learning analytics. We can gain some really cool insights into, like, our students. But also, if you've been to the panel this morning, we heard that AI does have some issues at times. People have found out. So, that's why I linked some research from colleagues of mine. People have found out that even learning analytics are not always fair. They exhibit bias. And they are seldom trustworthy right from the start. So, my colleagues have researched on dropout prediction models specifically. And they found that they might work, they might not work equally well for different groups of users, especially for underrepresented groups. They might not have enough data to make any meaningful predictions for them. And then, yeah, they might be biased against those minority groups, we call them. So, that said, what can we do? And we heard in the panel this morning many good ideas on what we could do to encounter biased AI. And I want to add one strategy to that, namely audits. Auditing means to verify that learning analytics or any AI is doing their job correctly, well, and in compliance with ethical values. We do this to find opportunities for improving the learning analytics, to assure its quality towards the stakeholders, and to therefore promote trust and acceptance of this technology. Now, that's well and good, but how do we actually audit any learning analytics? So, I'm going to show you a little example. Let's take a look at the student dropout prediction model of Moodle. We can audit it following three steps. The first step is we formulate claims. So, these claims are the principles that the AI or learning analytics is supposed to fulfill. For instance, here it might be, well, the dropout predictions do not show bias against minority groups. That's what we want. In the second step, we now gather evidence to prove or disprove those claims that we formulated. For that, we can look at documentation, we can look at the source code, or we could also look at the system logs that are being produced by Moodle when we evaluate a model. And then in the third step, we can now validate our evidence and hopefully come to conclusion of whether the claims are fulfilled or not. So, in the third step, we would need to check whether the predictions made by the model are equally accurate for both minority and majority groups. So, that's the process mainly, and that's what I attempted. But then I noticed, well, one does not simply audit Moodle learning analytics. I was faced with some hurdles. Let me tell you about it. The thing is, or the main problem lays in the type of evidence that I needed. To validate some claims, we need to conduct database tests. It's not enough to scan the documentation. It's not enough to look at the source code or even at the logs that are currently being produced by Moodle. Instead, what we need to do is we need to make an experiment. We need to check whether there is any bias in the predictions by inputting some example data into a trained model. And this example data should be realistic. It should contain data from majority and minority groups. And then after receiving the predictions, we need to compare the predictions and see whether the quality is just as good. So, that's a database test that we should conduct here in order to validate our claim. But this is not currently possible with Moodle learning analytics and there are three major reasons why. So, the first reason concerns the testing data. Currently, there's no suitable test data openly available that people can use to do some database testing of Moodle learning analytics if they don't have their own production system already running and the ability and also a right to use that data for it. Additionally, we cannot mark some data because the learning analytics models expect some complex, logical data, sequential behavior data of students. It's not easy to mark this at great scale. Additionally, the Moodle evaluation mode only evaluates model configurations, not already trained models. So, it's only ever an estimate of how good my models might be, but it doesn't tell me something about concrete models that I use on the site. The models that are being created in the evaluation mode are then not persisted after the evaluation. And finally, another problem also concerns the evaluation mode and that is that it does not make available the raw predictions, but it just returns two metrics which don't tell me a lot or not enough in order to like validate most claims and especially not any claims that have to do with fairness, where I want to compare results by group. So, these are three challenges I faced. And as a software engineer, I turned to software as a solution. I developed a Moodle plugin called Lala, which is short for let's audit learning analytics. You can find it on GitHub. And, yeah, so here's what I tried to do in order to solve those problems. Firstly, I enabled auditors to upload or select the data that should be used for auditing a model, for testing the model, for conducting this database test. Then I also made it that now we clearly differentiate between model configuration and trained model because this is not so clear in the Moodle evaluation mode. I also persist any models that are being created by Lala and I provide the raw predictions that are being created by those models. That's not all. I figured I could improve even more. I do also provide more extensive evidence that people can download from Lala. Namely, I do provide the complete model input that the model was using, the whole features and truth values that have been created. I do provide the split into test and training data and also some data that is related to the model input. For instance, if I want to do a fairness audit like I described, I might be interested in some information about the users in order to identify minority groups. So providing some additional user data might be helpful, right? Now you might think, but wait a second, what about GDPR? Isn't that a lot of data that you provide auditors to download? And that is true. So I did think about privacy. I anonymized all the information, all the data, so it can be downloaded safely and we don't have to worry about GDPR and deletion requests. Apart from that, what I also implemented, or what I also want to highlight, some features are that Lala ensures traceability. So if any model configuration in Moodle is updated or deleted, it is still persisted in Lala, so we keep track of those things. We also enable third-party audits for even more trustworthiness. That's by enabling admins to add new users and to give them the role of auditors so they can only see Lala and use Lala and not see anything else. And also I worked on providing an example analysis of the evidence because, okay, great, thanks to Lala, we have a lot of evidence, but what do you do with it in order to, like, find out about any biases, right? Okay, so those are some features that Lala has. Now let me show you how to use Lala or how Lala can actually help you in an audit. Let's get back to our example audit. Our example claim was that the drop-out prediction model should not or the drop-out predictions should not show any bias against minority groups. Now, in the second step, we can gather additional evidence using Lala. And this is how we do it. We first go on the Lala plug-in page and select the model which we want to audit. For instance, config 1.1, which is for course drop-out. We click on create new version, as it's hopefully doing, yes, create new version. So this will create a new version of this model automatically. That means what now happens is we trigger the data collection on this model instance. A new model is trained with part of this data and the other part of the data is used to make some predictions. And also the related data is gathered. Yay, a new version has been created. Cool. It's a default 322. And if we don't want to do this automatically, if you want more control over the selection of data, what I've been recently implementing is the manual model creation mode where you can select which context should be used for the data gathering or upload-own model input data. Either way, we have now some evidence that was gathered and we can now download it. Now, remember the claim, we want to check whether the predictions are equally accurate for different groups. And for that, what we need are the predictions and some related data, namely the enrollments and user data. So this is what we download and we can now use it to validate our claims to check whether the predictions are equally accurate for both minority and majority groups. Now, I will show you an example analysis. If you don't know Python, just ignore the code, I will tell you what it does. So first we will import the predictions as well as the related data. And we can have a look at it. So this is how the top shows what the predictions look like. We tell this basically for whom was predicted what and what is the actual value. So it tells us what is the truth, what was the prediction and for whom. And the second data set at the bottom, I shrank it a little because there's quite a lot of user data normally. It shows us which users set their moodle instance toward language. So for this analysis I chose language as the group indicator. So I split up students into those who have set their interface to English and those who set their interface to German. And here a little side note. I mocked some of this data because there's not enough sample data for doing analysis available yet. So I mocked some of it. Do not take any of it for the truth. It is an example analysis to demonstrate you how to use Lala, how to do this audit. It's not the truth about moodle learning analytics. So yeah, I imported the evidence. I had a first look at it. Now I do some complicated looking data preparation. Basically I just join both data sets into one data set so I can continue to work with it. And now here what I do is the interesting part. I use fair learn in combination with scikit-learn to make some calculations. Here I do calculate the accuracy based on the truth and the prediction for the students separated by the language attribute. And we can have a look at the results. We see here the accuracy for the English-speaking users and for the German-speaking users and it looks quite similar I think. So that's nice. Again this is fake but it looks nice. And we can also have a look at the numbers if we like. We can calculate the difference or the ratio and we also see that the difference is quite small and the ratio is quite high. So great. So now for our example audit we can conclude the drop-out prediction model does not show any bias against minority groups. At least for the language attribute. Okay. I told you many things. Here's what you might want to remember. Like the panel said this morning and like I also said learning analytics models are not always fair nor might they be trustworthy so that's why we need to audit them. However currently the auditing of Moodle learning analytics is a little hindered by the lack of data, low traceability and because the trained models and the predictions are not persisted in the evaluation mode. However I made a plug in Lala which persists and retrieves the evidence for the learning analytics models including the raw predictions. Okay but I'm not done yet. There's still some challenges, some problems so I'll give you a quick outlook. First of all sadly there's still no data openly available. We are working on making some data sets available for testing Lala but it's not enough. Secondly there are different machine learning implementations that can be used with Moodle but Lala currently only uses the PHP logistic regression implementation but we're working on allowing it to use other implementations as well. Then another thing Lala still only evaluates model configurations so when we create a new model version it will always train a new model but we are almost there that we can skip the training and proceed directly to getting predictions from an already trained model. Then another thing privacy is great but also we do lose some potentially valuable information due to the anonymization process so we want to experiment with a little more sophisticated anonymization algorithms in order to see whether we lose less information and lastly Lala also uses quite a lot of storage for all the evidence and it's also a lot of work for the servers to train the models so we also want to work on that and reduce the evidence sizes to ask users which evidence should be collected at all and also to enable comment line execution or to make the whole training an ad hoc task something that I learned at the Dev Jam. I'll check that out so thanks to my team from the Dev Jam that was really great working on this plugin with them. Yeah that was the outlook that was what I'm doing for fair and trustworthy learning analytics and now I think maybe it's your turn to also join the mission if you like. If you're working with learning analytics I hope that you audit your models in order to increase the trustworthiness to maybe increase the acceptance into those cool learning analytics and maybe Lala can play some part in helping you audit them. I hope that you will give me feedback share your ideas if you use Lala document any bugs and maybe even join the development some of you already have during the Dev Jam and last and final plea if you have any data any cool data to share I hope you will share it in an anonymized way of course. Thank you for your attention and I'm open to any questions. Thanks for the talk. Super interesting and really impressive so definitely gonna try out the plugin. I just want to ask do you have any ideas on how to I mean the challenge here I think is one of the challenges here is to get some data like selecting for the groups that might be discriminated by the model so okay so you use language that's a nice sort of way of approximating you know the problem of selection but have you had any other discussions what we might do to allow for more sort of valid comparison between groups or even how to test for intersectional effects for like for example like two different attributes of groups some of that so that's a very good point that you raise of course language is just one of many many different attributes that can designate a group there are many more gender or even intersectional groups so for instance female German speakers and we should definitely look at those things so when auditing what we do is or what we think of auditing as is to risk yeah risk aware so we should first think about the potential discrimination risks and based on these risks conduct our audit so while doing the risk analysis we might find special groups that are very vulnerable in our context for instance maybe in learning analytics and dropout prediction we could think that maybe parents could be discriminated because they are maybe underrepresented or they have different learning behaviors that the model might not be able to cope with so we should really take a risk-based approach here there are different approaches though we could also say okay let's look at the data that we have and identify minority groups based on the data that we have or we could also look at behavior data that we have of students and look at well what is behavior that is untypical or can we form any groups based on the behavior that we see