 In machine learning, classification problems are very common. The input is a set of features, and the output is a continuous value between 0 and 1, that we interpret as a probability. But in reality, these values aren't necessarily reflective of true probability values. Now, a main question here is, well, how does that even happen? Well, the primary reason is data imbalance. In problems like fraud detection where we have a few positive samples, it makes sense to under-sample the negative class or overweight the positive class if we want to train a model. We do this so that the model can detect these positive instances. But as a result, the probabilities returned by the model may be skewed higher. And in order to make sure that the values reflect true probabilities, then we need to calibrate the model, and we'll see this in code later on, too. Now, how do we actually calibrate a model? Well, we train a model as normal as we usually would for a classification problem. We then pipe this output and use it to train a calibration model with one feature. And the corresponding label here would be the actual label for the original model, too. Now, there are two main types of calibration methods that are widely used. One of them is PLAT regression, and then we have isotonic regression. PLAT regression basically uses a logistic regressor, and it is better suited for more simplistic cases. Whereas isotonic regression fits a more complex piecewise linear model and can be used to find more complex relationships. We can try either method and see how your model works, and depending on the model and also how your data is as well. Well, with that primer out of the way, let's get into some code. Alrighty now, so let's take a look at some code. I opened up a notebook, installed scikit-learn, and imported a bunch of packages from scikit-learn. So first of all, we have make classification, which is used to create our classification dataset. I'm just going to be creating dummy data. Calibrated classifier CV actually performs the calibration behind the scenes that I discussed previously. Calibration curve is a good way to visualize whether a model is actually calibrated or not, or how exactly how well a model is calibrated. Train test split, used to split your data into training test sets. Logistic regression, this is the main model that we're going to be using for our dummy data. Excuse me. ROC and AUC score, it's a metric that it's just going to quantify how well the model is performing. Briar score loss is kind of like calibration curve, where calibration curve is good to visualize how well a model is calibrated, visually via graph, whereas Briar score loss is good to determine how well a model is calibrated via just like a number. And then we have a bunch of other common functions right over here. So let's take the first case where we create a classification dataset with 10,000 samples and it's a balanced dataset, which means that, well, there's equal number of positive and negative labels. And in this dataset, we're going to have 10 features. All of them are going to be significant. They're going to be important features. Now I'm going to be splitting this up in this cell into train, dev and test sets with an 80, 10 and 10 split. We're using the train set to train the model. The dev set are the x-val and y-val. We're going to be using for calibrating the model and then the tests for actually testing the model and getting these results. And you can see like just looking at the distribution of the labels in the train set, they're pretty even. So since there's 10,000 samples, 8,000 in the train set, 1,000 evaluation set and 1,000 test set. So in the 8,000 we have 4,000, 4,000, which is about right, 50-50. So okay, let's first consider the calibrated model case. So right now we're just going to fit a logistic regression model on the training data and then make some predictions. Y-Pred will be basically a list of probabilities. So if I look at the distribution of probabilities, you kind of see like, okay, half of them are below 47%, half of them are above 47%, which kind of makes sense. That's correct. In this case, the AUC is about 94.2%, pretty good model. We'll roll with it. The Breyer score loss is 0.0922. Now mathematically, the Breyer score is the difference between the test as well as the predicted probability squared. It's just the average of those squares. And so as you can see, like, if it's lower, then that means it's better, basically. And this little plot is going to be of the calibration curve. Now, like I said before, this plot just signifies how well a model is calibrated. Ideally, this should be, it should be very similar to Y is equal to X. I'll just explain what probability of positives and fraction of positives is. So probability of positives, each of these X-axis is kind of the value of the label prediction probability. And the Y-axis is like, how many of these labels or what percentage of these labels are actually, you know, positive labels, which ideally should be equal. A good way to though understand what calibration curve really does behind the scenes is kind of just to open the GitHub repo, which I have right over here of its implementation. So right now we passed in a bins is equal to 10 and this default is five. So what it's going to do is actually take the entire array from like zero to one and just segmented into 10 equal parts, which is like zero to 0.1 is one bin 0.1 to 0.2 is another bin and so on. And that happens. Let me actually put that up right here in the code that happens right over here on this line. 875 where we're just creating equal bins. And then what we do later is that we're going to take all the 1000 evaluation examples and just do and just like put it into the bins where it where the probability lies. So we have 1000 samples if it lies between zero and 0.1 put in the first bin. If it lies between 0.1 and 0.2 put in the second bin. And then what we do is we're going to compute whatever was on the X and the Y-axis. So prob true right here is basically saying what fraction of in each case like what was the fraction of samples that were of the positive class for every single bin. We're going to compute it and then prediction probability is like in every single bin, you know each of them actually corresponds to a probability value that was returned by the model. What is the average for each of those bins? Now in every case they should be as close to each other as possible. So which is why you see like this ideally should be a straight line, right? And when we actually look here, it kind of does look pretty straight. It almost is like a Y is equal to X, which means that the values that are returned by this logistic regression function over here are pretty good. Yeah, they do are representative of probabilities are pretty close to that. All right. So now that we have the uncalibrated model set, what happens if we calibrate this balance data set? So basically what happens is we take CLF, which is the train classifier, and then we pass it into this calibrated classifier CV. And what we're doing here is just saying, hey, we've already pre-fit this model class CLF. So all we're going to do is apply an isotonic regression on it and then calibrate the model. And how we're going to calibrate it is using the evaluation data, which is another set of like 1000 examples. And so what we're doing here is we're going to make the predictions right here with predict proba. And then you can see that the distribution of like the predictions of the calibrated model are pretty similar to what we saw previously with the train set right up here. So you can see like before it was like 0.47 was the median and now it's like 0.5, which honestly isn't much of a difference. AUC is kind of similar to and the Briar score is very comparable 0.092, which I think was the same previously it was. So basically calibration doesn't really do too much here. And yeah, we still get, you know, a curve that's very similar to Y is equal to X. And by the way, I think I have to correct myself real quick here. So this curve, when I said, I think I mentioned that it was created by only 1000 samples of the valuation set. That's wrong. It was actually computed by 1000 samples of the test set. I'm not sure if I clarified that correctly. Oh, well, now you know, this is computed from the test set. Oh, and this is also computed from the test set as well, because we are calibrating the model and then we are making predictions via a test set. We're calibrating it using the valuation set though, but we're making this plot via the test set. I think I've repeated myself three times there, but that's okay as long as we all understand. So yeah, basically calibration didn't really do much to this because it's a well balanced data set and logistic regression is pretty good at returning probabilities. Now this is like kind of like an extra thing where, you know, I've just seen where not a non so recommended approach of basically using your training set to also calibrate your model. Probably not the best approach again, because you're training and calibrating at the same time. This may lead to certain biases, but I've seen it in certain tutorials out there. So I'm throwing it out here anyways, but it's good to at least refer. All right, now moving on to the unbalanced data set case. Now these are cases like, you know, the case of fraud data where you might have only a few cases of fraud, but like an abundance of just normal transactions that occur. So in this case, I'm also creating 10,000 examples with 10 features, all of them significant. And we have of these 10,000 1000 of them are positive and the other 9,000 of them are negative samples. And right here, I'm kind of doing a split of train test and valuation again, 80 10 10. And you can see here, we still have like a one is to nine, one is to nine ratio, which kind of agrees with the weights that we've given. So kind of representative of what we would see with like, you know, fraud data. So like that, like we done before for the balance data set case, let's look at what happens if we pass this into an uncalibrated model. So we'll basically pass into a logistic regression. And what we're doing here is we're passing in a parameter called class weight is equal to balanced. What this does is that, you know, because it's an unbalanced data set, the positive labels are going to be weighted like nine times more than that of the negative labels, or I should say the negative examples. So yeah, this is done so that the model is better able to pick up on these positive examples. And it's also kind of a requirement. So once we're done there, we'll fit the model, we'll make predictions. We see that the AC is like 90% pretty good briar score 0.087 again. Okay, that's fine. And now when we describe the predictions though, looking at the predictions of just this uncalibrated model, we can see that like 50% of them are under 10%. The prediction is under like 10%. Okay, so this is just something to keep in mind. Because let's we'll be comparing it later to the calibrated model case and you'll see the difference in probabilities. So now if we were to just create the calibration curve on the test set, you can see that it's very deviated now from the from from y is equal to x from that straight. Diagonal line. So this is indicative. When I look at this plot, I see, okay, the model is really not that calibrated, which means that these probability values that we see that are being returned in y pred D F are actually not very representative of probabilities. So now, okay, it becomes pretty apparent. So what do we do here? Let's try to calibrate the model. So we have our classifier again, and we pass it to our calibrated, calibrated classifier CV. We calibrate the model with the evaluation data set, and then we make predictions. Now we have an AUC that's not too different from before, but look at our briar score. It's now 0.05, which is definitely better than the 0.08 that we saw previously, which is good. And now when you kind of look at the predictions right over here, the kind of predictions that are basically returned by our calibration model are like 1.5%. That's the median. Before the median was 10%. So you can see that the probability values have now completely decreased compared to what they were in the uncalibrated case. And this is kind of what I hinted at back in the explanation before I showed all this code. Now these should be more representative to probabilities. Why is that the case? Well, if we look at this calibration curve right now, you can see it's much closer to a y is equal to x. And so these values are actually more representative of true probabilities. And yeah, and this is just like the same case what I mentioned before where we're like training and valuation happens on using this just like one set of data at the same time in one shebang. So yeah, that's kind of all about like model calibration. And an interesting place where you would use this is like anywhere where like you really need absolute probability values to be representative of actual probability values. Kind of like, you know, an expectation problems when you're finding the expected value of perhaps, you know, one of your features. And this actually I have illustrated very detailed in another video on expectations. So I think it was a video that came out before this. So if you want to check out a cool implementation using probabilities and calibration, I suggest you check that video out. Other than that, though, I have some references down in the description below or rather actually right here in the end of this notebook. And this code will be available on GitHub link also in the description below. So yeah, just please comment, like, subscribe, do everything you need to do to get the word out. I'm trying to grow a good channel here so stay tuned, stay safe, and I'll see you later. Bye bye.