 Hello, today we're going to talk about how to detect silent machine learning failures in models deployed to production, so you can treat this presentation as kind of an introduction to machine learning monitoring. My name is Wojtek Kubebski and I'm a co-founder of NaNiML, NaNiML is an open source Python library for machine learning monitoring, for detecting silent machine learning failure and for doing data detection. So let's get started with the agenda. So first thing we're going to talk about are two main reasons why machine learning models can fail. And these reasons are data drift and concept. We're going to define what are they and how they can potentially impact performance. Then we're going to talk about how to do the most important thing which is trying to see what's the performance of your model. And we're going to talk about why you need to do performance estimation and why simply calculating performance is most often not possible because you do not have access to targets once you deploy your models to production or at least you don't have full immediate access to target data after you deploy your models. And then we're going to talk about kind of the root cause analysis or the reasons why machine learning models can fail. And we're going to talk about the ways to pinpoint really what actually changed what went wrong. So we're going to talk about data and concept of detection. So before we jump into that, let's set the stage with a very simple use case. So we're going to talk about a loan default prediction use case, which is a typical use case in banks when you're just trying to do a binary classification, whether a person is going to default or not. We're going to take the credit scars. We're going to take the customer information. And based on that, we're trying to predict whether a person has is going to default on a loan or not. So we deploy this model to production. Once we did that, we want to know whether this model is still performing well, whether the predictions are reliable and how, if something goes wrong, we want to know that it went wrong and why it went wrong. And we're going to use as our target non-payment within one year. So we'll have to wait a year until the target is available. And as our technical metric, we're going to use strong AC, which is a very typical metric in binary classification models. So now let's start defining two main ways that machine learning models can fail. Two main reasons, which is conservation data. And before we do that, we need to look at what we're actually trying to do when we train our machine learning models. So let's start with the true pattern that exists in reality. Here, as an example, you could see that there is some kind of variable X, and as this X increases, the chance that a given data point is going to belong to negative class also increases in kind of sigmoidal fashion. So we have this true pattern that exists in reality, some kind of relationship that might or might not be causal. It doesn't really matter. And then we're going to sample from this population of people according to that pattern, because this pattern exists in reality. So then we're going to get our data. And this is the data that we're going to use. We're going to split it into our training validation test set, maybe we'll do some cross-validation. It doesn't really matter. And this is the data that we will use to develop our model, test our model. And now let's see what happens if there is data drift, how the true pattern, the sampling and the data is affected and how performance might potentially be affected. So in case of data drift or covariate shift, as it's also known, the true pattern remains unchanged. So the thing that we're trying to recapture this pattern has not changed at all. It is more or less exactly the same if we're dealing only with the covariate shift. However, what changes is the sampling. So we sample from this pattern in a slightly different way. So the distribution of the inputs will change, but the way the inputs actually relate to the target will remain the same. To define it a bit more formally, data drift or covariate shift is changed in joint model input distribution. And again, here we see that the distribution of X is changed, but again the pattern still stays the same. So if data drift happens or covariate shift happens, it might or might not impact performance. One example when we'll impact performance is imagine that this data drifts to a region when it's harder to distinguish between the positive and negative class. So close to the class boundary, let's say the real class boundary that really separates the positive from negative classes. There because of bit of noise, it's going to be very hard to distinguish for the model or for anyone else, whether the point should be negative or positive class. So we expect the performance of the model to drop. And however if the data drifts into region that where the model is even more certain of the predictions than before, then we could even see increase in model performance. And data drift is something that tends to happen quite often, so we should be able to somehow capture the impact of data drift on performance. But more on that later. Now let's look at the second reason why the performance might change and this reason is concept. So in that case what changes is the true pattern. So we see that this kind of sigmoidal shape, the relationship between the feature x and the frequency of positive and negative classes, is going to be different. So the pattern that we're trying to capture that we learned of our machine learning model is actually now different, it's no longer the same. Now maybe the pattern is much more linear than sigmoidal. If we deal with pure concept drift without the covariate shift, our sampling is not going to be changed. And in that case of course our data is going to look differently, not from the input perspective but from the target distribution perspective. And again let's define it more formally. So concept drift is just the change in the Arminine concept or pattern or mapping between the target and the model input. So it's the probability of the target given the inputs. And here just to again visualize quickly we see that in training data we'll have that kind of true boundary. So this is not about the learned boundary but about true boundary. And in production data we'll have something completely different. In that case the performance of the model of course would not be as good as it used to be because the learned pattern is no longer the same as the real pattern. So concept drift will almost always impact performance and the stronger the concept drift is, the stronger its impact on performance is going to be. And so imagine performance quite a few times now. So now let's talk why we should really focus on it. So first of all just detecting data drift is not enough because it does not always need to performance. Just the existence of data drift is not a strong signal at all for the change in performance. Another reason is that performance is something that we optimize in training. We use it as a business impact proxy and of course the business impact proxy is something that's very important to us because as data scientists our job is to maximize business impact with data, with machine learning use cases. So now that we know what are the reasons why performance can degrade and why we know why performance is really important, let's talk about how we can actually monitor performance. And the main thing is we'd like to simply calculate the performance. So we'd like to take the ground truth or the target data and comparative predictions and then we'll compute our F1, our ROP AC, whatever metric you like. The problem with that is most of the time we actually do not have access to target data and the reasons are really free vault. When we look at the prediction use cases, so the use cases in which we are trying to predict something in the future, the target data will be delayed and it's going to be more delayed the more far away in the future we try to predict something. If we're trying to predict default or on a mortgage, we might actually define it from business perspective as non-payment within three, five years. That means that we'll only get our target data in five years from now. So we'll be really like flying blindly for five years, which is not acceptable level of risk for most businesses and especially use cases that have huge impact on the business like credit loan default prediction. Another reason why we cannot simply calculate performance is that in some of these cases, we do not have a complete label. So again, just to give you continue the example of the credit scoring. We know for every person that we gave the loan to, so we predicted that this person is not going to default, we gave negative prediction. For each of these people, we know whether we defaulted or not. So we can tell whether our negatives were true negatives or false negatives. However, if our model predicts that the person is going to default online, so we have a positive prediction, then we will not be able to really tell whether they would have paid back their loans if we had given it to them. So we do not have access to complete labels to every single label. And we cannot reconstruct the confusion matrix. And there are methods that deal with that like reject reference, but they still do not provide full picture of all the labels, all the targets that we have. And then kind of looking at it from a different perspective, if we consider the automation use cases, when I start predicting something in the future, we're really trying to automate some kind of most menial labor that is done by humans. In that case, getting all the labels would kind of defeat the purpose of the use case because we'd have to redo every single prediction manually. And that just doesn't make any sense. So most of the time, we will get some kind of spot checks when humans will double check machine's predictions. But this most of the time happens for around 1% of the data. So for 99% of our predictions, we will not have the labels. So we do not have access to ground proof and simply calculating performance is not going to be the answer for our monitoring use most of the time. So what we need to do instead is we need to estimate performance. And the way to do it is really looking at how the model really evaluates its own confidence and then try to somehow transform it into expected performance. And before we go deeper, I just mentioned that this is an algorithm that we've developed in-house at NANIML and it's part of our open source package. And what we're trying to do really is trying to capture the impact of data drift on performance. So what we're going to do is we're going to take the modulus scores, so predict probabilities. We're going to make sure that these predicted probabilities actually represent the probability that a given row or given person belongs to positive or negative class. And then we're going to try to transform that into expected performance. So first let's start with taking the modulus scores and making sure that they actually represent probabilities. So to do that, we need to calibrate the probabilities. And probability calibration is a technique when we're trying to make sure we're actually fitting another model that adjusts model scores so that if you bucket your data according to quantals between let's say 0% to 10% chance or 0.0 to 0.1, etc., you want to know that the data in this bucket actually has the corresponding chance that it is a positive or negative class. To simplify, let's say that you have 100 predictions where the prediction that model score is around 0.9. And what we want to ensure is that 0.9 actually means that 90% of those 100 predictions will be positive. So what we expect from a well calibrated probability is that this probability actually gives you the chance that a given data point is positive or negative. Now if we have that, another thing that we need to do is the threshold. So in most of the use cases, we need to add some point classification use cases, we need to add some point threshold our data, let's say we threshold at 0.5 and then we put everything above 0.5 as a positive prediction, everything below 0.5 as a negative prediction. So now we're going to take our data point and we're going to look at the calibrated probability, let's say it's 0.9 and we're going to compare it to our threshold, let's say it's 0.5 and then what we'll see that this is going to be a positive prediction and we know because the probabilities are calibrated that there is a 90% chance that this is going to be a true positive prediction, that the prediction actually will turn out to be positive with 90% probability. So what we will do then is we're going to construct a partial confusion matrix and we'll put 0.9 in the true positive self, then we're going to take 0.1 and we're going to put it in the false positive self because there is a 10% chance that it is a positive prediction but it's false because 1 minus 0.9 chance that it's positive, 0.1 the chance that it's negative. So we'll have 0.1 in the false positive self and we'll have this partial confusion matrix for just one data point. Then we're going to take a look at all data points in our period that we want to analyze and come to a performance work, let's say it's last day or last week and we're going to do the same process for every one of them and then we're going to sum these partial confusion matrices and what we're going to end up with is expected confusion matrix according to the model itself. So the model kind of gives us its own estimate of how well it thinks it performs on a specific dataset and this really takes into account the change in model input distribution because imagine like you see here that in the picture on the left, let's say that's our training data, we don't really have a lot of data points close to the class boundary where the model is not confident and on the right we will see way more data points next to the class boundary so the model is going to be less confident there and the performance is expected to be lower and this algorithm and the simple transformation really fully takes into account the impact of data drift on performance and just to give you a quick example of how it works in practice, we took California housing dataset which is a dataset that most of you are familiar with, we fresh-holded it, turned it into binary classification problem for simplicity and we compared our estimated RKUC with realized RKUC and as you can see it fits quite well with each other so now we know how to estimate performance if we can calculate it but we don't really know what is the reason why the performance will decrease because this method is not very interpretable so then we need to go back to data drift detection and see what data drifted in a way that impact performance so then we can figure out actually how to troubleshoot it and how to resolve the problems so then we're gonna look at data to detection or covariate shift detection and there's two ways to do that the first one is univariate when we'll just look at things of case test or case square test and we will compare the distribution before and after change so we look at our reference data set when we know everything's fine and we're gonna look at our analysis data set when we would like to know whether there's significant data drift and we'd like to figure out whether a specific feature actually drifted so we're gonna look at every feature separately and this technique is great for interpretability because you will know exactly every single feature that drifted however it has two big drawbacks the first one is that if you have a lot of features that say 200 or 100 features you will get a lot of false positives because things will just randomly change all the time and it does mean that these changes actually impact performance and secondly the univariate data drift detection or covariate shift detection methods do not really take into account change in the relationships between the model inputs so let's say that the correlation between two features changes but the distribution of these features on from univariate perspective does not change in that case we will not be able to capture this data drift so because of these two reasons we normally turn to a bit more advanced method when we look at all the features at once or a subset of features at once and we're going for multivariate covariate shift detection or data detection and we're gonna use an algorithm called data reconstruction so what we're gonna do here is we're gonna compress the data via some dimensionality reduction method that learns the structure of the data then we're going to do an inverse transform and we're gonna reconstruct the data then we're going to compare the all the points in the original data with the points in the reconstructed data and we're gonna measure the dislocation between these two points and this dislocation kind of gives us information of how good the compressor is the compressor was perfect we would see that the original data and reconstruction in constructed data is exactly the same and because this compressor is learned on the structure of the data we're gonna train this compression or our reference data set and then we're going to do this compression decompression on our analysis data set and that will tell us how strongly their structure of the data has changed looking at the error that we see between the construct the original and constructed data so here we will just plot this error in time if this error increases we have data drift and we should start by looking at the data drift at all features at once and then if we see that there is data drift for certain region we should look at it for sets of features that will give us a quite comprehensive way of looking whether there is data drift or not and it will give us a bit of interpretability because one of which subset of features has drifted or how the relationship between features are changing and if the reconstruction are still the same all good the structure of the data is likely to be very similar and if the reconstruction or actually decreases that means that the learned structure of the data is sorry the real structure of the data is getting more and more similar to the learned structure which means that there's also data drift but in some ways in an opposite direction and I mentioned that we need to use some kind of encoding or some kind of compression or some kind of dimensional induction method and we're using PCA in the library but it in principle needs to be any kind of encoding that learns the internal structure of the data reduces dimensionality provides inverse transformation and provides latent structure that maps in a stable way to the original space and this last requirement is really needed so that we can use this reconstruction error as a measure of the magnitude of the data so the stronger the reconstruction areas the stronger the data drift and this reconstruction error can be any metric that basically looks at the distance we're just looking at the mean of U2D and distance and again I'm going to give you a quick example of how this works in practice so imagine that you have here an example with points in blue which as our reference dataset for which we know everything is fine and we have points in orange for which we really don't know what's going on and if we do our simple univariate data drift detection on both of these features X and Y we will actually not see any changes however if we do PCA reconstruction error multiple drift detection we will see that the data reconstruction actually spikes very strongly after the distribution changes so we are able to capture this change in correlation between features even if distribution per feature on the feature level does not change so now it's quickly summarized so first and foremost data drift and considered are the main reasons why performance can drop and the data that does not always lead to performance then we'd like to monitor that performance since we cannot just monitor data drift but production targets are often not available so we cannot simply calculate so we need to do performance estimation without the target data using an algorithm that I explained using confidence based performance estimation and only then we should go back to data drift detection to figure out what actually happened if performance drops and that's really it so thanks for listening and feel free to check our github as I mentioned your open source and if you like what you see there do give us a star also feel free to visit our website and if you have any specific questions I could ask them now in the Q&A session or add me on LinkedIn later on thank you very much that's it