 Hello, everyone. I am delighted to introduce Oliver Kopp who is here to talk to us about protecting your machine learning against drift Hello Oliver and welcome to Europe. Hi, then. Hi. Thank you for having me. Where are you calling in from today? London and not so far from me. I'm in Cambridge All right, I'm very excited for this talk. So without any further ado, I will let you take it away Great. Thanks a lot. Thank you all for tuning in. Hope you've been enjoying the conference so far I'm Oliver Kopp. I'm an applied machine learning researcher at Seldom, which is a machine learning deployment and monitoring company Here at Seldom, we'll spend a lot of time thinking about how we can best enable our users to deploy their machine learning models in a manner that's Responsible and robust so they can be trusted to automate processes of real-world consequence One way in which trust can be instilled is by having systems in place That can detect when the model is being asked to operate on data from a distribution that differs to that on which it was trained These systems are called drift detectors and they'll be the focus of this talk In the data science and research team here at Seldom We develop and maintain an open source Python library alibi detect Which provides implementations of state-of-the-art algorithms for drift detection as well as outlier detection and adversarial detection But this presentation will focus solely on drift detection In particular, we will look at what drift is and why it pays to detector The different ways in which drift can manifest itself How just can be detected in a principled manner where we're able to differentiate systemic change from natural fluctuations The anatomy of a drift detector in the various components of which they're made up And hopefully along the way we can demystify some concepts such as online detectors permutation tests and MD tests etc Which might appear a little abstract or theoretical when you stumble across them in the alibi detect docs For example, but at least in the context of drift detection are relatively intuitive And yeah, we'll finish up with a practical demonstration showing how you can use the alibi detects Python library Okay, so let's dive in and set the scene into which we will introduce drift detectors This is the sort of standard supervised machine learning context, which I'm sure many of you will be familiar So I suppose there's some quantity why that we'd like to use as part of some decision-making process be that manual or automated Sadly, we can't observe why but we can instead observe some features that are related to why in some probabilistic manner So what we do is we fit a model to predict the target from the features and then use the model's predictions in place of the unobservable quantity So here is a simple example We've got a binary classification problem where we'd like to be able to predict a category north or one given the features x1 and x2 To make this a little bit more concrete, we're going to think of each data point as representing a person x1, an indication of their economic conservatism x2, an indication of their social conservatism And then the category is their voting intention with the blue data points Representing those who are intending to vote for the UK Conservative Party and the red data points those intending to vote for the UK Labour Party And you see here that we've got two sort of clusters of voters One of particularly economically and socially conservative individuals and other economically and socially liberal There's obviously a bit of a simplification that's the example that we're going to roll with And then supposedly fit some model to predict their voting intention from these features And the decision boundary is given by this black dotted line It might turn out that say on held out And then we wish to use this model as part of some downstream application for example Target ad campaign for example where we send voters ads depending on which way we predict that they'll vote The way this works is that we get an indication of how well we expect our model to perform on future data points By looking at how well it performs on held out training instances And that then comforts us in using our model in deployment However, this is only a good indication assuming the process underlying the data remains constant over time There's various reasons why in practice this might not necessarily hold And we'll come on to those in a moment So yeah, this sets up nicely what precisely drift is And it is simply when the process underlying the data that arrives during deployment differs from that which underlies the training data And this is problematic because when such a change occurs We can no longer expect the model's performance during deployment to match that observed on held out training data So in effect we're flying blind and we don't know how well our model is performing Many of you are probably thinking that you really don't care if our targeted ad campaign suffers So before we proceed I'm going to introduce this example from medical imaging which will hopefully help better motivate the problem So suppose that we would like to train a model to predict whether these tissue scans are benign or cancerous For the training data we collect scans for a mixture of three different hospitals We train our model and we observed on held out training instances issues a classification accuracy of 93% Suppose this is better than some human baseline, whether they suppose it's 90% for example We then might decide it's worth deploying our model to use on future patients But then once more deployed it might work well initially if it's been queried with scans the same underlying distribution For example if it was deployed from scans from the same hospitals Then suppose the new hospital started using the model and the distribution underlying their scans was subtly different in some way So for example maybe due to a different demographic of patient or for example hair looks a bit like Potentially the stain used in the microscope might have been slightly different When such a change occurs we can no longer rely on our model to perform at the level it did on held out training instances And therefore it might end up misdiagnosing patients and putting them at unnecessary risk So that's a very concrete example of why these drift detection functionality can be needed Okay and yeah by the process underlying the data we're of course referring to this joint probability distribution And this is useful because it helps us to categorize the different ways in which drift can occur And we do that by noting that the joint distribution can be decomposed in two different ways And returning to our voting intention example we'll now see what the different types of drift correspond to So we call this is what the undrifted data looks like so the training data One way that drift can occur is with the distribution of the features remaining the same But their relationship to the target changes so applied to our example There might for example be some new tax policy that means that voters no longer intend to vote along economic lines But purely along social lines and then our model might end up sort of misclassifying these individuals Who are particularly sort of economically conservative but socially liberal or socially conservative and economically liberal And the model's performance will suffer Another type of drift that can occur is what we call prior probability drift Where given an individual's voting intention Their features keep the same underlying distribution There's a proportion of voters that fall into each category that has changed as you can see here When this happens what was previously a suitable model decision boundary no longer looks suitable The individuals falling close to the decision boundary now overwhelmingly conservative whereas before it was 50-50 And therefore the model is going to misclassify a lot more individuals and the model performance will suffer Another way in which drift can occur is what we call covariate drift Where it might be that the relationship between the target and the features remains the same But the distribution of the features changes So suppose that there's some change in society which means that individuals are pushed towards conservative and liberal extremes Or alternatively they're sort of brought together into more of a central cluster Where the relationship between the features and the target has remained the same And therefore it makes sense to keep the same model decision boundary It now turns out that the model performance could still change dramatically So if they're pushed away from each other the model actually might perform better Because there's less data points falling into this region on which the model is uncertain When they're brought together suddenly there's a much harder problem for the model to solve And it ends up misclassifying far more points the model performance will drop So these types of drift where the model performance that causes the model performance to decrease Is referred to as a malicious drift You might be thinking that this looks like a relatively sort of straightforward problem That as the data arrives we can just monitor some metric of model performance If it looks like it's decreasing we can say oh well malicious drift is occurring let's take corrective action But it's actually not that simple because in practice the data points usually The deployment data usually looks more like this where we don't observe the labels Our data doesn't arrive labeled and this forces us to look for the change directly in the feature space And this is far more challenging because the features can often be very high dimensional and complex So whereas in this example there are only 2D you can sort of eyeball it and say oh well there's clearly been drift hair or drift hair If you instead imagine that these data points are like collections of images or text Suddenly it's a much more challenging problem to know when a change has occurred And this is the unsupervised drift detection context which is typically much harder So given that we're looking for change in the feature space How can we actually go about detecting drift because we don't expect the deployment data to look identical to training data Because we expect there to be some sort of natural fluctuations So for example if this is our training data and this data arrived during deployment We might look at this and say oh well the clusters do look like they're potentially slightly closer together But is that a systemic change or could that have just been a natural fluctuation? But if this batch of data arrived where it certainly looks like there's more of one big central cluster We'd probably be quite comfortable identifying that with drift and we'd probably take some sort of corrective action The problem is where do we draw the line? How much deviation do we allow before we say well that's been drift we need to correct for it The way that we can approach this in a principled manner is via statistical hypothesis testing Which I'll now quickly review for those of you who didn't take stats at school or are desperately trying to forget the experience The way it works so leaving the drift detection context for a moment so more generally Before observing some data we specify some null hypothesis An alternative hypothesis about the data generating process We also specify a test statistic which we expect to be small under the null hypothesis and large under the alternative hypothesis We then observe the data and compute the actual value that the test statistic takes And we also compute what's called a p-value which is a probability that such an extreme value of the test statistic would have been observed if the null hypothesis is true The idea here is a low p-value discredits the null hypothesis It's like saying we could continue to believe the null hypothesis But then we'd also have to believe that we just happen to observe some extremely unlikely data So yeah the way it works is we normally specify a threshold And if the p-value falls below that threshold we reject the null hypothesis And even if there's been nature even if the null hypothesis is true that occurs with probability equal to the threshold But at least this means it allows us to reject the null hypothesis in a manner where the false positive rate is named Okay so returning to the drift detection context the way in which we apply this is by denoting by q0 Distribution underlying the training data in which our model is fit q1 distribution underlying a batch of deployment data And we take as a null hypothesis that the two distributions are the same in the alternative hypothesis that the two distributions differ in some way And then we try and specify a test statistic so function of the training and deployment data that we expect to be small If the two distributions are the same we expect to be large if they're different We then compute the p-value and the flag drift if it falls below some false positive rate that we desire And the hard part here is in specifying this test statistic So this is a function of potential high dimensional complex data that we expect to be small under no drift large under drift And that's a difficult problem specifying that test statistic and even if we are able to do it How do we then associate it with a p-value that determines how extreme it is This is the sort of problem that we're looking to solve Unless you think that's not challenging enough in practice we observe the deployment data as like one big batch that we can then compare to the reference data It actually arrives sequentially over time like this it evolves When drift occurs we'd like to detect it as soon as possible So the way that we formalize this problem is by saying by assuming that the data follows the pre-chain so the training data distribution until some change point T star And then there's a change and it then follows some alternative distribution But each time step we'd like to perform a hypothesis test of whether or not drift has already occurred So every time say has drift occurred yet Yeah, so Designed properties of algorithms designed to tackle this problem is that we'd like them to be fast to respond when drift does occur So we formalize this expected detection delay which is expected difference between the time T dash at which we flag drift And the time T star at which drift occurs would like that expected detection delay to be small However, we'd only like it to be small subject to the constraint of a known frequency of false detections when there has been no change You know we can formalize this as the expected run time which is the average time which drift is detected even when there hasn't been a change So we can think of the change point being at infinity because it's never occurred Yeah, you might have noticed here there's sort of an inherent trade-off where we'd like the detectors to have this low detection delay But that requires sort of sensitivity which means the detector is also more susceptible to natural fluctuations that might trigger a false detection And therefore the expected run time is also low which we don't want So yeah, we have to balance that trade-off And yeah, this point is often overlooked actually and the ability to specify the frequency of false detections in the absence of change Often, algorithms allow you to know abound this expected run time But that's not particularly useful because the action that we're taking in response to sort of a 1 in 100 event is entirely different to the action that we're taking in response to say a 1 in 10,000 event It's important to be able to know accurately how significant a detection is when it occurs so that we can respond appropriately Okay, so you might have been wondering how we can apply the sort of statistical hypothesis testing framework to the data that's now arriving sequentially And the way that we do this is by collecting the deployment instances together into test windows And these can then be compared to the reference data or window using test statistics in the way that we described before Yeah, there's actually various strategies you can adopt collecting the instances into windows and the different ways of doing this Actually the characteristics and properties of drift detectors often derive from how they approach this window And so we now describe some of the main strategies for doing this So here, if you consider the data here arriving sequentially, so you've got t equals 1, 2, 3, etc. These are the features spotted here One way we can window the data is using disjoint windows where we specify windows say at a size of say 5 Click the first 5 instances together, perform a test, consider whether or not the test statistic is below some threshold Above some threshold, if it is we flag drift, if not the window slides along to the next 5 instances, we perform another test, etc And the window progresses like that An alternative is using overlapping windows where again we fix a window size But this time after performing a test, you only slide the window along by 1 rather than by 5 And therefore the window contains much of the same data Perform another test and it progresses like that Third way we can window data and algorithms that implicitly adopt this approach Often don't formulate their approach as a windowing strategy But I find it particularly useful to think of it in this way Is by instantiating the window with just a single instance Not only do we perform a test whether or not there's evidence of drift But we consider whether or not we should allow the window to grow If there's no evidence for drift, we reset the window to size 1 and just start again and perform another test If there is then a data point that's indicative of drift But not so much say that it causes a detection, we keep it in the window and allow the window to grow Perform another test, more evidence, allow the window to grow And this allows the window to grow and accumulate evidence over time Which when drift is slight, this can be particularly useful Because we don't just cut the window short and move on, we allow it to accumulate if the evidence is there And all these different windowing strategies call for different test statistics And different mechanisms for determining the thresholds above which the test statistics should fall Each of those strategies also have their own pros and cons which we'll dive into a little now Firstly, the Disjoint Window Detectors Distinguishing characteristic is that the tests are independent because the window is a disjoint And that perform infrequently because you only perform them every say 5 or a bit more realistically say 500 instances that arrive And this is good because we can then take any test statistic and then it's actually possible to accurately estimate the p-value That says how extreme that test statistic is because we can perform a what's called a permutation test This would be too expensive to perform every time step say for overlapping windows But in this Disjoint Window Framework, we have time to perform these And these are actually dead intuitive how they work so I'm quickly just going to overview them So where they work is we take the reference window and the test window and we define a function that is able to shuffle all of the data Such that we then have two new windows that contains a mix of reference and test data points that completely shuffle together We then compute the value of the test statistic on a large number of these shuffled windows And then we look at the value of the test statistic on the unshuffled windows relative to all of the values on the shuffled windows And we see how extreme it is. So for example it was at the 99th percentile Then it would be in the top 1% of extremeness and that gives us a p-value that says how extreme the test statistic was The idea here is that if there's been no change then it doesn't matter whether or not we shuffle the data or not Because they all have the same underlying distribution and therefore we don't expect the shuffled, the unshuffled test statistic to lie at an extreme end So if it does line at the top 1% then yeah that's only one of the 100 event for example and that gives us the p-value Another key advantage of the disjoint window detectors is that we don't actually necessarily have to specify the test statistic Which as mentioned before can be a huge pain. We can actually use a portion of data to learn the test statistic So we can train a function to be small under no change and large under change And that completely gets around this problem of specifying the statistic Cons of disjoint window detectors is that they're very sensitive to the choice of window size Because if you use a small window size then there likely won't be enough evidence in any given window to allow drift to be detected But if you use a very large window size then suddenly we're only performing a test and we say 500 or 1000 instances And therefore the detection delay is necessarily going to be large because well we have to wait until the next test point regardless And that means that the detectors are particularly slay to respond to severe drift because it doesn't matter how severely drifted the instances are We still have to wait until the next test point before they can cause drift to be flagged And yes then looking at overlapping window detectors which are much trickier actually because consecutive test statistics are highly correlated This is because test statistics are successive time steps of functions of much the same data and therefore they're going to be correlated And this makes specifying the relationship between like thresholds and expected run times much trickier Because even if we can identify the threshold above which the first test statistic resides with only probability say one in a thousand Given that we make it to the second time step the probability that the second test statistic then exceeds that threshold Is then some unknown value much rarer than one in a thousand because we already know that it didn't cause a detection at the first time step So you sort of have this correlation which makes things much trickier And yeah this is something that we've spent a lot of time sort of researching and selling how we can capture this relationship between thresholds and run times for the overlapping window detectors It turns out that there is actually a way in which we can estimate sort of time varying thresholds that allow a desired expected run time to be targeted However we then have these sort of additional constraints on the test statistic that we didn't have for the disjoint window detectors Where it must be possible to update them incrementally because we're performing tests every time step It's important that we don't have to compute them from scratch every time because that would be too costly And also we can't learn the test statistics like we did for the disjoint window detectors We have to be able to specify them at the outset which as I mentioned is a bit of a pain But if we can do that they're computationally light because of the incremental manner in which they operate And they can be very fast to respond to severe drift because the severely drifted instances filter into the test window immediately We don't have to wait until the next test point we'll perform tests every time step and therefore we're faster to respond when severe drift occurs Okay a final way of wondering the data is these adaptive windows and these introduce additional complexities These typically work by accumulating some notion of outliners At least approaches for which the page change distribution is unknown and they operate at fixed cost this is how they normally work The way that this works is you sort of score each instance independently as they arrive Where they receive a negative score if they look relatively inline and a positive score if they look outlying The way this works is that prior to a change if we accumulate these scores we expect to solve downward trajectory that keeps decreasing Then if there's a change then suddenly the data points are going to start appearing more outlying And the trajectory will start to rise And then the challenge here is specifying the threshold for how much of a rise we allow for we detect drift And that's again a very challenging problem, specifying this threshold to correspond to a desired expected run time And we did spend some time looking at this that's seldom because we thought these adaptive windows were very nice In the way that they allow the testimony to grow However it turns out that this idea of accumulating outliners is actually not great for drift detection It doesn't result in good in low delays It's a little bit surprising because it was quite an intuitive idea we thought But we're now going to look at why that's the case because I think it helps illustrate the difference between drift detection and outlier detection So if we suppose our free change distribution is this sort of isotropic Gaussian distribution With the brighter region in the middle representing region of high probability density and the darker regions lower probability density And to think of these pink points as representing our reference set or training data points Then suppose during deployment we observe data that's much more tightly clustered Say these seven data points here which we can eyeball and say they're from a different distribution that's much more tightly clustered Some sort of drift has occurred The problem with the outlier based methods is that if you can consider the points in isolation you treat these points exactly the same as you treat these points And suddenly they're probably not data points that you'd have looked at and said oh there's been drift So by considering them in isolation we're not able to capture that And that's why looking solely at how outlying data points are is insufficient for drift detection Okay so having ruled out outlier based test statistics is insufficient How do we propose that the test statistics should be formulated And the ones that work particularly well are those which estimate the distance between the underlying distributions So these distributions are unknown but we aim to estimate the distance using the data that we observe One way that you might think of doing this is by using the data to sort of model the two distributions And then plugging them into some notion of distance between the distributions However with few data points that we normally have in practice this is an inefficient way to approach the problem What we instead do is try and estimate some notion of distance between the distributions directly And one distance between distributions that appears particularly prominently in our library alibi detect is this maximum mean discrepancy distance Which we find particularly useful for various reasons so we're now going to quickly overview how that distance works Okay so reason this is useful is because it transforms the problem of specifying the distance between the underlying distributions To simply specifying a similarity between data points which is a sort of much simpler more intuitive problem Typically we do this by projecting raw unstructured potentially high dimensional data points onto lower dimensional more structured representations using some projection phi And then evaluating some simple notion of similarity such as an inner product And this is great because this is possible across all data modalities Say for images, for the projection we might use some pre-trained convolutional network for example For text, some pre-trained transformer or whatever See this is particularly straightforward The maximum mean discrepancy can then be defined in various ways But in this context probably most simply there's the expectation of the similarity between reference instances added to the similarity between deployment instances Take away two times similarity between reference and deployment instances And this distance can be estimated from the data simply by taking the average similarities This is particularly convenient way distance can be estimated very simply But regardless of what we're trying to estimate here we can look at this as our test statistic which satisfies the properties that we want of our test statistic Because if the two distributions are the same then we expect the average similarity between reference instances To be the same as the similarity between test instances to be the same as the distance between the two And therefore these terms all cancel out and the test statistic is low But if there has been some change then one or more of these terms will be different and they don't cancel out and it will be large So this allows us to specify our test statistic to have that property of low and no change, high and the change which is what we want And often we'll see it written like this which is a bit more intimidating But yeah just it is a very simple concept in terms of average similarities So yeah so if we return to the example which outlier based test statistics were insufficient to tackle Here we're able to see that the deployment instances are much more similar much closer together than the reference instances And the sort of distance between the two and therefore when we add up these terms they don't cancel out and we get a relatively large test statistic value And then this is great but then we need to determine precisely how extreme is that value in order to associate a p-value And decide whether or not to flag drift and the way that we can do that is using the permutation test that we described before So here on all of the shuffled reference and test windows the similarities between the shuffled reference instances The shuffled deployment instances and the cross similarity between the two are all the same in expectation and on average they cancel out So for example we get very small values and therefore we expect the MMD estimate to be much higher on the unshuffled version And on the shuffled versions it ends up in the top say 1% and then we're able to use that as a flag drift And yeah that's pretty much all of this theory I was planning to go through So in future when you're trying to recall how drift detectors work I highly recommend that you think of this sort of funky looking fellow with alibi detect glasses And know that there are three main components that define a drift detector There's the strategy for collecting the deployment instances into test windows A test statistic which is then able to compare those test windows to the reference window or training data And this might involve specifying some projection as we described before And additionally we then have some strategy for setting fresh holes which those test statistics must remain below in order for drift not to be detected And it's important that these three components are able to work together in a manner that allows the expected run time to be specified So that when drift occurs we know precisely how significant it is and therefore can take appropriate action So yeah this is a sort of dial that we can tune and as we do under the hood this expected detection delay will vary But this isn't what we directly manipulate this is just sort of a downstream effect And yeah as I say we've got various drift detectors implemented in our alibi detect library And if I've got time yeah I've got time to just give you a quick demonstration of how the library can be used So hopefully this will work as put together pretty short Jupiter notebook to demonstrate how the library can be used I won't have time to go through all of the code line by line but I'll try and pick out the most relevant points There's a load of imports here which aren't particularly important But for the data we're going to return to the Chameleon 17 medical imaging data set that we described earlier Where we wanted to train a model to classify tissue scans as benign or cancerous But during deployment we start getting instances from an alternative hospital with a slightly different distribution So yeah this data set is particularly convenient to use because there's this python library So the wild python library which is put together by a team from Stanford that allows you to load and manipulate this data particularly easily Using these sorts of Chameleon or wild dot get the data set functions By just create this sort of wrapper around this to convert the data into a stream in order to simulate a sort of live deployment environment We then define a reference set of n equals 2,500 instances And then we also define two streams, one representing no change So the instances are from the same distribution as training data, so this one And one stream from the drifted hospital where there has been a change This is slightly simplified the problem and kind of put the change point at zero But by considering these streams and looking at how the detector, how long the detectors run for on these two streams Will allow us to investigate the average run time and the average delay and see how that compares to the expected run time that we desired So quickly just plot here some instances from the no change stream Looks like this, find the drift detector We're going to use a strategy of overlapping windows and we're going to use the test statistic, an estimator of the MMD that can be updated incrementally And therefore is suitable for the overlapping window context As I mentioned before because these images are quite high dimensional and structured We first perform a projection down onto a lower dimensional and more structured representation In order for the kernel to be more meaningful in the MMD to be a useful test statistic And to do this we train an auto encoder that passes the images through a lower dimensional representation Such that they're able to be reconstructed and therefore it captures the sort of structure of the images in that lower dimensional representation And here is just a convolutional auto encoder defined in PyTorch Here you can see the auto encoder being defined This is a sequential PyTorch model and we just pass the reference data to a PyTorch data loader module And then alibi.tech provides a function that allows you to train all these components particularly easily So we train the auto encoder and you sort of see the loss getting down here Such that it's been trained And then we define a function that uses the auto encoders encoder to map the images So alibi.tech expects functions mapping numpy arrays to numpy arrays We just wrap the encoder in this function that just does a bit of a conversions for us Having now defined the projection that we're going to use for our kernel Defining the detector has actually been very straightforward Simply specify the expected runtime we desire The size of the test windows we'd like to use The number of simulations we'd like to use to configure thresholds Then we simply import from alibi.tech What's called the MMD drift online detector So just overlapping windows using MMD Tector at which the detector flags drift Just to be expected runtime I'm only averaging over 15 runs this time So it's a very noisy estimate and it's not that close to the expected runtime of 150 But if we averaged even more run times we would gradually converge in on 150 So this is the sort of false positive rate On average it takes a false positive Every one in the 122 time steps We can sort of look at the detectors Test statistics and thresholds and sort of see what was happening under the hood The test statistics sort of fluctuated around zero But then eventually there's a fluctuation large enough to send it over the threshold And that happened at time step 163 as you can see here But then when we apply the detector to stream on which There has been a change The detector then only runs on average 14 time steps This is a sort of estimate of the expected detection delay This is great because it shows that the drift detector is working And that it flags drift much more quickly when there has been a change Than when there hasn't Which is obviously in the property that we'd like In this case you can see that the test statistics Started rising very quickly from the start And very quickly exceeded the test threshold that the detector had calibrated So that's pretty much all there is to it I hope that was useful I'll return to the slides And if you found this interesting Do check out our open source piping library That you can find here And also you can follow us on Twitter Set up a Twitter account this week So if you'd like to see Updates Okay Looks like there's been a problem Okay so it looks like I Break up a bit there Where did I get to And if someone could help me out a little I think I was pretty much wrapped up anyway You can follow us on Twitter If you'd like to receive updates on our research And our Python libraries Personally I'm ocselden.io I hope that there wasn't too much of a disconnect there But yeah that's all about to it Thank you for listening Thank you so much for that Oliver I was definitely thrown back to my stats A level For a room in there We do have a couple of questions from chat We have just a minute to go through them The first and probably quickest to answer is Is the alibi example Jupiter notebook available somewhere For people to download I think I was saying about this when I probably disconnected I'll probably tweet that from our Twitter account later So people can work through that more slowly If they desire So yeah do give us a follow I'll get that tweeted out Alright and a question from earlier Which was given as you said That after deployment we don't often have access to Why being true Can these methods only detect covariate drifts And then But you also said that if PX only PX changes Then the original model is still valid Yeah so that's a good question If the drift is purely Sort of Concept drift where only the relationship Between the targets and the features has changed Then sure we can't detect that By looking solely at the features But normally When a change occurs more than one Component changes and therefore You're able to sort of detect that the changes Happened in general Yes the model is potentially still valid If only the feature distribution has changed As I said The performance might have really degraded And it's important to be aware of that Because the model may no longer be suitable For the problem that you're trying to solve So although it may still have The best decision boundary given the context It may then no longer be suitable to deploy For example But yeah that's a good question Alright we've got a couple more coming in From the chat And we are about to run into lunch Okay For your windowing strategies Have you tried using time lagged approaches Where window sizes is ever increasing Yeah so that So I guess that sounds pretty similar to these Adaptive windows That I was describing I'm not sure if They meant something slightly different But they allow the window to grow And that's great But that means that the cost Of operating these detectors can then also grow And that's not really something you want And also it makes setting the thresholds Much, much more trickier So I was talking a lot about targeting An expected run time And that's tricky to do Using these windows that grow So there is that sort of that compromise Of using overlapping windows In order to allow A fixed size So that the desired run time can be targeted So there's pros and cons All over the place for these strategies It's always the way isn't it And one final question from chat Which is Would you see an issue in using an embedding model For the distance which is very close to the model For the predictions For example using a base BERT B-E-R-T for embedding In the case of a fine tuned BERT For the predictions Yeah so there is This is a good point that you shouldn't use For the projection A component that's been trained on The same reference instance For the detector Because then The component then sort of hugs these instances And puts them into a different sort of region So it does unseen instances And that sort of causes a discrepancy It isn't due to changing the underlying distribution But it's just being Because they've been seen before So it is important that you Don't train the component using the same data And you'll see in the notebook That actually split off a bit of the data To train the projection The separate split of the data To the detector And yeah that's a very good point That's an important consideration to make I have a follow-up question about that Is there any rhyme or reason to how you would split the data set For those two partitions Yeah so obviously again It's a compromise And the more data used to fit the projection The more potentially meaningful That projection would be The better the test it is equal than B But then the less data you then have To the detector And therefore The sort of less statistical power you have With which to detect drift So there is a sort of that compromise And I recommend sort of Using the smallest amount of data That you believe is sufficient to train Whatever component you're trying to train So that does require potentially a bit of domain knowledge But yeah It's not something I worry too much about At the exact split point All right thank you so much for that answer I don't really have any more questions coming in from chat So I'll just say thank you So very much for this talk It was really really interesting And thank you so much for giving your time to your reply Great thanks a lot for having me Enjoy the rest of the conference everyone