 All right, I am back again. Next up, we have a talk from Aleandro, who's going to be talking to us about production machine learning monitoring. Essentially, how do we deal with outliers, drift, and the overall statistical performance of the model. Great to have you here, Aleandro. A bit about Aleandro, he is the Chief Scientist at the Institute of Ethical AI and Machine Learning. He is also the Director of Machine Learning at Selden Technologies. We're really psyched to have you here, Aleandro. The stage is all yours. Awesome. Thank you very much. Today, we're going to be delving into a really interesting topic, the area of production machine learning in the context of monitoring production machine learning systems. There's quite a lot of content to cover. We're going to pretty much dive into very high-level areas, but in this link, that will be available throughout the presentation. You can access the blog post, which actually contains the code examples and the open-source frameworks and the references that you will be able to test everything yourself in your own spare time. A bit about myself, my name is Aleandro. I'm the Engineering Director at Selden Technologies and Chief Scientist at the Institute for Ethical AI. I'm also a governing Council member at large at the ACM. Selden itself, we are a machine learning, deployment management, monitoring, and explainability open core framework. We have a broad range of different open-source technologies as well as our enterprise product that builds on top of that. The Institute is a volunteer-led organization, research center-based in the UK that focuses on developing resources that ensure the responsible development and operation of AI systems. Now, let's dive specifically into this topic. What we're going to be covering today is the motivations for this, some principles that you can adopt, some patterns that abstract some best practices and some handsome examples. One thing to mention is that we are going to be exploring this from the perspective, not just of how can we deploy and monitor a model, but how can we actually enable this monitoring and more specifically advanced monitoring at scale with hundreds if not thousands of machine learning models. With that, as I was mentioning, you can find the blog post with all of the resources, so please do check it out. If you have any feedback, the projects are open-source, so you're able to actually provide any insights into that. Now, let's set the scene. We all are aware that production machine learning systems are hard. There's a lot of specialized hardware that is involved. There's a lot of complex dependency graphs. There's compliance requirements that go not just to the same level that software environments require, but now with the nuances of machine learning as well as the reproducibility nuances that you need to have in place. So from that perspective, it is important to take into consideration not just how you can develop and deploy those machine learning models to be able to have them in production, but how can you have more of a proactive as opposed to reactive interaction on the performance of the model and being able to address some of the issues that may arise such as performance drift as well as changes in the distribution of the data of your production models. So the key context to remember throughout this is that now it's more popular belief that the lifecycle of the model actually begins once it's deployed, right? It doesn't end when you finish training it is when you put it in production and it starts being consumed that the lifecycle really begins and you need to make sure that you have the infrastructure and capabilities in place in order for you to have the robust requirements from a more proactive perspective, the ability to know when you need to retrain the model, the ability to know when you have to identify and deep dive into a specific prediction as well as the general health of the models themselves. So what we're gonna be doing today is seeing how this applies into a model specifically. We're going to be deploying a single model as a microservice. We're going to be then able to see how this deployed model can't expose metrics that could be used for performance, for statistical performance, for outlier detection for drift detection and for explainability. So what we're gonna be doing today is we're gonna be taking the hello world of machine learning, this CFAR tank classification. This is basically a model that takes, as an input, an image and predicts one of 10 classes. In this case, it would be the image of a truck and it would predict the last class, which in this case would be a truck. So this is basically what our model is going to do. Given that we focus on the production lifecycle of the model, we're not going to be training a model from scratch, but the code is available if you want to do it. It takes a couple of hours, but we're going to be just fetching this ResNet32 CFAR 10 model. So it's basically a TensorFlow base model that we're going to be able to leverage with Python. We're going to be running the input being an image and then the output in this case is going to be an array showing us that it's the image, the class of a truck, right? The way that we're going to now productionize it is using this framework called seldom core. So seldom core allows you to convert or go from an artifact, right? So what we just saw, a model artifact and exported binary, like a pickled binary, or a Python, Java, or R code wrapper into a fully fledged microservice that can be deployed in Kubernetes and has a set of building blocks that allow you to actually connect into complex structures, like having multi-arm bandits, outlier detectors, feature transformations as part of your separate reusable components that in themselves are services, right? So the way that this is done is basically taking that artifact that we had or creating a class wrapper, which in this case is just a simple Python class with a predict method. Once we actually convert it into a full container using the source to image command line interface, we're going to basically have a component, a container that we can deploy and that will expose a REST, GRPC, and Kafka endpoint that will allow us to be able to perform inference on anything that we send. So whatever goes through this predict method basically is passed through the model and then is returned to us, right? Very intuitive. Now, the way that we're able to leverage cell and core is through the Kubernetes way, right? Defining your configuration schemas or your manifests. In this case, we are able to leverage some of the prepackage model servers. So in this case, it could be TensorFlow Serving, it could be Triton, it could be a scikit-learn model server or in this case, we're using the TensorFlow server and we're providing that binary as a Google bucket, right? So what we ended up doing was just exporting the artifact and then uploading it into a Google bucket so that we can then just provide this in this declarative way into Selden. So then Selden is able to basically take this and create the relevant containers, services, networking, et cetera. So now this basically becomes a fully fledged microservice. Now we are able to see this as the same inputs that we were running in our local model, we can now run it as a REST request, right? So we're sending a request to a predict method containing that input image and then our output is basically that class which is basically a truck, right? So we've deployed the model, but again, this is where the talk really starts, right? Because the point of the talk is deploying the model and then monitoring it, right? So what we're gonna now start looking at is the monitoring pieces, which you will be able to find a lot of relevant examples if you're interested to dive deeper into the Selden core repo in this open source. So please go ahead if you find an issue, you know, open an issue, that's the beauty of open source. So let's actually have a look at what is the anatomy of production machine learning systems, right? Because the lifecycle of a model starts from data to experimentation to deployment to monitoring, right? So the way that we can abstract it is by able to see that a data scientist will be either through their Jupyter notebook or through an ETL pipeline, they will be converting training data into the train model artifacts, right? This could be either that binary that we saw before or the container that we saw before. Then the next step is to be able to productionize it either through the manual CLI or by introducing a CI pipeline that automatically performs that deployment that allows us to have those deployed services which ultimately have inference data that passes through them, right? So any new data that you want to consume, you would pass it through the models. But where are the metrics, right? So the four key areas that we're gonna be delving into today, they're a little bit more, but these are performance metrics and tracing. So this is the same things that you would want to monitor for a microservice, statistical performance. So this is basically model machine learning specific metrics that you would want to monitor, drift detection and outlier detection for data that perhaps is wanted to be flagged being out of the distribution in one way or another. And then finally, explainability in the context of production models. Now, another reminder of the key thing that we said as a premise is that we are thinking about this not just from the perspective of how can we have monitoring for a model for a single model, but we want to ask the questions of what are the patterns that will allow us to perform monitoring at scale, to have hundreds of models with perhaps dozens of advanced monitoring components. Because you have to remember that if you deploy a model and then you add a drift detector, even though the drift detector is in a way reducing risk for you to be able to say, well, now I have that component to be able to flag any issues in a way you're introducing another machine learning component that equally requires monitoring in itself. So the drift component will also be a service that may fail, may also suffer from the same challenges that other machine learning components would suffer from. So it's something to take into consideration that we will be addressing as we talk about the different patterns in a more sort of like higher level perspective. So starting with performance monitoring, this is basically analogous to your machine learning microservice monitoring. So this is monitoring the performance of your running service, right? This is to identify potential bottlenecks in performance, debug and diagnose unexpected performance on that service. And this is really the usual things that you would imagine like CPU percentage, et cetera, which we will see in a bit. The way that you would see it from a sort of more abstract perspective is that once you deploy your model, one of the benefits of having this orchestration tools like sell and core is that not only it provides you with the endpoints to consume, but out of the box, it exposes those metrics that can be consumed by something like Prometheus. So you know if your model is suddenly perhaps having a memory leak, you may actually start to see that the memory starts to increase up to a point where it may crash. And then you can actually go and check why did that happen? Similarly, you can actually see some things such as the performance of the model itself given that a lot of the work that it's done is CPU intensive, right? So if you have worked with Python before, you know of the infamous GIL, right? So from that perspective, it is how you are able to optimize for the internal configurations of your model server such as minimizing in this case, the number of threads, making sure that the number of workers are aligned to the number of cores or well times two plus one, and also making sure that if you have too much, then it may actually end up with CPU throttling. So there's quite a lot of things to consider, but this monitoring of performance is the same or analogous to the one that you would expect to have in microservices. And from that perspective, it could be things like latency requests per second, CPU memory usage utilization, number of successful and failed requests, et cetera, et cetera. So these are equally important and still shouldn't be kind of like left off in your machine learning stack. The second point is statistical monitoring. So for statistical monitoring, we need to take into consideration now not just generic microservice metrics because even though it is still a production software system, we are still dealing with something that is much more nuanced. And in this case, the machine learning components will also require some further metrics that are gonna be very specialized. In this case, it would be perhaps something that may require the correct target labels, right? So in your training environment, when you train your model, you're gonna have your data and your target labels, right? And you're gonna be able to train a model because you have all of that data in there. However, when you deploy a model, you are not going to have those target labels, right? Because it's inference data. It's unseen data points. And what that means is that in order for you to be able to calculate things like accuracy, precision, recall, or other more specific ones like RMSE or several others, you may require having what we refer to as stateful metrics, right? Being able to collect those annotations that include the correct target label that may be provided in an asynchronous way, right? So once you deploy a model, you receive some requests, you may need somebody to actually go and re-annotate those requests. Then you will be able to start seeing some of those metrics like accuracy, precision, recall. Of course, there are others like KL divergence as well as perhaps some that you don't require that feedback or that correct label. But from that perspective, it does open some challenges of how do we achieve or how do we enable for this concept of feedback at scale? So the way that we have been able to address this with Seldom Core is that similar to how you have a predict endpoint that performs inference on top of your model, we actually also enable a feedback endpoint. And this feedback endpoint is able to use the unique request ID that you receive when you actually send an inference to allow other individuals to provide those correct labels, right? So you can do an inference request, you get told, okay, so this is request ID 55. Now somebody else later on can say, well, actually I'm gonna now submit a correct label for ID 55, right? So we can now see that from an architectural perspective, we now are able to abstract the concept of feedback as part of the models where the inference request as it comes, it gets saved into a, in this case, key value store. And when the actual feedback is provided, instead of actually being calculated inside of the actual microservice, this corrected label is actually forwarded into any other advanced monitoring components that want to do something with it. In this case, we have a component that just so happens to be listening to any feedback to be able to take that ID, fetch the inference request, make the comparison and update its current status so that you're able to then expose the specific stateful metrics through this advanced monitoring component, which we name as metric server. Now, some of the metrics that are provided wouldn't be just like the accuracy, right? Because you may want to actually have further granularity so that you can actually get, things like the accuracy per class or per subclass, you may want to get the precision and recall per subclass, et cetera. So from that perspective, you may want to capture the true positives, true negatives, false positives, false negatives, and individual prediction probabilities, as well as other things. Now, what a metric server looks like in practice, basically it's another sort of like abstraction that allows you to basically take the truth, the response and be able to calculate those particular components. So from that note, then you can see the intuition of abstracting the different servers and services so that you don't actually have to augment too much into the logic of your model itself and also to not affect the performance. Now, in regards to outlier and drift, it's similar in regards to the principles and the core concepts. Outlier detection is to be able to identify, you know, potential anomalies in your data and drift is for being able to identify a divergence in the data that you're seeing or the predictions. We are able to use another of our open source frameworks called alabyte detect, which has a broad range of different techniques that you can actually try out yourself and more importantly, deploy yourself. And in this case, we are able to actually show how we're able to leverage an outlier detector that we train locally, following the same pattern as we did with the model. You train your model locally or your outlier detector, you export it, you deploy it. Right from this perspective, now we can see that every single request that it sends to your model, it's forwarded and it can be captured by any advanced monitoring component. In this case, it's an outlier detector, which is able to flag whether there's an outlier or not. And similarly for the drift detection, we follow the same pattern. In this case, a KS drift detector, we're able to train it locally and then deploy it in the same manner such that it performs the same pattern as the outlier detector, but instead of acting on a single data point, it acts on a window of data. So every, let's say, 100 or 1,000 of requests, it would calculate that data drift. So that's basically some of the core design patterns of the drift detection. And you have to take into consideration things like STOPE, whether it's supervised or unsupervised, whether it is a classification or regression task, as well as other things, but you can try all of this in the hands-on code that we provide as part of that blog post. Now, the last component is the explainability piece. And you may have heard about explainability in the context of Jupiter notebooks where you perform an explanation. But now remember that what we're seeing here is how do we achieve this at scale? How do we enable this human interpretable insights in a production environment? And we are able to still leverage the same component. In this case, it's our alibi that explain library, which has a broad range of different techniques like kernel shop, counterfactuals, anchors, et cetera, all which can be deployed. The only difference is that, similar to how the model receives a request and returns an inference response, the explainer receives a request, reverse-engineers the model for an intuition, and then returns the actual explanation. And there are some nuances to take into consideration how to choose your explainer, but again, you're able to delve into more details in the repo as well as in the example. This is whether your explanation is local versus global, where it's a white box versus black box. This means whether you actually will need the artifact of the model as part of the explainer, or whether you can interact with the separate services, as well as the data type, whether it's tabular, image, text, et cetera. Now, from an intuition, this is what it would look like from a microservice perspective. You would be able to actually say, well, for my model that predicted that this is a truck, what are the specific reasons why it predicted that it was a truck? In this case, using the anchors explainer, it basically tells you all of the anchors that made this image at least the most sort of like importance to be predicted as a truck. Now, there is a side note about things that we're not gonna cover in this talk, things like adversarial detectors that are another important thing and are also covered in the alibi-detect library. The actual ensemble patterns that you can also leverage is something that we're not gonna cover, but you can go into more detail. This is basically the ability to have these architectural patterns building upon each other. So a model that has a metric server and a drip detector that actually acts on top of the metric server instead of just the model. But this is basically just emphasizing the power that these architectural patterns provide, not just the abstraction, but also the flexibility that they introduce. So you need to take into consideration all the things, like alerting, what do you want to do when drift is detected, defining the service level objectives or several level indicators, as well as how you feed that into your promotion strategy and being able to drill down into any of the results that you provide. But again, this would go well beyond the scope of this talk and the scope of this talk is already very, very large. So I think that is a good point to stop on. Today we covered the motivations for machine learning monitoring. We covered some principles for some efficient approaches, as well as some scalable patterns that you can adopt that actually work at massive scale, as well as the hands-on example, which you can find in the code provided. With that, thank you everybody. And if there's time for questions, I'll be able to answer. If not, then we'll be able to cover them offline. So thank you everybody and I hope you enjoy the conference. All right, what a lovely talk. I think you were able to simplify a lot of concepts and if someone like me could understand it, then I'm pretty sure a lot of participants would have been able to understand it. So again, thank you so much for the lovely talk. I am right now looking to see if there are any questions. Someone's typing something. So maybe we give it a second. All right, there's one question. I'm just gonna copy it and put it up. Oh, I can see it, I can see it. Yeah, I'm just gonna put it up on the big screen if... Should I read it out loud? Yeah, so it is, what would you do if a model starts to drift but the model takes days to retrain? Would you keep the existing one in production until the new one is retrained? Oh, I mean, that's a great question and that's very much a coordinate. Well, I wouldn't say a coordinate case, probably I mean something super common as well. I mean, unfortunately, when there's a time constraint, there's a time constraint, right? Like if it takes days to retrain and you found out that there's drift, then there would be kind of like an assessment of what that drift really means. And that's actually something that our data science team is working on is around the question of, perhaps not all drift is bad, it is the question of like, which drift is the one that would perhaps have certain implications and the interpretability of drift that is flagged. But in your context, I mean, there's a lot to consider. It's basically what is the risk of keeping the model running based on what that drift means in particular. So yeah, I mean, not a very sort of like, black and white answer, but yeah, I mean, fortunately, some of these libraries and some of these patterns should at least provide you with a core foundation to tackle those challenges. All right, I hope that answers the question of the person who asked it. And someone said that this is the best talk of the conference, so congratulations on that. Well, I really appreciate it. These are the best attendees of the conference. Well, I don't know. Fantastic. So in that case, if anyone has any questions, then they can probably find you in the breakout room of Barrett and thanks again for your talk. Thank you very much. All right.