 I'm Fabio Grez, I lead the machine learning operations team at a machine learning solutions provider called Merantics Momentum, based in Berlin, Germany. My team provides an internal developer platform that has opinionated services around the entire lifecycle of the machine learning model, beginning from data annotation to continuous model training, model deployment monitoring in production. Before that I worked as a machine learning engineer on natural language processing and computer vision projects, and before that I did a PhD in theoretical astrophysics. Today I want to talk about why so many ML models fail in production and how MLOps tries to solve this. I will talk about my maturity models that assess the degree of automation in MLOps systems by looking at a very manual process and then systems that have gradually more automation to release new models into production. And then I will explain how we at Merantics Momentum built such automated CD for ML production systems using open source software. So why is it that so many ML models fail in production? I want to argue that classical software typically gets better over time because the engineering teams behind the software address bugs and bring out new versions that have fewer bugs, but that ML systems in contrast to that often die catastrophically. And the reason for that is that models don't work well typically when the data distributions that they are applied on for inference differ too much from training distributions. And when the distributions in the real world change over time, the models can't handle that, their performance decays, which is something called concept drift. Let's take, for example, conversion prediction in an e-commerce platform as an example. So let's say the task is that you want to predict whether a customer will buy something when you show them an advertisement or a promotion for a specific product in order to assess whether it's worth doing that. And let's say we have such a model, we run it in our cloud system and it's used by the webshop to predict whom to show these advertisements for, then the site reliability engineer who might be responsible for running the model would worry about things like latency, about availability, about horizontal scaling, etc. But the data scientist maintaining the model would have to worry about things that go beyond that that change in the real world. For example, when the economic situation changes or when kind of a new product is introduced to the product category, we're looking at then customer purchase behavior would change and the model can't properly predict this anymore and will start to produce worse predictions. And while you would never consider this change a software change or an update or anything, it is a breaking change and the ML engineers need to be wary of that. And if you combine this decay that naturally occurs over time with the fact that most ML systems have pretty long release cycles because there's lots of manual one-off work involved and they often lack automation. Then you are in a situation where your model suddenly doesn't work anymore, but you can't quickly update it, and then the system dies and doesn't bring any value anymore. So I would argue that that is what very often happens and why that this is why very often ML systems stop working properly. Now, MLops tries to solve this problem and I want to talk about how it tries to do that now. First of all, let's look at what MLops is. I like the definition presented by Quartzberg in 2022, their article called Machine Learning Operations Overview Definition Architecture. They did expert interviews and literature research to kind of define the standard practices in MLops and to define the term and they find that it is a paradigm, a development culture and an engineering practice that aims at productionizing machine learning system and bridging the gap between the ML development world and the operations world. So very much the same thing that DevOps tries to do for classical software engineering. And they explained that MLops leverages technologies and methods like CICD, automation, workflow orchestration by aiming for reproducibility by versioning not only the code but also the data, the configuration of the training code by the resulting model and the data that the model was trained on. MLops uses continuous training and evaluation to always have an up-to-date version in production of model version in production and then MLops uses metadata tracking and logging to be able to inform yourself of which experiments are in the past and it uses continuous monitoring in production to detect, for example, things like model decay. And when you do notice these things, then you do a feedback loop, you retrain your model and redeploy that version to always have an up-to-date model running. I will not talk about a maturity model that you can use to assess how mature your setup is in terms of how quickly you can bring a new model version into production. There are different models that try to accomplish this. I'm linking one from Google Cloud Architecture Center here, another one from Azure. In the details that differ a little bit the models, for example, the one from Azure defines five different levels of maturity. The one from Google defines three, but they are very similar and the diagrams that I will show today are taken from the model from Google. In that model, there are three different levels. Like a first level with basically only manual processes to bring a new version into production, then a level where the individual steps in the training pipeline like data preparation, training, evaluation, and deployment are automated. So you have a system that can regularly retrain and deploy models, but updating the system is still manual. And then level two would be a system where even that updating of the training system is automated and even the triggering of retraining of a new model version is automated. But I will show that in a lot of detail. So let's start with level zero manual process and let's take again this example of conversion prediction that I already introduced in the beginning. So let's say the organization says, okay, we want to predict which customers to send promotions to. The organization hires a group of data scientists to build such a model. Data scientists, they would typically use existing customer interaction data and do an exploratory data analysis, prepare an initial data set, prepare an initial baseline model evaluated, for example, using a set of Jupiter notebooks. Then after the evaluation, they go back, change the data set a little bit by including more features by training a different kind of model, evaluating it again. And then they do this in an iterative way until they have a model that they still solve the task. They then upload that trained model into something called the model registry, which in the very simplest case could just be in a three bucket for example, and then they pin software engineer for example in another team to take that model artifact from the model registry and deploy it as a prediction service in some in some cloud system for example. There are multiple, like for a first prototype that's absolutely valid to do it this way, but when you want to continuously use such a system in production, there are multiple things here that could be done better. The first one is that the data scientists work in a sealed way from this from the from the engineer who puts the model into production. So if these people here train a new model and then ping this engineer here to deploy it, maybe two days past because the engineer is doing something else that makes the release cycle slower. Now let's say maybe the main data scientists developing this model is the company, then there's no experiment tracking in place. It's not a given that the data scientists replacing the first one is even able to reproduce the existing results, because there's no place to look up what experiments have been performed. So also that is problematic. Nothing maybe that's problematic is that we only deploy a prediction service we have no way of monitoring what it's doing. So we wouldn't even know if, if the model is decaying, and we do need monitoring place to be able to detect that. Now, to make the model to make the process more mature will first have to automate these steps here data preparation model training evaluation into one pipeline that just does all of that automatically. And we'll also have to look at how this deployment process of this case the model later of the training pipeline can auto be can be automated as well so that we don't have so many manual steps here and just this throw over the wall to a different team. And then we want to put a new model into production. So, we will end up with something like this. We will look at all the at all the components in detail so please bear with me no need to understand it at this point in detail. We will look at how the data scientists perform experiments locally on their developer machines will look at how they get new versions of their workflows that they developed into a staging or a production environment This training pipeline running in the cloud will do will look at all the different steps and what how you could implement these will look at how the model is then deployed in the cloud system and how it can be monitored. And how you can do the retriggering so we look at these things in detail. And in particular will also look at how at which open source technologies you could use to implement them, because I always like to always like to take to show this machinery AI and data landscape that you find when Googling for which tools you should use to implement your project. And the point I want to make here is that there's just a large variety of tools, which makes it very difficult for practitioners to decide what to pick what tools to pick. And of course, I don't have an opinion on, on most of these actually, but I do want to share with you which which open source tools we use in production systems and which tools work well for us. So, first of all, the things that I'm showing here in this diagram they need a lot of infrastructure to run on. And in my opinion, Kubernetes is the obvious choice to do that. New tools for machine learning on Kubernetes come out every week. And it's really a great ecosystem and tools like helm package manager for Kubernetes make it make it very easy to install these things in the cluster. And I think at this point in time it's been the natural choice to to orchestrate all these these different things. I want to recommend that you use something like GitOps to install things into your cluster so instead of just using YouTube couple to install something into your cluster you would commit a like a configure a manifest into your git repository that describes your infrastructure, and then a tool like flux would synchronize the desired state and that git repository to the cluster and install these things there. Maybe that is as a side note. So I would add how these steps like data preparation model training and model evaluation can be orchestrated can be automated into into like a workflow. You can of course they there are tools to run Jupyter notebooks in production and some organizations do that. I would argue against that because they're hard to version they're hard to PR review. In my opinion, they're great for quick prototyping but once you have something that you like, you should turn the Jupyter notebook into like a proper Python module by human tests for it and then versionize it and PR review it and I think that's that's works way better if you if you don't use Jupyter notebooks. There are a range of tools that do workflow orchestration for machine learning very popular ones or maybe Q flow, perfect for airflow which is not specifically for machine learning. And the one that we decided using when we compared all of these is is flight. I recommend you take a look at it. It's pretty cool. All you can specify your workflow in Python and not in YAML which is great for machine learning engineers and data scientists. And here in this example I want to I want to show you one feature that I really like about flight. Because when we develop especially deep learning models, we typically want different resources for the different tasks in our training workflow. So I'm going to show you this here. First, in the in the flight workflow, you have so called tasks and workflows. These are decorators that you decorate Python functions with. And in this example, I have a pre processing task, and I have a training task, and then I have a pipeline that first runs the pre processing task and feeds the output into the training task. And in my scenario, I want to run the pre processing of the data set, let's say in attempt in a spark cluster, and then flight will for the duration of the pre processing task provision and FMR spark cluster on Kubernetes using the spark on Kubernetes operator. And you can just in the task definition here in Python, you can specify things like how much memory should my driver have how many executor executors do I want. And it's very easy to configure it for ML engineers and data scientists here because they can do it in Python and don't need to do it in YAML. And you can do things like distribute the training in a very easy way using Kubeflow's PyTorch, distribute the training operator, and with flight it's very, very nice to set this up because the only thing you need to do is to specify task conflict PyTorch in your in this task decorator here, specify how many number of nodes worker nodes you want to have worker pops you want to have in this case, how many GPUs you want for a worker, and then you would get a torch distributed training session in the context of this training task here and could first run data pre processing in the temporary spark cluster and then start 20 nodes with one GPU each to distribute the training and all that in a single workflow and a few lines of Python code. I think that's a really cool feature. So when you want to run this workflow, you use a CLI called PyFlight, PyFlightDrawnWorkflow.py, assuming that that's the name of this script here, to specify which workflow you want to run. And when you do it without the remote argument, then it just runs locally at the Python script. And if you do specify remote, it will run it in the Kubernetes cluster for you. And you can also specify that it has a type system. So that means that you can in the pre processed task, for example, return a pandas data frame and then send that pandas data frame to the training task. And this looks like a Python script here so you might wonder, of course I can do that. But please remember that in the end this will run in a distributed system on Kubernetes on in different parts, maybe even different computers. But you can return a pandas data frame here and pass it to the next task and actually many like several of the other frameworks have showed you can't do that and flight can and that's really cool. It has a caching mechanism. So let's say you work on the training task and iterate on this one. You might not want to run the pre processing every time you start a pipeline to debug it. But you can configure it with one line of code to catch the results of this task for as long as the input arguments here don't change. Slide versions, the tasks and the workflows and also the artifacts produced by the tasks, which are really great for reproducibility so you like one year later you can go back and run the versions from that you that you tagged last year, for example. Maybe last point that I want to mention is that it's battle tested at massive scale so this is not some new project is developed by lift and by Spotify, and they use it there's thousands of workflows running on flight so it's, it's a, it's a pretty robust system. Okay, so now I want to talk about how our ML engineers use flight to do experiments locally on their dev machines, and how to get them into development staging or production environments in the cloud through CI CD. That's what I want to show you now. So, the first thing they will do is they will commit the workflow to source source code repository that will trigger a CI CD pipeline that will test that will test the workflow and the individual tasks, and then it will register the workflow version with the flight server. In the development development would be here. They, on their dev machine they check out a new feature branch they maybe change the implementation in which the model is trained. And then they would do pi five run workflow that pi pipeline to run it locally, or maybe they need more resources to so they say minus minus remote to run it in the cluster. But if they run it this way flight will run the workflow in the so called development domain. Okay. And then when they found that when they have a workflow that they are happy with, they will make a pull request, pull request, and another engineer will review the change the change to the code. And the CI CD pipeline that is triggered by the pull request will build Docker images that are used to execute these tasks and communities that contain the requirements that are needed to train the model. We will definitely definitely run unit tests for the different for the logic and the different tasks for pre processing training validation etc. Maybe we run an end to end test on a small data set using the workflow. And then we do the very same thing that we do in local development. We either register or run the workflow using the image that we build. No, we don't. But now we also specify this domain argument here. And depending on what kind of CI CD pipeline we're currently in. We would that we would register the workflow in a different domain. So what we do is that we register a workflow in the development domain on a feature branch where that is used for quick prototyping when the workflow after after PR review is merged into main which we consider staging we we register it in the staging domain. And when we want to say, okay, this is now something that we actually want to put in production, then we create the good tag or release and get up and trigger a CI CD pipeline that registers the workflow in the production domain. But we have a very clear separation of which workflow is just some quick experiment to test a new model configuration in which is the version of the workflow that actually runs in production. Okay, so now we talked about how our engineers develop locally, how they use like a source code repository on and see ICD to register new workflow versions that can then run in the cloud. And now I want to look at what happens inside of such an automated them a lot of training pipeline. And I want to focus here on, especially this last part the data ingestion and preparation tasks, because here, one box was added compared to this level zero manual process we didn't have it before called data validation, because when you build a system that continuously and automatically trains and deploy new model versions. You do want to make sure that the input data that is used to train train your model is actually what you expected to be and that doesn't change over time. And you don't notice it and your training doesn't work anymore. Two types of changes that you want to check for their so called schema skews, which would be, for example, that the column is missing in our data set, or that the type of the column changes, and their so called value skews. For example, that the distribution of the values in the column changes. There are different tools for doing automate automated data validation, and I would take recommend to take a look at great expectations and pandera. In great expectations, for example, there's something called an expectation suit, which is the collection of expectations. You could consider them unit tests for your data. The expectation could be that you expect the percentage of non values in a column to be no larger than 5% for example. And then, if all of a sudden your data changes and all of a sudden you, you find 80% non values in the column, you can be notified about that because that might really influence what your model is doing. Other checks could be that, let's take another look at our e-commerce conversion prediction example. For example, you have one column that includes the type of interaction the customer had on the website, maybe they put something into the shopping basket, maybe they clicked on the picture. You could, for example, have an expectation that asserts that all different types of interactions are present in your data set, and then if for whatever reason one of them is all of a sudden missing, you can get a notification about that. Great expectations comes with a dashboard that visualizes your validation runs, that's very handy, and we configure great expectations to have a Slack, to notify on Slack when a validation in an automated training run fails. And then we would click on a link to that, to that, it's called Datadocs dashboard, and we'll very quickly see which of the expectations failed, and would, for example, find that, yeah, all of a sudden one of the columns does have a way larger percentage of non values that we expected. And then a human can take a look at why that happened and how maybe the training code should be adapted to that. I do quickly want to talk about data extraction and data preparation, but I don't want to go into a whole lot of detail because there is a lot of tooling around it and I'm pretty sure that practitioners have a good opinion about which tools they want to use for that and then more focusing on the kind of tool that goes around the model training itself. But I do want to make a little bit of shame the self advertisement for a library that we built an open source called Squirrel Core. Squirrel Core solves the problem that we didn't find a good open source library to load a lot of data for deep learning model training from storage bucket in a call agnostic and fast way. And we built Squirrel that does that. So with Squirrel, for example, you can pre process a data set and put it into shards into S3 or GCS. And then with a few lines of Python code, download these shards in your deep learning model training, shuffle, have a shuffle buffer for the shards, apply runtime transformations and download these shards in the background, even before your model actually needs them so that you don't need to wait with your next forward path into the data has been downloaded because that makes the training slow the GPU idols. So we want to keep the GPUs busy and download the data in the background. It has features for distributed training so that when you train on many nodes, every node can easily get a different subset of the data. So that's really cool. And it's also really fast. If we compare the number of samples that we can fetch with squirrels from a storage bucket for the Wikitext 103 data set and you can see that squirrels is several times faster than other comparable libraries that have the same scope. The data mesh feature, which means that you can in a decentralized way in your organization publish data sets for other teams and then they can consume them without having one central location to store that data. That's also a very cool feature. If these are problems that you have then then feel free to take a look at this is fully open source. I don't want to talk about model training and model evaluation. I do want to focus on everything basically that needs to be automated around that, but I do want to bring to point that there should be like it's something like a decision gate in your automated pipeline. Because you don't want to obviously you don't want to deploy every model to production. There needs to be a decision gate, for example, in the simplest case, just an if statement that checks whether a certain metric has been reached. And there needs to be this check that that would stop the pipeline of the model as we could enough. But I do want to talk about in a little bit more detail is model registries and ML meta data store for reproducibility. It's very important that you just log your experiments log all the pipeline runs log which version of the code was used to train a model log the hyperparameters etc and log the metrics that came out of them, and ultimately also locked the model. So I think is the canonical choice here when you want to use open source tools. MLflow tracking has a dashboard that you can run locally on your computer just by running the CLI command MLflow server. And you can build a Docker image with the required dependencies and run MLflow the MLflow tracking server as a deployment in the Kubernetes cluster that holds the entire infrastructure. And when you do that, then in your in your flight workflow and like model training workflow you just import MLflow, you will set the tracking your eye to whichever endpoint the MLflow tracking server running in the cluster will listen on. And then in your evaluate model tasks, for example, you could in a simple example for example start an MLflow run log the different parameters that you use to parameterizer training log which metrics you observed as a result of that and then you will get these plots showing the metrics over time. You can also log model checkpoints and then register them in this models tab here so if we take this this conversion prediction model use case and let's say we train it every day then every day we would register a new version of the model. So these things are not just files that lie somewhere but we have like metadata attached to these models like which run they came from what parameters were used to train them etc. Okay, so let's say we did that we track that the decision gate passed we do want to deploy the model. How do we deploy the model. Again, there are many different ways you could do it. Since we just use an L flow and L flow also has a feature for deploying models. We use seldom core, which is a library to deploy models on Kubernetes very successfully use it in production. Southern core is a library that quote from the documentation makes it easier and faster to deploy your machine learning models and experiments at scale on Kubernetes. Okay, if you're if you're not if you if you haven't used Kubernetes much maybe like please don't be confused here. This YAML file is called a manifest and Kubernetes that describes an object you want to create. And the important lines here are that you want to create an object of kind seldom deployment which is a custom resource definition that is available in clusters after you install seldom. And the important lines highlighted here are that in this example I want to deploy my model using the so called Triton inference server by Nvidia, which is an inference server that is optimized for deploying deep learning models from Torch script, omx checkpoints, etc. And then the only thing you need to do actually is to specify where in which bucket that can find your model. So wherever you stored it, you you tell you tell this inference server and it will take it from there. And if you want to automate that. I've heard that in on the previous slide, where we looked at logging with ML flow we had this line ML flow register model. And this returns a model details object and it has a source attribute, which is exactly this year I in the bucket that is used as an artifact store. So by accessing the source attribute of this object that we get we when registering the model, we can fill this model year I key here in the seldom deployment manifest. And when we want to automate the deployment now from our pipeline, we could do something like have a deploy task and in the deploy task, just use the Kubernetes Python API to create such an object in an imperative way, which the pipeline just tells Kubernetes hey, I have a deployment here configured by this manifest please create it for me. And that would work. What I don't like about it is that, let's say your cluster crushes for whatever reason or you restarted, then you wouldn't have a way to to people to redeploy that model other than re running the pipeline. It's not great. I like our teams like to follow a github approach, where basically everything that is deployed in the cluster is configured by some configuration file in the git repository. And then there is something in the cluster that always synchronizes the actual state of the infrastructure to what is specified in that repository. And one way that we achieve that in an automated pipeline is that the mall ops pipeline creates either a commit or even the PR, if you want to human in the loop in such an infrastructure repository. And then a CICD pipeline test set. So let's look at how this could look like in a simple example. So we would add a deploy model step in our flight model training pipeline. After the decision gate that we want to deploy the model. And then we prepare such a model manifest and fill in here, the model year I that we take from the ML flow Python API. And then we use the GitHub Python API to, to, to commit that file to some branch for that manifest file we committed to some branch. And then we could, for example, create a pull request from that branch into like staging, for example, and in the body of the pull request. We could include, for example, a summary on the resulting performance metrics that we observed during training, and maybe a link to the run in the ML for tracking server. So that for the human in the loop, it's very easy to figure out what happened and how good this run is. In this pull request is then created automatically the CICD pipeline in the infrastructure repository. It could in theory build in a Docker image that can be used as a model server when you don't want to use the Nvidia Triton image server. But that's optional, but what you would definitely want to do is test the model deployment so in that CICD pipeline create a temporary Kubernetes cluster. Install the model, like, like start the model there, make sure it starts, send example requests to it, make sure that something reasonable comes back, and also test non-functional requirements like inference speed, for example, because maybe your data scientists and the ML engineers bring, like, change the model training code, train a different type of model that maybe has better, better, better metrics, but it's 20 times slower. And then maybe here you want to catch that you don't want to deploy data production if that's not fast enough. So, like, let's say the CICD pipeline checker screen, the human in the loop takes a look at the PR description that we added here, maybe clicks on the link to the run over view in the ML for tracking server and then decide that that's a good model to deploy. And then they would merge the PR, and then some GitOps to, like, Flux CD or Argo CD would realize, oh, there's a new model manifest in the infrastructure Git repository that isn't in the cluster we need to deploy that. And with Selden you can also do very cool things like Canary releases, which means that you don't all of a sudden switch all the traffic to the new model version, but that you, like, gradually switch a larger and larger traffic percentage over to the new model version. Okay, so we have a prediction service now, now we want to monitor it. And Selden integrates very well with Elastic with Elastic stack and with Prometheus and Grafano, which work, which run very nicely on Kubernetes. And in that Selden deployment manifest that we have before, you can specify a so called logger endpoint. And then whenever a prediction request is sent to the predictions API of the model, the Selden deployment will send what the prediction request was to this logging endpoint, and it will also include the predictions that the model made. The models also expose a feedback API. And for example, in our conversion prediction use case, the ground truth information whether the customer did actually purchase the thing that we recommended to them becomes available a few minutes later, asynchronously, and we can give this information to the model and as like through the feedback API. And what will then happen is that like a service called the request logger that Selden provides logs, all the requests the predictions and then later the feedback into Elastic search. This looks for example like this here where every prediction request creates a document where we have all the feature names and all the features. In this case, not not conversion prediction but breast cancer data set from SQL learn. We have here the class probabilities that the model predicted and if we provide feedback will also see the feedback here. And you can do even cooter things. Because there's another service that can be notified whenever feedback is received, and that will then look into Elastic search retrieve the respective document and assess whether it was a correct prediction and then will expose these information whether it was correct or not as metrics to permit this, and then you can use Grafana to build real time dashboards that allow you to track like over time in real time how your precision recall of your classifier for example is, and then if you would observe in the dashboard that the performance drops over time, you could automatically retrigger training here of that pipeline that then does the same thing trains a new model. If you deploy a new model version and then that's the one running. Now, we look at all these things now and looked at the friend open source tools that you can use to that to do to configure these things, but I want to take a step back and take a look at why we want to do that and remember that in the main argument was that ML systems decay over time because the real world changes. And if you don't have an automated and quick way to bring new new versions of your model into production the system will eventually die. And with a setup like this year. You have a way of bringing new model versions into production quickly. So you have an automated way of bringing new versions of the system into production that trains the model by changing the source code of the training pipeline, making a pull request, reviewing it, having a CI CD pipeline tested and deploy it in this case to the flight server, and then having a current up to date, registered version that might train a new model every day for example, deployed a new model as a prediction regularly, and then the monitoring with seldom allows us to retrigger maybe a new execution of the training pipeline to train a new version of the model with the existing pipeline on new data, or the monitoring can inform us that we need to go back to the beginning and change the code code code again so we have an automated way of bringing a new training system into production through CI CD, and we have a new way to continuously deliver up to date models into production. I want to enter the quote from just humble and they fairly they say continuous deliveries the ability to get changes of all types, including new features configuration changes box fixes and experiments into into production or into the hands of users safely and quickly and in a way, and I want to argue that especially for like of course there's a super important also for non learn software, but it is really really important that you also have a setup like this for an L systems if you want to maintain them in production, because it allows you to maintain them over time and prevent them from dying catastrophically. So, I reached the end, we talked about reasons why so many models failed in production. We defined the term and term and elapsed. We compared, or we looked at this model to assess the level of maturity of an L of system, and I showed you which open source tools we use to pitch to build such such automated systems. I want to thank you for your time. If you if you remain until here. If you want to build subsystems yourself. We are always hiring take a look at our website. Take a look at our open source work and thank you again. Have a great day.