 Hi, I'm Will Benton. I'm a product architect at NVIDIA and I've been building machine learning systems on Kubernetes for a long time and helping people build them in the real world for a long time. Sophie. Hi, I'm Sophie Watson. I'm a data scientist at Red Hat and I've also been helping customers run their machine learning workflows on Kubernetes. So machine learning systems are just distributed systems and so obviously Kubernetes provides the primitives to make them manageable. This isn't a controversial opinion anymore. Everybody agrees that Kubernetes is now the default choice for machine learning systems. However, your machine learning systems introduce new challenges that conventional microservice applications don't face or don't face to the same degree. The Kubernetes ecosystem addresses part of the problem, but it doesn't provide end-to-end solutions. You'll still need to build your own solutions and use your judgment. In this talk, we're going to show you what to look out for and how to get started. So we're going to start by introducing the basics of machine learning systems and Sophie and I have given another talk that really covers the argument for machine learning systems on Kubernetes, which we'll link to at the end of this talk. So you can watch that and catch up if you need to level set. We're going to talk about special challenges of reproducibility as they apply to machine learning systems and why maybe the reproducibility that you have in your CI CD pipelines is not going to be sufficient for your machine learning systems. In building machine learning systems, there's a lot of iteration, and so we want the tightest feedback loops possible. We want our practitioners to be able to work as quickly as possible and iterate as quickly as possible. So we'll look at what we need to do to enable them to execute with really high velocity. And then, just like with any complex system, machine learning systems have a lot of moving parts, and we want to look at sort of how we can benefit from monitoring our models and pipelines in production. And we'll bring it all together, showing you sort of how these systems fit together, how they can benefit from Kubernetes, and where you need to make some decisions as you start your journey. So let's start by talking about machine learning systems. Sophie and I have talked about these a lot in the past, and we've usually used a diagram like this to sort of visualize the workflow that practitioners go through, really thinking about the human process in building a machine learning system or a system that uses machine learning to provide business value, a system that learns from data. And we have a bunch of tasks flowing into one other, but instead of focusing on these tasks right now, let's talk about the teams, let's talk about the people who build these applications. We have data engineers who take data sources, streaming data, data at rest, structured and unstructured data alike, and make these available and reliable at scale. We have data scientists who turn the business problem we want to solve into a number, figure out how to find patterns in that data that the data engineers provide, figure out how to expose those patterns to machine learning algorithms that can exploit the structure of that data, and finally ensure that we're actually solving the problem that we want to solve. And we also have application development teams and DevOps teams who really focus on taking the prototypes that data scientists develop and the services that data engineers provide and turning them into a reliable, usable system that provides business value and that you might actually want to use. So I want to especially call out this inner loop here where a lot of the data science time is spent, and that's in feature engineering or figuring out how to expose the structure in data so that a machine learning algorithm can exploit it, model training and tuning, which is using computing techniques to exploit that structure and identify good trade-offs in characterizing input data and model validation, which is verifying that the model has actually generalized from the data it's seen and that we could trust putting into production. Now this inner loop has a lot of iteration and experimentation, and both reproducibility and velocity are going to be really important here, and the output of that inner loop is going to be a machine learning model. Now the model gets a lot of popular attention, even if you don't spend a lot of time thinking about machine learning, you can probably name some popular machine learning models like Bert or Huggingface that have gotten a lot of attention in the technology press, but models aren't really the hard part of machine learning systems. Models sit at the middle of complex distributed systems that have a lot of interconnected parts and a lot of bad engineering properties, because they're really complicated, and I don't want to dive into this now in this talk, but if you haven't heard of this paper by Dee Scully and some colleagues at Google called Hidden Technical Debt in Machine Learning Systems, you should check it out. This paper has been read a lot and cited probably even more than it's been read, but the argument is basically that machine learning techniques are relatively easy to develop, but machine learning systems are relatively hard to maintain. So one of the problems that we have with machine learning systems is the problem of ensuring reproducibility, and this comes up in a lot of different forms in machine learning systems that it doesn't necessarily come up in your conventional cloud native applications. The first one is a problem that we might have with feature engineering. Say you have a data pipeline that knows how to encode colors you're observing as an input to a particular machine learning algorithm. So typically we'll encode things as vectors of numbers and this will encode colors as vectors of two numbers. So it encodes red as zero zero, blue as zero one, and yellow as one zero. So this works really, really well as long as all the colors you see are red, blue, or yellow, but when you get to a color you haven't seen before, your model doesn't know what to do with it, and not only does your model not have any idea why orange matters or what orange might mean to the problem you want to solve, but before you even get to the model you don't have enough bits in the feature vector to encode it, and you don't have a way to encode that because you didn't see it when you were training. So some of you here today are probably old enough to remember the y2k problem in which a lot of software had to be retooled because it used date formats with two character dates to represent a year, and so it would only work in a single century. So once we went from 1999 to 2000, software might think it was in 1900 again. Machine learning systems have analogous problems every single day, which is why reproducibility is really important. We need to be able to make it possible and easy to change one part of the system even radically and rebuild while keeping everything else equal. Similarly, the library stack you're using probably has a lot of dependencies on userspace libraries that are outside of your language ecosystem, like native libraries for linear algebra, on massive chunks of the Python or Java ecosystem, which have a lot of subtle interdependencies, and even on accelerator drivers installed on your host. If you need to change one of these libraries, for example, to fix a security issue, provide some new functionality, or even just upgrade a transitive dependency that several things depend on, it might perturb your entire stack. And so it's important that we need to be able to roll back destructive changes just as we would with regular software. But these problems are much more frequent when we're dealing with a lot of libraries and a lot of libraries that are constantly changing. Another issue is explainability. If you're in a regulated industry, you might have a machine learning model making decisions about whether or not to write someone alone. This can be a tricky problem even if you get asked for an explanation right away, but it's one that practitioners in these industries plan for. They'll explain how they encoded the applicants as features and show what aspects of the application turned into inputs for the machine learning model and show how the model evaluated them. This usually works out, but what if you need to explain a decision from a few years ago? If you don't have a way to reproduce that entire pipeline exactly as you executed it, you won't be able to explain your decision. In many organizations, data cleaning and even some future engineering steps are manual processes, but you'll need total reproducibility and versioning across an entire pipeline if you're going to be able to answer these kinds of explainability demands. Another concern that we all have to deal with is the GDPR. And I say we all have to deal with it even though it's an EU regulation because any organization that deals with EU citizens is bound by the GDPR. Under the GDPR, users have a right to be forgotten, which means that you need to be able to show them there's no trace of their data left in your system. This can mean retraining a model to not account for data you collected from them, which means you need to be able to reproduce the entire training pipeline and run it again. Now I'm going to turn it over to Sophie to talk about high-velocity data science. Thanks Will. So the data science world moves fast. It's still a new and growing industry, and that brings with it many challenges that Kubernetes just isn't equipped to handle out of the box. And these challenges are really exacerbated by the way in which data scientists work. Data scientists are used to having total freedom and rarely having to share or publish their work. For them, the convenient thing is to self-manage an environment so they have total control. Think a software developer's computer in 1998. At the other end of the continuum, we've got a realistic idealistic setup where data scientists will share resources and work on a foundation that's managed by IT. In practice, these setups aren't conducive to rapid prototyping and really lack that flexibility that data scientists are used to. So diving into the work of data scientists a bit further, we as data scientists aren't focused on release engineering. When I wear my data science hat, my job is simply to find meaningful patterns in data. And left to my own devices, I'm going to do this in the most convenient way possible. Now, in order to put a machine learning pipeline into production, any code is going to have to meet some certain standards. And at bare minimum, it needs to be reproducible. And reproducible code requires a reproducible environment. But this doesn't really fit well with the fast-moving pace, the one for new libraries, new packages, and new visualization techniques. And this combined with a lack of release engineering expertise means that it's not uncommon for data scientists to end up working in a global environment that contains every package we ever used ever. Now, to achieve reproducible and release worthy artifacts, we first need to identify the minimal dependencies of that workload. And once those dependencies have been vetted by IT, they can go ahead and produce a certified image that satisfies the data scientists requirements whilst adhering to the company policy. But just identifying those libraries isn't always enough. So tools like source to image and draft make it really easy to build these images rapidly. But as IT, you might be supporting six different data science teams, each of which wants to use a different incompatible version of the same library. So unless you've got a really effective vetting process and software supply chain, your official data science images are going to be laughably out of date by the time they're released. Hence pushing data scientists back towards that self-managed ad hoc environment where everything is installed in one place. The challenges around delivering machine learning workloads don't stop once you've got a nice development environment. That arrow taking us from a valid model to a production deployment is doing a lot of heavy lifting. In many organizations, that handoff from data science to application development is actually a manual handoff where an application development team literally takes a stack of notebooks from a data science team, re-implements them as services and puts them into production applications. Ideally, any model service would be generated just like we'd make a standard microservice. We want to use CI CD to make a container image, eliminating any manual re-implementation and automating where possible. Relying on computers is generally better than relying on humans. But these model microservices aren't the same as standard microservices. For example, with your conventional cloud native microservice, you'll have an API that accepts things that look like program domain objects. For example, in this case, we have an employee ID and employee structure with an employee ID, an employee name, and a date range, requesting time off in a HR system. And it's returning a tuple that has a status of okay. But a machine learning model, our inputs and outputs are typically vectors of numbers that have no inherent meaning to the program or to anyone other than the model. Here we have a model that takes a relatively large vector and returns two numbers. But we don't know anything about what any of these mean. Maybe we know that the model expects a representation of an image as input and returns a pair of probabilities, whether an image is likely a cat or likely a dog. In this case, we have an image that's likely a cat. Now our model service is going to expect raw data, transform it into feature vectors, and pass it through the model to make a prediction. But if we expect a feature vector, our API is just a vector of floating point numbers. And there's really no way for us to sort of know how we got to that vector of floating point numbers. There are all sorts of techniques we might use to reduce large vectors to small vectors or to reduce images or program domain objects into vectors as well. If we had a pipeline service instead, we could take in raw data, transform that into feature vectors, and make a prediction like this. And again, with perceptual models, you have the additional problem of how you've actually encoded even an image, which is structured data that you could represent as a vector of numbers into that vector of numbers. And that might be something you want to capture as well. So we've talked about model services and interacting with them, but where do they come from? There's not always a simple way to go from an exploratory data science environment to a production artifact. And you might end up rewriting or re-implementing code. But if you're doing that, you're introducing possible errors and inconsistencies between training and deployment. We believe that CICD frameworks like Tecton and image building frameworks like source to image are absolutely the right way to train the majority of models. And we've built a data science user experience atop these frameworks to enable practitioners to operationalize notebooks. Now, the vast majority of real-world models can be trained in a short amount of time with relatively modest resources, and CICD builds are absolutely the right abstraction for these. But there are some challenges that conventional apps might not face. For example, your Buildapod might need more resources, time, memory, and GPUs in order to train a model. And you're going to have to account for this in your policies and identify a cutoff where it no longer makes sense to train models in builds. For some models that require tens of gigabytes of RAM, days of GPU time, and that produce enormous outputs, you may need to orchestrate training without using container builds. But you'll need to decide where that trade-off is for your organization, and you should definitely use CICD first. So monitoring cloud-native applications is obviously important, but it becomes even more important when we're talking about managing machine learning systems. So here I've got a model scoring pipeline. It takes in data, transforms it, and makes predictions. Now, we can check metrics about the predictions that it made, and we can check the standard microservice performance metrics as well to monitor it and set up alerts if something changes notably. But finding out that something has notably changed doesn't mean we know what has caused this change or how to identify it. Changes in recorded metrics in the model at the model level don't tell us exactly where in the workflow something has gone wrong. And there are things that can go wrong at every stage of this workflow. Suppose, for example, your machine learning model is detecting fraud. It's been trained to identify fraudulent credit card transactions. Well, it might work for a while, but scammers get clever. They learn how to change their patterns to trick the system so they bypass the model, making the model think that their fraudulent activity is actually legitimate. Another example of where things can go wrong is in the feature engineering stage. Suppose you realize that two columns in your dataset were correlated. Now you don't need both of them, so you decide that you're going to get rid of one. Now suppose the data source changes. Perhaps someone else realized that they don't need to be collecting both pieces of information so they delete one of the fields, and they end up deleting the one that you didn't. What will happen then? When standard software breaks, it often fails to compile or it returns an error message. But machine learning models don't fail in the same way. They're still going to carry on making predictions, but these predictions are going to be increasingly wrong more often. Another example is in the medical industry. Suppose you trained a neural language model on suppose you trained an NLP model on medical journals back in January 2020. It might give you a word cloud that looks something like this. Now you might have heard of this thing called the novel coronavirus, which has been causing quite an impact in medical journals in the middle part of 2020. But no models trained last year or early in this year will even know that it exists. So monitoring your models, identifying changes, it's all well and good, but it's not really sufficient to solve your problems. Now obviously, the more places that we monitor, the easier the behavior of that system is to observe. And that's a really important first step to making it possible to debug our systems when things go wrong. And there's also some high level tools that we can use to help us better understand our model and reduce the frequency of these errors. So the alibi library, alibi explain from the Selden team enables us to identify drift in our data and provides tools to give explainability insights into the model. Why did it make a particular prediction? The great expectations library brings that testing documentation and best practices from standard software engineering into data science and data engineering workflows by simply allowing you to assert some great expectations for your data. So Will, where does this all fit in the big picture? Thanks Sophie. Let's look at the big picture now. We've seen how the machine learning workflow, how practitioners build machine learning systems, but we haven't seen what a machine learning system actually looks like in production and how those individual parts that individual humans work on fit together. To get a sense for the territory, let's take a look at a map. First, let's look at what data engineers do. Data engineers are responsible for making data at rest and data in motion, both structured and unstructured accessible to machine learning systems. And they thus need to make it manage multiple system surfaces to provide access to data and make it easier for machine learning workflows and systems to consume federated data for multiple sources. They also may need to support data versioning to enable reproducibility of older versions of models for explainability purposes. Data scientists need to manage pipelines and experiments, potentially trying many combinations of feature engineering techniques, model training techniques, and each with different parameters to find the best. They'll need to have an efficient environment to scale out reproducible research if they're going to be successful. And they also need to manage flows of structured and unstructured data through these components without a lot of brittle glue code that they have to develop themselves. Application developers will need a low friction way to publish the efforts of data scientists responsibly and reducible in a way that makes it possible to identify when models and pipelines are starting to go wrong. These roles overlap too in key places. Machine learning systems model some aspect of the real world, which is a complex and changing place. By incorporating machine readable feedback about historical predictions, our systems can improve with time by identifying mis predictions they've made in the past. As practitioners learn more about aspects of the world that they're modeling, their insight will lead to improved techniques as well, but only if we can enable them to iterate with high velocity. Nearly every part of the system will produce metrics, reports for model validation, summary statistics and histograms for data sources, conventional metrics for conventional application components, and so on. But some components will consume metrics as well. As Sophie mentioned, drift detection is an important problem we need to address, and we can use the Kubernetes monitoring stack to support drift and outlier detection to identify problems proactively with our models and pipelines. If this all sounds like things you used to running on Kubernetes, great. But the devil is in the details. And in this talk, we've tried to show you some places where you might have to worry about how this doesn't exactly map onto what you already know about cloud native applications. Sophie? Thanks, Bill. Yeah, we started out by looking at the way in which data scientists workflows differ from that of software engineers. And we focused on the high velocity field of data science and the lack of discipline around creating software, which really leads data scientists to managing their own non reproducible environments. We talked about how there's a lot of heavy lifting done by this particular part of the workflow, that path from the developed model to the deployed model is not as straightforward as one would like. And we discussed how tools like source to image and good CICG practices can help to ease this transition. And once the model is in production, we've still got these unique challenges that machine learning microservices face and regular microservices do not. For example, managing complex software stacks and managing pipeline APIs versus model APIs. Finally, we discussed how machine learning models can break in many ways. And it's really important to monitor them as many places as possible. And there's tools like alibi and great expectations, which will help give you insight into what's actually going on. We're looking forward to your questions. Here are a couple of links, as promised, to learn more about why you should build machine learning systems on Kubernetes. The first is a talk we gave at Oskon last summer about how to provide a productive data science environment on top of the useful primitives that Kubernetes enables. And it has demos of our source to image builders and pipelines for machine learning model services. The second is a paper about why machine learning systems are really a lot like distributed systems in a way that enables them to benefit from Kubernetes. And it introduces a concept called intelligent applications to make sense of machine learning systems in the cloud-native era. Thanks again for your time. Neither of us is speaking for our employers in this case, so the best way to reach us is through our Twitter handles, which are at the bottom of the slide.