 Hey everyone, welcome to my talk. Over the next 20 minutes or so, I'm going to talk about MLOps. Specifically, I'm going to highlight how we automated various parts of a machine learning process using tools like Kubeflow Pipelines and Spinnaker, but first let me hide my camera so we can focus on the slides. A quick note on me. My name is Kevin Delarosa, and I'm a machine learning engineer at Snapchat. I primarily focus on computer vision problems to support the scan feature in the Snapchat app. That's my Twitter handle. Feel free to reach out to me there or on other channels like LinkedIn. So here's a quick overview of what we'll be covering. First, I'll give you a brief introduction to what I work on at Snap and where machine learning fits in. Then we'll talk through the key components of a production machine learning system. We'll touch on what MLOps is, and then for the remainder of the talk, we'll walk through the various steps involved from going from an experiment to continuous machine learning and production. Closing out with an overview of what this looks like for my team. So let's get started. Snapchat is the fastest way to share a moment. At Snap, we contribute to human progress by empowering people to express themselves live in a moment, learn about the world, and have fun together. One of the ways we allow folks to express themselves is through augmented reality experiences we call lenses. In the app, these lenses can be unlocked directly through a snap code, lens links, and a lens explorer. Snapchatters can also discover these lenses using the scan feature By long pressing the screen while the camera is active, scan brings up contextual lenses and tools that may be relevant to what the camera is seeing. In this first example, we see a drink-related lens come up after scanning a cocktail. In the second example, we see a lens that was built with a marker template being unlocked by scanning a poster. Where markers are images, like this poster, that can be recognized and tracked by the Snapchat camera to create 3D immersive experiences like the ones shown here. When you create these augmented reality experiences in the Lens Studio, creators can specify scan triggers, which tell Snapchat what you'd like the lens to be related to in scan. For example, if you add the sky scan trigger when submitting an astronaut lens like this one, in doing so, when a Snapchatter points the camera at the sky and triggers scan, it may open the corresponding lens. And as previously mentioned, lenses built with the marker template can be unlocked in Snapchat directly through the camera. Snapchatters who encountered a marker in the wild can long press on the camera to initiate a scan and unlock the corresponding lens. Under the hood, scan is powered by machine learning. In both of these cases, when scan is triggered, we use machine learning approaches to automatically recommend and rank the most relevant lenses to show to the user. And in production, if you followed the ML Ops methodology, you end up with a system like this, which is a diagram from Google Cloud. Some key components include continuous integration for building, testing, and packaging pipeline components, an automated machine learning pipeline for performing your data engineering, model training, and model evaluation workflows, some feature store to feed data into your ML pipeline, a prediction service powered by your actual machine learning model, a continuous delivery pipeline for deploying your machine learning pipeline as well as your prediction service, and lastly, some mechanism for monitoring performance. As we progress through this presentation, I'll provide more details on each of these steps and how you can ultimately get there based on our experience. But when everything is said and done, we all have to start from somewhere. Machine learning is experimental by nature, and the starting point for many new machine learning applications is often experimental code in Jupyter notebooks like this one. So there's obviously a gap between having a proof of concept Jupyter Notebook and that big diagram I showed you of a finely tuned and automated machine learning deployment. For this, we can use ML Ops to guide us along the way. ML Ops is a set of best practices to test, deploy, manage, and monitor ML models in real world production. On the right hand side, you see some of the core concepts to keep in mind while productionizing a machine learning system. In order to understand these, you'll first need to recognize there are key differences between traditional software systems and machine learning deployments. For example, machine learning isn't just about code. Machine learning is really about the combination of code and data to produce artifacts called machine learning models. And in addition to code changes, one also has to be aware of data changes as these impact the performance of your model and the reproducibility of your steps. Additionally, the ecosystem surrounding machine learning deployments has a lot of moving parts that are not always seen in traditional server deployments. For example, data pipelines, including ETLs as well as data labeling pipelines, feature stores as previously mentioned, and your actual machine learning pipeline, which can often be highly customized for a specific use case. Now that's a lot of stuff to keep in mind. Instead of trying to do this in one big bang, you should instead opt to adopt a incremental process to gradually increase the level of automation in your system as things mature. Generally speaking, these are the steps we use to reach continuous machine learning for the things like unlocking lenses and scan. I'm not going to touch on the machine learning algorithms here, but more focus on the machine learning engineering and ML ops considerations with the assumption that you have already done some experimentation on the ML modeling side. So let's walk through each of these steps and what you gain along the way. Firstly, you want to containerize all the things. In this step, you need to break up your experimental Jupyter notebooks into modular code and containerized programs. And once that is done, you add it to your source control system, i.e. GitHub. After this step, changes to your algorithm and transform code will be now tracked. You'll also have software that is easier to write unit tests for since your code is more modular. And you'll more easily be able to reproduce things because you dockerized all the things. Step two. In this step, you want to automate your machine learning process. For this, we use kubeflow pipelines. For the uninitiated, this is what kubeflow looks like. In kubeflow, a pipeline is a graph describing a complete machine learning workflow. This graph specifies what components are on, how do you relate to each other, and each component or step in the pipeline launches one or more Kubernetes pods and acts like a function in that it has a name, parameters, return values, and a body, i.e. docker containers. In this scheme, data is passed between parent components and children ones via serialized data. Using kubeflow to represent your ML pipeline might result in a graph that looks something like this. First, you start off with some step that validates the input data to your pipeline. Then we pass off to some component that extracts features from the raw data or the feature store. Then we pass these features along to a component that actually trains a machine learning model. After the model is trained and produced, we can do some model evaluation like calculating accuracy, mean average precision, recall, intersection over union, or what other metrics over a validation set that are relevant to your use case. Then lastly, you'll typically do some model validation against some held out test data. This can take the form of checking if the model performs at some minimally acceptable level of accuracy prior to uploading it to wherever you store or register your models. Over the course of step two, you now have an automated way to kick off a run of your training pipeline. To help enable reproducibility, you can store metadata about each run and model. You can track changes to your entire ML process by storing your KFP pipeline in source control. You end up testing more things in the form of components like data and model validation as well as unit tests over the model specification and integration tests over your entire ML pipeline. You can also log model performance for each run to understand if your model is degrading over time or perhaps if you goofed up on the algorithm side. Step three, now that your components are containerized and you have a pipeline that automates the ML process, we can move towards continuous integration. In this step, you want to set up your favorite CI tool to help you build code and run unit and integration tests. On my team, we use a mix of Jenkins and Drone to accomplish CI. Either way, you want to have some release step that publishes artifacts like Docker images, which we publish to GCR, Google Cloud Registry, or if you're in AWS, that would be ECR. We compile versions of our Kubeflow pipeline specification YAMLs and we version Kubernetes configurations for the prediction service deployments. In this step, you'll have unit and integration tests automation at both the component and pipeline level, as well as the artifacts required to deploy a server or training pipeline, which will come into play in the next few steps. In step four, we start to talk about continuous delivery. In this step, you want to orchestrate the deployment of your ML pipeline using a continuous delivery system. When you reach this stage, you'll trigger automated deployments of the ML pipeline when your continuous integration system releases new artifacts. This will let you automatically train new models whenever the ML process changes. Additionally, if you use a CD system, you'll have repeatable deployments for your ML pipeline and also have the ability to do things like rollback to previous versions straight from the UI. So for this step, we used Spinnaker. For the uninitiated, this is what Spinnaker looks like. Spinnaker is a cloud native continuous delivery system, which makes your deployments fast, safe and repeatable. In Spinnaker, you define one or more pipelines to manage the deployments, which consist of a set of stages. A stage is an action for the Spinnaker pipeline to perform, like deploying a manifest, running a job, kicking off a sub pipeline, resizing a server group, and so on and so forth. Deploying our ML pipeline might look something like this. Your initial configuration expects a set of artifacts as input. In our case, this would contain the Kubeflow pipeline YAML artifact and some configuration to help us pass in proper parameters when invoking the Kubeflow pipeline. Then next you might have some smoke test to do non-excessive tests against your new ML pipeline, but this is optional and really up to you. Next is the main part of this pipeline, running the ML pipeline stage, which does the actual deploying or submission of this pipeline. For us, this looked like a small Kubernetes job that first submits a run of the ML pipeline to the Kubeflow cluster, waits for successful completion, then parses the output to produce a Spinnaker artifact that can be passed around in subsequent stages in the Spinnaker pipeline. Lastly, if you have a model server deployment Spinnaker pipeline, you could invoke that step directly at the end of this pipeline, though at this point in your project lifecycle, you probably don't have that yet. More on this in step six. In step five, we achieve continuous training. To do this, you update your CD pipeline to trigger deployment of your ML pipeline, either on some set schedule, like daily, weekly, monthly, or upon availability of new data. The route you take here will depend on your use case and availability of new data. We do this in Spinnaker by using built-in triggering mechanisms, like Chrome jobs and Pub sub message triggering. After completing this, you'll automate the process of training new models upon the availability of new data. In step six, we achieve continuous delivery for the model server or prediction service. We did this in Spinnaker by setting up a pipeline which specifies how to deploy the model server and triggered on changes to the server configuration via continuous integration or upon the availability of new model artifacts created by the continuous training Spinnaker pipeline discussed in the previous steps. After this stage is completed, you benefit from the continuous deployment of new model servers and also make gains with respect to reproducibility. The Spinnaker pipeline here should look similar to what you'd expect for traditional server deployment, so I won't go into great detail. On this slide, you see a sample Spinnaker deployment that deploys a canary, waits for manual approval, then deploys to some production environment. Lastly, you'll want to invest in monitoring. In this step, you want to make sure that you have the proper telemetry in place for your server. Additionally, you'll need mechanisms to detect deviations in model performance. For the former, you have the usual suspects of latency, traffic, errors, and saturation. For your model performance, if it is hard to get direct feedback on your model's predictive accuracy, you can use proxy metrics that track the results of acting on your predictions. For example, here you might measure if there are shifts in click-through rates. After this step, you'll unlock the capability to determine if your model or ML process needs to be changed. So cool. Those were the seven steps we took to reach, quote unquote, production grade machine learning. At this point, you will have achieved a new continuous thing, continuous machine learning. This diagram summarizes the journey we took from ML experiments in Jupyter notebooks to continuing integration using tools like Jenkins, continuous training using Kubeflow to automate your ML process, continuous deployment through Spinnaker, to monitoring metrics in your Kubernetes pods, which will then ultimately lead you back to doing more experiments and refining your process. Here are some useful links for those who want to learn more about Snapchat, ML ops in general, or the various tools I mentioned. I'll stay here for a few seconds for those who want to screenshot this for future reference. Okay. Anyway, thanks for listening. Again, I'm Kevin LaRosa signing off. Have a great rest of your day. Thank you.