 Hello everyone. Welcome to my session. My name is Carl Weinmeister. I'm going to talk to you about building a reproducible ML workflow with Kubeflow Pipelines. And so I've spent some time in the software engineering world as well as the machine learning world and I know all the challenges of trying to not just build a machine learning project and get to that initial model's high level of accuracy, but how do you put that into production? How do you make sure that you get the same results every time? And we're going to talk about an open source platform that allows you to do that. So let's start with looking at the agenda here. We're going to start with why some of the challenges in building machine learning projects. We're going to talk about the Kubeflow platform and specifically the pipeline component of Kubeflow, what allows you to build an end-to-end reproducible machine learning pipeline. We'll get into a how-to. So you'll see some code. We'll look at a docker file. We'll get a flavor for what it takes to build a pipeline. And then finally we'll do a demo. It is using screenshots, but it's going to walk through an actual pipeline and what that looks like in a real deployment. So again, this is not necessarily about building a model. It's not an AI introduction, but really it's about all the important things around building the model. How do you manage it in a production setting and showing how Kubeflow can help with that? So first let's start with just overall this idea of operating in machine learning can be very difficult. We'll talk through some specific issues for machine learning. I wouldn't necessarily say launching is super easy. Often machine learning you experience challenges with getting the right data, having your model get to the right level of accuracy where you're solving the problem well without too many false positives or negatives. So that in itself is a challenge. But it's the next level of challenge around how do you track all these different models that you're experimenting with? How do you make sure that the data that's being used for those models from all these different upstream systems is correct? You know, you don't have a garbage in garbage out kind of issue. How are you tracking the accuracy of the models that you have? How are you ensuring that your models are performing well for all of your users? There's a lot of things that you need to think about once you get past that experiment stage and you're running this constant stream of new models, putting it into production, ensuring that you don't have bugs in your model, if you will, or regressions where you start making very critical mistakes in the model. How do you put some guardrails against that? So that's what we're going to talk about today. So one of those challenges is a lack of continuous monitoring. So say that you build a model, it's working fine. But slowly you start to see this drift where the model accuracy drops. And you might run into this if you don't have a practice for continuous monitoring. In other words, once you've deployed a model, watching the accuracy of that and being able to take steps if that accuracy drops below a certain threshold. So we don't want a crisis where everything's okay and then one day it just is so obvious that this model isn't working well that you've got to solve it in emergency type of setting without having monitoring in place. Another kind of issue you could run into is what's called training serving SKU, where you may have code that's used for training, say to do feature extraction where you are pulling machine learning features out of your data set, your training on those features and building a model. However, your serving code or the code that's used by your applications to do a prediction with that trained model might be a little bit different. Maybe there's a code that's not being shared between those different parts of the process. And so a bug might be introduced where the data is transformed slightly different coming in from one step to the other. Maybe there's missing data, nulls, this kind of thing. And so you have to be very careful around ensuring that both the one time training process where you're looking at all of your data and coming up with the model and the runtime aspect where you have a model and you're serving users or serving your application with that that code is in sync. And that can be another source of issues. Another consideration is around freshness. And what we mean by that is really thinking about the business requirements of your model, how often does that data change? And here are a few examples just to think about. So on one extreme, you might have a model that's based on news that's constantly changing. On another extreme, you might have a voice recognition model that's fairly stable and doesn't need to be retrained very often. And everywhere in between, maybe you have a model for an e-commerce site that recommends new products. If you think about it, that catalog of products may change over time. So it's useful to occasionally retrain that model to say, okay, some time has passed. I want to add new data to my model that's more relevant and current. And so you want to think about a process for retraining, redeploying, versioning that model. Because what we're seeing here is that AI is not a one-time thing. It's a process, just a continuous improvement process just like you might improve your code in a software development application, version it again. We need to think the same way when it comes to machine learning as well. And so we often think about data science as building a model. The fun part of you have data and you're looking for an insight and you're hoping to find that it classifies well or it predicts the future well. And it's super exciting to kind of go through that process and get to that point. However, that's just a piece of the puzzle. Once you've done that, these are all the things that you might need to think about in a production system. And we've talked about some of these so far. Ingesting the data, transforming the data, often the data coming in from these other systems isn't in its raw state how you need it to be formed so that it can be used in a machine learning model. You might need validation on both sides of the model, the data coming in, the model outputs coming out. You need to do things like maybe monitoring and logging of the model. So all kinds of different considerations as you are doing your machine learning process here. So we've talked about the process and the considerations. Now let's look at it from a different angle even, which is the stack, the architecture and ensuring consistency. So you can think about all these different layers in terms of your team members and your dev environment, your production environment. They have to have consistency across the board, everything from the model to the tooling, the version of the framework that you're using, GPUs, etc. For consistency, you need to make sure that these things are all aligned. So if you think about it, that process that we just talked about needs to operate the same way on these various different stacks. So just thinking about it like we would in a software development scenario, dev, test, prod need to be aligned. All team members using the same versions. If you're using Python, when you're doing pip install, are you working from a common set of requirements? All of these things are important to ensure reproducibility of your machine learning system. So this is where Kubeflow comes into the picture. So Kubeflow's mission statement is to let's break it down into pieces here, making it easy for everyone. So it was designed for everyone from beginner users and folks that are either data scientists who don't necessarily need to get into all the infrastructure details of Kubernetes and Docker that they can work in Python. There's reasonable defaults. But since it's open source, there is the ability to customize a lot of things in the stack. The other things you see here, it's about the whole life cycle, develop, deploy, and manage. So building the model, deploying it, and then what you do after it's live. Finally, I think the other key points here is why is it why Kubernetes, right? So portability. So we talked about having a uniform infrastructure. So if you abstract away the underlying architecture, you can ensure that you can port your processes between the cloud on prem, et cetera, all these different environments and ensure it works the same. And then another great thing about using Kubernetes is the distributed nature of it. So running jobs on clusters and work in the pods, it really helps to help scale out some work like in a training job where you have a one-time large training set that might take all night to run. If you can run on a Kubernetes cluster, maybe you can speed that up dramatically, especially if it's in the cloud and you have auto-scaling, auto-provisioning. Horizontally scale that, finish your job, and then shrink the number of pods allocated for that. So it fits very nicely to the model of Kubernetes. And one thing that I have found is also useful is if you're trying to standardize your DevOps processes across your company, it's nice having one stack and one set of best practices and learning curve around how do you manage your software applications and your machine learning and putting those all on one common infrastructure or Kubernetes. So that's Kubeflow in a nutshell. Let's talk about some of the key capabilities from it. I'm going to work my way clockwise through this slide. So Jupyter Notebooks is the IDE, if you will, the development environment for writing data science code. We'll see a little bit of that. There are training operators. An operator is a Kubernetes construct that allows you to do something. Basically, it's a custom API on Kubernetes. And there are training operators for a variety of different machine learning frameworks that allow you to run those training jobs and have a model as a result. There are workflow building and pipeline capabilities that we're going to talk more about that I think are really key to ensuring you get the same result every time. Data management capabilities to version the data, ensure that you're working with that same reproducible data set every time. There are a variety of other tools, like if you've done some data science in the past, you might be familiar with a concept called hyperparameter tuning where you might search across all these different possibilities of what kind of learning rate or batch size should I use. And this is another capability that fits well with the model of distributing a load over Kubernetes where you can do that search over your cluster to find those optimal parameters. Finally, you have metadata management. So maybe you want to track data against your model, what type of framework was used, what date was it created, so on and so forth. So you can attach metadata to the different objects. And finally, serving, so hosting the model once it's built in your cluster, so providing a rest endpoint for you to use the model. So all of these different things from end to end can be accomplished with Kubeful. So to sum it up, we talked about portability, scalability. We're going to get more into composability. How do you build a reproducible pipeline? It cuts across all the different parts of machine learning in the lifecycle. And finally, another key point is it's designed for machine learning. So it has this idea in mind of using specialized chip sets like GPUs and TPUs to improve performance. So you can mark various pods and set labels so that you're, say, on part of your cluster that has GPUs on specific VMs that have the GPUs enabled. You can send a workload to those GPUs. So a lot of capability like that is built into the platform. From a design perspective, this was designed to fit well with Kubernetes constructs. So it's not some sort of proprietary design pattern. It uses the Kubernetes APIs, if you will, and we'll look into a little more detail on that later. Really using those Kubernetes design patterns, and it supports a variety of frameworks. You know, TensorFlow, PyTorch, Scikit-learn, XGBoost, and many more. So it's very open in that way. Let's look at the components briefly. So any workload coming in, you know, there's an ingress as part of Kubernetes. The work gets distributed to the right place in the cluster. And you really have about three main areas here, the dashboard, a user interface where you have the ability to run your data science notebooks, to write code, to view your pipelines, et cetera. You have your operators where you can actually do the heavy lifting of running the jobs and the training jobs. And then serving that we talked about before where you host the model in your Kubernetes environment. All right. So how do we get started? So here we see that there's some YAML files. These are all in GitHub, by the way. Here are some examples and just a few of the many examples that are available in GitHub where you run a CLI command. And it's similar to if you've used Kubernetes, like Kube CTL. There's KF or Kubeflow CTL. And what you're doing here is you're applying this YAML file. And that YAML file has the instructions to set up the cluster for you and the services and configure everything, you know, permissions, firewall, et cetera. And you see there there's a few configurations for various different platforms. And that's all it takes to get started with Kubeflow. So we're going to walk through several different screenshots and how to implement these capabilities from development to training your model, serving it and orchestration. Okay. So from a development perspective, there's a concept of a notebook server. So you would define various servers using a Docker file. So maybe you use the out-of-the-box Docker file that has a, say, TensorFlow pre-installed. It's configured for GPUs, et cetera. Or you could say at, you know, our enterprise, we have a different version of TensorFlow or some, you know, different dependencies that we want to bake into our standard image. You can do that and add that to that list of notebook servers. And then each of those notebook servers could then be preconfigured with, you know, different settings of how much RAM and CPU, et cetera, to allocate to it. And then you're able to have your data scientists work in that environment. Okay. Now let's look at how training works. So I'm going to get into probably more detail than typically most folks would need to, but I think this helps illustrate what's happening under the hood. So just like we saw the installation file to the YAML to set up the cluster, here's an example of how training would work. So this is a code snippet from the PyTorch training YAML. Again, this is in the github.com.cubeflow repo if you want to take a deeper look at some point. And so it really just defines details around the, you know, configuration, like which Docker image will it use for training, which file is it going to run. And then you can set things like you see at the bottom of the limit, you know, that maybe it can use up to one GPU. So this is where that training job, this is the descriptor for how the training job works. And so you'd use a standard Kubernetes command to launch that training jobs you see on the right side. Cube CTL, and then it, you know, launches that job. And then from there, you can again use standard Kubernetes commands to monitor that job. So here we might look at that particular job and you're going to see the log as it's, you know, printing out the loss as it, you know, how many, you know, epochs has it gone through, how many passes through the data, has it performed. You're going to see all that it's sort of in a standard way of looking at the container logs. So that's training. Let's talk a little bit about serving. So with serving there, there is an abstraction layer where there are two different operations that you can perform a serving and various different frameworks will, you know, implement that slightly different, but it's the same concept here. So there's a predict and explain operation. So predict will predict or infer, given a given something new that you give it, you give to the model, what's the result, you're going to get back. And you have two different endpoints you can define your default. And then it does support this concept of a canary deployment, which is a technique where you might say, well, I'm, I'm testing out or it's in alpha. And I don't necessarily want to roll this out to everyone yet. I want to kind of see how it performs. So I'm going to create a split of 99% of my traffic goes to default and 1% goes to the canary endpoint. And so we'll see how that goes with that new model. And then over time, you know, shift that and if things are going well, the canary, you know, the model becomes the default model. So so it's a nice capability there for piloting new capabilities. And so once you're through that endpoint, you have a concept of a transformer. And the transformer works by taking the raw data that is passed in for a prediction. And maybe you have to tweak it a little bit to go into your model. So what I mean by that is that so say your model is built in such a way that it expects. A month integer and a day integer and a year integer, let's just say that's the way that those are the variables that are passed in the model. And the output is, you know, some prediction. But maybe your that's extra work for the API is and you simply just want folks to be able to pass in a date object and not have to, you know, kind of go through some of that. So that's where the transformer might come in, where it's a layer that does some pre processing that might split that date into the components that your model expects. And then, you know, basically you can use it from there. And so most of the time you might do predictions with that and then you get the result. And then in other cases, you might do explanation and explanation is a capability to say, why did the model do what it did? What are the important, what were the important factors in making the decision that I did? So that's an optional capability that support as well. So that's the serving component that's in the framework. So now let's move on to pipelines and this is really going to be the focus of this talk from for the remainder of the time. So pipelines allow you to build end to end workflows using those different components that we've been talking about. So let's talk about how we might implement a pipeline. So I'm going to go over a few code snippets. These again come from the repo of examples. So this particular snippet is for GitHub issue summarization, which I think is a really cool demo. What it does is it looks at a GitHub issue and it looks at that text and it gives you a summarization like maybe a one liner that tells you what was this issue about. So it's, you know, this idea of text summarization, which is a natural language capability machine learning. And so how you create a pipeline with this, there's a domain specific language, basically a markup that your data scientists can add to their code to define the pipeline. That's what you're seeing here now, a name, a description, and then you start to see the pipeline itself where you have parameters coming into it. How many training steps do I want? What is the directory for the data that I'm going to use and so on and so forth. And this becomes the input to your pipeline that you define here. Okay, so let's keep working through this. So we defined a pipeline. Now let's look at the steps of that pipeline or the components. So what we see here, the first piece is you're importing the package for the components and then you're pointing to a descriptor. So there's a YAML file that's out there in this case on the web on GitHub that defines the interface for this component. And in this case, the component is simply, it's an example, a copy component. It takes data from a source to its target. And so the next step is you instantiate that component. You have the description of it and now you're actually creating an instance of it. The next thing I want to show is how do you create dependencies between components? How does Kubeflow pipelines know which step comes after which step? So say we defined a log data component in a similar way, we could say that the log data comes after the copy data components. You can explicitly specify the order. What you can also do though is it's smart and it can infer the order. Say if log data had as one of its inputs that output from copy data, it would be able to wire those components together into that order so that you'd end up ultimately with something like as we saw in the last screenshot, that's how it figures out the order of operations, either implicitly or explicitly in how you define the pipeline. Okay, so now I've created your pipeline, how do you deploy it? So there is a package to compile your pipeline and what that does is it takes that Python code and it turns it into a YAML file and then zips up that YAML file. From there you can upload the pipeline programmatically or via the UI that we're seeing here and then now it's accessible in the dashboard. So your team can now use that pipeline and it's parameterized. You can use that same pipeline in different contexts with different parameters now and this is what we're going to see in the demo. Okay, so the component, so we talked about the pipeline. Now how do you use components? Do you have to write them all yourself? No, there are lots of reusable components written for various cloud providers as well as just sort of standard components. You see here one example on the right of a BigQuery component if I needed to extract some data to use in my training job but I just put a list on the left of all the different types of components that are already available and you can pass in the inputs and you can see what the outputs are, they're well documented. So it's a great place to start. And I wanted to also add if you happen to be using TensorFlow for development, TensorFlow Extended is a great library of out-of-the-box components for doing machine learning ops. So all of these components can be used in your workflow and we're going to actually walk through these in the demo section at the end. So I won't go into too much detail right now. And TFX can either be used through its own SDK or through that standard Kubeflow SDK that I just showed how to use. All right, so let's briefly talk about how you would build your own custom component. I know there's a lot of information here but I just want to really illustrate there's three steps are required and hopefully you can see how it all fits together. First is writing your descriptor. That's what we have here. What's the name of the component? It's description. Outputs and then a key part here to link it together is the link to the container image. So where can I actually execute this code? And it points here, for example, to the Google Container Register. So this is the description of your component by YAML. The next part is a Docker file. So that this describes what's in your image. So you might start from a base, say a TensorFlow with a certain version on GPU. You might update that image a little bit. You might install a couple more packages and then you specify the entry point. Here is the code within the image that I'm actually going to run my component with. Okay. And then finally the code itself. So that this is that entry point into the image. And here's where our code might do something like parse the different arguments, you know, like you saw in our Python script before, the data directory, the number of steps, so on and so forth. You might parse those. And then here's where you're actually running the code. You're going to run some sort of action and then you're going to set the output. And so this is what's really happening under the hood. Okay. So to summarize what we've seen so far, Kubeflow is a cloud native, a multi-cloud solution for ML. It allows you to build pipelines and we're going to see a demo of that shortly. And if you have Kubernetes, you can run Kubeflow. It's a platform that runs nicely on top of Kubernetes. Okay. And Kubeflow is an open community with all kinds of ideas from many different places. And if you want to find out more, here are some resources. Kubeflow.org is probably the best place to start. It has links to all these other resources. All right. So let's spend a few minutes going over a demo. This is using screenshots, but just imagine that I have my web browser running while I'm going through this. And I apologize if it's a little hard to read. So you might want to maximize your window if you are already. So here is a user interface to see your pipeline instances in the cloud. So you can actually see this happens to be on Google Cloud. But you see that there is a list of different instances pointing to a cluster and a version of Kubeflow pipelines and a link to directly open that pipeline's dashboard from the cloud. So say we clicked on that link and we opened the dashboard. So here's what you would see. You would see this getting started page and a set of different options here on the left side. And let's walk through those. Okay. So here is the list of pipelines that you would see. If you recall, I showed how to build a pipeline. And here's where you would see if you uploaded a pipeline. And you can see in the top right corner, there's a link to upload a pipeline. Here's where you'd do that. You'd say, okay, I've built my pipeline. Let me now make it available to the rest of my team here. Okay. And so now what we're going to do is walk through an example of the Chicago taxicab data set. This is a well-known data set where it's what's called a classification problem. We're looking at the tips that have been provided to the taxi cab drivers and trying to predict based on, you know, miles driven and all these different factors. And there is the tip going to be greater or less than 20%. So that's, you can find out more about this model, you know, in this example here. And I'll give you a link at the very end. Okay. So here's the graph that you'd see if you wanted to look at this demo. And by the way, this is already built into Kubeflow Pipelines. It's one of the examples already provided. So if you just fire up a pipeline's instance, you would see this as one of the examples. So this is the graph. It doesn't have any status because it's just simply the abstraction. It's not an actual run of the graph. It's a description of the steps. So we haven't actually run it yet. Okay. So now let's look at if we were to start a run. You would fill out some information here. You can see what's the name of it, a description of my run. And then you can even specify some things like is this a one off or is this a recurring run where I want to run it monthly or something like that. So let's go to the next slide here. And so here's where we see a list of your runs. And so you'd see an existing run and then a run that's in progress. How many minutes it takes. So if you use like CI CD tools in the past, it kind of looks a little similar. How did my Jenkins job go? Was it successful or not? And seeing a status and things like that. So here's in progress execution. So you see, you know, here are the steps that have been performed so far and then what's still to be done in my graph. So if I click on an individual step on the graph, then I see on the right side a variety of different things like what were the artifacts, the inputs, the outputs, things of that nature there. So it's a way to really drill down into what's happening. So I my first step was it's called example gen. I'm extracting from a CSV file into examples or records. This next step here is the statistics gen. So what this does, these, by the way, these are the TFX is TensorFlow extended built in components. So I can then get a picture of the data coming into my model. And I think on the next screenshot, we're going to see if I zoom in a little bit. What are the features in my model? So you see a nice, like summary statistics. How many rows were null, you know, of a particular column? What were the mean, the min, the max, things of that nature? What's the data distribution? So I was you at a glance to see what the data in your job. The next thing is a component called schema gen. And here's where you can look at the data coming in your model and think of this too as the guardrails of ensuring as a way to see is the data, you know, whether it's nullable or not or what the columns are ensuring that's the same every time. So this can infer what the schema is from your training set. Now the next step, this is pretty cool. This is the data validator and it can use the inferred schema as new records are coming in. In this case, the test set, the holdout set that we're trying to see how useful the model is. We're seeing a couple anomalies. So like, you know, for one, for one example, we see the payment type, you know, if you, you know, again, it's a little hard to read. But on the last screen, we saw, oh, there's like six different payment types, credit card check, things of that nature. So here there was a new payment type that was never seen before. And so now as we're evaluating this, we can say, okay, well, maybe our schema needs to be broadened a little bit. This is valid or, okay, well, this is outside the bounds and, you know, these records should be rejected. And there's some issue that we need to resolve that we shouldn't be seeing that payment type. So it's a great way to kind of validate the data coming into your model. Okay. There's a transformer step and I'm going to move a little bit quick to make sure we have enough time for questions here. A transformer step, as we talked about, if we need to change the data, we can do that. You can see now that here's the actual training. So as your model is being built, if you want to see like the TensorFlow logs, the loss going down over time and you can click on that. Any of these steps, you see what's inside of the container with the logging. Then there's a model analysis step where you can see a graph of how well your model is performing seeing statistics and seeing it sliced in different ways and sure it's working well for all of your users. And near the end here is called the pusher step. So you can have logic to say, if this model is better than my previous model, if it has a higher accuracy, it can be so-called blessed, a blessed model, and then it can be pushed to production. And so this is a way to kind of have some rules around pushing only what you want to into your production system. So and then finally, to sort of wrap it up, everything you've done here, you see an artifact list. So the data, the model, it'll point to a storage bucket where you can take a deeper look at that. You can see an example, your model, here's our metadata management comes in where you see, you know, the state of the model and other, you know, you can standard metadata as well as custom metadata that you can put in here. There is a lineage explorer where you can see all the different steps and data, how they're linked together. If you want to try out this tutorial, here's a link to it, tensorflow.org, and then you go into TFX or TensorFlow Extended. It has a nice tutorial. It runs in 45 to 60 minutes. It uses the free tier, so, you know, no charge to try that. It's a great, I think, end-to-end example to get started. Okay, so that was it. I have about a minute to answer questions. Sorry that it was, there's not a ton of time. But what I will do is I will go to the Slack channel after this talk and then we can get into more detail. So let's start with like a quick deployment questions. I think I answered a couple of the questions that was asked. What are the best ways to deploy Kubeflow in a self-managed Kubernetes cluster? So in the installation, there are basically two steps. One is to create the cluster for that cloud environment. The second is to lay down the Kubeflow services so you can decide, hey, I've got an existing cluster, just use the YAML that installs the Kubeflow services itself. So that allows you in an on-prem or a cloud environment in different ways to use this. So I think I'm out of time, I believe, unless the moderators say here I had some more time. I think I'll wrap up. I really appreciate everyone's time. Thank you so much. And again, I'm going to move to the Slack channel to answer all the questions. Appreciate all the questions. Glad that it was a useful session for you all. And have a great rest of your day. Thank you.