 Thank you so much for the opportunity to come and talk to the inaugural Kubernetes AI day. I have to say it is really astounding to see how far this community has progressed in the past few weeks and months and years. Couldn't be more impressed with how people are using Kubernetes for all their AI needs. As the slide says, I'm David Ronczyk and I'm here to talk about a new project in Kubernetes and Kubeflow called the same project whose goal is will be to address the overall aspect of the challenges of reproducible machine learning. So when I first got started in looking at this problem, I realized one of the biggest problems we have is around vocabulary. What you see here is some basics that I just want to cover for this talk. When I refer to a program that refers to everything, the data plus the algorithms plus the workflow. And when I refer to an experiment, that's a specific instantiation of a given program. Now I'm going to get this wrong, of course, but I hope that you can bear with me and keep me honest when I make a mistake. So one of the things that we do really well in the ML industry is talk about how great ML is. And it's no joke. Here you can see Google, this is a study of Google looking at applying ML to data centers. And when they applied ML, they saved tons of energy and huge amounts of dollars. And you can see that in the in the graph here where they turn it on and off and how dramatic the impact is. Now that said, one thing we don't do well in the ML industry is talk about how hard ML is. And of course, it's not just because I'm bad at math. It's because the gap between that magical AI goodness and most folks is often an enormous amount of pain. And the pain shows up like this. A model works great on someone's laptop or in a staging environment. But when it comes to moving it to production, it can often take many weeks or months in order to get it out there. That's oftentimes because the systems are so different and bespoke. So this leaves users with effectively two options. One, they can do it themselves, which involves setting it up from scratch and integrating with legacy systems, and eventually probably redoing the whole thing again because they need to migrate based on their business. Or two, they can set up a proprietary solution, which oftentimes is a lot more easy to get going. And the first five minutes are fantastic. But then the next five years, you basically are doing all the things you did with do it yourself, setting it up from scratch, integrating with legacy, migrating on business needs, plus you're now locked in. So haven't we heard this story before? The answer is yes. When containers first came about, there were a lot of proprietary solutions for orchestrating your containerized workloads across your infrastructure. And then came Kubernetes, which was an open source, very broadly developed project, which advanced the idea of cloud native applications. Cloud native applications are the idea that you can simply and declaratively roll out a deployment of even something complex that understands and the orchestrator takes care of rolling everything out for you. So what we really need to do in order to break out of where machine learning is today is invent cloud native machine learning. What does that look like? Well, I think it basically breaks down to three major concepts. The first is composability. Now, when we talk about machine learning today, we often talk about just building a model. But data scientists and machine learning engineers know that building a model is often just a small subset of your overall machine learning pipeline, coming from data ingestion and transformation, doing all the various sweeps around how you actually train, and then finally rolling it out and serving it, including things like monitoring and logging it when it's in production. And each one of these components is oftentimes a different tool or system that you're going to use that your team is already well aware of and knows how to use, and so you're really not going to want to swap anything else out. When we looked across Microsoft Research, we found over 159 different tools that you use to build out a sophisticated machine learning pipeline. And if you asked a machine learning engineer or a data scientist to go and change the tool that they were using just for consistency, I think you'd have some very unhappy customers. The idea is with composability, how do we allow data scientists and the subject matter experts to pick exactly what they would like for their tools without causing any difficulty in the way that we roll thing? That's what composability. Second is portability. I talked briefly about all the software components that are part of it, but there are oftentimes many more components when it comes to rolling out an individual solution. In this case, if I looked at all the things that I had to configure as part of my experimentation stack, I probably would have to do all of these and more where that existing pipeline that we had is just a small subset of them. And really, when it comes to clearing all this, you have to go all the way down the stack. And worse, you don't just have to do it in one environment, oftentimes you then have to go and configure brand new environments, one for training, one for your cloud deployment or production deployment and so on. And then we get to scalability. Scalability a lot of times in machine learning is just about more machines and CPUs and disk networking and so on, but true scalability even goes beyond that. We certainly talk about the number of programs that are being released or you want to scale your teams, adding more skill sets. Maybe you want a SWE, maybe you need a data scientist, maybe you need an ML engineer, a data engineer, so on and so forth, or scaling teams where you have composite models or teams across different organizations, or maybe even going outside of your organization entirely. And at the end of the day, people just want more programs faster, but there are lots of ways to get at that. And what we need to do is have a system that supports that kind of scale no matter what scale is important to you. So that's what composability, portability, and scalability are. You know what's really good at these things? Containers and Kubernetes, except if you want to use machine learning on Kubernetes, unfortunately, this means becoming an expert in a whole lot of things. Containers, packaging, networking, storage, immutable deployments, the GPL, literally everything. And that's just too much for any data scientist to have to do when all she wants to do is roll out a model. So let's build something to give everyone cloud native ML and embrace those concepts. That's what the same project, it stands for self-assembling machine learning environment, the L is silent. And for that, what we mean is, and we're going to start with Kubernetes as the baseline constructor, because it is available everywhere. And then you can declare a same and then stamp out that same in all of those individual locations where the core of the code doesn't change at all, but you're able to pick up environment variables that change the configuration of the way that your same pipeline runs when you move from place to place. Big companies solve this by having a centralized dashboard that everyone logs into and a common orchestrator under the hood. But the problem is, it really still hurts with collaboration. Oftentimes, even when you cross internal organizational boundaries, you might have completely different version of these stacks. And so what we need to do is build something that gives it to everyone, even those internal folks who may already have sophisticated internal pipeline. The same project's goal is to build a portable ML program deployment model. It's got to be low bar to entry. It must work on your local laptop and be compatible with your existing. But really have portability as a first class construct, allowing you to pick and choose the components that are important to your pipeline and built to share it in open spaces. You shouldn't have to go and log into some central dashboard in order to have something work, even if this is an external service. It must work locally and detached as well as in the cloud. And I like to joke that our slogan is machine learning without the letter K, right? A data scientist should never have to know Kubernetes. They should never have to know Kubeflow, Docker. Docker has a K in it. They shouldn't have to know Docker or containers. What we want to do is build an abstraction where they're able to operate it in the languages they know and love, Python and R and so and so forth. But they are able to then migrate that to production in a really elegant way. So how it might work. The reason I say might is because this is very early. We've had a lot of great feedback so far, but we would love to go even further and make better choices through advice of the community. We start with this data scientist and she says, oh, you know, here's a recommender out here in the world that I would love to you. The first thing that she's going to do is install a very simple CLI. This CLI is available publicly. It's actually already available now. But it's very simple to install and it'll download that binary to her local laptop. We already have builds for Linux and Mac and we're going to be adding Windows. After that, she clones that initial repo. Again, just using standard git functionality, she's going to clone that and have a local copy of that entire repo on machine. After that, she's going to install a local Kubernetes. Now in this case, we're going to use K3S. It's a stripped down version of Kubernetes that runs on local laptops. We're open to lots of other options, MiniKF or K3D or whatever works best. But we think what the most important thing is that you do have a local administrative light version of Kubernetes that properly represents what it looks like when you get to production so that you can start doing a lot of debugging locally. It goes through, it does that installation, and it also installs a small CRD that stands for custom resource definition that will allow us to do very simple deployment of these pipelines to the Kubernetes in the cloud. Okay. So now she's ready to go. The first thing she's going to do is type same init. And this initializes that Kubernetes cluster that she has locally and installs a bunch of core components, the dashboard, and the pipeline. It also installs a metadata store. Now, it doesn't have to do that locally. It could certainly use one in the cloud if that's how you configure it. But again, our goal here is to help people get started by default. So if you have nothing, if you have no centralized infrastructure or storage in the cloud, this works great. After that, she goes into that directory, and then she just types run. And from that point, everything is taken care of for her. So the first thing it's going to do is it's going to provision any hardware if necessary. Now, if you're on your local machine, obviously, you're not going to provision new hardware. You can't make a laptop out of nothing. But it will mock them up and use your local GPU if that makes sense, as well as any premium disk. If you're in the cloud, this would use things like the cluster API to roll everything out. It then goes through and deploys the services necessary. In this case, this particular pipeline might need Feast, might need TensorFlow, might need Catib, and it will roll those services out. And so they are ready for when the pipeline actually begins to execute. After that, it's going to go through and download data from an external URL. This is something we saw very commonly where you might use an external CSV or something like that, and you'll want to make a local copy of that. We'll do that for you, if you'd like, or we'll connect to an external database, if that makes sense. It then deploys the pipeline to the cluster. And again, this pipeline is described in code using straight Python, and it deploys it to that Kubeflow pipeline using the integrated tools that Kubeflow offers. In this case, a three-step, one feature store step that uses Feast, one training step that uses TFCRD, and one hyperprint. And for the final step, she's going to go through and install the history that came from those previous runs. So in this case, the original authors of this pipeline decided to produce eight previous historical runs with all the metadata associated, and we're going to insert that into the metadata store that she's chosen to use here. Now everything is up and running, and that same continues, and is going to actually execute this run for her. So in this case, it's going to use the defaults that are set for that experiment, and it's going to go through and execute that. And from that, it'll write out the storage of both the model to her premium disk and her previous metadata. If she wants to make a change, she can then rerun it. In this case, adding some additional run parameters, and by changing those run parameters, she'll be able to rerun it. This case, experiment nine, where again, it will download both the artifacts and the results into her metadata store. And as you can see there, it's starting from number eight and number nine and not at zero, making comparison to previous runs, even ones that she didn't run much, much easier. Okay, so that's a lot of architecture. Let's see this thing actually running in production. Okay, so to set the context here, the first thing we have is a very simple model that was published by Tre Research, AI division. This model is a housing model designed to explore different ways of predicting housing prices. This model, for example, might have been previously developed by someone internally to the company. And our data scientist wants to come along and try and reproduce the model to see if she's able to get as good or better results using updated data. So as you can see, this is just very standard Python code that you would use in any other application with the addition of just two things. The first is that we add some code down here at the bottom to do metrics recording. You can see that here, just very simple that we'll be using to output later. And then you have this single function here, which is how you coordinate everything together. In this case, we're going to download two steps before we do a training on them, pretty straightforward stuff, but just using very standard Python to do so. So let's get started. The first thing we're going to do, like I said, is clone that repo to our local machine, which works fine. From there, we're going to go into that directory. And now we're ready to go. So the first thing we need to do is install the same binary. In this case, it's just a standard URL using the very secure method of piping a shell script to bash. Now it's installed. And like I said earlier, we're going to install Kubernetes by using the same installer. Now, I hope you don't mind. I've sped this up a little for our demo purposes. Normally, this takes maybe about five minutes, but it is just a one-time thing in order to get up and running on my local machine. After I do that, I'm going to initialize that Kubernetes using the same init function, which understands that it just installed a Kubernetes and that everything is configured. And it's going to go through and install that for me as well. So now we have both Kubernetes and Kubeflow up and running, and no complexity. I never had to type KubeControl or KubeConfig or anything like that. It just works. Again, forgive me for speeding this up just a little bit for demo purposes. Normally, this also takes about five minutes, but again, this is something you only have to do the one time. So now that we have that up and running, we're going to move forward here. Let's go and run that. I'm in my directory. I have a same file in that directory, and we're going to execute it. As you can see, it really is that quick. Once I hit execute against that, it goes and uploads that pipeline to my local deployment. I'm going to open that UI there to my local deployment on port 8080. You can see here, this is real live running code. And when I go and click into this UI, I'm able to see that housing prices pipeline that I just uploaded. There's the entire graph laid out, and it's already beginning to execute for me. So when I click on that housing prices pipeline, I can see that run, and here's the graph that backs the run. And when I click over on the configuration, I can see the parameters that I used to run, which just so happens to be the exact parameters that I entered on the same.yaml file, showing exactly how we can declaratively execute those runs from a simple manifest file. Now, sometimes you're going to want to execute using nothing more than command line flags. This could be because you're not sure about the results or you want to do things in line, for example, during CI CD. And same supports that as well. Here you can see we're going to specify a run parameter right there on the command line and reexecute the same pipeline we did earlier. And as you can see from the results, the parameter automatically came through and you can view it. So you have two runs going simultaneously different only by their parameters that you identified. Finally, we're going to show you how to do things declaratively as well. Let's pull up that same file again. And from there, we're going to make a change in the file itself. So now you have a defined file where you're making changes. And when I go back and reexecute that pipeline again with no different command line flags, the system automatically picks up that new changed same file and executes using those new parameters. So now I have a declaratively via file changes to execute new pipeline runs. And when we go over to that dashboard and check out the latest runs, we can see run number three executing there. And when we click on run number three, we can look at the configuration of that and see that our new parameter has been picked up and is running properly. Okay, so we've executed from the defaults, we've executed from the command line, and now we've executed from a changed file. Let's move this to production. Now, typically this might be a pretty dramatic move for data scientists. Oftentimes you have to reset up your entire environment in order to get it working properly. But in this case, we have built into same the the ability to set up a vanilla Kubernetes cluster, whether or not it's on your local laptop or in the cloud. So these we're going to change from pointing at our local laptop to the cloud. This is the one time you do have to use kubectl and we are looking at alternatives here. But once I pointed at that cluster in the cloud, I rerun that command same init, and it automatically installs the necessary kubeflow for this particular pipeline run. And once I'm done doing that installation, it's ready to go. So we're actually going to go and rerun those three things that you just saw me do earlier. The first thing we're going to do is execute that pipeline run using our defaults. And as you can see, I executed same run and it worked exactly like I expect when I go to the UI. And after hitting refresh, I can see that my run completed exactly as expected. Now we're going to go back to the command line and we're going to re-execute again using a command line flag instead of doing it in line. Here we're very elite with our command line flags. And when I go over to check out the runs, I can see that second run again executing on the Kubernetes cluster in the cloud with my brand new elite run parameter. And finally, we're going to go and make a change to that same manifest in this case from 21 to 44 in honor of Hank Aaron. And then we re-execute here again with no different run line parameters, we just execute using that same file. And when we go over to Kubeflow, the dashboard to check it out, we can see that third run also executing and again using the same parameters we handed from the command line. So you can see again three different ways to execute each of which are declarative and are things that you can use in production. Now we're all set and I want to commit my workload to the cloud. The first thing I'm going to do is start with a very simple git command. Performance look good. Let's check it in. I execute it and push it to the cloud. And now very handily, I am reminded to go check out GitHub. So let's do that. So we pop over here to GitHub and we can see our Trey Research AI division with our brand new same file that I just pushed up. Now in this case, we get to a very common scenario that most production machine learning environments need, which is they decide that everything looks good and it's time to cut a tag and release that model to production. So in this case, we're going to draft a brand new release. In this case, it's 009. And when we execute that, we have it automatically set up via a GitHub action to use this same binary to both download, install, and run the model, tag the model. And then once it's done running, it will tag it with that commit line, which we are able to push forward into any production environment that now uses it. So forgive me for speeding this up just a little bit. It does take a little bit of time to run. This is a real model running in production on our Kubernetes cluster. And when it gets to the end here, you can see that right there in line, we have attached as a run parameter the commit SHA. And this is critical for all the things that you've heard about around lineage and tracking and so on and so forth. When we go back over to our Kubeflow running on our production cluster, we can see we have a brand new pipeline run by the name of release-009. And attached to it, you can see that same commit SHA that we ran earlier. Okay, so there you saw a full demo of our same binary and the same project that included a installation, installation of Kubernetes, installation of all the necessary pipeline components, and then running it against two different environments, your local laptop and then the cloud. And then finally, using that same tools and techniques in GitHub with automated GitHub actions for CI CD. And so, you know, while we are just getting started, I hope you can see the direction that we're going and how valuable this will be to have this kind of declarative workflow for machine learning engineers. The thing that we're really anchored on is much less about the individual components and much more about the user experience. When we go out and talk to data scientists and ML engineers, the thing we hear most about is the amount of new tools and techniques and systems that they have to understand when they really just want to focus on portable reproducible pipelines. And so, from our perspective, we think that user experiences like you see here on this page are critical to driving machine learning forward. We want one line to install, one line to run a program, one line to change the way that program runs, maybe it's via a command line flag or via editing the same file, and then one line to export so that you can share that with other folks in the community. Now, I mentioned the same file several times. You know, that is as well as not designed to be particularly complicated. It's just designed to be declarative where you might have sections that describe different metadata, resources required, the version of Kubeflow or the services that you might require, the specific pipeline Python, Python file that anchors your entire workflow, data sets that you might import, and then, of course, that last section, which is the run. And we think this is critical because by having your runtime parameters for your pipeline included in the same file, you now have a single document which becomes declarative and you can run anywhere. And then, when you want to go and kick off a brand new pipeline, instead of doing production changes to configuration, you come and you edit this on the far left of your pipeline, your overall CICD pipeline, and you kick off an entire ML Ops workload, as you saw there with me doing a full training run when I triggered a new release. We've interviewed a bunch of different people out there, and what we hear most of all is that data scientists and machine learning engineers want to focus on delivering value to customers, rather than infrastructure. And we hope that by making some opinionated choices here, we're able to help move the conversation and allow these data scientists and ML engineers to focus on what they really care about. Most of all, we think that there's a real need in the community for what we call kind of the center of the Venn diagram, where you have something that is completely open sourced governed, that isn't tied to any single company, commercial or not, that is based on a universal platform where I can get the entire thing spun up on any cloud with a single credit card, and then having a declarative deployment, so that no matter where I go, this system understands how to instantiate the entire infrastructure pipeline and data necessary to run. And that's where we think the same project sits. If you would like to help out, we are just getting started. Obviously, you can go try it out yourself. We have a simple website, sameproject.org where you can download and install, and obviously come back and talk to us on GitHub or on our mailing list, in our Slack, all that kind of stuff. But most of all, we're so early that we really want to hear about your needs. If you're looking to figure out how to build reproducible ML pipelines and infrastructure, we would love to talk to you now so that we make better decisions as we get going. And like I said, if you'd like, you can certainly come and join our repo, provide issues and features, bugs, whatever it might be, things that help us make us better. Because in my experience, the only way solutions like this work is if they're designed with the practitioners in mind, where we're really anchoring this on what your needs are rather than what I'm seeing. Like I said, we are more than happy to take contributions. I will say it's a little bit early and things are certainly moving around. But if you see a need for this in your environment and would like to collaborate as we try and build these things out to production, we would love to hear about it. And with that, I would like to say thank you again. My name is David Orancek. You can see all my contact information there. And more than anything, we are really excited about this opportunity and would love to hear your feedback. So with that, thank you so much.