 Hello, welcome to Kubernetes AI Day in Europe. This session is about a component registry for Q4 Pipelines. My name is Christian Kadner and I work for the Center for Open Source Data and AI for IBM. And this is how it looks. This is the SVL, the Silicon Valley Laboratories, just south of San Jose, California. Okay, this session. This session will have two parts, presentation and the demo. And the presentation, I will quickly talk about the AI lifecycle, introduce Q4 Pipelines, and point out some reusability challenges. I'll talk about pipeline components and the new proposed component registry. That'll lead me to the machine learning exchange. And I will talk about some of the technologies we have integrated and also quickly show Watson Studio Pipelines. In the demo, I will show what the machine learning exchange is, what it can do, and also show the new template registry API. So let's dive right into it. What are the stages of the AI lifecycle? Well, it's mostly about models, data models, and data sets. So in a nutshell, we use data to build models in order to automate decisions. And each of those big steps, they have, you know, small steps, you know, data we have to gather, we have to analyze it. The models, they have to be trained, traditional machine learning models, deep learning models. And once they're trained, we want to deploy them, and we have to maintain them. Now it's of course not that simple. We zoom in just a bit. You see there's many more steps involved. And I'm a software developer, not a data scientist or machine learning expert. So to me, this is already way too complicated. So alone the data preparation has multiple steps and stages where the data needs to be ingested, cleansed, analyzed, transformed, you know, invalidated, of course. And then just before training, data set has to be split into training set and the test set. And then when you're ready to train your model, you have to, of course, you know, optimize, you have to optimize the hyperparameters, you have to validate the model. And then once your model is finally trained, you want to deploy it on the cloud or even on edge devices. And then you want to inference, you feed it data, and you get predictions out of it, predictions out of it. All of that has to be monitored and logged. And, you know, if, if the predictions over time, you know, show anomalies or you have some drift, these models have to be fine tuned. And, you know, the whole process might have to start over again. Now that is to say that whole AI lifecycle has numerous challenges, right? There's a high number of steps that need to be performed. That process remains, you know, bifurcated amongst various teams. And the artifacts created or consumed, oftentimes are not shared. Instead, they're going to be recreated multiple times over. And then of course, there are challenges with traceability and governance, risk management and lineage tracking, the metadata collection. So what's needed is a central catalog for all of those AI and ML assets that can be shared and reused across organizational boundaries. And these data sets and models, you know, they need to have to have quality checks, proper licenses, you know, lineage tracking, and all of that is required to speed up their lifecycle. Now, let's start with Q4 pipelines. What's the purpose of pipelines? So pipelines, they offer, you know, end-to-end orchestration of the machine learning workflow, right? They provide easy experimentation, giving trials and experiments. And, you know, you can manage your runs and group them into experiments. And it's built, pipelines are built with components. And those components, you know, they're supposed to be reused and shared. And, you know, multiple components fulfill different tasks. Some of those tasks, you know, they are needed in various pipelines over and over again. So you can use components to rebuild and build many solutions. Now, Q4 pipelines is built on top of an engine that allows to schedule those ML workflows. And by default, the execution engine for ML pipelines is Argo. We here at IBM, we use Tecton as an execution engine. And, of course, there's a user interface that helps with, you know, tracking experiments, starting and running jobs, you know, tracking and, you know, following logs, as I will show later. And an SDK for data scientists who are actually doing or doing the machine learning training. And the SDK helps with those steps. And the SDK is especially useful when data scientists want to work with their notebooks, which are typically based on Python. And the Q4 pipelines SDK also is in Python, written in Python. Now, the individual steps of the pipeline, the pipelines are built out of components. So the building blocks of pipelines are components. And a component is really a self-contained set of code. It usually performs one step in a pipeline, such as data preprocessing or data transformation, model training, etc. And it can be thought of like a function. A function has a name, parameters, return values, and of course, you know, function body. Components are containerized, you know, they run independently on Kubernetes. And, you know, components have input arguments and they produce output values so that they can be used by downstream components. Now, when you write a component, you have various options. You can either write the YAML spec directly, or you can compile the component from the Python DSL. In that component spec, there will be metadata, name and description, etc. And the interface will be described, interface being the input and output parameters, name type, you know, description, and a default value. And the actual business logic of the component, you know, typically Python or shell scripts and the Docker container that the code runs in. Now, Q4 Pipelines has been around for a couple of years and there is a rich component ecosystem already. There are numerous platforms and vendors who are contributing components. There are components for various ML pipeline stages, data pre-processing, model validation, the training itself, the model evaluation. There are components that are focused on ML ops kind of tasks, right? They can send notifications via SMS or Slack when model training is done. For example, there are basic utilities, you know, to upload and download files from cloud storage and their entire pipelines in that ecosystem as well. Now, there's a problem with the existing Q4 Pipeline ecosystem. And the problem is a result of the richness of the number of components that exist already. And you can think of it as, you know, a bunch of Lego blocks like in the picture that you see here, right? It's an unorganized heap of blocks. And what we really want is we want an organized set of, you know, building blocks that can be used to build pipelines. So the challenges are in the authoring and the publishing of the components themselves. There's a little documentation on, you know, how to offer components. There's multiple ways in which they can be authored. As I mentioned before, there's a Python DSL that you can use to compile your components or you can write the components back in YAML directly. And in Q4 Pipelines V2, there's also the intermediate representation that is a platform independent representation of a pipeline. And all these different ways of authoring, they have no feature parity. And there's not a good way to document and publish these components. Similar problems for hosting. Right now, all of these components, they're hosted on Github. And that allows minimal capabilities in terms of indexing and searching, categorization, and versioning. And oftentimes components that are contributed, they're not very well maintained. As you know, things might be important for a certain project, right? And then the developers, they move on, project evolves. However, often these components, they don't get maintained very well. Now, what can help with that would be a component registry. So this is a proposal by the Q4 Pipelines team. And it's currently work in progress. You can find some details in the Q4 Pipelines repository in issue 7382. I will link at the end of the presentation as well, where you can find the design doc, design docs for the API and the SDK. Now, that new protocol is meant to be implemented by third-body template registry servers. And on the KFP side, there will be first class integration into the SDK, meaning that data scientists, they can use the SDK and they can download components directly from a registry, discover different versions, and have a simplified way to search and find components they need. The benefits also include that the format will be unified, the component format will be a YAML format. There will be versioning and tagging, similar to Docker image tags. And the KFP will also provide credential management and components may even be able to be run as a pipeline directly. This is just a bit of terminology in that new component registry. It's going to be the registry host, that's the server. Templates will be versioned, and each of these versions can be tagged, and a collection of one template with all of its versions is called a package. Here are a few examples of what that new registry protocol might look like. You see the API endpoints on the right side here. Most importantly, it's probably the download and the upload, and then the metadata API endpoints for packages and tags. Here on the left side, you can see two examples of what the REST API would feel like, how you would upload a component, given that you have the proper authorization, and how you would download it. And the download really is just the host, and then the name of the package and the version or the tag of the component version. And in the SDK, it would look as simple as loading a component from a registry. Of course, the client has to connect to the registry first, and then you just specify your package and the tag or the version of the package that you want. And then, of course, there are methods to list all of the versions in the package or list all of the packages in the registry. Now, what might such a registry look like other than what you can query or download via the API? Now, here comes the machine learning exchange. That is a fairly recent project created by IBM and sponsored by the Linux Foundation for Data and AI. And the machine learning exchange or short MLX really is a catalog of machine learning assets bundled with an execution engine. The execution engine being Q4 pipelines on Tecno. So what is the machine learning exchange? So in the machine learning exchange, you have various types of machine learning-related assets, the pipelines and components, of course, but also models, curated data sets, and notebooks that can be used to train models using those data sets. For each of those asset types or assets, MLX can generate sample pipeline code. So when you upload a new model, MLX will generate the code around it, and then you can run it directly on with Q4 pipelines. It's especially useful if you have data sets that you want to use to train your models, and especially if you already have a notebook that wants to make use of that data set. The engine underneath is Q4 pipelines on Tecno. And then we have various technologies integrated, like Datashim, Kfserving, or Ksurf. We have the notebook component from Ellyra, and we have data sets and models from the model asset exchange and data asset exchange, another project by Codate. The metadata, we have aligned with ML spec as good as was possible. And of course, MLX can be deployed on either OpenShift or Kubernetes. Now here's what the user interface would look like. Here are eight sample pipelines. You can filter them and find a pipeline that you want to look at in more detail. You can see the pipeline graph, details with description, the YAML. This is the cave protector YAML you see here. And then the pipeline can be launched directly from the MLX UI. You can provide your parameters and click submit. And then you will be presented with the Q4 pipelines, the run graph. You can see the configuration and visualizations, the metadata. You can find details, the volume mounts, and you can follow the logs. And of course, you know, the pot specs and you can inspect things in detail in Kubernetes directly. We have a very similar experience for components. All of these components here we make use of, or most of these components, we make use of in the pipelines that we generate. This one here is an accurate example. It shows that because it's short and easy to to demo and show in a GIF. Here's a view of our models. And very similarly, data sets and notebooks. Now, notebooks are interesting here. You can preview a notebook with the notebook preview as you can see here. And once you run the notebook, which you can do from MLX, MLX as well, we will use the Lyra notebook run component. It will kick off the run of the notebook using paper mall in the Kubernetes pod. And then, of course, you can follow the logs. And at the end of the run, there will be a new notebook generated. All of the output cells will be populated and you can download that directly. And, of course, the notebook does whatever it's supposed to, in this case, it's training a model. Here's a listing of the pipelines, components, models, data sets, and notebooks we currently have in the machine learning exchange. This is an ever-growing list. And these are also available in our read-only deployment on ml-exchange.org, which I will show as part of my demo. Now, some of the technologies we integrated in MLX is Dataschem, which we use to manage our data sets. Now, with Dataschem, we can easily create persistent volumes. And we can get data from S3, HP or NFS, and, you know, basically reduce the amount of work that end users would have to do in order to get data sets and work with them in Kubernetes. For models, we have an integration with KF Serving or KServe, or we can also serve our models natively with Kubernetes. I mentioned that MLX is built on pipelines, specifically KUFO pipelines or pipelines of Tecton, the difference being the execution engine as KUFO pipelines are, by default, powered by Argo, and we have Tecton as the engine, and of course, it runs on OpenShift and Kubernetes. Now, one implementation or one use of KUFO pipelines in an IBM product is Watson Studio Pipelines, and here you can see that we have a canvas where you can drag individual components from a component palette and create your machine learning workflows that way. Yeah, it's currently in OpenVeter, and you can follow the links here to try that out. With that, I will go to the demo part. So here I have an instance of the machine learning exchange. I did a port forwarding to 3000, so it looks like it's local, but it's not, and you can see our landing page, and you can browse through the individual assets, data sets, models, pipelines, components, and notebooks. And let's start with data sets. One of the most recent data sets we integrated is the Project CodeNet. That is a large-scale data set with approximately 14 million code samples in 50 languages, and each of them was a solution or proposed solution for one of 4,000 coding problems. Since that data set is really large, we cut it down into a few smaller ones, and we selected one for a language classifier that I'm going to show in more detail. So once you go into an asset or click onto an asset, you get a more detailed overview. You can, for data sets, you can download the original source files. You can browse the license, the size of the set, and how it was created, and oftentimes follow links to the source. Each of our assets is defined by a YAML spec, similar to how components are specced in Q4 pipelines. We have YAML specs for data sets, models, and notebooks, and they have information like the name and description, version, where it's from, what the license is, and other related information. Like here, you can see that there's a notebook we have integrated that is related to this data set. Once you've found a data set that you want to work with, you can mount it or create a persistent volume claim and can later be mounted. In this case, you would have to choose what namespace you want to create that volume in, click submit, and then the Q4 pipelines run graph comes up, and that'll usually take a while. Images have to be downloaded, container created, and then once it's done, you can click on details. You can see what log messages there are, and in this case, the idea of the created persistent volume. Now we can use that ID and run the related notebook. For Project CodeNet, we have a language classification notebook, and again, you can see some details here. The YAML file that describes where you can see find the source of the notebook and what are the Python requirements here and what image this is running, and you can also preview the notebook code. We use the NVV or component for that, and this comes handy, especially if notebooks change over time, you get newer versions, and you want to just reassure you what version you're looking at. You can see that. Also, you can see the output from the initial run, which could come in handy when you want to run it again and compare it with the most recent results. We can launch this. We're going to put in the ID of the PVC persistent volume created, and the mount path, and click submit, and then the Alive or Notebook component should start, should download the notebook from S3 storage and use Papermill, download the image, and use Papermill to start the run. Once it starts running, you can see the log files. We'll start with all the pip installs, and then we'll go through all the training steps. Since that is a long running process, I'm going to quickly go to the run I started earlier, click on the logs here, and at the very end, you can see that a new or regenerated notebook was created, and it's available via S3 on Minayo. I downloaded that earlier, and it just looks very similar to the notebook we just previewed. Go through all the cells, and at the end, in this particular notebook, we take our trained model and use a small test set to do some predictions, and since this is a language classification, we give it 10 samples in 10 languages, C, C-sharp, C++, the ASCII JavaScript, and so on. Here we can see that for most of them, the predictions are correct, except for C++, we only have 9 out of 10 correct predictions. And if you want to compare that to the initial run when the notebook was first uploaded, we can go to the original notebook in the notebook source code viewer, and scroll to the bottom, and see, in the original upload, two of the languages were not quite correct. In our latest run, we classified all of our C code channels correctly, so we improved. That was the quick demo of the machine learning exchange. You can find out more about the machine learning exchange on our GitHub repository, and also find our read-only deployment at mlxchange.org. And here you can see all of the assets I showed you just during the demo, except in that read-only deployment, we don't allow users all of the world to kick off training jobs. So Kupo Pipelines is not integrated in this read-only deployment. On our GitHub repository, you will also find information to just general information how to use machine learning exchange, find more information about all the integrated technologies, and you will find the deployment options. We have a quick start that only requires you to have Docker, Docker Compose, and you can run it locally, the catalog, not the Kupo Pipelines. If you want Kupo Pipelines, you can install it with Kubernetes and Docker, or if you have access to a Kubernetes or OpenShift cluster, you can deploy it onto your cluster. Actually, let me quickly show the new template registry protocol here. So here you can see an API browser for all of the APIs of mlx, and we integrated that new template registry service, and here you can see all of the endpoints. It can list your packages, you can upload new packages, delete packages, find versions, and here you can see one of those endpoints, and that is to list all packages. You can see all the components, in this case, that we have added, and for each of the components, you can also see the individual versions and the tags. So we go here, and for example, we have that echo sample component, and you can see all of the individual versions that we have for this. Now we deleted a few in between, so now we have v4, v2, and the latest, and you can directly access these also directly either via curl or Python requests or in the browser, and you can show your versions, show your tags, or directly download the component. The integration with the Kubeflow pipeline SDK isn't ready, so that will be for a future talk. Okay, let's quickly jump back to the presentation. There are a few links here that might be of interest. There's a link to the KFP template registry protocol, or at least the issue on the Kubeflow pipeline's report that has links to the RFE and the design docs. You have the links to the machine learning exchange, the GitHub repository and the website, and you can reach out to us via Slack, and for the center for open source data and technologies, you can visit our website or find our articles on medium.com slash date. That, if you're interested, we have four sessions from Codate at the Kubernetes QCon Europe co-located events at the AI Day, Edge Day, and KnativeCon, and feel free to check them out with that. Thank you, and see you next time.