 Hello, everyone. We welcome to today's sections on democratizing ML pipelines and bringing ML workflow to heterogeneous cloud-native ML platforms. My name is Tommy. I'm a senior software developers at IBM Open Tech Teams. And I also have my colleagues, Yi Hong Wang, here to help me as well. And let's go into today's sections. So in the ML workflows, we can see in the high level we have different phrases. From the data phrase, we have teams that need to draw data, analyze data. And we have data science team. They need to do machine learning, deep learning, and create the models. And finally, we have operation team to deploy the model and maintain the model. So we can see a lot of different teams working on different parts of the ML life cycles. But when we actually break it down into automation and have different teams responsible for each different sections, we can see it could break down into a lot of smaller sections as well, like from data preparations, cleansing, ingesting, analyzing, transforming. It could all break down into different teams and different workflows. And when we need to connect all these kind of workflows together, we could see a bit overheads on team need to porting different components into a different platform with different technology sections and stacks. So that's why we want to have a heterogeneous pipeline that could be able to run with different kind of environments. And based on different requirements, they could actually pour into a same type of information and metadata that could be shared among each other. And when we actually look into how to do that on Kubernetes, there's two popular projects that could actually do that. One is actually Agar workflows. So Agar workflow is a project that is able to containerize workflow engines for orchestrating parallel jobs and able to implement the whole pipeline as a Kubernetes CRD. And you could see Agar workflows provide very nice UIs. And all the components actually wrap it into a pod, so actually have a dedicated environment. And then it also could run as a DAG, so it's a subtext. So it's actually very easy for you to create very complex graph when you have very complex machine learning workflow as well. And then there's another option called Tecton pipeline. That's where IBM's and a lot of the company working on OpenShift that's using, because Tecton pipeline project also provides Kubernetes-style resource for decorating CI-CD-style pipelines. And they actually able to break into two different kind of runs. So you could actually have runs that is wrapped into a pod, have a dedicated environment. But you have common tasks where you need to just run it in the same environment. You could also have a dedicated controller to do a very fast job. Let's say like primary storing or just caching. You could use a controller to actually handle that without creating a dedicated pod environment. And why this is important to run on OpenShift? Because OpenShift pipeline is a project that OpenShift container platform have created. And this is the de facto CI-CD capability on top of OpenShift. And it's actually certified by Red Hat have dedicated security scans. And it also have the enterprise version available on OpenShift. So all enterprise users that want to run on OpenShift have a secure environment. This is kind of the de facto pipeline they need to run. So we can see from different user in different department they might have different pipeline system they want to run their ML flow flows. So now how can we connect all of them together? This is when we actually look into a project called Q4 Pipelines. So Q4 Pipelines is actually another stack that's run on top of Argo workflows and Tecton workflows as a connected layers. So Q4 Pipeline actually provides any ML framework that you could do. Let's say you want to use PyTorch for model training. You want to do like prompting. You can all done on QFloats. And it also provide a very nice Python SDK and DSL for DSI users to craft their pipelines. As we can see, a lot of the CI-CD workflow environment, they don't really have a robust interface for prompting pipeline. So Q4 Pipeline provide a very nice interface for client DSLs. And the DSL actually give capability to define inputs and outputs and conditions. So you could just actually craft the pipeline just like how you do a regular programming language. And I think the very needed feature is actually parallel loops. As we can see in machine learning workflows, a lot of the problems need to be parallel executed under different environments. Let's say you want to train models with different parameters. Looping feature is very common within ML workflow platforms. And the very nice optimization that Q4 Pipeline provides is that you're actually not executing the pipeline itself. You're actually leveraging whatever platform you have. Let's say you have all-go host stacks as your CI-CD. You could leverage that. You have an enterprise requirement that you need to run an OpenShift pipeline. You could also leverage OpenShift pipeline to run all your actual workflows. And on top of that, it actually provides good experiment metadata tracking. So all the metadata could be shared between different platforms. And also provide garbage clean-ups as the default workflow environment usually don't have garbage clean-up and caching. And this is what Q4 Pipeline aims to provide. And as we work on Q4 Pipeline, especially on the TACCOM versions, we see in the early stage, you create a very dedicated UI to have version control for the pipeline. And they build a display of a very nice graph for the pipelines. And do garbage collection have a dedicated DSL and API to reduce Kubernetes costs to the Kubernetes APIs or reduce the actual traffic to the Kubernetes stack itself? And then because we don't want to intervene the actual workflows under the back end, in the V1 version, we actually just caching using part mutations, which is actually one of the bottlenecks we will actually discuss, how we mitigate that later on. And another benefit is actually have experiment and run trackings, where you actually provide a very good way for data science to organize their runs and able to put them into different experiments as you create different models. And they also have metadata trackings, where you can actually see how your input and outputs is being leveraged between different components. And the powerful part of metadata tracking is that it actually gives you information on what data the model is training on. And actually compute different model runs and compute different metrics. And then let's say you have a caching capability. You've got to carry over the state from previous models. And then we use the computer outputs so we don't have to like, we do the training process. Let's say, oh, the pre-processing process, you expect the result to be the same. And as we kind of discussed with the earlier approach of the Q4 pipelines, we kind of see that as we just caching using part mutations, we kind of see there's still bottleneck on. A lot of the caching students need to create that they get a part because we don't want to intervene the actual pipeline flows. So we got to see a big bottleneck on the part creation and also on the scheduler itself. Because as we have thousands of parts needed to be created. And even with caching, all those parts do need to be created. And that kind of create a very big bottleneck on large scale ML workflows. So we can see from a schedule perspective, creating a new part actually takes a lot of time. And I think the most important case is that with the Kubernetes limitation, where with each GRPC call, you're kind of limited on four megabytes of data. We kind of wrap everything into one single graph. You could start seeing the bottlenecks on graph traversal and validations. And this is what we actually need to mitigate as well to break down the graph into multiple smaller ones on Kubernetes. So it can actually scale to like thousands or tens of thousands of tasks. As we can see with the modern AI and ML workflow, it can actually grow to that scale. And now we're going to pass to Ihan to talk about what we actually improved in the new version of Kubeflow Pipelines. Yeah. Thanks, Tommy. So in here, we just have a high level design for you to reference that this is for the Kubeflow Pipeline V1 design. So like Tommy mentioned earlier, actually, we run the Argo implementation and Tecton implementation. It's kind of two separate side-all projects. But that's for the V1. So if you are familiar with Kubeflow Pipeline, it has been out there for several years. And a lot of people adopt it and use it. But meanwhile, I mean, the concept of the ML workflow and ML apps and even the best practice for the ML workflow are kept involving. So it's actually unsurprisingly that Kubeflow Pipeline actually was at a pivot point where in order to fulfill those new features that are coming in that and also have a better integration with those internal components altogether, actually having a better design and big change were inevitable. So actually, about one or two years ago, probably, the Kubeflow Pipeline team, kudo to the Google team, they start to work on a new design and then also review the design at their community meetings regularly and then work on the implementation in both Argo and also Tecton. There are many work on the Argo side. And me and Tommy, we are many work on the Tecton side. And finally, we got the new release version earlier, I think, one or two months ago. So actually, I encourage you to actually try it out. And we would like to get some big feedback from you. And at the same time, we will also keep working on it and make it more complete and stable. So the reason I bring up the view and design here is that you can see, like Tommy mentioned, we start with the KP. We provide the DSL so you can use the Python to compose your ML pipeline for and use the SDK to actually to compile and compile the Python program. And you will get the pipeline specifications. And at that moment, actually, for the V1, when you are using Argo, it's actually give you the workflow custom resource, which will run on Argo, definitely. And if you are using a Tecton, it's give you the pipeline run custom object. So it's totally different. So when we try to, during the design phase for the V2, we actually gather all those lessons learned and the requirement from the V1. And there are a lot of new feature and requirement. So I try to characterize them into actually two main goals actually we want to achieve in the V2. The first one is because you can see here, we are using the pipeline. The pipeline spec that we are using is actually different from each runtime. So we want to consolidate the pipeline spec into a uniform language that is not it's detached from the underlying runtime. And also the metadata that Tommy mentioned earlier, the data we store in the metadata store and also the way we use it is not ideal. So we actually want to improve it and to make it more efficient in the whole system. And the second one is that we also want to, you can see that we actually very, very, very tied to the underlying runtime engine to run those pipeline. So we want to actually decouple some of the ML execution for from the underlying runtime engine. In that case, we can get more control. And then the ultimate goal is that you can actually easily bring the cool flow pipeline, actually, to other runtime engine if you want. So that's the ultimate goal. So actually, instead of 10,000 feet perspective, so I would try to dive into the detail of the new design in this session. So let's start with the first one, the pipeline spec. So in the V1, like I mentioned, the pipeline spec is directly tied to the underlying runtime engine. So even in the UI, you also need to understand two different languages. One is from Argo, one is from the tech town. And because that pipeline artifact is directly run on Argo and tech town, we also need to build in a lot of backend logic into the pipeline. So in the V2, we try to introduce another spec. We call it intermediate representation. We call it IR. So we define the IR as actually the pipeline spec. So it's actually easier for you to understand because it's also in the EMO format, but it's easier to understand because when you compose a pipeline, like Tommy mentioned, you will try to compose the DAC in your pipeline. So we are using more intuitive syntax to help you to compose the IR. And it's also a platform agnostic because we will use this IR in every runtime that we support. And it's also exchangeable because I will talk about it later. And the bonus part is the UI because right now UI don't have to understand Argo. You don't have to understand the tech town. The only language you need to know is the IR. And based on the IR, if you are using it, we try to promote it as kind of the open standard to define an ML workflow and even the small task. So if you share this kind of IR, you can actually share it to the whole community. And even you can phone maybe a central repository to store those IRs. And it's actually opened up the possibility that actually you can actually build an ML component ecosystem for a computer user, even from the vendor. So they can all use this IR to share the component or maybe the piece of the pipeline into this kind of central repository just like you have the, we store all the image on the Docker Hub or the Quail IO. So if you use the IR to share the component and also someone build a central repository to store it, it will be perfect. So people will, in the ML world, we can easily exchange those IR with each other. And furthermore, I think, inside, based on this assumption, actually inside the KP SDK, we also provide a register client that you can actually talk to this kind of compatible component registry. So you can actually pull the component or the pipeline on the flight when you are composing your ML flow. And actually the ultimate goal is that we try to promote the local and local approach to compose ML pipeline. So you can imagine that if there is a platform that can support the component registry, so you can actually automatically pull down those components from the component registry and show up, for example, on your right-hand side. So you can just easily drag and drop those components and link those pieces together and compose your ML workflow. It will be perfect and easy, right? So that's the ultimate goal. And so that's why we introduced the IR. And the next step is that, since we introduced the IR, so you can see that the design is a little bit different. So we, but it starts from here, right? Originally, we have the artifact that comes from the ARGO or Tectang. Right now we replace it with the IR and then you submit the IR to the pipeline service. But one thing that we also need to introduce is another component we call the back-end compiler because ultimately those pipelines still need to be executed on the ARGO or Tectang, right? So although we define the IR as the pipeline spec, so inside the pipeline service, we introduce another component we call the back-end compiler, which you can think is like an interpreter that he can interpret the IR and into the runtime engine that which you are using right now. For example, if you are using ARGO, it will, like I said, you will transform it to the workflow if you are using the Tectang, it will transform it to the pipeline run. So you can, with this kind of design, you can easily actually swap out the underlying runtime engine. But wait for it, we still need to have another extra component here to actually fulfill the whole lifecycle. It's called the abstract layer. So you are thinking the pipeline service actually right now need to talk to different engine right now. So we are trying to come up with a plug-in mechanism that actually help the pipeline service to actually talk to the whole underlying different engines. So we create an abstract interface that when you have the back-end compiler and then you submit an IR and when it generates the underlying artifact that really needs to perform on your runtime engine, it will actually go using this interface. So for the pipeline service, it won't touch directly to, for example, the ARGO controller. It won't touch directly to the Tectang controller. It will talk to this unified interface. So like I mentioned, I think here we want to promote this kind of mechanism is that if you have your own runtime, you can easily just implement this interface and plug in your runtime into the KUFO pipeline system and now he can support your own runtime to run the pipeline. So that's enough for the IR. So if you remember the second topic we want to tackle is that ML metadata. I think in the V1, you can see the design for the V1. We have a component called MLMD writer. It's actually running as a synchronous Python program and it's actively monitored the Kubernetes part. That is created when you run in the pipeline. So you can see it's actually and kind of running a synchronous fashion, right? So it's only way, it's keep waiting for when that task is finished, the part is finished so it grab those data, input data, and output data and store into the metadata. That's not efficient and in this way it's actually hard to actually integrate, to reuse the metadata information inside each step of your pipeline. So in the V2 we try to actually embed the MLMD into the pipeline execution. As you can see here, all the component is meaning that it's a subset of the pipeline. So it's actually directly talked to the MLMD so we remove the MLMD writer. So in this way, actually the MLMD will provide extra benefit for the whole pipeline. For example, you can actually use the MLMD to support caching because all the tasks are output input and output are stored in MLMD immediately after the task finished. So if you have another task, actually it depends on that finished task, you can actually immediately talk to the MLMD and grab those data. And the other thing is that you can support caching because when you spin up another pipeline wrong and the pipeline is actually ask you the same component with exactly the same parameters, you actually don't need to run that component again. You can just grab those results from the metadata and then show it to the user. And the other benefit is that in the UI right now when we show the real-time pipeline wrong status, we can also get those status from MLMD directly. So furthermore, I think when you look at the component here, right now I will dive into a little bit for how the components are wrong and how the component are leveraging the MLMD because earlier in the V1, when we run these components, this big component, we actually directly rely on the underlying engine. We just put that task to the Argo to the task for the Tecton and you will run it for us because the MLMD is run asynchronously and collect the part information separately. But in here, we try to promote the task directly push the input and also the output to MLMD. So we introduce another mechanism we call smart runtime. It's actually a driver executor and published wrapper. So in this kind of fashion, actually driver will, when the text starts, it will actually start with the driver. So driver will help to do some, the preparation steps. For example, if there is any parameter that you want to get from the upstream task, you can actually talk to the MLMD and get those value and substitute those value as the real value and put it there for the executors. And when the executor finish, actually you can see just like a wrapper, wrap around user's task. So it will actually run the user's task and by the end of the user task, it will also publish all the results into the MLMD immediately. So that in this kind of mechanism, like I mentioned, if you have any downstream task, which depends on your upstream, you can immediately pull in those data from MLMD. So I think those are the mechanism and new implementation we bring into the V2. And I think the goal is actually making the V2 could be actually porting into, to support different kind of runtime engines. So I think here I will just hand back to Tommy. He will cover about some of the performance improvements that we did in the V2. Thank you, Lihang. So as you have learned from Ihang, so Ihang have demonstrated a new way to implement Q4 pipelines where now all the Q4 pipeline components actually represent intermediate representation. So in the actual Kubernetes layers, you just have to store all the pipeline into one single custom resource definition. So we can actually break down pipeline into smaller pieces. But I think one of the downfall of this is that as we kind of implement this whole, all the features and have features complete, able to pull different pipeline DPP between different backends from different users, we see that we introduced the driver publisher and executed components. And because of those components, we have now introduced a new smaller layers of like code need to be executed. And what optimization we have done in the beginning is that from a tactile inside, we actually introduced custom task controller where all these common tasks just publishing and retrieving information could be done in a common controller, so we don't have to create extra part for that. In the algo world, the committee is working on the HVT template to replicate the same approach. And the result is pretty promising. And we could see like the runtime could be reduced significantly, which we'll show it later on. And the second phrase we want to improve is that as we introduce this new driver and publisher task, the graph complexity actually gets very bad when it scale up. As we can see, each graph, when you connect the driver layers down to all the tasks you need to run, you actually need to connect all the roof notes and for the publisher, you need to connect all the leaf notes to actually publish all the information once the sub graph is finished. So this actually introduces a big graph complexity and we need to improve that. And so I think the first phrase we have done in the current state is that we kind of combine, at least from the container, from the task level, publisher and drivers, we merge them into one and also give the capabilities of the task able to run anything that's not a part as well. Because over to the design, we expect all the user tasks to be just a part of the containers. And as we can see, in real world scenario, you actually have different training platforms and preprocessing platform that leverage Kubernetes resource. For example, you might want to use Spark jobs for preprocessing, you just want to create a Spark component and you don't want to just run your components in a pod, right? You just don't want to run a client. You just want to run a CRD in your Kubernetes platform and we're able to provide that capabilities in these new designs where the user component doesn't have to be a container. It could just be a custom resource as well. And some of the, so I just kind of like high level, we cap on what a Q4 problem V2 brings. With this new design, the caching is actually handled by the new driver code so we don't have to do like part mutations. That's significantly reduced the amount of part that we need to schedule and we need to create on the Kubernetes clusters. And all the code would be uploaded, all the parameter and all the fact are actually uploaded by the driver and publisher itself. So we actually need to bypass some of the workflow like workflow engine level limitation as we can see from a performance, at least from TecTons. TecTons have a limitation on parameter passing because even in August, well parameter passing is very difficult on the workflow level because they need to watch a user container to be finished and then publish those information back to the workflow itself. So usually they need to run a cycle or have to run some process right after the user component is finished. So that's why that creates this kind of limitation. And with the Q4 problem, the new design we have to create our own binary to push all our information to our dedicated metadata server. So no matter what platform you use, those metadata will be shared and the workflow engine itself doesn't have to handle like a big chunk of data. And finally, because we need to integrate and also merge all the information because right now we're actually breaking a lot of pipelines into smaller pipeline on Kubernetes, all those suddenly to be aggregated together. And this is also what the new design going to bring where when your subcomponent is finished those information will be pushed to Q4 pipeline and Q4 pipeline will aggregate a very nice graph for you. And behind the scenes all those smaller graph actually execute individually on Kubernetes. So you actually reduce the limitation on the Kubernetes custom resource definition limit on itself. And with this I'm going to just demo what it looks like with the new design and how you actually see Q4 pipeline bring a very nice UI and how it could actually pass different metadata across different backends. So in the Q4 pipeline UI you could actually see this multiple pipeline you could create. And on the pipeline you create we actually have added credibility with how to create different versions. For example, in this case, by default I have Q4 pipeline will have all the components enableable caching but let's say you don't want to have caching for the training you could actually easily I just pick the version where I make training without cache you could actually enable that in the SDK as well. So once the compiler finished the compilation into IR you could actually see the cursor is a bit problematic. Give me one second. You could see like for example so pre-processing you want always enable like true because pre-processing you're always processing the data into the same results so it makes sense to always have a truth. But as we can see for the training things should have a false over here. Sorry, the cursor is a little bit problematic. This one, make sure I'm in the right version. You can do search. I think on the training side I think the caching should be set to false. So that way when you actually do the training part the training itself is actually always been executed and you want to make sure the algorithm itself using the same data always converge to the same results so sometimes you might want to have the training set to false. As we can see that once you have the caching enables when you have all the components being cached all those components you could see could be executed in less than one second. So with all this information if you have everything cached you only have like two second bottlenecks on traversing the whole graph and able to pass all the information and metadata information displayed over here. And I could kind of show you examples just to run. I think this one might upload by mistakes but let's see. It will kind of run these pipelines over here. For the component that is being cached you could see the cached component would be completed immediately and give you this results because the results are stored in a centralized ML metadata service. And then the training steps where you always want to train so it would just take the cached components into the training process as you could see from the log it would just start the training and from a data science perspective you want to always make sure your models converge to the same results using the same data. So in this case it makes sense to actually always train and produce a new model to make sure all the models when they use the same data able to converge to the same results. And all this metadata could be shared with different backends. So as you could see this actually able to improve the performance significantly. For examples, you just want everything without caching with the pipeline you actually take about 30 seconds for this but if you're able to like at least cache the pre-processing if you're able to reduce at least the performance by one third of it. And if you have more information need to cache you actually significantly improve your overall pipeline runtime as well. And that kind of concludes the demo itself. I'm going to go over what will be happening next with the Q4 pipeline projects. As we can see we have done a lot of improvement in terms of caching breaking down the graph itself and able to scale out significantly. But we could still see like the extra abstract that we introduced from the driver and publisher logic itself kind of makes the graph transfer a little bit more complicated. So in the next phase we want to actually embed all those driver and publisher logic into as part of the controller and we were working with different backend workflow community to actually make this as pluggable. So this way we were able to reduce the graph complexity from the new design as well but also have all the nice feature of uploading and downloading all the information and aggregate all the status into a centralized metadata service where it could share between different backends. And the beauty of having a centralized metadata service is that not only the current phrase Argo and Tecton could use it but also when you introduce a new let's say workflow, let's say ML flow or let's say a very workflow we want to introduce that's not run on Kubernetes. You could easily leverage the new abstracted layer from Q4 pipeline introduce a new backend and share the same metadata service between different teams. So this is kind of like the angle we want to aim for Q4 pipelines. And you want to learn more about the projects here are the links to the core Tecton product itself. We have the Q4 pipelines on tower Tectons and also you want to join the community meeting could come check on the Q4 stack. And also at the end, our IBM Watson X products also run on Q4 pipeline to leverage the orchestration on all the workflow components. So you go and also check out what's actually interesting using large language model and also create a new workflows for your own model as well. That's it. Thank you very much. Any questions? I think we're wrong over. Sure. I think we're running out of time. So you have any questions to review the access? Yeah. Thank you. Thank you.