 Hello everyone, welcome to today's sessions on Embrace DevOps Plattice to ML Publand Developments. My name is Tommy Lee. I'm a software developer from IBM Code Day, stands for Center for Open Source Data and AI Technologies. And we also have Yihong Wang, my teammate, also working on the same project from IBM. So today we'll go over our ML Publand Developments and how we embrace a type of practice on top of it. So first we'll go over the agenda. So we'll go over some background why we need to develop a ML Publand Platforms. And then we will go over our development lifecycle and some of the challenges we face. And at the end we will go over some of the CIC data enablement we have made on top of our projects. So to start with some background, we want to automate the lifecycle of machine learning. And at the beginning we have a machine learning problem where we need to take some data, train it with machine learning, and have some outcomes. In reality, the actual ML workflow actually could span them into multiple teams. So we will have data engineer to handle the data collections and pass on to data scientists to do data ingests, data processing, and model training. And we also have production engineer to handle the training part as well on cloud, and then deploy the models and let the application developers to use those models. And at the end of the day, the feedback from application developers could also feedback to how we collect data and how we ingest data back to the data science and data engineer team. So when it looks closer at each teams, each of their steps actually break down into multiple smaller steps. Let's say for data preps, you could do data cleansing, data ingestion, data analysis, transformation, validation, and splitting. And the same thing applies to the model creations where you need to build a model and optimize it, validate it, and then train it scales. And from the production engineer perspective, you'll have to deploy, serve it, and then you have to create monitoring and then a future improvement and feed those feedback back to data preparation steps. So as you can see, each of these steps could be rerun and redo multiple times within the teams. And this process could be very repetitive over time. So this is why having a pipeline system will really benefit for multiple teams and they could also share their knowledge across other teams within the organizations. This is why we're really interested in the Q4 pipeline project. Q4 pipeline is an open source project that is building and deploying portable scalable machine learning workflows on Kubernetes. Q4 pipeline, it can derives implementation ML tasks so it actually allows us to put each task into components and reuse it. And it could use on any runtime framework and data type. And most importantly, it could also attach full Kubernetes object like volume and sequence so make it very cloud native. And from a data science perspective, they could use what the Q4 pipeline called the Python DSL to create a sequence of steps. And the steps could be defined with like data dependency on inputs and outputs so it could create like a DAG on the right hand side which is a direct acyclic graph. And then once you create the pipelines, you also create schedule to run this pipeline periodically or as a Chrome jobs. And one of the beautiful things with Q4 pipeline is that everything could be written in a Python SDK where you only have to define a decorator for the pipelines and then defines a list of parameters and steps. Then it could actually compile to the DAG you see on the right hand corner. However, at the time we look at Q4 pipelines only supports the backend engine called Argo where we actually find some problem when running it on non Docker runtime and on OpenShift. Because at the time, a lot of the IBM teams are already using Tecton as this DevOps engines. That's why we have the interest to move Q4 pipelines to run on Tecton as well. So here what we have done is I will actually build a compiler that actually take the existing KFPSDK and compile into a Tecton pipelines. We're actually able to pass it to the modified Q4 pipeline API server that we have built. And in the background, the Q4 pipeline API server supposed to handle the job history and the scheduling for us. But the actual pipeline execution will be handled by the Tecton engine on the given AD cluster. The benefit of doing this is that we have teams that optimize Tecton on IBM Clouds. And by doing it on Q4 pipeline we could reuse all the pluggable components from the existing Q4 pipelines communities. So now let's go over some background for the Tecton project. The Tecton pipeline projects provide given AD style resources for declaring CI-CD style pipelines. Tecton introduced few new CRTs like TAS pipeline, TAS run and pipeline runs. And pipeline run usually defines as an execution of the pipelines where pipeline is a template how all the tasks are composed together. And then we have TAS run that actually handles the execution of each individual task. And tasks could be as simple as just compiling code running tests or building and deploying images. So when we look at how to map Tecton back to the given AD terminologies essentially each task is equal to one part on given AD. And every step inside a task is an additional container inside a part. And this step could be very useful later on when we actually try to do metadata tracking on Q4 pipelines. Now I want to go over why we want to run ML workflows on top of Q4 pipelines. What the benefit is like Q4 pipeline provides metadata trackings where in this case metadata will be an artifact. A artifact in ML workflow sometimes could be a data it could be a models or it could just be metrics. And having metadata tracking is very beneficial because we could find out which data a model was trained on and then use that to compare the previous model runs. And then further on we could carry on states from previous models and we use previous computer outputs as well. So this is what artifacts tracking looks like on Q4 pipelines. Here we're using a TFX taxi trip examples and after running the pipeline you will see the above artifact are generated. Each artifact will have few columns of metadata information. They are pipelines named ID type URI and graded app. Pipeline is the pipeline that generates these artifacts. Name is the name of the artifacts. ID is the unique ID for each artifact. Type in this case is the TFX type for each artifact it generates. URI is the object path where the artifact object is stored and graded app is the timestamp of when this artifact is created. When you click on each individual artifacts you will see more information below where it shows the URI name pipeline name who is producing it and the state of the artifacts. And next to it you will see a Linux Explorer looks like this. Linux Explorer will let you view the history and version of your artifacts. In this case artifacts could be anything you produce from the task for examples model data and metrics. So we take a look at these examples these artistic tasks at the beginning we produce a schema gen artifacts and once it produced the schema task we consume that schema gen artifacts to actually produce three more artifacts called transform, example validator and trainer. And then you can see like transformed artifacts is actually further being used by four other tasks and then so on and so forth you could see more details when you click on each task. And this is one of the reasons why U4 pipeline is very useful for data scientists and we really want to enable this on top of Tecton as well. Because Tecton doesn't support this kind of artifacts tracking by defaults we actually create our own solution to run Tecton pipeline on Q4 pipelines to actually let us track the artifacts. So what we have done is when we generate a Tecton pipeline we actually give like some artifacts annotations. So the Q4 pipeline API server we're able to base on those annotations to inject a post processing steps. In this case we will copy the artifacts in the post processing steps and actually push to the Q4 pipeline object store. And in the process of that we also have a metadata writer to actually track each part and get the information on where the artifact is stored and how we want to store the artifacts. Then those information be stored into MMD which is a ML metadata service on Qflow and it's backed by a relational DB as well. And once those information is in MMD then Q4 pipeline you are able to use the MMD API call to get those information displayed in the graphic UI fashions. So with the new artifacts tracking mechanism we introduced on top of Tecton we're able to replicate the exact same Q4 pipeline experience for example the UI DAG, the workflow and the metadata trackings but everything is backed by Tecton YAML right now. And this is very beneficial because we are able to give data science this same experience while maintaining Tecton as our core engine for running all the pipeline and DevOps services. Now I will pass this to Yihong to talk about how to apply DevOps practices to our Q4 pipeline with Tecton project. Hi everyone, my name is Yihong. Thanks Tommy for working through the project that we were working on. Now you know one thing or two about the Qflow pipeline with Tecton. Let me share the journey about how we apply DevOps practice to enable the CI-CD for Natalie build release. I will talk about the tool and service we use in just a bit. We know that DevOps helps you on various aspects from development releasing to production. So first thing first, the scope. We define the scope where and what we want to implement and evaluate if the sizing and efforts are feasible and tangible. Our minimum viable product is to run a set of end-to-end ML pipelines using Qflow pipelines with Tecton. We'd like to achieve two goals here. First is quality assurance. The CI for PR and commit only verifies the unit test. So we'd like to use Natalie build to verify some end-to-end ML pipeline scenarios in order to make sure the core quality. Secondly, we want to automate the Natalie build releasing by pushing views when the end-to-end scenario pass. Once we had clear scope, we broke it down to smaller tasks, including retrieving source code, compile and build individual components, cluster configuration, deploy components and services around predefined ML pipelines, verify the result, publish nightly views, resource cleanup and etc. Each task should be self-contained and some tasks may depend on other tasks. For the cook cluster, it takes time to spin up and tear down. So we decided to exclude leaves in our initial phase. Instead, we reuse a preconfigured cook cluster to run our CI CD pipeline repeatedly. We will handle infrastructure as code later. Besides, variants of ML pipeline scenarios may need different operation tasks. Like some ML pipeline require unique production rollout or canary deployment. Some may need their own configuration and application monitoring. So this reminds us that flexibility, scalability are very important when designing the CI CD pipeline. Those also influence tools and service that we choose for the implementation. Next key role is the ML pipelines. When applying CI CD to ML pipeline development, there are some challenges that we need to deal with. Like I mentioned earlier, ML pipeline has its own production rollout mechanism. For example, canary rollout with accuracy monitoring or model serving with microservice architecture. So combining it with development cycle, it makes the CI CD pipeline complicated and not easy to maintain or expand. On the other hand, ML pipeline scenarios may change as the project moves forward. It may need an extra step for data processing or model merging after the hyper parameter tuning. Not to mention, mostly the ML pipeline are lengthy and have many dependencies. Based on these characteristics, we realize the CI CD pipeline will be better off having features like A, it can manage sequential and parallel task execution within the pipeline. B, easy to integrate with heterogeneous systems. C, a pipeline monitoring. So based on our observation and experience, these features will be influential in the adoption process. We did some research according to the scope and the challenge as I mentioned in previous two slides. If an IBM TourChain can fulfill most of our needs, a TourChain is a set of tool integrations that support development, deployment, and operation tasks. It also provides lots of tutorials with templates for various applications. For example, microservice on Kubernetes, progressive for all your application in Kubernetes, and etc. With the TourChain service, you can choose Tecton as the delivery pipeline engines for Kubernetes application development. And that's exactly what we need. Also, the Tecton task template catalog provides a lot of tasks that we can use to compose the CI CD pipeline directly. For example, a fetching Coup configuration, Coup contacts scripting, Coup control deployment, Docker build, Docker sign, a co-risk analyzer scanners, give report operations, post message to select channel, or invoke a script to execute testing. And also depth-ups insights integration to recall view, testing, and deploying records and more. So the key feature here is both development and depth-up teams use the same language to compose the ML pipelines and also CI CD pipelines. It helps both teams to sync up their needs and changes. Here is the details of the CI CD pipeline that we set up for the M2ML pipeline verification by using the TourChain service. Check the pipeline flow on the right hand side. It starts from code retrieval or followed by build process for each components and services. Then it trigger unit test, along with Docker build and Docker push to internal container registry. Then the deployment process starts. Components and service are deployed to Coup cluster. And here comes the ML pipeline execution as well as resolve verification. After cleaning up the resources, the CI CD pipeline ends at nightly build releasing. Each of the box represents an individual CI CD task. Overlap the boxes run concurrently. Each of the tasks waits for the previous tasks to finish. A task can also use the execution results from previous tasks. In our CI CD pipeline, if any of the tasks fails, the pipeline fails. Some of the tasks may encounter intermittent failure. We set up the retry counter to avoid this kind of situation that ruins the whole pipeline execution and waste of time. Basically, you can leverage all the features from Kecton to compose your CI CD pipeline. In our case, we only use a few of them. In the middle of these tasks, we also insert the Slack notification task to send out progress to the Slack channel. For those who are interested in the CI CD pipeline, they can easily get the status update by checking the Slack channel. Also, view, task, and deploy status are sent to DevOps inside services. It helps us to analyze the overall pipeline execution. As you can see, in our journey, we leverage the curated task templates as a set of building blocks to program our CI CD pipeline. Again, both ML pipeline and CI CD pipeline use Kecton. Share the same syntax and use the same features. And both kind of pipeline are stored in the same repository. Meanwhile, developers and DevOps engineers can check the pipeline execution status via Slack channel. View, task, and deploy status are published to DevOps inside service. It automatically provides basic quality and risk analysis information. From there, we can define specific policies to determine if the view could be deployed to the production or not. System integration becomes easy breezy. The nightly view releasing is fully automatic as well. Now, it's just the start of the journey. We still want to cover more aspects in our CI CD pipeline, like infrastructure configuration and production rollout. On the right hand side, I highlight the DevOps coverage we have so far. We believe with IBM Toolchain Service and Kecton delivery pipeline, we can accomplish more operation in no time. Finger crossed. Now, let me pass it back to Tommy. He will share some future experiments we'd like to adopt. Thanks, Ian, for explaining the CI CD process. So, because you've seen the DevOps process for our project, it's very complicated. So, that's why we want to convert DevOps as goals to simplify our CI CD process. And one of the experiments we have done over here is try to utilize SDK of PSDK. Because we have built a compiler that's able to convert KUFO pipeline SDK into a Kecton pipeline, we're thinking, can we actually bring those pipelines into the managed DevOps services that actually runs Kecton Android in the background? The result is possible for simple pipelines. However, we're actually putting in some challenges when we're putting into actual DevOps practice. And because Q4 pipeline DSL is mainly designed for ML pipelines, not all the features from Kecton and Argo expose the DSLs, and that actually creates some challenges. Because a lot of the existing technical committee task template, where we call like catalogs for building like DevOps, are not really one-to-one mapping to KFP DSL because it's not all the features exposed to the DSL. So, that's why it actually creates a very high learning curve for the existing technical developers. And they cannot define their normal workflow using some of the technical features because they are not exposed to the KFP DSL. One of the possibilities we'll learn from these challenges is that maybe in the future when Kecton gets more mature, we could design a DSL that's specific for the KFP platform rather than relying on the DSL on Q4 pipeline where it has dependencies on running on Q4 pipelines. So, with this, we're going to end the sections. And here are the resources on the project we have built. You can see the hub link for Q4 pipeline with KTons, the KTon projects, and the DevOps services we have used. And you can also see some of the tutorials, two-chain catalog, and tech-contact catalog we use for our DevOps services. Thank you very much.