 First talk of the OK and welcome to the talk and thanks for joining our session. Today we are going to share the journey of how we use the various open source project to set up a metadata driven ML platform. My name is Yi Hong Wang. This is my coworker, Ted Chen. We are working at the open source AI technology group in IBM Research. Both of us have been working on various open source projects. I think at least eight years. Most of the projects we work on recently are related to the machine learning. Like Cool Flow, Cool Flow, Plank Plank, TensorFlow, Ray, Cubray, and Flink, training operator, just a few. So we are committed contributors, maintenance as well as actually active case. So based on our experience and explorations, we conduct a ML platform and like to share it with you. The platform comprises many open source projects. So we will go through them one by one and provide information you need to set up the platform. So I hope this can help you to seamlessly integrate ML into your business solutions. So here is the agenda. So we will actually cover two main topics and those are very important for the ML enablement. The first one is the open deck house. I think the open deck house itself has become a very trendy topic these two years. It's the foundation actually of the ML platform that we construct. It combines the feature of the warehouse and data lake and is used by many ML personnel. For example, like data analytics and the future engineer and data scientist. So I will cover how to build a open lake house and its key component later. And the second part is the meta data driven ML pipeline. So you are not able to actually adapt the ML successfully without an ML pipeline framework. So here we believe Kuflo is a perfect solution for this. And Kuflo is actually a very big umbrella for a group of ML project with a built-in meta data driven pipeline called Kuflo Pipelines. So Ted, he will share more information about it along with other projects that you can easily integrate with the Kuflo including a KT before hyperbole tuning, a training operator for model training, K-Serve for model serving, et cetera. So without further ado, let's get started. So I think first thing first, the start with the ML lifecycle, the hot issue right now everyone is tackling. Here is a versatile view of the ML lifecycle. So it actually can then three pillars, data model and deployment. So yeah, data model and deployment. So the lifecycle always starts from the data and then you use the data to build the models the automated decisions. So in the data phase, you acquire data and you extract and analyze the useful features. In the model phase, you use a state of art technique to compose and train your model. And finally in the deployment phase, you deploy the model and integrate them into your solution or products. So actually that's dive into the detail of the ML lifecycle. So and have a clear picture of it. I believe if you try to search ML lifecycle, you will see a lot of different flow and diagrams. Here is just one of them. But I believe all of them actually all carry a very same message. The end-to-end ML lifecycle is extremely complicated. So here you can still see the three phases I mentioned earlier, the data phase, the model phase and the deployment phase. So in different, you see each phase is actually constituted of multiple tasks and operations, as you can see here. So different tasks are conducted and performed actually by different teams with specific expertise and skills. For example, data engineers works on data injection, data transformation and data splitting. And data scientists works on data analysis and feature extraction. ML practitioners work on model composing and model optimization. ML engineers works on model training at scale and model validation. And DevOps and software engineer work on model deployment and application integration. So obviously having a platform to facilitate the entire ML pipeline and enable various teams to work on their tasks become a must-have to succeed in the ML enablement. So when we adopt ML, according to our experience, you will face the following challenges. And we try to categorize it into two categories. The first is the data preparation, of course, because everything starts from the data. And the data volume keeps growing and growing in the ML world, you know. And then you need to be able to retrieve and process data fast enough to meet your deadline and achieve a faster time to market strategy. And the second is the data type. They are structured, semi-structured or unstructured data types. And I believe for most of the ML solution, you need to deal with semi-structured data. And even unstructured data, AKA we call it schema unread. But you may also need to handle structured data from time to time. So we need a solution to be able to handle this different type data type without complicated data moving and transformation processes. And also the flexibility and elasticity. Data usually come from a multiple data source and they are different type of workloads in data preprocessing. For example, like interactive or ad hoc data queries or batch processing. And of course, how much the platform would cost is one of the important factors. And if the platform is using open format or vendor locked format, it's also important. So, and for the second part, the ML. Complicity is always one of the challenges. We know the ML lifecycle earlier is very complicated. And require coordination from multiple teams. And the more people work on the same system, the greater likelihood of errors. And the harder to debug. So error form is another challenge. And the entire ML lifecycle is a lengthy process. You need a system that is flexible enough to actually adjust the resource based on the workload and entire a platform. So let's but not least is actually the automation because it's very lengthy. So definitely we want automation mechanism to speed up the ML lifecycle. And the mechanisms need to be easily manageable and can apply to various kinds of tasks. Because you can see each of the ML tasks or operation, they require different methodology or techniques. Okay, so here is our, sorry, I actually, okay, so. So apparently I think our solution for handling data preprocessing, the most challenging I mentioned earlier, the answer is very obvious. It's actually the open lake house. So open lake house is stitched together the features of the warehouse and they are like all together. And bring the traditional data analytics and advance functionality for ML scenarios. A lake house is still used object storage to store the different type of data which keep the cost low. And one of the key component of the lake house is actually the processing engine. And usually it actually can provide the capability to connect various data source and run queries at scale. And based on the name, an open lake house definitely use open standard and is solely formed by using open source projects. And another key component of the lake house is actually the table format. Table format forms an expression layer above the storage which provide reliability and great performance while processing huge dataset including asset transaction, caching, auto partition, time travel, snitch up, compaction, et cetera. And I think I will cover it in just a bit. So here is actually a high level open lake house architectures and it's actually comprised multiple layers. So starting from the bottom layer, you can have SQL database, cloud object storage, HDFS and more. And above the data storage is actually the open file format including Parquet, JSON, Avril. And as for the table format, there are several choices including Apache Iceberg, Delta Lake is one and as the Apache Hootie. So different table format actually provide different features. And in our solution, we use Apache Iceberg because we believe it provide more features that are suitable for ML usage. And I will cover it in just in the next slides. So and above the table format is the key component for the lake house is the processing engine. And here we have the Trino, Presto and Spark. Trino and Presto these two are the SQL engine and Spark is actually more for use for the batch processing. And then on top of the processing engine, you can have a bunch of open source visualization tool or business intelligence tool. They can connect to your lake house and do the query data analytics and also other dashboard stuff. So like I mentioned earlier, we choose in the table format that we choose the Apache Iceberg because it's actually designed for handling metabyte scale data set and solving consistency and performance issues when storing data on cloud storage and is actually initiated by the Netflix and they donate to the open source community. And you use a series of metadata to track the data change. Meanwhile, maintain the performance of data assets. So table change create a new snapshot which are linked with multiple metadata file and use the atomic switch to support atomicity, consistency, isolation and durability, aka the asset and which ensure the data reliability and integrity even with multiple assets at once. So it actually also provide other features like time travel, the ability to query the table based on your state at a certain point of history and also partition evolution allow you to change the columns which a table is partitioned over time and also the schema evolution you allow you to add, rename, reorder and delete columns. And as for the processing engine, we use the Presto. Presto or some people call it Presto database but it's actually a SQL engine. It's an open source distributed SQL engine. They can query large data set from various data source ranging from rational database, no SQL real-time data source and even streaming data source as well. Presto connect to those data source using a flexible plugin mechanism we call connector and query the data where it reside without moving the data around. So Presto is created for the interactive and ad hoc data analytics at the very beginning. So it leverage the distributed architecture in memory processing, query optimization, hierarchy caching and a lot of features to perform the query in the lowest latency whether you are dealing with petabyte of data or real-time streaming analytics. Presto ensure that your query can run optimal latency enable you to make the informed decision quickly. Presto run SQL queries because it's a SQL query. So it complying with the NC's SQL standards. So I think in the data analytics words, most users already know how to write the SQL query. So Presto is easy to access. So you don't have to actually learn a new language. Last but not least, the Presto is an open source project with a big community and is used by many companies including the original company who invented is the Meta and the Bydance, Uber, IBM and more. So and for the very top layer, I mentioned earlier is the visualization and the BI tool. On this day because the Presto provide the JDBC and REST API so you can easily set up those BI tool to connect to the Presto cluster and to retrieve the data. So Presto community also provide Python and JavaScript SDK for integration. So you can easily integrate BI tool using those SDK. So for us, we try to set up the Apache Zip Link is one of the data visualization tool. It use the JDBC connector to connect to the Presto. And we also set up the Jupyter notebook which is using the Python SDK to connect to the Presto. And on the right hand side here is actually another visualization tool is Apache SuperSat. It can also integrate with the Presto easily but in our deployment, we don't use it but I highly recommend that you can use it as well. So here is a complete topology that we set up for the data preprocessing. You can see we deploy everything from the Kubernetes. Of course you can actually switch to OpenShift for sure. And on top of it we have the Minio cloud storage and we also have the MongoDB which is semi-structured data and my SQL is for the structured data. And we use the, like I said earlier, we use the Iceberg table format and this have metadata is used by the Iceberg to store the metadata information. And on the Presto layer, we actually use the Presto home chart to construct a three-note cluster. One is the cluster coordinator and the other two are the workers. And on top of it, like I mentioned, I use the Apache zippling to do the visualization and the Jupyter Notebook for data scientists, they can operate, they can try out the Presto SQL and do their data analytics stuff on the Jupyter Notebook. So I think I want to demo some of them, but we try to see if we can run their live demo, but we found the network is not that stable. So we actually record the video earlier and then let me show you. So here is the Presto cluster. Actually, it has a two-note cluster, like I mentioned earlier. And in the cluster, in the Presto dashboard, actually I just recently contributed the SQL client into the community. So you can actually run the SQL query directly on the UI. And in here, you can see we actually have connect to different data source. We have the Iceberg MongoDB MySQL and some benchmark database. Yeah, and like I said, you can run the SQL query and here you can see actually I just tried to run a federated query. It's actually query from the MySQL DB and MongoDB and it joined those two table from two different data source and give you the result. Yeah, it's a very simple query. And another thing I want to show you is the zippling visualization tool. So it's very easy to compose the visualization tool, the data visualization using the zippling and you can see the SQL query and you can run it and talk to the Presto cluster which is constructed. So those are the very fancy visualization tool that Apache zippling provides. You can see the SQL query we have and then you can just run it and you will give you the result very quickly and plot the chart. So I think that's my part and I think I can also show you another part I mentioned earlier is the Presto SQL. It's a Jupyter notebook. I mentioned earlier, where you go? Sorry, give me one moment. Yeah, it's here, sorry. Yeah, this is a notebook. So in a notebook you can see we actually use the Python SDK, the Presto Python SDK. So it can easily help you to connect to the Presto import library, connect space by the server and you can run the query directly in the notebook. So yeah, you can see it just query the result back and here is the same query I issue on the SQL client. UI, you just try to do the federated query and then in here because I want to demo how I can actually inject the data into the iceberg table. So I try to clean up the table and in here you actually do the federated query and inject the data into the iceberg table. So the last query is I try to query those data directly from the iceberg and iceberg underline is using the pocket file format. So you can easily hand over to the data scientist. So, okay, so that's back to the chart. So here are the information that help you to set up the environment that we set up starting from the Presto. Here are the Presto repository and information and a lot of connectors that you can set up connect to and Presto home chart that we use to set up the Presto on the Kubernetes cluster and also Apache iceberg. And we also have to actually workshop lab that we can conduct to help user to learn the Presto and also the Presto plus iceberg. So these are two workshops that you can easily attend is free and publicly available. And it has the step by step instructor to help you to learn how to use the Presto, how to set up Presto and how to set up the iceberg with the Presto. So, and I think here is I finished in my part. So I will hand over to Ted. He's going to share the solution for ML lab cycle. Thanks, Yifong. For the data challenge, we just talked about how Presto can be used to solve some of the data challenges. The other challenge remaining is automating the ML life cycles because after you turn data into features, typically the next step is model training. And then you may have to do hyper permit turning to optimize the model. After the model is trained, you will deploy the model but model accuracy may drop then you will need to start the whole process from the data again. So the heart of the Qt flow is the pipeline or it's called KFP. We can use the pipeline to break the ML life cycle into smaller tasks or boxes. The line and arrows are determining the order. Each task is processed like flow chart diagram. Each of box may run containerized tasks like data preprocessing or training or the deployment models. Some of the tasks can be handled by the pre-built components. We can use Python DSL or Jupyter notebook extension to compose pipelines. And once you have the pipeline, you can schedule to run it automatically. Q4 pipeline is a metadata driven pipeline. Internally, the pipeline back end stores runtime information of a pipeline run in a metadata store. Runtime information includes status of a task, availability of artifact, custom properties associated with each execution or artifact. Metadata store also enable pipeline step caching which means if arguments are exactly the same, a task will be skipped and the old output is reused. So let's look into the Qflow core component and their usages. So first one is a Qflow notebook. It provides web-based development environment inside your QNATIS cluster. It provides an interactive environment for you to implement any tasks like data analysis, training, and deployment tasks. You may also import additional notebook image which provide other functionalities like local, no-code to compose a pipeline using drag-and-drop approach. And for hyperparameter tuning, we use Cateep. Once you have written your ML training code, you can parameterize Cateep runs trails to figure out the best parameters to run the learning process. Cateep also has Python SDK to create Cateep experiment. Pipeline also has pre-built components to run Cateep experiment. So for model training, we use Qflow training operator. It is used for fight-tuning and distributed training of ML models. It supports ML framework such as PyTorch, TensorFlow, XGBoos, and others. It allows you to deploy ML training workloads using Python SDK and custom resources. And last one, for deploying models, KServe is standard model inference platform on QNATIS built for highly scalable use cases. It provides standard inference protocol across ML frameworks. It offers a production ML survey, including prediction, pre-processing. It can also be used for monitoring drift bias, and so on. So let's take a look, a quick demo, which I have recorded earlier, because maybe the network may be small, slow. Okay, so let's see how we can create a simple pipeline using Qflow notebook. So I'm using a notebook with Elyra extension. As mentioned before, it offers no-co or no-co approach of creating pipelines. I have three notebooks, which I already created. One for data aggregation, one for load and training models, another one for serving. I drag and drop the notebook into the canvas, and then connect them with lines. And then so they can run in sequence. And then I click the play button to execute the pipeline using Qflow pipeline. And then I can go to the dashboard. I can go to the Qflow dashboard for to see the result. Oops, why is it zoomed in? Okay. Yeah, so once we execute the pipeline, we should be able to go through the dashboard and see the, while it's running, we can see the output of it, like the logs, like events and metadata and so on. And once the pipeline finished, we should be able to use a K-Serve sender inference API to get the result of the prediction. So let's, so yeah, that's it for the demo. And so we are using other, we are, so Qflow has many distributions, right? We install the IBM Cloud distribution on standard K, Kubernetes cluster with additional components like K-Serve and Elyera extension. Moreover, each Qflow distribution may patch different components. For example, the OpenShift AI is based on the Red Hat OpenShift, Red Hat Open Data Hub distribution. It offers additional open source component and SDK for running large language model workload. It also supports third party data ops service for running large queries. The IBM Watson X.AI is built on top of OpenShift AI. So this is it for our talk. So to summarize, we talked about solution data challenge with Open Lakehouse using Presto and Iceberg. We also talked about solutions to ML lifecycle challenges with Qflow. We demo you how to create a simple pipeline to automate these solutions. So thank you for attending. I hope you enjoyed. We just got information. We have five minutes for the questions. So if you have any questions, just raise it. But it's so bright, I actually cannot see. So yeah, just like we mentioned earlier, those are the, actually, because we went through a lot of different open source projects, so we think it's actually really interesting to bring all of them together. So from the data all the way to the ML pipeline deployment. So it's very important because most of the time when we look at the ML solution, you only content, for example, for the data part, maybe you only content the data. But for the ML part, you only content contents the model training or maybe the deployment only. So, but we definitely need a solution from N2N starting from the data, preprocessing and manipulate the data, and then hand over to the data scientist to do the model training and prediction to compose the model. Yeah, so we think it's really good. Yeah, I thought, yeah. Have you thought of something like a feature store where data is a special meaning and special use case because it is going to an ML model because it comes with its own special training and serving needs? Yes, I think is a good question. And then there's also the other reason we actually pick up the Apache Iceberg as the table format because it actually support a lot of other feature stores. So you can easily connect to other feature stores that you choose that then you can easily integrate into our data source. So for example, when you do the data preprocessing and you can use in, like in our scenario, we have the MySQL, we have the MongoDB and data may be also found some S3 storage. So you can use the Presto to have this kind of my favorite query and then you can dump into the Iceberg table format. And then you can have your feature store and talk to the Iceberg table format and retrieve the data, yeah. I was wondering about your data gathering. Was that like, are they all like custom bespoke solutions or do you have some sort of like a flink ingestion engine or a spark like ingestion engines that you run to gather your data? That's a good question. But because as you know, Pet and I, we are developing, we contribute to those open source projects. So we actually try to find some good scenario and also data set, but the question you have is that, do we have any scenario that do the data injection, right? But yeah, we don't have that because in our scenario, we just try to compose a platform and provide other users to use it. So if you have any good data injection scenario, definitely yeah, yeah, you're welcome to share with us and we can try that out on our platform. Yeah, so I think that's it. So thanks for attending again and yeah, those are the information we share and we'd like you to try it out and yeah, also maybe give us feedback. Thank you. Thank you.