 model introduction for each project. Hello everyone, welcome to our talk. The topic for today is integrating high-performing feature store with K-Serve model serving. I'm Ted Chen, I'm with my colleague Qin Huang today. We are software engineers from IBM Silicon Valley Lab in California. I'm here to talk about feature store and some of the options to set up a high-performing Feast feature store. And Qin will go over K-Serve with model mesh and how to integrate the Feast feature store with model mesh in a cube cluster and show you an end-to-end demo. So first, a bit of background for people who are not familiar with feature stores. Features are individual values that are inputs to machine learning models to predict an outcome. Feature engineering, on the other hand, is the process of generating or extract the features from collected data. So many ML organizations store features in a centralized fashion and each organization have different requirements for each feature store. But in general, it can be seen as a data management layer that enables data scientists and ML engineers to create, share, and distribute ML features. The turn of feature store was first introduced in Uber's machine learning platform called Michelangelo. Back in 2017, it was developed to build reliable uniform and reproducible pipelines for creating and managing training and prediction data at scale. Before the system was built, the data scientists were building models from their laptops. The engineering team were building one-off system to serve model introduction for each project. There was no established way to deploy models introduction. But of course, Uber is not alone. Many other companies were facing similar problems. Airbnb, Spotify, Pinterest, and Twitter. They were all looking for solutions to manage and operate their ML pipelines and deployment process. So they were all building their own in-house feature store to solve their own needs. Some are open source, other were cold source. The turn feature store has since become more generic in recent year. In 2020 and 2021, there were an explosion of managed feature store starting to appear, such as Tecton, Databricks, and Vertex AI SageMaker feature stores, just to name a few. Among these feature stores, Feast feature store was one of them. Feast was originally founded by GoJack. Willem Pinar, the creator of Feast said it was developed to address the data challenges at GoJack for scaling machine learning or right hailing food delivery, digital payments, fraud protection, and other use cases. It was developed in 2018, open source in 2019 and joined the LFAI and Data Foundation in 2021. Feast is one of the most popular feature store projects on GitHub. Currently at 3.3K stars, three main concepts of Feast. The offline store is a serving layer for retrieving features for the model training. The offline store supports many data warehouses. The online store is a serving layer for retrieving the latest features which are features materialized or synced to the online store. This is after you have done with your model training. Feast provides a list of choices for offline online store backhands which we will go over later. The third one is registry. It is a file based object store that is generated or updated by running the Feast apply command. It contains serialized feature metadata in materialization history. Registry is a central catalog which allows data scientists to research, discover, and collaborate on features. The feature repo is a single truth of source of feature definitions and features store config. It contains a feature repo YAML config and the feature store .py which defines the data source entity feature views and feature services stored in your chosen offline store. Running Feast apply CLI command will refresh the registry object. Feast provides Python SDK in CLI for the downstream ML training and inferencing and operational tasks. For example, the get offline features SDK method is used for point in time feature retrieval from the offline store for model training. Feast apply CLI command is used to sync the latest features into the online store. The get online features SDK method is then used to get the latest feature from the online store which is integration point for our downstream model serving layer. The optional feature server is a front end to the online store. Feature servers are optional. They provide test or GRPC front end to the online store in case your program doesn't use the Python SDK. For this talk, we'll focus more on the online store. Keep in mind that Feast does not resolve the following problems. First, it does not want to be an ETL tools. Feast is not a feature engineering tool. Feast assumes that you have already done the feature engineering job using an upstream ETL tool and you have stored those features in your data warehouse. Feast provides the SDK to retrieve features from your data warehouse in a consistent way instead. Last, Feast is not a general purpose data catalog. Feast is purely focused on cataloging the features for machine learning pipelines or systems and only to the extent of facilitating the reuse of features. There are many ways to use the Feast SDK for feature serving. By default, it comes with three providers maintained by the Feast community plus a vendor specific provider. The provider is an implementation of feature store components using specific combinations of offline online store registries in a specific environment. For example, the GCP provider uses Datastore, BigQuery and GCS for the online store, offline store and registry respectively. If you don't use any of the cloud providers, the choices for the offline online store and registry would fall under the local provider. Which means the mobile light or redis for the online store S3 or BigQuery for the offline store plus any of the additional offline store plugins such as Snowflake, Hive Postgres, Trino and Spark. Some of them are provided by third-party or experimental currently. However, if your offline store is not listed, the custom offline or online stores can be done by implementing methods in the abstract offline or online store classes. Let's talk about our scenario. In our deployment, our model serving platform is K-Serve and it is based on Kubernetes. We deploy our online store on Kubernetes. We won't go into too much of the offline store and model training part since our focus is mainly model serving which only requires the latest features. But in the FISC community website there are many examples and tutorials of end-to-end scenarios for populating the online store from scratch. For Kubernetes the Redis back-end online store is recommended. There are two options for the high-performance online feature retrieval with the Redis online store. First is the Java feature server. It is an optional GRPC feature serving front-end to the Redis online store which can be deployed quickly using Helm install as a service into your cube cluster. A benchmark result has shown that the Java feature server may achieve near sub-millisecond latency response time depending on the number of features retrieved. In a study done by the FISC community it is 7 to 10 times faster than the counter-partner Python feature server option when retrieving a row of 50 to 250 features. They are pre-built Python and Java GRPC clients to retrieve features. In our case our downstream model serving layer does not use in any case if your downstream model serving layer does not use Python or Java clients additional clients in other language can also be generated from the portal definition using the code generator. The second option is the newly developed Go-based Python SDK which the community claims much faster than the original Python SDK. We tried all three options here and since our case server transformer is Python-based and we choose the Go-based Python SDK for the integration. And the next I'll pass it to Qing. She will talk about case-serve with model-match to change your trend. Thank you. My name is Qin Huang from IBM. I will give you a quick overview of case-serve with model-match and do a short demo of model inference using online features. Previously called KF Serving case-serve is a CNCF incubating project. It's a standard-based model-serving platform built on top of Kubernetes. It is aimed to support production-grade model-serving use cases. It has a set of high-performance and high abstraction interfaces so that the users can deploy and run models in their favorite machine learning frameworks. This would include TensorFlow, PyTorch, XGBoost, Sidekit-Burn, Onyx, and TensorRT. Currently, NVIDIA's Triton server, Selden's ML server and PyTorch's Torch-Serve all support this inference protocol. It is open-source and it runs anywhere Kubernetes runs, so you don't have to worry about vendor locking to use this serving solution. As we all know, Kubernetes has certain resource limitations, such as maximum number of pods in a node and the maximum number of IP addresses in a cluster. ModelMesh in case-serve is designed specifically to address these limitations. It allows you to run thousands of models and frequently change the models as well, with high density and scalability. Basically, it serves multiple models per container. It has the logic to unload in active models and load them back just in time whenever needed. So the utilization of the available compute resources is totally optimized. It also has the intelligence to manage in-memory model data across clusters of running pods all based on the usage of those models over time. Let's take a look at the ModelMesh architecture. Essentially, serving runtime deployments are created on demand to host compatible predictors or models. On this chart, you will see two deployments of runtimes. They are serving for 10 models in total. In each pod, there will be three containers. The first one is to implement ModelMesh logic. The second one is the adapter or polar to retrieve models from the S3 object store. And the third one is, of course, the model server like Triton or ML server to do model inference. There's also a Kubernetes service here to work with all pods across all deployments. The external inference requests are going through this particular service. One of the ModelMesh pods will actually act as the ingress pod and routes the requests to the other pods as needed. Finally, the SCD instance is to coordinate the operations and also to persist model states. We wanted to see how much we could pack into a single node with ModelMesh. So we ran a scalability test in a fairly small cooped cluster. It has only 8 VCPUs and 64GB of memory. At the end, we were able to deploy 3DK simple street models into two serving runtime pods. We then sent in thousands of concurrent inference requests to simulate a high-traffic scenario. ModelMesh actually achieved nicely with a single-digit milliseconds of latency. When the QPS is less than 1000. I think that's pretty impressive. Right? Here are the highlights for KServe with ModelMesh. First off, the standardized inference protocol works for common mission learning frameworks. We get the GPU auto-scaling and scale to 0 by leveraging Knative. There's a explainer component that allows you to analyze model behaviors. The cannery rollouts make it easy to distribute the traffic to a new model version during a model upgrade. Custom plugins can be developed as well to do pre- processing around the model prediction. The intelligent model placement and loading as I talked about earlier helps to optimize performance and resource usage. And we have the efficiency, scalability and the flexibility to serve thousands of models. Alright, next I'll go over how we put Feast and ModelMesh serving together to make a prediction. KServe has this transformer component which is customizable for pre- and post-processing for inference. Naturally, we created a Feast transformer to retrieve real-time features and perform data information as needed. As Ted mentioned earlier Feast provides multiple ways to fetch online features via REST server a Java-based JRPC server or a Go-based SDK. Our transformer can communicate with Feast in all three scenarios. In this diagram for demo coming up we have taken the Go-based SDK approach. It is recommended for production use by the Feast community. The inference request will come into the transformer first and gets routed to Feast for real-time feature retrieval. Then we transform the augmented data to PoloBuff message and send it over to ModelMesh via JRPC for inference. At the end we transform the results in PoloBuff message back to the expected data format for outputs. Essentially, we applied the Go-based SDK and JRPC for inference to achieve the best end-to-end performance. As far as the demo we wanted to find the best candidate for driver request based on the most recent driver features. In this case, we use the driver's account rate, convert rate and average daily trips. We also would like to serve multiple regions. For instance, San Francisco and San Jose might use similar but separately-trended models. The initial input will be driver IDs and the output is the driver rankings. Let's see how this Feast transformer with ModelMesh can make it happen. We have everything set up in a 3-node coop cluster. There are a couple of pods to run the controllers for K-Serve and ModelMesh. As you can see, the Feast online feature store is a reddit server. We have a cron job to update the features based on an offline store in S3. We have a single serving runtime pod including the ML server to serve two predictors pre-channed in S-Caler. Each transformer will also get its own pod to handle the incoming requests. Finally, two inference services will be deployed and we will send two driver requests. One with two drivers in San Francisco and another one with four drivers in San Jose. So these requests are going to be using the online features and eventually processed by different models as indicated in this diagram. Okay, here's my terminal. I have a coop cluster with this reddit server for Feast and I'm going to deploy two inference services. You can take a look at this one. As you can see in my deployment spec there's a predictor as well as a transformer. The entity IDs and the feature keys are specified here. I can also request for additional resources if needed. It's pretty straightforward so I'll go ahead and deploy them. Let's take a look at my pods. As you can see here the serving runtime pod is being created and pretty soon my transformer pods are up as well. I was sending a couple of requests to do an inference. Basically you can see one here for San Jose for instance the other one for San Francisco I have a small script here to make a current request using these input files. Let's send the request for San Francisco. As you can see quickly it's coming back with the results going to Feast and the model mesh. Similarly I can make a request for San Jose. I got the driver rankings back for drivers. As you can notice the model name is padded with some string here to make it unique across multiple name spaces. This concludes the short demo. Back to you Ted. Okay so thanks Shane for the demo. For a recap in this talk we have introduced an open-source feature store of Feast and how you can set up your own high-performing Feast online serving in a cube cluster and get the online features with the Go based Feast SDK. Also we have shown you a demo how to integrate the open-source model mesh model serving layer with Feast for multi-region model serving in a cube cluster. So coming up we have a few talks from IBM. Please participate if you are interested and let us know if you have any questions. Thank you very much.