 How is everyone doing? I'm surprised that the room is still occupied at one of the last talks. Thank you for sticking around, and I hope you are enjoying the week. So let's get started. Is the recording on? OK. First of all, let me introduce myself. My name is Yuan, and I'm a principal engineer at Red Hat. And I just joined Red Hat about three months ago. And I'm also a project lead for Argo and Kubeflow. And today I'm going to talk about production ready AI platform on Kubernetes. So today's agenda. First, I'll be talking about the AI landscape and ecosystem. I'll be talking about some of the elements to consider for production readiness when you are building an AI platform on Kubernetes. Keep in mind that this won't be an exhaustive list. As there are obviously many other things to consider in production, especially when dealing with Kubernetes and cloud-native technologies. And at the very last end, I'll be talking about a reference platform using some of the tech stacks for different components. So I'd like to spend a few moments to introduce my book that's recently released this year, called Distributed Machine Learning Patterns. So it will talk about different patterns that's involved when building a large-scale distributed machine learning system. And at the very last section of the book, it will also talk about a reference implementation that provides an opportunity for readers to build a real-life production-level machine learning system hands-on. And take a moment to scan the QR code if you're interested in learning more about the book. OK, let's dive into the content. So the AI landscape and ecosystem is getting larger and larger, especially since the birth of the large-language models and GPT type of models. Before that, it may be known as statistics, data science, or machine learning, deep learning. Now, all the words are getting blurry and blurry. And here are the two screenshots. One is from the Linux Foundation AI and Data. The second one is from CNCF. It has a larger scope, but it also contains categories for machine learning and AI. Definitely check out the links for these two landscapes if you want to learn more about it. And if you ever missed Priyanka's opening keynote, she also mentioned she also has a nice picture of the landscape for AI-related projects in the ecosystem. Yeah, definitely worth checking out. And here are just some example projects that are used very frequently when building a cloud-native AI platform. For example, Kupflow, Argo, TensorFlow, PyTorch, K-Serve, volcano, Ray, and so on. And this is a diagram created by the cloud-native AI working group that started recently. And this diagram describes the relationships between different roles, for example, data scientists and engineers, between those roles and different components in the infrastructure. For example, data scientists, they may focus on building predictive or generative models. For example, predictive models may be classification models and clustering problems. And generative models are the recently hard ones, for example, the large language models and large vision models and so on. And then hardware at the bottom, they may focus on accelerating the machine learning workloads using different hardware. And there are also different engineers in between. Even data engineers and platform engineers, they often need to collaborate a lot on really building a platform that's user-friendly and in the meantime, production-ready. So check out the diagram and the white paper released by the cloud-native AI working group if you want to learn more about the details. Here's a QR code to scan if you wanna join one of the community meetings. And I believe in the community meetings, there are also notes, links to the reference the white paper as well. So there is also a dilemma or difficulty between difference between how much the data scientists care and how much the infrastructure is needed. For example, data scientists care most about the model, but it's actually one of the easiest part to handle in the infrastructure side. So definitely check out one of the talk I gave with Savin last year at Kupka North America if you wanna know more about the challenges we faced and as well as the solutions we suggested. The first perspective I wanna talk about is the scalability for the production readiness. There are different scalability challenges. One of the horizontal scaling, basically, you add more parts to distribute the heavy workloads into more and more parts through horizontal scaling. So Kubernetes provides its own horizontal part autoscaler based on utilization metrics. And K-Native also provides autoscaler that are based on more through event-driven, request-driven approach. And besides horizontal scaling, there's also vertical scaling, basically adding more resources to existing parts instead of adding more and more parts that may bring the whole cluster down. And Kubernetes provides a vertical autoscaler and there's also a resizer that helps you adjust resources based on the cluster nodes. And besides scaling based on parts, the cluster also needs to be scalable. So in order to do that, there are different approaches to scale the cluster. So one approach is the cluster autoscaler that automatically adjusts the size of the Kubernetes cluster. And another thing to think about is the algorithm scalability. So certain algorithms may not be suitable for large scale, they may be suitable for batch. And some algorithms have to be modified to be online friendly. They are relevant online modified version of those algorithms in order to use them in production. So really depending on your business requirements, whether you wanna run all the workloads through batch fashion or streaming fashion. And next are the hardware acceleration and resource sharing between different hardware accelerators. And these are getting very popular, especially in these weeks. Kupkan, you may see a lot of talks related to resource sharing between multiple GPUs and so on. It's definitely worth checking out those relevant talks. I believe there are a ton of them already this week. And also for batch scheduling, of course, those are also hot topics. You may find a lot of talks in the cloud native AI day as well that happened on the first day of the week. The next perspective I wanna talk about is the reliability. So one thing is the high availability and disaster recovery. So if you are building a Kubernetes controller, you definitely wanna support leader election, right? Having multiple parts stand by to make sure if one of the part failed, you also have other part to be ready, taking the tasks instead of waiting a long time to start another part. And elasticity and for tolerant are also important, especially for training workloads where if one of the workers fail, you don't wanna stop it from training, but rather continue training from the checkpoint if needed. And also if you have less resources, you may wanna start your distributed training as soon as there are parts are valuable and then other parts will join once the training starts. And in order to increase reliability, one thing worth mentioning is the versioning and following the GitHub's principles to make sure everything you deploy in production being version controlled in Git so that whenever something happens, things can be recovered as desired. And you might wanna prevent vendor locking and something to consider is to use a hybrid cloud so to make sure you don't get locked in by a single cloud vendor. And when bad things happen to a particular vendor, it's easy to switch to another one so reliably. And support and SRAs are also something to consider, especially if you are building a large platform that consists of multiple, many different infrastructure, tech stacks, technologies. So you may wanna make sure each of those components are being supported well enough by your engineers or third party service provider. And observability is a hot thing in the cloud native Kubernetes world. So here the observability I'm talking about is both observability from system perspective and from statistical perspective, especially when you are building machine learning models. You wanna take a look at the performance metrics based on statistics, for example, when you are tuning hyperparameters for a model and you want to make sure those metrics are available to track, to understand whether your experiments improve or not. And there are operational metrics that you wanna take a look as well, from system perspective or resources perspective, or if you wanna understand more about what your controller is doing and what bottlenecks your controllers are having, those are some of the good metrics to look into. And the next item is the model explainability and visualization. So oftentimes when you are building a very large and complex model, you want to understand how well this model performs. You understand how trustworthy the model is making its decisions. So you want to know what the model is actually doing and you wanna visualize some of the layers behaving in the models and whether the model can be trusted in industries like banking and insurance and so on. So those are some things to consider as well. For pipeline tracing, here I'm more talking about if you're running a lot of experiments, you may want to make sure each step in your experiment are being traced properly. So you want to understand what dataset is being used by what model and where the model is being deployed and how the model is changing over time and the relationship between the dataset and the models. Basically the lineage between those different objects used in your machine learning experiments. And the last one in observability I wanna talk about is audit log. So if you are working with large data science teams, you may want to understand who's doing what in the cluster, who's using this data, who's building this model and how are they collaborating each other. There are things you can improve through the collaboration process. And I'd like to also talk about the flexibility. This is one of the measures we definitely wanna look into when you consider something is production ready. So usually when you build machine learning models, it involves a lot of different machine learning frameworks. For example, PyTorch, TensorFlow, Exuboos and so on. You wanna make sure your tech stack is well supported for different frameworks depending on your data scientists, data science teams' particular needs. And you may wanna also look into different language-specific SDKs to make sure your team meet their requirements as well. And standardized APIs are also important, especially when you are working with different frameworks. You, for example, you wanna make sure your serving influence platform supports a standardized protocol across different frameworks. So that it will be relatively stable and relatively easy to onboard new frameworks. And the data sites, the data set is also one thing to think about is the data set large or small. Is it streaming data or batch data? So based on the characteristics of the data set, you may wanna use different technologies. Similar thing for model, right? Depending on the size and the framework you use to train your model, and you might wanna select tools accordingly. And different models produced by different frameworks also have different performance, and you may wanna look into different model serving run times to make sure the runtime is performant for your particular, for the models you've trained. And it's also challenging sometimes to integrate with different hardware accelerators too. So you wanna make sure your data science teams have a friendly interface to build models easily, leveraging different hardware accelerators. So for example, they may not need heavy GPUs for data processing or maybe they need that. For model training, you may wanna consider applying GPUs and for serving, right, so, so. So and for the, and also like, whether you are building the platform for cloud or on-prem or on-edge, that's also something to consider as well. So instead of building one platform for each, you'll want to build one unified platform that works on different environments. And you wanna prevent from being vendor locked in as well. So before I dive into the reference implementation, I'd like to introduce Kubeflow. So Kubeflow is the machine learning toolkit for Kubernetes. It consists of different tools that are applicable for different parts of the machine learning lifecycle. For example, there are Kubeflow notebooks that's useful for experimenting, especially for data scientists' day-to-day work. And training operator for distributed model training and Kubeflow pipelines for getting all the steps together in production and cut-tip for model tuning. And there's also a central dashboard for user access control and visualizing your experiments. And K-Serve is now a separate project, but it was originally born at Kubeflow and successfully graduated as a separate project that's dedicated for model serving. And Kubeflow works across different machine learning frameworks such as TensorFlow and PyTorch. So for data processing, there are many frameworks that can be used for data processing depending on your data set. So here I wanna focus on the use case where your data set is extremely large. So I want to mention that Apache Spark because it supports both batch and streaming and it also works with time series very well. So I'd like to also welcome Spark operator to be officially part of the Kubeflow project. So now it's under the Kubeflow GitHub organization. So it's, as you can see from the right-hand side, it's pretty easily to deploy your Spark application. So you basically specify a Python file that includes the logics for your, if it's data processing, then it's simply a Python file that you can include in this YAML and then specify the computational resources in this YAML. And if you are working with a lot of data intensive applications, one thing to consider is to use the Fluid project. What this project provides is it enables data set warmup and acceleration for data intensive applications. If you have multiple data set and if they are large and if they are in different resources, Fluid will leverage the distributed caching in Kubernetes to speed things up. And there's also data set abstractions for heterogeneous data source management. So if you have multiple data sources from different providers, this will be a good way to connect them together. And it also provides the data aware scheduling. So when it schedules the jobs between, it will understand how much the usage of your data set, data source, and in order to speed things up in this process. Next, I'd like to talk about the distributed model training. So for those of you who are not familiar with distributed model training, one example here is that if you are for, so this example is using the collective communication pattern. If you have a large data set, you can partition them into different parts and having multiple worker nodes to consume those individual data partitions independently. And then aggregate them together and to update the model using the calculated gradients on different workers. So Kubeflow's training operator works with multiple machine learning frameworks, as I mentioned earlier, and it provides Python SDKs as well as APIs that you can use to submit your distributed model training jobs. So it supports multiple framework. For example, there's PyTorchJob that supports the PyTorch distributed training. And also it supports multiple distributed training strategies, for example, using all reduced multi-worker training strategy or parameter server-based training strategy. And there's also MPIJob that's suitable for high-performance computing. And it integrates well with the job scheduling frameworks such as Q and volcano. And one last thing that's worth mentioning is the elastic training. So it supports well the elastic training mechanism. So if one of the workers fail, it knows how to start a new worker and continue from there. Of course, that also depends on whether the machine learning framework that you are using supports elastic training. So here's an architecture diagram for distributed training with workflow. So basically, if you have a TensorFlow training code in Python, you can submit a job pretty easily just by configuring the resources and the distributed training strategy you wanna use. And then training the operator will set the environment variables for you and starts the workers and parameters based on the strategy you specify in the YAML spec. And here's an example by Andre for if you're working with large model, fine-tuning. Here is a simple Python API you can use to fine-tune your model in a distributed fashion. So here, for example, you have a hugging-phase model and you wanna use some additional data set and then use tuning some adding additional parameters and some configuration for your Laura and then specify your computational resources. So this is pretty easy to use as well. So next is the model tuning. As I mentioned earlier, the cut-tip is the tool inside Kubeflow that supports hyperparameterony. It supports paradigms such as NAS for neural architecture search and early stopping and it's agnostic to different machine learning frameworks and languages and it's pretty deployable anywhere as is Kubernetes native. And it works well with other Kubeflow components as well. Here's the cut-tip architecture. So basically, when users, so basically the controller creates the X, reconciles the experiment and each experiment does some trials and then those trials, the metrics from those trials will get collected from the metrics collector and then there is suggestion controller that's responsible for suggesting the best parameters and based on the algorithm you specify for the hyperparameter tuning. Yeah, if you wanna learn more about the architecture, there is a reference paper linked in the slide. Here's an example of how you are using cut-tip to tune your machine learning model. So on the right-hand side, this is a trial template, basically the specification for you to describe how to train a single model with a particular set of parameters. And on the left-hand side, you specify the hyperparameter tuning algorithm as well as its objective and search space and how much you wanna go for to tune your model. So you may specify different number of trials, maximum number of trials and so on. Here's a screenshot of the cut-tip UI. So you can track how well your model is, your model performed during the experiments and you'll see here it has a training accuracy, validation accuracy and these are the parameters that we tried to search within. Yeah, so it's very easy to use, experiment tracking tool. And the next step, once you trained your model, after a parameter tuning and so on, the next part is gonna be the model serving, right? So K-Serve is a platform that's highly scalable and standardized on Kubernetes. It's performant and has a standardized protocol across different frameworks. And it supports serverless workloads and supports event-based auto-scaling based on K-native. And there's also a sub-project called ModelMesh that's also highly scalable and based on density packing and intelligent routing. And there are also ways you can pre-process or post-process your, before the model actually gets served by K-Serve. And there's also advanced deployment strategies like NERI, RUOUT, and Pipeline. And you can also create ensembles of your inference graph as well. Here's a diagram for the single model serving. So basically K-Serve controller creates runtime deployments for each individual model. Based on whether you are using K-native or not, it may use K-native auto-scaler or use the horizontal part-scaler from Kubernetes. And it creates a transformer service if you also include pre-processing or post-processing steps. Also creates the predictor service which is also responsible for processing your inference request for either predictive models or generative models. And the ModelMesh project is very user-friendly, very useful for high-scale, high-density and frequently changing models. It intelligently loads and unloads models from memory based on the responsiveness to users and the computational footprint. So it knows how to group. There's basically a routing layer. So it knows how to intelligently route between model serving requests and to the model runtime parts at the right time and at the right location. Here's an example if you are serving a large model. So here on the right-hand side, you can easily deploy your HuggingFace Lama 2 model using this YAML spec to create an inference service. And once this is created, you can easily, if you already put forward your service locally, you can easily send the inference request. For example, here we are asking, where is the Eiffel Tower? And it tells me the location of the Eiffel Tower based on the model. And one problem I want to mention is that usually, especially when dealing with large models, the model initialization takes a long time and there is a special feature called model cars that utilizes the models that's stored in the USCI image to significantly reduce the startup time. And especially when if one part goes down, it doesn't take another significant amount of time to start a new part and start serving the large model. It also has a lot of other advantages as well. And you can also leverage techniques such as prefetching images and laser loading and so on to significantly improve the efficiency. So the next step, once you have all the models and everything, you often wanna construct a workflow so that you can easily reproduce the experiment in the future. So Aga workflows is a container native workflow engine for Kubernetes. Here we are focusing on the machine learning use case but it's also generic workflow scheduling framework. You can use it for other cases as well. And it includes CRDs and controllers and multiple interfaces through command line or server or different language SDKs. And it also provides a URL as well. Here's an example YAML where you are constructing a basic DAC shaped workflow for example from A to B and then C, you can basically construct a diamond shaped workflow from Python. So you can also do that in YAML as well but here we are using Python as a simplified example. And Argo events is another project inside Argo that supports event-driven workflow. It supports different event sources, different triggers and it's also, you can construct very simple to complex event workflow. So here's a reference architecture of the entire workflow. Let's say if you want this to be event-based, you may trigger things, for example, once a data scientist commits something on GitHub, it triggers a workflow, an Argo workflow that gets submitted to Kubernetes and then the workflow tries to do the data ingestion based on whether the data has been updated recently or not based on our cash store and then it goes to the distributed training through TensorFlow and Kubeflow training operator and then CardTip does the parameter tuning and case of handles the model serving at the end. So, yeah, at the very last end, so once you have the workflow, right, the real life situation is that data scientists need to iterate this life cycle again and again. So this is something they need to practice every day. So, yeah, from modeling, from data processing to modeling and everything in between, they have to iterate all the different steps. So check out my talk with Savin last year to know more about the challenges there. And last, and I'd like to advertise my book again. I hope you don't mind. And yeah, you can read it free online and if you scan the QR code, you'll get a huge discount as well. So in case you wanna get it now, here's the time. That's it. We have how many minutes? Two minutes. Okay. Okay, any questions? Yes. One moment, just to wait for the mic. Thank you for the presentation. It was really good. In the past, when we investigated Volcano, we didn't see good interoperability with, sorry, with Kubeflow Training Operator, but maybe, so we will investigate that again. You also mentioned Ray. There is an operator, KubeRay, is Volcano also having good interoperability in your opinion or? Could you clarify what you mean by- Volcano- Is it model interoperability or something else? Is it working fine together in your opinion? Working fine together. Yeah, the scheduler of Volcano and KubeRay in this case? Yeah, I mean, if they are Kubernetes native, it should work well together, but you'll never know until you run them in production when you start experimenting, right? So that's something you need to know like when you actually start working on it actually, yeah. But I know like Ray, you need to start your own separate Ray cluster, right? Even though you have a Ray operator, it still needs some additional cluster setup. Instead of, unlike training operator, all you need is the Kubernetes cluster. We'll try then, thank you. Any other questions? Thanks also for the talk, I think it was crazy, great. So maybe a question more to the algorithmic side. What do you think about the distributed machine learning? Is there a trade-off? Do we reach the same performance, the same model performance as the non-distributed version? That's a great question. So it's challenging to tune your distributed training, right? Especially when you are using parameter server based distributed training, you'll need to tune the numbers of parameter servers and the numbers of the workers, right? That's the challenging part. So I find they are personally easy to use the multi-worker strategy if your model can work that way. So that way you only have to add additional workers instead of introducing parameter servers. So that takes, in short, that takes time to know the optimal numbers in between. So yeah. Okay, any more questions? One more. Yeah, hi. Regarding the Kubeflow, do you have plans to have two airbags inside of Kubeflow or all based access or ABAC or whatever? Could you repeat ABAC? Some more access model inside of Kubeflow itself because by default it's flat. There is no, like, you cannot give specific access roles to the people. Yeah, user access control is definitely something we need to continue improving in Kubeflow. So if you have any specific suggestions, we'd like to take them during the community meetings or like one of the working group meetings. Okay, thanks. Thank you. Yeah, I'll take, I guess I'll take more questions offline and Ross is here for answering questions for regarding model cars in case of and also Andrew is back there to answer any questions related to Kubeflow. Yeah, thank you for listening.