 So this is a little bit different panel that we're gonna be doing. We're kinda have a presentation and we'll be talking as well. One of our members who's also a member of the Kieffler Steering Committee and our working groups is John O. George. He's with Nutanix, long time community member. Unfortunately he couldn't be here with us today but he did help us in the making of the slide deck and does an excellent job on the working groups and committees that he's with. And now if you all will introduce yourselves. Absolutely. So hi everyone, my name is Andrei. I'm also part of Kieffler Steering Committee and I've been in this community for almost six years. So just from the beginning, I saw the transitions from Google project to Open project to Science here. I'm so excited to see so many people on this summit. I think it's one of the most, like one of the biggest summit we have for the last six years. So I hope you'll be a part of KubeCon community and be here. So yeah. Hi everyone, I'm Yuki. I'm reading Q4 AutoML and training with Andrei and John. Also I'm a Kubernetes patchbacking group, core member and I'm a Kubernetes Q maintainer. Thank you. So our components that we're gonna be looking at today and as you've also got a sneak peek of one of the announcements that's coming. It's not on this slide, but the Spark operator as well. So we have training operator, cat tip and the MPI operator. And so we'll take a look at those today. And I don't know if you guys want to stand the whole time or if you want to sit, but there you go. Yeah, I can stay. I think Amber. So yeah, as Amber said, we're going to speak about Cateep and Train Cooperator during the session and just to remind everyone, as you saw in the previous demos, Cateep is a project was dedicated to do high-primary search in quality of way. So the idea is to connect this open source ML AI research libraries like Hyper Ops, AppTuna, Scikit-learn to Kubernetes and build this kind of like, you know, layer between Kubernetes and structure and ML libraries like building the SDKs, building the APIs and building API access. And Cateep has like several features like parameter search, architecture search, also like some experiment tracking, workflow tuning, optimization and traditional UI capabilities. Here we can move on. So this is architecture. We try to, again, we hear you. We try to address the problem with documentation and we try to improve our documentation. And this is how architecture of Cateep looks like. From the user perspective, user just need to use the Qflow Python SDK where they can define what they want to optimize, how they want to optimize, what short algorithm that they want to use and how many resources they have, how many GPUs they have, how many trials they want to use. And then again, just use Python SDK to start the experiment. And inside Cateep, we have three different controllers. So the first one experiment, which is responsible for experiments here, customer source suggestion controller, which spawns algorithm service. The algorithm service produce hyper parameters and these hyper parameters go all the way to the trial controller with just dust evaluation on top of the different customer sources. Like we support by jobs, our work flows, by job. So we can orchestrate any type of customer source on top of Kubernetes, which can do your much more sophisticated hyper parameter experiment. And then at the end, we're collecting trials, trial metrics all the way to CateepDB and going back to our evaluation step. So this is native kind of hyper parameter optimization flow and we tried to reimplement it using the native Kubernetes infrastructure. Yeah, so in our 2024 roadmap, we looking for several items. I think the first one, as we saw before in the presentations, we tried to be more closer to newest LLM capabilities. So we recently implemented new train API to fine tune your LLM. So we tried to actually have a tune API, which easily can fine tune your LLMs and I will speak about it a little bit later. Also, we tried to create CateepDB to V1. We recently did some features, but we still need some work to be done. Like we wanna like, for example, define new API. So this is some proposed API name. It's like tuner, model search, support more things from like, feature engineering or model compressor. So we think about how we can make this API composable. Also, we wanna support push-based metrics, like using the some Python SDK, support more parameter distribution, like lock uniform, what like apptuner, for example, is doing. And we have one integration with Jupyter on the books because it's like the main tool for key flow users who interact with these components. So, and then let me jump to training operator. Just a reminder, training operator similar to Cateep is the framework which connects this deep learning libraries and ML libraries with commercial structure by utilizing Kubernetes APIs and also Python APIs. And training operator has several features, like distributed training capabilities where you can run several workers on multiple GPUs with really large scale. And we saw from the user survey that actually people using this for really large scale experiments, like train your LLM from scratch. Also, we support a lot of like older use type of training. We also have MPI operator capabilities for HPC tasks like hyper-forms computing. We supporting some different job scheduling and elastic training also there because we're kind of native to PyTorch libraries. Yeah, so this is like example, what you can do with training operator. This is like the thing is how you can do all reduced type of training with PyTorch and training operator will be responsible to actually set up the, so if I'm familiar with PyTorch, PyTorch has the Torch run CLI which helps you to spin off these workers and we basically spin up all of the workers for you by using this operator. And from the user perspective, similar to Cateep, they just set how many GPUs they have and they can start this PyTorch job from the function I will show you in a demo how easy it is to do. And the idea is like, training operator will be responsible to spawn these trials and they're going to then share the gradients between them to average them and actually train the model on multiple workers. Because if your model is too large, you need to distribute your model between workers and you distribute your data. So this is one example of how we can use older use within training operator. And if we go next slide, here example how we can use TensorFlow distributed. So this is just one of the example what you can do within Qflow. And the idea is like in TensorFlow, you can distribute your data across multiple workers and Qflow training operator will be responsible to schedule appropriate workers, set up appropriate environments and then you can train very large models on multiple workers and using multiple GPUs. And a parameter server will be responsible to generate new weights based on the gradients results from all of your workers. Yeah, so right now, I really want to speak about simplicity and we hear a lot today that, hey, Qflow is a great tool, but it's complex, right? We want to simplify this. And we hear you and we did some work in the last year to simplify the complexity of Kubernetes for specifically for AML community. And how we do this, we try to have this kind of user flow. So the idea is, we have data scientists who really want to, they really want to work inside Jupyter notebooks, they don't want to leave notebook and they want to do everything inside of the book. So they start a notebook server, they code their training there or training scripts or tuning scripts. Then they set how many, like, how much resources they have and then they do use some SDKs or web UIs to actually scale it on top of, scale it on top of Kubernetes infrastructure using the Qflow capabilities. So this is from the user perspective, how it looks like from the flow perspective and if we go next slide, here how it looks like from the API perspective. So on the left side, you can see that this is native PyTorch distributed API. So basically, if you're familiar with PyTorch, usually you say, hey, I want to use distributed as a parallel with, for example, NCCL, like individual collective communication library backend. I want to attach my model to distributor. I want to set up my optimizer and I want to train my model, right? So the idea is like using native PyTorch API and then scale this function using the create job API on the Qflow environment. So similar to QFP, if you're familiar with Qflow pipelines, lightweight Python components, we try to give a user a simple Python API which they can use to scale this function across multiple workers and they're going to serialize this function for you. Similar to a tip, you just pass this function in the tune API where you define your objective, your hyperparameters, the child threshold, and the resources like GPUs and other metrics. So very simple, very Pythonic, no Kubernetes, no YAMLs, no Docker, just a pure Python from your notebook. So yeah, so right now, I just want to give you a quick demo. And as you saw recently, previously we've been speaking a lot about LMS, right, and how important they are for industry, for our community right now. And recently community did some effort to simplify fine tuning for Qflow users and actually integrating you to a train API. So I hope my demo is going to work. So I think it's like, this is a notebook. In this notebook, what we're actually going to do, we're going to fine tune BERT. So if you're familiar with BERT, it's actually called B Directional Encoder Representation from Transformers, like one of the most famous LMS, which was developed by Google. And we will try to fine tune it on Yelp review data set. So this example is similar to like some hugging phase examples that you saw in the internet. So we tried to upload those examples using the Qflow training operators. So in this notebook, what are we going to do first? First of all, we're just going to download some samples from our data set. So here we're using the hugging phase data set to load some data and check some results. So the data set contains some Yelp review results with some stars. So basically label indicate what is a star user gave and here we have a text. So this is the data set we're going to use and then we probably, we just need to define our training script to fine tune the model. So just to speed up this experiment, let me submit it on the cluster first. So the idea from the data science, from data scientist perspective, they just need to define this training function where they just going to use native hugging phase transformer APIs. So if you're familiar with hugging phase, they should be very simple for you to understand. The idea is like you kind of use this kind of, you know, from pre-trained API where you take the bird model, then we're going to use tokenizer from this model. We're going to download data set, then using the mapping to adjust our tokenizer for this Yelp review data set. Here we're going to distribute our data because the main power of Kubeflow it actually can distribute your data across multiple workloads. So rank and the world size is the parameters that we're going to use to define which worker it is running, this script is running on and how many workers we have. So we distribute our train and test data set. And then we define the training arguments. So this is again native hugging phase trainer arguments where you can say, hey, I want to like evaluate my epochs. I want to output my trainer inside this train function. I also can, for example, like have some model checkpoints if I want and I can define my trainer here. So again, native to hugging phase trainer which can evaluate my training function. And then at the end we use trainer.train and we're exporting model all the way to S3 so we can do additional evaluations. So this is the function user defined inside the notebook. And the thing is like, we are going to use this create job API to scale this function on the multiple workers. So this will run not inside my notebook but inside the Kubernetes environment. So here you can see the training function, you can see the job name, you can see the parameters that can pass like the bucket name, resources. So I'm going to use one GPU and four CPU for every one worker. So in total it's like three GPUs I'm going to use to fine tune my LLM. So this is the packages I'm going to install. So if I'm rid of pipelines, it looks like similar. Like we have these packages to install, we have this kind of a create job function and looks similar. So I've already created this. So right now we can consume our Python job conditions. So we can see that the job is running right now. We can see that since we're running this on three workers, we're actually using three different ports. So Python like training operator will schedule three ports for you like one master to workers. And we also can consume in my notebook deluxe. So if we check the locks here, we can see that we download our BERT model. We map our review to the set to BERT tokenizer and then we actually start training. So this model actually using 108 million parameters to train it. And then we do training, we do evaluations. We run it several times and then we just exploring model all the way to S3. So then me as a data scientist, I just want to download my model to check the evaluations. So I'm not going to run this because it will take time. My model like sizes around 400 megabytes. Just to show you like evaluations, like from the evaluation perspective, I'm again using the Hacking Face API to pass my model to the pipeline and checking what kind of output it gonna produce. So I'm passing like the one of the good reviews and one of the bad reviews. So do some text sentiment analysis type of task. So as we can see, if I pass like good review, it actually give me at the good start. If I passing the bad review, it get me at the bad start. So here it's, I actually fine tuning LLM using Qflow training operator. It's very simple, very Pythonic, pure two notebooks. I don't use any Docker files or any YAMLs. And this is how easy it is to do. So what is next, right? How make it even more easy, right? Like usually data scientists, they wanna iterate quickly. They wanna fine tune their model much faster without even defining the script. And more importantly, how to distribute the data, right? Because imagine if you use hundreds of workers or hundreds of GPUs, how are you going to download data on these workers? It is not very straightforward for data scientists to do because they need to create shareable PVC, thinking about other things. And for this kind of purpose, we developed this Shrine API, which actually simplified the ability to fine tune LLMs. So I will speak about this on the next slide, but the idea is like user, from user perspective, they just need to use this new Shrine API, this one. And the idea is like, usually in fine tuning, like fine tuning is basically the ability of how, like the modern way of how people right now train their, train their models which already been pre-trained. So instead of like training your model from scratch, you're already checking the model which has been trained on like hundreds of GPUs and try to adopt it on your specific use cases. So in this Shrine API, we say, hey, I have this number of workers, I have this number of processes per workers. So if you know like in torch run, you can specify how many processes you wanna use per node. Like if you wanna use like multi-GPU, multi-worker training. So this also like similar parameters in this Shrine API. So then user define the model provider and the dataset provider. So model provider is actually, we supporting right now several model providers. So we've been focused on hugging phase initially where a user can say, hey, I wanna use this model URI and I wanna like use this transformer to fine tune my model, right? And also in dataset provider, we also support like I believe S3 right now in hugging phase, but we're applying to support more like GCS or any other providers. And from the user perspective, they just need to say I wanna use this repo. So this is like the hugging phase repo and we have some like parameters. Like for example, if I just wanna use 3,000 samples from my dataset, not like the full dataset to do the fine tuning. And then the user just define the training parameters. So instead of like asking user to define a training function, we kind of like pre-create this trainer for the user where a user just can specify the training parameters and like Lora config. So if you're familiar with like recent fine tuning technologies, Lora is like one of the best, like one of the most efficient way of how you can fine tune the models. So you can specify the Lora config directly in the trainer parameters. Lora actually will lock some layers so it will reduce the number of trainable parameters for your very large model. So it can significantly reduce the cost to train your model, to fine tune your model, sorry. And here is again, like you just find your training parameters, you define your Lora config, you define your resources and you just submit this job. And at the end, as I said, we kind of like using this storage initializer to download model and to download dataset. And we can consume our logs directly in the notebook at the end. So since we used Lora, number of trainable parameters would be smaller. So it's just 300,000 trainable parameters in this model. And as we can see, we have exactly the same results where my model has been trained and where my model has been fine tuned actually at the end and I can use it for my experimentation. So the new idea is, and if I go back to my slides. So the new idea of actually how users can use new APIs for LLM fine tuning looks like this. So as I show you in the demo, user just define what kind of model they wanna use, what kind of dataset they wanna use, the trainer parameters and the number of resources they wanna use to fine tune a model, right? So then our orchestration layer looks like this. So we have like a model provider and dataset provider, which is responsible to download data from different resources like Hugging Face or S3. And we kind of distribute this data across multiple workers using the shareable PVC. So your PVC has to support through it only many access mode to distribute data across cerebral workers. And we have this trainer, which actually pre-created trainer, which actually define like the training loop for the user. So user doesn't even need to worry about how to define my training script. So this is very extensible because imagine this can be even apply for not NLP-based tasks with maybe image classification or forecasting. So you can basically define so many different trainers, different users, your customers can use and it's very powerful of different kind of support we can offer. So if you wanna try this notebook, please follow this example. We already upload this presentation on the schedule. But we're looking forward for your feedback because this is a new thing and we try to extend it to support more features and offer you more capabilities to do the fine tuning within Qflow. So just quickly, so we have this kind of roadmap of offering different capabilities for different projects. Like we already offering this training API for training operators to do LLM fine tuning. Also we wanna offer this tune API to do your LLM tuning. And we also wanna extend it all the way to the case of. So because the Qflow already has like underlying infrastructure, which can give us ability to train, tune, sort of very large models. And this is kind of unique compared to other like things that you can do in the cloud-native environment. Yeah, so this here, let me pass it to Yuki so you can speak about Admo. Okay, let me explain what new in training operators. Since B1.7.0, we have started to, okay. We have started to support those, okay. Sub-suspend policies. This feature allows us to stop Qflow jobs without removing of jobs. In the next slide, I will introduce the training operator 2024 world maps. The first one is LLM train SDK API. As Andre explained in the previous part, recently we implemented it. This API has not yet been released. So we will include this feature in the next training operator version, B1.8. The second one is extending Python SDK trainer. Currently, we support only hugging face trainer. So we are planning to support other trainers like image classifications and deep speed. The third one is supporting the JAX frameworks. Currently, we are planning to add new job type JAX job. The fourth one is mitigation of accessibility to the data preparation step. In typical environments, data scientists perform fine tuning against downloaded foundation models. If the model data is so large like LLM, it can occur to waste significant CPU usage and download time. Currently, our trainer Python SDK allows us to mitigate such issues as Andre explained in the previous slides. But we aim to reduce the waste more and more by providing accessibility to the data preprocessing phase like Apache RO. The fifth possibility is webhook validation. The training operator has internal validation mechanism, but this error message can be seen by operator logs. So we are planning to introduce webhook variations. This feature allows us to notify errors for job creation and bring the better UX. The last candidate is supporting MPI job B2. Currently, the training operator supports only MPI job B1. And we can use MPI job B2 only via MPI operator. So we will support the MPI job B2 in the training operator as well. Next, I will introduce MPI operator. MPI operator is a standalone operator to serve only MPI job B2. Let me pay attention this diagram to describe MPI operator. Once MPI job is created, MPI operator creates launcher and worker parts and set up header subsets for worker parts. After all worker parts completed to initialization, the launcher part launches training processes in all worker parts. As the next topic, let me introduce MPI operator new features. These features are not released yet, but you can use them in the head branch. The first new feature is MPI-CH support. Before, we support the two types of MPI implementation, open MPI and the interview MPI. In the next version, B0.5, we start to support MPI-CH. The second new feature is a launcher creation policy. This feature allows us to configure when the MPI operator should create a launcher role part. This default start-up policy is the same as before. The wait for workers ready policy allows us to create a launcher part after the workers get ready. In general, workers parts take a long time to start because the container image combined with NVIDIA code are so large size and downloading needs to take a long time. If we use the wait for worker ready policy, we can avoid unnecessary part creation. But note that if we use both the wait for worker ready and gang scheduling, it can occur conflict. And then the MPI operator will fail to start MPI job. The next new feature is a run launcher as worker. By default, the MPI operator set up a launcher part only as a launcher. When we turn off run launcher as worker, we should schedule a launcher to node without GPU. To avoid consuming GPU resources. But in the high-pass scale environments, like having different networks such as Ethernet and InfiniBand, it can increase debug difficulty. So it can decrease the difficulty of terrible shooting. In this slide, let me introduce the 2024 roadmap. The first possibility is restart policy based on the existing code. In other frameworks like part of job, we already support this restart policy. So we are planning to support it as well as MPI job. The second is migration of plain parts to batch VM index job in worker role. We already created the launcher role via index job, but we still directly create worker posts. Our motivation is supporting all batch VM job like these functions. The last candidate is multi cluster MPI job dispatching by Q. By supporting the managed by feature in the MPI job, the same as batch VM job, we can submit MPI job into multi clusters. In environments managed by Q. Please join these sessions to understand as a multi cluster job dispatching by Q for more details. Thank you. So a lot of exciting features from MPI operator as well. So just the last thing, but the most important thing that I would say, huge thank you for everyone who contributed for this release, previous releases. So we try to release the most active folks for the last six months. So there's been a lot of other people who try to contribute. And the most I'm thinking like PIZZ users can do to be closer for us is to start contributing code. So we try to identify and recognize folks who are active in the community and who are helping us to build better services as such as Cateep, training operator, also MPI operator. So huge applause on everyone who been involved in this recent effort around LLMs and other features. And I think lastly, I just wanna say a few things about the community. And as Amber said before, we encourage everyone to get involved in the working groups. We meet regularly for every working group in the community. The community is very open. We open to any contribution. So please join the working group course. Please join the community call. Let us know about your problems. Let us know how we can help you. Also Qflow has been accepted as a Google summer of code project. So we're really excited about all of the students and their participation in looking forward to see more proposals in 2024. Also I wanna say for if you're interesting in training, tuning capabilities, please join our Wednesday regular calls and 2 p.m. and 5 p.m. UTC time zone. We meet regularly. We have a Slack channels on the Qflow Slack right now and also like you can check this developer guys for training operator for KTIP. And if you wanna contribute, start with good first issues or help on the issues or submit a new proposal so we can review it. So we open any discussions, any contributions, any new features that you wanna propose for these capabilities within Qflow project. So you've got a chance to hear about our training operators, Catib and others. And we wanna know, as Andre was saying, how you could get involved and do those things. But we also wanna hear from you all, I think we've got like five minutes, four minutes. So we would really like to hear what you would like to see in the inside of these working groups or hear questions that you may have for us concerning those. I think we have a question back here. Bringing it to you. One, two. Hello, so can I use OpenMPI to trigger a generalist workloads? Imagine I have a Monte Carlo system already a transfer simulation. Can I use it for that? Because I see this very applied to machine learning. But our needs are a bit different in the project. Thank you for a good question. So yes, so MP operator is not only for machine learning. So we can use MP operator for traditional HPC workloads. So does it make sense? Yeah, like traditional HPC, can I apply it here? Yes, so MP operator just around. So SSH server and setting up headless service. So you can start up any programs. No restrictions. Yeah, just add on this. It's a good question. So we have users actually using MP operator for HPC task, not like a machine learning and they're already using this in production. So if you wanna reach out, please feel free to reach out to us to ask about this more. Hi, yeah, thank you for the talk. So I see PyTorch community come out, PyTorch X project. It had this community scheduler. So how do you compare your training operator with Torch X? Yeah, it's a good question. So I know by Torch X, so Torch X is very one, it's the capabilities to use PyTorch on any environment with this Ray clusters, or any other kind of when you're scheduling. So the thing is like PyTorch X, previously I think it was came from Elastic. So it was came from PyTorch Elastic community and they were specifically focusing on PyTorch. So training operator and we wanna collaborate with PyTorch community to think how we can like consolidate effort because Torch X using the similar techniques as we do, but it's more kind of like in a Pytonic way. So if we have anyone here from the Torch community and wanna collaborate with us, please reach out and we can build better CLI for Qflow users, PyTorch users and moving forward. But our main idea not to lock users to specific framework because it's like, for example, users wanna use MPI, they're free to do it. If users wanna use TensorFlow or DeepSpeed or anything else, we want to give them infrastructure to do this, right? So we don't want to be focused on specific framework to run the planning. Hi, does training operator have support capabilities of dealing with training that were, for some reason, disrupted? Yeah, so it's a good question. So your question is about what if one of the node goes down and are you going to lose the training? Yes, so because your training needs to be stateful. So all of these frameworks offering the checkpoint capabilities when basically you can checkpoint your training on every epoch and you can like export your model all the way to whether it's any blob storage, right? So it's a good thing to do to avoid overfitting and avoid like a single point of failure, right? Because like if one node goes down, it's actually you will lose your training process, right? So also PyTorch offers some full tolerance capabilities. So we try to investigate them in specifically PyTorch elastic. So how PyTorch can gracefully can basically recover your training if one node goes down. But we're trying to explore different capabilities there. And we're out of time. Yeah, so if you have any questions feel free to reach out to me, also to Yuki. We're here to answer all of your questions for training, tuning and other questions about Qflow. Yeah, thanks everyone.