 Hello everyone, welcome to our talk, quick introduction for myself and Kian, my partner in this talk. My name is Mohamed Zayan, I work as a senior systems engineer at Newark SE, the company behind Xing, Kununu and Olify. If you are in Germany or in the Dach area in Europe, you might be familiar with this. It's very glad to share what our topic is, how to deploy AI Kubernetes cluster with Kuba Spray. This is my partner Mohamed and my name is K. I'm from the Dachau, I'm a principal engineer and in my career I have deployed hundreds of Kubernetes. So this is agenda. The first part is we will introduce what is Kuba Spray and what's the main features of Kuba Spray and the best practice. And then we will show how to build AI open as a cluster and then there is a demo, share our community and the last is the Q&A. So what is Kuba Spray? Kuba Spray is the subject, is a Kubernetes deployed tool. It's a subproject of Kubernetes life cycle, cluster life cycle management and it's totally based on Ansible. Actually it's a collection of Ansible. So there are many projects in the cluster life cycle management, such as Kuba Domain, Kweops, cluster API and Kuba Spray. Kuba Spray is a very good app to deploy environmental clusters and it can use for production really environment. And if you are using public cloud, the management Kubernetes is very good, such as EKS or AIR, AKS or something. If you just want to develop kind of good choice and if you want to do some production really, Kuba Spray is very good. So the main features of Kuba Spray, as we say, it can be used with several cloud environments. You can use to perform, to provision your cluster and then use Kuba Spray with the Ansible stuff to configure and provision your cluster. And yeah, it's flexible because we will explain this in the next slide also. How can you use it and what are the supported providers, whether if it's a public cloud or on per-metal or it's your private cloud or something. Yeah, it can provide the high availability cluster because you can have a clustered control plane, also HCD and other things where you can, which you can deploy from the using Kuba Spray. And you have many configuration options. So if it's a container and time interface, we support container D, cryo, lots of other stuff for CNI also. We will explain this. And it is also supporting most popular next distributions. In the next slide, we will show this. And it has continuous integration tests for all these configuration options and softwares which you can use from within Kuba Spray. Yeah, you see here? Yeah, you can use Kuba Spray to provision, deploy your clusters and cloud providers like AWS, Google Cloud, Equinix, Huawei, App Cloud. And if you are hosting your own VMware vSphere environment, you can also do this. OpenStack, Hitsna, NefCloud, the supported operating systems are, as you see, including Ubuntu, CentOS, Alma, Fedura, OpenZusa, Amazon Linux, Kylan, and I think K can talk about this later. Yeah, the container run times interface, options, we have container D, Docker, cryo, Yuki, Kata, lots of stuff for the CNI. We have Calico, Celium, we have plugins like MacVilan, we have Moltus, Flannel, Weave, Cannell, and Kube-Ovian. For the CCI, we support some plugins. You can see in the slide, we support the AWS, EBS, also the vSphere, and also CIFs. And also, you can use Kube Spray to deploy codiness. When you create a cluster, you can also toggle on and off to deploy Ingress Engine X if you want to deploy it from Kube Spray. Also, cert manager software like Argo CD or Helm. Yeah, we have the basic Docker registry software. Also, recently, we added the new feature discovery. So, Kube Spray is flexible with all this. You don't have all this by default, of course. It's configurable. It's up to you to choose whatever you need within your cluster when you start using Kube Spray. Yeah, this is the life cycle cluster operations of Kube Spray. You can create a new cluster. You can upgrade clusters with zero downtime, a control plane component, like if you want to go up with your Kubernetes version, for example, or some components, like if you also want to upgrade HCD alone because there's some CVEs, fixes, or some stuff. You can do this with Kube Spray, specifically Ansible, the power of tags and tasks. And you can also life scale your cluster. You can add or remove nodes. You can reset a cluster. You can do several configuration management if you want to add some configurations to your Cupid configuration or your API server or controller manager or something. You can do all this with Kube Spray. And yeah, we support performing HCD snapshots and backups during the upgrade just to make it safe during an upgrade. And yeah, here are the release cycle. These are the current supported versions of Kube Spray. We also do this semantic releases if there's fixes or something or some releases for major components, like component software itself or continuity or the most used stuff in Kube Spray. And yeah, we sometimes, because we think there are bug fixes or security fixes, we just do these releases. And this is the currently released which you can use. This is the, we can use Kube Spray to deploy in the public cloud. The besides Kube Spray is a very good solution for bare minimum environment plant. And it's also very good for the cloud environment. We can use Terraform to create the virtual machine in the public cloud. There are lots of Terraform scripts in Kube Spray. This shows the steps. A step one is to create a VM and a net with Terraform, with this command, Terraform init and Terraform apply. And then after that, the machines and the net has been set in the public cloud. And then we can install it with Kube Kubernetes with Antibole, use this command, Antibole playbook. And we are now supporting AWS and Go Cloud and such things published public cloud in the Kube Spray. And the Kube Spray is also very good for AIGAP deployment environment. We do a lot of work to do that. We declared all the image and binary files in the Kube Spray script. So we can run, run, run, we can create a machine as a download machine. It can generate a list of image and the files and then download it from the internet. And then we can package them. The package is about three megabytes. And then we can copy it into the AIGAP environment. Then we can use the machine as Antibole, then create a cluster with it. And another thing is that the Kube Spray do not do anything about the operation system. So we should use a DVD about the operation system's repo. So it can be work. And other things to notice that Docker CE is a special case because its dependency is a little complex. So we should do it manually. And then the Kube Spray has also lots of CR tests. When any P request has been submit to the Kube Spray, we should pass about 40 tests, about 13 operations and other things we should do to make tests. And the whole test needs about an hour. It's a very amazing thing. So that's why we keep the Kube Spray can keep the quality of the project. Then I will introduce some best practice of Kube Spray. One is the NGP. We know in the offline experimental environment that time sync is very important because of the EDCD and the Kubernetes control plan needs to sync time. So we can declare these things in the Kube Spray and it can sync time automatically for the volume. And as the production ready environment, we should use Kube Spray support to use the CTR to organize the operating system to make the workload more healthy, such as the PID and the FD can be increased. Another thing is the high availability. The Kube Spray can be easily to install a high availability of Kubernetes cluster. It has multi control plan. And I think there are two interesting things. One is that the Kube Spray can use NGP or HRA proxy to proxy the API server so that it can make the control plan for the Kubernetes. And then we can use Kube VIP or external LB to support the API servers for the Kube CTR. And another thing is that the third of the Kubernetes will be expired one year, so we have script to auto renew it every month. Recently, we have a new feature about Ansible Connection. Ansible Connection is either a dependency tool for Ansible. So we can declare it. It can be co-operated with other Ansible Connections. So we can easily declare it in the Ansible Collections requirement, the YAML file with this cell. And then we can use Ansible Galaxy to install the Kube Spray so it can be used as normally. The next thing is cluster hardening. So to make a cluster more security, so we can use it easily with some configs to make the cluster. For example, the organization and the request a timeout and the audit, that's very important. We have a document in the GitHub level. So from the keynote of the KubeCon, we can see more and more AI workloads are working on the Kubernetes. This is a screenshot from the open AI. So she says, why Kubernetes? We are excited with the scheduling and the introspection and scalability of the Kubernetes. So next, I will introduce how to make Kubernetes more AI-organized. As an AI environment, I think the architecture is, we need the Kubernetes is run on the CPU and the GPU machines. And on this is the design system or AI machine inventory. And on the top is the mode and the area. The AI workload is highly different from the web application workload. The workload is often batched and the web application is interactive. So it makes the challenge of the Kube spray. Some of us may know that the Kube spray has been developed four years ago. It's a little outdated. It only supports the device, the plug-in. And it can install the drivers from the binary. But it's hard to maintain. And there is a lack of the permission exporter and a lack of node feature discoverer and a lack of MIG, RDMA and such things. The new GPU features do not support the Kube spray. And the Kube spray is only support the default scheduler, Kubernetes scheduler. And it does not support the gun scaling, the capacity scaling, and the priority scaling. It's not support. And it's also lack of the AI applications. So, personally, this feature is my understanding of AI optimized Kubernetes cluster. We can look it from the button to the top. The button is the infrastructure. And the Kube spray should enhance the telephone script to enable the GPU create from the public cloud. And in the Kubernetes, we should support more GPU features, such as MIG and even DRA. And on the bottom of it, we should the GPU should support the bit by connectivity and other things. And on the top of it, we should support the scheduler for AI batch job and the Kube. On the top of that is the AI framework, such as PyTouch or HuggingFest or Ray or something, Kubeflow. But this is not part of the Kubernetes Kube spray. It's like applications. So, let me introduce the NVIDIA GPU integration. We want to use GPU operator to install the NVIDIA GPU integration. There are two parts of them. One part is the host level component. Another part is Kubernetes level component. The NVIDIA container took it to host level component. It's useful to make the container to support the GPU or the Docker or the CIO. And then we can use the GPU driver. It's officially a NVIDIA image. It can be installed in many operating systems. And then on top of this, the Kubernetes can support the GPU. And this is the high level component of the GPU operator, as such as device plug-in for scheduling. The future discovery is able to get the information about the GPU card and label it to the node. And then the MIG is very useful for GPU sharing and monitoring. And that's all for GPU operator. There are many sessions by NVIDIA in this KubeCon. I think if we are interested with that, we can go to join them. This is the basic usage of GPU. This is the pod. We can use it as a code image. And then we can use the resource limit, the GPU tool. It can be scheduled with the GPU plug-in. And we can also declare the node selector as the GPU type. So it can be labeled with the, this is the level, labeled with the future discovery. So the pod can be scheduled to the right machine. And this is the GPU advanced usage. Besides the basic usage, we can use MIG. MIG can make multiple use one GPU. It's very useful when we have a multi-user in a cluster. And KubeWorld is a solution to build a virtual machine on the Kubernetes. And with the GPU operator, the machine can use a virtual device of GPU. And with this exporter, we can create a dashboard from Grafana. And we can monitor in the GPU. There are a lot of future work about the GPU, such as DIA and RDMA. DIA from the keynote is the galaxy of the GPU. And I have tried, but I think it's still developing. So we will implement after the other basic things do. And then with the GPU direct, RDMA can make the AI workload run faster in a GPU environment. So it's all the GPU theme. And the next is why the schedule is important for AI workload. AI workload is very different from the web workload. So it's heavy, like a highway. So the job is built by Ray or KubeFlo will be joined to a queue. And the queue is wet in a queue. And then it will be scheduled by the Kubernetes scheduler. And finally, it will be run in a node as a pod. So AI workload often requires a large amount of computer resource. So the Kubernetes scheduler can make all the things be effectively. Then if the job is distributed, so we need to use gun scheduling. And next, if because GPU is very expensive, so the environment are usually used by the multi-tenant. In the multi-tenant environment, the priority queue is very important. So we can make the high priority job be run the first. So that's why the queue is important. Then let me describe the gun scheduling. Why the gun scheduling is useful? Because in the distributed AI job, all the jobs should be run together at the same time because they should communicate each other. And many of them be lost. The AI job cannot be run good. So we should combine all the jobs into a pod group. And the pod group should be scheduled by a Kubernetes scheduler. The Kubernetes scheduler will watch whether the resource is enough. If the resource is enough to run all of the job, it will allocate them at the same time to the right nodes. If the resource is not enough, nothing will do nothing. That's the gun scheduling. So to implement the gun scheduling, we can choose schedule a parking of a canal. There are two choices. It is all good. Schedule a parking is a aspirin project from the Kubernetes C. And the volcano is a CNCF project. It is a good choice. To use schedule a parking, we can use the ECM file. We can see that we defined a pod group called name NJX. And we defined a replica set using the schedule name to specify the schedule parking. And then use the label to use the pod group. That's how to use the schedule parking. And the parking can also be installed as a home into the Kubernetes array. This is all the features of schedule parking, such as the gun scheduling and the pin pack and other things. It's very useful. And another thing is Q. Q is very useful. This is also your upstream project named KQ. It's a Kubernetes native job Q system. It can be integrated with Ray or Kubeflow or other things. And it can be integrated with them, such as Ray job can be put into a Q. So the Q can do some resource quota management of it. And it can also do some priority work or concealing work. So that's what we can do next. There is a discuss about HIRM and Ansible template because Kubeflow is a long history project. All our applications are implemented by Ansible template. It's become very huge. Many people want to add new add-ons into it. So there is a discussion about whether we should use HIRM instead of Ansible. I think it's a very good idea because HIRM can reduce the maintain load and it's more clear. But there are many features are implemented by Ansible already. So Magrite means break change. And Ansible also has some benefits because it's ready for air gap environment. But HIRM is a little hard to do that. So my answer is that Kubeflow should use the HIRM and Ansible together. Both. The Ansible is better for system component and the HIRM is better for applications. So I think the GPU or AI features would be implemented by HIRM. Then is the demo. We can install the Kubeflow with Skid Chrome. So it takes a minute and it's to be installed. Then we can configure it. This is a simple example. We have three nodes. And one node is GPU-1 control plan, three worker plan. Then we can use a configured file to edit to enable the feature. We can see there are GPU and HIRM and Ansible. Then we close the firewall. And the next we apply it. So we can see the Ansible is working automatically for them. It takes a lot of time. Now it is pre-installed something in the operating system and do some project, disable a swap and so on. That it takes a lot of time. Because my network is not it's no. If we want to actually accelerate it, we can use the offline package to install a cluster. It may take about 10 minutes. It will be installed. So the cluster has been installed. Then we can use it. It's ready. Then we can see the all the things is ready, such as the Calico and the DS. Then the GPU operator is working. Now it is installing the NVIDIA GPU driver because the GPU driver should be compared into the machine. So it also takes a lot of time. It's comparing. And finally it will be okay. Then I create a CODA sample application. So it will be wrong and pull the image. And finally it would be like this. Okay. That's the demo. Yes thanks Kay for the information about the AI stuff with CubeSpray. And yeah from our place here we would like to thanks to the community. We had 1082 developers and more than 50 in a single release. We have 7,536 comets. We welcomed new contributors last year. I think they are just mentioned by the GitHub user names and we still welcome more contributors. We encourage you if you use CubeSpray also to support the project. I actually wanted to ask this question in the beginning. Who's using CubeSpray here? Wow that's impressive. And of course I would expect that it's in production. So it's a yeah yeah yeah. Good, good. Yeah you need our help. You need to discuss an idea. You want to introduce a new feature. You want to make a change. You want to do whatever kind of contribution. Please. Yeah as Alexander Dumas said I'm not good in French. It was by Kay this code. But yeah we are in Kubernetes slack and we have the CubeSpray channel for general support where lots of CubeSpray users are there to ask questions or to ask questions and also we are there to support people. You can also create issues in the GitHub project and if you're interested in topics or questions or things about CubeSpray development please also join the other channel kubespray.dev And yeah here also the the path for the GitHub project CubeSpray six slash Kubernetes six slash CubeSpray and we are also available here at project pavilion at kiosk pp19-b. So yes I think we wanted to introduce a round of questions and answers so please if you have questions there are two microphones here please just thank you so much yeah please yeah thanks a lot for the talk I learned a lot and I have a question and you touched on the topic of CubeSpray release. Can you get closer to the microphone please? Oh yeah sorry you touched on the topic of CubeSpray release cycle and I was curious to whether it aligns with the Kubernetes core release cycle and if that's the case how do you ensure that you remain up to date and not fall behind? I think I can answer this question yeah so it's like voluntarily contribution and we as the maintainers because also we discussed the things before we make a release because in theory you can use the master branch but use it at your own risk because it might have some something which is not working you can support the project with testing it but normally before every release we check all the release we build the draft and we check if we have something missing from the core components or not a core component of course here would be the Kubernetes version so we check if we have an upcoming release or we have like the most recent stable in every which with every release with CubeSpray we support three Kubernetes releases was like one really very stable like was like five versions or something and yeah we don't release CubeSpray before checking like stuff like etcd, Kubernetes and the calico many of the core components which everyone I think is using when you run CubeSpray and one other thing also that sorry I forgot I wanted to to to say something yeah we also with this core components in CubeSpray we check the compatibility metrics so what Kubernetes, Kubernetes is using like of course Codeness or a specific etcd version we just follow this we don't increase versions like not stable versions or something we just follow the recommendation by Kubernetes okay that's why it's this hard right but are you on bar with the release cycle of the core itself or do you need to lag behind a little bit until you make sure that every other project yeah we are trying to do that yeah if I go back to the slide of releases I think we could show oops sorry yes yeah you can see here that we with the latest release we support Kubernetes 128 with the upcoming release which will be soon we will be supporting 129 and yeah we do releases like every six weeks or eight weeks maximum so this is how it works with how we do but also as you see we have a release 2233 because we had to release some fixes or some something was introduced and it was mandatory to release to to the users of kubespray okay great thanks a lot you're welcome thank you um any more questions thank you everyone for attending I'm around here if you have questions again and thank you okay thank you thank you welcome to thank you guys yeah