 Good morning, everyone. My name is Malin Patel, and I am the group powered manager for Google Kubernetes Engine. World is moving towards cloud native computing, and Kubernetes has become the de facto standard for cloud native computing. In my opinion, Kubernetes is an ideal platform for AIML and high performance computing. And in this talk, I'm going to talk about challenges and opportunities in making AI easy and efficient with Kubernetes. So there are three core reasons why I think Kubernetes is an ideal platform for AIML and HPC. Number one is portability. Kubernetes provides cloud native open and standard-based APIs. It allows users to seamlessly port workloads from their laptop to private data center to the cloud. And this is very, very important for AIML community because it allows them to reliably reproduce their results. The second one is scalability. Kubernetes allows workloads to be scaled from a single node to thousands of nodes. And it also supports autoscaling, auto provisioning, GPUs, TPUs, and many other hardware accelerators, which helps to train model faster, quicker, and cheaper. The third reason is productivity. Kubernetes makes data scientists and AI practitioners more productive by freeing users up from having to manage their own workstations or servers. It left them focused on their business critical mission, which is to build a model and train the model without having to worry about underlying infrastructure and compatibility issues. However, there are many challenges that this community faces. So GPU utilization is one of the core concerns for AIML practitioners. And poor utilization costs them dearly. So in our Google Kubernetes Engine fleet, we have seen that overall GPU utilization across the fleet is quite low. And GPU utilization is actually getting worse day by day, as GPUs are getting more and more powerful. A single workload may not be able to saturate a really powerful GPU. And underutilization problem is even more acute for certain types of workloads, such as inference, gaming, notebook, and visualization. So why is that the case? So the Kubernetes allows fractional requests for CPUs, but it does not allow fractional requests for GPUs. So you can ask 0.5 CPU to Kubernetes, and it knows how to give you 0.5 CPUs. But you can't ask 0.5 GPUs. One GPU has to be fully allocated to one container, even if the container only needs fraction of GPU for its workload. So this invariably leads to overprovisioning and sometimes cost overrun. So as a community, we have the opportunity to make GPUs Kubernetes native resource. And that will hopefully address this challenge. The another challenge that AI practitioner faces is the failure-resilient training. So Kubernetes was designed with the fundamental assumption that pods are disposable and replaceable. So we treat pods as cattle, not pets, which means they can be disposed any time. This assumption does not suit well for many distributed computing frameworks. Majority of distributed computing frameworks, especially that are used for AIML, are very sensitive to disruptions. They are intolerant to disruptions, such as pre-emption, failures, or maintenance events. And the problem gets really, really acute when you do really large-scale training with thousands of nodes. In those cases, the probability that a one particular node will encounter disruption increases when you scale the training cluster size, as well as the duration of the training. The way a community deals with this today is through checkpoint and restore. So you frequently take checkpoints. However, these checkpoints are typically taken at the epoch boundary. So if a disruption arrives, then all the work that all these thousands of nodes has done since the last epoch is lost, which is not a good story from the cost and time saving point of view. There are frameworks like PyTorch Elastic, which handles any kind of disruption gracefully. But the challenge there is those solutions are framework-specific. What this community needs is framework agnostic elastic training. So our goal should be to support any framework without any code changes. And this will give two main benefits. If we have the elastic training, then you can use it to run your training on spot VMs, which are a lot cheaper than on-demand VMs. So it saves a lot of cost. And it solves another problem, which is obtainability. As most of you know, GPUs are a scarce resource. Spot VMs are also a scarce resource. It's very hard to find, let's say, thousands of GPUs upfront to start your training. If you have elastic training support, then you can start a training with however number of spot GPUs available to you, and then scale it up when more GPUs are available, and scale it down when you lose them. So that will also address the obtainability challenge. Another opportunity for this community is to basically enable native support for checkpoint migration and restoration in Kubernetes. And the way it can be done is that whenever the underlying infra surfaces that there is an impending maintenance event or there is a preemption coming, then Kubernetes can transparently and gracefully take a snapshot or checkpoint and store it. And this will make it work conserving, meaning current checkpoint mechanisms, you lose the work since last epoch. But if you had a transparent checkpointing on demand, then it will conserve all the training work that has happened. So this is another opportunity for the community. The AI practitioners also face a lot of challenges when it comes to observability and Kubernetes. So Kubernetes observability primitives were mainly designed to provide service level indicators like uptime, like CPU utilization, memory utilization, GPU utilization, and things like that. Now they are ill suited to monitor model health. So AI practitioners, when they think of their model, the things that they care about are like model performance, like how accurate my model is, precision, recall, F1 score. They also care a lot about data drift and concept drift. So AI practitioners also care a lot about training and serving SKUs. And typically they detect it with KL divergence or studying the feature importance between training and serving. And last but not least, it's very, very important to know the fairness of the model in many industries, like demographic parity or equal opportunity. So in my opinion, it's very hard to get this kind of information from existing Kubernetes primitives, like Prometheus matrix or logging matrix. To give you a little bit of a concrete example, like in typical Kubernetes observability, you rarely have to deal with events that are months or even years apart. On the other hand, in AI world, this is a very common occurrence. So let me give you some examples. Let's say you have a model to predict customer churn. Now a customer acquisition happens. And at some point in future, the customer is going to churn. So in order to study the accuracy of this model, you have to combine these two events which are spaced months or even years in a part. There's another example. For example, you have a model that predicts loan default. Now you issue a loan, default may happen sometime in future. To understand the accuracy of this model, you have to combine these events. So in order to understand the accuracy, precision, recall, and many other metrics, the observability solution needs to combine desperate events that may space far apart. So this is one of the examples where we as a community have an opportunity to extend existing observability solutions that are there for Kubernetes to make them suitable for AI ML community. So in summary, I see a lot of exciting things happening in this space. And I'm super excited and energized to be working in this space. And I hope the AI force be with you. Thank you all. Any questions, comments, I am around for you. Thank you. Thank you.