 Hi, everyone. My name is Susan Wu. I'm an outbound product management in Google Cloud. I'm here with my friends and colleagues. I'm going to let them introduce themselves. Hi, I'm Clayton Coleman. I'm really excited this year. I can finally claim that I have a 10 plus years experience in Kubernetes. That's going straight on my resume, and hopefully I'll be a hireable someplace. Peter Pulliott, Ampere Computing. It's a pleasure to be here in Paris this week. Great to see everybody. Looking forward to a wonderful week talking about Kubernetes and Cloud Native. I'm Ricardo. I'm a computing engineer at CERN. I'm also part of the TOC and the TAP in the SCNCF. Really happy to be here. May I tell everyone? Hi, I'm Lu. I'm the PSE maintainer of the open-source project called Alasio. Hopefully, I can see how we date our AI and Kubernetes work together in Q-Con. Okay, Clayton. Ready? I guess so. Enterprises are building a lot of AI platforms on Kubernetes, and I hear platform operators talk about abstracting Kubernetes from the data scientists. What can we do to make Kubernetes simpler? Oh my gosh, what can't we do to make Kubernetes simpler? It's interesting, data scientists don't need to know about Kubernetes, but there's a bunch of things that their platform teams are going to need. I think I can simplify by saying Kubernetes is supposed to be a cluster or operating system, and the job of an operating system is to abstract hardware. So you heard from Kevin. There's a bunch of exciting work going on in a number of SIGs and Kubernetes around abstracting the resource model, making accelerators easier to run, but making just the details that the platform admins need exposed up. Above that, that's going to lead to better opportunities for scheduling and bin packing for keeping those really expensive accelerators working, and the point of Kubernetes has always been to run multiple workloads together, and that's actually what helps a lot of people achieve significant efficiency. So on top of those accelerators and a better resource model, we really do need to bring batch frameworks like Slurm, Ray closer to Kubernetes, we need to be able to support them effectively. Then finally, since training is really just the development part of the process, everybody's got to go to production at some point, and I think Kubernetes needs to be the best place to run production inference workloads, and so the focus I think will have to be constructs that make it easy to run those and to keep them running on top of accelerators, pretty much 24-7. Sounds like something that our community should be proud to be part of, right? I hope so. Okay, Peter, I want to ask you about, so Kubernetes is already well-suited to handle the resource allocations, so for what compute choices should these platform owners make, especially for performance or for sustainability? So Kubernetes and open-source ecosystems have been great for LLM innovations, and open-source small-parameter LLMs are becoming a more pragmatic and available choice. However, for small-parameter LLMs, small-parameter LLM inferencing, a GPU-only approach isn't necessarily sustainable. We need something that's affordable, available, and easy to use. LLM inference runs seamlessly today out of the box on AR64 architecture, processors like Ampere, for example, across cloud providers, and it has a proven price for performance per watt, compared to the alternatives today, specifically for this use case. I think you have some credits, so you're giving away, right? Oh yeah, so Oracle announced last KubeCon in Chicago, they're offering credits to be able to use Ampere compute within their cloud, specifically for CNCF-based projects and open-source projects in general. So it's extremely exciting. Get out there and try it out. So Ricardo, so there's a known GPU shortage, and traditionally people are experiencing really low GPU utilization. What's happening? What, why is that happening? And also, what techniques can operators use to increase their GPU utilization? Yep, so that's something we've been looking at for quite a while by now. In research computing, scientific computing, we have some experienced running batch workloads and building a lot on what Clayton was also mentioning. We can kind of separate two different patterns of usage. One is more the interactive kind of what we would say inference now as well, even CICD kind of workloads. This needs immediate access, but they tend to be quite spiky. And in there, we see very low overall GPU utilization, something like 20, 30% in the order of 20 to 30%. There are some things that we can do to try to optimize this pattern, which is to try to share better or maybe partitioning the GPUs so that we can make the best of them. So there's several techniques that are possible. And there's a lot of work in the community as well to better support this. Clayton was mentioning a lot of these ideas. I think it's something that we'll continue pushing for. And then we have more the batch workloads, which are more predictable but more long running as well. And here what we want is really to make sure that we always have enough workloads on the queue to make the best of the system and never have like pauses in between. There are some primitives that exist in traditional HPC kind of systems that are landing now in the Kubernetes and Cloud Native area as well. Things like queues and better scheduling primitives, things like co-scheduling. And this is kind of vital to ensure we also cover this type of workloads. And I put here something that I heard in back in the AI day in Chicago, and a presentation that the ideal situation we should look for is sharing resources where we can have both this kind of online and offline workloads sharing the same pool in a sort of tidal co-location. I think this is the aim we should try to go for. God ask you Clayton. Are there any monitoring frameworks that could monitor the GPU performance? There's not enough. And I think this is actually an opportunity, both in the ecosystem and for vendors. There's a ton of great tools, whether it's monitoring, accelerator usage, rich history of CPU and host level monitoring. But I think as we're moving into this more complex era, we'd actually need better tools for understanding where capacity is going, how capacity is gonna flow, whether the necessary needs of the workloads are being satisfied, the storage system, how data is flowing. And I think we're just kind of at that beginning of stitching together this, these massive AI supercomputers that all of us are gonna end up running 10 years from now. And I think the monitoring systems have to evolve with them. That makes sense. Lou, we talked a lot about performance. I've seen you present, talking about moving the CPU to the GPU so that you could speed up data loading, data pre-processing. What are some of those economic considerations that you wanna share with the group? So talking about the economic consideration, I think the GPU utilization rate is one thing that we cannot avoid. Like building on Ricardo's insights about like how we can schedule in different kinds of jobs together to help to maximize the GPU utilization rate. So our approach is more like focusing on the traffic flow and also the storage performance, which like later I have talked about. Like so the approach is to attach CPU machines to GPU machines. So the CPU machines can focus on some initial data tasks like data loading, data caching, and data pre-processing. While the GPU machines, they can focus on the chaining and serving. So it will largely reduce the GPU wait time, the time that we need to get data ready for AI. So AI seems to be all about getting that information out of the data. So how do we get the data ready for AI? So let's go little more detail into it. So there are different approach that how we can get data ready for AI for different kinds of AI workloads like chaining and serving, they are kind of different. So we will share one of the simple example of how we can get data ready for AI chaining leveraging to open source framework, the Alasio array. So Alasio decouple the storage system from the AI framework. So the AI framework can assess different storage systems using some unified namespace like the local file system approach or Python AIO approach, which the data scientists are pretty familiar with. And also it can leverage the resources on the CPU machine. It can leverage the disk resources on the CPU machine for the caching probability. It can catch data for the AI chaining, which is especially helpful when the chaining job you need to load the data again and again. And on the other side, like Ray, it can leverage the CPU like resources on the CPU machine to basically to facilitate the last mile data preprocessing. And with Ray, the data loading, caching, preprocessing, and chaining, they can all be paralleled and stringed together and thus improve the overall GPU utilization rate. I think there's a operator for Ray from Kubernetes, right? There are a lot of integrations coming with Ray and Kubernetes. And I think that's a great opportunity for us to keep improving. Peter, we talked a lot about accelerating AI reforms in software. What can we do to optimize for maybe sustainability? So software acceleration is a great start for LLM inference, but we need a more sustainable compute to actually scale the workload, right? So for LLM inference, GPU isn't always necessary, right? So Ampere has done some extensive testing with small parameter LLMs to inference running on our compute. And the empirical evidence that we're getting is, you know, it's very, it's intriguing, right? Like we're seeing significant savings and cost efficiency and great metrics, you know, in terms of that coming out. So, you know, in some cases, we're talking, you know, 40% to 80% less cost for operating. So, you know, and it's available today, you can give it a try, right? So, you know, in fact, later on today, you can come see it running in the Oracle booth and witness in an experience here for yourself. Ricardo, you're in infrastructure. It seems like the AI workload seems to be different from then the typical Kubernetes workloads. Do you want to share some advice for folks because they might be running and operating infrastructure for AI workloads? If not today, maybe tomorrow. Yeah, so there's quite a lot of challenges. We've been discussing them in the last couple of months in different sessions at Cook on and elsewhere. There are challenges internally if you're running your own data centers that were traditional designed to host CPUs. When you start increasing the density with things like GPUs, there are a lot of challenges. I think the main lesson we've learned is to stay as flexible as possible in terms of the infrastructure you can support. And what this means, given the scarcity of GPUs that Susan was mentioning at the start as well, is that if you plan to be flexible in terms of supporting multi-cluster workloads, and especially supporting hybrid workloads where you can have the majority of your workloads or the predictable workloads running on-premises if you can actually get hold of GPUs these days, but also complement that with the ability to burst into external resources where these GPUs might be available in larger numbers, this is really important. This is really a design decision that has to be made quite early to support multi-cluster, to be flexible on where the resources come from. This is a key decision to be made from the start. Well said. So let me kind of encapsulate and just kind of summarize. You heard a lot of points, but Kubernetes is really looking like it's becoming the standard for AI platforms, would you agree? And so, let's work as a community to make these accelerated workloads run much better on Kubernetes. Another thing, another consideration is Ricardo talked about the different workloads. So there might be some that are long-running, some might be short and spiky, so make your resource allocations decisions based on the usage patterns, right? Lou talked about speeding up the data loading, data pre-processing, attach your CPUs to your GPU clusters. And then lastly, I actually heard this from my users and customers. They said, choose the right specialized compute for the right AI model. The job is to make it easier for, you know, research scientists, data scientists to iterate much faster, so that's our job. So with that, I'm gonna close at this time and I wanna thank the panelists and we're gonna be around for a hallway track and so look for us. And then, you know, thank you for having us on this and please give us, my panelists, a big round of applause.