 Hi everyone, my name is Casey Zhang. I'm I sit in the cloud native computing foundation's technical oversight committee I'm also one of the leads of the the newly formed on cloud native AI working group. I'm very happy to be here Hi, I'm Ricardo Aravina and I'm one of the leads for the cloud native working group in the CNCF and I'm also a co-chair for the CNCF tag around time Hi everyone, I'm Alelita Sharma and I am a co-chair for the observability tag in the CNCF as well as part of the GB the government's board of the CNCF and Maintainer on the open telemetry project In the observability set of projects that the CNCF has super happy to be here today Please be ready for your questions as we get ready to Have a Q&A later Hey everyone, I'm Madri. I'm founder of Elotl. I've been in container ecosystems since 2015 worked on flocker project if you're old enough you would know what flocker project was and Before that I worked at VMware on virtualization and spent some time on databases at Oracle really love cloud native ecosystem and the community and excited to talk about all things AI and CNCF All right, it so turns out that all my panelists are from Bay Area. So Bay Area is well represented here in Paris So before we get started and dive deep into the discussion I would like the audience to give some call outs and what are they looking forward to from this discussion like Any call outs shout outs You can like shout it out Challenges for AI workloads anything else LLMs That supports are here When I had asked this question last time someone told me or someone asked me if they can upgrade to Kubernetes control plane using LLMs All right, okay, so be sure that's all of these Points are going to be covered in this discussion of so to start off I'll start off with Kathy and Diving right into it right like with Accessing AI infrastructure is kind of difficult right now. You need access to GPUs. So how do you see? Access to artificial intelligence infrastructure being democratized from an open-source perspective, okay So we all know that AI is evolving very fast But accessibility to the hardware GPU is still a challenge when we think about AI on Training or inference we usually associated associated them with GPU But actually you can run AI inference on AI accelerated CPUs So if you have you know challenge accessing GPU you can try your AI inference applications on CPU with Accelerate AI accelerators being turned on another area that can help you know to simplify the Accessing and the usage of GPU is to build abstraction layer on top of you know different hardware vendors GPUs and build a generic API layer That is you know on vendor neutral and also works across different GPU architecture and different on computing Engine whether it's a GPU or CPU or FPGA or you know accelerators As we know right is not I think it's not realistic to require AI developer or AI scientist to know the Incent us of every you know vendors GPU architecture because they are all different right and then if each vendor has its own APIs you know then you know as an user as AI developer You have to understand all these APIs and once you develop your application on one set of API It will be hard for you to port it to another Architecture another another vendors and GPU and so to a newly formed Foundation called UXL Foundation that was created recently to address I mean this problem to achieve this goal So the foundation will be you the abstraction layer and also a unified set of API Based on one API open programming model which can work you know Across all the different architecture architecture GPU architectures and different vendors and so this will Greatly simplify the end-users use of you know the lower level hardware Also help you know, I mean promote Accessibility to the to the GPU. So when you develop your app on you know on one vendors hardware, you can use it in my great to another vendors hardware and You know and and makes your life much easier. I really like the call out the The hardware abstraction layer, which is vendor neutral And and also the unified API calls over there. So thanks for that But taking the abstraction layer one step ahead. How do you regard to how do you see? cloud native being more relevant to artificial intelligence most specifically from say OCI artifacts or distributed training Yes Typically like LLMs that are very popular now are distributed using a GUF UF file format, but most of you were probably familiar with cloud native and the OCI spec So how containers are distributed? So I see maybe Synergy between that format and GPU UF and OCI working together so you could distribute your models Across different cloud native environments. So you can take advantage of all these registries like Harbor or other cloud native open source projects that help you distribute the container images So that's that's one way. I think also the the way The fact way to a running AI Training and inference will be on cloud native Open source or ideally if you talking about creating open source models So I think that's another way that that's going to happen Going forward and I think also of How you could use AI technology to to improve cloud native environments and architectures generative AI is one thing but I Predictive AI has been around for quite a while and it's been used in several environments For example that detecting anomaly detections on API latencies So that's been around in generative AI is gonna even enhance that further Yeah, thanks for that. I like the call out for the use of predictive AI The anomaly detection and also parts of it which have been kind of sort of used in Kubernetes Mostly through audit logs and things like that. So yeah, thanks for that But talking about AI applications even more Alulita, how do you see profiling AI applications going forward? I think you know again as we Come upon the world of LLM's which is all you know become the buzzword now It's not that you know work in this space in especially in building large-scale AI apps has not been around But I think you know, it's always a sweet moment when you see the cloud native of the cloud native infrastructure scale if you will converge with the applications that are being run for for AI using AI right and Especially, you know as Raj is called out. It's like so so you're building AI applications How and do you know how do you optimize them, right? And for optimization again, you need observability and specifically understanding both not only the hardware the infrastructure that you run on but also the applications that you are running on it, right and for that Again going back to Profiling if you will Profiling is one way of kind of really understanding behavior of software applications been done for a long time all the way from the OS You know many of you have worked on Linux Profiling itself And you know in the AI application space the same applies What is different today is really the advent of you know large language models for example, and What you are seeing now is that not only does it matter to have resource Utilization understanding which observability and profiling specifically gives you but also, you know understanding performance of not only your software application, but also your models your model inference and your the actual, you know Hardware that is used for AI applications, right? Including GPUs not only CPUs, you know accelerated CPUs, but also GPUs and others Profiling has many aspects that already, you know have been covered, but I think that you know as we go forward in Understanding some of the areas that can be improved upon There are gaps in what you can observe today, right? And Especially for AI applications, but that's something that again, I think we'll dive Yeah, I can't wait for the cloud native goodness to be extended towards artificial intelligence mostly from a profiling perspective So yeah, that's that's like a very exciting space out there, but shifting focus to the end users now Madhuri, what are the market trends look like what what walk what works? What doesn't work for the users? That's a really good question. So from an end users point of view our end user Needs end user persona, which is a data scientist needs a cloud native platform that not only meets their needs Which is don't make me bothered about where the GPU is located. What's the price of GPU? What are these Kubernetes stack components? All of these things just give me a cloud native platform that is opinionated that is Recommended by the cloud native community that also lets me in addition to letting me run my AI ML applications It's also Reduces the nag I have to face from the finance teams and the CSO so the finance persona is the second stakeholder Who is going to be concerned about are you actually using the money that you that the company is paying you to use, right? Are you using exactly what you need or are you use are you spending more than what you're actually using? So the finance persona is the second stakeholder that that the data scientist needs to be concerned about and the third one is the CISO so how your data gravity requirements being satisfied is the is the platform that you're using going to be HIPAA compliant, etc. And what doesn't work is The time for this platform is now So we don't have the luxury that we had with the container orchestrators where we figured out Okay, they had the swarm missiles and Nomad and Kubernetes Let's take five years to figure out where we have to converge on right So if the cloud native community doesn't come up with this cloud native platform Recommendation within the next six months or so the industry will go and build their own and it'll be a bunch of Bespoke stacks that will be running around that are not going to take advantage of all the community effort that we put into with the cloud native standard So we need to get to these this platform as soon as possible Yeah, I understand the need for the reference architecture and talking about reference architectures That's that's one of the deliverables of the working group that's been formed in CNCF on artificial intelligence so shifting gears in Talking about the working group this morning We got the white paper for cloud native and AI released by the working group So a huge shout out to all the folks involved in that and Ricardo being one of the co-founders of that working group How is your experience being what are your plans ahead? How does it look out? Yeah, my experience has been great a Lot of folks interested in the space joining a community effort to put I don't know, let me be 20 pages of Cloud native AI We'll look talking about challenges talking about the history of AI the history of cloud native and how they they're merging together now especially with generative AI And and some of the opportunities ahead. It's pretty excited about things to come and Some of the things we're thinking about are You know creating a reference architecture Also a landscape, but you might be familiar with cloud native landscape but we want to create one that is more targeted towards cloud native AI and We also want to make it easier for People to get started using things like Q-Flow Have a like maybe constrained environment where people can just play around and and understand the full lifecycle of machine learning Which includes data prep? Creating the models storing them the models in the model artifact and then pulling those artifacts model artifacts and Serving them in in a microservice type of environment So yeah, so it's a lot of exciting things and looking forward to the new things and and and We hope that folks in the community get interested and enjoying the working group. Yeah, actually I think we welcome you to join that working group. We have bi-weekly project bi-weekly meetings and Yeah, you can contribute and together we can drive, you know, the cloud native and AI together Absolutely I love the way the working group is shaping up So if you're interested in where the action is happening all things cloud native and AI that's the working group to join but taking things ahead and Talking more about observability and the tag observability. How are your efforts shaping up Alolita and like How does it look for tag observability to benefit AI? Actually, very proud to say that, you know, when we started the working group a few months back and we announced it at KubeCon in Chicago The tag observability was one of the tags that actually You know one of the co-sponsors of the working group and so super excited with the work that the You know AI working group has done I think that you know in general in the And this is going back to one of the comments that Madhuri made is that it's very, you know You're seeing a changing landscape because one of the areas that is happening here Is that obviously, you know, we all live in the world and breathe in the world of Kubernetes and And cloud native right today because we all I operate in the cloud native computing space and and Many of our tool chains as well as the projects that are related to that the open source projects in the CNCF are Very foundation only, you know revolving around the universe of Kubernetes now That also means that as the world of AI in especially different types of black box models different types of models that Come in different training pipelines different types of workflows of data come in there is a whole layer of Assets of components that are coming into this space, which you know, not only does Kubernetes as a foundational mod layer need to handle and be able to have Visibility into but also the related spaces that you know, we have in the landscape.cncf.io for example, whether that's App delivery all the way to security or observability Also need to be able to you know have projects Which are actually very aware of some of the new generation of applications coming in into the space And what that means is that in observability specifically There's a lot of discussion happening on some of the open source projects For example is to what does that look like right because you have interoperable open standards for example Do you have open specifications for what LLM metrics look like? what model behavior looks like what you know, what do you define as Normal thresholds if you will and and you know, there's discussions need to evolve into Different spaces such as you know having the ability to have more white papers in specific areas as well as Semantic conventions to find which are really standard for being able to interoperate across different, you know data workloads and AI applications models as well as lower-level observation Components that exist today like open telemetry, which is in collection mechanism for example and That again as Matri was saying time is of essence and that actually needs to Accelerate so I'm hoping that you know with the some of the discussion those are going on and the tag observability We can actually help accelerate that and also you know again add more join more forces with the working group the AI working group itself to You know maintain that momentum Absolutely, so if you are more interested in like defining the semantics for observing LLMs, then you know which stack to join Kathy You are one of the tech leads of the working group and also a member of the technical oversight committee was the NCM Which innovative areas do you see particularly in the open source realm which are much more relevant to artificial intelligence? and Where cloud native can contribute? I think how native can contribute to two areas one is performance for example When AI inference Application sometimes do not need the whole GPU to run it You may I mean for some AI inference application may just need a fractional GPU So if we can I think cloud native can develop You know GPU partition or GPU or virtual GPU aware scheduler Which will schedule your the AI inference workload or maybe some fine-tuning Onto a fractional GPU that will help saves, you know the cost and So that that's one thing another is you know the For example, you have for the distributed training is very popular today So for distributed training it really involves Communication between GPUs on the same server or a cloud servers, right? And and you know the communication link speed or cost for between Intra server communication speed and cost as much, you know the link speed is much faster than the Inter server communication link, right? If you zoom in on a GPU You are going to see different links connecting the different components. So if we can build GPU topology or hardware topology and also the link cost Aware scheduler that will help, you know schedule the your AI and service workload Service application onto a set of GPUs that renders the best performance, right? Because it takes a link cost up between the different components or cross servers right across switches into consideration another area is, you know, the cloud native can build we can build a Auto auto scalar which will which is guided by the actual real-time GPU or CPU or memory, you know Utilization metrics are now based on that metrics it can automatically scale out or scale in or scale up and scale down and then to meet The current resource needs of that application AI service So we all know that, you know Some application AI application service has peaks traffic peak and valid, right? So without that auto scaling mechanism, you probably need to plan for the peak right to Guarantee its service, but if we set mechanism, then, you know, you can save, you know Your resource cost and then, you know, let that auto scalar to automatically, you know Scale up and to scale down based on your, you know, your real-time usage need Yeah, I really like those callouts. Thanks for that We've heard from the leaders within CNCF on What what lies in the future? So it's only fair To flip the side get a pulse from the end users or mother you going ahead to you If the end users were to ask one or two things to the cloud native community, what would that be a cloud native platform that is That that comes with the reference architecture that satisfies the three stakeholders the data scientist persona the finance persona And the CISO persona That's great. I Think a sustainable and a responsible future is what lies ahead of us But moving forward I'd like to Go ahead and like keep one or two questions rolling like which each of y'all can take a stab on Yeah, what areas would you recommend for people to? Keep in mind while building open source projects, which are in the intersection of cloud native and AI Okay, I go first I think the convergence of cloud native and AI will drive Enterprise AI H AI and the PC AI Leveraging technologies such as in a fractional GPU Allocation and like the GPU utilization telemetry guided auto scaling So this will when you develop a solutions in that scenarios, I think you know you need to pay attention on resource Resource sharing and isolation because in those scenarios the GPU resource is or CPU even see AI accelerated CPU resource is limited. It's not like you know in public cloud. It's in your IH or in your You know in your AI PC Yeah, thanks for that Ricardo. Would you like to take a stab? Yeah, I think One of the cloud native areas Is very popular now or it was gaining a lot of steam as web assembly and I think You could actually use that technology to service and learning models What are the interesting areas is the edge where you could actually create an inference service at the edge and Make that maybe like an AI agent. So there are a lot of conversations about using AI agents in yeah, so basically if you have an LLM you could partition the Responses or the workload of a lot of multiple LLMs. So at the edge you will have one part of the Compute happening at us in a smaller Language model I would say and then send that over to a centralized larger Language model and do that rest the process in there. So that's one of the architectures That a lot of folks are talking about Creating and I think cloud native can help in that in that area in and also web assembly because it's very lightweight at the edge and And there are some other projects that I've actually been using this Pattern and an example is that was an edge. It's a CNCF project So yeah, take a look at that and I think that's something that we can see more of in the future Taking machine learning to the edge Yes, I mean in fact, I think that Given that, you know, there is AI everywhere, I would say that the edge as well as the core is equally important from a infrastructure standpoint and Being able to observe, you know AI applications on the edge as well as at the core means that you can operate, you know AI applications both at the edge of the core and and it is, you know fundamentally right now that As Riccardo just said, you know, they can only run smaller models on the client on the edge and and that's due to many reasons if you know hardware included and as the world shifts, you know, most folks want to You know with with concerns and rightfully guardrails for data privacy security and fair use Having that those guarantees are best at the edge and Being able to guarantee that the infrastructure and observability and security and everything that goes with infrastructure is available end-to-end That's the opportunity that we are you know actually I See as the next boundary where do we need to address and work towards absolutely, Madhuri Yeah to add on to the edge Conversation providing the cloud agnostic compute for your Emerging workloads your LLM gen AI workloads cloud agnostic compute that honors Your C source requirements and also is cost aware That would be I think a very interesting area that There is a lot of opportunity for open source ecosystem at this intersection of cloud native and AI Yeah, totally and actually I would say that you know for that to happen It's very important to have open standards and interoperability and that's why actually cloud native open source matters Absolutely, right. It's it's all about fostering collaboration within the end users researchers maintainers collaborators keeping open source at the heart of it with that I would like to open the floor for questions Anyone having questions on top of their minds What would they like to hear in terms of like challenges in AI workloads or anything on those lines? Hello, thank you for your words. I'm just wondering how is the relationship with the hardware developers? Well, we know NVIDIA and they collect like cloud native community This is like my main Yes, I don't know idea of how how this communication goes between developers of hardware and the cloud native community of AI So I think the question is how vendor developers and cloud native community communicates or collaborates is that right? You can take a stab at it So this alludes to my earlier point about the need for going to market as soon as possible because of the demands of AI today, so the way we are looking at it is the hardware vendor plus the cloud provider providing the and the Kubernetes out of the box solution plus Multicluster platform sitting on top So have an opinionated stack that stitches these three layers together and go to market quickly That enables us to go to market quickly. So the ecosystem is definitely working with hardware vendors Happy to dig deeper into it in a follow-on Yeah, just a short add-on that I think we have a lot of folks from those vendors Working in the CNCF community. So we have folks from Intel, NVIDIA Not really sure if we have anybody from AMD But but we're trying to work with all the different vendors So so we create those standards that can work across all of them, right? And then all of the end users can benefit from them In cloud native computing foundation there is I mean, I would say it's a it's called tab So if you are an users, yeah, you're welcome to join there and then to you know, like for example, if you see some challenges When for you to use the cloud native infrastructure, right? You're really very welcome to you know, to voice your your opinions there and also to influence the cloud natives, you know technologies and Also, as we all mentioned, there's a cloud native AI working group We really would like, you know, the end users to join that working group, you know Whether you are developer AI developer or AI scientist or AI users, right to give us your you know feedback on the Challenges and pain points and then we can work with you together Solving those right we have quite some cloud native experts in that working group But we would like to you know close integration, you know with you know the end users Also, as I mentioned before, I think you know down the road a close integration from the upper Occultation layer with the hardware layer there several layers that right a closer integration Will be very good very good for the for the larger community, right? So we provide a very simple and user interface on how to allocate the low-layer GPU resource Or how to you know dynamic address your resource boundary, you know Make your life easier, right? You don't need to understand the infrastructure You don't need to understand the hardware you can concentrate as an user as AI developer concentrate developing your AI models, right? And let that infrastructure layer automatically take care of all the either HAA or resource scheduling resource allocation resource, you know scaling yeah, or Like you know, and any resilience, you know This you know disaster recovery or take caring for take care for you I think with that we come to an end of we have a lot of other questions, but we'll take them in the hallway This has been a great discussion. Thank you so much for joining the panel. Thank you everyone for attending And have a great rest of the conference. Thank you