 Welcome to this session on Future of Intelligent Cluster Operations. Can I get a show off the hands of how many people made it to the keynotes this morning? Okay, so all of you all almost know, have heard already about AI. So welcome to yet another talk about AI. Since we're going to talk about cluster operations, let's start with the life of a Kubernetes engineer. But before that, I'm Rajas, I work at VMware Broadcom, I'm also involved in the CNCF technical advisory group Runtime, most specifically the working group around artificial intelligence in CNCF, as well as the working groups around WASM and things like that. And I have Amin with me, how are you? Hi folks, my name is Amin, I'm part of the AWS eCast team. I mostly work on controllers and recently NLM NITIS, so yeah, pretty excited to do this. That's good. Next slide. All right, so life of Kubernetes engineer. So this is mostly what the life of Kubernetes engineer looks like, most of the times we are on calls. And then we'll walk you through some of the other scenarios that we go through. So this is what happens, like you're good at deploying clusters, you're all happy, you go watch a movie, and then you find out that a cluster upgrade failed. Once the upgrade fails, then you figure out that the APIs that were there were deprecated, and then eventually you have to restore the backup from HCD. Legend says that this is also a screenshot of someone who upgraded to 124 and didn't keep in mind the dockation was removed, and thoughts and prayers with them right now. The other scenario that we mostly face is when we've added all of the features, we're doing good, the stock's going up of our product, and then we find CVs, which we didn't resolve, and have led to the stock tanking. So the point we're trying to make over here is Kubernetes has become boring, yet there are problems which are pain points. In this screenshot, we can see that Reddit got down because of a Kubernetes cluster upgrade. Upgrading Kubernetes clusters is not very easy or very seamless as of now because of multiple other reasons. Similarly, CVE scannings, upgrading bumping dependencies is also not very seamless. So all in all, debugging clusters when something goes down is still a pain point. What we're trying to focus on over here is how AI can help you. A shout out to Kate's GPT that recently got into CNCF Sandbox. So this is like a step in the right direction, wherein we're trying to see how we can get AI to make Kubernetes better or just cluster operations better. What we're going to talk about right now is something where we are embedding LLMs backed by Kubernetes controllers, and we're going to see what that looks like. So it's LLMs, Kubernetes controllers, and the cool kids these days are calling it LLM netis. So it's a conceptual walkthrough of what LLM netis is, what controllers backed by LLMs look like, and how the journey has been. So before getting started, everything in Kubernetes is a controller, from namespaces to pods to deployments, and LLM netis is no exception. It's just another controller you install in your cluster. You're going to feed a few CRDs, and it's going to do a bunch of actions in the cluster. So basically, LLM netis is going to interact with an LLM endpoint. It could be OpenAI, CloudTree. It could be a local LLM. It's going to fetch some information from those models and apply those back to the cluster. For example, like, hey, how can I create a pod or how can I create a load balancer? Or how can I destroy my cluster? We're going to jump real quick to a demo, and just a disclaimer, what's happening here is live, so whatever the LLM gives us back is going to be applied directly to the cluster, and if something doesn't work, it's very likely because LLMs are not deterministic, or sometimes they can do way more than you asked them to do. We're also calling them starting us demos right now. They may or may not work, so let's see how it goes. Let the demo gods be with us. So just one quick note. Last CubeCon in Chicago, Rajas and I presented the same topic, LLM netis, and we started with this concept called command exec. So instead of really giving a deployment spec, you can say like, hey, create three engine exports and exports them maybe with a load balancer on port 80. Today, we have more CRDs. We have what we call cluster audits. For example, you can ask the controller to scan the images and tell you what's wrong with the ports and the deployments, for example. You could also ask LLM netis to break the load balancers. You don't specify if it should really delete them or update the spec, but it can break them in a way or another. So yeah, that screenshot just in case the LLM doesn't work, we're gonna show it, but now we're gonna do the demo. Is the phone size okay? Maybe bigger? Is this good? Okay, sweet. All right, so we have a bunch of examples in here. We have a controller running locally against a kind cluster. So if I do cluster info, I have my cluster running and I can show, for example, the command exec. So this command exec can deploy a cron job that will delete a port randomly every two hours. It sounds like a chaos engineering task. So with time, Rajas and I decided to make this what we call a chaos simulation CRD. And that became, for example, in here, a chaos simulation. Basically, you do nothing. You just give a CRD called chaos simulation and the LLM is gonna decide what to do with it. All right, so without any other comments on this, I'm just gonna go and apply the command exec to show, or also like pray that it's gonna work. Sometimes it doesn't. All right, so we asked the command exec to create a cron job that will delete ports randomly in the cluster every two hours. And you can see in the status of the CRD that the command was processed successfully and it was executed successfully as well. You can see the YAML file that was applied to the cluster. It says there's a cron job, V1 beta one, which I believe is deprecated in 1.29, so this one is not gonna work. And you can see this, I'm not a cron job expert, but it should be like maybe every 30 minutes or two hours. And if I do, no actually, we have some replacements. So we changed the V1 beta one to V1 and you can see that we have a cron job in the cluster running right now, trying to delete ports randomly in the default namespace. Another example we can show today is CV scanning, for example. So we can see here, we're gonna scan all the images of the ports and deployments running in the cluster and then we're gonna return what are the CVs that were detected and what can we do, what kind of actions we can take to address those. Before I'm gonna show that I have a tiny pod running in here. So if I do, yeah, it's not an nginx, just named nginx, but if I do get pods, we have an S3 controller running in the cluster and it's not working. I mean, it's in the crash loop back off, but we're gonna run a scan CV in this one. All right, let's follow this. It's gonna take some time for the LLM to respond and I think also we're doing tri-V behind the scenes, so we're calling a binary to scan the images and then call some APIs and LLMs to get some answers. It's a little bit slow, or maybe it's crashed. History says that demos never really work live, but the point that we're trying to make over here is how we are auditing the cluster, we're auditing all the images of the cluster, we're trying to see, we're passing that spec to tri-V and seeing what all other images which are being, having a CVEs which are to be addressed and then hitting the LLM API at the back end, which is telling us what are the next steps that we have to do. So that was the output that we were expecting over here where in the, the LLM endpoint would tell us, hey, these are the CVEs addressed in your image, these are the next steps, these are the dependencies that you got to bump and then this is how it would be addressed. That not really, really worked right now, but that was the point that we were trying to put across. Okay, we have yet another crisis to solve when the HDMI has gone down. Oh, by the way, this is how, I talked about live for Kubernetes engineer in the beginning, this is what live for Kubernetes engineer at KubeCon looks like. Okay, it's back up, I guess. Okay, cool. So the CVE scan in finished and apparently it cannot connect to ECR. Oh, God, okay, let's give it one last try. It's always credentials. Okay, so we're trying again, we're trying CVE scanning again. Actually, should be... Okay, so we got the controller running. Now we're gonna apply the cluster order at CVE YAML. We got that in, we just wait. But this time should be faster because the image I think is cached. Oh, okay, sweet. So yeah, so this is what it looks like, right? Like, look at the output over here that we're trying to focus, wherein it says CVE so and so with a particular vulnerability around it, the package that it was associated with, what version was installed, and then the actions to take. So it's not just telling you what went wrong with your cluster, but also tells you what to take, what next step to take. Sweet. In this case, it was like Golang Network Library. This was a pretty recent CVE I think in December. So if you haven't updated your libraries, please do. The last example. Yeah, and the point is, this is how LLMs at the backend of your Kubernetes controllers can actually help you drive your cluster operations. CVE is just an example, right? Like, CV scanning, their upgrades and so on and so forth. Another quick example we have is auditing storage. For example, you can look at the PVCs and PVs and see what's wrong with them. Like, we know a lot of, in a lot of cases, like users try to deploy storage solutions and they don't know what's really happening. So we developed this plugin that was gonna go and scan your PVCs and tell you what's really, what's wrong with them or what could be the problem. So I have a, I think a pendant VPC in here, yeah. Sorry, PVC. And it can tell you like there is a problem with the full PVC cell pendant phase and it's gonna like suggest some of the things you should be able to do. And then the more information you give about the provider, the more it's gonna give you more tricks and hence on how to address this. We have two more things to show. So before we get into the voice to chat or like talking to the cluster, we're gonna show the chaos simulation. So this one is the empty CRD, let's say, or like the empty spec. We're gonna apply this chaos simulation and basically that we didn't instruct the controller to do anything in here. We're just like, hey, just simulate some chaos in the cluster. And we're gonna see what kind of action the the controller will do in here. It's always a cron job that someone starts with. So here it's trying to delete pods, right? Yes. So sometimes it goes for services, sometimes for pods, sometimes for both, but today it shows the pods, yeah. So this is an example of a cluster chaos simulation, wherein you can talk to the cluster in all of these examples in plain English. You talk to the cluster to scan for your CVs in English, you talk to your cluster in plain English for your persistent volume claims, you talk to your cluster in plain English for simulating chaos. And then with the help of an LLM, it generates the Kubernetes configuration which is required to do the required action. So the thing to note over here is the controller also has the right access to your cluster. So in case things go wrong, you and only you are responsible for this. Sweet. We can go to the next one. I think this is the trickiest part of all the demo. It works maybe 20% of the chances, but basically we're gonna try to talk to the cluster and be like, hey, can I upgrade to the next version without really giving it a lot of information? So in this case, we have a tiny binary that's gonna help us take the voice, transcribe it to an LLM net ease resource and apply it to the cluster. All right, I'm gonna start here. So can I upgrade to the next version? All right, we can see that we have a cluster audits of type cluster upgrade. And if I do cluster audits in here. No, you can never upgrade it. So can you highlight this text, the output? Yeah, so this is the output of the LLM. And what we did behind the scenes is that we queried all the available resources in the cluster and then we fed it some information about the deprecated APIs and we're like, hey, can I really upgrade? The thing is that this is a false information. You can upgrade actually most of the times very easily from 27 to 28 because there is not a lot of upgrade deprecated APIs. Correct me if I'm wrong. A cron job maybe? I think that was 25. 25, okay. I think 27 to 28 is one of the safest. But clearly like the LLM in here is trying to be careful and telling you, no, actually you should go and check more, which is correct. Actually, if you think about it, you should also check the CRDs because here we're only checking the native resources like pods, cron jobs and all those things. So the point is you can talk, not only write plain English to your Kubernetes cluster but also talk to it in plain English and then get a response from it, which is basically guiding you for your cluster operations. This one was like the hardest to get because it's about upgrades. Upgrades are not safe, upgrades are not easy, but again, you're not relying on the LLM to do upgrades on its own but also taking it as an assistance to do upgrades. Sweet. So we don't have to show the screenshots because I think most of the things are almost, yeah. So in this case, the Chaos CRDs deletion randomly either pods or services or I think, no, deletion services randomly, you can see here like the random for the service array. And for this one, we had the cluster upgrade check. So yesterday it said yes, today it said no, so it's unpredictable. Well, that brings us to the next question or one of the most imposed questions like, hey, when you ask an LLM cannot upgrade to the next version, what is really happening in its mind? It is really in this situation. Should I say yes or should I say no? And what we've learned while building this prototype is machine learning models without classifiers are just probabilities. And we cannot rely on probabilities to do things like cluster upgrades. LLMs, they are not deterministic. I've heard that some people can make them deterministic, but the ones that we're using here, the local ones and OpenAI API, they are not. So every time you run the same request, you get a different answer. Ideal upgrades or if you want to upgrade your cluster, you're looking for precise data. You're looking for an exact answer. You cannot really afford mistakes. And the problem with LLMs or machine learning models is about probabilities. And you cannot afford making mistakes in this. You can upgrade to the next version and everything goes down. So our learning from this prototype is that we should avoid asking LLMs to give insights or advice on tasks that need precise data, especially when filing taxes. Unless if you want to get a call from the IRS. Imagine like asking an LLM to file taxes for you. That's crazy. So beside that, really, really quick pot twist. If you wanted to do a cluster upgrade for real, you need to pass the audit logs. That's the best way we have so far. You can check who's calling what, who's calling the deprecated API. And you need to scan the real cluster resources, not only the native ones. So the CRDs give the real change log, change it to a JSON or to a YAML file, and then feed that to a deterministic system that is not an LLM. Ideally, that's what you need to perform a cluster upgrade check. So that brings us to questions wherein we're actually planning on doing something where the LLM doesn't know what the data is about. Like you're planning on upgrading to 129. Does it know about deprecated APIs? Things like that. So I would just like to take a pulse around from the audience. If you were to talk to a cluster and ask any questions for any particular cluster operations, what would that be? Like any show of hands, any call outs, any shout outs? Yeah, I see one over there. So operator vulnerability upgrades is what I hear. Sounds good. Anything else? Resource consumption. Thank you. Anything else? Any show of hands? Any call outs? Any shout outs? All right, OK. So this is great. Thanks for your inputs. So the point I'm trying to make over here is every one's cluster operations will be different. Every use case will be different. Every task will be different. And the LLM may not be able to serve you for every generalized task, or may be able to serve you for a generalized task, but it may not be accurate. So that's where we want to draw everyone towards the concept of local LLM. Wherein we don't want to force an endpoint of an LLM, but also encourage cloud-native engineers, all of us, to adopt the art of training LLMs, fine-tuning LLMs, for your needs, on how it looks like. So what we're going to focus on is the green box over here. This is the LLM that is conceptual diagram that we talked about a while ago. Now we're going to focus on the green box over here. Moving on, this is what maybe a local LLM looks like. You have a data set. In this case, it can be like a cubesedial data set. And you have a pre-trained model. A pre-trained model can be a LLMA. Yeah, so this is what your data set looks like. If you move to the next slide, thank you. A pre-trained model can be a LLMA, MISTRAL, Pythea, anything of that sort, an openly available model. You take that model, you take your data set, and pass it through something called as XORLTL, wherein you fine-tune that model for your particular need. In this case, it's a cube-cuddle data set, wherein it's having English for cubesedial commands. That's it. XORLTL is a project in Open Access AI Collective, which helps you fine-tune models. And we have done an extensive talk. The talk that Amin called out earlier in this session refers to cluster operations as a service that we talked about in Chicago at Kubernetes AI Day, wherein we actually talk about what it means to fine-tune a model, what it means to take a data set, pass it through a model, and actually run inference on it. All right, so now you've taken the data set, you have a model, you have fine-tuned it, you have your newly shiny fine-tuned model available over here in blue. This model can eventually be at the heart of it, maybe a PyTorch framework or a TensorFlow framework. It depends on what the pre-trained model was trained on. So to really embed this model in your controller, you need something called as a unified abstraction layer, which ONIX offers you, which is machine learning or a deep-learning platform framework independent format through which you can transfer weights. You can convert your PyTorch model to an ONIX model. You can convert your TensorFlow model to an ONIX model and then run inference on it. So this is the ONIX model, which eventually gets embedded into your local LLM, which you can use. So for any use case of resource utilization that you mentioned about, you can actually train your local LLM for this particular use case or some operator vulnerability use case, you can train your local LLM for this. The demo that we showed you was a framework how you can embed all of these things in your cluster. But at the heart of it, you can get your own model. You can get your own LLM to the table. If you go to the slide before this, the tricky part over here is your LLM may not give you the accurate data every time. So in that case, whatever data it spits out, and if it's inaccurate, you actually tell the LLM that it's inaccurate. You help the LLM learn it. And that is when you pass all of those validation test cases which fail back to your data set over here in blue that we have shown. And then cool kids call all of this as MLops. So this is another concept that we wanted to talk about and really focus on. We tried to do a demo of ONIX in a local model. And this is what a computer looked like when we tried to run it on a local computer not on a cloud computer. Funny story, we wanted to prepare a demo for this. Rajas sent me the weights maybe four days ago, something like that. And he was like, hey, I want to drive, go and pick them up and run the ONIX. I was like, yeah, I'll do it for sure. And I executed the model and maybe two minutes after my laptop is frozen, it cannot really function. And I was like, hmm, interesting. And then I remember that the weights were like 60 gigabytes. So I was trying to load 60 gigabytes of weights in memory. But another interesting thing is that everything went to the swap, but it's still slow with the laptop a lot. Yeah, so the point is we're still working on this. If you happen to be at Salt Lake City, chances are we may get local elements working over there. So stay tuned. OK, with great power comes great responsibility. We've heard it a ton of times. But what next? What are we actually talking about, right? Next slide. We're talking about how responsible AI is the need of the R. Just explainability is not enough. Telling what your cluster does is not enough. But what to do next in order to fix your cluster is the need of the R. That is what we need to focus on. We need to focus on training our models, embedding them into Kubernetes controllers so that we can get actionable outcomes, like the next steps to look forward to. The other thing is you can't rely on your LLM completely. You give admin access to your LLM and it tells you to delete all of the pods, the delete of the namespaces. That's not really going to work out. So you should be able to disagree with the outcome of your LLM. And as a human, that's not enough. You should be able to embed that in the tool that you bring. So the ability of disagreeing with the output of your LLM model and then embedding that in the tool is, again, something that leads us to responsible AI. So to basically cover all of these topics, we're coming towards how we can have control over the data, the model execution, and then also think about security. Right now, we're taking one data, we're taking one LLM, then passing it through multiple filters, and then running it. We just can't run something like that in production. We need software supply principles. We need attestation. We need shards to validate that this data set came at this particular shard. This model was at this particular shard. So I'm basically talking about how we can cloud native wise the artifacts of AI. And all in all, LLM net is just a vehicle. It just gives you a tool, the vehicle that basically makes sure that you define the path where you want to go with your use case. What are the problems that you're going to solve? How accurate you want the outcome to be? How much of an impact the outcome weighs on your particular problems? So LLM net is doesn't solve all of your problems. LLMs don't solve all of your problems. But this is a framework that you can embed. You can bring your own LLMs and then try to have nuanced news cases, add much more validation test cases, add much more testing, and then holistically solve your problem. So LLMs, LLM net is AI is more of an auxiliary, more of a tool that helps you. But at the end of the day, you need humans to testify against all of these outcomes. Next steps. So what do we want to build next? We wanted to do this talk, but it was too big of a task. But our next goal is to have one LLM and basically embed two personas in it where one is trying to fix the cluster and the other one is trying to destroy it. And what we want to do is, hey, we have data. We can have two LLMs take action on the cluster, break it, and then try to fix it, log all those actions, and then try to feed that data to the LLMs, try to increase the learning of those LLMs and their accuracy towards something that is very specific to Kubernetes. So yeah, hopefully, Sunleg City. Sunleg City, Easter eggs, we don't know. But the point is how you can have some sort of a reinforcement learning where one LLM still teaches the other LLM on how to fix a problem very well. I don't know, Sunleg City, Easter eggs, let's see. With that, we come to an end. Thank you so much for attending. And we are up for questions now. One last thing, LLM let's use this open source. So if you want to read the code, if you want to play with it, if you want to contribute, if you want to deploy it in the cluster, please go ahead and do it. We also like taking PRs. Thank you, Vox. Sounds good. I see one question over there. You've got a mic. I see a couple of questions over here. OK. I have a couple of questions. The first is, do you plan to implement bedrock as a back end? Sorry, I didn't get that. Do you plan to implement bedrock as a back end? We don't have bedrock on our roadmap as of now, but that's surely something we can think about. So if you would be open to it, you can create an issue on our repository. It's LLM ladies slash LLM ladies. Thank you. And the second question is, do you plan to propose this as a sandbox project? That's something we've been thinking about. So we've been looking for more contributors who can come and contribute to this project, and then we can build it to a level where we can donate it to CNC as a sandbox project. Thanks for your questions. Thank you. So it's not deterministic. It's resource intensive. It sometimes can give you not wrong, but destructive answers and lead you in your evil path. To whom it might be useful at the moment, to the Kubernetes expert, to someone who is novice and might try to do something that he didn't think of? So I would say this is useful to everyone. It cuts across the domain, wherein someone who's early in their career wants to play with this, they can try to see what Kubernetes configuration looks like, or want to play with AI. This can be a useful task for them. The other one could be like SREs or cluster admins, or people who actually work with their clusters day long. What it means to simplify their cluster operations? What it means to make this better? We're not saying use it as is, but what are the challenges over here? This is more of a futuristic use case. As you mentioned, it's resource intensive today, but we don't know how it's going to turn out in maybe a year or two. So we're making sure that the road is built, and then we can run faster over it. But thanks for that question. Hello? I just one thing. So regarding cluster upgrades, we know that it's something we do not advise to use. However, for chaos simulation, we think it's a great tool to use. You want to break your cluster in spectacular ways. Maybe you don't think about some edge case that the LNM is going to see. So maybe for deterministic tasks, you don't want to use it. But for chaos simulation, why not? I think we're going to bet maybe more on that on that route in the future. Right now, we're just exploring all the areas and see what's the output and presenting that. True, but finding real use cases for it now is great. It's exciting, like chaos engineering. Thank you. Thank you so much. I see a couple of questions from here. Yeah. So first of all, thank you guys for the presentation. And I just have a question about two things. When I hear term AI, I expect a lot from that. So when you are talking about this cluster of grates that your LLM thing will really help us out for a smooth cluster of grates, so how this is different from the other solutions out there. For an example, you're working with AWS EKS. And EKS already introduced a very good feature, like upgrade insights. And now you really see things on the UI that, OK, these are the deprecated APIs. And you can just check there if it is allowed to upgrade or not. So first thing is that, how this LLM is useful there. And we should ignore other solutions. Another thing is, when you are showing that the storage demo, you said the pod was pending, does this LLM will help us out to identify the root cause? Or this will be just a normal message, like, OK, there are three or four options which you can check. And definitely, in my opinion, you can check that the storage class is there or et cetera. Whether this LLM will exactly point you out that, OK, this is the problem in your cluster, and you just need to change it. It can be anything. Yeah, so that's it. I'll take question one, you take question two. So for question one, fun fact, I worked on the cluster upgrade feature on EKS. I would say something, LLMNities is fun. It's not something I would use in production. In EKS, behind the scenes, we have a deterministic system that cannot make a mistake, theoretically speaking. If it executes, it's going to tell you whether something is deprecated or not. And that's what you need for a production system. However, if you're playing with an EKS cluster or a kind cluster, I'm like, go ahead and play with the LLMNities. Don't do that for production classes. Yeah, thanks for that. And to answer your second question, it's actually what exactly what you mentioned, wherein it's not just telling you that these are the things you can try, but actually auditing your cluster for actual problems and then giving you an actionable next steps. These are the things that you can do, and not just analyzing it from an if-else perspective, but actually taking data from your cluster and then auditing it, right? Does that make sense? Sounds good. I think we can take much more questions in the hallway. And we should be there for now. Thank you. Thank you, everyone. Yeah, thank you all. If you see us in the hallway, please stop us. We're super happy to chat about this. Yeah, we'll be in the hallway. Thank you so much.