 We'll have two full sessions, and then three lightning talks. If you're giving a lightning talk and we haven't seen you yet, if you could just come to the front so that we switch quickly. To start this second part, we'll be hearing about cluster operations as a service introducing LLM-backed controllers. And we have Rajas from VMware and Namin from AWS. So thank you very much. Thank you, Ricardo. Am I audible in the last row? Welcome, everyone. We're going to talk about cluster operations as a service. So you've talked to Siri, and you've talked to all of these voice assistants before. Today, we're going to show you how you can talk to Kubernetes. And to do that, here I am. I'm Rajas. I work at VMware. I'm a senior member of technical staff. I'm also the tech leader at Tag Runtime in CNCF and a contributor to Kubernetes. I have Amim with me. Hey, my name is Amin. I work for AWS, and I mostly spend my time doing open source and controllers. And yeah, thank you for coming today. So first, why should you listen to us? Well, actually, you should not, because even our jokes, if our own code, laugh out at us. We're going to justify that you should still listen to us. So to get started, there's some myths around deep learning. So this is a screenshot from coast.fast.ai by Jeremy Howard, where Jeremy talks about top down approach or a code first approach to deep learning. So first of all, maybe we should start the conversation by not treating AI as a black box. And the myth around that we need too much data to train a deep learning model or a large language model is not true. We certainly don't need to be PhDs to train models. As domain experts, we can still do that. And we've already seen with chat GPT, deep learning is not just relevant for vision anymore. We can still extend it. We do need GPUs to train models today, but we don't need an AI lab. And it may not be a brain that you want to build. We just want something that helps us get our job done. So to get started, we're going to talk about LLMs or large language models and then move on to the Kubernetes side of things. So what we're going to talk about today is how the gap between domain experts and deep learning can be bridged by adapting or acquiring skills that can improve our efficiency in our tasks. There isn't, we need to have a conversation around how we treat data and not treat AI models as just an API that we can hit for inference. We should also be considerate about what are the biases involved in the data and the whole ethics around the data used for training the model. So we're going to address that point as well. And most importantly, we should also have some sort of guidance to the inference done by the large language model and should not completely rely on the output given. What we're not going to talk about today is how chatGPD solves all of our problems. So what are LLMs? LLMs, you know, well, we're not going to talk about all of these things. In the spirit of natural language processing, we'll walk through this real quick by passing it in English. So LLM is a model that predicts a next set of sentences or next set of words given a prompt. It does this using a set of tokens behind the scenes. So what are tokens? Consider a string like this. If you tokenize it using a tokenizer used for a model, so here we've used something for Da Vinci, you get a set of tokens. These are just numbers that a model understands. How to interpret these numbers? If you decode them back, you get some sort of words. This is how a large language model parses words, sentences behind the scenes. At the heart of it, it's just a deep neural network. What is a neural network? A neural network is just a mathematical function that's completely flexible and has a billions of parameters that you can fine tune. So initially, it's dumb enough to not know anything. And then, we can teach it to basically do anything using an algorithm called a stochastic gradient descent by showing it examples of what we wanted to learn. That's what a neural network is. Now that we've seen, talked about LLMs, if you want to play around with it, there is something called as NAT.dev, where given a prompt, you can play with multiple language models. So here's one that I played with Da Vinci over here, and this is what the language model predicted. A lot of the content that we've talked about and that we've already addressed in this presentation can be found in this course by Jeremy Howard on a Hacker's Guide to Large Language Models. It's a code first approach. So if you want to hack around with this, if you want to play with training and inferencing and fine tuning, you should check this out. Now that we've touched base upon large language model, let's move on to the Kubernetes side of things. So as you all know, everything in Kubernetes is a controller. Whenever you create a namespace, deployment, even secrets and map, behind the scenes, you have a controller that is trying to manage the state of your application. Let's say you create a deployment that is of five replicas. Behind the scenes, there is the deployments controller, the replicas at one, and the pod controller. And all of them, they try to work together to make your application available. Behind the scenes, a controller is a binary that is running what we call the reconciliation loop that is observing the desired states and trying to take actions and move your desired state or the current state towards the desired states. So Rogers and I, for the last few weeks, we've been working on this concept. We were like, hey, we have LLMs and controllers. Why not call it LLMNITs? So we created this controller called LLMNITs, and we're here to demonstrate how it works. For example, instead of creating a pod or a deployment, you can do something like this. You can ignore the API version is a little bit random, but you can, for example, have a kind called command and use spec instead of specifying the number of replicas or the image, you can say, hey, create three engineering pods that will serve traffic on port 80. And behind the scenes, you're gonna have a controller passing that query, do whatever some magic behind the scenes and create the pods for you. Or the things that I can do, for example, is cluster audit, like, hey, you know what, I want you to go and audit my cluster, go and look at all the pods and the surfaces in all the namespaces and try to tell me why is it working or why is it working, actually. The other one is chaos simulation. For example, I want the controller to come with new ideas of breaking my cluster. Like, hey, I think I am a very good Kubernetes user. I trust my application and everything that is in my cluster. Go ahead and try to find new ways of breaking the cluster and doing what we call chaos engineering. Well, now the more they happen. So we're gonna show you real quick one of the CRDs we worked on. On the bottom here, we have the logs of our controllers already running. Can you all see this? Well, is it good enough? Awesome. All right. So we have some examples today and the one that we wanna show off is the command exec. For example, command exec, we have this one, like, hey, that says, deploy a cron job that will delete a pod randomly in my cluster every two hours. Use the batch v1 API and use the case slash kubectl image. Let me check that I don't have any here. I know I can just, okay, I have my cluster, or like my controller running the query. We can see that in the logs. And we're gonna wait for the magic behind the scenes to take action and see what really happened behind the scenes. All right. We can see here, for example, what the controller did is that it took an input and gave us an output and the output, you can see, like, there is a YAML file there in there. There's a command. There is also an explanation of what happened. And if I do, like, get cron jobs, I'm gonna see that 18 seconds ago, there is a cron job that is running in my cluster that is scheduled for every two hours in the cron syntax. And if I do get cron jobs at the YAML, we can see that is what I did in kubectl, delete, pod, all the pods and shuffle in them. And it's using the image. It should be k slash kubectl, yeah, this one. So yeah, this is one of the CRDs that we use here. There's like a lot of other CRDs we can demonstrate, but for the sake of time, we're not gonna show everything. So yeah, the chaos simulation, there is also the cluster audit that we showed. Another one is command, it's like very simple. Yeah, so that's it for the demo. What do we have next? So what we just saw, we heard, was a controller backed by an LLM. The model that we used was a GPT 3.5. So we were just hitting the open AI API behind the scenes in the controller. But we don't necessarily have to do that. We can extend this and take this to the next level by not treating AI as a black box. So for that, we can do something called as fine tuning. So here's a snapshot of the ULMFIT three step approach or the algorithm that's behind the scenes for all of the LLMs or the large language models. What's shown over here is a plain language model, which is very much capable of predicting the next set of sentences or next set of tokens. ULMFIT was trained on Wikipedia. A lot of large language models that are accessible today are trained on data over the internet. So what happens over here is this model or this neural network is very well capable of understanding constructs of grammar, understanding constructs of language, knows the context behind historical data, geographical data, math equations, so on and so forth. So we can take this and fine tune it by showing it data that's closer to our task. So in this case, it will be data that's closer towards Kubernetes configuration, closer towards Qtl commands and so on and so forth. That's the fine tuning aspect of it. And then we can use something called as a classifier fine tuning wherein whatever the model predicts, we use another LLM to tell it whether the label was right or wrong or we use human intervention over here. So in this way, we have control over the inference done by the model and also we're going to see how we can use the data. We can also see how the false positives that were treated by the model or where the model gave wrong answers. We can feed that back into fine tuning and try to get the right answers out of it or at least teach right answers to it. So how do we take this as a skill? So here's what we did. We tried out multiple experiments on fine tuning openly accessible models by using a data set called as kth's Qtl. That was by component soft. So it's accessible by a hugging phase. So hugging phase is a collection of data sets, models, MLOps, pipelines, so on and so forth. So this is what the data set looks like. So if you take a closer look at it, you can parse that data set and it has 35 k rows and a bunch of columns around it. This is what it looks. It has something called as an objective which sets the context for the data set. So in this case, to print the address of control and cluster services, the other column that we may be interested in is the command, that's the actual Qtl command that's generated and the question column wherein we give the model an instruction to generate a particular command. So how do we train this? So we use something called as Exolotl which is a collection of AI models under open access AI collective. So Exolotl is a tool wherein you can fine tune models using a YAML configuration. It supports multiple large language model parameters, fine tuning, so on and so forth. And it has collections of multiple models over here as well so that you can see Lama, Pythea, so on and so forth. So we use Lama2 to go ahead with this. Talking about GPUs, we used eight A10 GPUs in this case but you can very well do it on a single GPU as well. So how do you train this? You just run one command, that's Exolotl launch da-da-da-da and then you give the YAML configuration to it. That's it, that's how you fine tune the model. The configuration has links to where you fetch the data set from, where you store the model. So here we store the model using something called as QLaura. Q is for quantize and laura is how we can space optimize such a huge large language model. And then there are a set of hyper parameters that you can fine tune. We kept them as default and went ahead with it and we ran out of GPU space and it took a long time. So we tweaked around it, like set the sequence length to say five one two instead of one or two four. We set the number of epochs, which is the number of times the model will go through the data set. So we set it only one and we didn't tweak anything else and we got pretty good results. So the whole skill that a domain expert can attain is how you can go around, how you can decipher what a training looks like. So here's a snapshot of what a training looks like. So you get the training loss. The idea is to minimize the loss. So here we went from 0.5 to 0.09 and how you don't have to minimize it to an extent that you overfit the model. And when I say overfit, the model will only understand the data that it's given to you, given to it during training and doesn't understand any other data later on. So we want to avoid that. But at the same time, we want it to be generic enough so that it's very well versed with the data that's given to it as well. While doing this, you can tweak around the hyperparameters. You can set the learning rate, you can set the weight decay, you can set the optimizer and all the other aspects of a neural network. We don't have to go into the details of it right now, but the point we're trying to make over here is that as a domain expert, we can get into the fine tunings of it as and when we need it so that we can experiment and see how the model works. We don't have to learn all of these at first and then get to fine tuning a model. The idea is how we have a reconciliation loop in Kubernetes. We can extend it over here, wherein we know the desired state, we can test the desired state using inference and then we can tweak out a current state over here by fine tuning each of these parameters. Okay, the other thing that we want to talk about is the prompt that is given to a model. So in this case, it has context, question and answer. Prompt is very much relevant to any Lash language model because that's based on how it is trained. You can find this data from, say, hugging phase. So this is the prompt that we use. This is an example of the training data. This is how it looks like when we push it to the prompt. Here's an example of Tiny Lama. So we tried multiple experiments. Tiny Lama with 1.1 billion parameters. Smaller model takes lesser time to train. So we trained it and here where we are loading the model that we trained, we're doing inference of it. So here's an objective to create a job with the command and this is the question we have. So create a job called lock processing job using the redis image and run the command redis server version in the creative job. So when we run inference, this is what the model predicts. Everything is kind of sort of right, but the image, the command that is run is redis and not redis server. So we extended this to Lama 2 with 7 billion parameters, trained it, here's where we're running inference on it and this is what it looks like. To create a cron job is the objective of it and we're passing, create a cron job named metrics collection using the Golang image. The cron job should run every minute and execute the command go run hello dot go. When we run inference on it, this is pretty accurate to what a cube Ctl command should look like. We can now play around with it. So here, if anyone has any suggestions to what we should feed to the model to run an inference, you can go ahead. Like, you can take suggestions from the audience. We're running a custom one over here to create a pod, but any question or any suggestion that the audience has, nothing, okay, we'll just write some, okay, let me write something like to create a service and say, create a service of type load balancer and expose pod 80, short, let's try that. I don't know if that'll work. To create a pod and expose that as a service. Create a pod, do you have a suggestion for the name? Nothing, called YOLO, using Enginix image and expose it as a service of type load balancer on pod 80, sounds good. Now I don't know whether this will work or not, but let's give it a shot. This is kind of sort of right, almost there. But that's the point of it, right? The model may not give you an accurate answer all the time. And this was strained on 35K rows or 35K cube serial commands. It didn't have context to everything under the sun. But the point we're trying to make over here is when the output is not something that we expected or something that we can tweak around, we can take this back in and feed it back to the model and say fine tune it like whatever you predicted was almost there, but you could have done better. So that's the point of fine tuning that we want to make over here. Because the output that'll be given by the model is something that we as domain experts can parse. This is not something that every LLM researcher may know about. So this is where you need human intervention from the domain experts. So you can extend this analogy to say when internet came out, so I think Jeremy I would also talk about this in one of the courses. So when internet came out, you had to know a lot of things about networking in general to just get, say, a website open up in a browser. But just to set up your router and modem and things like that. 20 years down the line, we're building business over the internet without knowing anything about TCP, IP stack or anything of that sort, right? So here we should be able to take cloud native to an extent wherein we can use AI and maybe use cloud native for AI as and bridge the gap between the domain expertise and the deep learning skill set over here. So when we showed this to some of our colleagues, they said, are we gonna talk to Kubernetes now? Is that the next step? So we were like, why not? So we'll try another demo. I don't know if this will work. We've not tested this. So let's see if we can get this to working where we'll try and talk to Kubernetes. This is the one where we need the demo gods the most. I think we have an 80% success rate. Let's go. Before we go to those, does the audience have any suggestion? Or else I'll just go with the load balancer service. That's the only thing that comes to my mind right now. Okay, so I'll just go with the load balancer service for now. I'm not gonna touch the upgrade gods. Create a service called load balancer and expose port 80. It created something. Let's see where they did. It's still running the query over here. We're not yet done. We still run the query going once again. We're waiting for this query to run in the background. We're worried about the echo. We don't know if this will work. Looks like something came up. Not yet. Well, it says something. Okay, okay. So this is the manifest that we created. Can you show the manifest? Kind of sort of there? Was it not the best manifest? The point being, it still created this manifest and then there was some stuff around how I talked to it and the echo and all that stuff. Run the background. That's one of the things I'm worried about. So how do you get involved? This is a project called LLM Netties. You can reach out to us, send us PRs in open up issues. You can tell us how you find your model if we can add anything to this operator. As of now in chaos simulators, we're still working on how we can extend the level of chaos. We're trying to work out upgrading, not the control plane, but trying to see how we can upgrade Kubernetes clusters and whether it will work out or not. We are also trying to see if we can do CVE scans and things like that. There's also working group AI being formed in CNCF and a technical advisory group runtime and technical advisory group observability. So here's the issue where the charter is being discussed. So if you have thoughts around it, if you want to talk about this, like feel free to chime in on the issue and thank you for making it to this talk and listening to us. All right. One last thing to add, everything you saw today is made by a controller behind the scenes and what's happening is that the controller is logging everything that works and it didn't work. So we have this data set of all the commands that worked and the commands didn't work. So we can continuously teach the LLM model to learn new things and learn from its mistakes. So I think right now we have maybe one week of training but if we do this for a full year, maybe we can do a control plane upgrade, why not? So there's a V2 coming up. It's what we are foreshadowing over here. Any questions? Yes. We have time for some questions. So if you have questions, there's a microphone in the middle. There's one coming, I think. And we have the two speakers that are coming next if you can come forward, we'll speed things up. Yeah, go ahead. Cool, so we saw some like kubectl commands. What about generating YAML? So the fine-tuning was done using kubectl commands but the controller also generates configuration around it. Do you also want to talk about the commanding? So for example, the example of the commands we've seen before, this one, command exec. So this one here, the command is deploy a pod and use like the batch V1 API but behind the scenes it's not only a kubectl command. The kubectl command is the one only apply to a manifest and you can see that for example here. You can see the output of the model is, there is a YAML file content. It has like an API version, a service. Sorry, not this one, it must be, obviously the one before, it's this one. So here you can see that we have a crown job YAML file. It's not really a one kubectl command. And you can see here at some point we have kubectl apply, yeah, it's this one. So the command that the controller did behind the scenes is kubectl apply this file. This is why we have a slash TMP like in the directory. We'll write the manifest there and then we call kubectl apply. Yes, it's just a prototype. Usually we can do something better, but yeah, it works like that. Thank you, thanks for the question. Sorry, I didn't get that, upgrade the controller too. So what we're planning on doing is make sure that the controller talks to any fine tune model that we embed into it. So that model can be trained on our database and we have the context right and the instructions that given in. But when we're talking to the controller, we're not giving context and all of that stuff, he's just giving it the instruction. Does that answer your question? Usage scenario, let's say it's three o'clock in the morning and you get paged and you're with the application developers and they're saying the application is acting funny. Is there anything, can you do an analysis of the clusters, everything healthy? Is there any anomalies? Is that something that possibly the application team could query this, I guess a chatbot, if you will, that could maybe quarantine a node, it finds maybe there's a bad node, you know, that kind of thing. Yep, there's a CR that we've introduced called as cluster audit, as of now we have parts and services, but do you wanna talk about it? Yeah, so there's this CR we created called cluster audit. And basically what it does is you give it like a set of CR, or like resources that you wanna watch. Let's do this. For example, here I'm saying like, hey, I want you to audit the parts and the services, but you can also add CRDs and nodes and things like that, and cluster audit is not really to find problems, it's to find problems, but I can also explain why everything is working fine, or also I can tell you ways to improve your application. So let's say you get paged at 3 a.m., there's a problem, you can run a cluster audit and it's gonna go and watch the status, the conditions of each resource and try to understand where is the problem really, like it's one of the deployments missing or one of the resources is missing, maybe like the node is out of, doesn't have enough storage on this, all those things you can see them on the conditions. So the controller is basically looking in the conditions to understand what's happening. There's also another project called SKATES GPD that also analyzes the cluster, you should also check that out. We had a link at the beginning at the opening remarks for some of these projects. Yeah. So in the interest of times, maybe we'll take other questions in the hallway, but if you have any feedback, please scan this QR code and leave feedback on SKATES for us. Thank you very much. All right. Thank you.