 to the second episode of Chatloop Backoff. It is February 15th, 2024. This time it is bright and snowy outside. We got a little storm that just came through and dumped a bunch of snow in a really narrow band. It's also chilly. It's about 26 Fahrenheit or minus three Celsius. But that's still actually up from, I think it was 15 Fahrenheit, which would be like minus nine and a half Celsius. So this program deep dives into the realms of the cloud native landscape. I'm your host, Jeffrey Seca, but many of you already know me as Geefy. I'm thrilled to have you here this time because today we are focusing on Kate's GPT. It's a newly accepted sandbox project into the cloud native computing foundation. This live stream is a sibling to the similarly named Clashloop Backoff, which is an event we put on at different cube cons. In that event, we picked two community members against each other to accomplish a technical goal of my choosing and it's meant to be a challenge, right? It's meant to be laid back for the audience, not for the participants, but it's generally fun. This stream is more of a self-induced version of Clashloop Backoff, but it's me. I am looking at different projects or techniques or ideas in the cloud native landscape and I haven't done much in-depth research into them. We're gonna sit here and we're gonna learn it together. I like diving into topics with fresh eyes and this is how I would normally learn if I was doing this off stream. So kind of doing this and exposing this, hopefully we can all walk away having learned something new. So a few housekeeping items before we get started. During the stream, feel free to drop questions in chat, I'll get to as many as possible. We are here to try and learn together so if you know the answer or something or I'm stuck and you have an idea for me to try, please speak up. We did that in the last episode and it worked great and I definitely missed something and people helped me out. All that said, this is an official livestream of the CNCF and is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would violate the Code of Conduct. Basically respect all of your fellow participants, please respect me, don't be a jerk. This video will be put out onto our YouTube channel afterwards so folks that couldn't make it to the livestream can still learn along with us just asyncremously. Before we get started diving into LLMs, couple pieces of news I'd like to highlight from the last week in our little corner of the world. And these links, actually I will show this afterwards, but we have a GitHub repo where I track any of the demos or commands that I ran, utilities that are downloaded. All of that is put into a GitHub repo as well as a summary of what we did. All of these links will be in there as well so don't worry if you're not missing anything. Also a little bit of like pseudo personal news, there is gonna be a cat somewhere here because we let our cat downstairs and she gets full reign of the house and she's more important than this, no offense, but she will be around and she'll probably push the mic over and it'll be a good old time. All right, news time. There've been a couple really solid blog posts actually that have been published on the CNCF blog. I promise I don't just chill CNCF news but these have been really good blog posts. One of them was a deep dive into policy as code in the software supply chain. What they were really diving into was using things like Kyverno to enforce policies but that's kind of, I don't wanna say basic but that's what you would expect. What they really pushed for is how to define policies in like the business landscape and then write those policies and machine code essentially so that the business policies can be enforced. They emphasize a lot about codifying policies and mainly leveraging the open source tools that we have out there for both implementation and evaluation of policies. So I will make sure that link gets dropped somewhere. Similarly, I should have mentioned the policy as code blog post was written by one of our tags or technical advisory groups that was a tag security. Tag storage also wrote a blog post highlighting cloud native disaster recovery and how to do that for stateful workloads. In that they outlined this concept of CNDR or cloud native disaster recovery and they talk about how to essentially track, monitor and evaluate how good you are at the disaster recovery and some, I guess, again, going back to the policy thing. Policies and techniques to try and prevent stateful workloads from just completely getting torpedoed assuming you lose everything in your data center or your cloud. Last but not least, yesterday was Valentine's Day. It was also Kubernetes patch day. So it was patch release day. All of the currently valid and up to snuff Kubernetes versions had patch releases that were cut. Please check the release notes, see what's been updated, see if there's anything that is important that you need to update for. So that's that. Let's now look into what KGPT actually is, what it does and there is a small chance depending on what happens I might actually install it on my home lab and we'll get to kind of mess around with that live. So let me move that there, let me move that here and let's share my screen, screen one, boom. All right, that looks good. Okay, does that look good? I'm gonna say that looks good. I might make it a little bigger. Yep, okay, cool. Like I mentioned, each episode winds up being a folder under chat loop back off and actually while I'm here I will show that really quick. C and C F, C L B O. So under the episodes, we'll have a little read me and it'll link to the YouTube video after the fact, the news that we've talked about and then kind of summary of what I've done and any relevant links. Feels like that's the right way to do this. Also any of like artifacts that came out of it, these were the YAML files that we use to evaluate Gateway API and actually install Gateway API. So we're gonna do something similar here. Ahead of time, all I have done actually, ahead of time, all I've done is installed KHGPT, that's it. So how did I do that? Well, on this machine, I am running a flavor of Fedora Silver Blue called Universal Blue, specifically it's called Bluefin and it is kind of a container-based desktop. It's really awesome, highly recommend folks look at it. But one of the cool things is it actually ships with brew. So all I did was do a brew install KHGPT, it's gonna auto update but boom, KHGPT is installed. And I guess the latest homebrew version was built a month ago, cool. So we've got a pretty up to date version of KHGPT. So let's see if they have kind of a getting started guide. Get it now, yes. So what does KHGPT actually do? Based on what we're reading here and what other people have told me, it uses the power of LLMs and AI, hype cycle, but to analyze workloads within your cluster and to potentially alert if there is something wrong as well as help you debug what is wrong. So if there's like something that is clash loop back, or wow, I fell into my own trap, something that is a crash loop back offing or something that's just throwing an error, it will help debug things to give you something more actionable than, oh, this pod is erroring out. It'll actually give you a bit more info. Codified SRE knowledge, AI cuts through the noise. I know these folks, they're awesome. So I'm gonna trust it, getting started guide. So let's ensure KHGPT is installed, safe to say that's good. And let's connect to Kubernetes cluster, documentation for setting up a kind cluster. I am all about that, so let's do it. Kind is just one of many methods to create a quick end, I won't even say dirty, but a quick and small Kubernetes cluster. There's K3S, there's Minicube, there's kind. There's other projects out there, those tend to be the big three that a lot use. I tend to prefer kind for a couple reasons. One, well, three reasons. One, I know the creator and I'm friends with him. So I'm here to support. But that aside, the Kubernetes project itself actually uses kind for their own CI. So it's the closest, in my opinion, way of getting a vanilla open source Kubernetes cluster running. I haven't looked at Minicube recently, but I know K3S does some different things in order to make it even smaller. But it does change some things like it's not using et cetera D, it's using SQLite and some other things. The third main reason why I like using kind is you can use some of the API machinery like configs, like QBatom configs to pipe those into kind and get a cluster in the shape that you want. Okay, so we have, we have kindKHGPT. We also have kindKind, but I guess kindKind is an artifact from last week. So we're in kindKHGPT. We are in the default namespace. I'll even blow this up a little more just so it's easier to read. Cool. So let's see what's next. View the different commands. I am going to go through this, you know, kind of top to bottom and then we might go off script and now authenticate with openAI. All right, so this does rely on a third party service of openAI. Like I said, I have done no research. I'm curious if there is a way to get it to run off of a local LLM that would be kind of interesting, especially if you're hosting your own, even if it's, you know, a seven billion parameter model. Okay, so I'm sure you've created an account with openAI, generate a token from the backend. So KHGPT generate, let's see what that does. Copy the token for the next step. So it's gonna general, I'm gonna wind up generating a new secret key and then putting it in here. So we might have to go a little bit, KHGPT, a little bit on the other screen. So we're gonna copy this, put that over here. KHGPT auth add. Okay, so you can specify a backend. Hopefully that means what I was saying earlier as possible. AI, model, GPT-35 turbo. Okay, we should be good. So current context, get nodes. Okay, and then we need to create a bad pod. So let's make that YAML really quick. So create a broken pod YAML. Pod has the wrong image tag. Yep, okay. So theoretically we do, oh, GVCNCF, ML code, CNCF, CLBO, 15, bad pod. So bad pod has been created and it should be throwing error image pool. Let's just make sure that we are all on the same page. Error image pool, sweet. So next up it's probably gonna have this analyzed, yep. So let's do KHGPT analyze and see what happens. AI not used, explain not set. Did I miss something? Or is this actually what it should be coming up with? Seeing this command will generate a list of issues. In the KSAR example, a message should be displayed highlighting the problem. Well, let's make sure I didn't screw up adding the open AI off. Off list. Active open AI, okay. Oh, local AI, sweet. So that kind of answered my question from earlier. You can actually run some local AIs. Yay, for local alarms. But it says, AI not used, explain not set. Oh, I'm just dumb and I need to read further. Okay, so it's just dumping out the basic info, intelligently the basic info of what the error is for the user to interpret. And then if you pass explain, it will actually pipe the error to an LLM and actually read it out. So theoretically, Kubernetes is unable to pull the image due to an error. Verify the image name and version is correct. Yeah, we definitely need to do that. We don't have network issues because if we did, I would not be streaming. And then if the image isn't available, try a different one. Congratulations. You have created a local Kubernetes cluster and deployed a broken pod and analyzed it. That's all I needed. That's all I needed to see. So there's another thing that I'm really interested in. They have an in cluster operator. So this, right now as we're using KHGPT, KHGPT is a tool to help you like a person, an SRE, insert persona here interact with a cluster by hand to help debug it. Does the in cluster operator, well, now I hear the cat crying. Does the in cluster operator do any additional magic sauce? That's the question. Using open AI, create a secret with the token. Heck, let's just add this into kind really quick. Actually what I might do, I might switch over to my home lab and then see if something there is angry. Get pods A, grab V, rounding, grab V, completed. I'm probably making a lot of people upset by typing that. Nope, no Angi pods. So let's get back to the kind demo. All right, so let's install this operator around to R. Kind cluster. So let's add this repo, home repo update, and then we'll just copy pasta the whole thing in. It has been added. So it adds it to the Kates GPT operator namespace. I need to deploy a secret again. So I'm gonna pop that over quick. Just creating a new key to copy, paste. Okay, so secret has been created and then deploying a Kates GPT resource. So Kates GPT, Kates GPT sample, namespace, Kates GPT operator system that we created. Let's go a little off script. Let's do GPT for turbo. I think that is allowed to me. Open AI model list, that's cool. Let's try this one. Secret is done, anonymized false, that's fine. Language is English, that's fine. I'm curious what the CRD like all actually is. Sync, Slack, what? Ooh, now we're looking at cool things. So, reference operator overview, there we go. Can run as a Kubernetes operator inside the cluster provided. Okay, so can you set up in cluster metrics and then anything that Kates GPT like finds, Prometheus can scrape out and throw an alert. I think, I feel like I'm getting off track. I should probably just throw this in there and see what happens. I'm definitely coming back for you though. Cause if we can have it, thanks go pilot. If we can have it dump anything weird that it finds into Slack, like I could, I kind of like that pattern. I'd be curious if there are other syncs that already exist, but we'll dig into that in a second. Let's see if this version is right and up to date. We'll see you get back here. Oh, three, two, six. All right, well let's apply this to our kind cluster and see what happens. So deploy that, deploy to Kates GPT resource. Yay. Out of cluster traffic to AI backends and the above example, enable AI set to true. This option allows cluster deployment to use the backend filter and improved responses to users. Oh, that's kind of neat. So you can prevent it from reaching out to third party like AI services by setting that to false. That kind of makes sense. Local AI and Azure open AI are supported and they're also looking to do kind of a way to serve in cluster models. That's neat. So when you run this, I'm assuming it's creating a result. Well, there's a result CRD and then we can just dump this out and it's gonna give the same thing from earlier. Oh, right. Need to be in there. Okay, so there's the run that we created and with the updated version. So now let's go back and run this. And there we go. So items, a result, name, default, broken pod, namespace, case, GPT operator. Kubernetes can't find the specified engine X image and Docker hub, very similar to what we got before. There is a difference from what we received or got back in earlier when we were just using the command line because that command line was using GPT 35 turbo and we're using GPT four turbo for this broken pod. So I guess my next question is, so it's a case GPT operator. So it's only going to act on new CRDs that are created. So it's not gonna run things automatically. It's not gonna create these new runs on a cron schedule, is it? Which will wait a case GPT resource before anything. Yep, here's the case GPT resource and there are the results. I kind of want to dig into this. This is like, I get the idea of wanting, I get the idea, I'll put it. I feel like this is half done. This has to have some way to on the regular run against a cluster and then give feedback back. Luna, hush, you are not the one I was expecting to be troubled with that. So let's see, is there a separate repo for the operator? Yes, there is. Okay, if you're in scope of managed case GPT workload. On the example, this is, I believe really close to where we were at. Okay, monitor multiple clusters. Sinks. Okay, so there is slack and there's matter most and then setting an image pull secret. Okay, so you can host it yourself. That makes sense. Let's look at the help chart quick. That's probably super small. I just blew up the wrong window. There we go. Super fine controller manager, image case GPT operator. Interesting. I really would have expected this. Oh, that's too small. I really would have expected this to be able to run periodically or run essentially put a watch on all events and then an event comes up that is not expected like error image pull, just outright error. Anything that is equivalent of not exiting zero or running. It would then be able to run and output some sort of report. You know, where y'all's roadmap at? I wanna see if this is on the horizon. Cause this is like, it's awesome. This is a very good example and application of what LLMs can do. Being able to take somewhat nebulous errors and unwrap them, unravel them a bit. But also, I was somewhat expecting this to equally be able to surface alerts and issues as they came up. Charter, adopters list, backends, key features. So data is anonymized before being sent to an AI backend. That's pretty nice. So it won't, okay, that's actually really cool to read. So imagine this is in your cluster and you have pods and workloads and the names of those workloads might indicate something you do not want publicly available. What will happen is it will anonymize any of the relevant information specific to your cluster. I like that a lot. I saw the whole anonymous config option in the YAML earlier and I wasn't 100% sure what they meant. Now I understand and this is really, really good. Imagine you worked at a hospital and a patient comes through and a patient like when think like a research hospital, I guess. A patient comes through, there happens to be like some analysis being done against their lab results. You might actually have the workload name be like the patient's medical record number or something. And you could be unknowingly exfiltrating HIPAA data, which would be bad, PHI, very bad. This, very good. That's awesome. KGPT stores config data in, okay. So when you put your API key in here, make sure it's not like a public computer, please and thank you. Remote caching. That also makes me super happy. So if what this means is if KGPT is running against your cluster and you have 10 of the same error, exact same, it's the same pod, it's the same everything, except maybe the time stamp in the object. It will, you can set up a cache so that every time it's not hitting and consuming valuable GPU resources, aka money. And just return back what it's already gotten back from looking at a extremely similar error. By extremely similar, we're talking the only things that probably it would ignore the unique ID of the object as well as the date of the object. I'm just guessing here, but if like the name, namespace, essentially, you have this holy trinity of how to identify an object. You have kind, you have namespace and you have name. And if there's an error, I'm assuming in those, or if there's an error with that uniquely identified object and the same error is happening, it's just maybe a different time, maybe there's a new version of the object, but it's still the deployment is throwing an error that will prevent you from just wasting money and getting the same information back. Love that. Custom analyzers. Restrake your own analyzer. Oh, this is dope. Okay. Okay, so KGPT can also pipe the data that it gets back into an analyzer that you or other people can write. I really like that. I like how extensible that can be. And I like, I just like the flexibility. There are some times where like me personally, I'll just want to write something custom just to kind of see how things work in the background. And this kind of will let me scratch that edge. Nice. Okay. So remote caching, custom analyzers, boom. But there's not a way to continue, like have it run continually or have it run when it detects or the operator detects an error. Am I, let's look at their features. Clueless event anonymization, deep dive alerts. I wonder if that's also just out of scope of what they're trying to accomplish. Cause it is relatively easy to just create a crown job that would run, well, a crown job that would create the type of run that they would want and the operator would run against that. That's something that I don't want to just sit here on screen and read GitHub issues for too long. So I am going to like do that one offline. But like, I don't even want to know what co-pilot trying to auto complete that too. Actually, that's thank you co-pilot. You're so smart. Look up Alex about periodic runs from case GPT. Cause I feel like, I don't know. I feel like I'm a broken record right now. That functionality would take this to the next level. Not only when something wonky happens with a cluster, it then, hey, we see these pods aren't getting scheduled. Based on the error, we think it's because there's not enough GP resources. That saves an SRE like three or four commands trying to look up the error. Also look at the state of the cluster and try and reconcile the two. Cause that type of error might come up and it'll just say, if there's not enough resources expected and you kind of have to dig a little bit. So what else around case GPT can we show? Integrations. Okay. Let's see what this has. Trivian Prometheus. So case GPT integration activate trivia once activated. It will install the Trivian Prometheus operator into the cluster. Oh, now that's kind of cool. So just out of the box, out of the box. Did it close, closed one extra tab. Out of the box case GPT can install Trivi for you, integrate with Trivi and then you can use different filters out of Trivi and also analyze the output from the Trivi operator. That's wonderful. That's really, really cool. Also being able to just kind of not at a glance but dump all the CVs that were found here is the remediation boom. That's really nice. Now what does Prometheus do? It doesn't just deploy resources in the cluster but it assumes that Prometheus stack is already in the cluster and you specify the namespace. That is a reasonable assumption from installation to otherwise Angie error and then once there, case GPT has access to config relabel, config validate. Okay, so by integration in this case it pulls out your Prometheus config and we'll do an analysis and throw, not throw. If there's something wrong or off with your Prometheus config it'll actually output some meaningful information including how to fix it. I'm down with this. And then again, filters we talked about earlier we can actually write our own filters if we so chose assuming we follow their schema. This is cool. It's not that I'm disappointed, it's just I really think there should be some sort of periodic monitoring with it and maybe that's coming or maybe that is like a feature that vendors will try and bake up. I don't know, but this is awesome. Am I going to install this in my home lab? Probably not. I wind up having only one or two of the same error in my home lab and it has to do with running on kind of old and kind of jank gaming machines from like 10 years ago that I repurposed. So it's like hardware, the machine just kind of dies and then I have to go and reboot it and like that fixes it. So this seems a little a little too much but I'm also very, very excited to see where this is going to be in six months. I was really interested in what they were doing before they joined to the CNCF I'd been following, at least just the project and the project health not using it. Yeah, I can't imagine where they're going to be in six months because I feel like where they were six months ago, they've come a long way already. Of note, they have all their information in their community repo. If you are interested in helping out if you are an SRE that also wants to try and leverage LLMs, this is probably a really good place to look. If you're an SRE dealing with Kubernetes specifically though. Otherwise, I think that's where we're going to wrap this episode. I don't want to, I don't want to just again surf around a GitHub repo too much on my own. I will say, thank you all for joining. I will update the GitHub repo with this, today's episode and all the artifacts and links. And I will probably be reaching out in whatever, however and wherever I reach out. If I get an answer, I will update the episode notes with what I get back regarding some form of like periodic checks. So otherwise, yeah. Thank you all for joining. Hope this was useful. It was at least fun for me. So hope you had fun too. Happy Thursday, bye.