 Go ahead and get started. So thank you for coming, everyone. Welcome to Minimalistic Best Practices and Kubernetes Operators. Somebody should have given this talk a shorter name. Whose fault was that? So my name's Jonathan Burkhan. I work for IBM. And we're going to be talking about ways to make your operators better, specifically by cutting out a bunch of stuff you don't need. So before we get into that, though, I want to introduce myself, tell you who I am, and maybe why you should listen to me. So like I said, I work for IBM, specifically as an open source developer. So I actually contribute to open source full time as my job. And for some reason, they pay me to do that. So right now, I am a steering committee member for Operator Framework, which is a incubating CNCF project that is not part of Kubernetes itself. It's sort of its own separate thing. But everything we make is related to Kubernetes. We make a bunch of tools for writing operators and upgrading operators and distributing operators and doing all kinds of things with operators. So I remember the steering committee there. I'm also a maintainer of one of the subprojects, Operator SDK, which is a CLI tool for scaffolding operators. And in previous lifetimes, I worked as an open source contributor to Kubernetes, various parts of it itself. And then before that, I worked on an open source platform as a service called Cloud Foundry. And then there's just the link to the Operator Framework website. If you're interested in operators and somehow have not heard of it, check that out. It's probably got a bunch of tools that are going to make your life way easier. OK, so before we get into the meat of it, let's talk about what is an operator? Because these are kind of a complex topic. Maybe people are not familiar with exactly what they look like on the inside. So show of hands, who here has ever written an operator, worked on an operator, deployed an operator, used an operator? OK, so it looks like pretty much everybody at least has some experience. You've got some rough idea with what an operator is. But just for completeness sake, I'm going to go over it in sort of brief terms, my understanding what an operator is. So if you have used Kubernetes at some point in your life, you have probably done something that looks like the top workflow, where you use kubectl, you have some specification of some Kubernetes resources contained in a YAML file, and you say kubectl create this, or kubectl apply this, or kubectl patch this. And that's how you interact with the system. What that actually does is it just writes a line to the et city store. It's just writing data to a database somewhere. And what actually makes that become reality is a process called a controller. And there's a bunch of controllers that are part of core Kubernetes itself. So when you say kubectl create pod, that's really just writing a line to a data store somewhere. And then there's a pod controller somewhere in the cluster that is sitting there watching that table. And it says, oh, something changed. I'm going to make that actually exist somewhere. I'm going to go do some things that will eventually result in a container that looks like the pod you requested existing somewhere in the system. OK, so what does it have to do with operators? So an operator sort of concisely summarized is just replicating that exact same workflow that Kubernetes already uses internally to manage its core resources for your resource. So you're going to create a new type by extending the API of the cluster. And then you're going to be able to kubectl create your thing. And there's going to be some controller for your thing that exists somewhere in the cluster. And it's going to see that, that something changed. And it's going to go make that happen, whatever that means for your thing in particular. So this is just sort of a quote I wrote sort of concisely. Summarizes this. An operator is a design pattern for creating software to run on Kubernetes. So really, it's an architectural pattern. It's an idea. It's not a specific concrete thing. Rather than statically creating our application out of native Kubernetes resources, we extend the kube API by making our own customer resources and controllers that govern their behavior. And as a caveat, I will note, it is technically possible to write operators that do not, legally speaking, follow this strict definition. But I am almost all of the ones that I see interact with in my day-to-day life follow this definition. Some kind of application you want to run on Kubernetes, you want to package it as an operator. And this is accomplished by teaching Kubernetes to fish rather than giving it a fish. What is an operator in more literal terms? It's a bunch of kube yaml. Literally, though, it is a custom resource definition that specifies the form of one or more custom resources that the user will interact with, create these things, to cause stuff to happen. It is a controller implementation. It is a server that runs in a container, usually on the cluster somewhere, that sort of manages the behavior of those resources. That controller is usually implemented very literally again as a Docker image. You have to write a bunch of kube in a bar back that controls access that the controller uses to see and watch and create things. And then it's about 10 metric short tons of kube yaml that encapsulates all of the above. So when we talk about minimizing our operator, these are generally the things I'm going to be talking about, usually ways to make these smaller, ways to make these smarter. Honestly, for a lot of this stuff, you can just cut out 90% of the stuff that gets blasted all over your operator when you use something like operator SDK to generate it. Because operators are really not a one size fits all kind of solution. You really need to think about what you are trying to do, what you are trying to accomplish, and how that relates to these various pieces. And this is not the simplest of tasks, because operators, like Kubernetes itself internally, are a very complex topic. There are a lot of moving parts. Those moving parts are running on top of other moving parts. When you move those parts, the other parts move, and then it makes other parts move. So a little change can propagate a lot of complexity. Additionally, they're sort of a very admin level thing. They're big and scary. They do things like have access to maybe all the secrets on the entire cluster, and you really don't want to mess around with stuff like that. And the person whose responsibility is to write it, generally a Kubernetes application developer, they're usually a very expert on their application and its domain specific knowledge, but they maybe aren't exactly an expert in Kubernetes itself. Why would they be? They're writing their app. They're not a Kubernetes developer. They don't just sit around and think about operators all day like I do. And so it can be difficult to kind of teach that developer about that complexity, because they're not gonna write a bunch of operators. They're gonna turn their thing into an operator, and then they're gonna go back to working on their thing. So I've tried to condense some of the knowledge, some of the wisdom that I've gained from my time working as a contributor to operator framework and do some sort of rough guidelines. These are not like hyper specific ideas that like, oh, go delete this line. These are more general principles. Now some of them are actually really simple and the two in particular, very easy to implement and will give you a huge bang for your buck if for some reason you weren't already doing them. So let's get into that. Right, the solution. So what are we gonna do? Keep it simple. There isn't a need to reinvent the wheel for a lot of these. And like I said, when you scaffold an operator with something like Operator SDK or Coup Builder, you're gonna get blasted with a whole bunch of stuff because we wanna make sure that operator, when somebody just stands up an operator, they don't know what they're doing. We really want it to run the first time or at least if it doesn't run, make sure it's not our fault. So a lot of that stuff you can get rid of if you know what you're doing. Some of the things I'm going to suggest are actually going to be to add complexity, which you might say, that's not minimalistic. How does adding complexity? You're making things bigger. You're making things more complicated. There are a couple of things, especially if you do them at the start of writing your operator instead of at the middle or at the end, that are gonna save you a lot of time and headache later on. So in my estimation, that makes it worth it. And also if you haven't written an operator before, I realize some of these things I'm going to suggest might not make much sense. I might be using words that are not familiar. If you aren't super familiar with like how the internals of a controller works or stuff like that. If that's the case, stick with me. At the end of this talk, maybe go, I've got some stuff linked at the end that'll point you to a couple of tutorials that you can stand up a basic operator. Do that and that should give you enough information to be dangerous. Come back, watch this. Hopefully it should make more sense. Okay, so let's dive into it. Let's start with the CRDs. So a CRD, a custom resource definition, specifies the structure of a custom resource that will then exist on the system. They are sort of the heart of your operator, even though the controller is the stuff that actually goes and makes the magic happen, the CRD is really what specifies the API that the user is going to interact with. You're making a new Kubernetes resource, they're going to Kube CTL created just the way they do everything else, they're going to write it in Helm charts, they're going to do all kinds of weird stuff with it, but they're going to interact with it by creating these objects. So really, this is where you need to start thinking about how the API of your object itself works. You want to be very considerate when you add things to it, because adding things is really easy, but taking things away is pretty difficult, unless you like opening up and doing mutating webhook migrations, which that's my favorite thing to do on a Saturday, but might not be yours. Additionally, if you just do the shortest, simplest path, every time you're like, oh, we need to add this new setting, stick a bull that controls this one, feature gate and forget about it, you're going to have eight million bulls in there by the end. So add things, yes, but be aware. Try and keep things versatile. It's much, much easier if you implement things in a way where you can add new features by changing content, things that are defined in the content of the CRD rather than by changing the structure of it, and that's going to be way easier because then you can just roll the controller, you don't have to go through a messy migration path. It's going to make things in your life simpler and easier. Another thing to keep into consideration is that anything that your CRD depends on, if your controller needs to look at it, needs to touch it, needs to do noodley things to it with its controller noodles, you're going to have to give it permission to do that, which may have some ramifications that you're not really aware of. I mean, okay, it's going to have to have access to it, so that's pretty simple. But if you end up, I don't know, watching all of the pods on a production cluster in your controller, that means your controller's going to get an event, anytime any pod changes anywhere in the system, you're not going to have a fun time. So that's just something to keep aware of, although this is something that you can control at a couple other steps, which I'll talk about later on. And then finally, especially for the CRD itself, it would really be a good idea to stick to a Kube-like API because really what you are doing is you are just extending the Kube API itself. It would be good if you kind of stuck to the way things are supposed to be there. What does that mean? So Kubernetes itself is a declarative system. It's not an imperative system. It means you interact with it by just declaring state and then hoping eventually, sometime later, things will happen that will make that state true. So you really want to stick with declaring form, not having buttons, don't have buttons, don't have annotations that cause things to happen when they're created or weird stuff like that. Use the spec and the status, the spec is the way information flows into the system, status is the way it flows out. Yeah, stay away from annotations. I don't know why everybody loves those things so much. Okay, so once you've got your CRD hashed out, that's really where you start. Controller's the next bit and the controller itself is really just the reconciliation loop and a bunch of, you know, cruft around that. So briefly, if you don't know what the reconciliation loop is, your controller is sitting there watching the table for your CR or whatever. Every time it receives an event, usually, which is when something changes, it's gonna fire and say, okay, this is the desired state of this object, this is the actual state of the system, I'm gonna try and move it from one towards the other. And if you're writing your own operator, especially if you're using let go, which would be my recommendation, you're just in arbitrary code execution, you can do whatever you want in your reconciliation loop. It's generally a good idea to keep things simple, though. What does that mean in this in particular? Keep the steps small. You don't want to say, okay, somebody instantiated a new instance of my object that is made of these hundred different parts, I'm gonna go make all of those hundred parts at once. It's a bad idea, because you're probably gonna get to like part two and everything's gonna be different by then. So then you're gonna go make the rest of those parts, you're gonna wait, it's 98 steps. If there is something that's like 98 steps that you're really sure I need to do those all at once, something that is sort of possible, maybe a better way to do that, is to implement long workflows outside your controller itself. So particularly if you're interfacing with some off cluster system, some API somewhere to provision resources, some are off the Kubernetes cluster, that's the sort of thing that we see that often has a big lag time, oh, go provision this thing and eventually it'll come back. You can implement that off the controller itself, have some API that the controller will hit and return, don't sit there and wait for it to happen, because like I said, stuff will change. Relying event driven reconciles, this is one that just boggles my mind every time I see someone do it. So Kubernetes, like I said, small iterative steps, you're supposed to be watching the tables, things change constantly. The default sync period for an operator is, I think operators to KSS, that's 10 hours by default. People seem to like to set that to a really small number so that the controller is sitting there watching and doing things. No, don't do that, don't touch that. If we could remove that, that would probably be a good idea, but people seem to need it. Use requeue after, so when you exit the reconciliation loop, there's a bunch of tricks you can do, requeue after, I think there's like a requeue in event, there's a bunch of different ways you can specify that say, keep reconciling, but wait a minute, keep reconciling if we get this error specifically. And you can reason about the behavior of your operator because you know the domain specific knowledge, you say, oh, we went and hit that off cluster API that's gonna provision a server that's gonna take 15 minutes to come into existence because it's, I don't know, a bare metal server made by monkeys running in hamster wheels. I don't know, but you could build that in. You could say, okay, we know it's gonna take 15 minutes, so requeue after 15 minutes, don't bother trying to reconcile this again. And that adds up. You can filter events, you can use predicates. There's lots of ways to cut down on the traffic coming in that's gonna fire all these reconciliation loops. And that also sort of applies to your traffic that goes back in, out of your controller. You can use patch if you are just making small changes, especially if you've got sort of a simple operator, one CRD, one set of controllers. There's apply, if you've got multiple things crashing around at once. And there's server side apply now, which is even fancier. And then finally, don't update the status if it hasn't changed. So a lot of the times you're gonna fire up a reconciliation loop, you're gonna move one step. And nothing that's observable has really happened yet, so there isn't really a need to write the status back. You can just exit. And these sort of lead us into the next thing, which is minimizing the API load. So your controller itself, you're gonna need to, you know, make sure that's running and perform it internally. But you really also need to be a good citizen of the cluster as a whole. And here's this first bullet point. This is one of the things I was talking about earlier. This is one of the simplest things you can do. This is literally a one line change. This is gonna guarantee, save you at least 10 headaches, 10 stupid mistakes, turn on metrics in your operator. If they aren't on right now, I want you to promise you're gonna go and edit that line because operator SDK uses controller runtime, so does cube builder. Those implement plugin points for Prometheus by default. So if you have Prometheus installed in your cluster, it's a one line change. And this is going to save you so many headaches because there are so many things that are not able, you can't predict these, you know, at design stage and they can change depending on what's running on the operator, there can be all sorts of bottlenecks. You could say, I've got all these reconciliation loops that are piling up because something's logged jammed, the pods on the, you know, some other resource because there's other things going on. This is a production cluster, I don't know. And it can be very difficult to reason about that ahead of time and it can be impossible to figure that out even if you have the, you know, running cluster because you don't know. You don't know where it is in the reconciliation loop. You don't know that, you know, we've got 10,000 things queued up at this one specific step and then they all just kind of hang there for some reason. But if you turn on metrics, it's going to be really easy to figure this out. So please do that. And that's going to save you a lot of trouble. It's going to help you fix these bugs and these bugs generally are going to also make your operator a better citizen of the cluster itself. So everybody else will be happy too. Oh, I'm going to fall down these stairs. Okay, watches are expensive. So your controller is sitting there on the cluster watching generally, you know, the table for your custom resource. But it also might depend on other things that you need to watch. Maybe you create something, you run it in a pod, you want to watch those pods for some reason. But like I said, you really don't want to watch all the pods in the cluster. So you got to be smart about it. These add up. A lot of the times the cluster limits them, like only allows X globally and a bunch of those are taken up by the core Kubernetes controllers themselves. So this is something you want to be aware of and you want to be clever. Especially if you've got multiple controllers in a single operator that are related, you can make them share a cache. That cache, you can filter it and all of this stuff is going to cut down. Yeah, okay. It's going to cut down on how much stuff you're asking for. It's going to cut down on how much stuff your operator is going to have to care about. Although this is also something you've got to be aware of because if you filter it wrong, your operator is basically going to be dead to the world and it's not going to work. A bunch of these things can actually be short circuited and implemented, or at least you can be forced to implement them. If you do the other thing that is sort of a short and sweet recommendation, which is namespace scope your operator. So by default, operator SDK is guilty of this. If you say give me an operator, you're going to get a cluster-scoped operator because that's simple-ish. But is it good? Is it the best solution? Maybe not, but it is simple. If you namespace scope your operator, though, and this is, again, one of those things that's way easier to do if you do it at the start instead of start with cluster-scope and then move there later, this is going to force you to do a bunch of the stuff that I've already mentioned. It's going to force you to only care about the resources in your namespace that you're operating in. It's going to, therefore, your watches are going to be cheaper because they're only going to be watching things in that namespace. You're going to be more performant because you're not going to be ballooning up with all of your weird watches based on how much other stuff running on the system. You're going to get a bunch of stuff I haven't mentioned for free. Like it's going to be way easier to upgrade because you can roll stuff one namespace at a time that's not going to affect other users. It's going to let you do all kinds of weird multi-tenancy things like maybe you've got different versions of the operator or different tiers that you want to be available for different people, and it's going to be way easier to control access to that instead of just having cluster-level R back or requiring some kind of third-party sign-in to be plugged into the system. Finally, I would like to mention this. This is sort of beyond the scope of this talk, ha-ha. Multi-cluster scope is the thing. So rather than having one operator installed on one cluster, you could have a single operator installed on multiple clusters, and then you wouldn't even have to have controllers on those other clusters at all. That's kind of complicated, and I don't know if it would be my recommendation for someone just starting out, but I would like to mention it because there's a lot of scenarios where that ends up being more efficient. R back. Okay, so operators are low-level, cluster admin kind of things. They're scary. And like I said, operator SDK, we're guilty of this. When you make a new operator, when you scaffold something out, we're just going to liberally hose it down with a bunch of cluster admin privileges that you probably end up not going to use. So I would say probably like 90% of those can probably get removed. So yeah, I would definitely go through there and really think does your operator really need that? Secrets and cluster-level watches or cluster-wide watches, rather, are generally the sorts of things you want to avoid, which if you implement as a namespace operator, it's kind of fixed up for you anyways. And this is also sort of a security vulnerability concern because this means you're going to have to have a new security or a service account that has all these privileges that belongs to the controller, but if other users were to somehow get access to that, it creates a security vulnerability in the cluster that you have to then keep up with. And then finally, the controller itself is running in a container, which means there's a bunch of container stuff you can do that you probably should already be doing, but might not be, to make that simpler, smaller, faster. Again, a bunch of the stuff that gets generated by default is just we're going to shove everything into that image that we think you could ever use. But once you've got your operator, you know what you need, you could probably remove 90% of that. Start with a minimalistic base image, only include the things you need. Have multiple build stages so you can have all the build artifacts you need, build whatever your thing is, remove those so they aren't included in the final product. This again is sort of simplistic stuff that if you're running applications in containers you probably are already familiar with, but it works on your operator too. Okay, so to summarize, first two points. If you remember nothing else from this talk, do these two things. Turn the metrics on your operators, thank me later. Make your operator namespace scoped. Now, this can be kind of finicky to do if you've already got your operator, I've done it once just for fun, it wasn't that bad. It wasn't that bad. Filter, filter, filter, there's going to be a bunch of stuff you can filter, you can filter your watches, you have less events coming in, you can filter your actions with credit kits, or sorry. Yeah, filter events with credit kits. Because a bunch of the stuff that's firing, a bunch of the events that's coming in, a bunch of the stuff going out is probably not actually relevant so you can filter a lot of it out. Cut out the cruft, a bunch of the stuff that's going to be included in your operator by default doesn't need to be there, you can get rid of it. Many of these fixes are multiplicative. If you namespace your operator, but then also bother to go and implement filters on the watches, you could cut this even down even further, maybe you've got a bunch of pods running in your namespace operator but you really only care about the ones that are running this one particular part that exposes events that need stuff to happen in the reconciliation loop. You could filter it down even further so it's only those pods you're actually getting events for, et cetera, et cetera. And all this stuff adds up, because if you do that and then you have a bunch of these operators deployed in different namespaces on your clusters, you're going to be a fraction of a fraction of a fraction of the load that you used to be when you just had one size fits all, cluster operator, watch everything, grab everything, read everything. A little forethought goes a long way. This is maybe not the most helpful advice. Think about this at the start rather than in the middle but it's what I can offer. And then finally, is some familiar with the CoreCube API helps a lot, so like I said, a lot of the people that write operators are application developers. They don't necessarily the most familiar with the Kubernetes API itself, but it's all very extensively documented. I think I've got some links to some of it in here. Yeah, so first link, operator framework is the thing I work on, it's got a bunch of tools that are super helpful if you haven't heard of them. And then I've got a link to the API conventions from the SIG architecture docs. This is sort of a cliff notes of like, why is the Kubernetes API the way it is? What style should you be shooting for? This is probably my first recommendation for if reading material sort of summarizes a lot of the things I have just said and this will help you twofold. It'll help you one, cause you'll understand like how Kubernetes works and how you can use Kubernetes itself better. And you'll be able to emulate that style when you actually design the API of your operator itself. And then the third link is just, that's the CUBE API specification itself. It's really more of a reference document, not very light reading, but you know, if you're into that, it's certainly very informative. And that is it for stuff that I have. I'd like to take questions now. Hopefully this thing doesn't explode. Can I get the microphone? Oh, oh, hello, hello. Oh, is it? Okay, yeah, questions. Greg, talk to Jonathan and thanks for working on operator framework I'll be using for a little bit. One question that has come up for us working on an operator is how do you actually test it? Test the operator? Test the operator. So if I come back six months later and I've got a bug I need to fix and I want to prove that bug exists in my code and test it without needing to spin up a lot of infrastructures. Is there a possibility for that? Without spinning up a lot of infrastructure. Now you're getting difficult. Well, my recommendation again would be you should have done all this already. So why are you asking now? So there's a couple things you can do. You could have unit tests and probably should for all the stuff in your operator itself. In terms of actual integration level tests and end tests, are you familiar with Scorecard? The OpenSSF project. No, the one that's part of operator framework. Oh, no. Okay, so operator framework has a solution for you. It's called Scorecard. What is Scorecard? Scorecard's a thing that lets you specify a bunch of integration tests that get run in a container against your operator and then once you have the container built, you just, I forget, it's like Scorecard runs. Scorecard, go do my thing. And it actually has a bunch of tests built in by default that are actually useful. That'll test to make sure the API specification for your operators are flexible, that everything is plugged in and connected and then you can build more on top of those that are domain-specific ones. So that would probably be my best bet would be take a look at Scorecard. Awesome, I'll check that out. You're not a plant, are you? Ask me questions that are easy. When you talk about the namespace scope operators, do you mean like having namespace scope permissions or running the operator once per cluster and namespace permissions or running the operator in each namespace? Or, and then the question would be if on the scalability side, on the watchers and all that, if I'm watching multiple namespaces instead of cluster-wide, am I using more resources on the API level, more watchers, more memory, more everything? Okay, when I say name, when I say namespace scope, I mean namespace scope the operator itself. So this means that an instance of your operator, Fubar is gonna be, I'm gonna make an instance of it that exists in some namespace banana. And it's only gonna, you're only gonna be able to like create resources for that, you know, your operator type in that namespace banana. And the upshot of this is, is like I said, it forces you to do a bunch of the things that I just mentioned that are really good ideas to do. Anyways, and then you're gonna have multiple of your operators deployed on the cluster. You're gonna have one in banana, one in cherry, one in durian, one in whatever fruit that starts with E. And you're gonna be able to manage those separately. So you're gonna be able to upgrade them separately. They're gonna reconcile separately. And if you do your homework right, this will not be a big resource hog, but that can also be finicky depending on what exactly your operator is doing. Okay, so you're saying, pack my customer sources in one namespace where the operator lives instead of having them spread out across multiple namespaces, if. No, no, no. So by default, if you just like say, make me an operator, this is my CRD, it's gonna just blast that. It's gonna be a cluster wide. It's gonna be on the cluster level API. Everybody can see it from everywhere. But if you namespace it, then it'll only exist in one namespace at a time. So you'll be able to make foobar in banana namespace. And there'll be a foobar in cherry namespace that is not technically the same custom resource. So you'll be able to manage them, do things to them separately. Does that make sense? Yeah, but we have a specific case, but I think we're gonna add a lot. Everybody always has a specific case. Okay, any other questions? Come find me afterwards, by the way. I'd like to chat with you. Nobody? Okay. Okay, well, if that's the case, thank you all for coming. I'll be around if anybody would like to chat more about operators. Thanks.