 Alright, let's get this started. My name is Joe Thompson, and today I'm asking you to stop writing operators. Relax, what I really mean is stop writing operators by default for managing Kubernetes apps. Now, of course, when I say something like that, naturally people wonder, who's this guy? Although some people use a different word than guy. And why do I care what he thinks about operators? As a roundabout answer to that question, here's some basic info on me, and that QR code has my contact info embedded in it as well. In a nutshell, I've been involved in somewhere or another with Kubernetes since 2015. I've worked at Red Hat, CoreOS, Capital One, and Mesosphere, and I'm currently a solutions engineer for HashiCorp. But before I got involved with Kubernetes, I was a system administrator and IT consultant for about 15 years of my 20 and IT up to that point. In short, I've been around long enough to see some operational patterns emerge multiple times. And a quick note, it looks like I'm not the only one who thinks the operator pattern may be getting a little overused right at this point. I submitted this talk this past July. In September, Devon Goodwin at Red Hat wrote a blog post called When Not to Write a Kubernetes Operator. Then later that same month, the new stack published an article titled Kubernetes When to Use and When to Avoid the Operator Pattern, which extensively quoted several people about why writing an operator is more of a last resort than a preferred solution, and what distinguishes operators from other things. And then a week after the new stack article, I got the notice that this talk was accepted for KubeCon. To paraphrase the term William Gibson used about the invention of the steam engine, suddenly it seems to be operator skepticism time. And result, I'm recording this talk in October with a lot less apprehension than I had when submitting it, because nothing cures your imposter syndrome about something in Kubernetes, like finding out the co-founder of Rancher agrees with you. So you're writing or adopting an app to deploy in Kubernetes, and you need to have some way to manage its operation. This is where a lot of people start asking themselves questions about operators like, what functions should my operator perform? What API permissions will my operator need to perform those functions? What framework should I write my operator in? I would like to ask you any time the concept of an operator comes up from now on to ask yourself these questions first. Should this operator be written at all? And should I write it? The rest of this talk is going to sound really dogmatic. First of all, if nobody leaves here thinking I'm completely wrong about something I said, then I'll feel like I didn't challenge the status quo enough. But even though I phrased things very emphatically here, I understand there are cases where people will write operators I think are unnecessary. For what might be good and sufficient reasons if somebody sat down and said, look Joe, we had to write one because A, B, and C. Or maybe just because your boss said write it or you're fired. Although if they did that, I hope you send an emergency signal on your communicator and we can beam down and rescue you. I don't want anybody to leave thinking I'm saying the operator pattern is garbage or anything like that. I actually think exactly the opposite. It's a highly useful pattern in the right circumstances. But you can have too much of a good thing or have a good thing where it doesn't fit. Now let's start by distinguishing between an operator and things that are not operators. Back in 2016, CoroS put up a blog post introducing what they called the operator pattern. By which they meant, and I'm quoting here, application-specific operational knowledge encoded into software that leverages the powerful Kubernetes abstractions to run and manage the application correctly. CoroS applied this concept specifically to stateful apps because those are the ones that typically need this kind of hand holding through their life cycle. I want to really grind this axe for a few seconds because I've had conversations with people where they talk about wanting to write an operator for something I don't think needs one. And when I dig a little bit, they're using operator to mean any controller. Prior to operators, there was already a concept of domain-specific management of apps using extensions to the Kubernetes API called a custom resource controller. Now some of what I'm saying in this talk will indeed apply to stateless CRD controllers as well. But mostly I'm going to be talking about stateful application operators and what a quagmire they can be. So I wanted to try to reset the conversation for an instant and bring some focus back to a useful distinction that's starting to fade a bit, but is very relevant here. So first question, should an operator for a given application exist? I think in several cases it shouldn't. The mental model I use to think about these cases is operators are sort of like an adapter that you use to plug old VGA displays into new monitors that only have digital inputs. Sure it works, but it's not as efficient or functional as it could be. Operators are a stopgap when either your workload or your platform can't handle the other directly. And I'll be up front here and say I have a bit of a professional bias against the whole idea of a dedicated manager process like an operator. Not things like cluster leader nodes, but components that don't serve workloads directly at all and are there strictly as things like install managers. It's a common model and there are places where it makes sense or is necessary for practical reasons, but I always prefer to have my resources available to benefit my workloads rather than reserve for managing them. Now when do you not want an operator? There are basically three sets of circumstances worth talking about here. Case one, there's no state that needs managing. Either the app is not stateful or everything stateful about it can be properly managed without outside intervention using the available Kubernetes primitives. There's really nothing for an operator to do here. You might go ahead and write a CRD for it to make everybody's life easier, but Kubernetes and the app together can handle everything that needs handling. You might not even need a CRD controller, just a well-written Helm chart. It's a judgment call there, it's really up to you. But there's a retrospective case that's worth mentioning, which I'll go into later and it has to do with that clause I snuck in there about available Kubernetes primitives. So you may in fact end up writing an operator with my approval here. And really, isn't that the only approval that matters? Next time you want to write some code and your scrum product owner says no, tell them I said it's okay. Case number two, the app has complex functional needs related to statefulness and you maintain it. In this case, creating an operator is a needless and I think a harmful abstraction. Your app knows and can act on its own state better than anything outside it can. If you maintain the app and it needs to grow new capabilities to deploy on a platform like Kubernetes, then by all means give it those capabilities. But do it directly, build that code into your app. Splitting it out into an operator is just making life difficult for yourself as a developer, as well as for anybody that has to actually run and manage your app. Fundamentally, it's one more thing that can fail. Some people would say, well, this is just a component of my app. It's a microservice. But I think there's an important distinction between operators as usually conceived and microservices. You can pick any of various technical definitions, but the basic idea is that microservices typically are discrete components that interact using standard interfaces. The internal details of microservices aren't relevant to each other. This is a useful abstraction. As long as the microservice I'm talking to is talking back to me using an API I understand and performing the functions I ask it to perform, I don't care what its internal state is or how it performs its functions. Loose coupling is part of the point of microservices to begin with. Operators break that abstraction wall down and tighten that coupling. There's also another aspect to consider, security. And this applies to any CRD controller, whether it's an operator or not. Controllers need API permissions to do what they do, which my friend and former co-worker Eric Chang pointed out back in January means if you have to pass a security audit at some point, you just made your life that much harder with every CRD controller you use. And even if there's no auditor looking over your shoulder, you still have that much more attack surface to worry about. If you're writing the app, you have the latitude to eliminate or at least minimize that attack surface. That leaves everything else. The app is stateful with complex functional needs related to its statefulness. You don't maintain it and none of the available alternatives for managing it inside or outside Kubernetes are viable. But before you jump on this, let's look at what some of those alternatives are. Okay, what can you do instead of writing an operator to make managing your application deployments easier? Well, trivially, you can write a CRD controller. Remember, a basic CRD controller doesn't try to manage anything statefully. That alone eliminates a lot of destructive failure modes. But as noted, a lot of the above advice applies equally well to CRD controllers. So saying write one of those instead isn't where I want to stop this section because it doesn't address most of those things. Probably the most effective option in a lot of cases is the default, nothing. Don't write anything or at any rate, if you have to write something, write as little as possible. This would be the case, for example, if what you're dealing with is a bunch of one-off issues that you never see twice, or if what you have is so lightly touched that it would take longer to automate its operations than to just do them. There's an XKCD cartoon for everything in tech, and this is a great one to keep on my wall that details this in chart form. If you expect to have to do something once in two years, and it takes you 10 minutes to do, it's probably not worth spending a week to automate it and test the automation unless you're compounding the time savings over an extraordinarily long lifetime. And consider, all code is a liability. It has to be written, it has to be tested, it has to be maintained, it can have bugs, and those bugs can lose your data, or worse, compromise other people's. If what you're automating is a routine change, a popular model for Kubernetes change management is the GitOps model. This was first described in detail by Weaveworks. You manage your cluster by applying an as-code model to it and to everything running in it. In a cluster external development environment. In the most trivial case, this could be just writing a Helm chart or other templated artifact to deploy and maintain your app. Then maintaining and using that as part of your CICD workflow. You can add layers of other management as well. For example, using Terraform to manage both the deployment of your Helm charts and changes to the clusters you deploy them on. This gives you a single source of truth about every change to every app you run and potentially every cluster you run them on. You're still instantiated, essentially, as management by static artifacts, rather than live code. There are now several frameworks that take a no-code approach to writing operators, like Kudu. Generally, these are some kind of top-level controller that implements operator primitives. In that sense, it's the actual operator rather than the code you write. Then you write a config file of some kind to implement your specific application operations. These may be a happy medium for you, a sort of operator-light option that lets you farm out the hard parts of writing an operator to people who've made that their actual job. Just make sure the primitives you need are all there before you start. You can also have a live management process of some kind running outside the cluster. At first glance, this sounds like just writing an operator in a different place, but there are real problems with end cluster operators that this solves. One of the issues I mentioned with Kubernetes operators is, since they're running inside Kubernetes, they take resources away from other workloads in the cluster, but they're also limited themselves by the cluster's total resource availability and by the inherent scale limitations in representing things as CRDs and storing them in LCD. You can end up needing to scale up your cluster not to handle additional workloads, but just to give your application operator breathing room. Move that operator outside and not only are you no longer limited by running in the Kubernetes cluster, you now gain some additional flexibility in your manager. The one caveat is that now you've got something outside the cluster that needs to have credentials of some sort to authenticate to it. So you have to take due care with how you handle those credentials and that authentication. Alright, we've talked about when an operator isn't needed and what you can do instead. When is one needed, I hear you asking? Two primary cases. Case number one, when you don't maintain a stateful app and you have to manage its statefulness on its own behalf inside Kubernetes. Not the ideal situation to be sure, but sometimes you just got to do the thing. Get real familiar with whatever system the app maintainers use to raise issues because you'll be using it in proportion to how complex the app is to manage. In fact, if you just need one or two state handling features, you should try to engage the maintainers and see if they're receptive to adding them in the app. It may turn out that they can do that for their own code faster than you can write an operator for someone else's. Case number two, when you do maintain a stateful app and you want to either start taking advantage of Kubernetes features that are in development but not released yet or you need to backport an app to versions of Kubernetes that don't have some statefulness management it needs. In these cases, the operator is a temporary shim that you're using to match up what your app wants with what the targeted versions of Kubernetes can do and you expect to discard it piece by piece and eventually entirely when you no longer need that support. Okay, you had the meeting or you cashed in your write one operator card or whatever and the answer to the first question is yes, we need an operator. Okay, get to work. No, no, wait a second. Remember there's still one question you need to answer. Should I write this operator? And I mean that not just in the sense of you the individual contributor but more broadly your entire organization. If you're just bound to do it, I can't stop you but I want you to go into it with the idea in your head that I mentioned earlier that all code is a liability. Okay, let's talk about why you shouldn't write an operator or possibly even some non-operator automation that needs writing. Reason number one, you don't have a good grasp of the operational need. That's a really abstract statement so let me put it in heuristic terms. You should not write an operator to handle operations that you haven't spent at least as long recovering from failure or absence of as you realistically estimate it would take to write and thoroughly test the operator for and be very, very liberal with that coding time estimate. Why do I say to measure time recovering from failures specifically? Primarily because failure is where we learn what not to do and that's one of the most important things to know when automating anything. It's where the smooth surface of a third party app that you can't see inside of has cracks and through the cracks you can see a little of what made them and maybe how to keep your automation from falling into them. It's also where you find out what the failure modes of your own apps are in complex, unanticipated real world conditions. Bugs can be destructive and one of the most destructive things to have is a bug in is automated management. One of the best illustrations of how hard this part of operators is ironically is the XCD operator CoreOS wrote to introduce the concept. It turned out that running XCD in Kubernetes in an automated fashion was so complex and had so many hard to anticipate and hard to recover failure modes that one of the hope for goals of the XCD operator to make XCD in the Kubernetes control plane totally self-hosted was abandoned. Eventually the operator itself was abandoned by CoreOS as well. This is a floor, not a ceiling by the way. I'm not saying when you hit that break even point, it's time to fire up your code editor and get to work and if you get there and you realize you don't feel comfortable with your understanding of something in the app's behavior, of course I say you definitely should not write that operator. Somebody out there is thinking but I have these rune team tasks that I need to automate. I don't spend a lot of time recovering from failure of them, but I do spend a lot of time doing them. Shouldn't I automate that with an operator? I'm not saying you shouldn't ever do that, but I think if that's all you're doing, you shouldn't automatically look to an operator as the default way to do it. The other options I covered earlier often work better for routine changes like that. Reason number two, you shouldn't write a needed operator. The app itself is not stable in a compatibility sense. Operators are not something you want to be writing for 0.x versions of things with no forward compatibility promises. You can end up in a situation where you have to update your operator every time the application updates, and that's less useful than just adding the code into the application itself and maintaining it there. It might be a third party application that you can't just do that with, and we'll talk about that in a minute, but in general, avoid, avoid, avoid writing operators for non-stable apps. This is a minimum, not a maximum again. If that app is at 1.x, it doesn't mean it's actually stable because not everybody out there is following this number, and not everybody who claims to be actually is. Trust your gut and take anything that looks like a compatibility violation seriously when deciding how you feel about it. Reason number three for you not to write code that needs writing. It's someone else's app and you haven't engaged with the team that maintains it yet. The app maintainer will generally know best what the dependencies involving that app's statefulness are, and this is critical information you need to write an operator well. You may also find out there's either one already in progress or there are features coming that eliminate the need for an operator entirely, in which case your best move may be to sit tight, stay engaged with the process on that code, and wait for that release. Again, as noted above, you may end up forced into writing an operator in any of these situations by circumstance, but they're all to be avoided whenever possible. Let's say you've looked at all the reasons not to write an operator and all the things you can do besides write an operator and you still need to or just want to for whatever reason write one. What does the landscape look like and what should you do when you write it? The first thing you need is a good understanding of CRDs and controllers and to have a look around at the operator building landscape. The Kubernetes documentation has lots of good info on this and even links to some frameworks for building your operator. The ancestral custom controller frameworks are KubeBuilder and Metacontroller. More recently, operator framework and Kudu have arrived on the scene to some considerable fanfare, and there are other frameworks both older and newer that you can look at. I'm not going to try to give you a flow chart of which framework to choose or none based on criteria A, B, and C, but I will offer some general guidelines for evaluating any framework for fit. Does the framework allow you to write in a language you know well? It would be very skeptical of trying to get started on both a language and a framework at the same time. The tools you know best will tend to be the tools you do your best work with in most circumstances. Many frameworks require you to write in Go, and if you don't have a lot of Go expertise, that's a recipe for bugs and extra stress. Does the frameworks model of the world make sense to you and fit in with the rest of your operations? For example, Metacontroller is a little unusual in that it has you write your custom code as web hooks it can call. If this is how you already do everything, or it feels right to you and doesn't conflict with other requirements, great, Metacontroller may be a good choice. If not, things may change later, and another operator you write may be a perfect fit for Metacontroller then. Are there ecosystem services the framework's maintainer provides around it that you want to take advantage of? For example, consider the Operator Framework operator hub.io here. Although strictly speaking, you don't have to write your operator in the Operator SDK to get listed on the hub. Doing so does make passing the criteria for getting listed basically automatic. If you want to provide an operator for your application to the public easily, these kinds of services may make a difference to you in choosing which framework to write it in. Lastly, when it comes to actually sitting down and writing your operator, I want you to keep three things in mind. Maintain loose coupling. You should not need to update your operator for every update of the application it manages. It's fine if the app gains major new features that need to be managed in the operator, but there should not be a routine need to update the operator as part of an app upgrade. Less is more. This is what, the third or fourth time I've said you should write no code whenever possible? That's because I really, really mean that. Not only should you seek to write a little code in the operator as possible, you should actively push code out of it whenever you can, either in favor of leveraging new features in Kubernetes that make it unnecessary, or into the operator managed app. Write what you know. It's not just good advice for novelists. It's good for programmers too. This means not only writing in a language and framework that are familiar and understandable to you. It means beginning with automating the processes you understand the best and have the biggest impact. As you work out these cases, you'll learn the application's behavior better and you can add other, lesser used operations more confidently. But don't forget what I said in the last point. Only write what you need to in the operator. This slide has references and further reading. I know you can't click on the links in video, but don't worry, there will be a link to these slides coming up. And with that, thank you very much. This is the link I promised or you can scan that handy QR code right there. And I believe we now have some time for Q&A if anyone has questions.