 Hi, my name is Liam White. I'm a senior software engineer at Tetrate, where we help companies adopt and scale Istio. We work with them to solve all sorts of technical and organizational problems. So if that sounds like a problem you have, my contact info will be on the final slide. As it's relevant to this talk, I'm also a maintainer emeritus on the Istio project. Emeritus is a fancy way of saying former. And yeah, over the past few years, I've worked on several Kubernetes operators to varying degrees of success. And this video is an effort for me to clarify my own thinking. I've learned from these experience with the goal that everyone who's watching can learn from the pain of myself and my colleagues. To do this, we will take software engineering principles and ground them in the operator ecosystem to help us design better software and hopefully make better decisions. So first, let's get everyone on the same page. Let's discuss what the operator pattern is and what problem is it designed to try and solve. If we take a look at the Kubernetes documentation, it says the following. The operator pattern aims to capture the key aim of a human operator who is managing the service. Human operators who look after specific application and services have deep knowledge of how the system ought to behave, how to deploy it and how to react if there are problems. The way I would actually summarize this is that the goal of building an operator is to codify knowledge that you have about operating a system or application. This diagram is quite a lot. It is the architecture diagram for KubeBuilder, which is one of the frameworks that you can build an operator with. Now, there's obviously a lot of information in this diagram, but at a very high level, what we do is we watch the Kubernetes API server for events and when an event occurs that we care about, trigger the reconciliation loop that we've implemented. Often, the events we're watching are custom APIs that we write, i.e. a CRD, but they don't have to be. So we could watch for changes to deployments if what we want to implement is some kind of failover, for example. There are several libraries in a variety of languages that will automate the way a lot of the boilerplate of building an operator so that we can focus on the actual logic and beyond the obvious program and language differences, there are different approaches taken by some of these libraries. For example, Kudo allows you to build an operator using just YAML, but the library I'm most familiar with in this list is KubeBuilder. In fact, the architecture diagram earlier on was from KubeBuilder, but this isn't really a discussion about which library to use, it's on whether or not we should be building an operator in the first place. So we're not gonna talk too much about these libraries. One other quick point of terminology to get straight is that of an operator versus a controller. So the controller is the part that actually implements the reconciliation logic and the operator is responsible for running one or more controllers. It does do other things like validating emission webhooks and mutating emission webhooks and that type of stuff. But in this case, we're primarily talking about it in the context of the control loop that it's running. I will inevitably use these interchangeably. So as we go, but at the level of abstraction this video is discussing, it kind of doesn't really matter. Controllers, operators, same thing. Those of you that are familiar with the internals of Kubernetes will know that it's just a collection of controllers and operators. At a high level, each resource in Kubernetes has its own operator. For example, when I apply a deployment resource, the deployment controller is watching for new deployment objects and its job is just to create a replica set. The replica set controller is watching for replica set objects and creates the relevant pods. And then the scheduler patches the node name and off we go, right? The Kubla takes over at that point. It's effectively just operators, the controllers chain together all the way down. So now that we've gone over some of the theory, let's take a look at a concrete example as a case study. The operator we're gonna talk about here is the Istio operator. And I'm using this as an example because first is public facing, but also I was part of the working group responsible for building it. And the issues that the Istio operator had reflect experiences I've seen elsewhere as well. So it's not just unique to the Istio operator. As a quick disclaimer, I did participate in the design phases of the operator, but I didn't contribute meaningful code to the implementation. And although I'm critical of the Istio operator here, the Istio project is pretty good at making tough choices when a technology or decision isn't working for users. The most obvious example of this is when the project moved away from a microservice based architecture because the complexity overhead that users had to deal with outweighed the benefits. So with that disclaimer out of the way, let's talk about Istio. Now, one of the project values of Istio has always been to optimize for configurability. As a result, you can set up Istio in all manner of weird and wonderful architectures across many different types of workloads. But this makes installing Istio non-trivial for users because the problem with this level of configurability is that there is an inherent trade-off between configurability and simplicity. It's not linear, but it is causal. So the goal of the Istio operator was to tame some of this complexity by simplifying the experience of installing Istio. The announcement blog summarized it with these four points, but for the purposes of this video, we are gonna focus on the following two points, that all API fields are validated and that version-specific upgrade hooks can be easily and robustly implemented, which at the time was a significant problem that users were having when upgrading Istio. So the operator worked as follows. A user would create a CR that describes their installation. When they applied this to their Kube cluster, it would be validated using a validating admission webhook. This is how we get our type safety. And then once applied, the operator would take the CR, it would render the built-in Helm templates based on the CR and produce Kubernetes YAML. And then it would apply that to the cluster. One thing to mention here that will become relevant later is that the CR itself didn't actually look that different to its predecessor, which was a Helm values.yaml file. But there is definitely value an operator could provide here. For example, preventing upgrades if you're using Istio resources that have been deprecated or removed in the version you're now about to install. If we take a look at how the operator aligns with the operator documentation, I would say that you could argue that the goals of the Istio operator were capturing the knowledge of a human operator. In this case, the operator are the Istio maintainers who are providing validation that the installation configuration you're provided is correct as well as version-specific upgrade hooks. One other thing to mention is the operator wasn't just providing syntactic validation. We would actually prevent certain types of behaviorally invalid configuration that was syntactically correct. So we were kind of removing some of the foot guns there as well. This is functionality that can't fully be solved by Helm. Whilst there is support for upgrade hooks in Helm, a significant portion of Helm users, myself included, use the Helm template command and then deploy the generated Kubernetes YAML. When you use Helm in this way, you don't get any of that hook logic, right? You don't get the upgrade hooks. And this kind of makes sense, right? If I'm gonna flatten everything to a KubeApply, there is no lifecycle, right? There is pre-KubeApply and post-KubeApply, right? And this is like legitimately where building an operator can provide some value. This would kind of all be fine in a world of no trade-offs. But if there were no trade-offs, we wouldn't have a job, right? So what are the trade-offs or the problems of moving this type of logic to an operator? The first one I'm gonna discuss is kind of the surface level obvious one, right? Building any kind of reconciliation loop is easy to get wrong. And there are a couple of reasons for this. First, you have to deal with merging configuration. Second, you have to deal with what happens when the operator gets stuck. If it's a human operator, we can solve problems on the fly, right? Like we can unstick ourselves, usually. But to encode this in an operator requires us to know about all of the failure modes ahead of time and to tell the operator about them. Depending on the complexity of the thing you're trying to automate, there may not be that many failure states. So, you know, this might not be a problem, but with something like Istio or something more complex which allows users to deploy in a multitude of different architectures and customize the heart's content, there are a lot of dimensions and therefore a lot of different possible failure modes. Now, we're software engineers, right? There's the way that we validate this type of things is we test them, but that brings me on to the second problem, which is testing this stuff is hard. Testing has gotten a lot better recently. For example, KubeBuilder now comes with a testing framework, but what you're actually testing with that framework is that you create the correct Kubernetes resources or that your interaction with the Kubernetes, you're testing the interaction with the Kubernetes API. Not that the resources or the actions you take will behave in the way that you want them to. What we end up with is effectively validating syntax, right? It's slightly more than that, but if most of your logic can't be captured by the Kubernetes resources that you're creating, then we can only really test it by doing the full thing, right? No amount of testing framework is gonna help here. You have to actually trigger your reconciliation loop with the given inputs in a live Kube cluster against a live system. And I'm using live in the air quotes here because what I really mean is you have to go beyond unit and integration testing, right? If you actually want to validate this stuff. Next, in my work as platform lead over the last few years, the main takeaway and thing that I've learned is that installation is an extremely hard abstraction to get right. And the reason for this is users have different requirements. They want varying degrees of customizability and have different levels of familiarity with and understanding of your application. When you build an operator, you end up reducing a user's ability to customize things, specifically when you build an operator around ease of installation. If the user wants a feature or to be able to set a field that isn't exposed by the operator codes, then you have to build it into the next version of the operator, right? You can't just edit the binary. Whereas if we're using a Helm chart, a user can just go in and modify themselves if they really want to. This was something the Istio operator attempted to solve by using this JSON patch or customize inspired syntax that allowed users to modify any field in the Kubernetes resource after it had been rendered by the operator. The problem is that this style of API is difficult to use. Despite having used this API frequently, I still have to go look at unit tests to get the syntax right. Anyone who uses JQ and have to Google or now I guess ask chatGPT to build your query, you know this pain, right? And this isn't a critique of the API itself, right? I don't have a better solution to this problem. This is an inherent complexity of choosing to install using an operator versus just installing via Helm. And in a similar vein, by choosing to use an operator, we've added a layer of indirection. Rather than users now interacting directly with Helm, they're interacting indirectly with Helm, right? We shifted some of that logic right. So instead of being able to see the changes in a getDiff, we have that layer of indirection. They can see the changes to the CR in the getDiff. I don't necessarily know what that's gonna do in my cube cluster. And for business critical applications, such as Istio that controls all of the networking traffic, making things obvious should be our primary goal to use us. This is where we should start talking about complexity. One definition of complexity I've seen is anything related to the structure of a software system that makes it hard to understand and modify the system. I would tweak this definition slightly by changing hard to harder. Some of the things that we are trying to automate are already hard, but I think this definition is useful here. Although this is obviously all somewhat subjective, I believe that in the case of the Istio operator, it was a net increase in complexity for Istio installation once you got off the happy path. And that's kind of the key point because if your project values configurability, you can't just be optimizing for the happy path, right? We have to be able to support our users that decide to configure Istio in whatever weird way they want to configure it. And that is all before we consider that we have to maintain and update and test all of this code. None of this is for free, right? Both in terms of engineering effort and opportunity cost of doing other things. Now, despite all that negativity that I've just said, this isn't to say you shouldn't build an operator. This is just to help you understand some of the trade-offs. Ultimately, you need to decide whether or not the value users would get from you building an operator are worth the trade-offs and align with the values of the project or products that you're working on. But the null hypothesis should be use something like help, right? You need to be able to justify building an operator. And just like the change from microservices to monolith Istio decided several releases later, the value users were getting from the Istio operator were not worth the trade-offs and have moved back to Helm as the primary installation tool. The operator is still supported, but new features are not prioritized. So how do we think about whether the value of our operator exceeds the costs that we've just described? Well, the primary principle I like to use when thinking about this type of problem is different layer, different abstraction. This is a concept from the book, The Philosophy of Software Design. But basically the idea is that in a well-designed system, each layer provides a different abstraction from the layers above and below it. If we look at the Istio operator, it is providing some additional guardrails, but it's pretty much the same abstraction as a Helm install or Kube CTL apply. An operator that does demonstrate this principle well though is the Prometheus operator. In addition to installing Prometheus, the abstraction it offers us is in terms of service monitors and pod monitors. And these are resources that enable you to describe targets that Prometheus should monitor. So rather than have to generate a bunch of scrape configuration files in a Prometheus-specific syntax, we're able to describe our targets in a Kubernetes-native way. We're not talking about individual endpoints to scrape. We're providing Kubernetes label matching stances. Different layer, different abstraction. The next principle is whether or not we're building shallow modules. This is another concept from the Philosophy of Software Design. And on the surface, it seems similar to different layer, different abstraction, but this one is more about the API design itself rather than the level of the API. A shallow module is one whose interface is complicated relative to the functionality it provides. Basically, the benefit that a module provides, i.e. not having to learn about the things the abstract away from, is negated by the cost of learning and using its interface. The JSON patch customized inspired post-render mutations in the Istik operator are a good example of a shallow module. The syntax is difficult to use and we haven't really gained much functionality on top of what Helm enables us to do. In the Helm world, I can just pull the charts to my machine and edit them to achieve the same effect. An example of an operator that exemplifies this principle is cert manager. You might not consider cert manager to be an operator, but if we go back to our definition, cert manager creates secrets required to deploy our services. It reacts when certificates are about to expire, i.e. reacts to problems and encodes the knowledge of how to retrieve certificates using the ACME protocol. And Istio itself is a good example of something that follows this principle. I don't know if you've ever attempted to write Envoy configuration and deploy Envoy within a Kubernetes cluster by hand, but that's not how Envoy is designed to be used. Istio allows you to describe your mesh in a Kubernetes-native manner. It handles the sidecar injection. It modifies the pod network namespace so that traffic gets automatically redirected. All of these things are encoding the knowledge of human operators, right? The Istio is an operator. And my final principle is to take off your solution blinkers. And what I mean by this is that often users or product will come to you with a solution to a problem rather than the problem itself. And once they give you this hammer, everything looks like a nail. But part of software engineering is to clearly enumerate the problem you're trying to solve. So take your time to work out what they're trying to achieve and only then propose a solution. Force yourself to come up with alternative solutions than the most obvious one to yourself and the stakeholders. For example, can you get 90% of the way there with a Helm chart? Because a Helm chart will give you more optionality going forwards and most, if not all of your users, probably understand how Helm works. So to summarize these principles, different layer, different abstraction, don't build shallow modules and take off your solution blinders. None of these principles are revolutionary or even specific to operators. They're all kind of just useful software engineering principles. But I found it valuable to have these as a mental checklist when I'm reviewing or writing design docs for operators or really any significant piece of software. Like I said at the start, if your company is looking for help adopting or scaling Istio, you can reach out to me at this email address. Similarly, if you have any questions about the videos, send them my way. You're more likely to get a response on email than Twitter. So use that information as you will. And thanks for watching.