 My name is Manuel. I'm working in the OpenShift dedicated SRE team where we are building and running and maintaining a bunch of different operators yet to help us with our mission to keep customer clusters alive. And you can group those operators in two different parts in our case. One part is operators that are involved in provisioning new clusters and then there are operators involved like they're deployed to every single cluster and running on all the production clusters that we have, right? So as you can imagine, that that's quite a number of different repositories of operators to maintain. And when you're maintaining such different operators at one point, you will figure out what works when building operators and what doesn't work. And that's what this session is about. I want to share a bit of the kind of best practices that evolved when we built and maintained that clusters. And the number of operators is still growing, so new things to come. And hopefully I can help some of you not repeat the mistakes that we did before. So this talk will be split in two parts. First, I'll give a quick introduction to operators but not a too deep introduction. And then we'll talk about the best practices that we found when building and maintaining operators in OpenShift dedicated. So first, let's think about a real world example for an operator. And for that, I want to introduce to you the famous barbecue operator. When you imagine the job of a pit master, what does it do when there's a new patty arriving? What you need to do is find a nice spot on the grillage and when the patty is getting color on the bottom, the pit master is turning around the patty so it can get color on the other side as well. So what the barbecue operator or the pit master is doing is guiding the patty, the resource it's managing through several different stages. And I actually have a small site diagram of a patty in case you're not aware of it. So the first state is it's raw and arriving on the grillage and then it's starting to get color on the bottom side. And then the patty is getting turned around by the pit master and left there again until it's well done. And then it's removed from the grillage and put on a bun. And then there's probably some cleanup done by the pit master like cleaning up the grillage and stuff like that. So the pit master is continuously watching the state of each patty to make sure it doesn't burn. And when it figures out that something needs to be done then it operates. And if there's nothing to do, what does the barbecue operator do? Right, relaxing, drinking beer, doing nothing, but observing closely the state of all the resources that it's managing, right? As a pit master you can't just all day long turn around the patty if you need to think about and observe and only take action when there's really something to do. And this is essentially what we call in Kubernetes operators, this is what we call the reconciled loop where there's nothing to do. The operator is sitting there and watching resources. And when there is an update happening or something changes then the operator is taking action. And yeah, then goes back to beer when nothing is to do anymore. So now you just need to think replace patties with Kubernetes objects like deployments, pods, custom resources, whatever you want. And the operators that you build for those resources do the same, they observe resources and update them or update something else until the actual state requires matches the desired state. That brings us to a bit of a schoolbook definition. An operator allows you to operate on Kubernetes objects that can be built in Kubernetes objects but as well they can be custom objects. So you can build your own custom resources, we call that custom resource definition and they are handled by the Kubernetes control plane just as the usual built-in types are handled. And you can use those operators. For example, if you're deploying a complex application that doesn't fit into the definition of the built-in deployment resource then you can build your own software that brings in all the things like, for example, external dependencies plus creating some pods on the cluster, whatever is there and have your users create custom resources that your operator will then work on to start deploying that application. In the end what operators are used for in when managing Kubernetes clusters is automating operational tasks. So previously, if you don't have an operator probably you're going to the cluster and deploying your application yourself, creating all the external dependencies and stuff like that. And if you want to automate that because it happens many times a day, for example, you can do that with an operator. If you've previously not done all that manually probably you had a batch script to do that and this brings me to this statement here which is operators are the batch scripts of Kubernetes and that's what it seems, right? It's just the cool burn of batch scripts we are automating now with, I don't know, go code or something like that. But there is a substantial difference between batch scripts or manual actions and what operators do. And that is batch scripts usually don't run unattended unless they are running on a Jenkins external server. And they are also not running within the cluster but outside on an administrator machine or on a Jenkins, for example. And yeah, that means those operators are running inside your Kubernetes cluster and they are running unattendedly. Just as every other business application that you're deploying to your cluster and that means they bring the same complexities and the same maintenance effort just as any business application that you're deploying to the cluster. And that means then that the readability of your code is as important as in any production software that you have. And that readability is probably easier to achieve with some go code than with a batch script. I mean, I think operators that look almost like batch scripts but you should avoid getting into that state with your operator. Okay, let's take a closer look at the reconcile loop. For an operator you, in most cases you have some input which is either built in Kubernetes object or your own custom resource. And in our example, let's say we have our custom deployment that we talked about before. So the input is that custom resource. And when that resource appears on the cluster the operator will realize that the new resource has arrived or that the resource changed and will reconcile the resource until the expected state, until the actual state fits the expected on the desired state. And the output of the operator is in many cases additional objects and additional Kubernetes objects like deployments, demon sets or pods. And it may also be some external dependencies like a resource on AWS. Let's say for example, the application we deploy depends on an RDS database that we need to create. We can do that within the operator. And when that happens, the operator goes back to, I don't know, idle state waiting for other resources to appear or updates happening to the watch resource. So this is quite a nice thing to implement for complex deployments or other use cases. And speaking of other use cases when you start writing one operator to do something and you find it useful, you will get to the point where you think let's write another operator. So that brings me to this statement, operators are pack animals. When you, as I mentioned in OST we have multiple different operators and when you start automating things with operators just as with batch scripts you will end up having many of them. Does that mean the Kubernetes operator is the right choice to automate everything in your cluster? Probably not, but it will help you to quickly get started with a new project and it will give you for free the ability to use the Kubernetes API just as you can use it for every other Kubernetes object with your own custom resource. And that means everybody who's able to create a deployment on Kubernetes will be able to interact with the API of your operator which is then the custom resource. And you can also use the authorization mechanisms that you already have in Kubernetes with your own custom resources like granting access to creating or changing your custom resource type to users and denying it to some other users. You get all of that for free when building an operator. And that's quite nice. So over time there have evolved some different frameworks that you can use to build operators quickly which generates some boilerplate code for you. And when you're maintaining many operators I can only encourage you to make them look familiar. So choose a framework which you would like to use like for example the operator SDK and stick to that. So when you're familiar with one of the operators that your team is maintaining and you need to change something in another operator that you don't have to learn how that operator is working from scratch but you already see some familiar parts from the other operators that you already know. So the operator design looks quite similar to every other application, right? What I always say is that the entry point to the reconciled loop and when you use the operator SDK that's the reconciled function is kind of the main function of your code where you put all your business logic. And from within that business logic you will reach out to external libraries like the Kubernetes client to create a pod or the AWS library to create RDS databases or whatever you need from other libraries. And this is all very similar to production software and that means actually operators are production software. And even if you think you're just creating a better batch script you're just replacing this batch script that you have with an operator. I can promise you your operator will grow and you will need to maintain it. So it's better to develop your operator in a clean way right from the beginning. And the first step in that is make yourself aware of the architecture boundaries that you have in your operator when you start designing it. So when we go back to that picture you need to know that there are architecture boundaries around your business logic. It's obvious when you use the external libraries that you have architecture boundaries there to another system, but in the same way you have an architecture boundary between the framework that you're using to generate boilerplate code for your operator and the business logic itself. And as in every software you will want to wrap the external dependencies so you can make the external dependencies fit the design or the architecture of your software and to be able to exchange them easily and also to have it easier to write tests by mocking those external dependencies. And in the same way that you wrap the external libraries that you're using you should also try to wrap the boilerplate code that you get generated by the operator SDK to have it easier to upgrade the upgrade or exchange the framework that you're using and as well to be able to mock it. Okay, next I have a hint for you or yeah, one best practice and that is reconcile carefully. When you look at the reconcile loop what you will see often is that when reconciling a resource you're actually changing the resource itself. When you think about the Patty again when the pit mass is turning around the Patty the state of the Patty has changed, it has turned around and that will be realized by the framework by your operator and that will probably trigger another run in the reconcile loop. So you need to be aware of that that whenever you update your resource itself there will be another run in the reconcile loop and you can, if you have such requests running in parallel you can even run into nice race conditions there when you have two competing runs in the reconcile loop. So essentially you need to be aware at any time that there can be another run of reconcile that can be also someone who changed the resource manually or for a time out there has been another run in the reconcile loop. So you need to be aware of that and the best thing to do is make sure that your reconcile function is idempotent. So you can, so you know when the same function is executed twice with the same resource and the state is already as expected that the outcome is the same and there is no side effect of it. And one easy way to achieve that is splitting up the reconcile function. Often you will have many different steps executed in that reconcile function and you should split it up in subroutines which can be executed independently and are idempotent on their own. So they first check, do I need to do something? Do the action and then return. Another thing that I want to encourage you is to write tests and write your tests early and that's not different for operators than for other projects as well. I just want to mention it here again because sometimes you think I'm just automating this simple thing do I really need to write a test for it? And I think you should. And that's not just because they help you find bugs and that's what they do but they will still not prevent all the bugs from sneaking in but they will also help you wrap external dependencies because when you start writing your tests early you will get to the point where you want to test something with the external dependency and you will need to wrap it in order to mock and test it. And that means the goal of wrapping external dependencies can be reached quite easily when you already started with writing tests. And another reason why I say start writing tests early is because when you have tests for your code early in the development phase you will automatically generate small readable functions which are easy to test. Often when you start writing tests late you get just some scattered code function with large lists of parameters and you think, okay, this is hard to test and either you write less tests than you would wish to. You have written because it's hard to write them or you just not write test at all and say this is code is untestable, I don't do it. So writing tests early will help you in starting with already small testable functions with short lists of parameters and few return codes and wrapped external dependencies. And essentially it will help you to avoid the anti-pattern of overstaffed reconciled functions that you can find in some operators. And also tests always give you as developer the confidence that what you changed didn't break the base functionality. If you have a meaningful tests suite you have a good confidence that you're refactoring or your new feature edit doesn't break what's already there in addition to it doesn't break what I built because I have written a test for it. And that's important for operators that run on a bunch of clusters because when I have something that I want to deploy to production to all customer clusters that we have I want to have all the confidence that I can get that might change doesn't break all the customer clusters. And what helps me to get that confidence is have a good test suite. So basically this is the main point that I have for you. The best code that you write is the code you use yourself and you're the first user of your code when you write tests for it and when you write the tests early. And this will also automatically help you achieve all the goals that I mentioned before like wrapping dependencies, having readable code, having small functions, all that is automatically done when you start writing your test early.