 Okay, let's go ahead and get started So welcome to the operator framework maintainer track session here at kube-con EU 2024 By way of introduction about who we are and why you should listen to us My name is Jonathan Burkon. I'm a steering committee member for operator framework I'm a maintainer for operator SDK probably some other things I'm forgetting. I'm an open-source contributor who works for IBM I've worked on operator framework. I've worked on kubernetes. I've worked on cloud foundry back in the day Have you ever heard of that and I'm here to talk about operative framework with my friend Varsha? I work as an engineer at red hat. I'm also a steering company member Contributor to operator SDK a maintainer for SDK and OLM Maintainer of kube builder and few other things all around kubernetes in general. So here we are to talk about operator framework So show of hands who knows anything on the slide who here has heard of operators Hopefully, that's most of you. Otherwise. I really don't know why you're here Operator framework operator SDK job by patristic a opm. Oh, I'm all of those are the various things that operator framework maintains So that's what we're going to be talking about focusing mainly on operator SDK and operator lifecycle manager because those are sort of our Our two main big projects So just in case for those of you who didn't raise your hand when I asked if you knew what an operator was Let's go over that very briefly so an operator is an architectural pattern for writing software that runs on top of kubernetes and the special thing about it is rather than creating a whole bunch of kubernetes mammal myself that statically describes a Bunch of stuff that runs in pods on kubernetes. I am going to teach kubernetes to fish by extending its API So I'm going to write a whole bunch of complicated stuff in the end result of that stuff Is that the same way you can kubectl create pod and that makes a pod a container come into existence some more in the back end I'm going to extend the API so I can kubectl create my thing and that will make my thing come into existence Some more in the back end and then can that can be as complicated or as simplistic as I want to make it It could be a whole bunch of pods come into existence and network together and talk to each other over services and connects to Off-cluster resources and who knows what else? So that's really what the idea of an operator is And like I said operator framework were a CNCF incubating project that makes a variety of tools that make this process simpler, I promise Most above them are operator SDK and OLM those are the big ones you're probably familiar with we also make a couple other ones like OPM for packaging operators scorecard for testing operators a couple other things like that But today specifically what we're going to be talking about is the things that are new and exciting I'm going to be talking about the operator SDK side of things and then Varsh will be discussing the advancements. They're going on an OLM So on the operator SDK side of things Things have been sort of more or less stable We've upgraded to the most recent version of kube builder Which means we've moved to v4 plugins if you're familiar with how kube builder is internally architected So we've deprecated support for the old v3 stuff and this mostly is so we can keep using the most recent versions of kubernetes itself and Customized for crunching our yaml together. So this is just make sure we maintain compatibility with new versions of things moving forward Java operator SDK recently joined the project and they are doing a lot of active development with version 5.0 of their stuff And I believe theoretically they promised me as of right now today the snapshot release of that is available if you want to go try it But basically this is just sort of an opportunity to implement a bunch of breaking changes that have sort of been piling up Over the past couple months For more details you can click on that link and see some milestones But let's briefly talk about some of the stuff they've done. So again dependency bumps They've moved on up to Java 17 and fabricate 7.0 Just sort of keeping up with things and then in terms of actual features. Let's talk about some of the stuff They've got going on So they have been going through their stuff and changing things over where they can to use server-side apply Particularly in handling updates to statuses and handle ising finalizers So this is actually something I don't think we're up to in the go-lang side Yeah, just server-side apply because that's got to change a lot of stuff in the back end So that's cool. They've also introduced some finer-grained control on how they handle reconciliation loops So you can check if the next reconciliation is imminent sort of have some finer-grained control over How you return from your reconciliation loops there in the Java controller They've also improved their handling of dependent resources So this is how if you have a crd you can have dependent resources so that they're attached and kubernetes knows that they belong to you They've done some improvements on how that's handling With some notable use cases that they don't currently support Such as recreating external resources or support for read-only and block setup So this is mainly dealing with stuff that handles off cluster resources that are represented in the kubernetes at CD But don't actually correspond to Resources that exist on cluster that's sort of still handled manually Oh, right, and they added support for optional dependent resources So this is say I have my operator depends on some other crd type or some other operator that may or may not exist on the system and their sort of Example use case of this is routes, which is just things specific to open shift I can declare it, you know When you create an instance of my cr that go and make a route happen if and only if routes exist So this is also sort of eases some of the dependency Cremuginliness of if I have an operator that depends on other operators that can be hard to deploy and not have it explode if stuff doesn't there And this is just sort of a quality of life thing So that's an example of that where I create a route if routes exist. Otherwise, I just make some ingress stuff They did some improvements with workflows when I talk about workflows here I mean Java operator SDK workflows, not the Kubernetes workflows stuff that Google makes They've extracted this to their own separate annotations and supported some additional finer-grain Control over that so that they can be explicitly invoked from within your controller And if you click through on those links in the presentation, you'll see some code examples of those And with that I'm gonna pass it over to Varsha to talk about OLM So operator lifecycle manager before getting into what's new with OLM Let's have a brief overview about what OLM is I know many of you would be familiar with the benefits and with the troubles you're facing with OLM So we are hoping that we could solve most of those and I'll be going through the changes. We are planning to make So I can imagine the way I imagine OLM to be is you have an application You have an operator which manages the life cycle of the application but who manages the life cycle of the operator and That's what OLM does. So operator lifecycle manager helps you manage the life cycle of an operator itself They're providing a few benefits like the whole concept of catalog management Taking care of upgrades by ensuring CRD upgrade safety rules by providing an extensive testing framework And also taking care of dependency resolution, which is kind of very common when you build application Where you want operator aid to be dependent on operator B or it's again interdependent on some other different packages So on and so forth So what's new with OLM? We are trying to build a whole new architecture learning from the mistakes We have faced in the past taking all the feedbacks that we have from the community And we have a few other interesting news and that's what I'll go over. So OLM is moving to a new set of APIs. Now, why new set of APIs? Why wasn't v0 just enough? So the reason some of the reason why this was the case because the existing API's if you are aware of the whole Install plan and subscription model. It was too much for anyone who is new to OLM It was very difficult for folks to get on boarded for the new arcade in the old architecture and get used to this concept of Install plan having looked at the status of multiple objects aggregating them so on and so forth so it was time for us to build a new architecture based on the needs of the community and Majorly OLM was designed when CRDs were in beta format So a lot of things which we had in mind while initially designing OLM are not Relevant are not prevalent right now. So that's what we are evolving and learning from the changes which are being made in Kubernetes in general So learnings from v0 Today, which is what OLM zero is designed is It has multiple user-facing APIs a very simple example without going into the architecture Is you have installed plan which has the list of resources which are going to get installed on the cluster You have a subscription where you subscribe to a particular package to get updates And the operator author has to build a bundle in a particular opinionated format called registry v1 Now what does future of OLM look like we'll be moving farther away from all this to a single top level user-facing API That way as a user you want to install a particular package Which has been approved by your cluster admin you go in you create a single API And it's a one-click solution to get everything on cluster installed The second major issue was few of the architectural decisions in v0 Where partially imperative so in v1 We want to bring in a whole GitOps friendly approach and a one-click solution for everything This also dwells into the third point which is having continuous Reconciliation and the concept of rollbacks because things sometimes break. They are unpredictable and v0 had a lot of problems in terms of allowing users to rollback to a previous version There were a great parts clearly defined, but rollback was always complicated. It's in general But v1 is trying to find ways to solve it The next one is v0 always let you upgrade to a latest version available in a channel But then in v1 We are trying to bring in the whole concept of a great path where you can choose the path You want your users to take in if you want folks to move from 4.3.1 Version to probably 4.4.2 based on the changes you made on a stable channel Then it's up to you. You can create your own upgrade path rather than forcing Forcing the users to upgrade to a particular version and then making the whole concept complicated more The most important one which I want to get into in detail is v0 had a very opinionated Packaging format and I'll there's another slide just Expanding that particular point But if you're familiar with v0, we had this concept of registry v1 bundles Which had this particular API called cluster service version So any operator you wanted needed to be packaged in a particular specific OLM compatible format no matter whether you had it in the form of hem chart Whether you had it in the form of plane manifest and things like that. So there was a lot of Additional work which an operator author had to do to onboard their operator into an OLM v0 catalog So that's something which we are trying to make it easier With v1. We are trying to extend support in a way where no matter how you package your bundle your operator You just tell us what to install and OLM does it for you So moving to v1 how the overall architecture would look like so OLM v0 was monolithic Everything was bundled into a very big chunk of code and it was difficult For us to even predict what's going to happen. So with v1 We are trying to break things into a set of focused and scoped components So we'll introduce these four components. The first one is a user-facing API This is where a user can go and mention that I want to install an Argo CD operator of this and this version It'll be straightforward It will contain the details which OLM needs to get an operator installed on the cluster from a catalog This second one is catalog D. So it's a package server Which would it's basically a controller which would be running on the cluster and it will ensure that the operators Present in the catalog are available on cluster so that users can go and install it This next one is DP So we are working on a dependency resolver to make things easier because Installing operators is not straightforward. You have complex operators where one thing is dependent on another For example, you have an operator X probably dependent on a third manager operator So we need to make sure that the dependencies are resolved That's a one-click solution if you are installing operator X you also have all the dependencies in that tree installed and The user need not worry about making sure that nothing breaks The third one is a plier and here we have an interesting news that we are also trying to make OLM Compatible with K-App controller. So K-App controller is one of the sub-projects of Carvel do check it out It's also a CNCF open-source project and it has a bunch of amazing features Which OLM could use and make sure that the contents are applied on the cluster So currently we have support for Ruckpack, which is an inbuilt applyer with present in the repository and we are also Trying to work on integration with K-App controller So how does the overall process look like with respect to OLM? This is in terms of the user. So a user creates an API and Then there's catalog D running on cluster which basically takes in the API For example, the API could say or the CR could say that I want to install package X from this particular catalog Now what catalog D would give you is it would provide the contents of the catalog available on cluster Helping the user to choose the operator which they want to install Then we have dependency resolution so dependency resolution Basically involves the process in which the OLM would run a few set of pre-flight pre-flight checks To ensure that if operator X is being installed does something break on the cluster or not Is it safe to install operator X? Is it safe to upgrade operator X? So dependency resolver would give us a signal telling yes or no whether we can install it or not Now after OLM reads the signal We currently use Ruckpack to install it and we have work going on with respect to K-App controller Where we will create one of the API which K-App controller provides which is known as app API It'll create an app CR and then things would be available on cluster because K-App controller will also be running in the background So how does the overall workflow look like in the new architecture? So operator author comes in They create a bundle which is packaged in whatever format whether it's registry b1 Whether it's a set of plane manifests so on and so forth and they add it to a particular catalog This is very similar to how v0 works right now but then This catalog b which is running on cluster which takes in this catalog image It unpacks it and makes sure that the individual packages are available on cluster for OLM to pick it up and install as required Now we have the next set the next persona which is the user The user comes in They create an API not going into the names, but very generic because everything is still in progress and The user-facing API basically tells OLM that I want to the user wants to install Package X constrained to a particular version from this particular channel. So what does OLM do then? It takes in the available operators packages available From the catalog D. It runs dependency resolution gives us a signal whether it's yes or no If it's a no it gives out a error telling you can't install packages because you already have some API on cluster Which is going to cause you conflicts or it's going to say hey no packages available from the catalog that matches your requirements After which we create a app CR and here is when the cab controller kicks in it treats in the contents of faps here And it makes sure that the contents are applied on the cluster and the operator is available So these are the three different personas where we have the person of the OLM itself doing an action We have the person of the operator author where he may they make things easier In terms of bundling and providing it on cluster into the catalog and then we have k app controller integration Which basically takes in and applies things on the cluster so another interesting use case is Helm is all over the place. Helm is a very amazing package manager So it's very common that users who build operators already have their operator packaged as a hem chart Now with v0 are in the current scenario if a user wants to exploit the benefits of OLM They would have to take in the hem chart Convert it into a registry v1 bundle and then apply it on cluster now with OLM v1 We are working towards an approach where users will be able to install packages from a hem chart directly and Would at the same time get the benefits of OLM in terms of upgrade safety dependency resolution and things like that The next part is the cluster extension controller and this is the heart of OLM So what we are working with the cluster extension controller is it is the engine which basically Looks for the particular user facing API it reads in what the user provides It calls in a dependency resolver. It brings in contents from catalog B for the dependency resolution It kicks in it makes sure that app CR is created or any other API is created and the supplier Does its job by applying contents into the cluster and it helps to read the status of the particular Whatever the applyer has done the action that applyer has taken and bring it back to the user So this is the heart of the cluster extension controller are the extension controller, which we are working on So Yeah, those are the updates about OLM v1 in general. So please Do let us know how do you feel about the new architecture? If there are any changes you would like to see in and things like that We are very active upstream. We have a slack channel called OLM deaf for OLM and for SDK We have operator SDK deaf So we welcome the community contributions because this v1 architecture is basically based on the feedback that Community has provided to us any questions so we'll be taking questions now and Show of hands. I'm gonna be running around with the microphone and please ask the question into the microphone So it's on the recording all the way in the back here Excuse me. Thank you for your talk and thank you for the development I was wondering if you have the elements to do rollback, you know When you upgrade your operator and you want to go back to the previous of a version of the operator is it think in your version So with OLM, we have still not figured out the rollback feature yet But the whole concept which we are thinking into is we have this cluster extension controller are the OLM controller that kicks in Make sure that the downgrade is also safe by doing a set of pre-flight checks and kind of brings Rollbacks the operator back to the previous version. So this feature is not in yet But we have discussions and design documents around that still Also, the messiest part of doing something like that is probably going to be writing the conversion Okay, just so you're aware Anyone else you could also use this opportunity to talk about what you all don't like in OLM v0 Hi We are running the controller framework in a large-scale scenario and we have quite a few resources And I was wondering if something like sharding is planned or anything that you consider or look into Because even if reconciliation takes a longer time you want to do that in parallel and maybe on multiple resources at the same time I'm not quite I follow when you say sharding Do you mean like sharding like having multiple controllers that are responsible for reconciling different portions of a single resource? Okay, why did they keep things simple? I mean the shorter answer is no I'm not aware of any attempts we have to implement something like that right now Especially because that would necessitate like changes on the kubernetes side of things like how do you? Generate a watcher that only watches a certain part of the atcd for a single resource And and we don't control that because we live in kubernetes land and kubernetes does stuff and we use it That said if I really wanted to do that, how would I do it? You could probably Do something with namespaces we're like the individual resources our namespace scoped And you have multiple controllers that watch only things like that Now I can imagine all kinds of horrible ways that would blow up, but General principle is not let multiple controllers handle the same resource We have ssa which takes care of making sure that only particular part of the spec is handled by a particular owner But still you would not isn't isn't there another way you can do a filter on the cache and I have each controller Have a different cache so it doesn't actually know about the other resources that only receives the updates Implicitly for the ones that you're filtering on Don't ever do that Okay, do we have any other questions? Okay, well if not, thank you all for coming varsh and I'll be around to talk if you want to chat about anything about operators Thank you all