 Welcome everyone, thank you for joining us. Come on in, there's some additional empty seats. Well, a few of them, although they're going. Are these reserved for somebody, or are they for us allegedly? What does that say? There's a few more seats, if you wander up this direction, is it good? Okay, we got reserved, there's one here. We got a lot of friendly people here at KubeCon. So make some friends, you got empty seats there, maybe raise your hand if you got an empty seat next to you, and you're interested to meet a new KubeConner. But we got tons of them up here. Come on in, make room. We're gonna talk about CERN and what they've done with the operator framework to make their lives easier. I am Michael from Red Hat. This is Varsha, we're operator framework contributors and maintainers over various times. We also have here with us Francisco and Rajula, their SREs at CERN. And we're going to talk about how we came together, their operator they created caught our eye. So they're gonna introduce what they've done. We'll discuss a bit amongst ourselves some observations about what they've done, their experience. We'll have some discussion over here and then invite you to ask your questions and join in that discussion in the latter portion of this. So thank you again for coming and Rajula, please take it away. It's just a great bit of intro about CERN. So we're the Center for European Nuclear Research. We are the world's largest and the biggest particle accelerator. This is one of the accelerators where we collide subatomic particles. This is called ATLAS. Just to give you a bit of glimpse, this weighs around 7,000 tons, which is equal to the weight of the IFIL tire. This is the collider. So it's an LHC, it's around 27 kilometers in diameter. And you see the Geneva Lake and the Swiss Alps in the background. And it's not just physics, what we do. So CERN is also a place where the World Wide Bed was invented and we actually do a bit more than just that. So we have a lot of member states around the world and the main source of collaboration and communication happens through websites. So we host a lot of websites. So we specifically come from the team where we host publicly available websites like home.CERN and ATLAS.CERN. So just a quick glimpse of our infrastructure. So we manage content management systems. Basically, it's Drupal instances. So the users basically can create, update, and delete without a lot of technical knowledge. And we want each of these instances to have a life cycle to be autonomous and isolated. And we also want to enforce all the dependent resources. And at the same instances should also be integrated with CERN specific stuff, like SSO, databases, and backups. So why operators? The first reason would be time saving because we wanted to create and delete sites on the fly. So at this point, we have around 1,600 websites, of them 800 or 900 production websites. So we create previews and test sites on the fly. And standardization. So we wanted to reduce inconsistencies and errors with manual creations. So with operator, we have a standard of configuration. And the flavor of quality of service. And scalability. We want to scale up. So there's a new version of Drupal. And we can just, want to create a preview for all the production websites. We can just double the number of sites. And we have quite a large number of instances, like I mentioned. And access control. So we also want to have a fine-grained access control of what the users can do, what they can access on the site. And self-healing, of course. So if someone deletes some of the resources, the operator should be able to heal everything back and bring everything back to normal. This is a user portal that we have for the interface through the operator. So most of our users are non-technical users. So they use this interface. They just mention the URL they want and the version they want. And this translates to a custom resource definition, which is basically just basic stuff. It just spins up a site with a database, with integration with SSO and everything else. Is the operator publicly available? So a lot of stuff is done specific in our operator right now. We use a lot of our internal databases. We integrate with our own SSO. And we assume that we retrieve images and push images to our own image registry. And we also assume our own GitLab instance. So it's technically not open source, but it was a consequence of our design because of the academic environment. The code is publicly available. Yet there are a lot of integrations that are coupled with the operator. So I'll hand over to Michael. Great, well thank you for that introduction. So we've seen a little bit about what they've done. And so now Varsha and I have had some discussions with Francisco and Rizula. We've looked through their code. And one of the things that really caught our eye and really led to this collaboration and this analysis that we're doing together here is just the fact that they've created an operator as real end users. They are the human operators themselves. We see a lot of discussion and a lot of attention given toward, it's perfectly reasonable, like often vendors creating an operator to install their software for their customers and their end users. But much of the operator pattern really shines when it is the SREs, the operators themselves, creating the Kubernetes operator to help them. And these guys even referred to this operator. It's two of them maintaining this incredible number of websites and it's infrastructure around it. The operator is the third member of their team and the only one who's working today. So that's really neat. And it's fascinating I think to get that perspective and that view on what it is like, what's the experience it's been like to develop and create an operator. So one of the things that for me in particular caught my eye is the multi-tenant aspect and the self-service aspect of this. So just for a brief moment on multi-tenancy, what this means to us is we have multiple users or groups of users, we would call tenants, sharing some pool of resources. Probably most of us use like an email service and a calendar service and some kind of office document product that's delivered to us over the internet as a service, those are all multi-tenant back end kinds of systems. Tenants have discrete data that needs to stay separate from each other. So my email is separate from your email, separate from somebody else's email and we depend on the services to keep those things separate. There's a common characteristic of delivering software as a service or in this case, some kind of a software in a self-service fashion. And we can write applications natively to handle that separation, but many vendors and software producers have older school or existing software that a single tenant, something like Drupal or WordPress or some other kind of web application like that or maybe something very domain specific. Maybe it's an accounting software. I'm from the US so taxes are on my mind right now. We just had that day go by. So if you have a single tenant web application or some service like that, you can still offer it in a self-service way in perhaps even a software as a service subscription kind of way by using Kubernetes namespaces. And that is what namespace level tenancy is. Every tenant that you have gets a unique namespace in a Kubernetes cluster and we just deploy a copy of the application for them in that space. It may include a data store that's unique to them. It may not. Maybe there's a shared database management service that's external or in some ways joined. But one way or another, we're enforcing that isolation with namespaces. This gives us a pretty easy on ramp and an easy path to delivering a single tenant application potentially in a self-service or even a SaaS model. And that's what they've done here. So I think it's a really interesting example of that. Now I'll hand it off to Varsha. Coming on to the next observable part. We all know operators handle a lot of complex applications and there are a lot of moving parts too. Now having metrics alerts and visualization is an important aspect and they constitute what is known as operator observability. Now collecting metrics helps us in providing a deeper insight about what an operator does and how is it behaving, whether it's behaving as expected. And having alerts helps SREs in cluster admins to make sure that cases do not turn to be critical so that we don't have to wake up in the night. And visualization helps us in getting a better understanding of these complex moving parts. Now some of the key reasons why operator observability is important and something which we advocate while writing operators is one, to quantitatively understand the deployment success rate in turn the number of successful deployments of the operator workloads which are running on cluster. Two, having continuous health checks that ensures that the operator is performing as expected at any instead of time. The third and the most important is resource utilization. So having an idea on the resource utilization on cluster by the operator helps in providing a better understanding of how the operator is performing and if there is anything which we as operator authors can do to improve the performance. There are a lot of metrics which we have from controller and time we'll talk about that in a bit. But the other two important aspects are scaling and availability and they measure how well an operator is able to scale at what instant of time, at what workload and availability is basically the percentage of time the operator and its workloads are available in action doing stuff which we want them to do. So we have a doc on the best practices on metrics related to operator SDK. It has in-depth details on how alerts should be set up, what are the labels and other suggested format. Now as operator SDK users, what does it provide to all with respect to monitoring metrics and visualization stuff? So there are three things which SDK provides us. The first one is the controller and time metrics. So we all know that operator SDK uses a library which is known as controller and time and it exposes a set of controller or reconciler metrics to be specific. Now what they are and how they are useful I'll briefly cover it in the next slide. But moving on to the second point is when you build an operator with SDK you get a manifest which creates a service monitor. Now using the service monitor you can integrate the operator metrics, you can expose the operator metrics and push it to end Prometheus operator or any standalone Prometheus instance running on cluster. So the metrics are available at the endpoint. It's just up to us to scrape them, see them and visualize them. And the third most interesting and beautiful aspect of this is the availability of a Grafana V1 Alpha plugin. So this plugin is available on Cubilder which can again be used in operator SDK. What it does is it scaffolds out a JSON blob and you can copy this JSON blob, put it into a Grafana instance and you get the visualization of the metrics which you have collected. So we went through the Drupal operator code and these are the few metrics which we thought would have been helpful to the SREs and to the CERN folks and also in general to all of you when you all start writing an operator or you all try getting metrics from an existing operator. So some of the controller runtime metrics to briefly go over are a total number of reconciles, the number of errors during reconcilation and the duration of a reconcile in a controller. So these metrics give an overall idea on the performance of controller. They help in monitoring the health of the controller and identifying any issue or any request which is possibly blocking or taking a lot more time than expected. And the other two metrics which we thought would have been useful are the available backups and the scheduled backups but the idea is using the tools available. We can also extend them to create our own custom metrics and this is something which we recommend. So the recommendation is to have as many metrics as possible which are custom to your application logic and having a visualization out of them to better understand the operator performance. Moving on to the internal discussion. Great, let's have a seat. Live, hello, okay. There's a switch on the bottom, mine is on. So we're going to have a bit of a discussion. This is fantastic, thank you for the chairs. We were envisioning maybe some couches and some coffees or something like this. So that's at least what we're going for here. So we thought about some questions and some maybe discussion and a little more of the story of what the experience was like. So maybe could you guys just elaborate a bit to start with us on how did you get to the point of you wanted to use Kubernetes at all and write an operator and what was that experience like of actually creating this thing from the ground up? Okay, so first off, we were a bit lucky in the sense that we already have another Kubernetes based services around. It was logically wise more simple but it proved that we could have our CMSs on the same platform. So once we understood that Kubernetes would be a place where we could put our Drupal websites or our public websites, we then started the development of this operator which is very logic heavy. It was written in Go which was the one more flexible, more capable of being custom. And well, it was a learning process. So we discussed this already before. It was let's say our first big operator. It was very logic intensive and we did not realize in the beginning how we were making some mistakes. So this is for example, one very simple but crucial mistake is that we created one very long controller which was handling a lot of logic. So in the end, the reconciliation loop as you guys can imagine was very long. We initially because the operator is and he still is super fast, once he started to take more than just a few seconds, we started realizing that there was something that design wise was not great. So we were basically from that point on doing smaller adjustments and improvements on the operator in order to basically keep the performance we wanted. Now you're both, before we dive into that topic a little deeper, I mean, you're both accomplished SREs. What was that like writing in Go? Were you already comfortable writing in Go? Was that a big learning curve itself and how is controller runtime and the whole, that whole experience around it? Okay, from my side, my experience, I was not a very big developer at all. I had previous experience with many languages but Go, it was very basic. Writing the GoLang operator was, I didn't find it a big of a challenge. It was very easy to understand the code structure, the methods that I needed when I needed. So I would say it was fairly easy. Yeah, so as they mentioned, having a very long reconciled is something which is not useful. And a suggested pattern is to break down the reconceler into multiple different parts. Now, one thing which we wanted to highlight as operator SDK developers is that can bring a kind of issue in terms of how many CRDs does a controller manage. So the suggested pattern is to have just one CRD per controller and just to quickly highlight the reason behind that is that if there are, if there's a CRD which is being managed by multiple controllers, there can be contention of the update. So multiple controllers can update a single field and that can cause issues in terms of causing race conditions and debugging race conditions is definitely a major pain point which we would suggest never to do so. So I think you guys also had some similar conclusions and ended up going from your like long reconciled function and thinking about how to break that apart into multiple, right? What was that like? Well, so in our case, we don't do what you would say as a correct approach. So we use more than one controller for the same custom resource, but we defined that on this resource that would be a super resource that would only be managed by a separate controller. So what we did was we had the basically, and this is how it actually went, our operator was managing the backups and the backups were taking very long. So we separated that into different controller and then it's completely fine because they are basically separated even though they use the same custom resource. Yeah, and there's a couple of knobs here, I guess we can try to turn to deal with that particular issue. One of the first things we think about is controller runtime defaults, of course, to only doing a single reconcile at a time and we can configure and in fact, that's a good idea. For any of you who are maintaining operators or thinking about creating operators, go look at that setting, think about enabling it to reconcile multiple. Do you recall what we call that? It's max concurrent reconciles, that's right, is one of the options there on the manager, I think, right? So did you start with that? Did you try multiple and how did that go? Yeah, we actually tried with that but our problem remained the same. So now we have more, but they all get basically stuck in this one operation that is very time intensive but it's not very high priority. So we realize this is a low priority test that takes a long time. So for us, even though we basically increased the number of reconciles, it just made more sense to have these, let's say, operations decoupled into different reconciliations. Yeah, make this. Rajul, did you have something to add? Yeah, so there were some parts of the status. We had just wanted to always have the status as a priority, even with a slow task. The other reconciles were putting the status up to date. So it was kind of useful. Interesting, yes. Status management is always a tricky thing to get right sometimes, yeah. And so another thing that caught our eye about your API in particular, I love looking at API choices and design. Well done. Is the fact that the Drupal version is actually embedded into the spec. We typically, the way we develop operators at Red Hat and what we recommend operator developers usually to do is to couple the version of the operator with the version of the operands. So if you wanna upgrade, for example, Drupal, you would upgrade the operator and that enables you to limit the scope so that your operator logic always knows what version of the operand it's gonna be dealing with or maybe it's just upgrading from this one to that one. I think need some more flexibility than that, I understand, for your use case and your user base and some flexibility, maybe tell us a bit more about how that went and what you did in the API here to actually get that flexibility. First, the version is basically the serving image. It's just a image tag. So it's not related to any of the resources because everything else is the same resource. So the version change on the CRD will just change the serving image on deployment. Anything else? No, I was just going to say that the operator itself is basically agnostic to the version because the expected operations for all of them is still the same MySQL database, et cetera, so it will still be able to do database backup or restore and so on and so forth. So it's, like my colleague said, it's just a serving image. The operator is completely agnostic. We can provide a second or third image and the operator will be able to serve all of them without any fuss. And I think through your portal, do I recall correctly, you were able to give people some flexibility in terms of when their particular instance is going to get upgraded, for example? That's correct. So the version itself of the CMS evolves through time. Users have to upgrade and if we just upgrade them all at once, we by risk putting some websites offline because they have something that was not compatible with the latest version. So we basically announce ahead that we have a new version that they can try, they can install, and they can easily create a copy which is this other custom resource that they're just going to create, have a new website running with the latest version. They can compare, they can basically then change the URLs to the copy. They can just then upgrade the previous website. And then changing the version is basically you go to the portal, you select a new version. That changes the custom resource, operator reads it. Okay, there's a new image to serve. Is it a different version? So if it's a different version, I have these operations to do and then it upgrades and then it's back online and like nothing was ever done. So me and my colleague have nothing to do. Our third colleague took care of that upgrade for us. So you can come to Amsterdam and speak at KubeCon. Exactly. Excellent. Well, maybe we should get them in on this. What do you think? Yeah. Great. Well, let's... Now it's your opportunity to ask some questions of some certain folks in operator framework and I've already got a hand up here. Please go ahead. I'm not sure if we have mics running around or... There's mics on the side. I see a couple of mics on the wings. Why don't I just come give this to you for now and other folks maybe if you want to make your way to one of those. Hi there. Thanks for the talk and it's very interesting. Just a quick question on... Or two rather on the base conference but to bring up the GUI. How much flexibility does a user have if he wants to... Can he go crazy and do 100 of them? And what is behind it? So it dumps the CRD somewhere but the website where you create this request is that running on the cluster where the CRD will end up that will trigger the operator? I'm not sure if I completely understood but you're asking basically what connects the graphical interface with the cluster. Yeah, exactly. Because in there you would request... Yeah. So basically in the website you basically create whatever and that impersonates for you an operation behind the scenes which would be either create or update this custom resource or even delete it. So all the CRD operations are done behind the scenes and in the portal you just have the fancy buttons and it's basically impersonation. Okay, but how does it get there on the cluster itself? If it... Yeah, it has a service account so it's basically a dedicated account and within the cluster it does impersonation so it applies the operation with authenticated user accounts so you will only be able to do the things in the cluster that you actually can do. Okay. Thank you very much. No worries. It's a bit more complex but that's basically the storyline. We're happy to have a further discussion if you want a bit more of details on that. So is your internal identity management is connected and set up for your cluster? Yeah, so we have role bindings within the cluster, etc. So it's part of let's say our namespace template so when you create a new project, a new website there's already roles there so you can only, let's say, work with that specific namespace with specific resources within that specific namespace so basically that's how everything gets managed and well, that's it. Thank you for amazing work. If I understand correctly, you are together handle big load due to operator and it's a very efficient way of working but what I'm afraid did it freed for you time to learn something better, learn something new or just instead you squeezed up to the last drop and just get more more domains to handle instead of free personal time? I would say that the operator definitely freed a lot of work from us so we are definitely very happy with the solution we have actually we are very happy and we still agree that there's improvements on the operator that basically release a lot of work from us so a lot of operations that would have needed to be done by us as soon as we notice oh, this is something systematic that users might need we introduce it into the operator and from that moment on, our third colleague takes care of all these operations for us so let's say in the short term we might have to do some extra development to introduce the operator but medium to long term it eases our lives in terms of daily tasks on the infrastructure well, here we are someone is doing exercise today hello, hi I happen to be one of the main users of this service I know Francisco hello we had some conversations and my question is more about the downside of creating an operator because I'm not very familiar on operators or maintaining them but do you think it's any downside having an operator and something that's software based rather than having some other type of solution like Helm charts or an Argo CD instance or something like so if you're asking if we wanted something more of ETOPS kind of thing so in the sense that you have a repository and you have some specific thing for that well, you're at CERN so there's also a service that allows you to if you want something custom with a specific image Helm chart, whatever it's a dedicated place for that the goal with our service is providing a software out of the box so you as a user don't have to have a lot of technical knowledge and you get a website up and running and our operator basically enables us to do that very easily without a lot of actual work to keep the things running thanks I'm not sure if it answered but thank you I think typically it may just add a thought to that and I'll come to you and then you you know operators we talked a little bit about this as well like maybe GitOps is an option here but that gets you a deployment story and maybe a deprovisioning story also but not the continuous operational story here it's not the let's go you know have fun and go to dinner in Amsterdam and not worry about what might happen because there's an actor ready to react to unexpected or expected changes in state that's the story we tell at the operator framework actually one thing that I might add is that for example for some specific users that might want to do extra things with their website to actually provide them permissions to for example work with the deployments change the config maps and so on and so forth and one thing that you actually realize is that the operator even even in well protects them from certain mistakes because if you delete the deployment don't worry the deployment will be back very soon after because the operator will enforce it so it also has that into to be in place and also help us with probable accidental operations that's it we do sometimes see and this gives an opportunity potentially a mix of these where you might use GitOps or some kind of pipeline like that to actually create the CRDs that still an operator might use so then that just moves the source of truth out of the cluster to get repo which has its advantages potentially but it's another thing to deal with that's a different topic outside okay who is over here here you go as you already mentioned status updates tend to be tricky I've been there are there any best practices or tips you'd like to share like is it worth having a separate reconciler just for the status decoupled from the main logic or what are your thoughts Varsha do you have any thoughts on this first it depends on it depends on the business logic regarding the status updates it depends on the application logic I don't think we have special recommendation on having a separate reconciler just for doing that but in general what we suggest is to use status conditionals so that whenever a status is updated the when the operator updates the status the third person who is looking at the operator status knows the reason behind what's happening it's a way to communicate to the user on why a status has been updated what is happening in the reconciler loop and what is a cause of an error a failure or a success so using status conditional is something we would suggest but having a separate reconciler just for doing so is something we haven't it's something which is not so common yeah there are a few ideas and best practices sometimes if you only write the status if you know there's a change to the status don't just blindly write an update to the status on every reconcile that can be a significant help and then when you do need to to do a write doing a patch instead of a full apply is a much lighter burden at least on the API service so we like for example with OpenShift it's based on operators you know top to bottom and we're very sensitive about the burden we put on the API server and so patching is a very important part of that as opposed to the entire you know safe update workflow those are a couple ideas did you guys have anything to add? some of you is over here I promise to come back I have a question is there any work involved to keep your operator compatible with different Kubernetes version you cannot hear me? I could not hear it I'm sorry is there any work involved to keep the operator compatible with different Kubernetes version you're asking if our operator is compatible with the Kubernetes version and are we upgraded? yeah do you have to do any work with your operator maintaining it as the version of Kubernetes changes underneath it? it's a very good question so we upgraded the operator and went into production around two years ago so two years ago it's basically where we said okay this is good enough to start handling production workloads and ever since we have not upgraded operators so the operator has remained more or less in the same version we didn't have the need so far to upgrade it we are actually planning on doing it in the upcoming months but there's no actual need so currently up until now it has been very good with the upgrades so the cluster has been upgraded ever since with the latest version but the operator itself remains in the same version and it's very compatible so it's very easy to keep and keep it running yeah also the APIs we use just talk right into it so the APIs we use haven't changed much in Kubernetes like the client APIs so it's totally fine that we are using an older client but a new server well for one last question for you if you were to maybe think about it even perhaps in the near future do this again start from scratch what would you do different it's a very good question so there's a few things that we would do differently so we already learned the lesson about big reconcile loops that would not happen again a second thing is metrics so we are aware that the operator exposes metrics but we never use them so we at some point realize our operator was slow but we never had actually proper visualization of that so one second thing is from day one we would definitely monitor our operator workload see how it handles the workload how fast it is and what's going on with it so these are definitely two big points about it that we would take into consideration one third point is that we actually in terms of the architecture of the customer resource looks really good and we would probably keep the same which is custom resource to deploy to deploy and handle everything about the software that we would keep for sure because it works pretty well with us it's one single point of information to provide everything for a user nice, Rajul any thoughts from you too? I think mostly we would also focus on more testing but a unit testing and also we don't use all the latest features that come with the operator SDK so we don't have that much of time to actually focus on it but if you have it we would write a new one then we would start with that first and follow up with new features very nice, Farsha any any parting thoughts? yeah probably testing metrics and yeah the upgrade process they covered it all yeah well I guess say goodbye we have one last request for you this is a QR code where you can tell us how we did we love your feedback because of feedback is also helpful if you just had a good time it's helpful for us to hear and for our bosses to hear who sent us here but we'll stick around up here we can stay in the room I guess for a little bit more time and then we can go outside too if you got there's a lot more topics we could cover so bring your questions bring your last time we did this something similar we got some pretty good confessions about real world horror stories and mistakes and things gone wrong so coming up share, listen and thank you for coming