 Thanks for making it to today's talk, last day after lunch. So thanks. In this talk, I would be sharing a story with you about my experience and my team's experience on writing Kubernetes operator, what were the challenges that we faced, and how we overcome that. So the content of this talk would look something like this. I'll be sharing the problem context where we started off with what were the alternative solutions apart from writing Kubernetes operator for us, how they did not work for us, and how we transitioned into writing a Kubernetes operator. This was just the half story of finding the right kind of solution for us. But the technical challenges started coming after we started implementing the operator and making it productionized. We would also go ahead and see what are the good resources that are already available, which you can look into if you want some additional resources, or you are just starting out in this space. Apart from that, we will do the recap of what we looked today and any questions that you guys have. Feel free to stop me in between if you have some questions and something is not clear. So yeah, let's understand what kind of application I and my team were working on. The application would process lots of messages in a day, and it would be of multiple kinds. I cannot go into the details or architecture of the application, but you can understand something like a message processor, and it could process the events ranging from a simple OTP message to some email content that the marketing team would use. And once these messages would be processed, these would be sent to different business applications. So application who wants to send the marketing content or wants to process the credit card transactional statements. So such kind of applications would consume these events based on category and configuration. So maybe a transactional statement would want to read a message immediately. And even if the communication between my application and that business application fails, they want it to be read immediately for 10 times. But that might not be the case with notifications team who wants to read marketing messages and all. Now, these configurations that we would manage throughout the presentation I would refer to as states. So there were around 15 to 20 such fields which would contribute to a state. And we were managing it in a centralized way. So one team would manage all these states. And these states would be actually replicated in multiple zones for multiple tenants. So only my testing zone would have somewhere around 5 to 10 tenants, which would be growing based on the business needs. Now, mainly, we say that a team would be managing it. But behind the scenes, there was an SRE who would manage it. And our SRE is Anand. Anand would, on a day apart from handling incidents and other issues, he would just manage 15 to 20 tickets on good days. And on bad days, he would take 50 plus tickets at night if there is any specific requirement from the business application. So the state of the entire operations was not super good. And Anand was stressed out. We needed a solution which could help out Anand lessen his burden, which would ensure that security is not compromised, which would help us move from this centralized solution to a decentralized solution. And hopefully, it could be done in an automated way. So these were the solutions that we tried. We thought, OK, maybe giving a helping hand to Anand would help us out. But as I mentioned earlier, transferring the knowledge of what our application was doing was pretty complex. And the time we invested in training a new resource was not yielding a satisfactory result. So we thought maybe, OK, we can try out some automated solutions. We tried to build a Jenkins pipeline and a UI which could just decentralize it. And we could just give it to the business application owners that you have your parameters, manage them. And we don't want to do that repetitive task for you. But that led to issues that we were not able to properly do the authentication and auditing of the resource changes that were happening. We had a lot of duplication. So what we did was to offload the burden from Anand to other teams. This did not look like a solution that would help us in a long run. So that is when we transitioned into writing a Kubernetes operator. I would emphasize here that when we started writing an operator, we were not the team who would know a lot about Kubernetes or writing the operator. And we just knew that what we wanted to do, what were our requirements, and how it could roughly be solved with the operator. We were trying to manage the states. We were doing some repetitive tasks all along. We knew that we could easily convert our problem into a declarative model. We researched a bit and found that none of the existing controllers available could solve our problem. And the mistake that we made or the inaccurate assumptions that we made were that there wouldn't be much learning curve for us or there wouldn't be much resource overhead, which we found out that caused the issue to us later on. So once we knew, OK, we are going with the Kubernetes operator approach, we started out with finalizing what our CRDs would look like. And just writing a boilerplate code with those CRDs or custom resource definitions using the CubeBuilder framework, if anyone of you is looking to write, you guys can choose the framework based on your use case and feasibility. We wrote some basic custom logic and validations based on our use cases. And then we tested it in local environments. How many of you are aware of what CRDs or custom resource definitions are? OK, quite a lot. So just in case if someone doesn't know what a CRD is, you can loosely relate it to something like a contract between your operator and what it would read. So you can specify what kind of fields would be present and what kind of version your CRD would use. One of the mistakes we did as a newbie when we were starting out was not understanding the importance and of deciding what our scope would be. We were an infrastructure team, so we had lots of permission and we started out with the scope of cluster. And we didn't realize at the time what security risks it could possess and how it could be a road blocker for us to serve a complete self-served solution to our users. The second mistake that we made was that we did not maintain the status of our CRs, which blocked us into understanding the insights of how our CRs were performing. Getting the alerts on time and maintaining them, maintaining our application stability. So before I move on to the challenges that we encountered during this process, this is how a custom operator might work. I, as a user, might create some YAML file or something similar which would act as my custom resource. My operator would have a reconciliation loop which would keep on reading for any of the updates that are happening in the CR and try to make my actual state similar to what the desired state should be. And then maybe a verification that we would talk about how to do it in the later slides. So we moved on from phase one where we did a basic setup of our Kubernetes operator to the phase two where we were testing and trying to see if we were able to make some updates to our CRs and maybe CRD. So we are testing with 5,000-plus CRs in our pre-prod environment, which looked much more similar to what my production environment would look like rather than the case was with the local testing. And as I said earlier, we were testing with the scope of cluster. While we were testing it and we reached out to multiple different teams who were using our operator, we realized most of the teams were not having the permission to access the cluster-scoped CRDs and use their CRs properly. And then we explored, OK, like there were some security risks involved there. So we were having no other option than to migrate our CRs and CRD. For us, this was the case. But for others, the case could be pretty different that maybe you are not making some state which is backward compatible and you might be forced to migrate. And for us, we were also using Helm. And Helm doesn't support any upgrade or deletion of CRD. So we were forced to execute the steps manually to update our CRDs. Now this is what we wanted to avoid. We wanted to avoid the cluster going down. And this is what exactly we faced. The split-brain problem. So while we were upgrading our CRDs, we later on realized that most of our CRs which should have been deleted were not deleted. And the entire cluster just got down due to the state difference that we had. I have a demo on this, but I don't have it in slides. If anyone requires it, I can give it offline. So this was our first lesson that any changes we make after we go to production, we shouldn't be making any of the migrations as long as possible. And if we are actually planning for migration, it should be like each and every step should be thoroughly thought through. Because ultimately, you are playing around with the Kubernetes internal system. So try to avoid it. And if you are doing it, just plan it well. Apart from that, one mistake we did not realize was that playing around with the finalizers of custom resources. I would say not just custom resource, don't play around with any of the Kubernetes finalizers you have. So we were done with this. We understood what we shouldn't be doing. And we recovered soon from this problem. And then we had our backups in place. So we were good to go to production. Once we got into production, we started having issues related to how we get the insights and how we make it fully self-served. The operator was working fine. We had resolved our technical issues. We had validations in place. But most of the people were not able to understand how they would move on to the new process of maintaining these states. Now, when I say this, we had a basic documentation in place, but we were not having the documentation which would serve two different personas. And I think having that was the most critical part to making it self-serve. After we had a proper documentation, we moved on from having daily around 15, 20 tickets to two, three tickets. And it was a great savior for us. Later on, we realized that as people started using this process and as custom resources increased in our cluster, we realized that most of people were complaining the state difference was not getting reflected on time. And this was because we were not handling concurrency well for the needs of different zones. So we realized, OK, we should have something which makes it configurable. And with the growing tenants, we should be able to handle it well. We realized it later on as well that because we were not having alerts on time, the issues became critical. And if we had metrics and alerts on what was the RQ size, what was the lack that each, what was the latency that each of the CR took to get into a reconciled loop would have helped us tremendously. So setting up the observability in the process of maturing your operator could play a big role. Apart from that, we realized that we had high latency. We resolved the issue of high latency, but that came with us processing lots of CR. Now, with this, we observed that a couple of downstream applications that we were having were actually not implementing rate limiting up to the mark. And that made those applications go down and created a bigger issue for us. So if you are having any of the applications which is not able to handle these scenarios well, then make sure that you have configurable concurrency and rate limiting in place. Apart from this, we also understood that most of this overload was coming from the stale resources. So stale resources were not only using the resources in our cluster, but also making the entire process slow. So having a systematic approach to cleaning up these resources on time could lead to greater advantages in terms of latency and resource saving. Then we were able to achieve this systematic cleanup observability because we were able to maintain the status of each CR. So make sure that initially you might not need it, but as you go to the higher zones, you might want to make sure that your status of your CRs is in place. And then in the last, we also added end-to-end testing, which helped us make sure that our entire pipeline was not breaking due to any of the changes that we make in downstream applications. So this is how we achieved the stability of our Kubernetes operator in all the production zones. And currently, we are running this in around five different production zones we have with each having somewhere around 3 to 5 tenants. What I found useful in this entire journey were these talks. Why you shouldn't be writing the Kubernetes operator in the first place, avoiding any of the changes. First, figure out whether it's necessary for you or not. If there are other alternatives, go with it first. Else, if you don't have, just go through the checklist of what ever would be required in writing a Kubernetes operator, then bit on what CRDs are and how it could be useful for you and why and how not to play with your finalizers. You can go through them. These are pretty good. So just recapping through all the things that we went through. First, explore all the alternatives you have before diving into writing a Kubernetes operator. Try to make as accurate estimations as possible while you plan to write a Kubernetes operator. Make sure that you are making correct considerations when deciding on writing CRDs. This could create real big issues if you don't plan it well. Have a thorough documentation for each and every persona you have. Manage the concurrency of your operator properly. Have effective observability before the issues become big. Make sure that you are able to catch them. Do the proper resource cleanup and have a systematic approach to it. And make sure that you have end-to-end testing in place. Thank you. Feel free to give any feedbacks you have on the slide. Any questions? Hi, great talk. This is Shidon from University of Illinois. I'm not able to hear, sorry. Oh, sorry. How about now? Yeah. I'm Shidon from University of Illinois. And I'm in particular curious about the split-bring problem you mentioned in the talk. Could you elaborate a little bit more how does this happen? What is exactly the split-bring here? And how do you address that? So are you asking that how the split-bring problem occurred for us? Yes. OK. So as I mentioned, we were actually running our CRDs with a scope of cluster. Now we actually manually went and changed all the required CRDs to the namespace scope. And we were having, in thousands, we were having our CR instances, which were running with that cluster scope. So when we moved our CRD, most of our CR instances were not moved to the namespace. So we actually, what was expected or desired result was expecting the scope of cluster while most of our CRs were running in the namespace scope. And that was the time we made a mistake while we were using Finalizer. It did not work as expected. And due to the state difference, it led to split-bring problem. And actually, our entire free-brought cluster went down. But yeah, once we fixed that, instead of playing around with the Finalizers, we patched the CRs. And then we were working fine. We did it for all of our prod zones as well, like a few of the activities. And it was working fine. So it's basically because you are trying to change the scope from cluster to namespace. Yeah. And it's not necessarily that this was the case with us. We were trying to make sure that the scope was migrated. But it could be with any of the state that you are trying to manage. And it's not backward compatible. So yeah, I think what we learned was planning your CRDs well is super helpful. OK, thanks. Yeah, so on the CRDs with Helm slide, we ran into something similar. We have several custom CRDs internally that we upgrade regularly. And we ran into this attempting to write a Helm chart to deploy our Coup Builder operator. Upon some research, it's actually a little misleading. You can upgrade CRDs with Helm, absolutely. You just have to not put them in the CRD folder. If you treat them just like any other manifest, they will apply an upgrade just like you would expect. This is the approach that several large operators, certain manager being one of them, follow. You just lose the support for deletion. Because if you delete your CRD, you delete every CR as well. So you can upgrade. You just have to take the risk of, if I uninstall this, I'm going to uninstall everything else. So was it for Helm 3 or Helm 2 that you explored? I also saw that. But at the time, we were using Helm 2. So we had our restrictions there. What was the Helm version I'm asking for you? This is recent, so this would be Helm 3. Yeah, OK. Yeah, we were running Helm 2. And we had our restrictions. Yeah, Helm 3, it's standard. Yeah, they are considering other options going forward with the Helm upgrade. So it should be easier. Thanks for the session. It's helpful. So these operators, until they run into issues, they are great. Once you hit the issues, you get into all sorts of troubles, as I understand. So my question to you is, one, you mentioned end-to-end testing. Could you explain what kind of end-to-end testing that you have done in a second? How did you handle the errors, right? If it runs successfully, no issues, right? When there are errors, it's hard to debug or troubleshoot. Yeah, yeah. So I'll start with the second question that you asked on how to handle errors, right? For us, we figured out three set of scenarios that we would run into, one. So our operator would call external API. And based on that, it would create the states, right? Now, for us, there were three scenarios. One was success. If it runs fine, we are good. The second was that if there is an error, it could be either a retryable error or it won't be retryable error, right? If it's not retryable error, we just want to discard it. And we want to make sure that the status of what kind of error is maintained. And for us, we were OK to retry it 10 times. So when you write the operator, it has the custom logic of exponential backoff. So if we wanted to retry them 10 times, if they were successful, then good. Otherwise, we won't retry them. So I think you can play around with your retry logic. But having that distinction between what is retryable and non-retryable is important. And for end-to-end testing, so we were actually managing multiple applications at the same time. And few of them were getting updated pretty quickly, right? So in the end, what happened that while creating the state, we were calling a couple of APIs for authentication and then for state creation. We realized that if we were playing around with those APIs or those contracts, we were actually facing the issue with the operator. Because once your operator is stable or any application is stable, you don't want to touch it. And you tend to forget that something might be causing problem. So because we had alerts in place and we had lots of traffic, if there would be any failure, we would get the alerts pretty quickly. So that is how it helped us do the testing whenever we made changes in any of the downstream applications. Got it. Thank you. Helpful. Split brain. So one thing I want to bring up is the way we look at CRDs is basically we're extending the API layer of Kubernetes by introducing new domain objects, right? And that's what the definitions are. Wouldn't it be because if you think about it, like those attributes and those contracts are always going to be changing, right? Now, I know you said try not to change it. But wouldn't it have been a version control on the CRD, something that if you're doing a modification of, let's say, scope changing? Well, you could have easily just bumped the version of the CRD. And you're treating it just like an API. It is literally the API layer, right? So the new version of your API is now going to be supported on a namespace scope versus your previous version is on a cluster scope. And so your controller is now looking at the new version of the object via the API, right? Wouldn't that fix maybe the split? So for us, the issue was that we wanted to manage these states, right? So any of the CR instances that we had, right? We wanted to make sure that if we are upgrading it. So either the option was that we patched all the CR instances that were present to use the upgraded CRD version, right? We thought that wasn't a good option for us to go ahead. And that is why we had to migrate. So we did consider moving the version, but that was not useful for this use case. The other thing is what I'm trying to look at is would there be, because what we do like in a normal microservice development, right, is you have a request ID for requests, right? Normally like a UUID that gets passed on, and then you use a context object throughout to know how you correlate that inside your logs for debugging. When it comes to, again, the operators and everything, again, we're going through the Kubernetes APIs, right? So, assumably, if there is a request ID or an ID that's associated with that request with the CRUD action that's happening against the object, do you know if there is a way we can grab that insight, like the reconciliation inside the controller so that we can have that inside the context object, pass through the execution of business logic, and that way we're actually storing it in logs and standard out so that when there is the request comes in, on that object, we have an associated correlation ID that we can then track if there is a failure or whatnot. Does that make sense? I couldn't actually follow it entirely. So is it that you are asking for a particular request? Are we able to follow what are the errors? And this request is actually the CR, right? Is it? I mean, you do the request for, like, apply, right? Either it's a create, update, or delete of a CR object, right? So again, that request is literally an API call that's happening, right? Now through the kubectl or what have you, right? Now that's going to come and go inside the controller's reconciliation, right, and then it gets picked up. And do you know if there is, like, a request ID, if there is an ID of some sort for that request, that we can use as a correlation ID in our logs? Because when you go inside the reconciliation, right, we're logging what's happening. And those are either it's Kubernetes logs or it's a log risk that you're using and you're sending that onto, like, a centralized logging, right? If we have that request, we can use it as a correlation so that if anything does happen, right, inside our central log, we can always refer to it. I don't know if you guys look into that or not. I'm not pretty sure about the entire, like, I'm not sure about the question itself. Well, I just wanted to know if there is a request ID that you knew about that comes down into the reconciliation or not. Yeah, so, like, the CR name that we were using, that is what we were using to actually search for any of the errors that would come in the logs. So, for us, at least, that did solve our problem. Okay, so the name. The CR name itself. And we had the name of what our, like, state name was, right? Like, each state had a unique ID as well, right? So, the CR name was also similar to it. So, we were pretty well able to, like, actually follow the entire trace of, like, if it's going into reconciliation loop, how many times it's going into reconciliation loop and if there are any downstream errors that we encounter. Okay, thank you. Thank you. Any other questions? Okay, thank you.