 So we should we start? Yeah. So, hello, everyone. My name is Tarun. In today's talk, we'll talk about the Dragonfly operator. So, I think the lunch was good. So, hopefully, my session also follows that. So, about me, my name is Tarun. And I'm working remotely from Hyderabad. So, I work at a company called Dragonfly DB. And the project that we're talking about is also very related to the work that I do. Previously, I was at Gitpod and also at Linkerd. I'm an amateur runner and also I like my coffee, just like a lot of folks here. You can follow me on Twitter or also on my website here. So, first, yeah, before we talk about the Dragonfly operator, we need to talk about what Dragonfly is, right? So, Dragonfly is an open source in-memory store, right? It's like Redis. It supports the same API, but it tries to do a better job at the whole caching layer. So, it's a drop-in replacement and you can expect much better performance, reliability, than many other tools out there. So, why Dragonfly? So, Dragonfly essentially scales vertically better, right? After with Redis and a lot of other alternatives, after like a specific QPS, you have to horizontally scale. And with Dragonfly, we try to not do that essentially by making efficient use of course, all the modern algorithms, modern hardware, etc. So, on the same instance, you would get significantly better performance out of Dragonfly than you would get with Redis. And that's the whole idea. So, now, how is Dragonfly related to Kubernetes? So, Dragonfly ships as a binary and a container image, right? And you know where a lot of people run your container images. It's Kubernetes. So, essentially, it makes the team and the community of Dragonfly responsible to have a better experience when they're running on Kubernetes, right? And that's where our whole journey started to focus on Kubernetes. So, essentially, it makes sense to build relevant tools to maintain, manage Dragonfly on Kubernetes. So, obviously, like many other applications and users, we started out with the Helm Chat, right? Helm Chats are simple. They're easy to use. A lot of users love them. It's all good. But it only works. They're great for your stateless and static applications, right? If you want to do something more on the management layer for your application, Helm Chats do fall flat because they don't do much there. And even if you have to have your own logic, some kind of dynamic logic there, then there's no option. They do good on the templating side, right? The users can give their configuration and get Kubernetes manifest on the other side. That do a simple job. But that's not all, that's not the only useful part, right? A lot of applications and especially databases, they need a lot more management around their life cycle, around the state, et cetera. So, for that, Helm Chats don't do a good job. And the reason why we see a lot of operators around databases, applications out there in open source. First, specifically, Dragonfly. Dragonfly requires a lot of automatic failover and other features. They do need an external component that configures them and makes sure the configuration gets updated through the life cycle of Dragonfly instances. If you're familiar with Redis, there's a whole project around this called the Redis Sentinel, which, essentially, when you have multiple Redis instances, it is the component that runs Redis commands, makes sure that all the Redis instances are configured correctly, who's the master, who's the replica, all those things. But when we had the same challenge, so, because Dragonfly is compatible with Redis, you can use Redis with Dragonfly and get all the benefits. But we have always, whenever we spoke to any of the Redis Sentinel users on Kubernetes, they did not like the experience. It's because Redis Sentinel knows Redis, but it doesn't know Kubernetes. Helm charts know Kubernetes, but they don't know Redis. So it's like this policy where Helm charts can only do Kubernetes things and Redis Sentinel can only do Redis things. We thought there should be a better middle layer in this, right, where it knows Redis, it knows how to manage Dragonfly, but also it is aware of the Kubernetes APIs, how to configure them, how to listen to events, et cetera. And that's where we landed on this idea around the Dragonfly operator, right, to build an operator that runs and manages Dragonfly instances for users on Kubernetes. So let us first go through the goals, right? Yeah, just like Dragonfly, even the operator is open source, you can check it out, the GitHub repository is open source. And then, so the goal of the Dragonfly operator is to manage the underlying stateful set resources of each Dragonfly, right? So example, in a company, there could be multiple teams and each team wants their own like a Dragonfly instance. One single Dragonfly operator should be able to maintain multiple Dragonfly instances and their underlying stateful set, right? Because Dragonfly is a stateful machine, right? It keeps, or Redis, it is essentially storing its state in memory and when it shuts down, it essentially writes into a memory and Redis can try to load it back up, just like Redis. And the other goal is to always have a healthy master, right? So this is a very Redis and in-memory stored thing, right? So you can't just have one Redis instance, right? Because even though the major use case for Redis or Dragonfly is caching, a lot of applications use it more for real-time data, right? For example, a lot of gaming companies, sessions, et cetera, any real-time data, Redis is a popular use case because all the session management, et cetera, is stored in Redis for faster retrieval and all of those things. So this means you need to have a Redis instance always running and even during any failures, you should be able to automatically fail over onto another Redis instance, right? You should not have a manual intervention there. You should all be automatic. You should have a replica and if the master goes down, you should just fall back to replica with all the data and it should all work. And one of the important parts of the operator is to do this. And then the other important part is to allow upgrades of Dragonfly, right? So any stateful application, upgrades would be complicated, right? Because there should be essentially a rollout procedure through which how your clients are updated about the whole upgrade happening and also that after the upgrade, you have the same data and you did not lose data during that transition. So stateful applications do need some hand-holding and essentially this is the reason why we have the whole thing about stateful set, the naming conventions, the upgrades being very different from a deployment, right, in Kubernetes. And in the same way, Dragonfly operator also does some betterment in the whole upgrade part, essentially to have no data loss and also to do a better job than plain upgrades, plain stateful set applies. So before we dive into the codebase of the Dragonfly operator, how do operators even work? So an operator is essentially it is doing three main things, right? First, it is watching for a declarative requested state. So for example, when you apply a deployment onto a Kubernetes cluster, it is a requested state by the user, right? The user is asking for a deployment of this many configuration, all of that, and Kubernetes does that job, right? It essentially applies that into the resources on the other side. So essentially the operator is also the same way, but for custom resources, right? Not the resources of Kubernetes, but new resources, essentially. So it watches on those resources, and then it manages other resources on the other side. And there's also a status component to this whole thing, right? Because operators are not stateful. They're stateless by default. They have to store their state somewhere to know, like, what are they doing, what's the status of each requested state, et cetera. So they use the status field in the custom resource definition to do all of this. We'll see that in the demo. And on the other side, we have managed resources which the operator is maintaining. So let's take an example of this talk, right? So here it's a custom resource first. It's not available on any Kubernetes cluster. It is specific to the Dragonfly operator. The Dragonfly operator essentially manages a resource called the Dragonfly. Hence, you see the line around kind to Dragonfly. It has a name, and then it has its own configuration fields, right? There's the number of replicas of Dragonfly and the resources for each Dragonfly instance. So this is the example of the declared state, right? This is the custom resource that we're talking about. But before we even talk about custom resources, there's the whole thing around the definition, right? This has the same relation as a class and object in any programming language, right? The class is the declaration of the schema. But then the Dragonfly or the custom resource is like an implementation of the schema, right? It is one instance of the custom resource like definition. Here, on the left, this is the custom resource called Dragonfly. And on the right, we have the custom resource definition, which is the definition of the Dragonfly type, right? Essentially, it's a bunch of things. You're telling Kubernetes that there's a new resource called the Dragonfly that it shouldn't be aware about from now. And it has all these fields. The plural form is called Dragonflies and a lot of things like that. We also passed the OpenAPI, sorry, all the chat about OpenAPI. We also passed the OpenAPI v3 schema here, essentially used to validate those objects, right? Whenever the user applies a resource of the type Dragonfly, we essentially validate on the OpenAPI schema and then validate the resource. So essentially, the architecture would be this, right? So on the API server, we apply the objects and the Dragonfly operator does some magic and creates a cache like a pod on the other side, right? It's not a pod, it's a stateful set like we discussed and a service that the user can use to talk to it. But essentially, there's a lot of magic in the operator. And it can add like more instances of it, right? You can ask a new Dragonfly resource for maybe for your frontend team and it will create a new resource, a new stateful set, new service to back all of that up and the clients can start using the instance. So now, if you see what the magic is about internally, it is two things, right? Essentially, for the Dragonfly operator. But like many things, right? Operators can be built in many ways. This is one way that we went about it. But there's no one, like one right where it all makes sense like based on your use case, etc. In our case, though, we have like two reconcilers in the whole operator thing, right? One is the Dragonfly reconciler and the other is the pod type cycle reconciler. We'll talk what each of those things do. We'll also look at the code. Yeah. Now, let's do a demo first. Yeah. So we spoke about a bunch of things. Before we do any of it, can you see the code is the font good? Yeah, this is good. So first, I have a kind cluster here. And if I do, you'll get pods. There's nothing on this cluster, essentially, the bare bones, whatever kind gives you. Now, let's install the operator first, right? The operator manifests are present in this folder called the manifest. We look at what the operator gets it with, right? So if you see, it essentially creates a namespace first, and then it creates the custom resource definition that we spoke about the schema of the custom resource called Dragonfly. And there are a bunch of RBAC stuff, right? So essentially, the operator has to have the required permissions to create the underlying stateful thread to create the underlying C crates to create the underlying service, all of that. So it needs a bunch of permissions to all of that. And this is the RBAC that we get. And this is the operator itself, right? The operator itself is a stateful is not a stateful application. So it's a deployment. It doesn't need any of the fancy stuff. So now that we have it installed, now if we do a cubic tail, get CRDs, right? Any all the custom resource definitions, you can see that we have a new resource called the dragonfly. Now let's apply a sample resource, right? Here I have like a example. So here we have like the kind equal to dragonfly. There's a bunch of labels. The labels are not important. And here is the actual spec of the dragonfly instance, right? We are essentially asking for two replicas with a specific set of resource requirements, right? These are like the same CPU and memory requests, right? And then you can pass that. So now let's apply that, right? If you do cubic tail apply, this is part of the config samples. Now you should see that a dragonfly sample has been created. Now if you watch for the pods, right? You can see that we're getting two pods automatically created one after the other. Now if we see what all the dragonfly, now if we see cubic tail get dragonfly, if you do a dragonfly, you should see that we have a dragonfly sample that we just asked for it. And you can also do cubic tail describe dragonfly, dragonfly sample. You can see that we have the same resource, but there are a bunch of events that the operator has been doing. It has created resources and it also did a bunch of other things. Let's not talk about it first. We'll go to them later. So essentially, as you saw, the operator has created a bunch of things. And who did that, right? That's where we can go back to our slides where this is the job of the dragonfly reconciler, right? So before that, actually, let's look at the code, right? So essentially, operators are built using the cube builder framework, right? Like cube builder is a popular framework to build and manage operators, essentially. And we use the same stuff. We use cube builder to essentially write the operator. It's a framework to write Kubernetes operators. And even cube builder internally uses the client to essentially talk to the API server. And if we see the code base, we have two important, we have the main important thing here, right? We are essentially attaching two things into the manager. One is the dragonfly reconciler and then the other is the DF pod lifecycle reconciler. So what's the job of the dragonfly reconciler, right? The job of the dragonfly reconciler is to create stateful sets and services for each dragonfly resource you are requested. When a user asks for the dragonfly resource, it goes ahead and creates the stateful set and the services you are required. Now, if you see the code of the dragonfly resource, it's reconciler itself. Essentially, before we go into the reconcil loop, so each reconciler essentially has one single method that it runs for each event, right? And what is an event that it gets? So the dragonfly reconciler listens for the type called dragonfly, which as it should. And then it owns two things, right? It owns the stateful set and the service to back the dragonfly resource up. So how does it create that? That happens in the reconcil loop, right? So we get a request of type, the control dot request, and we try to get the df object underlying that. Once we get the df object underlying, there are three essential things that we do here, right? First, if the status of the dragonfly is nothing, then we create resources, right? We essentially translate the dragonfly object into a bunch of the resources. And here you can see we create a stateful set. And also, there are a bunch of fields that you can set, right? TLS secrets, annotations, a lot of things. And then we also back that up with a service. The service is what the user would use to talk to dragonfly, the database, essentially. And then we just apply those resources. We update the status field and then just exit, right? So that's what just happened. Now, if we go back to our code, we'll see that we create a stateful set and also a service backing these resources up. Now, let's talk to the dragonfly instance itself, right? So because we are outside of the cluster, I'm running a Redis pod and I'm giving it the URL of the dragonfly sample.default, right? Default is the namespace, dragonfly sample is the resource. And now if I run this command, I'll get a Redis client command that I can use to talk to the dragonfly instance. So it's downloading the Redis image, I think. Yeah. Now that we have the prompt, so let's put some data, right? Let's do set event and we'll call it cube day. Now if I do get event, we see that the data is there and let's exit that container. So we were successfully able to create a dragonfly instance, talk to it, insert some data and come back. So before we go ahead, the important thing to note is that we asked for two replicas here, one instance of dragonfly with two replicas, right? The dragonfly sample, zero and one. So but with Redis and dragonfly, there's only one master all the time, right? And then the other pods would be replicas, right? To failover in the event of a problem. So in our case, in which pod did we write our data into? So this is how, so we do this through labels, right? There's this label called the role, replica and the master. And if you run this command, you can see which pod is replica and which pod is master. And the service essentially always points to describe service. And the service would always point to the role equal to master. So the users are only talking to the master. The replica is always like catching up with the data from the master. And now, what happens if we delete the master? We saw that dragonfly sample zero is the master. Now, if we delete the pod, the master has gone down. Now, if we do QBectl get pods role equal to master, we automatically see that sample one has gotten the role equal to master label. Who updated that? This is the job of the operator, essentially. This means now if we go and run the QBectl run on the same instance, even though we deleted the master, we should, when we do get event, we should still see the data because the operator automatically saw that the pod went down and then it automatically upgraded the other replica as the master. And now if we go back and see what all pods are replicas, now if we do QBectl get pods role equal to replica, we see that sample zero is now the new replica, which means once the pod came back, the operator automatically also marked the new pod as a replica to the new master. So whose job is all of this? And this is where the pod lifecycle reconceler, the second thing in the diagram. So essentially, the responsibility of the pod lifecycle reconceler is to listen for pod events, right? All pod events and then see if it is related to dragonfly and see that if there is any problem, if the master is going down, then update a replica as a master. If there is a new replica, if there is a new dragonfly pod, mark it as a replica. So all the logic around new pods and deletion of pods, right? So now look and then it listens to the pod events and then it runs dragonfly commands on those pods to essentially make them as a master or as a replica, essentially. So this is the thing that I started the talk with, right? The dragonfly operator knows about dragonfly, the Redis commands, et cetera, and it also knows Kubernetes API. So it knows both the things and this is how we make the business logic of dragonfly into the Kubernetes layer in a way. So now let's look at the code, right, how this works, the dragonfly pod lifecycle. So essentially, if you look at the code here, the pod lifecycle reconceler listens for pod events, only that. But it only listens for pod events with the label, with the custom app name label called dragonfly so that it can filter out, right? It doesn't have to listen to all pod events in the cluster. There are some application pods that the operator doesn't care about, et cetera. So it filters that down into the pods with the labels or dragonfly, only the ones with this label. And then whenever it gets a request, it first checks if the pod has a label, right? We check if the pod has a role label. If not, then we configure if the pod doesn't have a label, first we check if the face is resources created, right? Then we have a new problem around. We have to configure replication itself, right? There was no master, no replica, et cetera. But if the face is ready, then we got a pod restart from a dragonfly instance, right? Then we check if the master exists, if there is a master, then we configure it as a replica to the master. If there is no master, then it means that we need to create a new master itself. So it goes ahead and runs a function around like configuring your application itself. And if you see the example of like you're configuring the pod as a replica, you see that it runs a bunch of commands, replica of, it uses the Redis Go library, the Golang Redis client, and then runs the command called the slave off. And then it updates the labels like I mentioned, right? It updates the role equal to replica for the pod and then it updates the pod resource itself. So the dragonfly pod lifecycle reconseller essentially is responsible for all pod events. And then it filters down to the dragonfly pod events. And then on that, based on the status of the pod, if it has a replica role, if it has a role, if it has a label role, then if it doesn't have a label role, then it makes sure that if there is a master, it configures this pod as a replica to the master. If there is no master, then it makes the replica itself as a new master. So it has all the logic around pod life cycles. So any time a pod goes down, right? So for example, let's take an example, where we have one master and replica, the master went down, the pod lifecycle control reconseller gets a request and then it checks if there is a master. No, we don't have a master. So it'll make the replica as the new master, and it'll configure this pod as the replica to that master. So essentially, the whole topology around who's the master and the replica is controlled by the pod lifecycle controller. Now, let's all this. So the other important part that the Dragonfly operator does is the upgrades, right? So like we discussed, because this is a stateful set, the Dragonfly operator knows better on how to upgrade a Dragonfly instance, right? If there is a master and replica, the operator, any Kubernetes operator or even a human operator would know that they would first need to upgrade the pod, the replica, right? Why hit the master first? Once you have the replica ready with all the new data, then you would go to the master, upgrade it, and then now your cluster is all good. So the Dragonfly operator essentially tries to do the same. This is part of the same Dragonfly reconciler, right? When you trigger an update, an update could be, so let's do the demo, right? If you go here, now let's do an update, right? I want to make the CPU request lower, right? I've updated the resource. Now let's apply it. Now that we saw that it is updating the upgrade lifecycle, but first start with the replica. So here zero was the replica. So it's starting from a replica. It deleted that. And then once we once we have the zero pod back up and also as a ready replica again, right? Because when you terminate the sample zero, it lost all its data. When the new sample zero came in with the new resource requirements, it has to again be backed up by the data from the master. And once that is ready, then the operator triggers the update for sample one. As we saw, sample one was just terminated. Now it's getting back up. In this meanwhile, we saw that when the master is being deleted, the replica has to be the new master. So that has already been done. So now if we go back and see who's the master with the role equal to master, we see that sample zero is now the master again. Now if we run the same command to get the data, right? Now if we do get event, the data was persisted even during this whole rollout, right? Like we started with the replica. We deleted that. We got a new replica with the new configuration and then we updated the master. The master went down. The replica got promoted as the master and the and the new the old master is now a replica to the new master and the data is still all there. So this is the thing that we're talking about, right? The transition around the new model, right? So the new model, like upgrades, is now carried out by the operator because it knows how best to do that. And how does this happen? So essentially, if you look at the code, right? First, if we check if the if the stateful set's pod spec has changed, right? Only only some some kind of like configuration updates require a rollout, right? Not all updates are the same. So for example, if you add a new label, they don't require like a whole thing. And and we only do that when those happen. So essentially, we update the the status field to is rolling update. And then we reply back. Now when now we rerun the reconcile, we are here, right? We see that the the rolling update is happening. We'll first list out all the parts of the part of this dragonfly instance. We then start with the replicas, right? We check if it is on the law, if it is on the latest version. We first want all the current replicas to be to be ready with the data, right? With the old data. When they are ready, we then start terminating each replica, right? We start with the replicas and then once we terminate all the replicas and then they're back up and again configured as replica to the existing master. Then we move on to the master. We delete the master and then the master when we delete the master, one of the replica is already upgraded as the new master. And then the old master now comes back as a replica to this to this dragonfly instance. So that's how we do a rollout of the of the whole thing while making sure that we understand redis. We use redis command. So even in this example here, we when we update the master, right? We do that using the repel takeover command. The repel takeover command is a custom command specific to dragonfly where it is the master during this transition so that the client is not writing anything. And then the master is released so that even the clients are aware of until which point the data was inserted and available. So this we do this using this using this field called update resource strategy. So on a stateful set, you can you can set this on delete stateful set strategy, which means the stateful set is not automatically upgraded by Kubernetes. So when you apply a new change, it doesn't do anything. And once the operator is responsible here to delete the parts, the underlying parts, when the operator deletes the old part, now Kubernetes creates a new part with the new configuration. So without the operator deleting the part, your Kubernetes doesn't do anything on an update. So this is the use of the on delete stateful set upgrade strategy where Kubernetes doesn't do anything when an update to a stateful set is given. It waits until someone deletes the part and then it creates the new part with new configuration so that the order of the upgrade and also when to do what is left to the operator here. So we use that logic in Kubernetes to essentially prevent the Kubernetes from doing the whole upgrade. So those are the three things that we do with the operator, with the driver operator. Like I said, the code is open source. If you want to contribute, if you want to look at it, it's all there. And that's the talk. Thank you.