 Hi, everyone, thanks for joining. I want to talk quickly about scaling cluster operators beyond boundaries, that is single cluster boundary, aka we call it as a multi-cluster controller. So this is more about TikTok's journey of how we implemented multi-cluster controllers. And also check why we ended up building multi-cluster controllers in the first place. So by the way, my name is Naveen, Naveen Moghula. I'm a tech lead with TikTok at Edge Platform Team. So let me give a little bit of introduction about my team first. So I'm part of the Edge Platform Team, and we are responsible for infrastructure for all the edge use cases. That is CDN type of workloads, live streaming, real-time communications, upload, et cetera, et cetera. So we basically decided to use Kubernetes on all of these. We call it as a point of presence. We have around 250 to 300 global clusters. So we use these tiny data centers, and we build Kubernetes on top of it, and we provide it as a platform to internal engineering teams. That is TikTok CDN team, TikTok live upload team, and all of those guys. So there are around 250 to 300 Kubernetes clusters across the globe whenever I say globe. Consider in this presentation, it has a globe minus China. It's non-China in a TikTok. So and the important thing here I want to just talk about is these are Edge clusters, not main data in the cluster. So some of these clusters are very tiny. Like it could be as minimum as like 10 in a node to all the way to, you know, we have around 600, 700 nodes. So all of these are Kubernetes clusters, you know, somewhere between like, I've seen 10 node clusters and we have seen multiple 600 nodes based on the demand, you know, based on the popularity of geolocations of what not, so all these clusters have a variety range of nodes. And obviously the other one, you know, it's predominantly we are in bare metal infrastructure business rather than the cloud business because of, you know, obvious reasons. TikTok bandwidth, you know, transaction volume or what not. So it's going to be way too expensive in the cloud. So we have our own point of presence across the globe. But we started exploring the cloud wherever it's possible. You know, we have presence in GKE, we have presence in EKS, OK, you know, across all these places. In high level, we have the, you know, control plane and data plane clusters. So data plane is more like where the actual user workloads, like when I say user, it's more like a TikTok CDN team or, you know, other teams, the workloads are deployed on this data plane clusters. And control plane is where, you know, our platform services sets and we use that to manage all these data plane clusters. One other thing I want to talk about is like obviously like any other platform team, we just not install this vanilla Kubernetes clusters across the globe, you know, across these locations and provide it as a platform. But we obviously provide some essential features like logs, metrics, config, secret management, you know, all of these essential features. So, yeah, so platform features like metrics, config, secret management. And we have, you know, internal services as well. We provide it as a platform. Like anytime you create anything, it probably creates an incident and all of these platform features are provided by our team. I want to talk a little bit about, you know, since we're talking about the operators, we want to talk a little bit about the operators here. So one of them is like I said, whenever it's not just a Kubernetes, you know, vanilla clusters, we're going to provide these platform features, which basically at the end of the day, those features gets turned into operators like logs, metrics, you know, whatever, all of these things. And operators basically play a key role for platform team to provide these features. Obviously, if it's already available in open source, we basically just reuse those things. We're not going to reinvent anything. Instead, we just, you know, take the components as it is and we deploy it across all these clusters. Obviously in the last two, three years because of the TikTok growth of Word NAT, so we have seen a lot of, you know, more and more platform feature requests came in from the, you know, user teams and obviously our team also grew and any time a platform feature comes in, that basically turns into an operator. So more and more operators and the need of more operators basically started increasing. The other one is complex workflows. So whenever you want to deploy this operator across this 300 clusters, you know, you're going to have to deal with go ahead and create the namespaces, resource quota, network policies. And, you know, if you're using GitOps to deploy these things, you're probably dealing with customization. You're basically adding more and more overlays or whatnot to, you know, for each and every specific cluster to deploy this operators into Kubernetes cluster. So another thing, if you're not into edge ecosystem, so I've been in the Kubernetes ecosystem for the last five or six years before TikTok, you know, I was mainly working on the main data center once where, you know, you may have, you know, 10 or 15 clusters which, you know, 10,000 nodes or 15,000 nodes. Mostly if you're in cloud provider, you're probably using ASG which cluster auto scaler which can scale up, scale down. But in our case, you know, like I said, we are in more of a physical infrastructure business. This basically means that anytime you need extra, you know, resources like a server, you know, a couple of more servers, it's not just a straightforward thing. It could take sometimes days, weeks, sometimes months, you know, depends on the logistics. So there is a limited resources constraint in, you know, edge ecosystem. Obviously, coming to the platform features, so anytime we get more and more platform features, that basically means that we're turning into operators and we have to deploy these operators which basically comes up with the cost. It's already a tiny resources, a resource constraint ecosystem and tiny clusters and you're gonna have to deploy your edge, you know, your operators on these clusters and it's gonna take up some resources. And on top of it, we have, you know, some pretty good requirements, something like real-time processing where based on the response times on these edge clusters, let's say you have two or three different POPs in the same geolocation and one of the POP is probably responding within two milliseconds versus the other one is like three milliseconds. So the routing team, GTM team uses that information to take the, you know, real-time decisions in terms of the routing for instead of POP one, it probably goes to POP two. So for any of these things, you're probably adding more and more operators and you're taking more and more resources on top of it. So the balancing act is like, there are tiny clusters. A lot of these application teams want as many resources as possible on these tiny 10 or 15 or 100 nodes. On top of it, you are getting more and more platform feature requests as well. So you're gonna have to make some balance where, you know, at some point of time, we used to say like, hey, 20% of these resources are reserved for platform team because we have a lot of these platform feature requests are coming and we have to sell. And it's probably not gonna go well with all the application teams because of obvious reasons. They want to have more and more resources so that they could sell a little bit more in these edge locations and generate the revenue versus, you know, we have our own reasons like, hey, you want more features? It is gonna cost some resources. So, like, I think I kind of, you know, summarize this thing in the previous, you know, previous thing. So before multi cluster controllers, whenever the request comes into us, you know, as a new feature, we basically convert that as a, you know, hey, okay, we're gonna build the operator for these kind of stuff. So we build the operators and the traditional approach is like, you know, use Q builder, create an operator and go ahead and create these namespaces, resource quota and whatnot across all these 250, 300 clusters and start deploying those things. But we, more than, you know, a lot of the times we run into these issues where some of these clusters are completely full, like there's no capacity in some of these clusters. And, you know, a lot of these clusters may be reserved by, you know, CDN team or, you know, whatever the application teams, right? So they preserve these entire capacity. And even today, you're gonna get into this situation where like, you know, hey, you reserved 500 core CPU, a thousand core CPU, but you're using maybe 200 core or 300 core CPU. That problem is not specific to our team and not specific to TikTok. I guess, you know, even in my previous place as well, the same thing, you know, I was talking to cloud providers yesterday, you know, in AWS folks as well as GCP, they were also saying, hey, we have seen this trend of where people, utilization is always less than 20 or 25 percent. So this option is gonna be always there where we're gonna work with the end users and trying to reclaim some of the resources to deploy platform features. That's, was there, that's, that is there today as well, that's gonna be there tomorrow as well. We are gonna have to continuously work with these guys to, you know, you know, efficiently use these underlying resources. But as an engineering point of view, from our point of view, what we want to do is like, okay, so for platform features or operators, whenever you want to deploy it, you, obviously on the resources side, we can work with those guys, but is there any other opportunities where we could improve so that we don't have to consume resources in the edge clusters? So that's where we went to, you know, whiteboard and started looking at, you know what, we are using Q-Builder as a framework every time when we create in a controller. So why don't we just take a closer look at Q-Builder and see what we can do. If there's any opportunities out there where we could improve, you know, to not to utilize so many resources across all these edge clusters. So if you look at on the left side, you probably, you know, most of you might be aware of this architecture diagram because it's been there in the Q-Builder book for, you know, three or four, five years. So in high level, I think I simplified it on the right side. What do you see is, you have one manager which is basically provides the common functionalities such as client, Kubernetes client, and you have the cache that's more like, you know, informers. And you have controllers where you basically register with manager and it's not one to one relationship, one manager, one controller, but you can have multiple controllers and register it with a manager. So if you look at in high level, it's pretty much basically anytime you're creating a controller for specific to a particular resource, you basically register with informer which is nothing but cache there. It builds the cache and once you get the cache, you basically controller takes those, you know, filters out those events based on predicate or whatever you want, puts it in the queue, a worker item queue, and the reconciliator is gonna pick up those items one by one from the queue and, you know, process it. So in high level, what we have seen when we started looking at it is like, okay, so the problem is with the manager where the manager is the one which is picking up the queue config from in cluster and able to talk to API servers. So the boundary is one single cluster. So why don't we see if we can expand that option instead of having manager aware of one single cluster? Why don't we see if we can make manager aware of multiple clusters? So that's where, you know, where we have, you know, what we started doing is if you look up on the right side, we first started with the, you know, just a diagram, right? Okay, what can we do? Let's bulk up the manager and have multiple clients so that each and every client can talk to, you know, respective API server and you can have in a multiple caches as well and the same way, you know, a controller basically, you know, register with this manager and, you know, whenever controller wants to talk to cluster one API server, it goes and talk to client one that basically talks to API server. So and so, in theoretically, it was very good. Okay, so it was easy like, you know what, let's bulk up the manager to expand it so that it can talk to more than one single cluster. So what we had done is we started taking Cubelder as a reference, you know, Cubelder code as a reference and then we started enhancing the packages inside the Cubelder. So you have, you know, controller component, you have a cluster component, reconceler, you know, all of these packages inside the Cubelder reference. So we started modifying, you know, if you look at the Cubelder code, you will see that, you know, the structure, the interface is basically accepting only one client and only one cluster. So we tried to like, you know what, let's just make it to accept array of clusters, array of clients, array of caches. So when we started doing that, it was all good, but we just went into the deep rabbit hole where we had to change each and every component because if you're touching the top layer, the, you know, when you change the signature of the functions in the top layer, you have to change the child as well because this is the one which is calling the top layers. So we ended up writing this beautiful code or the complex code, however you call it and, you know, change every component of the, all of these packages. So the way we done is obviously like, you know, on the left hand side, you can see that implement manager to be aware of more than one single cluster and implement controllers to be aware of multiple caches and clients and start multiple caches instead of single cache and, you know, start all the controllers at the end. So that's more like the sequence of doing the things. So, but in overall, the idea is instead of having these distributed operators installed across all the clusters, why don't we do one single operator in the central cluster where we deployed in the cloud and we have this flexibility of resources. So why don't we do that in the central hub and let's see if it can do what it's supposed to do in all the clusters. So that's the idea and we started working on it and, you know, after a good amount of, you know, code, we're able to get to a stage where, you know, it worked. It started saying like, you know, we have, you know, when we started doing in a lower environments, it was able to handle the things for the, you know, 20 or 25 different clusters, whatever you want to do it in one single cluster, you know, it was able to watch the events across all these clusters and start, you know, manipulating the changes or whatever you wanted to do for that particular cluster. So in high level, we went from distributed operators deploying across all the clusters, creating namespaces, working with other people to one single operator in the central cluster and deploy it so that it can actually access all the clusters and it can do, list and watch across all these clusters and perform the reconciliation or whatever it needs to be done. So it definitely helped in terms of the deployment issues for us. It definitely helped in terms of the troubleshooting where, you know, in edge world, when you have 300 clusters, if one of the cluster is down, you've got to go and, you know, get into that single cluster, figure out what's happening in that specific controller, you know, there's a lot of specific things right away, networking issues or whatnot. So troubleshooting also helped completely centralized into one single place. So that also helped for us. And like I said, we don't have to create namespaces, resources or any of these things across all the clusters. So that basically simplified and everybody loved it. But during the implementation, so every time I create a PR, every time I create a PR, we used to have lots of lots of discussions and lots of lots of comments on the PR. And that's for obvious good reasons because all of these engineers as well, top class, but none of them are exposed to the underlying, you know, libraries itself, right? Not many people are aware of controller runtime because that's how much QBuilder and all of these community is working towards to hide all of these complexities under the rug so that people can just start writing controllers using these frameworks. So whenever I started writing the PRs, you know, building, you know, creating the PRs and we used to have a lot of these conversations so I had to explain that, okay, so-and-so, so-and-so. So that's when it hit us like, okay, the solution works, but this is not gonna be scalable because now because of this one single reason where, you know, centralized operator, I need to make sure all of my engineers, all of the team need to aware of underlying, you know, libraries and all of this stuff. So, and also obviously we, you know, because we were growing a lot during that time, new engineers were like, hey, what is this? It's taking a lot of time for them to onboard and adapt to these new, you know, new multi-cluster controllers whatever we implemented from the scratch. So like I said, you know, community efforts, community is working hard to hide these complexities so that people can write in multiple controllers or, you know, lots of controllers, but whatever the approach we have taken, it actually took it in a completely opposite direction where, you know, people need to now understand what is controller, what is reconciler, what is worker item queue, what is caches, what is, you know, all of these things online, you know, library items. So, balancing act, okay, the concept works because it helped us not to have distributed operators instead one single operator is doing what it's supposed to do across all the clusters. It reduced the resource issues for us because we don't have to go and talk to four or five different teams like, hey, you reserve more resources, can you give us a little bit more? Can you give us so that we can apply in a platform features? But also, you know, on the other side is it's complex. The complexity actually got increased compared to our previous operators, whatever we used to write. So, what we wanna do is that balancing act is like, okay, I wanna use the concept, the centralized control plane concept, having one single operator and one single centralized cluster, but try to see right the code in a way that it's closer to widely known frameworks, whether it's a cube builder or whether it's operator SDK, whatever the widely known operators, right? So, that's basically what we realized from this, but also like, you know, the last point is more like, people can write, or people can solve the problem many different ways, but at least the way our culture in the team is more like, you basically code familiarity is more important than writing some complex code and just be happy that, hey, I solved this problem. You want as many people in your team as possible understands your code so that they can also take it forward. You're not gonna be there forever. So, someone else has to maintain that code at some point of time. So, for that particular reason, we basically ended up going back to Whiteboard again, scratch entire thing and start doing this thing. Like, if you look at these diagram, whatever you see on this thing, it's literally from the very first cube builder slide where you have one manager, you have controllers. So, each and every manager is basically dedicated to one specific cluster. So, instead of having one single manager bulking up and making it aware of all the clusters, we changed it to, you know what, forget about, let's just do cluster-specific managers. Let it have its own manager so that you're not changing anything in the cube builder framework or any of these things, but rather you're introducing something on top of it. That is nothing but we introduce something called multi-cluster manager. And the multi-cluster manager's responsibility is to make sure it brings up the managers and it shuts down the manager whenever it needs. So, in high level, each cluster has its own cluster-specific manager. Add the controllers to all the cluster-specific manager and then adds those into multi-cluster manager and multi-cluster manager takes over in terms of manager lifecycle events. All right, so yeah, it's pretty much what I was talking about in a multiple manager instead of one single manager. And there's no changes, non-line changes, whatever you see, Reconciler, however you create it in cube builder, literally the same way we can start using the same Reconciler and create these things. So there's nothing changed from that point of view except few, you know, this Go-specific function, you know, Go-lang-related stuff, nothing related to the domain, the controller runtime or any of those library-related stuff. So other thing which is important is if you look back here, so the controllers are being added to cluster-specific managers. So the Reconciler, you're gonna write one single Reconciler class and that needs to be aware of what cluster it's actually working on. So we basically introduced one extra in a field where we basically provide the cluster name. So now at the Reconciler, it knows which cluster it's actually working on. And it basically, for whatever the reason, if you want to have a specific logic for a specific cluster, so now the Reconciler now has the access to the cluster name and it can, you know, write the code according to that particular cluster. So in high level, I don't know if you can see it, but in high level, this is on the left-hand side, sample QBuilder, right? Whenever you use QBuilder to create any controller, what do you do? This is more like a main.gov. If you look at it, you create the manager, right, on the top, and then you add the controllers to that particular manager and start the manager. If you look on the right-hand side, that's our approach where multi-cluster controllers. If you look at it, it first starts creating multi-cluster manager. So before that, if you look at it, it actually, instead of one single cluster to be aware of multiple clusters, you need to have multiple clusters, cube config. So if you look at the one line before that, creating the cluster manager, what we do is we basically go and pick up the cluster's information, which is, in our case, we use it as a secret. So you have one secret per cluster, which includes the cluster name and the cube config related to it. So during the bootstrap, we go and pick up all those secrets, and now you have the list of clusters and the cube configs related to it. And create a multi-cluster manager, and then, if you look at it in the second section, it's basically adding the controllers to cluster-specific managers. So if you look at the first line, it's more like pick up the cluster-specific manager in the loop and go and add the controllers. So in this example, we added two controllers. If you look at this, that's one of those, this multi-cluster event, where we are processing the events across all the clusters and doing some event-related stuff. And there's a namespace reconciler where we are doing a bunch of stuff related to namespace whenever the namespace comes up. But you get the point where you have a multi-cluster manager, you create it, you add the controllers to the manager, and then you start the multi-cluster manager. So it's closer to the cube builder framework. It's closer to the cube builder framework, and it's a centralized operator where instead of one single cube config, you're basically handing over multiple clusters cube config. So that's how closer, how much closer we got to the cube builder framework. And that actually helped a lot in terms of the users. We had one intern who were writing the controllers on this particular framework where it was easy for him to understand like, okay, so this is what is there in the open-source documentation, and this is how we can do it for the multi-cluster controllers, and people are writing multi-cluster controllers a lot. So obviously, the other thing I wanna talk about, like for any cluster, any product, any software product, you're gonna go, it's gonna go through lots of improvements and enhancements and whatnot. So once we started running that for the last, you know, six months or seven months, we realized a couple of things, couple of issues, not issues, but yeah, challenges, let's just say. One of them is in the Edge world, there's a lot of network connectivity issues more than you wanted to happen, right? So a lot of the times these clusters gets disconnected and because we started the multi-cluster controllers and one of the manager gets stuck, it's basically creating the issues for that, put all the controllers in that specific manager. But also, like we said, we are growing a lot and we are adding more and more clusters, which basically means dynamic clusters, right? You know, whenever we are doing the bootstrap, we are actually picking up the clusters, but after the part gets stuck, if new cluster comes up, or if one of the existing cluster goes down, we don't have a way to manage those things. So that's another problem we had. So what we have done is in very high level, this is more of a goal-related stuff and nothing more than that, but. So if you look at on the left side, so these are the secrets, right? In that particular multi-cluster controller, people can add the secrets for specific clusters or if a cluster gets removed, during the registration process or whatnot, we are gonna remove the secret. That basically based on the create, update, or delete event on the secret informer, we do those operations. If you look at those things on the right side, if it's create, you create that cluster-specific manager add the controllers and start the manager. And for whatever reason, if the secret gets updated, sometimes we actually do changes, the admin nodes from one side to the other side. So at that time, you need to have a different cube convict. So in that case, you literally stop the manager and do the same thing again. And if it's secret gets deleted, that means the cluster got deprecated or whatnot, so you just stop the manager. So this is all being done inside the part itself. You're not stopping the part, you're not doing any of this stuff. It's just automatically getting taken care. So if you go back to the network connectivity issues, so what we have done is we started introducing health checks. So health check is basically every 30 seconds, we're gonna ping these clusters. And if there's, for whatever reason, there's a network connectivity issue, we're gonna stop the manager and we're gonna restart it. So we're basically simulating more like secret update because it's a Go channel, we just basically put a message there saying that, hey, there's an update happened, which basically stops the manager and then try to restart it. And until the health check ping is up to date, is able to talk to it and we should be able to good. So far, we have been running this for the last two years. It's no issue so far and we probably have around nine different controllers and all of these are deployed in the central clusters. We save a lot of space or resources in the edge clusters for sure, but this also solved as dynamic cluster management. Whenever a new cluster comes in, you don't have to restart the parts or anything. It just takes care of the things automatically. All right, so finally, what I wanna just talk about these things is more like resource recognition. You have to, at least in edge space, you have to understand that there's a lot of resource constraints, especially if you are into physical server infrastructure. But also, you can't just sit tight and say, like, hey, there's no resources and we cannot provide more platform features. So you have to balance that. So, but also on the other hand, you can't just write, invent different things. It's always good to invent those things, but if there is a possibility to be closer to the open source related frameworks, at least from my team point of view, that's our preference. Keep it closer to the open source adoption so that it's easy for any new engineers to understand and easy to onboard. And a simplified deployment. This is one of those hidden benefits I cannot be happy about. This is more like, previously, it used to take multiple days for us to deploy a platform operator across all the clusters because, like I said, a lot of the times users reserve it, all the resources, and we need to work with those guys to reclaim some of these resources. So this actually helps us completely cleaning it up. Okay, so I don't need anything. All I need is in the central management cluster, which we have control. And these are cloud clusters so that we can bump up the resources if we need it. But I also want to close it with, this is an opinionated approach. I'm not gonna say that this is gonna work for everybody, every organization. Some of these operators are probably, you're writing as demon site. It has to be on the particular node. So there's no going around that demon site idea or whatnot. So these are opinionated approach. It's spaced in a case by case. It may work, it may not work for your case, but this is our journey from TikTok where we had to go a little bit of out of box to improve the operators from one single cluster to multiple clusters. And hopefully, we were also talking to a multi-cluster SIG team as well. And hopefully once you have the cluster inventory in a cube builder also can adapt the newer cluster inventory to start making this operator a way of multiple cluster instead of one single cluster. But obviously, we are very far from that thing. It's probably gonna take one or one and a half year. But yeah, this is more about our approach. And that's about our journey in terms of a multi-cluster controllers. Thank you. I know it probably took more time, but anybody has any questions? You know, I can, yes. I can, yes, we can talk about it.