 Hi everyone, thank you for attending our talk, TikTok story, how to manage a thousand applications on Edge with Argo CD. In today's talk, we'll do a case study on how TikTok is managing 3,000 applications across 100 global Edge clusters with Argo CD. And we'll discuss some considerations, tips and techniques for using Argo CD to manage cluster applications on the Edge. So before we start, some quick introductions. My name is Jesse Sun and I am an Argo project lead and co-founder and CTO of acuity.io, which provides application delivery solutions powered by Argo. Speaking with me today is Chin Kun Li and I'll let him introduce himself. Thank you, Jesse. My name is Chin Kun Li, Tech Lead Manager of Edge Platform Team at TikTok. We use Kubernetes and co-native technologies to manage TikToks on-prem Edge clusters and help developers deploy and manage applications on the Edge. This is the overview of TikTok's Edge cluster. We have around 100 Edge clusters distributed around the world serving TikTok Edge services such as video seeding cache, live streaming, gaming, and so on. The size of our Edge cluster varies from 10 to 60 nodes on each Edge. Those are powerful server nodes with currently 96 CPU cores and 256 giga memory for each node. We also have the data center that can talk to all of those Edge clusters. The data center functions as both the management control plane as well as the service data for those Edge clusters and Edge services. For example, we have the Argo CD that runs in the data center to deploy Edge services to those Edge clusters. Talking about Edge cluster deployment, this is the high-level overview and architecture of our deployment infrastructure. So each of our Edge cluster is a standalone Kubernetes cluster. They talk to the data center where we have Argo CD and the Git repo to manage the deployment for the Edge services using GitHub. So our developers will push their Kubernetes configuration for the Edge services to the Git repository and the central Argo CD controller will pull those configuration from the Git repository to sync and deploy them to the specified Edge clusters. For the deployment of all Edge services, they actually all follow the same pattern. Usually an Edge service is deployed to many Edge clusters and the service functionality and behavior of this Edge service on all Edge clusters are very similar or the same. For example, for the video seeding cache service, when deployed to all the Edge clusters, they all serve the same seeding functionality for the local users. As a result, the Kubernetes configuration of the Edge service will also follow the same pattern, which is the deployment of an Edge service on all Edge clusters will share a large portion of common configurations. And at the same time, there are also a small portion of cluster-specific configurations. For example, the replica count, resource quota, IP address, et cetera, might be configured differently on different Edge clusters for the same Edge service. This is because different Edge clusters may have different number of server nodes at different IP address range. Here is an example of the configuration of such deployment pattern. In this example, we use the sample engine service deployed to three Edge clusters. We can see the majority part of the deployment and the service configuration of this engine service in all three clusters are the same, except for some cluster-specific configurations. In this example, they are the replica count, image version, and the external IP configurations. Although the actual configuration for the real Edge service might be much more complicated than this, but the idea is the same. For such kind of deployment and configuration pattern, in Kubernetes, we can use tools like Helm, the Customize, to manage the common and cluster-specific configuration parts. In this example, we use Customize to show how we manage such configurations and how we structure our deployed directory for those configuration files. In this deployed directory, we have both the base directory for the common, the shared configuration parts, and we have the overlays directory for the cluster-specific configurations. In this example, here we see in the base directory, we have the deployment and the service.yaml file that defines the common deployment and service configuration for this sample engine application. Here we show an example of what's in the cluster-specific overlay configuration directory. Here, in the customization.yaml file, we point to the base directory for the common shared configuration part and also apply the patches for the cluster-specific configurations. In this example, we apply the replica and image patch to the engine's deployment configuration and apply the external IP patch to the service engine's configuration. We use Argo CD with app of apps dial to manage those configurations and to deploy those configurations to the corresponding edge cluster. We generated the parent application for the edge service and the parent application will then generate the child application for each cluster to be deployed. Inside the child application, we will specify the corresponding overlay cluster directory as the application path, and we specify the destination for the corresponding edge cluster. And the central Argo CD controller will continue to process those child applications, and in this example, using customers to generate the Kubernetes manifest by reading the configuration from the specified path from the git repository, and then we'll see how the deploy generated Kubernetes manifest to the corresponding edge clusters. Let's do a live demo to show how we use Argo CD with such app of apps dial to deploy an edge service. The configuration we are using in this demo has already been uploaded to the GitHub and the link is attached here. Please feel free to check it out. In this demo, we will deploy the example engine's application to the three edge clusters as shown in this diagram. Let's get started with the demo. This is the GitHub repository for this demo, and in this GitHub repository, we have the deploy directory, and then we have the base and overlays directory, where the common configuration for the deployment and the service are stored in this base directory. And in the overlays directory, we have the cluster-specific configurations. For the demo purpose, I just add the cluster-specific service external IP here, and also in the customization.yaml file that first point to the base directory for the common shared configurations and then apply this service patch. In addition to the base and overlays directory, there is also the applications.yaml file. So this is what the apparent application will use to generate the child applications. In this example, we're going to generate the three child applications for the three clusters, and for the child application for cluster one, we can see we specify the configuration path as the corresponding cluster one overlay directory, the same for cluster two and the cluster three. Okay, let's go to article CD to create this application. Let's first create this engine's application by creating this parent application here. For the parent application, we call it demo parent, and we use the KubeCon demo project that I have already created, and we set the repository as this GitHub repository that we just saw, and we set the path as the deploy directory. So this is the path that points to this application.yaml file. So this is how this parent application is applications.yaml, and we use it to generate the child apps. The destination here is the, the destination here is actually the Kubernetes cluster and namespace where we're going to create this parent application. So that's in the data center, create this parent application. Now in the parent application, it already uses, used the configuration we just did for this parent application to see the three child applications in the GitHub repository, and we do a sync to create this three child application. We're created. We go back here. Now, now we're at the stage those the parent application is created, and those three child applications are also created, but they're currently in the out of sync state. So this is when we have created those Rgo CD applications, the parent app and the three child app, but we haven't synced to generate the Kubernetes manifest and deploy them to the edge cluster yet. We can sync and deploy by click on these sync apps, and we sync those three out of sync child applications. This may take a minute for the three applications to be synced and deployed. Now they are all synced. Let's go inside this demo cluster one application. We can see it create the service and the deployment for this engine example app. In the service, the lab manifest, we can see the cluster specific IP that we have already configured, that we have configured in the GitHub repository, and we can verify this is working by access this IP, and we see it's working because we're able to see this engine's page. And also on Rgo CD, we can go to the pod to see the logs. If we do it refresh here, we should be able to see the corresponding access log from my browser. And if I do another refresh, another two refresh, we can see two more accesses here. It's pretty useful. Okay. This concludes our demo part. Now we have talked a lot about Tiktok Edge infrastructure, our deployment architecture, and how we use Rgo CD to manage our edge applications and the deployment. Now let me talk a little bit about the challenge that we have in so far using Rgo CD in this way. So first of all is for the performance. Now because we are having, so we get over 3,000 applications managed in the central Rgo CD controller, we do start to see the speed to list all of those applications becomes slower, especially for our developers who have permission to see all of those 3,000 applications. So luckily for us, like most of our developers, they only have permission to see their own applications, which are already like 200, 300. So this is not that bad for them. But we do see the performance degrade when we have more and more applications in Rgo CD. Second is for scalability. We see some imbalanced application handling for different edge clusters. So because when we have more and more applications managed in the central Rgo CD controller, then we start to roll out more and more controller instance to scale the ability to handle those applications. But currently Rgo CD will sharp the applications to the controller instance by the destination cluster, which means all the applications for one destination edge cluster, they will all be assigned to the same controller instance. As a result, because we have some of our edge clusters are big and some are small. So for the big edge clusters, when we deploy more applications to those big edge clusters, and then there will be more corresponding Kubernetes resource to be handled for that big edge cluster. However, all those applications and Kubernetes resource can only be handled by a single Rgo CD controller instance. So therefore, we can see some controller instance might be idle, but some controller instance might be busy. And we can not easily resolve this problem by simply adding more controller instance. Next is for functionality. So we'd like to have support from the Rgo community for the project level support, because currently we use Rgo CD project for the multi-tenant support to support different teams and developers in our company. So one issue with the current support for the project is that the application name in different projects conflict. For example, if we have the engines application in project one, and another application named engines in project two, so they can have conflict with each other. In addition to that, often to see a better project level view and management to be supported as well. For example, for the management of the project level applications, Kubernetes resources, repository, et cetera. But not least, for the developer of the reliability, we find currently we like the internal observability for tracing, such as the open tracing capability. Because we are a Rgo CD user, so sometimes we need to go inside the Rgo CD source code to do some public shooting. But now we don't have a very easy way to trace and track a single request in the source code, because we don't have the open tracing enabled in Rgo CD yet. So those are the challenges that we have seen so far. Instead of that, we still think Rgo CD is pretty helpful, useful, and reliable for our edge scenario use case. We do recommend Rgo CD if you have a similar use case and scenario. Next, Jesse from the Rgo CD site will talk more about the suggestions of using Rgo CD in such edge scenario, as well as how those challenges could be addressed. Thank you, Jinkun. So now we'll discuss two different edge deployment models, which we call a centralized push model and a distributed pull model. And then we'll discuss some of the trade-offs between the two. So the first model, the centralized push, is one which you should already be familiar with, because it's the one that Jinkun had just described. And with this approach, you have a centralized instance of Rgo CD, which is connecting to and managing your edge clusters. And this Rgo CD instance is rendering the manifest from good and pushing or keeps it applying those manifests to the edge clusters. And the biggest advantage of this model is that you're able to use Rgo CD as is intended, as a single pane of glass. And so from this control plane, you're able to view and manage your applications in any cluster. And second is easiest to manage since you only need to maintain that single Rgo CD instance. And because this instance is running inside your data center, you can automate and create integrations against it. On the other hand, a single large Rgo CD instance will require some performance tuning and sharding as your cluster fleet grows. And you may encounter some of the same scaling issues that TikTok experienced. Secondly, because Rgo CD needs to connect to these clusters, the Kubernetes API servers of these edge clusters do need to be made available to Rgo CD. And finally, because this is a centralized model, Rgo CD does become a single point of failure. The next model is a distributed pull model. And with this approach, instead of running Rgo CD centrally, you instead run an instance of Rgo CD in every edge cluster. And these instances would be configured to pull from a Git repository and auto sync changes from a directory specific to that cluster. The biggest advantage of a distributed model is that scalability is no longer a concern. Since each cluster has its own application controller, the work is distributed evenly amongst the edge clusters. Second, with this technique, you can choose to actually close off Kubernetes API server access to these edge clusters and increase security since you no longer need a central Rgo CD instance to connect to them directly. The biggest disadvantage with this approach is that since Rgo CD is no longer running centrally, you lose out on a lot of its features. So there's no longer a central control plane to view and manage your applications. Instead, anytime you want to want visibility into your applications, you would have to access the remote Rgo CD instance. You're also less flexible in how you can control your deployments. So because this technique relies solely on auto syncing paths inside Git to create and update applications, you won't be able to decide to perform manual syncs. Finally, with this approach, you do need to make your Git repository available to the edge clusters, which may or may not be a problem for your environment. So next, I'd like to describe some specific features of Rgo CD, which would be useful for managing edge clusters. And the first is Rgo CD core. So if you do decide to go with a distributed pull model and run Rgo CD on every edge cluster, you should know about this installation mode. Essentially, you can decide to only install the core components, namely the controller, the repo server, and Redis. And Rgo CD would act more like a basic bare bones GitOps operator. And this is a good option if you never had the need for end users to access an Rgo CD UI for these edge clusters. But one nice thing about this feature is that you can still use the CLI and UI as long as you have Kubernetes access to the cluster. For example, you can run the admin dashboard command and then visit the UI through your local host. Second, a second feature which was useful for edge is application sets. And this is a feature in Rgo CD, which allows you to automatically generate applications from things like paths inside a Git repository or clusters which get registered to Rgo CD, or even some combination of the two. And this was created as an alternative mechanism to the app of apps pattern. And this technique works for both a centralized push or a distributed pull model. So here's an example of an application set, which is using the cluster generator, which would be useful for a centralized Rgo CD model. And this cluster generator is creating a new application as soon as a cluster gets added to Rgo CD. In the example we see here, this will create a nginx application targeting that cluster as soon as that cluster is added to Rgo CD. The next example is using a Git generator. And this generator will create application based on files or paths inside Git. In the example we see here, this will iterate all subdirectories in the cluster-3 directory and then automatically create an application for that. So we've been working closely with the TikTok team on many of the issues they encountered in their use case. And these were some of the many improvements which they've made as they scaled out their usage. And first, they've made many performance and caching improvements on the API server. Second, with global edge clusters, the networks tend to be unreliable. And so there have been many tuning options introduced to make this more robust. And in terms of future improvements, we're discussing ways where Rgo CD applications can live in different namespaces. And so this would help with things like name collisions, as well as improve the UI performance. If you do run a large number of application controller charts, we are introducing a utility that can help you automatically rebalance and reassign charts based on cluster size. And finally, other features which the TikTok team has been contributing to are open telemetry support, as well as the ability to exec into pods through the Rgo CD UI. So that wraps up our talk on how you can leverage Rgo CD to manage your clusters on the edge. And if you have any questions, please connect with us over at the CNCF Slack. And thank you for watching.