 Hello, folks. Welcome to our talk. Today, we'll be sharing our story in running 10,000 Argosidy applications and sharing our journey in tuning the performance of our Argosidy instance. My name is Geary, and this is my colleague, Yudi. Both of us are infra-engineers from GoToFinancial. GoToFinancial is a financial arm, part of financial arm, part of GoTo Group, the leading digital ecosystem in Indonesia. We provide various service offerings from ride-hailing service using motorcycle, food delivery service, package delivery service, e-commerce platform, and many other services. To start off, I'm going to give you a brief overview of the state of current of our Kubernetes and Argosidy. We maintain around 50 Kubernetes clusters across AWS, GCP, and private data center in Singapore and Indonesia region. This consists of 700 compute nodes, 15,000 CPUs, 120 terabet memories, and more than 30,000 pods. This Argosidy dashboard snapshot we took two weeks back. There's an interesting story behind this snapshot. When we were preparing for KubeCon proposal this talk a few months back, we were at 7,000 Argosidy apps and looking at the growth rate of our Argosidy application, we predicted today we would reach above 10,000. That's why the title 10,000 Argosidy apps and today we made it. It's now 11,000 Argosidy applications. These 11,000 applications are coming from 6,000 repositories across 60 different projects. Argosidy watch more than 30,000, 380,000 total objects. On our largest cluster, we run 2,000 applications and Argosidy watch over 40,000 objects. We adopt a simple centralized Argosidy instance model which is technically a push model or some of you refer it as hub and spoke model. In hub and spoke model, we have management cluster where we run our single Argosidy instance. This Argosidy instance reads a common git repository. Sorry, git providers. And then using the manifest stored in git repository, the Argosidy instance push objects across all the clusters that get registered under Argosidy instance. There's a couple of benefits with this simple centralized Argosidy instance. It's very easy to maintain and upgrade because we only need to upgrade one Argosidy instance regularly. It's very easy to integrate with our automation at platform. Internally, we maintain a developer platform that is tightly related to Argosidy instance. So maintaining only one Argosidy version makes it very simple to write our integration logic in the platform. It's very easy to manage centralized RBAC because everything is in one place. And we have a nice single dashboard to view entire Argosidy applications that we have across entire clusters that we manage. Internally, we have a developer platform that became the primary interface of our product engineers. We don't let product engineers to create Argosidy application by themselves. So this developer platform leverage the standardized health charts that we maintain, the platform team maintains, and the platform generates the manifest, push it to repository, as well as generate the Argosidy applications. Our developer platform has some sort of grouping mechanism so that sets of Argosidy applications point to the same repository. In some cases, one Argosidy application can also point to one repository. The design in our platform is that one service can contain three to five Argosidy applications. These sets of applications can have different life cycle. So for instance, if a user or product engineer create service through our developer platform, we will generate one Canary application and one stable application. The stable application points to, let's say a stable container image V1, and then the Canary application points to an updated version of Image, for instance V2. The separations of application makes it very easy for us to perform our Canary rollout strategy and do promotion to the stable and Canary rollback if necessary, makes it very easy. Now in our platform, we enable Istio by default, which means we inject Istio sidecars into every pod that we manage in our clusters. We manage Istio sidecar objects or Istio sidecar configurations as a separate application. So things like critical service object, destination rule object are maintained separately in another application. In this application, we configure the traffic routing, let's say Canary 5% or stable 95%. The reason of the separation is that each of the application we can control it independently. And in some cases if the service wants to expose domains to public or third party partners, we expose it, we control it, we make configurations through another application, the Istio gateway application which contains gateway object. Another use case that we leverage ArgoCD a lot is to maintain our cluster runtime components. In this cluster runtime components, we have standardize components across the 50 clusters that we have. We leverage ArgoCD app set or application set and the app of apps pattern on a monorip. So we only have one repository that manage and configures runtime components for all 50 clusters that we have. In the root of repository, we maintain the root app set which generates the parent app of each cluster. So in this example, there's cluster 1 parent app. Each parent app manages the base cluster configuration as an app as well as the runtime component as an app set. This base app contains basic configurations like RBAC or limit range and so on that are standardized across all the 50 clusters that we have. If we need to customize the cluster beyond what is standardized, we do it through another application which is a server-side apply patches application. These sets of applications are replicated for the rest of the clusters that we have. And by the way, we also manage ArgoCD on ArgoCD. There are a few challenges with this simple centralized ArgoCD instance. First challenge is this ArgoCD instance require connectivity to all the target clusters that it manages. So the consequence of this is we need to establish tunnels or peering connectivity between the management clusters and the entire workload target clusters. And sometimes it's not always possible to establish tunnels and peering. We do it through a public network over MTLS. Another challenge is the limitation of ArgoCD functionality. First, in ArgoCD we need to maintain unique application name globally. There's no separate... So even for different projects, the application name cannot be the same. For us, luckily, we always generate application name from the platform so we can have some sort of convention to add, let's say, the cluster name or team name as a suffix or prefix of ArgoCD application name. So each team in our platform can have conflicting namings. And then the next challenge is this centralized ArgoCD instance is a single point of failure. Another challenge that we encountered a lot is performance issues of central ArgoCD instance. Along the way, we encountered slow reconciliation problem and sync issues. We look at the ArgoCD or QDAP metrics and the app reconcile metrics for this. I'm pretty sure a lot of you are also experiencing this if you manage more than 1,000 applications. The UI starts loading very, very slowly. For us, it took around 1 or 2 minutes to load the entire home page of ArgoCD dashboard. This is quite obvious and we got a lot of complaints from the product engineers. We face frequent repo server OOM kills by looking at the Qube events. We have an alert for this. We face high rate of our Git API calls for both LS Remote and Fetch. We look at Git request metrics for this and we encountered high repo catch miss in the repo server by looking at the repo server locks. It turns out later on we found out that this high repo catch miss is the root cause of why our ArgoCD is making very high rate of Git API calls. And in our controller charts we see imbalanced resource consumption by each charts and noisy cluster problem. Next, UD is going to give us walkthrough on our journey in tuning at the performance of our ArgoCD instance. Thanks, Gary. Now let's talk about our performance tuning journey. In the next 20 minutes we'll discuss all the config and parameter tunings we have done so far to support and scale our ArgoCD for 11k plus apps. As a note, the tunings in this presentation are not in chronological order of when we implemented them. Instead, we'll group them into each of their own components. So first, let's take a look at ArgoCD components. So we'll use this diagram from the official ArgoCD documentation. In ArgoCD there are four layers. First layer, we have the UI layer which is mainly for user interaction. Here we have the web app and CLI. The web app we use with our web browser and CLI allows us to interact with ArgoCD through the terminal. Next, we have the application layer which consists of the API server or we also call it ArgoCD server. It serves API requests from the UI layer. And then next, we have the core layer. These components are the main ArgoCD functionality components. Here, we have app controller, app set controller and repo server. The app controller mainly reconciles and synchronizes Kubernetes objects according to their state in the repo. And then we have app set controller which is to generate applications based on templates. And then we have the repo server which receives manifest generation requests from the ArgoCD server and the app controller. And then next, we have the infra layer. ArgoCD depends on these components for its functionality. We have radis for caching kube API to watch and apply kube objects and get helm or customized repose and also decks for authentication. Okay, now let's start first with ArgoCD server. So as Gary mentioned before, we had very slow UI load. So it could take anywhere between 15 seconds to 2 minutes for us to load the ArgoCD homepage depending on our network connection. So as a solution, we enable Jzip compression feature in ArgoCD server. So what we need is to set this environment variable to true. So in our case, it improve load time in average by 5x and data size to 7x smaller. Next, still about ArgoCD server and it's actually not a tune but more of a tip for us, the ArgoCD UI users. We can actually use selectors to filter out only the applications we want to see using their label values, their projects or their namespaces. And in this example, we select one project in one namespace and we only show fraction of the total apps that we have which under our selectors here. So one great theme is that the selectors are saved the next time we load our ArgoCD UI which is quite handy. Next, it's about Kubernetes CPU limits. So this is actually not specific to ArgoCD but more of a Kubernetes mechanism and it can be applied to other use cases outside ArgoCD. So the problem is that we notice and through our monitoring system that all our ArgoCD components got total CPU total and it kind of had an impact to our reconciliation latency. So in Kubernetes, CPU request and limits are implemented using C Groups which uses the CFS or completely fair scheduler from the Linux kernel. So CFS could guarantee or could total container CPU depending on the proportion of the container CPU shares or quota in the node. We actually don't really have much time to talk more about this mechanism but if anyone is interested to have a deeper look we attach some references. So as a solution to the CPU throttling we lifted our Kubernetes limits we decided to leave it and our app controller and other components did not go total went total anymore. Next, we move to the repo server. So Giri mentioned about OOM killed happened to our repo server. So it happened very frequently So as a solution we increase the replicas and use HPA so that the repo server pods automatically scales with its memory usage. So with this we distribute more requests into more pods. So each pod actually get less manifest generation request from the server and app controller translating into less memory usage. Or alternatively we can also use the parallelism limit flag on the repo server to control how many manifest generation request that can be served in parallel to help avoid the OOM kills. However, there's one third of this approach is that the throughput of manifest generation will also be lower as we limit its parallelism. Next, so the Argo CD server and app controller talks to repo server, right? They are the clients of repo server. They talk to repo server for manifest generation and they have timeout configurations. So as we grow we started seeing those clients' timeout errors in our logs. We see them when we sing and refresh apps. So as a solution we increase the configuration in both the Argo CD server and the app controller. One thing to note is that we really need to set on both components because I've seen people only setting it on app controller and missing the server. So they still see the timeout errors. Next, continuing on repo server again. So from our repo server metrics git metrics we consistently saw very high git fetch request. So Argo CD caches generated manifest in Redis and they have 24 hours expiry by default. So in cases when remote files change often even though the repository tag hasn't changed. For example when we do git push force or when we update a home chart with the same version shorter expiry time will be more desirable to pick up those updates quicker faster than 24 hours. But in our use case our home customized and git remote references are already hermetic. So we use the tag and we don't force push or force update to a tag. So we can actually use a higher value for the expiry time. We can do it by setting this environment variable and after extending our expiry time we immediately saw dramatic drop in our git request git fetch request from the repo server. Next we move on to mono repo. So we mention that we use apps pattern and the application set in our mono repo. Additionally, we also use the multi-sources apps feature of Argo CD which allows us to have these multiple features when defining an app. So when we implemented this we immediately saw very high git fetch and git alas remote requests as seen in the two screenshots here. So we investigated it and found that it's potentially a bug in Argo CD. So we implemented an undocumented workaround for this which can be accessed in the details in this GitHub issue and our git fetch request drop which is shown in the right screenshot. But we are still seeing high alas remote request. So we think this is still an open issue. So if anyone is interested please check out the GitHub issue. Still about mono repo. So we usually might want to use webhook from our repository to notify Argo CD so Argo CD can pick up updates faster every time we commit. In a mono repo Argo CD webhook server refresh all applications when it receive a webhook even though the news change might not have any relation to all those applications at all. It could be just a subset of parts that we really care of but webhook will refresh anyway all the apps. In the refresh process Argo CD invalidates the cache for all apps and calls Kubernetes API to annotate all application objects with special annotation for Argo CD refresh. And this operation is a network bound process which may slow the process overall especially when we start having more than 1000 apps. So we can actually filter out or define specific parts when we receive webhook so that Argo CD will only refresh applications that are changed in specific parts. We can use this special annotation called manifest generate parts we define the specific parts here. After using this we spread out our refresh process after getting a webhook refresh. Next, let's move to the app controller. So one of the first problems that we had with app controller was about work queue. So our app controller work queue depth started to pile up and it did not go down at all. So we investigated and we decided to increase the number of operation processes and status processors in app controller because there are the number of concurrent reconciliation and synchronization that can be happening concurrently at one time. So we can use these parameters to tune them and in our case we use these numbers. One rule of thumb we can use for configuring these values is that for every 1000 applications we use 50 status processors and 25 operation processors. Next, about scaling. So the app controller can actually be scaled into multiple shards. So it's horizontally shardable. And the sharding algorithm in AguCD is on the cluster level. So different set of clusters will be assigned and will be served by different shards. So to scale the app controller we need to increase the replicas of the startful set of app controller and we also need to set this environment variable AguCD controller replicas with the same value. And after implementing this into more AguCD pods after implementing the sharding we notice uneven shards. Uneven shards CPU usages. Some shards has higher CPU usages and some has significantly lower compared to others. So as previously mentioned, AguCD shards per cluster, not on the app level. Large clusters could be hosted by the same shard. Likewise, smaller clusters could also be hosted by the same shard as well. Resulting in the uneven CPU usages. The new round robbing sharding algorithm just available recently might not help much either for our use case here but the chance that large clusters will get into the same shard when AguCD round robins the clusters into the shards. So we decided to do manual a manual shard allocation instead to fine tune our shard resources. To do this we can use the cluster secret which is used by AguCD and we add a new field called shard and we put the shard number into the cluster secret and after implementing it, we saw more even CPU usages across all our shards. So perhaps one discussion about sharding on AguCD. Sharding on the application level we think would be a really great feature and might be a good solution for the noisy cluster problem we have here. There's an open discussion about this on GitHub. If anyone is interested, please check it out. Next even after all the app controller tuning we did we still saw very high app controller CPU usages and what we thought I think we think was slow reconciles so AguCD the app controller watches all field changes of track objects and if any field changes and differs to the benefits that's in the cache, AguCD will start refreshing the apps. Those objects are Kubernetes objects which maybe as we know, Kubernetes fields will get very concise and some fields might get frequently updated even though those fields are fields that we don't really need for AguCD to reconcile so we actually only need the fields that we have in our repositories in our manifest. Those fields for example are the status fields or the resource generation field so in AguCD there's this fairly new feature called ignore resource updates which was only available since v2.8 and ignore differences features. We can use these features to filter out the fields we don't really need for AguCD to do that reconciliation process or we can call them high turn objects so in our case we use these configuration I think it's a bit small but I hope you can see it for example we ignore HPA annotations which frequently updates we also ignore some replica set annotations we ignore endpoint slides which could get very noisy and we also ignore the whole status fields of our AguCD one thing to note about status is that if you have custom controllers your custom controller might want to use some fields in the status field so you might want to specify more explicit what's here you want to ignore in our case we can just ignore the status fields and upon implementing this our CPU usage drop dramatically almost by half and for one fairly simple feature this was a really great feature and shout out to the contributors and maintainers for making this feature happen next last but not least about API client so our in-house developer platform implements AguCD API client library and as we scale we started seeing HTTP2 go away errors from our platform so we investigated found and fixed a bug in the API client library when using GRPC web mode which is shown in this github issue which is that error and also actually for AguCD CLI it may fall back to GRPC web mode even though the flag is not specified so you might want to check your .config last AguCD.config file to see if there's any configuration defender or alternatively we can just use the native GRPC because this bug only affects GRPC web mode and all the improvements sorry the tuning that we can share and we do have more improvements to AguCD that we look forward to which will be presented by Geary so we explore different alternatives to our centralized AguCD setup the first alternative we look up to is the decentralized AguCD model instead of centralized model this is a pool model where each cluster installs its own AguCD instance so if we have 50 different clusters there will be 50 different AguCD instances responsible to reconcile in its own local cluster there are a couple of benefits which wins against the centralized AguCD the first benefit is the application control workloads gets distributed across the clusters nicely so it's easy to scale with this model and then second benefit the AguCD instance doesn't need to have access to all the Kubernetes API servers that we have in the infrastructure so this model has a better security posture than the centralized model however there are many challenges that we don't like instead of maintaining one instance we now need to maintain and upgrade 50 different AguCD instances for us small team of platform engineers we already regularly upgrade 50 clusters 50 Istio components and if we implement this we need to add more works to our small team and our platform if we implement this we need to maintain multiple AguCD client versions in order to talk to which cluster we need to use different AguCD client versions it makes unnecessary complexity in our platform logic and we no longer have centralized dashboard to control and view AguCD applications in a certain cluster we need to open different dashboards we would be having 50 different AguCD dashboards and anyway if the cluster gets large we still require tuning of that particular AguCD instance another alternative that we saw was an agent based AguCD or a hybrid model between push and pull that we saw in equity platform in this hybrid model each of the cluster still has AguCD instance but there is a central AguCD control plane which oversees the entire application states across the clusters and the dashboard is still centralized we really hope this model gets contributed upstream in the community we've seen a couple of work in progress in the community like the first pull request got merged couple of months back it enables optional pull mechanism for application set this PR the application set can generate application objects to a remote cluster and then let the remote cluster reconciles on the application object that they belong locally in the cluster now second feature is still an open issue in GitHub it's an effort to implement centralized UI for multiple AguCD instances and what we are also excited about is the AguCD UI improvement there is an open issue right now to support server-side pagination on the AguCD dashboard which will significantly boost up AguCD load page in the home page and this is already on equity and they plan it to bring it upstream I hope soon it will be available upstream we are heavily inspired by TikTok and Adobe's journey in managing their thousands of AguCD applications and we learn a lot from AguCD best practices from Alex and also the amazing AguCD documentation that provides us clear guidance on the performance tuning journey thank you for your time we are now happy to take questions we still have 2 more minutes and please scan the QR here to give us feedback, thank you Hi, thanks for the talk I wanted to ask how fast and easy is it to short app and down sorry, I didn't get it how fast is that how easy and fast is it to short app or down on the application server how easy is it to short app or down right, you were showing you had 4 application servers running oh, okay, controller shards controller shards, right? actually it's a simple replica change and to change the environment rival we shown in the presentation so AguCD app controller is actually pretty stateless it can build up its own state again from scratch we might perhaps see some little bit of slow reconciles right after we short up or down the app controller but it's like just 2-3 minutes in our experience I hope that answers your question Hi I have a question about the centralized AguCD maintenance so what type of guardrails do you have when migrating from one AguCD version to another say for suppose recently we did a small migration from 2.72 to 2.8 performance issues and we rolled it back in case of centralized AguCD if something breaks the whole deployment system might break so what kind of guardrails did you take when you are maintaining this okay, upgrading AguCD actually each AguCD version could have their own special treatment we need to so they are actually described pretty clearly in each upgrade documents for example I think from 2.7 to 2.8 we had a pretty smooth upgrade and I think it was 2.5 to 2.6 where we had some major changes so it depends on the versions but generally it's not that operationally intensive and they have pretty clear guidance for us to do I think we are out of time if you still have questions let's do it in the hall thank you