 Hello, everyone. Unlike the fact that there is no secret ingredients in Kung Fu Panda's noodle soup, a good amount of effort has been baked into enterprise-scale ARGO CD. Do you know ARGO CD can support thousands of applications? Have you tried to connect your ARGO CD with hundreds of Kubernetes clusters? What about the case with thousands of objects in a single application? We will deep dive into ARGO CD product, bring answers and best practices to you. My name is Hong Wang. I'm the founder and CEO of ARGO CD. My co-speaker, Yuan Tang, is the founding engineer of ARGO CD also. Both of us are ARGO core maintainers and actively working on ARGO project. I'm giving special thanks to Alexander Matushinsov for sharing the practice and also helping with the content. We are from Acuity. We provide the vendor-supported enterprise-grade distribution of ARGO. You can also get export support and services from us who are project maintainers here. The ARGO project is a set of Kubernetes-native tools for deploying and running jobs and applications. It uses GitHub's paradigm such as continuous delivery and progressive delivery and enable ML ops on Kubernetes. It is made of four independent Kubernetes-native product, workflows, events, CD, and robots. We see teams use a different combination of those projects to solve their unique challenges. If you would like to share your ARGO journey, please reach out to me at the CNCF ARGO Select channel. We will invite you to present at our community meetings and even our ARGO conference. We have a very strong community. The product has been recognized and used by a lot of companies. It is being adopted as the de facto Kubernetes-native GitHub solution and also the data processing engine. We got accepted as a CNCF incubator project and we have 20,000 GitHub stars, 600 contributors and 350 end-user companies. We are very proud of the current progress and enjoy being part of the open source community. We are working actively towards the CNCF graduation. Today's main topic is ARGO CD. So I would like to give a high-level overview about our next milestone, which is version 2.3 and beyond. With a lot of scalability issues resolved in the recent releases, we will shift more energy to integrate ecosystem projects including application set, notifications, image updater into the main product. Those projects have been tested and validated for a while. We believe they are mature enough to be admitted to enhance the autobox ARGO CD experience. You can read our roadmap detail by following the link here. Next, I will hand over to Acuity Founding Engineer Yuan Tang who will present us the secret ingredients of ARGO CD. Thank you. Before we dive deep into the scalability challenges, let's talk about GitOps in general. First, what is GitOps? One of the definitions is GitOps is a set of practices to manage infrastructure and application configurations using Git. That means any GitOps operator needs to automate the following steps in sequence. First, retrieve manifests from Git by cloning the Git repository such as from GitHub or Gitpack. Second, compare Git manifests with live resources in the Kubernetes cluster using kubectl.dev. Finally, use kubectl.apply to push changes into the Kubernetes cluster. This is exactly what ARGO CD is doing. The GitOps workflow does not seem to be too difficult. However, the devil is in the details. Let's go ahead and find out what can go wrong and what can you do about it. First, let's take a look at ARGO CD architecture. It has three main components. One for each GitOps operator function. The first is ARGO CD repo server which is responsible for cloning the Git repository and extracting the Kubernetes resources manifests. Second is ARGO CD application controller which fetches the managed Kubernetes cluster resources and compares the live resources manifests with Git manifests for each application. Finally, the ARGO CD API server presents the different results between live manifests and manifest stored in Git to the end user. Now you may wondering, why are there so many components? Why not just package everything into one small application that performs all three GitOps functions? The reason is that ARGO CD is delivering GitOps functionality as a service to multiple teams. It is able to manage multiple clusters, retrieve manifests from multiple Git repositories and service multiple independent teams. In other words, you can enable GitOps for application engineers in your company without having to ask them to run and manage any additional software. This is very, very important because if your organization is adopting Kubernetes and the application developers are not Kubernetes experts yet, this GitOps as a service approach allows to not just enforce best practices but also reduces the number of questions or issues the support team receive from the developers to enable self-service. This also means that ARGO CD needs to manage potentially hundreds of Kubernetes clusters, retrieve manifests from thousands of Git repositories and present the results to thousands of users. This is when things might become a little bit more complicated. Well, the good news is that ARGO CD scales really well out of the box. ARGO CD is optimized to run on top of Kubernetes that enables users to take full advantage of Kubernetes scalability. This screenshot visualizes metrics exposed by real ARGO CD. As you can see, it manages almost 2,300 applications deployed across 26 clusters with manifests installed in 500 Git repositories. That means around hundreds of applications, application developer teams are using that instance and the leverage GitOps without much overhead. Unfortunately, no application can scale indefinitely and at some point you might need to tune your configurations to save resources and get better performance in some edge cases. Let's get started and walk through some of the ARGO CD configurations that you might need to change. First, ARGO CD's controller runs with multiple workers. Workers from a pipeline that reconcile applications one by one in sequence. The default number of processes is 20. This is typically enough to handle hundreds of applications. However, if you get a thousand or more applications, you might start seeing a few hundreds of milliseconds delay. The delay can increase as you onboard more and more applications. One strategy to improve the performance and reduce delay is to increase the number of workers in the controller. You can modify it in controller.status.processes inside your ARGO CD config map. A larger number of workers means that ARGO CD will be processing more applications at the same time. Note that this would also require more memory and CPU resources, so don't forget to update your controller resource requests and limits accordingly. With more and more applications, the controller is going to consume more memory and CPUs. At some point, it makes sense to run multiple instances of the controller, where each uses less amount of computational resources. To do so, you can leverage the controller sharding feature. Unlike stateless web applications, it is impossible to just run multiple instances of Kubernetes controller. The challenging part for ARGO CD is that controller needs to know the state of the whole managed Kubernetes cluster properly to properly reconcile our application resources. However, you may be running multiple controller instances, where each instance is responsible for a subset of Kubernetes clusters you are managing. So sharding can be enabled by increasing the number of replicas in ARGO CD application controller state for set. Don't forget to update the ARGO CD controller replicas environment variable with the same value. This is required for the controller instances to know the total number of replicas and to trigger restart to rebalance the work based on updated configurations. As a result, each controller instances will do less work and consume less memory and CPU resources. The next component that might require tuning is ARGO CD repo server. As I mentioned earlier, the repo server is responsible for retrieving the resource manifest from the Git repository. That means ARGO CD needs to clone the repository and retrieve YAML files from the cloned repository. Coloning the Git repository is not the most challenging task. One of the GitOps best practices is to separate application source code and deployment manifest. So the deployment repositories are typically small and don't require a lot of disk space. So if you have a repository with a bunch of plain YAML files, then you should be fine and won't need to make any changes in repo server configuration. The problem, however, is that deployment repository usually don't have plain YAML files. Instead, users prefer to use config management tools such as customized help and jsonnet. These tools help developers to avoid duplicating YAML content and allows to introduce changes more effectively. Of course, you might ask other users to store generated YAML in deployment repository. But ARGO CD has a better solution. It can run manifest generation on the fly. ARGO CD supports multiple config management tools out of the box and allows to configure any other config management tools as well. During manifest generation, ARGO CD will post server exact or fork appropriate config management tools binary and returns the generated manifest, which often requires memory on CPU. In order to ensure fast manifest generation process, it is recommended to increase the number of repo server replicas. Typically, running three to four repo server instances is enough to handle hundreds or even thousands of repositories. ARGO CD aggressively caches generated manifest and don't need to regenerate manifest frequently. However, you might encounter some performance issues if you store deployment manifest in so-called monorepositories. A monorepository is a repository that simply has a lot of applications. The real world monorepo might have hundreds of applications, including infrastructure components, as well as multiple microservices. Typically, monorepositories are used to represent desired state of an entire cluster. This causes the following performance challenges. First, each commit to the monorepo invalidates existing cache for all applications in that repo. That means ARGO CD needs to suddenly regenerate manifest for hundreds of applications, which causes CPU or memory specs. Second, some config management tools does not allow concurrent manifest generation. For example, multiple applications that rely on a Helm chart with conditional dependencies have to be processed sequentially. Generating lots of manifest introduces spikes of CPU or memory. The memory spike is the biggest problem since it might cause OOM kios. To fix it, you can limit how many numbers of concurrent manifest generations per repo server instance. The number depends on how much memory you are ready to give to the repo server and how much memory your config management tool uses. Next, I'd like to introduce another performance optimizerization technique that might help you avoid manifest generation spikes completely. ARGO CD invalidates manifest cache for all applications since it does not assume that the generated manifest depends only on files within application-related directory. However, this is often the case. In order to avoid unnecessary invalidating cache when unrelated files are changed, you can configure, commit webhooks, and annotate ARGO CD applications with the ARGO CD manifest generate path annotation. This annotation value should contain a list of directories. The application depends on. So every time when webhook notifies ARGO CD about a new commit, it will inspect the changed files listed in the webhook payload and reuse any generated manifest from a previous commit if new commit does not touch any related files to that application. The API server is a stateless API that scales well horizontally and does not require too much computational resource. The API server keeps an in-memory cache for all ARGO CD applications. So if you are managing more than 5,000 applications using one ARGO CD instance, you might want to consider adding additional memory limit. ARGO CD also exposes numerous Prometheus metrics. For example, ARGO CD app reconciler metric indicates the application reconciliation performance. Work queue depth metric indicates the depth of the controller queue. And then ARGO CD app sync total counts the number of application sync operations in history. You can use the community maintain Grafana dashboard and review the higher availability document for relevant metrics. Those are some of the secret ingredients of ARGO CD. Thank you for attending our session. We'll take any questions you have from here. And feel free to contact us for any follow-up questions you may have. Thank you.