 gembira melihat semua anda di sini hari ini, tetapi tidak terlalu gembira dengan masalah teknik yang telah kita mendapatkan. Oke, jadi sekarang saya, Jio, dan temanku saya di sini, Giri, akan mencari anda ke jalan kami untuk mendukung 10.000 adonan CD-apps. Pertama, anda bisa mencari kami di media sosial. Dan sedikit pertanyaan kepada perusahaan kami, ini adalah perusahaan perusahaan perusahaan digital digital di Indonesia. Dan kemudian, kami ingin kalian faham dari mana-mana kita datang. Jadi, ada beberapa perusahaan tentang perusahaan perusahaan CD-apps. Kami membuat 50 perusahaan perusahaan perusahaan perusahaan CD-apps di Singapura, Indonesia. Dan kami memiliki lebih dari 30.000 bot yang bergerak. Dan di sini adalah perusahaan perusahaan perusahaan CD-apps yang telah kita mendapatkan. Anda bisa melihat bahwa kami sudah mendapatkan 11.000 perusahaan CD-apps. Sebelum ini, perusahaan ini hanya ada 7.000, hanya 7.000. Dan kemudian, kami memiliki lebih banyak perusahaan untuk platform kami. Dan kemudian, kami juga membuat perusahaan CD-apps. Kami membuat perusahaan CD-apps. Sebelum ini, perusahaan CD-apps bisa membuat perusahaan CD-apps. 11.000 perusahaan CD-apps terhadap 60.000 perusahaan CD-apps, dan memiliki lebih dari 38.000 total objek. Dan di perusahaan terbesar kami, kami memiliki 2.000 aplikasi, dan memiliki 40.000 objek. Apa yang kami sekarang menggunakan adalah model CD-apps, di mana kami memiliki satu perusahaan perusahaan CD-apps di mana kami memiliki satu perusahaan CD-apps, dan ini menghubungkan kepada perusahaan single-grid, yang adalah perusahaan perusahaan CD-apps kami. Ada beberapa proses menggunakan model simple ini, yang mudah untuk digantikan. Pertama, pengalaman ini hanya memiliki perusahaan client-librari dan perusahaan image pada malam. Sangat mudah untuk digantikan dengan perusahaan automation kita. Kita hanya membutuhkan perusahaan single-grid dan perusahaan client-librari. Sangat mudah untuk menghubungkan perusahaan CD-apps pada policy.csv, dan bagus untuk memiliki single-dashboard untuk melihat semua pengalaman di sekitar 50 klas. Ini adalah cara kita menggunakan Rgo CD. Kita memiliki platform developer kita sendiri, kalaupun GoPage, dan kita memutuskan aplikasi API untuk developer, aplikasi fungsi, supaya kita bisa membuat pengalaman dan membuat pengalaman untuk membuat pengalaman. Dan beberapa aplikasi bisa berbagi dengan pengalaman yang sama. Di mana aplikasi Rgo CD ini akan kita memutuskan ke setiap aplikasi mereka sendiri. Dan di sini adalah cara kita menggunakan aplikasi Rgo CD. Kita menggantikan aplikasi Rgo CD ke aplikasi Rgo CD, ke aplikasi Kandari dan Stable, di mana ini sangat mudah untuk developer untuk memperkenalkan aplikasi Kandari, memperkenalkan aplikasi Kandari ke aplikasi Kandari, dan juga memperkenalkan aplikasi Rgo CD. Selanjutnya, kita juga memiliki aplikasi untuk Stio Sidecar dan Stio Gateway. Dan ini di mana kita menggunakan aplikasi, kita memperkenalkan aplikasi trafik, dan juga kita bisa menggantikan aplikasi Stio Gateway dan resources ke aplikasi lainnya. Dan ini membuatnya sangat mudah untuk aplikasi kita untuk memperkenalkan aplikasi, secara separat dari cara developer memperkenalkan. Selanjutnya, oh, untuk komponen run-time kita juga memperkenalkan Rgo CD. Kita memiliki monoripo. Di monoripo ini, kita memiliki aplikasi Rood, yang menghasilkan aplikasi kandari untuk aplikasi Kandari. Di aplikasi Kandari, mereka bisa memiliki aplikasi base. Mereka bisa memiliki aplikasi lain untuk memperkenalkan komponen yang berbeda dengan hal. Dan disini, ada perbedaan antara base dan aplikasi final yang kita inginkan, kita memiliki aplikasi menggunakan aplikasi kandari. Dan model ini juga digunakan untuk semua 50 aplikasi yang berbeda. Dan bila-bila, kita juga menggunakan Rgo CD menggunakan Rgo CD. Ada cara untuk memiliki simplisitas ini. Pertama, kita harus memiliki koneksi-koneksi untuk semua 50 aplikasi yang kita menghasilkan. Dan kadang-kadang menggunakan talian menggunakan aplikasi kandari atau lainnya. Kita menggunakan koneksi-koneksi MTRS. Ada juga fungsi yang menarik. Kita harus memiliki aplikasi kandari yang berbeda dengan 63 chart. Tapi ini sebenarnya mengubah karena kita mengubah dari label identitas ke identitas notasi. 63 chart adalah kelimitas servis, objek, dan bantuan. Jadi, saya rasa ini bukan masalah lebih lama, tapi kita masih memiliki kelimatan teknis untuk memiliki aplikasi kandari dan servis dalam aplikasi kandari. Dan lebih banyak, ini juga sebuah koneksi kandari. Kemudian kita juga menggunakan pengalaman yang berbeda dengan kelimatan yang lebih kecil dan berlaku. Ini menyerah para pelanggan kita karena mereka yang menggunakan aplikasi kandari ini juga memiliki aplikasi kandari dan aplikasi kandari yang berbeda dengan matrix. Oke, beberapa of you already may be felt it that slow UI loading time and we have also frequent server-repo-server OAM kills. Another issue is the high rate of Git API calls. This is reflected in some of the logs, this connection issue with our GitLab and also visible through RGCD Git Request Total matrix. There also issue of high-repo cache miss and imbalist charts and IC clusters. Next, my friend Geary we will show you our journey to tune the performance of RGCD instance. Alright, let's talk about our journey in tuning the parameters and configurations of our RGCD instance to scale and support more than 10,000 apps that we have. As a refresher, here's the RGCD components. It's RGCD architecture. Basically RGCD consists of four different layers. First layer is the UI layer which is the primary layer that the users are interact with the RGCD systems. The layer contains a web app and the CLI. Second layer is the application layer that contains the API server or the RGCD server which serves requests coming from the web app and the CLI. The third layer is the core layer which is the primary layer that serves the main functionalities of RGCD systems. It contains the app controller, and the repo server. The app controller watches application objects and reconciles. The app set controller generates application objects based on certain templates and the repo server serves manifest generation requests coming from the app controller and the RGCD server. The last layer, the infra layer contains the external components that RGCD depends on. There's a matching purpose. There's a cube API that RGCD use to watch Kubernetes objects and apply changes. There's a git that contains the desired manifest that RGCD refers to and there's decks which contains authentication. Let's first talk about RGCD server component. As we grow, we started receiving complaints from our RGCD users that the dashboard is very slow. For us, I believe if you started having more than 1,000 applications, you would start facing the same problem. For us, it took us up to 2 minutes in order to load the entire RGCD UI dashboard, which is really a pain. To solve this, we enabled GZ compression in the RGCD server, and the impact was really great. It fastened up the load up time by 5x and it reduced the transfer from the server to browser by 7x smaller. Next is more of a tip on using the RGCD UI dashboard. It's not really a tuning, it's not necessarily a tuning. RGCD provides a feature called selectors. The selectors user can use selectors to filter out applications based on labels, projects, or namespaces. As can be seen by this screenshot, we selected one project and one namespace. Instead of loading up the entire 10,000 application objects in the home page, this was just a subset of those applications, and it loads up quite instantly under 1s. What's nice is that the selectors are saved the next time we load up RGCD UI, which is quite handy for the users. Next is about Kubernetes CPU limits. This is more on tuning on the RGCD UI. As we grow, we started seeing a lot of our RGCD components got CPU trouble. As can be seen by our monitoring dashboard, we started seeing CPU saturation, as can be seen by the spikes, especially during the peak hours and when the number of synchronizations get high. Kubernetes implements the CPU request and limit using C Group mechanism and the C Group mechanism is based on the CFS, or completely fair scheduler by the Linux kernel. The CFS guarantees or throttle CPU utilization based on the proportions of the container shares, or in the C Group terminology it's called CPU shares and CPU quota on that particular node. We don't have much time to talk more about this, but it's a pretty deep and interesting topic. If you're interested, you can see the test references. With this, we decided to leave off the CPU limit. So, especially in our app controller, we don't use CPU limit at all, we only use the CPU request in Kubernetes. The result we no longer see CPU saturation and the components can use up the remaining available CPUs in particular node. Let's move on to the repo server. As Geo mentioned as our number of repositories grow, we started seeing the repo server got OAM killed quite frequently at that time. So, the repo server fork and accept the git processes. So, if you enter accept into the repo server you would see a bunch of git processes there, as well as the template generators, like Helm and Customize. And this consumes a memory. As we scale, the repo server consumes a lot of memory and OAM kill. To scale this, it's a classic solution. We increase the replicas of the repo server and enable HPA or horizontal pot after scaling and put the memory utilization as a target. With this, we distribute more memory requests to across multiple replicas of repo server. As an alternative, there's actually a feature in repo server. There's a flag called parallelism limit, which will control how many manifest generation requests can be performed in parallel per repo server pod. And this avoids OAM kills. However, there's a trade-off that we will be having a lower throughput of manifest generation as we set the parallelism limit. As I mentioned also earlier, the Argosity server and the controller talks to the repo server and as we grow, we started seeing a bunch of timeout errors from our logs. To solve this, we have to tune the client timeout configurations in both the app controller and the server components. We really need to set it up in both because we thought by setting up only in one component to solve the problem but we really need to set it up and then we started we encountered a problem with a persistent high-git fetch request, putting pressure to our git infrastructure and it's coming from the repo server. So Argosity catches the generated manifest into the radius and by default it has a default expiry time of 24 hours. When the manifest remote file changes in the repository even though the repository tag hasn't changed, for instance, if our users apply changes and it doesn't change the tag in the repository or in the case of Hell Manifest it's still behind the same HellmTag version a shorter expiry time is preferable so the repo server can pick up a bit quicker than 24 hours. However, for our case the remote files references are already hermetic which means we always do forward fixes and forward patches and increase the versions of the repository in this case, for our case we can use a higher expiry time. So after applying this we see dramatic decrease in our git request total metrics quite significantly. Next is about Monoripo Our Monoripo as Gio mentioned uses app of apps pattern and the apps app feature of Argo CD. Additionally, we also use the multi-sources app feature which would allow one application to refer to multiple directories in the same repository with different generation strategies. So one directory can be plain Kubernetes manifest, the other one can be Hellm and things like that. This feature puts up very high git fetch and git LSD remote request for pressure to our git infrastructure. We've investigated it and we found out that this was potentially a bug in Argo CD. So we discussed with the Argo CD there's an Argo CD seek scalability group also in the community. We implemented an undocumented workaround so if you're interested please look at the github issue. After applying this undocumented workaround actually that one, the red highlighted lines of undocumented workaround. Right now it's an open issue but in the queue point view as we brainstorm a solution with the maintenance we found that solution. I think the fix is going soon. After applying the workaround the git fetch request drop very dramatically on the right screenshot. However, we are still seeing a very high LSD remote request and the bug issue is still open. And monorepo usually if you have monorepo in Argo CD you would want to use the web hook integration as well. The monorepo because it contains multiple application objects if there's a change in one of the directories it would trigger Argo CD to refresh all the applications that refer to the same repository even though those changes are not related at all. So in the refresh process the catch for all applications and call Kubernetes API to annotate every single application with some special annotations and this process is a network bound process which slows down the entire Argo CD update process. Especially it's impacting reconciliation performance when you have more than 1,000 applications. There's a feature in Argo CD through the annotation called manifest generate path directories or the manifest that are not related at all to the applications. They speed up the update process of Argo CD. Let's move to the app controller component. So the app controller component uses two kind of cues for reconciliation and synchronization purposes. As we grow we started seeing our work queue depth piling up but it never went down at all. This means there are more consumers or processors are less. Those workers or processors are called operation processors and status processors in Argo CD. We need to tune or increase the number of processors so there are more consumers of the task. As a rule of thumb for every 1,000 application we set 50 status processors and 25 operation processors. And then the app controller can be scaled. This app controller is horizontally shardable. We can scale based on shards and the sharding algorithm in Argo CD is on the cluster level. Which means different clusters can get assigned to different shards, app controller shards. To set it up it's pretty straight forward. We increase the number of replicas of app controller and set the environment variable. With this we distribute the workloads, the reconcile and syncing workloads to different pods, different shards. After implementing shards we started seeing the shards are consuming CPU utilization unevenly. So there's an imbalance CPU consumption across the shards. Some shards consume higher CPU than the other shards. So Argo CD, because Argo CD shards per cluster not per app technically the largest clusters can be assigned to the same shard. And likewise smaller clusters can get assigned to the same shard as well. Creating imbalance workload and creates cluster noisy problem. Argo CD release new sharding algorithm called round robin sharding algorithm but this might not help much either because there's still a chance that the largest cluster can get assigned to the same shard. So to solve this issue we did manual shard allocation to fine tune the shard resources. So we ensured through this manual allocation we ensured our largest clusters are spread across the shards. And to do this we simply apply secret object in Argo CD there's a shard number that we can apply from there. After applying this from the screenshot we can start seeing the CPU utilization are more even evenly spread across the shards. There's an open discussion discussion that we think Argo CD on the app level will be better than the cluster level. After all this you name we still see the app controller still consumes very high CPU utilization and we think this also impacts some of the reconciliation still can be improved. So Argo CD watches all the field changes of all the projects So some Kubernetes fields could be very concise or frequently updated, dynamically updated by the Kubernetes. So things like Kubernetes status resource version those are the fields that Kubernetes dynamically generates and updates but we don't really define them in the desired manifest. Those dynamic or frequently updated fields we call it high turn objects which trigger unnecessary reconciliation and consumption of the app controller. In the recent 2.8 version there's a feature called ignore resource updates and ignore differences. We can use this feature to filter out all the fields that are not related, we don't really care. So from the left hand side the example we decided to ignore the HPA annotations the replica set annotations fields and annotations which are pretty noisy which are pretty noisy and we also ignore the status field. Note that if you implement custom controllers you might rely on the status field but for our case we don't really need the status fields in all the Kubernetes objects As can be seen the result was quite amazing realisation by almost half almost half here for a relatively simple feature with this amazing impact It's really great, shout out to the Argo CD maintainers for making this happen. Our last problem that we encountered was related to Argo CD API client library. So as Gio mentioned and keynote this morning we have internal developer platform and our internal platform uses the Argo CD API client library to interact with the Argo CD servers As we scale we started seeing HTTP go away errors from our platform we investigated it and we found out there's a bug in this upstream API client library when using gRPC web mode true we fix this in upstream and right now it's already fix I think it's going to be in 2.10 or above that also one note if our users are using Argo CD CLI even though the gRPC web mode is not set explicitly it may fall back gRPC web mode true if you are using it please check your Argo CD config file in your local machine as an alternative we can use the native gRPC to talk to user as opposed to the gRPC web mode to avoid this issue Next Gio is going to talk how we can further improve our existing Argo CD setup now that we have taken a good look to how we manage Argo CD and identify this problem there are several approaches that you want to explore the first one is the decentralized model this is completely opposite of what we already have Argo CD Instant in this clusters it offers a couple of advantages the first one is the application control workload now it's distributed across clusters this makes it very easy to scale more easy to scale and then the access to Kubernetes API now is limited to local only no longer maintaining connectivity from the management clusters narrow down our attack surface but there are also trend-offs using this approach as one is we lose our easier maintenance and upgrade headache we are consisted of small platform engineering team so we already have a lot to upgrade greatest clusters and HTO mesh inside so now we want to avoid this kind of maintenance burden another one is to avoid automation headache where we maintain several different versions of Argo CD client versions and then we will lose the centralized dashboard because not that we have each Argo CD instance in each clusters we will have 50 different Argo CD dashboard and as you know that we have a single largest cluster then we still need tuning we got this on that cluster because the size is different besides the centralized model we also look up into Argo CD or the hybrid model this is pioneered by equity and as you can see it has a distributed controller where the workload is spread evenly more distributedly and a single control panel centralized control panel to view our resources throughout the workload clusters we hope similar model is also supported in the community versions and in fact progress has been made so the first is the optional feature optional pull mechanism for application set so that the application set in the remote clusters can generate applications in that own clusters and the Argo CD instance in that cluster can reconcile on its own the second one is the issue for centralized UI for multiple Argo CD instances we can track this issues number to see where the future for this model in community versions another nice thing to add is the server side pagination this is already there on equity version and I believe they are planning to bring it upstream and they are targeting version 2.10 for the release and we want to say thank you since we are heavily inspired from the work being done in Tiktok Adobe from Alex and the amazing Argo CD documentation and that's it we still have time for questions okay now the questions time okay any questions okay in front can anyone please give the mic then you can go to the mic mic in the center I think there is a mic in the ISO questions please go to the aisle mic go to the ISO may ask a question about the you mentioned you optimize the UI load right what sorry the Argo CD UI load as we have more than 10,000 applications the UI load is quite slow I'd like to know about any customer patch Argo CD to solve the questions or just use some okay the answer is have we done anything custom to optimize the UI load of Argo CD I think we we encounter the same problem but our situation may worsen than you guys I'm from Tiktok and the presentation is shared by my previous leader and equity CTO our questions are more tough for example our UI loading time is like 3 minutes for the first for the first the initial list for example we need to load the UI the initial list is cost maybe 20 seconds and the UI load is almost 3 minutes and after the caching they will cache all the RBAC results after caching the load is like 20 seconds and the UI load is still more than 1 minute I'd like to know have you guys done any improvements on that so the answer is we haven't done anything custom on the GZIP compression and then so for us we don't allow the users to open the home page entirely because we have a dev portal in our platform that links to the Argo CD applications that each team owns so with this linking we don't necessarily give users redirection to the home page so it's like we can control the filters from our platform and let's say the filters we filter based on teams so we don't allow 10,000 objects in that home page in our case our users just bookmark are the links we've seen that also we're over on time already so we can continue the questions offline okay, we have to move to the next session thank you so much