 Hi everyone. Welcome to this talk. Today I'm going to be talking about how to unleash ArgoCD observability superpowers. My name is Leonardo. Everybody knows me by Leo. I'm one of the Argo maintainers. I help driving the contributors meeting. Maybe you already met me in that meeting and I'm going to be presenting you today. Unfortunately, Deng was supposed to join me today. He wasn't able to come but he provided me the recording and I'm going to be playing it during his part. So let's get started. And I wanted to start just providing a really brief story of what the problem this is trying to solve. So imagine that you are passionate about technology and you work for this company and developers are having issues deploying components in ArgoCD, their applications in ArgoCD. And you have this, sorry, to deploy applications in Kubernetes and then you have this great idea to install ArgoCD in your company's infrastructure and suddenly it's a success. Developers start using ArgoCD and deploying their application. And one day one of the developers come to you saying they can't deploy their application. There's a problem going on in production and they ask for your help and you check the UI. There's no much of information there. So your nature is to inspect the logs and you open ArgoCD logs and you have that feeling. So you have to find which one is the faulty wire, right? And you don't know much about ArgoCD. It's a little bit intimidating to understand the logs. And this talk is about how you can use some observability functionality that is built in inside ArgoCD to help you whenever this day arrives. So basically I'm going to be talking about metrics and when I say metrics is about Prometheus metrics. And I'm going to be talking as well about distributed tracing, how ArgoCD leverages distributed tracing and enables you to have a much better overview of what's going on in your ArgoCD instance. And Deng is going to be talking in its recording about how they run continuous profiling to find ArgoCD bottlenecks at a large scale. So Deng works for ByteDance, which is by the way the company behind TikTok applications. So it's a very large infrastructure they have there. So speaking a little bit about the numbers when I say about running ArgoCD in large scale environments. So my team operates ArgoCD at Intuit. Just really brief numbers here. Intuit has more than 14,000 employees, around 20 locations. We have more than 2,000 production services. To support that we have 44 ArgoCD instances, which sums up more than 16,000 ArgoCD applications. So you probably already know, an ArgoCD application can sync several resources. So it is quite a bit heavy load environment that we have and that we have to support ArgoCD at Intuit. So let's start talking about metrics and how we leverage Prometheus metrics and Intuit MIT. So basically if you check the documentation basically, you're going to be able to see that ArgoCD has built in several Prometheus metrics that you can use to really inspect how applications are updating. So for example, the first one here in the list, I'm going to provide you an example. So you can write queries, for example, to check for the health status of your application. And whenever this health status comes from healthy to degraded, you can configure to send alerts to your team so you can act on these issues. So there are several metrics that you can use. And one thing I wanted to mention is that there are on the top here different components that you can see. And just talking a little bit about the ArgoCD architecture because things are going to be much better to understand. If you understand a little bit about internals on how ArgoCD is designed. So basically users interact with API server. And so this is the same component that serves the UI. So most of the operations are provided by this component, which is pretty much a back-end service. We also have repo server and sometimes depending on your configuration, the CMP server. So the boxes on the top represented with orange color are basically back-end services. And the ones at the bottom in green are controllers. So just showing you what I was displaying in the previous slide, those metrics are exposed in specific ports that are already configured in ArgoCD. There's no specific configuration that you need to do to have those metrics available. The only configuration is really pointing Prometheus to scrape those metrics from those ports. And you're going to be able to interact with those metrics. All right. But what we do actually inside into it to leverage those metrics, right? So we leverage Grafana dashboards. We have a collection of... So we have our dedicated alert dashboard. That's how we call it. And so basically you wrote a collection of Prometheus queries leveraging those metrics. And whenever an error happens that we know we need to act, we send a PagerDuty alert. And the thing we like, I think it works really well for us, is if there is a problem that needs human interaction to help solving the problem, right in the PagerDuty alert, we provide a link. So we can just click that link. We go straight to the dashboard and we have an overview of what happened and we have some history. So I think it's something we found really useful to have alerts linked with dashboards. So it provides some visual whenever something goes wrong. So this is what I'm showing here to you. It's a default dashboard that is provided by ArgoCD repo. You can start using this dashboard today if you're not already using in your infrastructure pointing to it. So this is basically a general purpose dashboard. But as I said, the biggest advantage you have is by the time you start tweaking those metrics towards your needs and sending alerts whenever you need some action item. Okay. So what if you're running ArgoCD in some sort of a multi-tenant environment? So what if you're sharing the same infrastructure across different teams? So if you're just querying one specific metric, what happens is that you can easily route alerts specifically to a dedicated team, for example. So this is one feature that I wanted to quickly show to you that ArgoCD provides a similar feature that is available in cube state metrics. So if you're familiar with that, it allows you to define some labels in your application resource. So here you can see that I'm specifying two hypothetical labels. So app team, platform, app, BU, business unit, infra. And if you configure ArgoCD with this command parameters, what ArgoCD is going to do is basically start emitting a new metric called ArgoCD app labels with the values that you define in your application resource, which allows you to join this metric with all the other metrics that I was showing before. So you can basically do a much better routing on those metrics and send specific alerts to specific teams. So this is available today. Yeah. And moving forward, distributed tracing. This is a new feature we introduced in ArgoCD 2.4. So ArgoCD 2.4 embraced open telemetry. So yeah, there was a little bit of work to get things updated, especially on the GRPC side, not getting into that. But the truth is, yeah, it is available. There's not much you need to do to enable ArgoCD to start emitting traces to any open telemetry collector that you have in your infrastructure. But the only thing you basically need to do is to edit this config map. So if you're not familiar, this is something I wanted to highlight as well. I'm not sure if there are too many developers aware, but ArgoCD provides this config map called ArgoCD CMD params. So before we used to ask developers to patch ArgoCD containers to customize how they want to configure specific settings in the tool. But this is not always great because, for example, for the open telemetry configuration, you would have to patch at least four different containers to add this additional configuration, which is basically the same configuration for all of them. So ArgoCD provides this config map called ArgoCD CMD params CM. You just define the attribute. We have a complete documentation about all the possible configurations it can add in this config map, and it automatically injects whatever you define it in the appropriate containers. So this is just a tip. If you're not using it yet and you're still patching ArgoCD containers, this is maybe a good thing you can start using. So to have a complete list of attributes, you can configure it by using this config map. You don't have to memorize this URL. Just remember that we have a section in ArgoCD documentation called operator manual, and inside you have a page called declarative set up, and inside this page you're going to have a link for all the config maps we have. So let's talk a little bit about distributed tracing. This is not ArgoCD UI. This screenshot here I extracted from Jager UI that I used to configure. So I configure ArgoCD locally, running locally in Jager, which is an open source tool to collect open telemetry traces. So here you can see that immediately there is something that is made available in the UI, which is most of the distributed tracing UIs, the ones that I used to, will provide a similar diagram that shows you the interaction with the different components that you have in your application. So as I showed you before in the architecture diagram, ArgoCD is not a single pod that you just install and run, so there are several components involved, and there's an ArgoCD to work properly, those components that they need to collaborate together. So by leveraging distributed tracing, you can see those components, the interaction with those components, and whenever some of those components fail, usually the UI will provide you a way to navigate directly to the point where this fails. So for example, ArgoCD communicates with GitHub to get the desired state, and then it reconciles this desired state with the live state and provides you the diff that you're already familiar in ArgoCD UI. So just with this very simple example, you understand that ArgoCD depends on the connectivity between your ArgoCD installation and where you run your Git repository in order to calculate this diff. So if something goes wrong, if your ArgoCD is communicating with GitHub, for example, and GitHub is down, ArgoCD isn't able to calculate this diff for you at that point, and relying on tools like distributed tracing, things are going to be much easier to realize just by looking in logs or even looking at the UI. All right, so I recorded a really quick navigation in Yeager UI, mainly to show you how it looks like and to have a feeling about what are the type of information you can get from it. So you can see here on the left that you can filter by the different services, and those services mean the different components that I was showing you in the ArgoCD architectural diagram, and you can also see the different operations. So for example, ArgoCD needs to list applications to show you all the applications that are deployed in a specific ArgoCD instance, and that is served by this GRPC call that needs the collaboration of two components. The one represented in orange are the API server, and the one represented in green is the API that is called, it's a component inside ArgoCD that we call repo server. So this is the GRPC call, and it shows, here I'm showing you the operation, you can see all the details and inspect, and you can expand the tags and see the details of it. Okay, so this is a pretty simple and quick navigation on the type of information you can extract by enabling traces, but these are all happy scenarios. So let me quickly show you what happens if something is not right. So in this example, what I did, I manually killed repo server, and I called the same, navigating in the UI, I invoke this service to list the application. So as this operation requires the connectivity between API server and repo server, obviously repo server is down, in the distributed tracing view, it's very easy to spot where the error is. So if you click in exactly the point where you see this red circle, you can inspect the traces tags, and you can see that there's something going on between the connectivity between API server and repo server. So this is the type of visibility that enabling distributed tracing in your infrastructure, enabling ArgoCD distributed tracing in your infrastructure is going to provide you. Okay, so moving forward in a presentation, so this is the time where I'm going to play the Dengs recorded presentation, and again, yeah, he's going to be, let me see if they're recording. I'm Deng Zhou from the Biden's Edge Platform Team. Today, I'm going to present how we do continuous profiling in our Edge Platform. This is an agenda for this talk. We will start from the definition of the continuous profiling. Then I will give an introduction about the high level architecture of the Biden's Edge Platform. For that, the deployment and the setup of the continuous profiling on Biden's platform will be discussed. Last but not least, two case study about ArgoCD performance troubleshooting will be presented. So what is the continuous profiling? Profiling is a dynamic method of analyzing the complexity of a program, such as CPU utilization, or the frequency and the duration of the function calls. With profiling, you can locate exactly which part of the application are consuming the most resources. Continuous profiling is a more powerful version of profiling that adds the dimension of time. They are two major type of profiling tools, instrumenting profiler and sampling profiler. The instrumenting profiling inserts code into function calls to collect function execution details, while the sampling profiling periodically collecting the function call stack to represent the estimate function calculation. This is the high level picture of our Edge Platform architecture. In Biden's, we have hundreds of seeding pops. Each pop is converted into one or more connectors and are managed by the Edge Platform. So the Edge Platform is a control plan of our Edge cluster federation of aggregations. The platform is equipped with info service like storage, monitoring, logging, tracing, and so on. It provides unified console to expose service like Github, metrics, billing, IM, and other product level features. They are actually many Edge workloads already running on the platform by sharing the resource and infrastructures like CDN, Proxy, RTC, API and game and other services. So why do we want to continue providing on our Edge Platform? There are a few motivations. First of all, we are seeing scalability challenge on our system. We have hundreds of Edge cluster all over the world, and more than five applications running on it. All the applications are managed by a single ROCD instance. So we want to understand the performance issue and the root cause. Second, traditional performance test environment is hard to set up for us. Our Edge cluster connection is not reliable due to the networking instability, which is very hard to mimic in the testing environment. Second, our application of this scale is very hard to mimic in terms of pattern. Even we can have similar number of applications, we are not able to mimic the unproduced activity and the remote health status. And then our Edge Platform use continuous delivery model, and we shape new features and fixes every week. As not all features enablement need corresponding age pop change or update. It's very high cost to simulate all the changes. Last but not least, even if we can do it, we have no intention to set up a very cost problem reproduced environment in terms of both money and engineering. Because of those, using continuous profiling tools to collect information from the production environment for the troubleshooting is a front-line solution for us. So this is a set up for continuous profiling on our Edge Platform. Every piece of the software about Kubernetes is deployed via GitHub's manner and managed by the ROCD. We have two profiling tools deployed, the Paca and the Periscope. Currently, only the control plan cluster enables continuous profiling. All our targeted services are running on the control plan cluster, including signals, premises, ROCD, and other self build services like resource manager, billing, ticket system, and so on. The first step I want to share today is the ROCD application list performance. As our internal SIE becomes multiple project owners, we are seeing that his application listing waiting time is several times longer than our admin. This is a finger captured and provided by the Paca, one of the tools we are using, when the performance degradation lasts. From the S-frame figure, we can find out that application service list handler takes most of the waiting time. Inside it, we realize that our back enforcement takes significant amount of time. After checking with the source code, it turns out that ROCD will go through the owner role list to check the permission of each application for the user when they do the application list. 5,000 application times the length of the owner list is a major source of latency. As a quick walk around, we give the SIE the admin instead of owner list. The list latency has reduced from 40 seconds to 7 seconds. Beside the workaround, we have redesigned the permission management by using of the project role instead of global role to avoid the checking. In addition, we have figured out that ROCD does not have a good caching about permission interpretation results. We have further ideas caching on that. With a fixed application listing can reduce to 2 seconds instead of 7 seconds. Another case we want to discuss here is about the CPU efficiency improvement for ROCD with the continuous profiling help. Our SIE observed that ROCD application controller has a very high CPU organization in most of time, even though there's no user activities. So we took a look at the profiling data collected by a periscope, which is another tool we used in our system. We observed that the function secret cluster, which sounds like a quick lookup, consume significant amount of CPU. So we want to know what happened inside of this function by checking another profiling data point as shown in the in the slides. It turns out that the unmask of the cluster secret is a major CPU cost. Actually, the cluster secret is a very large data object, which contains all the information regarding the cluster include the certificate and the connection credentials. But the code showed that the controller just need to get the get the cluster name but unmask of the whole secret. After reporting this, ROCD upstream has fixed it by just adding an index of that and the CPU utilization become much better since then. For anyone who are interested in learning more about continuous profiling, here is the reference. The first one is about what is the continuous profiling and how you can use it. The second thing is about another explanation of what is continuous profiling and the different type of continuous profiling. And the third one and the first one are two different tools we have, I have used and demonstrate in our case. So these are the two findings we want to share today among not all cases we encounter internally. Besides ROCD, we actually find the problems, performance problems in Sanos and our self-building service as well. Continual profiling is indeed an excellent tool which can help address the most unreproducible problems that show with very little overhead. Thank you. Okay, and mainly to wrap up this talk, I have this last slide, mainly to provide a conclusion. So in this talk, we spoke about where we are and into it and how we leverage Prometheus metrics, having a dedicated dashboard that we use internally. I also spoke where we're heading. So distributed tracing was enabled in ROCD 2.4. This is not still configured in our infrastructure. This is a working progress, but we're looking into going in that direction and improving the instrumentation that we have inside ROCD. And Deng showed how they do continuous profiling by dense to find problems that are really hard to reproduce because it requires a really huge environment to happen. Okay, so this is what I had for you today. I don't know if you have time for questions. Any questions, anyone? So thank you.