 Hello everybody. Thank you for attending Service MeshCon North America and thank you for attending this session where we will be talking about metrics merging in Service Mesh. My name is Lawrence Godbond and I'm a field engineer at solo.io where we have several products related to application networking in a cloud native environment. So let's jump into it. So monitoring and metrics in general are very important, especially when you're talking about distributed systems because you'll have applications running in several different places, spinning up and down, et cetera. And you need a way of understanding the behavior of those applications and those deployments. So a common tool to do this is Prometheus. Prometheus is a CNCF graduated project. It defines a very commonly used metrics format and also provides a way of scraping those metrics and giving you a way of querying that data so that you can do things like monitor and understand the behavior of your systems. So looking at this diagram here, we have a pretty common or typical deployment where you have a few namespaces running in a Kubernetes cluster. You have different workloads running in those namespaces and you need a way of monitoring them. So we'll be using metrics to do that today. So we have Prometheus deployed in that same cluster in its own namespace. And so we want to have Prometheus scrape these applications so that we can get their metrics from them so that we can monitor and alert and so on based off of the runtime behavior of these applications. So what that means is we'll have, we need to configure Prometheus to have a way of knowing where to actually go. First of all, how to find these endpoints that we need, there are these workloads that we need to scrape and then also how to actually scrape. So how do I find the metrics endpoint that I need to send a scrape request to, right? So that's done using scrape configuration and service discovery and Prometheus. So scrape configs will tell Prometheus how to actually do the scraping, gather those metrics and then service discovery tells Prometheus about the plate, the targets that it does need to scrape. So the way that, a very common way to do this, this is not like a standard, but very close to it, very commonly used way of doing this is to utilize Kubernetes service discovery, which provides Prometheus with things like pods and services and so on. So that Prometheus understands, Kubernetes primitives, right? But then there's an additional part, which is that this is kind of the non-standard part where there are Prometheus annotations, Prometheus.io annotations that say you annotate a pod, these annotations will give Prometheus additional information. So looking at these annotations here in this example, we have a path, a Prometheus path annotation set to slash metrics, we have a port set to 8080 and scrape to true. So what we're doing is we're telling Prometheus that, hey, I want you to scrape this pod, the port that you need to talk to is 8080 and the path that you can send the request to in order to get those metrics is slash metrics, right? And so then combined with the given pod IP, a script request might look something like this, where does it get to the IP and the correct port and slash metrics, right? And so then the end result is that this request will result in a response that has the metrics for this given workload. So when we're talking about a service mesh, just quickly look at the architecture, right? You have workloads and traditionally, that means you have a data plane. So your data plane are the proxies that are usually deployed alongside us like a sidecar to these workloads. And you have a control plane that actually configures those proxies based on the state of the environment, user config or so on, right? And so that's all great. That makes sense from a service mesh perspective, from a monitoring and metrics perspective, what that means is that now Prometheus has to know about more components to scrape, right? Because now we have to scrape from the application as well as the proxies and the control plane. Control plane's not too difficult. Usually it's just a standard deployment, but it gets a little tricky when you're talking about having a two components in a single pod, for example, the proxy and the workload. So looking at Istio in particular, it gets even more complex because now we have an additional component called the Istio agent. So this component does several auxiliary tasks related to the workload, but it is a component and you do need to scrape its metrics usually and it makes it challenging, right? Because now you have three different places in a given pod that you need to scrape. So if we look at the previous example we were talking about with the annotations, let's see what that would look like in Istio. So our pod has the correct annotations. So Prometheus is gonna try to scrape our application metrics, which are at port 8080 slash metrics. So then the request will go through the proxy, like in a typical service mesh environment and get routed to the application workload, which will then return the metrics for the application workload. But that is a problem, right? Because then we're not getting the metrics for the proxy and we're not getting metrics for the agent. So to solve this, Istio has a concept called metrics merging, which provides a solution to this by exposing a single endpoint. And so remember that agent is a standalone component. And so it can perform logic, it can perform tasks. And in this case, the agent is what's responsible for exposing an endpoint that Prometheus can scrape and then the agent, when it handles that request, will then provide a merged document of all three of the metrics from the proxy, the application workload and the agent itself. So in order to accomplish this, let's look at how Istio solves that. So when you inject a workload, you're in Istio, typically that's done using a webhook, a mutating webhook. And so in this example, what we'll look at is, we're passing in a pod as being created and admitted to the Kubernetes API. So then we'll go to them through the webhook for injection. So in that case, what we're gonna do is when a pod has the Prometheus annotations, those annotations will be changed or overwritten to have the correct annotations that point to the Istio agent's endpoint, the metrics endpoint that is able to provide kind of this aggregated metric. And so the old annotations are essentially replaced during injection with the correct annotations that point to this endpoint. But we actually still store the original annotations so that we can then, the agent can then know how do I actually get to the application of workloads metrics, right? So then what it looks like is, okay, we have a pod that has the correct annotations to point to the agents endpoint at port 15,020 and the path is slash stats Prometheus. So then Prometheus will generate a script request to the pod's IP to the correct port and the correct path. So then the request will make it to the agent's endpoint. Then it's up to the agent to, since we store those original annotations, it knows how to get the application metrics. So we'll make a request to that correct location and we'll get back some application metrics. Then we'll also scrape the metrics from the proxy. Since the agent is also responsible for launching the proxy, it knows where the proxy's metrics endpoint is already. So it will go ahead and scrape that from the well-known endpoint. And then it will merge those two and then additionally add its own agent metrics. So now what happens is we have a full document that contains the metrics from all three of the components in a given pod. And now that can be returned back to Prometheus and we have a single endpoint. From a single endpoint, we have all of the metrics we need for a given pod. So that works great. In most cases, it's very transparent to the workload as long as you're using those annotations we'll talk about that here in a second. But another thing to consider is that since the agent itself is exposing this endpoint, it's excluded from the proxy. And that means that you don't get things like mutual TLS for free. So that in most cases, I would say, is probably not a problem. But if you do have super strict security requirements that require even metrics data to be encrypted, you will not be able to use the solution so that you'll have to use an alternative. There are some outline on the Istio.io website that talk about alternatives for handling the TLS portion. And like I just mentioned earlier, this feature or this solution requires Prometheus annotations to work. So if you're using a tool that doesn't respect those Prometheus annotations, this will not work either. A very common example of that is Prometheus operator which has its own kind of mechanism, kind of CRD based mechanism for discovering which CRD and custom resources mechanism that tells Prometheus where to go to scrape. So in that case, the annotation based solution will not work. So in general though, there are other alternatives to solve this. So you could configure Prometheus directly to understand where the data planes live, right? So whether that's through like kind of convention or well-known labeling or so on, it's very, you could configure Prometheus directly to understand the data planes and you don't have to worry about kind of merging them together. Another interesting thing, Kuma and other CNCF project has they have a native Prometheus service discovery plugin, which means that it gives Prometheus the ability to really understand natively what Kuma's deployment looks like. So you have data planes and so on, the proxies, Prometheus will understand that. It does require a change to Prometheus deployment, which kind of leads into the next more sophisticated approach, which would be a very dynamic native integration that would use a service to configure it, rather than having to modify the Prometheus deployment, you could configure it through a service. And if you're familiar with Envoy, you know what the XDS APIs, there may be a play there to allow the control plane to configure Prometheus. Since the control plane knows about your data planes and knows about the workloads and so on, that means the control plane can dynamically configure Prometheus with a native integration as well. But yeah, that's pretty much it. Just a quick plug, if you are attending in person or virtually, come check us out at solo.io team. We have an in-person booth at booth S4 in the solution showcase. We also have a virtual booth. Come talk to us, come check us out and you can also enter for a chance to win some AirPods. And that is all I have. Thank you very much.