 Hello and welcome to OpenTelemetry Collector Deployment Patterns. My name is Judas Yipashyan-Kroling. I'm a software engineer at Red Hat and I work on the distributed tracing team. I'm a maintainer on the Yeager project and a collaborator on OpenTelemetry. Now for this conversation here today we're talking about patterns. Before we go into that we're going to invest a couple of minutes talking about OpenTelemetry and the OpenTelemetry Collector. Now those are the patterns that we're going to cover here today. The first one is the very basic pattern and if you followed a quick start of OpenTelemetry Collector you know them already. The second pattern is the normalizer pattern and followed by a couple of variants of a pattern for deploying OpenTelemetry Collector on Kubernetes. We talk about load balancing, multi-cluster and multi-tenant scenarios as well. All right, so all of the patterns that we talk about here today they are available in this repository here. This repository contains images and configuration file examples and a deeper explanation on those patterns. Now you can download this slide deck here either from the conference's website or from the spot from here and I'm also sharing this slide deck right now on Twitter. All right, so let's get started. OpenTelemetry is a project that was created with a fusion of open tracing and open sensors. It is actually composed of two big parts. The first one is the specification and conventions part. So it is where the community gets together to determine what are the cementing conventions that we should all be following when instrumenting our applications and it also defines the specifications for telemetry data types, for traces, for metrics, for logs, and so on and so forth. We have a group taking care of the client APIs or instrumentation APIs as the case. We have a group making a definition about OTLP and OTLP stands for OpenTelemetry Line Protocol and it is in concrete terms it is a proto-birth basically, but it is a specification of how we can transmit data, telemetry data from one service to another. So it specifies both the message and how the services should look like on the client and on the server side. And then the fourth big part of OpenTelemetry is the collector. The collector is where we are focusing on here today. And the collector, if you go to OpenTelemetry's website, the documentation, and open the collector documentation, you see that the definition of the collector is this one here. It's a vendor agnostic way to receive, process, and export telemetry data. It kind of hints at the internal architecture of the collector and the collector is composed of the following components, right? It has receivers, it has processors, it has exporters, and it has extensions. So there is a special type of component that is not part of the pipeline. And a collector is viewed as the pipeline for the telemetry data. So data is either received or pulled by receivers. And the receivers put data at the beginning of the pipeline. Now once they flow through the pipeline, they get to the processors, and the processors have the ability to look into this data and do some action with them. Or just observe this data and create perhaps a new data point out of it. Now a processor, most of all the processors have finished doing their jobs with that specific data point with a batch of data, data reaches the exporters. And exporters can, again, be passive or active so they can be sending data actively to a final destination. Or they might be making data available for other systems to extract from the collector. Now when we talk about receivers, we are talking about things that emulate a Yeager server or that emulates the behavior of Prometheus or that implement called TLP on the server side. Or they might emulate a Zipkin server as well. Now processors, they might be doing sampling, they might be changing attributes or adding or moving, updating some attributes from the data points themselves. They might be doing some batching, they might be doing some routing and so on and so forth. And when it comes to exporters, we have a bunch of them for pretty much all the relevant systems and vendors out there. So we have exporters for Yeager, Zipkin and TLP. We have exporters that emulate or that expose a Prometheus compatible endpoint so that other Prometheus compatible systems can scrap data out of the process. And we have exporters for pretty much all of the commercial commercial vendors out there. All right, and besides the collector proper, besides the open time to collector itself, we have a couple of repositories or a couple of projects that are part of the same ecosystem. And the first one and the biggest one is the contrib. The contrib is where all the non-core components live, including the vendor-specific ones. So if you want to use a vendor-specific component for open time to collector, you'd probably download the contrib distribution first. Now, you might notice that the contrib distribution, the contrib binary is actually quite big. So you might want to consider using the open time to collector builder to pick and choose which components you want as part of your own distribution of the collector. And then for that binary, you have only the components that you need. So it's very slimmed down. It's very lightweight. All right, this is how a configuration file looks like for the collector. So we have sections for the extensions, for the receivers, processors, exporters. And we tie them all together under the service node. So we specify the extensions that we want there for this process. And we specify the pipelines. Now, the pipelines can be traces, metrics, and logs pipelines. And we might even have multiple pipelines for the same data type. So we can have multiple traces pipelines here, for instance. In our example, we have only one receiver, one processor, and one exporter. All right, so let's get started with the patterns then. The very first one is a basic pattern. And in this case, we have our application instrumented using the open telemetry SDK, exporting data with OTLP to an open telemetry collector located somewhere. Now, that open telemetry collector then exports data to a final destination. In this case here, Yeager. Note that throughout this presentation, I'm using Yeager as an example here of the final destination. And that is that with pretty much everything that you want, right? So it can be a different tracing solution or it might be a specific vendor or it might be even something not tracing specific. So it can be, in most of the cases, it can also be like Prometheus in here. And so this is our first pattern. I hope it was enough to warm up. And a second pattern that we have is or it is a variant of the first pattern and it is a fan out pattern. Going back to our original image, we have the same application instrumented with open telemetry SDK, exporting data with OTLP to open telemetry collector and the open telemetry collector exports data to Yeager and additionally to an external vendor. Now, the point here is we still have the ability to own our data, right? So we can still have access to our raw data within our realm, within our infra. At the same time, we can send the same data to an external vendor and have a different view of this data. Now, the second pattern that we have today is the normalizer pattern. And in this case here, we might probably have a Prometheus client instrumenting our application for metrics and we might have our application instrument using open tracing for the traces with the Yeager client as the actual tracer. Now, data is then either made available to Prometheus or sent to Yeager and in this pattern here, we are using the collector as a drop in replacement for Yeager and Prometheus when it comes to the contact with our application. So Prometheus is now scraping the open telemetry collector that we have, so not our application anymore. At the same time, the application is sending the Yeager data or the Yeager client is making a connection to the open tracer collector thinking that it's a Yeager server, Yeager collector. Now, the point here is that our open tracer collector that is sitting in the middle between those systems, it has the ability of looking into all the data points that are flowing through this pipeline here or through this communication channel and ensuring that they all have the same set of basic labels. So let's say that we want all data points to contain the cluster name they originated at. So we can ensure that the collector has a couple of processors adding a collector label to metrics and to traces. Now, our third pattern here is a couple of patterns actually to deploy on Kubernetes. The first one is using a sidecar. So a sidecar on Kubernetes is basically a second container as part of your pod. So you have your application pod with one container in it and you add a second container with the open telemetry collector. Now, that open telemetry collector acts as an agent that is possibly deployed on another namespace, observability here in this case and from that collector we export data to Yeager. Now, there are a few advantages to this kind of approach here. The first one is if we decide to change the way that we send data to Yeager or if we change our Yeager location or if we decide to not use Yeager anymore or if we decide to use another exporter in addition to Yeager then we can just change one deployment here in our observability namespace. We don't have to change any of the sidecars. Now, the second advantage is we're not talking about only one application here. We're talking about multiple applications in multiple namespaces. So the advantage here is we get better client-side load balancing when we have multiple instances of the clients making a connection to one server. So when we scale up the number of instances or when we scale up our open-touch collector on the observability namespace then all the new collectors or all the new instances of the collectors on the workload namespaces they would then be using those new instances. So it's very likely that load balancing would work better when you have more clients than fewer clients. Now, another advantage of having sidecars on a per application basis on a hard basis or per deployment is that we can fine-tune the configuration for that collector to the necessities of that application of that deployment. So if we have a critical application here if we have a critical deployment we can have a very specific configuration for the collector perhaps with a more resilient more retry mechanisms and perhaps even more memory more CPU allocation for that specific process. Whereas for lower criticality services we can have a lightweighter configuration for that agent, for that sidecar. Now of course managing hundreds of sidecars might be headache and that's why we might want to use an open-to-lan tree operator here to manage the sidecars for us. So the open-to-lan tree operator can inject and manage sidecars on our behalf. Alright, the second variant of this pattern is not using a sidecar, but using a demo set. So the advantages are very similar. So we have a collector that is very close to our application, to our workload making it easy for the application to offload data very quickly to the agent in this case here in the demo set. But we don't have many of the advantages of the sidecars, for instance it is harder to do multi-tenancy in this kind of scenario because we have multiple namespaces running on the same node and we have only one collector running on that node. It means that that collector is going to see data from all the tenants in there. So it's harder for us to manage a tenancy at that level. Not impossible of course but harder. It is also harder to have multiple instances of the collector on that same node. So if we need to scale up for some reason then it's harder to do. Again, not impossible, but harder especially when it comes to service discovery. Now the biggest advantage of demo sets over sidecars is, as you can imagine, the overhead. So OpenTime 3 collector itself is not big or it doesn't have to be that big in here. So as a sidecar I would say that the memory consumption of the collector itself should be around 5 to 10 megabytes. But it just adds up when you have hundreds of those instances. When we have as a demo set we only have one collector per node. So our overhead is lower. Now the idea is that each one of those collectors would then collect data from the applications running on those nodes. So it's local to the application and resiliently, securely and safely send data to a central collector. Now our next pattern is about load balancing. To explain a little bit about why you need load balancing at the collector level and instead of just a regular GRPC or HTTP load balancer we have to go back a little bit and understand how tracing actually works. So if we have a user doing a transaction on service A it's very likely on a microservice architecture it's very likely that service A would make a connection to service B to service C to service D and each one of them would then make downstream connections to whatever services they need to get information from. Now the way that tracing works is not that we are going to wait for all the services to complete and then service A sends data to the tracing backend. It's not like that. The way that it works is service A is responsible for collecting and sending data for related to its own operations. So only the spans belonging to service A are going to be sent from the service A to a collector somewhere. Now the idea that we have here is that each collector should have a complete view of the trace so that it can make a decision based on that trace. So for instance, if we are doing tail-based sampling, we want to take a look at the whole trace and make a decision whether we want to send or not, but perhaps we are doing some analytics on the tracing data. So we want to take a look at the whole trace perhaps compress it in some way perhaps just extract some metrics and discard the trace itself and things like that. Now to do that we need to ensure that all the traces all the spans for the same trace are at the same collector. What we do not want to have is spans for the same trace at different collectors. Now we ensure that by having two layers of open telemetry collectors. So the first layer is a load balancer layer with it is a basic open telemetry collector with the load balancing exporter. Now this load balancing exporter will split the batches that it receives from the clients. It will look at each single span in it it will take them, extract the trace ID, hash it and determine which collector should be receiving this data point. And it sends data to that data point. So you can have a HA like deployment of the load balancer with three replicas for instance where and at the same time having hundreds of instances of the collector backing that load balancer. So as a final destination or a major destination for this data. Those collectors might then be doing the same thing and sending data to the final destination like eager or your other tracing system. Alright, so the next pattern that we have is a multi-cluster pattern and in this case here it is very similar to one that we've seen before for the Kubernetes either sidecar or demo site. So the idea is that we have our application very close to our application. We have a sidecar or an agent that receives this information from that application and sends information to a central collector local to the cluster. Now that collector then makes decisions about data from the cluster itself. So perhaps it is adding cluster specific information to the data points. Perhaps it is doing tail-based sampling at the cluster level but the point here is that we centralize all the data to that collector and that collector then makes a very secure connection to a collector on a cluster on a control plane cluster. Now that communication might have different resiliency, different reliability requirements. They may have different security requirements. So it makes sense to have to extract that knowledge or that logic into one collector that works at the boundary. And then on the other side, on the control plane cluster we don't have this collector that is receiving data from other workloads. So of course we're not talking about one cluster only, we're talking about multiple clusters sending data to a central control plane cluster. Now that one is receiving data processing the way that it should be processed and sending data to the final destination here. So this is the multi-cluster and finally we have the multi-tenant pattern. And in this case here we have multiple data coming from different tenants. So we might have one application that is multi-tenant or we might have multiple tenants as clients of our application. So I named it here open time to collector as a service, but it's in the real world it's very likely to be like IT department owns your observability stack and each department is then a tenant that is then charged back based on the resources that are consumed. So we have one central location for data to arrive at. So our open time to collector that is then taking care of some logic that applies to all the tenants. For instance in doing security, in doing data cleanup perhaps we are removing some personally identifiable information from these pens. Perhaps we are adding some other information to those pens and sending data to the final destination. So in this case here two different ergers, one for each tenant and then we can charge one of those tenants based on their user usage and the open time to collector is central here it could then be owned by IT itself. All right we have a bonus pattern here and that is a per signal deployment of open telemetry collector. In this case here we we have one collector taking care of the metrics part and one collector here taking care of the tracing parts. Now there are some reasons why we would want to do that. And the first one is the way that we scale an open time to collector doing scraping is different than the way that we scale as in a push based model. So when we have a push based model we can just scale the number, we scale up the number of replicas that we have and the new client to then just find new back ends and send data directly to them. Now when we are talking about pulling data out or in a typical scraping mechanism from permissives we cannot just increase the number of replicas for the collector. Because the more replicas that we have the more scrapers we have and they're all going to the same targets. So we might want to have a different instance of the collector taking care of metrics than the one taking care of the tracing data points. Another reason to have it displayed per signal is that right now the maturity of each one of those signals are not the same. So tracing components are very mature, whereas the metrics components are not that mature. So we might want to mitigate a risk, we might want to get the traces on a highly available scenario for the production quality deployment for the tracing pipeline and for the metrics pipeline we have something that is more fragile than the tracing one. So those are the patterns that I had for today and I think there are key takeaways here. The first one is the open telemetry collector is very versatile. It's incredible how many things we can accomplish with the collector and I think the key here is to understand and getting to know the existing components. So once you know which components you have, you can start planning, you can start drawing and architecting how you're going to deploy an open telemetry collector. Now a key thing as well is to understand that collectors can be changed together. So if you go back to our patterns here, most of them they are about chaining open telemetry collectors together in different levels. So we have sidecars talking to other collectors as a local to the cluster talking to collectors on a control plane cluster and so on but they are all the same minories they are just configured differently and finally mix and match components and collector instances. So plan your deployment and mix and match and if you don't actually have to use the same binary you can build your own binaries depending on the needs that you have at each of those levels. Alright so those are some of the resources that you can use to continue from this conversation here. So some information about open telemetry the collector, the location here for the contrib repository, the computer and the patterns from this presentation here. If you've used open telemetry collector before I'm quite sure that you have your own patterns. So here's my action item to you. Go to this repository fork it right now and include your own patterns. Thank you very much.