 Hi, I'm Andy Keller. I'm the Principal Engineer at Observe IQ and maintainer on the Op-Amp Go projects within OpenTelemetry. And I'm Jacob Erinoff from ServiceNow. I'm a maintainer for the OpenTelemetry operator projects, where my focus is on making the hotel experience in Kubernetes as simple as possible. Let's say you just bought one of these fancy new electric cars. This one is perfect for a quick trip to the beach. You get all your stuff together, your friends, towels, everything you need, and as you get in your car it says, tow me to the nearest dealer. Update my software. You'd be pretty upset, pretty angry, but it's okay. We have the ability to do remote software updates now, so no problem. You update your car and you're on the road. So you get closer and closer to the beach until your car starts to break down again. Once again it says, tow me to the nearest dealer. I don't know what to do. And now you're extra upset. But what if the car could just tell you what's wrong and walk you through what to do? Or maybe it could even fix itself. It would be so much better. Or maybe next time you should just ride your bicycle to the beach. It really sucks when cars break down, but they keep breaking down. Ideally the instruments on your car's dashboard will have some info. Sometimes you'll need to open the hood and take a look yourself. It would be pretty awful if the access to both of those key functions to your car's operation weren't available to you. Software also breaks all the time and we can't always look inside. This necessitated the creation of agents that tell us about the health of our software through dashboards and instruments. Back before the cloud, this would mean you would SSH into an instance and run something like top or tail to see how things are doing. Then the cloud came along and changed it all. Logs and metrics were the primary ways of understanding your data and for the most part the collection and sending of that was done by an agent. An agent in the observability realm is something that looks at how your code is running and reports it somewhere. Oh boy, there are a lot of agents. They're all configured slightly differently and do slightly different things. One way of deploying an agent is in its eponymous agent mode. This is close to how most agents have been deployed in the past. Essentially for every application you run, you also run an agent that receives your telemetry, processes it, and exports it to your telemetry background. One issue with the agent mode is that it scales one to one with the amount of applications you run. A different and potentially more efficient way of running your collector is a gateway. In the gateway model, you send your telemetry through a load balancer to a pool of agents. This is possible because agents are inherently stateless. But if an agent is your primary mode of getting telemetry out of your system, how do you know you need to go fix it? If you're unsure where your data is coming from or you don't alert on missing data, it's very challenging to know if your agents are alive and effective. How would you drive your car without a speedometer? What would you do if you got a flat tire on the highway and you didn't have a spare? Or maybe you forgot to replenish your brake fluid before your big trip to Chicago. We trust that our cars always have these features to allow us to operate them effectively. In the same way that we trust our agents to send data to vendors is reliable. But just like the myriad of ways a car can break down, agents can break down in many ways too. The first one is the networking failure. If data can't leave your network, it's impossible to debug why it's happening without going into the cluster or doing some analysis on your cloud networking configuration. When you send data to a backend, you usually need some amount of authentication to ensure that only privileged agents can send telemetry. Otherwise, anybody could just junk up your data. Hopefully the logs from your agent would give you some hints if you have an authentication failure. For the dashboards and alerts you rely on, if the agents sending that data are running out of resources and aren't scaling, data could be dropped or lost. This could also happen when your cluster is out of extra instances to scale your telemetry collection. Ghost agents exist when you are running lots of agents that aren't providing any value. Maybe from a vendor POC that isn't used anymore or some experiment someone forgot to tear down. Regardless, these agents could be costing you thousands of dollars a day and you never even realized it because no one was reporting on it. The next failure is configuration. Your agents are healthy and you see that they're doing some work. But for some reason, nothing is coming out the other end. You looked through your recent commits and saw that Helm translated your 0.01 sample rate to 1. Now you're dropping everything and no one said anything at all. A great way for understanding these failures is to introduce an observer through your observability. This observer's purpose is to help you understand when your agents are misbehaving or unhealthy. These observers can even recommend or automate fixing the bad agents. Observing the observer is like having insurance. You don't always need it, but like a good neighbor, Op-Amp is there. I love that. Alright, in this section I'm going to provide an overview of Op-Amp, including the motivation, goals, and capabilities of the protocol. I will then cover agent status and remote configuration in detail, and then I will describe the implementation of Op-Amp in OpenTelemetry. The Op-Amp name stands for Open Agent Management Protocol. The specification and the implementation are both open source, and while the protocol is not limited to use by OpenTelemetry, the community working on the protocol and the repositories containing the implementation reside within the OpenTelemetry CNCF project. The repo containing the specification and protobuf definitions is called Op-Amp spec, and was initially contributed by Tigran of Splunk in November of 2021, and the Op-Amp Go repo, of which I'm a maintainer, contains the reference implementation of the Op-Amp protocol for use in agents and servers. Op-Amp is the result of the work of many contributions from many individuals and organizations. As Jacob just explained, it's important to be able to remotely observe and manage your telemetry agents. This requires being able to communicate with running agents deployed across data centers and in the cloud. Over the years, many observability vendors develop their own custom protocols for agent observability and remote management. Take our company, ObserveIQ, as an example. Over the course of 10 years, this may be slightly embarrassing, but factual, we implemented custom protocols in three different languages using both HTTP and WebSockets. Many other companies have likely followed a similar path. Every protocol was better than the previous one, but it was still a custom proprietary protocol. A simple telemetry pipeline looks like this. There's an agent that is collecting data from workloads that is sending signals to a telemetry backend. Once there, we can make use of this data to create dashboards, alerts, etc. It's also useful for the agent to send its own telemetry to the backends that you can observe your observability. That's what this blue line represents. Additionally, some vendors allow management of the agent, showing the agent's connected, their status, and possibly allowing the configuration to be modified. Here we have three vendors with three similar pipeline architectures using their own agents and protocols. There's been a shift toward many of these vendors contributing to the development of open-source agents. With an open-source agent like the open telemetry collector, we have standardized on the protocol for sending telemetry. This allows us to move toward a bring-your-own-agent model where end users are able to build and deploy their own distributions of the collector and can send the data to multiple telemetry backends. However, to allow these agents to be managed, we need to establish a protocol for communicating between the agents and an agent management platform. This is where Op-Amp comes in. It provides a standard protocol for managing agents. It can be implemented once in each type of agent and each vendor can implement their own agent management capabilities in the server. Some vendors may focus on providing the best agent management experience while allowing telemetry data to be sent to multiple telemetry vendors. The goal of Op-Amp is to be an open protocol for communicating between agents and agent management platforms. It's vendor-neutral without favoring any particular observability platform. Any agent in any language can implement the protocol, and any management server in any language can manage these agents. The protocol is easy to implement with a reference implementation in Go. It is flexible with support for communication via HTTP or WebSockets, and finally, this is important. It allows for a partial implementation of the protocol. Some agents may only connect and report their status. Some management servers may only show connected agents. Others may allow for full remote management of the agents, including remote agent upgrades. Let's first look at the way agents connect to the management server using the Op-Amp protocol. As I just mentioned, HTTP or WebSocket connections can be used for communication. To keep things simple, there are only two messages defined by the protocol, and their names couldn't be more obvious. Agents send agent-to-server messages to the server, and servers send server-to-agent messages to the agent. Here are the details of those messages. In these slides, I reorganized the fields, removed the protobuf tag numbers, simplified the comments for readability. I will still show many of the protobuf structures because I think it's important to see the details, but if it's hard to read, these are all available in the Op-Amp spec repo. Both of these messages contain a unique ID, sorry, an instance UID, which identifies the agent that is either the source or target of the message. Capabilities flags on each message identify the capabilities of the agent and server, and this allows for partial implementation of the protocol. Each message contains sub-messages that serve different capabilities. To minimize the amount of data transmitted with each message, sub-messages can be omitted if the information sent previously hasn't changed. I'll describe a few of these sub-messages in detail later. Using WebSockets on startup, the agent connects to the management server. The server can use the connection headers, client certificate, or anything else available on the HTTP request to authenticate the agent. After connecting, the client sends an initial agent-to-server message with its status. This includes the set of capabilities by the agent and any status corresponding to these capabilities. This may include, for example, its current configuration. The server then responds to the agent with a server-to-agent message, including its capabilities and any instructions for the agent. If the response changes that modify the status of the agent, it must send another agent-to-server message with its new status. At any point, while the agent is connected, the server may send a message to the agent. For example, it could instruct it to change its configuration based on a user clicking on a management dashboard, or it could be a GitOps workflow calling an API endpoint on the server. Because there is a persistent connection, these messages can be sent to the agent immediately. If the message modifies the status of the agent, the agent must respond with its new status. Using the HTTP protocol, the agent sends its messages in the body of a post request. The server responds to each request with a server-to-agent message. The server can still send a message to the agent, but it must wait to receive the next agent-to-server message. It can then send its message in response. To make this work, the agent must periodically send empty messages to the server. This allows the HTTP and WebSocket transports to be functionally equivalent. It increases the latency of messages sent from the server, but it doesn't require a persistent connection to be maintained. Now that we know how the messages are sent between the agent and server, let's discuss the capabilities enabled by the protocol. There are five core capabilities and an additional capability to extend the protocol with custom messages. Status includes the description of the agent and the status of any other capability supported by the agent. This is required for all implementations. If it's not obvious, the big square is the management server and the middle circles are agents. In this picture. Remote configuration reports the configuration of the agent and can allow the configuration to be remotely modified by the management server. Package management reports the packages of an agent and can allow them to be modified, and this may include upgrading the entire agent binary. Own telemetry settings report the current telemetry configuration of the agent to send its own logs, metrics, and traces to the telemetry backend. Can also support remote configuration of these settings, and this is what enables visibility of the agent and its collection. Connection settings allow other connection settings to be reported and potentially modified during its connection to the op amp server. Finally, custom messages allow additional messages to be sent between the agent and server, and their format and contents are not defined by the protocol. This allows the connection to be reused for vendor-specific behavior outside the scope of the protocol. Support for individual capabilities of the agent are specified with the capabilities bit mask in the agent-to-server message. Here I have them grouped by capabilities that send information to the server and those that accept changes from the server. Similarly, server capabilities indicate capabilities supported by the server. Here they are grouped by fields that identify what the server will accept and what the server will offer. The combination of agent-to-server capabilities are what allow the protocol to support partial implementation. Let's look at agent status reporting in more detail. Agents are required to report status upon connecting to the server. They may also report status anytime this status changes. They must also, excuse me, the status represents the current state of the agent for all of its supported capabilities. After connecting, all supported sub-messages should be reported. In subsequent messages, only the sub-messages that change need to be reported. I'll describe the first three sub-messages here in detail. The agent description contains attributes that describe the agent. The combination of identifying attributes should uniquely identify the agent. The names of these attributes should match the resource semantic conventions defined by open telemetry. And these attributes should also be included in telemetry sent by the agent so that it can be easily identified and non-identifying attributes describe things like where the agent runs and can include any user-defined attributes that the end user would like to be associated with the agent. The component health sub-message contains detailed information about the status of the agent and its components. The recursive component health map field allows the health to nest as deeply as needed to describe the agent and all of its components. And the effective config message represents the configuration actually being used by the agent. It contains a config map where the keys are config file names or config section names as appropriate for the agent. Each file contains the body and content type, but the content format and encoding of the configuration depends on the type of agent. Now let's look at remote configuration. The agent can support both read and write of its configuration. Looking at the agent and server capabilities, we see that there are separate capabilities for the agent reporting and the server accepting effective configuration. There are also separate capabilities for the server offering configuration and the agent accepting remote configuration. Enabling all of these flags allows the agent to be remotely configured. As I show the remote configuration process, I will simplify the communication to only show the sub-messages. So on connection, the agent sends the effective config as part of its initial status message. If the server determines that it should be using a different configuration and the agent accepts remote configuration, it sends an agent remote config message to the agent containing the new configuration. The agent immediately updates its remote config status, indicating that it is applying the new configuration. When the new configuration has been applied, it updates its status. This includes its new effective config and remote config status with the applied status. This interaction files a simple rule of the protocol that when the agent's status changes, it must be reported to the server. If the new agent... if the config could not be applied, the remote config status is updated with a failed status and an optional error message. Here's the remote... the agent remote config message that is sent from the server to instruct the agent to use a new config. It includes the config in the same format as the effective config message sent by the agent. It also includes a config hash that allows the configuration to be easily identified in subsequent messages. The remote config status is updated after an agent remote config message is received. It includes a last remote config hash that refers to the config hash that was sent by the agent remote config, and it also indicates the status of applying, applied, or failed. If the status is failed, an optional error message can be included. As I mentioned in the beginning, there's more to opamp, the opamp protocol that we have time to cover, and I encourage you to check out the spec for more detail. Opamp is currently being implemented in Opamp Telemetry, and there's three parts to this. I'll cover two of them, and Jacob will cover the third. So there's an opamp extension in the collector that can be configured to contact an opamp server. It reports the agent description, and it will report the component health and the effective config of the agent as well. It provides a read-only representation of the collector to an agent management server. The supervisor is the second part of this, and it is a separate process that can manage the Opamp Telemetry collector that implements both an opamp client and server. As you see in the diagram, the opamp extension in the collector reports to the opamp server in the supervisor, and the supervisor contains an opamp client that connects to the opamp backend. It relays the collector status to the opamp backend, and when it receives a new configuration, it can configure the collector by running a new config and restarting the collector process. So now I'm going to give a demo of remote configuration using an opamp server. I recorded this because the Wi-Fi has not been great. Anybody's noticed. Okay, so what you see here is an opamp server called BIME Plane, which is capable of remotely configuring agents. On the screen is a configuration that is sending Nginx logs to Grafana and S3. The configuration is deployed to 25 agents. We can see them below. It looks like version 15 of the configuration. We can see they are sending logs and there's one processor that is reducing the number of logs going to Grafana. We're sending all logs to S3, but only error logs are being sent to Grafana. Suppose there's an issue and we decide that collecting info-level logs would help us diagnose the issue. We can remove this severity filter, save it, and then start a rollout to the agents. What you're seeing is a staged rollout sending configuration to a few agents at a time. You'll see the agents briefly go into configuring status and then go back to connected. When it completes after a few seconds, we'll see that all logs are flowing to Grafana. After diagnosing the issue, I could come back and repeat this process and add the filter and deploy that again, and it would allow me to remotely manage this configuration. Now Jacob's going to talk about Op-amp in Kubernetes. Thanks, Andy. First, I'm going to describe the open telemetry operator, which we use for all of this. So the open telemetry operator was created in August of 2019. The goal of the project is to make using hotel in the Kube environment just as easy as possible. The operator adheres to the standard operator pattern in Kubernetes. This means its main job is to reconcile its custom resources. In this case, the operator provides resources, collectors, an abstraction to simplify the deployment and management of hotel collectors, instrumentation, a resource to simplify automatic instrumentation of your applications, and the newest one, the Op-amp bridge, which I'll be talking about later. There are many modes to deploy the collector. Today, I'll only focus on the three most common ones. This is what running the collector in deployment mode might look like. This is especially useful for tracing and logs in your telemetry. Demon sets are especially useful for things like Prometheus metric scraping or log collection, where you collect from each workload that lives on the same node. Finally, you can run a collector in a staple set. A staple set lets you take advantage of features like the target allocator, which allows you to distribute Prometheus scrape targets amongst your pool of collectors. This lets you horizontally scale your collectors with the amount of pods you run. A staple set also lets you have a stable name for stable naming, as well as connecting to volumes for persistent keying. As stated before, OTEL is a huge ecosystem with so many features. You can enrich all of your telemetry with data from the CUBE API. You can scrape all your Prometheus metrics. You can even run complex transformations in OTEL native transformation language. Say you have a cluster that runs a collector deployment that does some Kubernetes attribute enrichment and forwards all of the LP traces to your backend. Maybe after that, you add a staple set collector with the target allocator to scrape all of your metrics and send them via Prometheus remote write to a backend. And finally, you add a demon set to get all of your cluster and application logs forwarding to an elastic backend. Trying to pin where a metric tracer log came from or more importantly, where data is being dropped can be a real challenge when you're running so many collectors. When one of these collectors is crash-looping, it can be hard to debug. Going back to the introduction, how do you debug when the agent you rely on for debugging information is unreliable. Beyond that, managing and wrangling these collector pools becomes a real hassle. If you're going through CICD without running any validation on your config, you could potentially push out bad configuration and need to run through your whole CICD to stop that crashing. Or maybe you heard of a new feature in the collector that you desperately want in all of your collector pools. Maybe a new service came online that's sending way too many spans and you need to sample it as soon as possible to not explode your spend. At this point, your head might be spinning at the possibilities. If you're in hundreds of clusters and have many different configurations in each, it gets incredibly unwieldy. Imagine using Argo CD to deploy your collectors and you want to add a new feature to all of your clusters that gets the cluster name and appends it to all of your telemetry. If each configuration is in its own values file, you need to update however many clusters you have by however many configurations you have and wait for them all to roll out. Well, today I'm happy to introduce the Op-Amp Bridge. The bridge sits in your Kubernetes cluster and acts as a supervisor that reports on the expected configuration and current health of your collector pools. The bridge queries the Kube API using the hotel operators' collector resource to get the status and health of each pool you run in that cluster. You can even use an Op-Amp server to remotely configure your collector pools. Recall that scenario I mentioned where you needed to immediately sample traces from a noisy application. Through the bridge, you're able to add in that sample configuration in as little as one second. Your operator's server sends it to the bridge and then the bridge simply applies it via the Kube API. The bridge is able to accept configuration for all of your Op-Amp capabilities that Andy talked about earlier. The bridge also has an allow list of any unknown or unverified components in your collector. The bridge's default RBAC also limits it to only query the Kube API for information about the hotel collector CRDs. Now, time for a quick demo. As you can see here, it looks like we stopped receiving infrastructure metrics about 10 minutes ago. I can probably assume that one of our collector pools is unhappy. I'm now going to apply the Op-Amp Bridge, which should tell me all about the state of my collectors and what I'm going to do now. Here you can see the configuration for the bridge. Now I just apply the Helm chart with the bridge. Wait for that to roll out. And now I'm going to go over to this little Op-Amp server I wrote in Elixir for this demo. You can see here the agent is connected and there are four cluster four collector pools as well. You can see that the cluster stats only has zero out of one replicas available. So let me just pop over to logs to see what's up. You can see in the logs there's a configuration problem for Bosnian set to debug, which isn't right. I go back into my configuration, change it to what I know is correct, which is normal, and then I submit that change. You can see that it's immediately applied via the bridge, and if I go back to my metrics, I'm just going to wait for it to roll out. And there we go. Metrics are back and I've solved this incident making sure that nobody's going to be woken up. Nobody else, I should say. The bridge is able to only allow reporting for some collector pools. Here I have a login collector and I'm going to try and edit it, and you'll see that the bridge denies that change. And my metrics are still happy. To wrap it up, we've shown today that agents are powerful observability tools. We've also shown that you probably need some form of insurance in the form of an observer. The op-amp protocol is a vendor neutral method of expressing agent management. The hotel collector and the operator are both implementing op-amp, and we're looking forward to more contributions. Finally, we really encourage you to join the CNCF Slack and the Hotel Agent Management Working Group channel, or the Hotel Operator channel to talk to us more. You can find us at these booths. Please leave some feedback on the QR code above. And I think we do have time for some questions as well. So please come up and ask us some questions. Thank you. Okay, you want to come up to the mic over there? So in some cases, we've come across back pressure issues with the hotel collector, so how would this help us troubleshoot and dissolve that? That's a great question. Do you want to... Sorry, the question was if you see back pressure. Yeah. If it's a... I guess there's a couple of things. First of all, you could be identifying the back pressure via own telemetry, but it's usually being sent to the back end. So you see this in the telemetry being sent by the collector. If it's a configuration issue, you could send down a new config and hopefully resolve the issue that way. And then you're also able to just... Yeah, as Andy said, send it up. With the op-amp bridge, you're also able to change the amount of replicas or configure the HPA for the collector as well. Do you want to go up to the mic? Yeah, thanks for the talk. I apologize that this was covered. I know the main use case for this is to manage collectors. Could this also be used to manage the configuration of your SDKs in your application code? It's 100% the goal. There's a... It's in an issue right now in the repo, and it's something that we still have some work to do in actually designing the architecture and implementing it, in every language SDK. But I love the idea. I would accept contributions. It's intentionally simple and constrained, but also flexible so that it should be fairly easy to implement. But it'll be different in every SDK. So... Okay, cool. Thank you. Contributions, welcome. Definitely. A question I had was about maybe this was made clear in the presentation when I missed it, but how flexible the agent status is for that last backpressure scenario if we have some sort of mechanism of protection we're in a backpressure mode or that backpressure signals are being sent or maybe we were reading runtime memory information for some of that stuff, could that be part of an agent status message that a backend is capable of reading and reacting to? Yeah, we were saying... we've actually been doing a bunch of work for this exact scenario such that the agent, like the actual collector, would be able to report its component status, which is this recursive map and ideally something like the memory limiter could say, you know, I'm under backpressure at which point your backend could effectively respond to that. So definitely something we're thinking a lot about. Yeah, I'm just going to bring this up. And this is fairly recent, actually, in the collector and part of the goal of that implementation is to be able to report this up through opamp. So this allows that nested structure and allows a lot of status information and the status was intentionally left as a string with this recursive map that can be arbitrarily deep to try to accommodate any agent scenario or any component that might have, you know, N levels deep of configuration or something. So that's the goal of the structure. Thank you for great talk. I might misunderstand something, but about the remote configurations, it's updating the configurations of the open telemetry products, right? Then how to work with the configurations managed in GitOps style or should we stop managing the configurations in GitOps? That's a good question. I still think that that's a valid way of deploying and changing configurations and not intended to replace that. I wouldn't advise against that, I guess I would say. I would just say it's kind of a both and situation. I think there will be situations where people are primarily using the remote configuration in sort of a read mode where it's reporting the configuration to a dashboard of your agents that you can see what those agents are actually running, and then if you notice an issue, then you would maybe change it in your Git repo and use GitOps to deploy new configuration. So that's a valid architecture as well. Thanks for the talk. It's very informative. I had a question about bad configs being pushed to agents. I was wondering like where is that problem solved? There's a bad config somewhere in the configuration and we're pushing it to the agent. Do we expect the supervisor process to deal with it? Do we expect that to be done at the server level, or do we expect the health checks somehow to validate that and revert to an old version? How does that work? Yeah, so I can tell you there's certainly the intention is to do this in a safe manner. It's kind of up to the implementer, and there's two implementations I'm aware of. One is in our ObserverIQ collector, which is called Blind Plane Agent, a deployment of open telemetry. I'm sorry, a distribution of open telemetry. And it will save the old config, try the new config if that fails, revert, run the old config and report its status as being an error, while still using the old config and continuing to collect telemetry. Similarly, the supervisor will write a new config, and if the collector fails to start it will restore the old config and report the error status as well. So it's and I would encourage any other implementation in any other agent to work the same way. But right now those are the two. Yeah, and then in the bridge we're working on, bridge and operator group we're working on enhanced like configuration validation stuff as well. Yeah. We have time for one more question, I think. Thank you for the talk. I got two quick questions. So on the config management wouldn't it be possible for you to have a GitLab or Git Hub URL in the editor instead of the whole config and just upload that and say use this instead and That's a great idea. You know, not have to worry about okay, your database keeps the config versus, you know, GitHub does it. My question really is on the telemetry that you collect for the pipelines that you're showing in the UI how big of a snapshot are you taking of the logs and the metrics and traces and how much data is that consuming in the environment that you're talking about in the demo where we showed telemetry. When you go and look at the logs and verify the metrics or the traces that you're getting what you're expecting. So that's using a custom processor that we wrote that is just buffering like the last it's usually about 100 signals but I was hoping you said that. I say roughly 100 because we don't try to break up packages because that would if we're just going to throw them away anyway but we try to keep it small. I think we are out of time unfortunately but please come ask us questions happy to stay back and chat a little bit more.