 And now, I have the pleasure to welcome Braydon Keynes from Google Cloud, who's going to talk to us about how much overhead, how to evaluate your observability agent performance, which is a really interesting topic. So give it a applause to Braydon to get started. All right. Is it on? Oh, yeah, there it is. All right. Cool. So hi. My name is Braydon, and I'm going to be talking. I worded it differently on this slide. Maybe I should have checked before I wrote that. But I'm here to present the most boring-looking slides you've ever seen. Everybody had way cooler slides than me. Observability agents are one of the most important parts of our—oh, shoot, it's not changing. There we go. Here we go. There we go. Observability agents are quickly becoming one of the most important parts of most modern infrastructure, especially in scenarios where you can't go change an application and add a bunch of instrumentation to get metrics, or you just need to get the system metrics running alongside your application. When you're installing a new piece of critical infrastructure on your VMs or in your clusters, one of the first big questions is what's the overhead? What's the effect that this is going to have by installing it on my system? Unfortunately, as important a question as this is, there isn't really a straight answer. Observability agents are—yeah, I love this picture. Observability agents are by design an orchestration of essentially tiny little programs that you can put together to do any combination of things that you want, which means that asking what the overhead is, you're not going to get an authoritative answer. But in this talk, I'm going to try and explain ways that we can rephrase the question to get more valuable information. First I'll explain who I am. My name is Brayden Keynes. I'm a software developer at Google Cloud. The team I'm on is called Collection Services. We're focused on making the telemetry collection experiences the best they can be, and the best way for us to do that is to work in the open. So I'm heavily involved in the Fluent Bit and Open Telemetry projects to try and make it better for everyone. And I'm also the creator and maintainer of a tool called the Amelfimp. So if you want to come yell at me about how it doesn't work on Helm charts or something, you can come find me after. If you're a Linux user, you're probably going to be familiar with a command like this. The Unix philosophy is tiny programs that do one thing and do them well, and they work by stringing them together to get new use cases that you couldn't have imagined on their own. In this example, I have a command where I grab some information about a process from the system. I grab some sort of information out of it. I filter out the information I don't want, and then I do a transformation to get some sort of useful output. Now if you're familiar with observability agents, you'll probably recognize this pattern as the same sort of pipelines that you're creating in your agent configurations. Observability agents are generally the same pattern of you ingest the data either through pull-based or push-based ingestion. You process that data to transform it into something more useful. And then there's a stage where you can export it to different backends, and there might be some work to translate it to different protocols to send it to a compliant backend, or you're sending it to some very specific bespoke major cloud provider or something like Prometheus. Unfortunately, observability agents don't look like the last slide when they're configured for production. They look more like this. You can configure your agent to have any number of different pipelines, and those pipelines could have varying amounts of complexity. And that's why asking what is the overhead is not always going to be such an easy answer if you don't know what you're going to be doing with it. So I'm going to try and rephrase the one question, what's the overhead, into three different questions. Where is my overhead coming from? What can I do to try and improve it? And how do I try to evaluate it for myself? To do this, first I have to define what I'm calling overhead and how I'm thinking about it in this talk. It's generally a combination of resource usage measured against throughput. So resource usage is being memory. I mentioned resident set size here. I was the one who raised my hand and said that last time, and I got an egg on my face. But I think for that scenario it made sense. For this scenario I think resident set size makes sense, which is resident set size is the amount of memory that the process is actually taking up in the system, which for an overhead snapshot might be more useful. But what Brian said is true, there's no one memory metric to rule them all. CPU usage is a little bit easier because you can measure the CPU time of the process, and that's a little bit easier to get a more solid answer on. Disk usage is important, especially if you are configuring your agent, which is something I recommend to do is configuring your agent to buffer data on disk so that you don't lose it in tragic scenarios like network outages. And all of these things will be measured against throughput, which depending on the type of signal it might make more sense to talk about data point count or it might make more sense to talk about bytes per second count. I'm going to start going through each stage of the pipeline. I'm going to talk about my experience in finding some potential performance challenges or potential misconfigurations in each one. And we're going to start with pull base ingestion. Pull base ingestion is the mechanism of going out somewhere, getting data on an interval. So most people will be familiar with things like Prometheus scrapes, where you're scraping a Prometheus endpoint to get text metrics. Another common one is open telemetry collector has host metrics and it'll get information about either all processes or you can ask for a specific process. Or you might be familiar with some third party application receivers in hotel contrived like Apache, which will query the server status URL. Some of the challenges with pull base, I think the nice thing about pull base ingestion is that it's under most circumstances a very predictable size workload and it's happening on an interval. So really the biggest problem with measuring the performance is just how much data you're actually going to be working with. My favorite example of this is with Prometheus scrapes, specifically the scraping library has a limitation where because of the way the Prometheus metrics format works, you kind of need to have the entire buffer ready to properly parse all the metrics out of it. And that requires the Prometheus library to make an entire copy of the scrape in memory. We've had scenarios where we've helped customers with Prometheus setups where they have like 160,000 metrics or something like that and the agent takes gigabytes of memory to try and process that, especially if you're on too short of a scraping interval. If you're scraping too quickly and you're scraping lots of data and especially if you're not using things like scrape timeouts to control that, you're going to overrun your agent while it tries to finish the last scrape, then there's a new one, tries to finish this scrape, then there's a new one, you can really get yourself into a bad scenario with that. Scraping implementation is something that users don't always have a good control over, but there are examples of scrapers that aren't necessarily implemented as efficiently as they could be. I'm quite familiar with Open Telemetry's host metrics process scraper, which is not very efficient when it processes these metrics because of the underlying library that it uses being targeted for a single process and when you try to do that same thing over and over again, you're doing a lot more work than you technically need to. This is something that users don't really have control over unless you know how to want to contribute something that is better than what's there, but it's something worth keeping an eye on. Push-based ingestion is the opposite of pull-based, where instead of going and sketching data from somewhere on an interval, you're just opening yourself up to the world to receive data. I put file here, if you're familiar with the implementation in files, it's kind of more pull-based, but anyways, I'm considering it in push-based. You could be writing logs to a file, or you could be opening yourself up for something like Prometheus Remote Write or Yeager Open Telemetry on your agent. The big challenge with push-based data is, unlike pull-based, it's not a very predictable workload. You need to have a lot more intimate knowledge, I guess, of what you can expect to push into this, but there's nothing stopping you from writing way more logs in one second and then nothing for 10 minutes. Those sorts of bursts of data can very easily overwhelm the pipeline if you're not ready for them. Something I'm going to be talking about later is back pressure and batching and how to deal with when you potentially are bursting data. If you're configured for file system, you can overrun your disk really quickly, or if you're not limiting your memory, you can overrun your memory really quickly if you're not ready for these sorts of bursts of data. Oh, yes. Most agents have kind of worked around this by now, but I decided to put it in here anyway, especially if you're someone who's running an old version of Fluentbit or something. There can be issues where, if you're tailing a bunch of files, for example, but one file is way busier than the rest of them, there used to be issues where the main process would be starved because it's so busy processing the one busy file and it keeps on accidentally prioritizing it. I haven't looked into how open telemetry handles this, so if you know more about that, please come talk to me after because I actually want to know how Open Telemetry handles this scenario a little better. Processing is the step in the middle. It's where you want to try and do something useful with your data. Maybe you want to filter out unwanted data or you're transforming it. You might be doing structured log parsing. JSON and regex logs are very popular. There's Kubernetes filters and processors in most popular agents that will fetch metadata from the Kubernetes API to enrich your logs and data. And what's becoming more popular is Fluentbit Lua and Wasm and in Open Telemetry, Wasm processes in the middle that let you do a lot more advanced processing in your pipeline in case there's something that the default processes aren't going to do for you. The biggest challenge with this is mostly that you do anything on a pipeline that's handling megabytes of data a second. Even the smallest actions are going to have a multiplicative effect on your overhead, especially if you're doing like a regex log or JSON log parsing, the effects of that grow really quickly. Where the plugin actually runs in the pipeline is becoming more important, especially now with Fluentbit and Fluentbit 2, what Eduardo talked about with processors instead of the old filters. Processors can run in input and output threads instead of the old way which is filters running all in the main thread. It used to bog down the performance, but processors are much better. So if you are in a scenario where you can move to a newer version of Fluentbit and try processors instead of filters, I highly recommend it. I've had a lot of success with them. Another example is on tail input. There are in hotel and Fluentbit and I think in vector, there are ways for you to on a tail input specify a parser that happens before it makes it to the rest of the pipeline. This could be like a JSON processor or a multi-line processor. And that's really convenient for understanding the pipeline and sending it through, like you don't have to send it through a JSON processor first. One of the problems with that is that if you're overworking the tail plugin with also doing the parsing, you can really hamstring how fast you can read data out of the file because it's so busy doing other stuff. We've had some issues with the Docker log processor on the input tail plugin in Fluentbit where instead we just decided to move it off to a filter and it was a crazy, crazy performance improvement in comparison to putting it on the tail plugin. So that's something to keep an eye out. JSON is a data format that for some reason we've universally decided is the default way to communicate with computers, but unfortunately parsing it is very difficult and slow and hard. And it can be a real problem when you're trying to parse something, especially that is deeply nested. If you're trying to parse deeply nested JSON logs, you can get into a lot of trouble. We had some trouble in the Google Stackdriver output plugin where someone was trying to log a MongoDB query that kept on going in and in and in and in and was crashing the process. And so if you have really deeply nested data, that can be a real problem. But really JSON is just slow in general, especially if you're running it on all your logs. If you're JSON processing every log you're putting through. Exporting is the final stage. Now that you've done all this nice ingestion and processing on your data, you need to put it somewhere. You might be sending it to one of the major cloud providers. You might be remote writing it to an external Prometheus. Or you might be sending it to an OTLP-compatible vendor or even another agent. The biggest challenge with exporting is that most exporting happens over the network. And all the problems with sending data over the network are going to come into play here. Sending data over a network is never going to be as fast as sending data within the program. And it's very easy for your input to outpace your export. And that's where we start getting into problems with back pressure. With batching and back pressure, you can handle these bursts of data without losing it. If your output is still so busy looking at data, trying to send out the network request, maybe you're just having one slow API request that's slowing everything down. If you're not sending any back pressure back, you're just going to keep on building up memory, building up memory as you try and hold on to all these logs without dropping them. Batching is a good way to deal with this, where if you batch your data up a bit larger, you might use a little bit more resources, or it might take a little bit longer to buffer up enough resources to send it out. But if you're batching up more data before you send it out, then there's less overall requests that need to be made for your throughput. Overall, when you're dealing with batching and back pressure, it takes a lot of experimentation to find the right sizes, though, especially because the hammer's got to drop somewhere. Like if you are putting through so much data and you are not having a big enough batch size, you're just going to lose data eventually, especially if you have other things on your collector like memory limiters. If you have other things that are trying to limit your usage, eventually you're going to lose data if you're pushing through too much. So it takes a lot of tweaking to get that right. Threading, if you're running FluentBit and your output plug-in doesn't support threading, you should ask them to. I don't think there are any that don't support it at the moment. I think pretty much everyone will. So if you are having trouble with sending data fast enough, I highly recommend increasing the workers or at least leaning into the threading implementation of the agent where you can. Because exporting is really the only step in the pipeline that can easily be parallelized. Most backends can handle time stamps being a little bit out of order if one worker sends it a little bit faster than the other, usually they will be able to reconcile all that. So it's really easy to set it up into, if you set it to like eight workers, for example, that means a thread pool of eight workers sending data at the same time and it can really open up your pipeline if it can send data along. At least it can dispatch it to the thread pool and let one of the workers figure out the slow part. So the last question we need to answer really is how you evaluate it for yourself. If no one's gonna give you an authoritative answer on which agent is the best or what sort of overhead you can expect, you need to figure out how to find it for yourself. And the only way is to try running it. If you are able to replicate your production environment and try installing the agent, configuring it and watching some of the metrics that I mentioned in the earlier slide, that is by far going to be the best way you're going to get an answer here. But in case you're not in a scenario where you can really easily replicate your production environment, I have some ideas for test workloads, some of the things that I've done to try and do a little bit of benchmarking on our own products. Using a log generator is sort of the obvious default one. If you're testing a log pipeline, there's a really good log generator from AWS called AWS log bench or maybe they'd call it like something bigger but I would call it AWS log bench. It lets you specify a size for your logs and a rate per second to send them. And that's a good way to get answers for like if you're at 100 megabytes a second of logs, some ridiculous number, like what's the overhead going to look like and you can test it at different amounts. And you can test it by sending it through JSON processors or sending it through 14 modified fields or something like that. You can try and test the limits really well with a log generator. Log generators aren't always the most, they're very synthetic because they're sending data at like a specified per second, same amount, same logs. It's not the most realistic environment but it's a good start if you want to test the limits. If you're scraping Prometheus, if that's the main thing you're going to be doing with the agent, it's actually very easily to replicate it because you can take a copy of the text scrape and if you don't expect that to change much, setting up a mock server with just that text scrape is a really good way to get a sense of what kind of resources you're going to be having because the changing of metric values or adding new labels can be changed. The changing of metric values isn't really where the resource usage comes from in Prometheus, it's about processing of the scrape. So if you can at least set up a mock of your scrape of like what the shape is going to be, then that can be a good test workload. Why can't I read that? Oh yes, of course. If you are scraping metrics and you can try and force high cardinality scenarios to test the limits, this is especially important if you're scraping, if you expect to be scraping like a database. Database metrics are one of the first ones that will go a little bit crazy in terms of cardinality because there'll be a time series per table or per replica or per table in replica or it depends on the database. If you can find a way to force more high cardinality scenarios like you can scrape an example database with tons and tons of tables, then that's a good way to sort of stress the limits of how the agent acts when it's pushing through too many points. If you don't like the answer, if you do this evaluation, you figure out how much overhead your agent's gonna have and you don't like it, what are you gonna do? Trying to do less is sort of the obvious one. It seems pretty obvious when I put it that way, but generally if you're doing more, you're using more resources. If you can find ways to reduce how much you're processing, if you can reduce the size of your scrape or if you can offload the work somewhere else, which is my favorite way, that is the best way to do it. If you can, if you have a backend that will do the JSON processing for you, I mean, that would be a dream world if that exists. Cause if you could just shovel JSON, raw JSON over there and let them do the JSON processing, that would be great. But I'm also pretty bullish on like aggregator nodes where you have lots of agents who can handle data being pushed to it. It's a lot easier to manage one location of agents and try and scale that out than trying to deal with agents all across your fleet doing too much or growing too big. If you are truly in a scenario where you think you have an unacceptable performance or you've found a regression in some upgrade, when you do open it in issue for maintainers, make sure that you have good information and it's hard to say authoritatively like what is the right information to always include, but really any information helps with performance stuff. Cause a lot of the performance issues that are open in these repos are about things that maintainers will never be able to access. So making sure that you have ways to replicate the performance issue you're seeing, making sure you can include graphs or CSVs, making sure you can include Linux Perf reports or Pprof profiles. Any of this stuff is going to be very helpful for maintainers trying to look into performance issues. And I think that's everything. So thank you. My name is Braden. You can find me on the CNCF Slack or on Twitter. I had my Twitter up there at some point. Anyways, yeah, thank you. Another question, hey. Great, thanks. Of course the obvious question is which collector is the best but I understand why you didn't want to answer that. Maybe you could answer what do you use in Google Cloud or what do customers of Google Cloud are using and why are you considering a change and why? Yeah, so I probably should have preface the talk with that a little more. But my team, one of the main things we work on is the Google Cloud ops agent, which is I call it two agents in a trench coat because it's under the hood, it's a fluent bit, collecting logs and open telemetry, collecting metrics and traces. And that's mainly we have a central config layer that will generate configs for the underlying open telemetry and underlying fluent bit with the sort of recommended tunings for folks running on primarily VMs, like plain VMs. Collecting metrics using like we use the host metrics receiver by default. We have a lot of third-party applications so like the Apache receiver, the Nginx, all the database receivers, we use all of those and we also do have support for Prometheus. We have a Prometheus receiver that a lot of folks have started using and an OTLP receiver that is less people are using. That's mainly what we use and so the reason that I think about all this stuff is that we're trying to make the best recommendation when we generate these configs for people, they don't really know what all the knobs are so we've tried to find the right settings for the knobs. That's kind of why I'm thinking about all this. Can you also speak to horizontally scaling these agents in case like the application is big and you have to send the logs and metrics to multiple agents, what are the best practices to do that? Sorry, can I hear that again? Can you, okay, yeah. Can you speak to the best practices to scaling the agents as well, horizontally scaling the agents if the number of application parts is growing then how do you decide which agent to send that particular log or metric to? Right, so it's about best practices for scaling the agent and deciding which agent to send, like what to send where. I think it does, I don't have a lot of experience with scaling the agents in the aggregator node like I mentioned. Yeah, I don't have a good answer for that. In terms of deciding what agent to send where, I think if you're not doing a lot of processing, basically any agent can shovel data super fast except for fluency. Every agent can send data super fast without any processing in the middle. It's so it doesn't matter too much and then it's gonna come down a little bit more to functionality but in terms of performance, yeah, you kind of just have to try it. I think is the best way to tell but I've really liked working with Fluentbit. I think there's not a big upfront cost like OpenCelementaryCollector being a go program. In our experience, we've seen more usage in that than in Fluentbit by default but it does kind of depend on what you're doing with it. Thank you, Raydon.