 Thank you for coming to my talk. I appreciate it. This is me. I'm Jason. I am a software engineer at Splunk, but I'm not here to sell you anything. The opinions in this talk are mine alone and not necessarily those of my employer. I help out with the Java Instrumentation project. I help out with Android within open telemetry. So that's kind of my background. So this is maybe the dumbest outline of a talk you'll ever see. It's like, what's the problem? Who cares about that problem? And then what are we going to do to make it maybe a little bit better? Hopefully, we'll get through that today in 25 minutes. So maybe some of you have spoken with this guy before. He's like, well, I added this instrumentation to my project because I want to know what the heck is going on. And suddenly, my HTTP response latency went up by a whopping 75% and he is really concerned. Has anybody talked to this person before? Yeah? OK. All right. And then maybe he goes on and he says, well, therefore, I'm very smart here. I conclude that I can never put this into production. It's going to be too costly. It's going to burn through cores and I could never have this in my environment. All right. And then maybe, I don't know what else he does. He goes back to hand-tuning assembly code or something. I don't know. He probably says some other words. And that's fine. I mean, that comes from a good place. But I think it's not super well-grounded. And I want, in this talk, I hope to give you a different way to think about overhead because it's not as simple as most people think. So in order to generate telemetry, the computer has to do some work. And this work consumes CPU cycles. It has a cost associated with it. Instrumentation uses resources and that's what we're calling overhead. So you have an app. It does some work. You add instrumentation. It does the work it did before, plus a little bit more work. All right. As much as we would like it to be free, as much as we would like instrumentation just to be the smallest, most microscopic layer on top of your application, it's not. And I even wrote down here, you know, you think that offloading some work to a collector might solve your problem. You might think that offloading or sneaking in under the application by using EBPF eliminates that problem and it's still not true, right? If you have a collector on the same machine or you have to put data on the wire to get to a collector, that's just a different problem. Sneaking in with EBPF, well, I'm sorry, that code still will execute operations on your central processing unit. So I mentioned compute. Users will almost always, after the introduction of instrumentation, they will notice an increase in CPU utilization. And that is just the way it is. Hopefully, it's not that noticeable. And memory as well. Every single object, like piece of telemetry, every span or metric measurement has to take up something resonant in memory and it's gonna, you're gonna incur some cost, some overhead cost. And then if you have a threaded platform, every thread that your instrumentation uses will also have kind of a built-in natural amount of overhead. Now when your CPU goes up and your memory goes up, guess what happens? Your latency will tend to increase, like overall. And if you have an environment that has dynamic allocations and you incur some garbage collection costs, that use of memory can also use CPU in the form of garbage collections, further increasing latency. Now that's not the only form of overhead, right? It's not just memory and it's not just CPU. There's also startup time. And it's not always the case, right? Some instrumentations require no change in startup time. But like I work in Java a bunch, we have tons of instrumentation that gets applied at runtime. And so when every single class is loaded, there's an instrumentation step and that slows down the boot up process for that, for that app. And like I said, not every app is impacted by this. Some systems have compile time instrumentation or build time instrumentation. You also see an increase in bandwidth. If you have telemetry in hand and you wanna get it somewhere, you gotta put it on the wire, you're gonna see an increase in data usage in almost every case. And usually it's within tolerance, it's okay. And then you're also potentially gonna see an increase in disk space, sometimes from the instrumentation itself, like the collector binary or a jar file or whatever your instrumentation looks like in the case of Android, which I'm gonna be talking about later. We have a mechanism through which we buffer telemetry on disk for a little while, and so there is an increase in disk space potentially. If there's one takeaway from this talk, it's this. I want you to really think about this. Okay, so the person says, my HTTP response latency went up by 75%. Well, I think just right off the bat, that's probably a lie. And I think many of you know why, right? Your service does not have just one latency, you probably have some sort of like distribution curve and some users see one thing, others see others, and that changes constantly for most applications. Okay, fine, really what we're talking about is just like median latency, that's the thing we actually care about. Okay, well, we can talk about that. So let's talk about that, shall we? And let's do so with a common language called math. So let's take an application, no instrumentation. It's latency, it's measured latency that you can measure from the outside of the application. Let's call it MN. That is equal to the latency of the application itself. So let's call those the same thing, and this is ignoring any additional cost for like proxies or load balancers or all of that. We're calling that a fixed unit, we're throwing it out for the purposes of this. Measured latency is the same as the app. Now you've added instrumentation. There's some amount of additional latency that's going to be introduced, we call that LI. So MW, that's the measured latency with instrumentation, is a sum of the application latency plus the latency caused by instrumentation. Okay, so we have two equations so far. The way to compute the percentage change, remember I said percentage changes in latency are lies, this is how you compute the percentage change after you've added latency, after you've added instrumentation. You take the measured without and the measured with, you subtract them, you divide by the original, you multiply it by 100, that's how you compute percentages. Welcome to very basic math. Now we're going to substitute those other equations into our percentage equation. Okay, so we start by substituting M sub N with the application latency, then we substitute M sub W. Don't fall asleep on me, stay with me. With the sum of those two, we simplify it and we get an equation that looks very simply like this, straightforward inverse relationship. So the percentage change to your latency is 100 times the instrumentation latency divided by the original application latency. Okay, who cares, right? Why is that interesting? What does that matter? That means for a fixed instrumentation cost, let's say that I really optimize my instrumentation, I get it all the way down to one femtosecond, right? Doesn't matter what the unit is, one microsecond. For changes in LA, as you drive your application latency down, meaning as your application gets faster, the percentage change goes up. Let that sink in if it hasn't already. I know it's simple math, but if you're only talking like median latencies and you have a very fast microservice, which all of us do these days and it performs very well, any instrumentation, if you want to spin that as a change in percentage of latency, will go up significantly with an application with very good performance. How few we got there, okay. Let's take a concrete example. Two apps, app one, app two. App one is very fast, fine-tuned, three millisecond mean latency. App two, a little bit slower, but I think a lot of us run apps that perform at 100 milliseconds, and then we add instrumentation, picked an arbitrary number of two milliseconds. That increases the total latency in app one's case to five milliseconds. In app two, all the way up to 102, and if you look at the percentage change, who's complaining more, right? The person operating app one or the person operating app two, they see very different things. And just visualized as a percentage here on the vertical axis, you can see that red is a huge chunk of app one. It's almost nothing. Person running app two doesn't care about that. All right, how are we doing on time? Workload is independent, sorry, overhead is workload independent, or workload dependent, sorry, and distracted by time up in here, got a lot to go through still. What this means is like your overhead will change as your workload, as your request count, as your throughput, as your concurrency, as your time of day changes. All of those things impact the latency, sorry, it impact your overhead. And user input itself, right? Every service that takes input from a user which almost or an external source, almost all of them do, that's also going to incur changes, or you are subject to that input changing your performance characteristics of your instrumentation, therefore your overhead. All right, other things that cause it, we touched on these real quickly. Data volumes, for example, a service that generates a thousand traces is gonna have probably a higher overhead than a service that only generates maybe a couple of traces. All right, if you think that you have a problem with overhead, at the end of the day, you have to measure it yourself. You just have to. You can't just sample a couple of things or reduce it down to a mean latency and then say, oh, this is my overhead. Actually calculating the overhead for your application is tricky. And here's a couple of quick notes about that. When you test, you gotta make sure that you control your environment. You have to use data that's similar to your production workloads. In an environment that's similar to your production. And when you do that, you wanna be comparing. So before adding instrumentation, after adding instrumentation, you wanna be comparing apples to apples there. And so controlling the environment means trying to isolate your hardware as much as you can, even if that doesn't sometimes look like production. I know that's a contradiction, but, and then purge your database, do a warm up cycle. If you don't, if you have environments that have JIT or anything, they're going to be impacted if you don't do a warm up cycle. And it won't be a fair test. It won't be a fair comparison, rather. All right, and this is really important. This is not, when you're trying to measure your overhead for instrumentation, this is not the time to do a heavy stress test. Like running at max capacity is going to give you wildly varying results and it will be very hard to get repeatable outcomes in most languages. All right, once again, let's say that you do have problems with overhead. What can you do about it? You've measured, you've tested. What's the very first thing you can do? Well, maybe do nothing, right? Like, yeah, it's good enough. Yeah, we could save a few pennies, but maybe it's okay. Maybe the time spent testing, optimizing tuning, maybe it's not worth it. So this is the shortest, easiest path. If you can do this, it's probably the least amount of engineering. All right, now that you've measured, you've tested, you have quantified your overhead. What do you do about it? Okay, step two, remove any signals that are not helpful. If you're primarily looking at tracing to get a better understanding of what your environment looks like and how services talk to one another, maybe you don't need these logs, right? This looks like a lot of log volume. Maybe you can turn that off. Upgrade your instrumentation. If you haven't done so in the last, oh, I don't know, month, six months, year, I know a lot of people have not, they just deploy it once, right? They never upgrade their instrumentation. Well, I can tell you firsthand that open telemetry engineers are working really hard around the clock to make instrumentation more performant. And if you're doing yourself a disservice, if you don't upgrade. All right, sampling is a thing that exists. I'm not gonna go into too much detail here, but it's a way to throw out some of your traces before you have to put them on the wire. There are different forms of sampling. All right, and step five, especially like in like a runtime like Java where we have 100 different instrumentations that could be applied, some of those you may not be interested in the telemetry coming out of them and you can just turn them off. And that will reduce the amount of computational overhead, memory overhead. And then I think finally, just a word of warning about manual instrumentation. I know that not everyone uses auto instrumentation. So what ends up happening sometimes is you have engineers who got paged once and therefore go into the code back on Monday and they just start sprinkling manual instrumentation code like everywhere because they don't wanna get paged again. And if they do, they wanna have like a very detailed record of like all of these things that happened. So they're targeting this anomalous case with like liberal sprinkling of manual instrumentation. And that's fine. I just think it has to be used carefully and you have to remember to pull it back out. So if you're an operator and you're looking at some overhead, you also need to know what your developers might have put in there on themselves, like what kind of instrumentation they have added. So just be wary, know that that's a factor that can play into it. With that, I'm going to end. Thank you for listening. I think we might have a couple of minutes for Q and A. So let's do that if people are into it. Great talk, thanks. Quick question. How do I know the average of my overhead before instrumenting my code? Can you repeat the question? So you showed the math where I have the basis and then I add the instrumentation. Yeah, yeah. How do I know the basis? So like... Yeah, so the basis would be like in the case of like an HTTP service endpoint that you're testing. You wanna know the impact of that. You can measure it from the outside using a performance tool like K6 or LoadRun or any of those like tools that help you to test a service endpoint latency. So you test with no instrumentation first. I'm using a testing tool. Like generate a load and look at how long it takes for that service to handle those. Awesome, thank you. Yeah, yeah. Other questions, don't be shy. We're all friends here. Come on up. Hey, there we go. Okay, so I have one. I'm coming from the serverless community and sometimes the overhead is much bigger than what you showed. You showed two milliseconds, right? But in case of call starts in Lambda, you need to start the collector inside of it. So it's adding much more time to it. I know that there are a couple of new processors like decoupled and this kind of things, but what are your arguments to use in discussion with this kind of serverless folks? Yeah, so the question is what can I do if I'm running a serverless, especially if I have to like spin up a collector? Yeah, sometimes that startup cost can be very expensive. I think, you know, you touch on it with caching. I'm definitely not an expert in serverless or functions of service styles operational modes, but I would expect to have maybe longer living collector processes if that's where you're really having problems. In general, it's like 1% of the invocations are call starts, but people are overthinking it, I guess that's my opinion, right? But what are your opinion on that? Because I think that we are getting a lot of value, right? By instrumenting our applications, you can improve much further apart from like a hundred percentile in that case, right? But is there anything that is planned by the open telemetry to do? And for example, there's another way that you could do the instrumentation and for example, use just the endpoint for gateway collector and send the telemetry in there. But you have a problem that in case that gateway collector is down because of some reasons, you might get impacted even harder, right? And also in serverless case, you are paying for every second of invocations. So it might pile up quite nicely. Yeah, there's a lot to unpack in what you just said. So I appreciate that. Yeah, if you have 1% cold starts, I think you're doing pretty okay. I know that for that 1% though, you have to eat that cost. And yeah, so the question around is open telemetry doing something like this? I'm not aware of it. I think they probably are. And this might be a good question or a good thing to ask of the serverless community. Good presentation, thanks. Thanks. Is there a way to measure the overhead of each instrumentation library from the Java agent? We don't have that yet. We really want this. Also nice to meet you, by the way. Yeah, no, this has been, this has been brought up a couple of times and I would love for people who, smart people that might have ideas on how to do this, right? Because we have in the Java agent literally like a hundred instrumentations and not all of them get applied, but the ones that do. It's hard to decompose like an attribute like of the overhead for a given call for each instrumentation that was in play. What is the relative weight of their overhead? Yeah, we do not have that. Yeah, good question. You mentioned to Android, what's the open telemetry stands currently on thinking about the difference between like human readability and efficiency of your transfer protocol for things like mobile where that's sort of at a premium? A little bit of a change of subject, but yeah, I think the idea is to keep it small, especially in mobile and like by using things like protobuf, the idea is to not have to incur some expensive marshaling. So yeah, does that answer your question? Okay. All right, if there's no more questions, we'll end it there. Thank you. Thank you, Jason.