 Thanks for coming out at five o'clock on the last day of a conference. I really appreciate it I'm gonna take this opportunity to check in with that's okay. Everything's fine I'm gonna take this opportunity to check in a little bit with who's here and kind of where we're at. I realized I wrote this for like People who've been using open telemetry for years and or people who are starting tomorrow. So Just let's let's let's let's if it's really to raise their hand and tell me kind of where you at with monitoring with open telemetry It's something you're using now. So you're thinking about using Speak up, let let let let us know Just thinking about it Like this is an array border sort right okay anybody anybody who's like monitoring like 100 or more microservice stack with open telemetry Any hands out of the eight people left. Yes one. All right. Well, I hope this is good for you as well So, oh Don't do that again. So this is controlling data overhead with the open telemetry collector So sort of two possible interpretations of that Topic line is this about Controlling other data overhead like the data overhead of your services, but this is largely about Some success stories from controlling that footprint from your observability tooling So seeing that we have some kind of impact from observability Which is largely inevitable Then how do you control that? There's some Yesterday I did an intro to open telemetry. I found that I wanted some more material from that as people were kind of advanced So a little bit of that talk got pulled into this one. So apologies if you saw if you've seen a couple of these slides before But we'll be going through it in a new way So who am I to be speaking to you about this So I am now with sygna dots which does Development environmental testing on Kubernetes not super closely related to what we're talking about today, which is nice I didn't have to worry if this was to product pitchy before that with Telemetry hub which works directly with open telemetry and on New Relics open telemetry team Before that working with closed Observability tools for seven years before that with New Relic. So That's my experience a lot of stuff working with enterprises trying to do like pre hotel Distributed tracing and other metrics and measurements And so pretty familiar with some of the things that can fall apart to when I start talking about data Cardinality and data explosion. That's definitely stuff I've seen directly so How do we start with time on this so most of both the Cause of and solution to our data problems with Monitoring you're gonna start with the open telemetry collector so Naively you can just instrument single lines of code And have them report up to a data source like to permit yes, but in general what you'll actually want to do is Go ahead and have the collector sitting Gathering data and then batching it out to your data source There's a ton of really sophisticated and exciting work now Happening with acting on the information that your collector has Which is a really cool idea right like hey if you're getting packets from everywhere in your stack You did all the work to set that up Could you not do some of your orchestration work or some of your deployment work or do some of your measuring for? For example your canary deploys by looking at the collector. She can it's an interesting idea Cutting out the data store Let's zoom into that picture and we talked yesterday too at these two slides one where I'm zoomed out one where I'm zoomed in Do not remember why I did that that was so important when I was writing this talk, but so not too surprisingly the act of going in and Connecting metrics Tableting metrics gathering logs and batching them is one of the largest places where we see overhead and we see an impact on our on our performance as a result of measurement so as a general concept what's happening inside of the collector we have multiple receivers and That's a place where There's tons and tons of growth as far as what can be fully received within the collector And so one of the things you'll see when you look at for example the support list for Different open telemetry libraries and say well our logging is just an experimental for this Well, that's natively in that library logging right, but of course we have tons of logging receivers for the collector Then we have our processors and for some of I've labeled the third column receivers again Sorry, so I made this grab so of course of course. I managed to grab a screenshot with the mistake in it, but so then we have our processors I Would generally think these are the three the top three uses right we have data scrubbing normalization and sampling We're gonna talk about some more advanced use cases a little later but right a data scrubbing would be seeing personal identifiable information in the stream of data that we're sending for example in metric names Very frequently and saying okay. I recognize that and I want to pull that out and then normalization There's a great line From the adobe people when they were talking earlier today where they said You know any kind of normalization and sampling is like a touchy issue with your team so You think oh well Hey, we gather these traces and we're just gonna send every 50th trace And you're like don't worry about it because we take like 10,000 of these a day So no problem or even get You know 200 of them a day. That's great, right? But people can get uncomfortable with that so you have to be aware that that step well totally necessary and all the stuff we're gonna get into Is an area where people by default kind of don't want to do it and then you have your emitters at the other end is where this stuff is going and That part generally is gonna be pretty simple unless again you're doing really cool like collector side logic I have an example of generating some alerting with like a little bit of sophistication there Through like a logging pipeline, but essentially right the outgoing pipelines are pretty straightforward another view of like how that might be implemented and I think it's important to realize like The collector should not just be a data exporter, right? It's not just that its job is to gather places for you information from inside never could send it out We really want to think of that as the point as where especially normalization should be happening One tip that I have especially as teams adopt is that you will learn all this stuff about the SDK and You can even get kind of excited You can be like oh, I'm gonna go train the team on this and that and this other thing about the SDK do this other kind of Cool normalization and calculation You know, I'll take in these two metrics. What is the number of goods sold? Well, the other one is is the you know individual cost and I'll I'll admit another a final metric That's total cost right from the application code. Don't do that Once you're if you're if what you're doing right at the jump is going really far into logic right there at the instrumentation point think about that stuff is happening in the collector because even if you and your sort of direct Engineers that you're working with are really excited about doing this as you try to get the whole team super excited about it You're gonna have people who are not excited about this right who just wanted to you know Put in working application code and do not want to worry about admitting the proper metrics So if you put that logic on the collector side, you're in a much better position for that so This is stuff. I should have but you were here for my last representation yesterday So I'm actually gonna zip through this pretty quick. This is just the idea of like how you would do the most basic of instrumentation Right you have the library available. You create both a listener for a trace and generate those trace bands and You create a meter and a mix of metrics. What's kind of notable here are even in a basic config you do make some decisions about batching and batch size and You set your check interval and you set your memory limit and That's pretty critical. So I know we can think of like hey It's something really bad is happening. We want to maybe grab gather more information or I can't think why this would be you know, over limiting memory you definitely want to set that limit every single time because A bad situation where you're generating for example a ton and ton of traces or a ton of ton of spans You'll still get some information in with this limiter in place and you don't want your failures to cascade So that's that's why that's sort of you want to do that every single time Then you have this pipeline here and we're gonna talk a little more about pipelining in a later point I'm sorry. I'm not reading all this text cuz I'm I'm glossing over this version of the code We go into a little more detail about setting a Memory usage alert with the collector in just a little bit So yeah, here's a more sort of realistic actual Full set of Don't get old everybody. It's it's kind of stinks. There's aspects of getting old that stink So here we come to a more reasonable example of an actual Config for this where we have a number of ignores that we're doing on the errors that we're generating for various reasons either because well you can think of the reasons usually happening and Then we also start looking for span statements where we say hey, we don't want to include those in our distributed tracing and then Down in our metric statements were able to Sorry folks very very tiny on my screen on this little speaker preview We're able to make some decisions about how we're gonna do math on our metrics So a good example of that is like a cumulative metric so I Don't know why my mind always goes to these basic e-commerce stores, right? But when you say hey, you're reporting repeatedly like hey, this is the price that the item was sold at the final metric Maybe is how much did we make what was our total revenue, right? So you want to mark those metrics as cumulative? Another example you'll see really often is like Error counts you want marked as cumulative most of the time so You'll have the situation where you know the only possible value for like error count on a single report on an event It's one and if you find you have a bunch of cycles or you're just reporting one over and over again It's like oh you didn't mark this as cumulative So it just keeps getting set to one instead of incrementing by one Okay, let's talk about a situation where we want to go ahead and measure That we are over using memory oh Sorry, I think of this example. Yeah, we're we're worried that we're over using our Network so we're looking at the actual host that we're installed on and we want to admit some host metrics They are just to to say hey we have a problem here because there's for some reason we're using a ton of network pipe This is a pretty reasonable first step Ceiling warning when we start working because there may be reasons why we have things like A synchrony or something that are causing a ton a ton of network traffic that we don't expect So we start by sending our collection interval We configure a logging exporter because the logging is how we're going to do This alerting and then we're going to define our pipelines, right? So we have a metrics pipeline right which receives from host metrics which is going to tell us about our network usage and then exports out to logging and Then we're going to look with a strict match at the metric that we care about right in this case right network usage and Then we're going to need to Enable the extension for host metrics and then we'll do this metrics transform And we'll even go and create a log message that we want to create this is so right a collector right is not Right its own sass. It's just a service. It's running on your system So we'll do this metrics train transform to say hey when we see this We will mark it by some and we will generate this log message and Then we'll go ahead and connect it So I will share an example of this full config file Well, certainly put on Twitter, but I'll try to link it off of this talk once it gets once it gets uploaded Okay, so this was also in yesterday's talk, but I want to talk about it again because this is kind of the second thing to worry about when it comes to Overuse with our observability so In this case right we're monitoring a front-end site and We're doing like you know Google Analytics style or real user monitoring style Monitoring and I think we can agree that For most use cases in an org this information is pretty useful, right? It's like okay This is the path that we're at this is how many people hit it This is how many of them were new this is you like marketing analytics or something We don't want to see like how successful certain pages and In general we would say that these metric names which we have listed as page here are pretty valuable Right because there's a reason why these pages might perform differently, right? They might be getting linked to differently. So that's one thing but also they they're loading different content, right? So they may well You know have different performance now We can have often without our intent those same metrics shift to something that looks like this and Getting back this metrics now the one mistake I did here when I was making up this fake data is I Didn't update these hits and new values, which should all be one Right because when we're getting these paths and they're by user ID The values are just gonna be one and two and one and two over and over and over again I wonder if this is really a table My Sexually a table here. Oh, it'd be so cool if it was. Oh, it is. Well To be continued. I'll update that soon, but So What's happening there is this buzzword that gets passed all the way around all the time Which is a transition from high cardinality to low cardinality data And you can see it on that hits or traffic value again if I'd set it up exactly right which is Now there are a bunch of metric names a whole bunch more metrics the possibility space of their values is becoming actually much smaller Right by user ID. There's only gonna be two three max hits on a particular path. We're here. They could be anything, right? and so This is high cardinality data We have a small number of metrics whose possible values is a large range and here We have a large number of metrics whose possible values is very small so There are reasons to want this metric to So I don't want to discount that too readily. So a classic example is hey, we have enterprise users and Whoa, actually, we don't have enterprise users We have enterprise user and we really care about that enterprise user and so how we're performing for Giant name brand really matters to us. So sometimes this is important when it's one particular value Which is absolutely something you can encode even at the collector level though Maybe you're really just filtering on the dashboard level, but most of the time you don't care about this and What's surprising is When you go to configure to not get this garbage you can get some pushback Right because it's like well. This tells me which users were not performing for so that must be great, but This is kind of fundamental when we're not using like our tables and our tools the way that they're intended to We lose a lot of their flexibility and their power. So for example Normalizing and averaging across time spans isn't gonna work well with this Right if you have 10 million different metric names Then the fact that you can normalize each metric across a time span isn't gonna do much for you for compression so We want to shift who's you want to shift to this point of high cardinality I think there's even a shift here between slide one and slide three of like Trying to compact it down a little further to be like hey These are hits basically to the dock site and these are hits basically to the marketing blog and that kind of thing But that's a more ideal situation Okay Stuff on distributed tracing. So the big thing that I want to say about distributed tracing Which is again the sort of rezz on detra for open telemetry is You know so so the magic here right isn't there's collector side logic which ties together spans that are happening on multiple different services That were both kicked off by the same request Pretty neat stuff, but one of the dirtiest secrets is that very little of that data is actually viewed So there's a great question that came in yesterday at my talk and also today at the adobe talking interested into it which is like How important is the actual detail on those traces and my thesis is like super not important that very very little that data ever actually gets viewed and a great deal of What actually matters about traces is like tying together multiple services rather than Okay on this service, which method was being called that took so long oh Now we're on to a fun story Which is this is doordash getting really deep in the weeds which I actually really love they have a hole right up on this I will link it from the talks. I haven't liked the URL there, but please don't try to type that, but I will share it out so Doordash saw these really significant increases in overhead when they were doing a certain level of distributed tracing and It's super interesting they discovered that the problem was related to signal batching and I apologize reading glasses again so The issue was that when they implemented batching that's what was causing This like massive CPU overhead What they ended up doing was implementing multiple different concepts for batching and for queuing This was going in and actually like modifying the collector code and This result in these like massive massive changes in performance And you can see them like compared here. They ended up trying like four different possibilities now This does get you to one of the kind of fundamental things which is there were really great talks at this conference from both into it and Adobe and one of the Distinctions that they drew between each other was at into it. You generally know what the user's path through the service is So into what was able to build this tool that was just Magnificent it was like you logged in and you say you have an incident and it's like it is affecting 2100 users in these regions And it's been affecting them for this long like pretty great right then you scroll down you see a stack trace like fantastic, but Right, Adobe's not able to do that for like obvious reasons. It's not able to say okay all users take this one path through our tooling in Both cases and in the store-nash case right they are able to have people working on this stuff full-time So I only want to express that is that's kind of a subtlety that you just want to be aware of is that Not every solution is going to make sense to a 2030-40 person developer team and so That's why like doing some simple like collector side Limiting and batching on what you're reporting is really key because what that was trying to do was trying to get these Really really deep traces set every single time and they were able to have a team develop devote a little bit of time working on redoing The matching logic that was present in the collector So great commit, but it is a result of trying to get this super high-resolution data Okay so Along with all that there's this concept of baggage with open telemetry and I want to see when I come back next year I want to see more baggage demonstration from everybody because the ability to pass around a request between a whole lot of services and Have a consistent data format that can be read as it's going into each new service Can be really powerful. It can do a lot more than just observability. So We're seeing it get used with security. We're seeing Signature paid for me to be here to use it for this kind of testing Now you're testing like experimenting as a developer on a single service Implemented with open telemetry. So that's pretty neat. I'm hoping to see more applications to that in the near future All right, this was quick because it's late folks. My gosh. It's five o'clock on the last day of the conference, but I'm here if you have questions Yeah, if you have anything come up. Let's talk Yeah, if you want to raise your hand and grab or go grab the mic back there feel free to ask questions But that's that's what I had Again, you can find me on the server where it's serverless underscore mom Thank you so much for coming out You