 Thanks everyone for your patience. Also this is exciting. I don't know, show of hands. Anyone else the first time at KubeCon? Yeah, wow, a lot. So a lot of excitement. Hopefully we can have some fun during this session. Thanks everyone for joining. I know some of you may have just been hanging around after charity's talk, but I appreciate you staying comfortable and hanging out for me. So yeah, hopefully you'll get something out of this. You'll learn a little bit more about how to make observability a regular part of your development workflow. And you'll take away some tips on how to get into that. So a little bit about me. I'm Jamie Danielson. I work at Honeycomb. I am a telemetry engineer. So I work in the instrumentation libraries and I'm an approver on open telemetry JavaScript. So I apologize in advance for any of the groans. I know I'll probably hear later. A lot of the examples are in JavaScript, but it is applicable to many of the other languages. So we'll go through a lot of general concepts if you practical applications and keep in mind that even though a lot of it is JavaScript, it will be applicable to other languages like Go or Python or anything like that. So the biggest thing that we want to get out of the way is that if you know how to log, you know how to trace. Now, most people are using logging of some form today, whether that's your local console logging or printing to your terminal or you have some other logs that are being aggregated and sent somewhere. And there's this thought of how do I make that leap? How do I get from logs to traces? I've heard that maybe traces are good, but it seems like a really big deal and I don't know how to get everyone behind me on making this big change. But what I'm here to hopefully tell you a little bit about is that it's not as big of a leap as you might think that it is. So something like this might look a little bit familiar where you just have all of these logs and these logs are just sort of generally maybe a request to an endpoint handling that request, looking up an item, things like that. There's a lot of useful information in here and I'm not gonna tell you that it's not useful. In fact, this is a lot of really important information. The hard part about it is there's a lot of free text. It can be hard to see how these different logs maybe correlate with each other, how they're affecting other logs, other things that are happening. And when it comes to searching, when you have hundreds, thousands, millions of these logs, as you might know, this is a free, you know, free text search that can take a while, that can be hard to do. So we wanna start with looking at these logs as just being a simplified version of traces or traces are really just a dressed up, fancy form of logs. It's really not that big of a mental leap and this is gonna be the best way to help you get from point A to point B. So if we think about the anatomy of a log, at its simplest here, we have a timestamp, we have a level, maybe info, warn, debug and we have a message of some kind. That's the free text that we talked about before. And as we keep going, we wanna add more and more useful information to it. We wanna be able to add in things like, if it's an HTTP request, we wanna have a target, a URL, maybe a response code, things like that. And so if we look at these logs, these structured logs, and we get to this next step here, one of the biggest changes here is just that we have a trace idea associated with it. It's still a lot of the same information, but now we're able to see how these pieces correlate to each other and these different things that are all really relevant to each other connected with one single ID. And so one thing we wanna look at, I'm working on this app locally and I'm not doing a huge new feature, but when we looked at before, we had someone hitting an endpoint and we had it looking up this item and I figured, you know, we wanna add maybe some caching logic here. So instead of always having to hit that endpoint, we wanna add in some code that just says, hey, let's put that into a cache. In this case, it's just a simple Redis cache, but this way whenever someone is hitting this endpoint, we're not always doing that same lookup. Maybe it can be a little bit faster. So again, this is a JavaScript example, but we're just adding in some simple caching logic here. Think about when you're actually going through this in your workflow, a common thing that you might do is add in these console logs or add in these print statements as you're going through. Hey, I found this thing in the cache. Whoops, this wasn't in the cache. I didn't get what I expected or yes, perfect. That is the cache that I was hoping to get. And so the way that we wanna think about that is, we're already logging these things. A lot of this might be free text. So how can we get a little bit better? Well, instead of just having that free text that's a little bit unstructured, we wanna standardize that a little bit. So not only did we find a product in cache, well, we found this very specific product or we found this very specific book, the SOFA, whatever it might be. And we add in this attribute of app.incache instead of just found this thing in the cache. So the way that's going to look now in our code is right next to where we have those console logs, we can add in these attributes and we'll get into this in a little bit more detail. But as you can see, I'm doing this in generally the same place, but a big difference is think about when you're console logging and you're doing these things locally, as soon as you kill that process, that's all gone. All of your logs are gone. They don't make it anywhere else after you're finished working on it. But by adding those attributes into a span, into a trace, now those are saved somewhere that you can go back and you can look at them and you have a feel for how to look for those in the future. So we looked at these logs earlier, this big batch of logs that in this case, I happen to know that they all go together and when I'm just looking at it as I'm working on it, I know that they go together. But what would really be useful is if we could see them more in a timeframe and I can see that these individual pieces affect the other. When someone makes a request to an endpoint, it then kicks off this background task or it then goes and looks up an item from a database or from an endpoint and we can see how these different events affect each other and we can break it down into these smaller units of work and we get a better mental model of what's actually happening here. So now that we've seen how similar those logs are to adding those attributes or to those traces and it's just a little bit more dressed up and we can see now that we can put this into sort of a timeline of events to see how these things go through, we're gonna go on to the next step now and we're ready to graduate to start actually tracing. We're moving on now from logs to tracing. So we wanna think about, okay, again you're telling me this isn't this huge mental shift, this isn't a lot of work going on but I'm not convinced. So one thing that we can do is we can start small and get value quickly. This is gonna help us visualize things, get a little bit more comfortable with how our systems are working without really investing too much work from the start. Let's just get started and the way we can get started is with open telemetry. Now some of the code that we saw before of adding those attributes to a span that was an open telemetry. An open telemetry if you don't know is this a standardized way of collecting, of instrumenting, collecting and sending off your telemetry data. So it's standardized, it's vendor neutral, it's open source and it's a way that you can send it to a variety of different back ends. And one of the best ways to start is with this automatic instrumentation. Now the beauty of automatic instrumentation is that you can instrument your code without actually touching your code. Now that's very exciting, less work, I love less work and so the idea that you can just run a few things and see some results, which we're gonna see in a moment. And so since we're at KubeCon, I definitely wanna point out the open telemetry operator, which is just a Kubernetes operator that amongst other things can inject those automatic instrumentations for you. The automatic instrumentation is just made up of these instrumentation libraries that are available in open telemetry. There's not a whole lot to talk about on this slide, you can see there's some YAML. I'm not going to read out the entire YAML file, there's a lot more than just this small block here, but this is part of the general gist of it. And I wanna say that the docs around the operator are really great. They should be in the references slide that I'll have at the very end of the talk. But a lot of it really just involves updating some of your many YAML files that you may already be using if you're already using Kubernetes and you're able to then inject that automatic instrumentation. So what do we get from that? It sounds simple, okay. Well, of the code that we were just looking at, what do we get when we enable this operator and we enable these automatic instrumentations? Turns out we get a lot. This, again, is from not touching my code in any way and simply enabling these automatic instrumentations. And what I'm seeing here, what's really important to point out that this is actually just a local dev environment that I'm working on this code, I'm working on this caching logic that I've added in. And right now I'm using a dev environment in Honeycomb, but this could also just as easily be sent to Yeager, which is also open source. You can use something called Hotel Desktop Viewer. There's a few different ways that you can do this locally without necessarily sending to a vendor or worrying about if you're spending any money on this development environment, you're still trying to get convinced. You can do this all locally still to get a feel for again, this is what we're getting out of the box. So instead of just logging all these things into my terminal, maybe doing a command dev of hi, I'm here. How did this to the cache? Looking for that, trying to find that in there. We can now see this on a backend that's going to support open telemetry. So if you're not using Kubernetes or occasionally you need to have a little bit more, a little more control over setting up your automatic instrumentation, this may be because, again, you're not using Kubernetes, maybe you're more comfortable having something that you're able to write a little bit more of. This is what our operator is essentially doing when we're adding in our automatic instrumentation for NUDE. This is usually gonna be kind of a separate setup file. This may just be like a copy paste for some different services or languages. And this is gonna let us have a little bit more granularity as well as we look at this. And so one of the things I wanted to look at when I was looking at this automatic instrumentation and looking at it in Honeycomb is, what are the other kind of events I got that maybe weren't related to that trace that I was looking at? It turned out that there's actually a lot of events that are being produced here. We looked before at just a single trace that had all of really good information, but there's all this other stuff, all this instrumentation FS over 2000. Turns out I didn't actually need any of this. This wasn't that useful to me. So I think it's important to think about automatic instrumentation as this starting point of what can I get out of the box? How can I quickly get started? And then from there, what other changes can I make to be a little bit more effective and be a little bit more specific about the kind of information that I'm getting? So because I've decided maybe I don't need all that, that FS instrumentation is file system instrumentation. It is a little bit noisy. So maybe I don't want all of that. A simple way of disabling it is in part of that setup file that we talked about before, we're able to just add these few lines that say enabled false. I don't actually want that. It's a little bit noisy. And so when I go to my trace, I'm still seeing most of the important things that I want, but I realize I'm getting way less of those other events that I'm ultimately paying for and aren't giving me a whole lot of value. Now I'm looking at my trace and I'm still thinking, there is some extra middleware there. I don't know if I care about all that either. I'm using Express here, but yeah, maybe that's still some extra stuff that I don't really need. So again, I can go into my little setup file and in this case, I'm just disabling these extra layers. I don't really need these middleware so I'm just gonna add in a few extra lines. And now when I look at my new trace, it's a lot more manageable. It's a lot, you know, it's not as unwieldy and I can really see what's important to me and what makes sense in this trace that I'm working with. And so now I wanna think about how do we tie these things together? So the automatic instrumentation that we just looked at was, I haven't touched my code or I've barely touched my code but I've done enough to get some of this out of the box instrumentation to see what's going on. I can see how, you know, at a higher level how these things are interacting. I can see how I get from A to B to C to D and so forth. Now the part that we looked at a little bit earlier that we wanna tie together now is manual instrumentation or custom instrumentation. And what that allows you to do is instead of just knowing A to B to C, you might have more information about B specifically that's very interesting and is very relevant to your app. And so you maybe wanna add a little bit more context. You wanna add a little bit more detail about what's important with the code that you're writing. So again, if we look at this code block that we had before, I'm not even adding all that much aside from those specific attributes that I was adding with my logs. But the beautiful thing here is all of this code is being automatically instrumented and I'm able to add attributes to that automatic instrumentation without doing much more except adding more context to what's already being collected for me. And again, this is now going to be safe somewhere where I can look at that data again later and see that things are working the way that I expect them to. In this particular case, we wanted to see that something was added into the cache. And one of the important things that you use cache for is you want things to be faster. You wanna see an important performance improvement. And so by adding in these specific attributes, yes, I can see things were put into Redis or taken out of Redis or whatever else. But when I start looking at maybe the time spent on different services or calls to different endpoints, I can start slicing and dicing it a little bit more based on those attributes that I've added to my code. So like in this particular case, I can see that some of those really slow endpoints were where we weren't able to get it from the cache. And maybe we can see that we need to spend a little bit more time speeding up, getting that information when it's not in cache. But what I also know right now is that my cache is working. And again, I'm doing this all in my local development. I haven't deployed anything. I haven't even pushed up to GitHub yet. I'm just keeping this local and I'm seeing, yes, this is doing what I think it's gonna be doing. And I can feel good about what's gonna happen here. Now I wanna point out something important, which is that it's not all sunshine and roses. Observability is really wonderful. And what it's about is being able to deal with unknown unknowns. Because the fact of the matter is, things are going to go wrong and things are going to happen in prod that you didn't expect when you were working in your development environment. You're working on things locally and you're feeling really good with all of your different test cases. You've added in all this context. You've added in all these details. And you feel really great about how that's looking and it's doing what you think. But let's look at an actual incident that we had at Honeycomb where we can see that observability really helped us out to be able to keep working on that code and get it better. So what we see is a graph and you'll notice I titled it test and prod because an important thing I also wanna point out is test and prod is not chaos engineering. It's not just throw things out there and see what happens across your fingers. It's about having things instrumented, put out in prod and then seeing how those things are actually working in production. Now in this case, what we had was something that my team was a little bit involved in, a new feature we were rolling out. We were really excited for. We had it behind some feature flags, but the idea was that we, what's a little bit funny is the new feature was related to logs, so ha ha ha, don't do logs, but we were now able to ingest logs from OpenTelemetry, which is a little bit newer in OpenTelemetry and we were able to do this using OTLP. Jason, where previously we were always using protobuf. Now both of these things we had in our code was behind feature flags, which I'm not gonna go that far into feature flags, but I will say that that's worth looking into if anyone else is talking about that. I recommend looking at that because that can really help you out. So these feature flags were something that we used, almost like a control rod in this case, but we wanted to see that we can open up these flood gates at some point once we see that things are going good and we wanna see how this is working. And it turned out, as you can see from the graph, something went wrong one day, but the important thing is here, we instrumented all of these services really, really well and what we were able to see was, well, here's a feature flag here and this is a new ingest type. We can see that these were these new logs coming through. We can see that this is this new Jason type that's coming through and we can actually see what customer was sending this specific information. So because we had instrumented these services so well, we could narrow it down to the type of code that was coming in, we could narrow it down to the customer and with those feature flags that I talked about, we were able to disable this one code path for this one customer, which was able to prevent us from having any kind of outage or from really affecting other customers. We were able to kind of contain the fire and narrow down to what the problem was and what this helped us do then is we could then talk to that customer and get a better feel for what were they trying to do? What were they sending in? Maybe we can figure out how to do that locally. And so what we did was, we were able to reproduce it locally. So the cool thing is, one thing that happens with logs, what really can happen with a lot of things is that all of this data was coming in this unexpected format. It was this kind of really big Jason blob that wasn't expected, but it was from a pretty big product that wasn't a total edge case. It was something that maybe we could have accounted for, but probably not. And that's the thing to keep in mind is you can't account for everything. So we were able to instrument it. We were able to reproduce this locally and understand, okay, one of the best things that we can do is we can at least truncate that data. If it's a little bit too big and it can't quite fit in what we can handle and what you'll notice in this code block, I don't know if you can read it up there, but we added in more telemetry so that we can see if we do need to truncate these fields, what are those fields? How big are they? This is good for both us and the customer because now when they send in that information, they can know if something is wrong with one of their fields and know exactly which field they have to go work on. And because we were able to reproduce this locally, we could see what this looked like before we made that change and we could see what it looked like after we made that change. You may have heard that at Honeycomb we're very big on dog fooding. We wanna know that whatever we're working on and how it shows up in Honeycomb is how our customers are going to experience that as well. So this is just a small internal conversation where I said, hey, these PRs are up, this is ready to go. We were able to test this out internally and dog food it and we feel pretty good. We still have all this instrumentation for the particular customer and for those code paths. So now we can feel confident in re-enabling that feature flag for that customer and keep an eye on it and see how it goes. And it was a success. They were able to then figure out where they needed to fix their code and what they were trying to send in and we still didn't have problems with other customers. One of the quotes that I really liked reading through this incident review that was pretty long as we were sorting through was from one of our engineers. And he literally said this, I'm not making this up, of there's no way that we would have been able to track it down to this specific customer sending this specific form of data without all of our telemetry. And that's important because again, if things will go wrong, but it's about how you can narrow down where those problems are, be able to test that locally and see that your fixes are gonna be good before you get them out to production. Sorry, I need some water. So what a lot of this comes down to is that observability is about confidence. And it's not just about when it gets out to prod and what's out there that your customers are seeing, but it's about when you're working on your code locally that it's doing what you expect. That if you're implementing a new feature, people are using that new feature. If you're adding in some caching logic, you know that that caching is getting hit when you expect it to. When it does come to prod, do you ever have this time happen where maybe there's a bug report, something has happened and you're looking at get blame and your name is all over it and you don't remember it. You know, I don't remember working on that. It was last month. Maybe you went on vacation. You might've had a long weekend. Maybe it was just, you just aren't remembering it. But you need to be able to ask questions of your systems without knowing the inner workings of them. You need to have all that context and all those attributes and all that instrumentation to be able to slice and dice that and know at least where to narrow things down. And the best time to add in that instrumentation, the best time to add in that context is when you're writing the code at that moment. You're uniquely situated to instrument your code at that point because when you're writing it, you know what you're trying to build. You know the outcome that you're trying to have. So any time that you're thinking that, hey, this would be useful to know or console log I am here or console log found this in the cache, that's gonna be your trigger to know, hey, I should probably at least add this as an attribute or if I'm seeing one of those auto-instrumented spans or traces and maybe there's a larger gap, I wanna be able to break it down a little bit and understand what else is happening in that gap. You wanna be able to answer that question of even before it gets to production, do I know what it's gonna look like when it's in production? And so again, especially when it comes to automatic instrumentation and looking at this locally, when you think about when you're dealing with a monolith or a really large system of services, you wanna know how your code, how that small feature that you're working on fits in with the rest of the system. You wanna know if the changes that you're making are going to have the impact that you expect and if you're going to actually hit those code paths. You wanna be able to know if something's wrong as we saw from that graph earlier, it was pretty obvious that something was wrong but without that would be have known that something was wrong. You wanna know if someone's using your new code path or your new feature, it can be helpful to point out how often someone is using the new discount code functionality that you added in or if you had something that you want people to click as a call to action, how often are they hitting that? By being able to check that locally, you can know that it's gonna work in production. And so again, can I see whatever I do here locally, can I know exactly what that's gonna look like in production and then when something does go wrong, be able to narrow down to where those things have gone wrong and how it can iterate and improve from there. So as a recap of what we just talked about, the biggest thing that I think you wanna take away is that if you know how to log, you know how to trace because we're all working with logs in different times, whether again, we're logging to our console or our terminal or we have those logs that are getting sent to a log destination, a log sync, whatever else, we have these different forms of logs but we wanna be able to put them in a way that makes sense and lets us correlate from A to B to C and get additional information about those. We wanna be able to start small and get value quickly, especially if our teams are nervous, if people are feeling overwhelmed, if this is feeling like this big bang change, I wouldn't recommend doing that. You're gonna have a lot of pushback maybe from other developers. You might have pushback from management. You might have pushback from yourself. You're not sure if you're really sold on this. So by being able to start small with that automatic instrumentation, you can get a lot of value out of the box and get a feel for how are things looking now and where can I dig in a little bit more to add more detail of what I need. From there, as I mentioned, as greatest things are looking locally and you get a really good feel for how that should look in production, you have to know that things are going to eventually look a little bit different. It doesn't always work as well as you'd like kind of in like sort of that lab environment, that local contained environment but by having as much detail as you can, you can know when things go wrong and where you wanna narrow in on and where you need to improve. And finally remembering that observability is about confidence. It's about confidence when it gets to prod but it's also confidence as you're working on your code and knowing that it's doing what you think it's going to do and it's going to have the impact that you think it's gonna have. And so that's all I really had to talk about on this. If anyone wants to stop by the Honeycomb booth this week, booth N22, there's also a couple of things there for feedback and also some of the resources if you'd like to look in a little bit of what else I talked about. And then I think I have a few minutes to open it up for questions as well. And also I guess if people do have questions if you're able to, there's a microphone in the middle just so that others can hear or otherwise if anyone wants to yell out questions I can repeat it back so we have it recorded or it might be good. So just within your CI pipelines to be able to, one of the reasons I'm looking at it is I could run a test today and then run a test tomorrow and then diff the result expands to see what was ultimately impacted in the app. Is that something that's already? Yeah, so there's a few different options depending on where exactly in your CI pipeline you're thinking about. So I'll start with one which is if you're instrumenting your CI pipeline itself which is I don't think what you're talking about. Okay, so. This is the monitor test. So for example, we have like smoke tests that we do for our distributions at Honeycomb and what that entails is we have a small example app that runs in a container and it sends its telemetry to a collector and that collector from that point then spits its output into like a JSON file and we have tests that run against that to see if the instrumentation looks the way that we expect. So for example, if we're expecting some metrics to come out of something or if we're expecting if I have an application that is hitting an endpoint and then is hitting a cache all of that should then get sent to a collector sent to a JSON file. We actually upload those artifacts as well so that we can take a look at them later and we run tests against them to see if that's what we're expecting. Well, it's confidence. It is about... Yeah, well, and what's so great about that too is sometimes you get that built-in helper too of like here's some examples for other people who might also be trying to work with the code. Maybe there's someone new onboarding. Maybe there's someone who's new to using an application by having those tests, like the smoke test, for example, you're building out an example. You're seeing what that should actually look like. In this case, like I said, we send it to a collector and send it to a file but I could also change that endpoint and send it to Honeycomber, send it to Jaeger or something else and see what that looks like. You see like pros and cons. Right, exactly. And that's big and we use smoke tests specifically in this example. What I'm talking about is it's not quite as... This particular example isn't quite as comprehensive as maybe a full end-to-end but what we're seeing is a smoke test does anything catch on fire? Is there anything really obviously wrong with it? So am I still getting spans? Are they still having the names or reasonable durations or metrics coming out? And that helps us build confidence. Again, all our unit tests pass but we also see telemetry coming out which is ultimately what we need at the end. So I'm a big fan of that. Great talk. I have a question. So when you say run it locally, right? So the collector is collecting and sending it to the database like local backend or is it sending this traces to the deployed environment? So your question is, if I'm sending telemetry to a collector, where does the collector send it from there? Like if I'm running it locally, is it sending it to a backend which is running locally or is sending the backend which is deployed by a platform team or something like that? Yeah, so again with our smoke tests that I mentioned before, our collector sends just to a file but what we can also do is we can have that sent to Jaeger. You can have these all containerized maybe in Docker or Kubernetes or something else. You can send it to a collector and you can send it to two places as well. Let's say you wanna see it in Jaeger and you wanna see it just printed out in a console in the collector. That's another option. So I don't recall if there's other, there are probably other collector talks happening this week. I do know also there's a contrib fest on Wednesday where there's gonna be some hands on stuff with the collector but one of the great things about collector is you can receive from different places and you can send it to different places depending on your use case. How you guys do it in honeycomb, right? Where you guys send, I mean, do you have a? We send to honeycomb, yeah. So when I mentioned before and I apologize, I should have talked more about it. When I talk about dog fooding, what that means is we have telemetry about our actual production in honeycomb when you go to like our honeycomb UI. Data about that is being sent to our dog food environment. So we actually go into honeycomb to see what's happening in honeycomb. It's a bit of inception kind of, it takes a little bit to get used to but we do actually send it to another version of honeycomb for internal so we can see what it looks like in the other environment. Does that make sense? Yeah, that makes sense. Okay. We do the same thing. Thanks. Okay, perfect, thank you. Great talk, thanks. So you talked about instrumenting apps from the start. I think that was really awesome. Of course, documentation is not enough for apps that already exist. Swords of things should dev teams do in the manual instrumentation area that they're not necessarily comfortable with it. They're already really happy with logs but they should do more than just auto instrumentation. Yeah, I think what can be, one of the things that I like seeing is if you can find an example of I know that this thing took two seconds. I know that from A to B took two seconds only because I know that this happened and two seconds later this happened. And a big question I remember asking a lot was what happened in those two seconds and how do I break that down further? And by not being able to see what happened in that two seconds, then you have a lot of, well, I think this person worked on that part of the code. Well, maybe it's the database. Maybe it's this third party API that we're using and there can be a lot of questions and wondering where is this problem actually? And so by being able to even start small, I would say start with whatever is maybe your HTTP request and response cycles. Anywhere there that you can track that is a good place to start and any really common or gnarly services where if you get most, 50% of your customer escalations are coming from this one service, that's a good place to start and see if you have a better time answering customer questions or debugging problems based on those really hot paths or anything that's a really critical path through your application. So I might not care as much if it's a little bit slower when you, if I'm talking about maybe, this was like a book app that I did. Maybe I don't care as much when you're going to look at these groups but I care about if you wanna add a book to your shelf or I care about if you're buying a book. Well, let's instrument that because that's where our money comes in is when someone buys that book and we can let the other stuff go for a little bit until it becomes a bigger problem. Does that help? Yeah, I think so. Okay. Thanks. Thank you. I think I have time for one more. Hi. Yeah. Yeah, so the question was do you have a problem sometimes with over-instrumentation where people get a little excited and wanna instrument everything, add spans for everything and maybe they're duplicative? That is a thing. So to me it kind of reminds me of that auto-instrumentation I looked at where I realized I wasn't getting a ton of value out of what was there and if it really gets down to the nitty-gritty saying we need to cut our costs a little bit and this is one way that we're getting charges if maybe we have too many spans. But the thing I like to think about is if I'm looking at those traces, again, whether I'm doing that in development or something else, if I'm looking at those traces and I'm seeing maybe two spans that look similar and I'm not getting much out of that, try and drop one of those out even if you're commenting out that part of the code for a few quick run-throughs and see if you lose a lot of value there. That way someone doesn't feel like they're losing this really hard fought battle of getting the span in there but let's just comment it out and see how that feels, see how that looks and if later it becomes more important then maybe we'll add that in and maybe we realize we just need a little more context or we need more attributes there. Thank you everyone. Thank you.