 Hello, my name is Ted Young. Today, we're going to talk about how to take your existing logs and make them more awesome using open telemetry. Alright, so as a prerequisite for this talk, I assume you write software, and because you write software, I assume that means you write logs. If you're the kind of software developer who's like, why would I write logs? Who needs to know what my software is doing? Well, I just don't even know where to begin, so let's just assume that you're writing software, and that means you have already written some logs somewhere in your software, and you would like to make better use of them. So, what is open telemetry, and how can it improve these existing logs? In short, open telemetry is a project that takes the whole three pillars concept of observability, smashes it into rubble, and turns it into a single braid of coherent cross-correlated data. But why would that actually be helpful to me if I already have logs? Fair enough, you ask. The answer is that finding all the logs in a transaction is a horrible, horrible experience because you lack context, and adding the context that you need turns your logs into traces. Allow me to explain by starting with the problem, distributed transactions. Let's say you've got a client that wants to upload a photo to a server. Well, for one, you've already got two services, so you're already distributed, but we know it's never that simple, right? The service is really like a proxy that's like talking to some auth service, talking to some scratch disk and an app, talking to some cloud storage and maybe a data service. We're talking to a couple databases. You get the idea. Even the simplest system is actually multiple systems working in conjunction, and any transaction in the system hits multiple different services along the way. Scale that up to a real life service and it's just crazy making. The problem is that logs need context in order for you to find them. All of the logs from all of the transactions that are happening at the same time in a single server are all mixed together into a bag, and every server has its own bag of logs. This makes finding the logs for one particular transaction from the client service all the way to the database in the back end really difficult. It's like looking for a needle in a stack of needles. You end up having to do a lot of searching and filtering, right? Where you start with one log and you see maybe it has a timestamp or some kind of ID and you use that to find a couple more logs. And maybe from there you can find some more logs, so on and so forth. Just stop and think for a second about how much time you spend trying to collect the data before you can analyze it. It's actually a lot of time. It's really painful. And what's worse, it can't be easily automated because it's sort of an ad hoc process. The data isn't particularly well structured. So how do you solve this problem? How do you identify which logs are part of which transaction? Evil magic? No. A trace ID. A trace ID is a single ID that you staple to every single log in a transaction. So that later you can look up all of those logs by that trace ID. It's really simple, right? Instead of having this cludgy process of searching and filtering to find a rough approximation of the logs that might be part of a transaction, you just do a single lookup on this one trace ID. This is fundamentally simpler and it allows you to leverage the indexing properties of whatever logging database you're using. So if you think about it, why would you ever want to have logs that didn't have these trace IDs? How could we improve the effectiveness of these existing logs? Well, there's two options. The first option is to install open telemetry and use it to add trace IDs to your existing logs. So this means you're still using your current logging system. You haven't changed where you're looking at the data or how you're analyzing it. What you're doing is attaching trace IDs to these logs on their way to your existing backend. The way to do this is to create a log appender. Almost any logging system today has some kind of plug-in module system that allows you to add an interceptor or an appender, something that lets you take the log as it's being made and attach more data to it. And if open telemetry is set up correctly, at any given moment there's already a span available in the background. In order to access that span, open telemetry in every language has some form of a get its current span function that will return you that span. Once you have the span, you can pull a span context off of it. The span context is what contains the IDs that you want. You can then take that span context and using your log appender, however that thing works, attach a trace ID to your log and then attach a span ID. So the trace ID will allow you to look up every log in the transaction. The span ID allows you to identify all of the logs that are in this particular operation. Congratulations! You now can find your logs and you've fundamentally improved your existing logging tool. But what else can you do? You can also take your existing logs and add them to your traces. So this would mean you've installed open telemetry in order to gain access to a tracing tool. One way to enrich the amount of data coming in on that tracing tool is to take your existing logs and move them over to your tracing system. So how would you do that? In this case, you're going to take that log and turn it into an event. Every span has an add event method. Every event has a name, which in this case would be the log message. And events can optionally take a time stamp for when they occur. So if you're doing this out of band, you can take the time stamp off of the log. If you're doing this in band, you probably don't need that. Also, if your current logging tool is advanced enough that it actually has structured data instead of just a single message string, add those logging attributes to your span event. So again, a couple of lines of code and now all of your existing logs are being attached to your trace as events. That's nice, but what do you actually get by switching from logging to tracing? And the answer is context. Your traces are providing context for what happened, but open telemetry also provides resources, which is context for where things are happening. Allow me to explain. Imagine you're looking at an HTTP server event. What kind of attributes would these HTTP logs have? Let's say you have a get request for a particular project associated with a particular account. You would have a bunch of attributes that you would be interested in indexing this log with. A couple of those attributes are unique to the event. So there's the event itself, a description. There's the time that this particular event occurred. But then there's a bunch of more generalized attributes that are not specific to this event, but are very important for finding this event or contextualizing this event. Some of those attributes are static. These are what we call resources, things like service name, the version of the service, the library that produced this event, the version of the library, what region or data center this was occurring in, what Kubernetes pod or container ID. It goes on and on. These are what you might think of as static context. So this context is consistent across every event that might be occurring in this service. At the same time, you have what you might think of as dynamic context. These are attributes whose values change from request to request, but are nevertheless interesting for every event associated with that request. The request start time, the duration, the HTTP status, was this operation an error or not? And then application specific attributes like say the account ID and the project ID. These attributes are not necessarily specific to any one event in the operation. Every event that occurs within this operation would want to be indexed according to these values. And last but not least, this is tracing. So we want to be able to take all of these events and put them into a graph. And in order to do that, you need several more attributes. The trace ID, which identifies the transaction, the span ID, which identifies the operation, the parent span ID, which gives us causality. That's fundamentally how we turn this into a graph. Finally, the operation name so that we can compare across different runs of the same operation. Why is all of this information useful? The answer lies in how we're actually using these logs. We're not just looking at individual transactions. We're trying to find correlations across many, many runs of the same transaction. And we want to find those correlations so that we can find the root cause of any problems that we're experiencing. For example, we might notice that a particular operation has high latency. Why is it slow? Well, we might see that this operation is slow only when it's correlated with a particular Kafka node. What about errors? For example, an operation might see a spike in errors. And you might notice that all of those new errors are associated with a particular set of project IDs instead of being evenly distributed across all of the requests. Only requests with certain project IDs were resulting in an error. In all of these cases, that correlation tells the operator or developer where to look next or gives them some insight as to what the source of the problem might actually be. This kind of analysis isn't what you would call metrics. We're not talking about looking at dashboards of data. And it's not logs. We're not talking about looking up logs and seeing all of the events in an individual transaction. This kind of analysis is trace analysis. This is what you can do when you take all of these events, turn them into a structured graph, and then do aggregate analysis on that graph. The reason why you need that data to be structured is because the cause and the effect are often non-local. These correlations might occur between different attributes within a particular event. They might also occur across different events in a trace. Or they might have something to do with particular resources being used by those traces. For example, a particular Kafka node being the source of the problem. So, hopefully in that light, hopefully it's easy to see that tracing is not some niche tool just for looking at latency or something like that. So, hopefully it's easy to see that tracing is not some niche tool or third pillar or something like that. Tracing is about structured data that enables automated analysis. Now, I have to pause for a second and just remind everyone that machines are never going to actually be able to find the root cause for you. Don't believe the AI ops hype. The problem with that is root causing a system is subjective. You have to interpret the data and make judgment calls in order to understand what's truly wrong with your system. And machine learning, at least in the form that we're talking about, is never going to be able to do that. Understanding what the root cause of your system is is effectively a form of the halting problem. So, don't expect that out of your machines. But what you can expect machines to do is to find correlations in your structured data automatically. And automatically finding these correlations and presenting them to you saves you a ton of time when you are trying to root cause a system. It saves you so much time, just endless amounts of time, and that's what makes tracing such a critical tool in your toolbox. Thank you very much. Hopefully, this talk was helpful and you can go out there and get that extra juice out of your existing logs. If you like this talk, I have a couple other ones that you might find relevant. If you want to learn more about open telemetry and how it actually works under the hood, I recommend my talk on context propagation. If you want to learn more about how the open telemetry project is structured from an open source standpoint, check out the value of design in open telemetry. Do you have any questions, complaints, observations? Hit me up on Twitter. My handle is Tedseuo. Have fun. See you on the internet.