 Hi, everyone. Welcome to our talk about correlating signals in open telemetry. My name is Morgan McLean. I'm a director of product management at Splunk. Specifically, I work on Splunk of different ability cloud. I'm also one of the co-founders of open telemetry. Hi, I'm Yana Dogen. I'm a principal engineer at AWS, and I'm working on containers and observability. So we're going to assume that this is a more advanced talk for people who are fairly familiar with open telemetry. For those who aren't, I'll give you a quick one minute rundown. So open telemetry is a set of tools, protocols, and other components that you can use on your services and on your infrastructure to capture machine-generated or custom data from them. And you can then send that information to a backend for processing. There's a number of different backend supported. It's a big multi-vendor industry project, so there's a number of different places you can send that information. But specifically, open telemetry includes the following. A collector, which can be deployed as an agent on a host, or as a network service that will capture data from any host operating system if one is present. And it can also capture data from other sources, both sources from open telemetry like the SDKs and agents that I'll describe in a second. It can also capture information from Prometheus endpoints and ZipCon generators in various other locations. Next, open telemetry also includes SDKs and language automatic instrumentation agents. These are designed to capture signals like distributed traces and metrics from your application. Whereas the collector, in addition to collecting these, also captures those host metrics and host logs that I described earlier. Open telemetry also includes a protocol that defines these different data types and allows them to be transmitted from different open telemetry components to each other, or indeed even to backends that support the native open telemetry protocol. And finally, open telemetry has a specification and semantic conventions for various use cases. So open telemetry is fairly well opinionated, which means that if you want to, say, capture a trace or metric of HTTP traffic, open telemetry is standard conventions for how to do that. This means that then when you process that data, no matter how it was captured, as long as it was through open telemetry, the data will be consistent and correlatable with other pieces of information, which is a big part of our topic today. Specifically, open telemetry captures distributed traces that was the first data type that we started the project with. That's now considered GA across almost every single component of the project, something we're really proud of. Open telemetry also captures metrics. In most cases, this is already beta. Hopefully much of it or some of it will be GA by the time that you're watching this. We're recording this video a bit in advance of KubeCon, but there's a possibility it's still in beta at that time. And finally, we have logs. Logs are in alpha effectively. I realized that I mistakenly wrote on the description of this session that they're in beta, they're not. They're in alpha. Ideally, they'll be beta early mid next year, but we're still working hard at those just because logs were only added to the project later on. And finally, open telemetry also captures resource metadata. So in summary, it captures everything that you need from your applications and provides the components to do so. So I mentioned this in an advanced session and I mentioned open telemetry and perhaps even ourselves are fairly opinionated about these things. And one of the sort of big benefits of open telemetry is the correlations between this types of data that it pulls in. Certainly capturing metrics from an application or infrastructure or capturing logs is not new. That's been done for many years, very successfully in many cases. But whenever you capture this types of information as silo, you are completely setting yourself up for failure. This is a scheme that may have worked in the past where you can capture and process logs over here and metrics over here. And maybe if you're doing traces, capture and process them over here. But there is so much value to be unlocked as service owners and service operators by actually correlating this information. Open telemetry allows you to do this in a way that realistically wasn't really possible before. And we're going to give you some examples to demonstrate why this is powerful. This isn't just Morgan and Yana telling you about theory and wouldn't it be interesting if you could correlate this information? No, this is a thing that really large organizations do very successfully in production today of open telemetry. And we're really excited to talk about how you can improve your own organization and improve your own responses to outages and other challenges by performing these correlations. And the sort of succinct reason why is that if you have a highly distributed system, something with say like 100 different services or even larger, a failure that happens 10 levels down the stack is going to manifest itself in very, very strange ways at the top of the stack. Ways that if you are only analyzing with logs or only with metrics or even using both but using them in isolation or even with just traces, you really won't be able to understand what is going on or it will take you a long time to do so. You can also use these correlations to gain more general production insights about how your services work, which will speed up your development velocity. So, as I mentioned, we're going to dive straight into examples. And so the first one we have is a mock e-commerce service. For this demo, I'm actually using the services included in Google's Hipster Shop demo. And we've analyzed and visualized these in the tool and I'm going to walk you through where the correlations drive its value and where the pitfalls will be if we don't perform any of these correlations. So in this example, we have customers experiencing extremely high latency with certain actions in this e-commerce service. Specifically, as you can see here, the checkout service is incredibly slow. It's about 8.4 seconds is the 90th percentile latency, which is ridiculously bad, particularly for a checkout action e-commerce. At 8.4 seconds, you're going to have customers who order things and close the window, then not realizing the transaction went through. You're going to have customers who have completely lost their faith in your e-commerce system. This is terrible, terrible, terrible. It's the worst possible time for it to happen in the stack. Already, you can see that we're using some of these correlations to produce this visualization here. And so what we've done is just stack up all of the different distributed traces as well as the service information, and we've used that to build this service topology map. And so this map that's based on all these aggregated traces allows us to step a couple levels down into the stack of services. And from here, we can already start to form a few hypotheses about why the checkout service might be slow. We can see that we've got about half a second of latency from the checkout service to the cart service and to the payment service. Both of those alone, by the way, are probably unacceptable. That's really, really high latency. But that doesn't add up to the 8.4 seconds that so many of our customers are experiencing. So there's still something going on here that we can't quite explain. So we're going to take one step in, and we're going to filter this information down to just show the service-to-service latency metrics for the front-end cart operation. Already, we're on something that you can't do without these data correlations. These metrics here are from actual metrics captured between these services, but we've filtered them down for an operation that occurred between the front-end service and the checkout service. So when I look at the latency for, say, the cart service to the Redis database underneath it, this isn't all of those interactors. We're not just looking at the average latency there. We're only looking at the latency that was as sort of a child event of a front-end-to-checkout event that took place. And so already, we're taking a correlation we're doing a correlation between a metric and data from a distributed trace that was actually keyed in several levels of the stack up. So this is quite a complicated correlation we're performing, but because we're using open telemetry and it propagates that information all the way down to metrics, we can perform it on our back-end very, very easily. So we've done this, and we've also added in some external client latency. The external client latency isn't super relevant here, but from the front-end service to the checkout service and the payment service and the cart service, we can see that the 90th percentile latency compared to what we saw before is roughly consistent. Obviously, front-end to checkout has gone down a bit, but we're still seeing about half a second to the payment service and the cart service, and that still doesn't really account for what's going on inside of the checkout service then. Additionally, though, we can start to perform a few hypotheses or at least we can continue them. We do see there are a lot of errors in the payment service. That's interesting. What's probably more interesting is that those errors that are happening that's what the red indicates are not bubbling up to our clients via the front-end service. That is interesting that it's not actually correlated with the latency, but we're going to need to dig in even further to find out what's going on here. So one of the other correlations we can perform is correlating a services application metrics with the metrics and logs and other information that are coming back from the underlying host. And so using open telemetry, we've now pulled up CPU usage and memory usage and disk usage and network usage from the exact hosts that are running our payment service that is so slow. And from here, we can't draw any conclusions about the cause of our error, but we can at least rule a few things out. We can see our CPU usage is fairly steady across these hosts between 0 and 50%. That's probably nothing to be too concerned about. Memory usage appears to be around 40%. Again, nothing to be particularly concerned about. And so through this correlation, we can immediately then rule out any kind of host performance issue impacting our application. And if the issue had been here, we would have used this correlation to solve this problem. Again, we needed to use correlations between different data types in open telemetry to even get this far in our investigation and to rule this out. So I filtered down the graph a little more about the different interactions between these services. And so we've just isolated this to the ones we're really interested in. And again, we have a hypothesis that the payment service is being slowed down or so the checkout service rather is being slowed down by the payment service and the cart service, though this certainly doesn't account for all of our latency. And indeed, we can already perform a hypothesis into the cause of slowness of the cart service. At 672 milliseconds to the Postgres database, that's something we should be targeting. So it doesn't account for the massive latency that our customers are experiencing. But even once we fix that, 672 milliseconds, 99th percentile latency to a particular database is unacceptable and needs to be fixed. In this investigation, we would certainly file a ticket for that. But now we're going to proceed with our investigation even further. Now to get further in this, we really need to understand what's happening in these individual requests. And so what we're going to do is dive into some of the distributed traces that were used to generate this topology map. Technically, this isn't a correlation because the topology map itself is generated from the trace data. But we're going to use further correlations from these traces to drive our analysis further. So here's an example of one of these traces. And we can now quickly start to see what was going on. So when a customer calls from their browser to the checkout endpoint, there's a number of very short client interactions that take place. Nothing to be too worried about here. But we can see these errors on these requests between the checkout service and the payment service. These are the errors that we saw in the topology graph earlier, the ones that we noted were not bubbling up to our clients. And now we can see why. It appears that when the checkout service calls the payment service, it throws an error after a little bit of processing and then the checkout service basically repeats this operation until it succeeds. This is why the errors are not making to the client because we're continually repeating these calls until they succeed. It's also the major source of our latency, right? And because these calls, while individually or small, are being repeated with the six times or possibly more in other examples, that's causing the entire interaction to be slow. And so from these different correlations, we've driven ourselves both to application metrics to rule out a possible cause of this high latency. But from the distributed trace, we can see that there's effectively errors that are causing the high latency. And indeed, we can actually, from this trace, because we've done the work to correlate these with logs, we can actually go and inspect the individual logs for this trace. We could also perform this in aggregate if we wanted for a particular service, but because we already had trace we're interested in, I've now jumped using direct correlations because we've stamped the trace and spanity into these logs to all of the logs that were generated as a result of this single customer interaction that was slow. And we've seen some logs that have been flagged as errors. Indeed, that's how we were flagging errors earlier on. And we can see that we failed payment processing through this payment API due to an invalid API token. And so from here, we can form our conclusion about what happened. The service is slow because the payment service is effectively repeating these operations because one or more instances of the payment service have an invalid API token. Clearly, some of the instances have the correct API token. Otherwise, these requests would never succeed at all. But we can now identify our main cause of latency with the secondary cause being that very slow database that we identified earlier, the Postgres database. So I've already talked a bit about how we did this while presenting it, but I'll talk a bit about it more. So to perform this scenario, this example, we had to correlate application metrics with our services. We had to correlate our host metrics with our services. That in open telemetry just requires keying in a service name typically into the open telemetry collector for every instance of a service that's running on a particular host. And we can immediately then correlate this information, regardless of whatever backend you're using. Next, we need to propagate request data all the way down the stack so that we can correlate it with metrics further down. Open telemetry provides the tools that you need to do this. And these are techniques that have been honed and perfected at many large internet companies over the years if only so that they could reduce things like disk hotspots if similar queries were hitting the same disk spindles in their data centers and perform other operations where they need to do performance optimizations. Now with open telemetry, you can actually do this yourself. And finally, we also correlated logs of traces and services. This is a little more straightforward. This just involves stamping trace and span IDs into your logs and also stamping a service name into your logs so you can filter those later. With open telemetry, this becomes fairly simple and straightforward, which is something it hasn't been before and it really unlocks this power for you for the backends that support it or if you're using open source backends where you can add your own analytics or programming to them. This open telemetry really opens a lot of doors for you because of this. And like I said, this is all done directly with the open telemetry SDKs and Collector. I described that before, but for those of you following along who want to pause the presentation, this describes with a little more detail where to make these changes. So Yana has another example for you. So Yana, do you want to pick this up? Yeah, sure. As Morgan's experience, like we have a lot of difficulties when it comes to navigating data and understanding actually like where we can use the telemetry data for. One example is this, like once we have the service that was causing latency, but when we look at in the traces or logs, we didn't see anything particularly new. We had some elevated CPU usage from the nodes, breaking down the CPU usage in our monitoring dashboards reveals that a particular service, in that case, it was a simulations engine, was impacted. We haven't seen any other services in that cluster being impacted, so we kind of like decided to focus on that service. As I mentioned that like, searching for traces and logs involving that service revealed nothing new. We saw some maybe slight increase in the RPCs, but nothing particularly wrong with the networking or retries or like there's been nothing quite new. So this made us like start to think what else might be out there in terms of telemetry data we are collecting. When we're talking about like correlations, we sometimes talk too much about like metrics, traces and logs and like that, you know, three type of signals but we actually collect a lot more. And when you can collect all these like additional, you know, signals, telemetries, telemetry types with, you know, correlations, you can actually start utilizing them. We realized that, you know, in our crash dump storage, there is like, you know, crash dumps related to the nodes where this problem was originated that. So we started to take a look at the crash dumps. We started to realize that like, actually a server was, the simulations engine was crashing after a particular RPC. So for each incoming request comes in, this, you know, service was sort of like crashing and then it was causing, you know, scheduler to reschedule it and like, you know, fork a new process. It was causing this like cold start. And since, you know, it was adding a little bit latency because this RPC was already kind of like an expensive call. So we couldn't really quite see from the, you know, the latency dashboards that like, you know, there was additional latency and it didn't really alert because it was still reasonable for us. But then taking a look at, you know, the debugging, like, taking a look at the codes, locally debugging it, made us realize that like, oh, actual service was crashing because of some sort of like, you know, misuse of a device each time that request is being handled, which caused all this like, you know, rescheduling, reforging of the process, additional latency and some additional cost. So it was not quite possible to see it from everything else, but, you know, we've been quickly being able to navigate the data and rule out some of the, you know, the hypothesis as we had and we've been able to focus on the right type of telemetry to be able to, you know, gather actually like what actually has been going on. So we've been able to, you know, fix the issue, push the new release and we've been able to, you know, fix the, improve the impact very fastly. The other example I want to give is like very similar. It is still not quite the case where you wouldn't be able to gather too much by just looking at the data itself. Like correlations made it much, much easier for us to be able to rule out. In this case that a service was, you know, beginning to experience some latency and elevated CPU usage. So we, you know, make releases once a while. Sometimes there's a large number of changes going in the same release. When I think it's small, it's easier for us to be able to roll back and we use rollbacks all the time. You know, we have like automated pipelines. If something goes wrong, just rolling back is so easy. We trigger a rollback and, you know, just try to figure out what's going on. But since this was in a cannery and we were trying to figure out what was going on, we decided to take a look at like the data coming in. So first thing that we looked at was like, hey, is there any other service in the same cluster impacted the same way? If, you know, there's something related to the cluster or there's something related to the nodes in that cluster, it might be, you know, revealing. But we've seen like nothing else other than the service being impacted. We, you know, very naturally started to take a look at the traces and logs to see if we can capture anything. And we see, you know, we saw no like retries or any issues with the outbound request. So, you know, we've been able to rule out that like it's not related to networking, which is most of the time actually like the cause of the problems that we have. So we've been able to, you know, just rule them out. Very, very quickly in a couple of minutes. And then we decided to take a look at the diffs, like what actually went into the new release because that might kind of like sometimes give us some ideas. In the source code, there hasn't been like much, you know, differences in terms of one particular case that we've been curious about. There is a new library dependency that we, you know, on boarded a couple of, you know, days ago, which included a new version of a compression library. So we were kind of like curious that maybe that's sort of the case. And we decided to enable the profiling on the cannery and been able to retrieve some profiles. So just because we can collect profiles per RPC by labeling them, it was easy to like make the correlation between, you know, the latency and like the specific, we've been able to, you know, break down the profiles coming originated at a particular RPC to be able to verify that like, CPUs actually elevated for that particular RPC. And reading the code again, like the RPC handler revealed that like, you know, there's nothing has been going on in that particular RPC handler, but the new compression library was causing slightly more CPU usage, which was impacting our service. So, you know, the service was like actually causing, you know, some 50% more CPU usage was very CP incentive intensive to the point that like not rolling back would be a capacity problem for us in the future. So we decided to roll back and we decided to bring back the old version and try to mitigate the performance regression in the new library. So a couple of years ago, like in the, this is very interesting because, you know, we, when we think about like telemetry and correlations between telemetry, we are sometimes restricted analysis, but I think it's much broader than what we think about them right now. One example, I gave an example from the profiles, but the other example is, you know, runtime execution tracers. In Go 1.1, we've been working on this interesting thing where you can actually get runtime traces, a content by, you know, your custom labels. So you can do like correlations between your distributed traces and, you know, your runtime events. In this particular case that, you know, we have a simple hello world gRPC server and you can see how much time is spent on execution, how much time is, you know, spent in network, system calls, like, you know, blocking calls and other scheduler and so on. It requires some maybe instrumentation where you get this type of like level of insights and, you know, there are more and more tools and like language runtimes and everyone else is interested in, you know, building these correlations because they are very useful. So to recap, we talk about, like Morgan and I talked, you know, briefly mentioned a couple of different correlations. To recap, we talked about correlations among metrics logs and traces and other telemetry types such as, you know, we mentioned about like some crash dumps or profiles or, you know, execution tracers. Another way to build correlations is the resource inferences. So you can easily query, you know, what, give me all the data from this particular virtual machine or give me all the data for this particular Kubernetes namespace and pod to be able to, you know, focus and navigate. The other area is, is the custom annotations that you have, like they can be local or as Morgan mentioned, they could be propagated. If you have downstream services like storage systems, it's super useful to sometimes produce some of these annotations in the upstream services, propagate them all the way down so the telemetry can be collected by applying those annotations and, you know, in your dashboards or querying systems, you can, you know, take a look at the data, break down by, you know, upstream services, services usage. So it kind of gives you an easier way to understand like what is the impact of my, you know, upstream service on these storage systems or any other like downstream service. So, you know, correlations are just so helpful to make the navigation of the data, you know, just becoming more easily, easier, but also they are so useful to be able to rule out, you know, what is a non-issue. We've been able to quickly say like, hey, you know, something's wrong here, but it's not about networking. It's not about like this particular like cluster. We've been able to, you know, like see the same data from different perspectives to be able to, you know, just rule out non-issues very quickly. So, we have a bunch of like interesting things actually come in in as feature endeavors and I want to return back to Morgan, so Morgan can give us what's coming up next. Yeah, thanks, Yana. So, there's a number of new things coming in this space. Probably one of the next that will come in open telemetry eventually would be profiles. Like Yana was talking about the go profiling tools and there's certainly a number of distributed profiling tools that become quite popular in the last few years. And so I would anticipate eventually profiling will become another data type with an open telemetry alongside traces, metrics and logs. I don't think we're going to jump on this immediately. We need to finish our work on metrics and also finish our work on logs. I think before we sort of risk spreading the project too thin by bringing in other data types but it does seem like a natural next step to me as someone who's inside of the community. And it would be great to bring this in because of profiles, of course, you can correlate performance information on the same host or even on different hosts all the way down to code via traces and metrics and logs that are correlated with those profiles. So it would be very, very powerful and would honestly lead to I think like the next generation of sort of development analysis tools which would be really neat. Also, perhaps things that could come to open telemetry would be native support for core dumps, crashes and exception call stacks. You know, call stacks would be somewhat duplicative of profiles perhaps but I think there's a lot that we could do there with those. Something that's coming in right now is open telemetries gaining support for EBPF instrumentation with a focus initially at least on capturing network information. And so that should open up a number of possibilities of also correlating core sort of kernel network data and network request response data things like DNS queries even possibly like DHCP with specific requests that are part of an application. And so one imagines you could have a visualization of say a distributed trace where for span is taking a large amount of time and of extra annotations or visualizations or something on that span that show how much of this was the application being slow versus how much was the network being slow or even different sort of network handshakes or network events that may have impacted that. And finally, as Yana alluded to correlations with language runtime traces would also be really, really powerful. And so that wraps up our presentation and description of different scenarios today. This is a topic we're super excited I know a number of people in the community are super excited to discuss. Again, I think my biggest takeaway for everyone else from this is various tools in the industry already take advantage of the correlations that OpenTelemetry provides. And indeed, much of this work was pioneered at various large internet companies over the last decade where they're performing this internally. OpenTelemetry makes us available to everyone in a very easy way. And as all of our own services grow more and more distributed and as we break out functionality across more and more of these services that are interconnected if you don't rely on these correlations you are setting yourself up for failure, honestly. We deal with more and more layers of abstraction every year. Certainly projects like OpenTelemetry provide the performance data we need to make sense of those layers of abstraction but that's only so valuable until we actually perform and use these correlations for analysis as we showed you in our examples today. So thank you again very much. We hope you got a lot out of this session and have a great day.