 Hey, everyone. Thank you for attending. Today we're going to be covering comprehensive observability using deeply linked metrics and traces. So my name is Ryan Allen. I'm currently a software engineer at Chronosphere. At Chronosphere, we're building a hosted metrics and monitoring platform specifically designed for very large scale high throughput use cases. So kind of the core storage tier within our platform is actually the open source metrics engine M3 and M3 originated from Uber's original high throughput metrics use case. Before working at Chronosphere on metrics monitoring, I was working as working specifically on platform architecture at a company called Applied Predictive Technologies. So the agenda for kind of diving into deeply linked metrics and traces. First, we'll kind of cover just what the observability signals are in the landscape and also just, you know, for an engineer what their experience tends to look like when they get an alert, for example. Next, we'll actually dive into, you know, the journey of designing and implementing this deeply linked metrics and tracing metrics and traces feature. And then lastly, we'll talk about kind of a future where once deeply linked metrics and traces are actually widely available, how that dramatically can improve the engineer on call experience by, you know, providing us the ability to build features such as jumping directly from an alert in an email or a pager duty, for example, to a request comparison where we're able to immediately diagnose what is the difference between, you know, a successful request versus an unsuccessful and identify the root cause of an issue. So first to start, we'll just cover kind of the three primary observability signals that people leverage these days. The first is metrics. So metrics are usually kind of the first signal an engineer gets when something goes wrong and metrics under the hood are just essentially many different time series that we track over time. And can then, you know, set thresholds on to alert off of or just track trends over time and kind of get that very high level, less granular but but more expansive view of an overall platform. So kind of the characteristics of metrics are first, they're very, very structured. So whenever you're emitting a metric, you're specifically emitting that metric with a particular name as well as a, you know, finite specific set of tags associated with that. And then all of those combinations of name and tag values, ultimately represent kind of a single trend line that you could graph over time. And now, as a result of being very structured, they're very fast and efficient to aggregate both that, you know, the query time so when you're actually querying it from an alert or from a graph, and as well as collection time so, you know, as a result. It means that you can really run many, many queries within short periods of time, and it's not really going to result in performance issues and the same is true for collection time so it's very easy to just, you know, admit many different metrics from a single service and you know collect them with like a local thing. They're also as a result of being very structured, very efficient to store so you know when you're storing just information that you know is you know some value associated with some time. It's very, it's very easy to more efficiently store that in some storage tier or on disk, then let's say some arbitrary kind of blob of data. Now, all of those are really great pros to metrics but you know the downside to this very structured nature obviously is that it doesn't have kind of any arbitrary information that you might want and might be very vital in diagnosing a problem and so it lacks kind of that granular per event that you could need when diagnosing something so an example of that type of information could be you know, like the exact HTTP request response for a single request that happened to, you know, occur in your platform. And so as a result of, you know, these characteristics metrics are really best for that high level observability and initial kind of alerting of what's going on, and then kind of going into more detailed root causing requires using some other observability signal. So, that's kind of where logs come in so logs are on the other side of the spectrum, and these are, you know, this is kind of the observability signal that that is as old as time, and it just involves kind of dumping any arbitrary information out that you can then reference later. Now, the characteristics of logs are generally very unstructured. Obviously, you know, there are a lot of libraries and tools these days that help make it easier for engineers to be more structured in their logging so that they're easier to search for and more efficient to search for. However, because of the inherent nature of logs containing really any arbitrary information, they're intrinsically slower and less efficient to activate. And then additionally, they're less efficient to store since you can't kind of make specific assumptions about what that particular log looks like when you're ultimately storing that data somewhere. Those are kind of like the performance and operational downsides to just storing any and all arbitrary information, but then obviously the huge pro and why logs are essential is that they do have this per event detail preserved so that you should, as an engineer, be able to find any information that was relevant to any particular event that occurred within the platform. And so, as a result, they're kind of the best tool for, you know, actually getting into the weeds and diagnosing a particular problem. Now traces are what well metrics and logs are kind of, you know, the two signals that are as old as time traces are a more recent observability signal that has become very popular with the advent of micro service architectures and just platform is becoming more and more complicated. And the reason traces are so useful is because it gives you kind of a contextual flow and transparency into how a particular execution path behaves within a more complex architecture. So you can see here how it will be able to essentially tie together specific types of logs kind of like entering and exiting specific services and giving you an understanding of how much time is being spent within a particular step within a particular service. And so it's easier to be able to then kind of diagnose where within your more complex architecture a particular issue is happening. So, again, you know they provide the transparency into this execution flow which is really vital in micro service oriented architectures. They're kind of in between metrics and logs in their, in their kind of nature of data. So they're very structured in so far as all of the particular kind of steps within a trace need to be ultimately tied together by kind of a request ID. And so that's a very kind of structured piece of information that flows throughout a given trace. Though they're also unstructured in that many of the libraries which admit traces allow you to kind of store arbitrary data as well. And so in that regard they're unstructured. As a result of kind of them being in this in between, you know, level they're not nearly as efficient to store and query and things like this as metrics. And so as a result. of the quantities that are doing tracing at kind of large volume. They have to actually sample down the total number of traces and you can do this intelligently and try to say, Okay, here all the kind of distinct combinations of values of data associated with my trace and let's try to have particular time periods to store at least one of that distinct set so that we have kind of a representative sample. That being said, there, there are still kind of limitations to how one can go about the sampling and so sampling is you know one of one of the kind of biggest challenges to comprehensively supporting traces within your platform. Next we'll talk about kind of what the average experience is for an engineer when they get an alert. So let's say, you know, you're on call, you get an email alert, and in that alert you see oh request latency is high, and then typically, you know with it embedded within that alert you'll have some sort of link to kind of the, the source of the alert which often would be let's say a graph that points to the specific metric that that triggered the alert. So let's say you know we jump to our email we click the link to Grafana. Now we're on kind of a dashboard that is associated with that alert. We have to kind of parse and say all right which of these graphs has the metric that we care about oh okay it's the bottom left it's request latency. So let's go into this. Here we can see that there's kind of a divergence between the time that's that that these requests are taking. And then we can also see kind of the particular splits in time series tag values at the bottom so we can see okay, kind of the information about these two particular trend lines is that they're coming from the customer get endpoints. They're coming from the same Docker internal hosts. It's coming from a hot rod job, and kind of where they're diverging is it is in kind of like which histogram bucket they fell into in terms of total latency. So from there, we could say all right well things are slower on one path versus another another but from this information it's not clear why that would be. And so now we kind of have to jump to some new signal to try to figure out what the root causes. So that's where you know you would then jump to something like a trace or a log. It would be really useful here because what we really want to understand is like what's the difference between kind of the request that are running slowly versus the requests that are running faster. So, usually what you can then do is you can jump to let's say kind of an open source trace visualizer like Jaeger. So we go to kind of the search function there, and then we try to narrow our search to try to find one of the try to trace that represents, you know, one of the requests that actually fell into this longer request latency time bucket. So, you know, we'd filter down to specifically kind of this customer endpoint, we'd filter down to the status code being 200, we'd filter down to the request duration being within or above that threshold that you know we were triggered off of. And then as a result we get kind of some subset of traces that match those parameters. So now, let's say we click into one of those traces that we found. And then we see, okay, like it looks like this is kind of what the flow is for for a trace that happens to be taking that longer than than others. It's not totally clear from this particular trace in isolation what the difference is between this and kind of a good one. Okay, yeah, it looks like you know, there's some time trying to talk to Redis but that fails and so then it talks to my sequel which takes a pretty decent amount of time and then there are a few other kind of validate customer steps at the end. But it's not so clear like it what is the ultimate root cause here for why this is slower than let's say some other case. So usually then what you'd want to do is you'd actually want to find. Go back to kind of the search functionality of the trace and then change that latency to be kind of the faster version so then find a sample of a trace that actually represents, you know, the good case versus the bad. So then we'd say, okay, here's actually a trace that is, you know, faster than the other. So we see, okay, interesting. It looks like the Redis case here actually succeeded and then there is no my sequel step which is, you know, the slower case here and obviously you can manually like do this diff mentally. However, Jaeger actually has a cool feature in which you can actually compare two specific traces and visualize the specific kind of step differences between them in the same way that you would do a diff in code. So let's say we kind of grab those two different trace IDs, we use the compare functionality, and then now we can actually very easily identify, hey, the difference between the request that takes a long time versus the request that doesn't is that the request that takes a long time actually has this one additional mySQL step. And so now we know, you know, with very strong evidence that the kind of root cause of why this particular set of requests are taking longer than the others is there's something specifically wrong with mySQL or maybe it's just that mySQL is slower in this case, because we're not hitting like the Redis cache. So now we can actually say, okay, well if we want to actually improve this, then the way in which we would do it is by specifically targeting, you know, the mySQL step. So that's kind of like the flow, for example, of what an engineer would do when they get that first alert of a latency metric being divergent in a bad way from kind of the norm. So the question would be, could we actually kind of skip all of those steps in between where we're doing manual searching and manual comparing and things like this, and instead be able to just give you that direct link immediately so that the on call engineer can just click one link and jump into depth. And essentially, the way we can actually go about achieving an experience like that is with something like deep linking metrics and traces, and the idea here at a high level is to say, instead of having kind of metrics be this one that is causing the alert, but that's it. And then traces being this other thing that gives us more contextual information. Could we actually kind of tightly couple them together in the same way that you would tightly couple kind of like two pieces of data, or two tables together with a foreign key. So we can kind of make these actually natively compatible so that we can actually jump straight from kind of a metric data point to the trace that would be associated with that particular data point. So, generally, kind of, as we saw in the example where we were searching Yeager for a trace that looked like the parameters associated with the particular metric. So these are linked is is not really kind of this, you know, tight coupling but rather a loose link by convention, which, which is to say, it is just convention that the way to set up your metrics is to make sure that the tags actually are similar to the tags that you include on traces that when you do ultimately do that manual search step, you can actually kind of properly narrow down your search. So, again, it's important that these tags are similar, so that you can actually jump to this search step and make sure that you're kind of able to consistently tie metrics to traces. However, there's no kind of like enforcement of this and this is not kind of like a native or tight coupling of this information together. So what would be great is if we could actually add this kind of more foreign key relationship or native type coupling, and the way we could do that, for example, would be to make sure that every single data point that you see on some sort of time series for a metric has a span ID directly associated and if we did live in a world where that was so then you would be able to, for example, you know, click a particular data point on a graph and then jump to the particular trace immediately from there. Now, the question is how can we, you know, go about making that a reality. So, one of the first steps to doing this is to actually kind of think about how we can include this information and propagate it in a way that allows us to consume it properly. Now, open metrics is kind of the standard for how metrics are admitted and how they're kind of represented to whatever is collecting them and then shipping them off to some, you know, some tool like Prometheus. Now, one of the aspects of open metrics is that it has this concept of an exemplar which is essentially kind of a comment string that you can append to the metric values as they're being admitted so you can see here. That HTTP request total for search for status code 200, the value is 1725 and following that, you know, following that hashtag symbol, we can then kind of include arbitrary data with this same protocol that we can then consume on some other side. One way where we can propagate, you know, for example, trace information alongside metrics information would be by leveraging this exemplar. And so one of the changes that we've made to kind of make this happen is by including a trace ID with every single kind of emitting of a metric value embedded within that particular exemplar in open metrics. And now kind of a next step would be to say, okay, well, now the protocol supports being able to expose trace IDs with metric values. How do we actually, you know, admit that from code now open telemetry is an instrumentation SDK, so that you can kind of instrument your applications via code to emit both metrics and traces. And by modifying kind of this common library, we can make it so that it actually abides by that, you know, that that use of exemplars so that when we are actually exposing metric values, or including whatever contextual span ID is associated with the context of that request. By modifying this library, we're actually now able to make sure that when we admit it, we're including trace IDs as exemplars in code. Now, a third step here is to then actually kind of properly now consume that exemplar that includes the trace ID, right, so it's like our code is using open telemetry to write out that trace ID in the exemplar. The open metrics protocol supports the fact that we can include this metadata in the metric kind of emitting structure as an exemplar. Now kind of on the Prometheus or M3 side, we need to actually be able to kind of consume that additional metadata. So, on the Prometheus side, we just modify kind of the scraping endpoint to make sure that it's including these exemplars along with the values, when it's actually scraping the data. And then kind of after making sure that we're scraping it properly, then the next step is making sure that we're actually storing that data properly as well. So kind of an additional step that we've added is having M3 store these exemplars alongside every single data point. So, you know, we're now getting a data point scraped from Prometheus or scraped as Prometheus protocol, and then M3, let's say, now has that metric value in addition to the exemplar. And instead of just storing the data point as we normally would, we store the data point and then right beside it, we also store that exemplar metadata. Kind of then the last step is after we now have kind of this metadata persisted in M3, the question is how do we retrieve it properly. And so then kind of a final kind of implementation piece here is to modify M3 query to actually read out that information so that when we're reading out the data point values, we're also including with that result, you know, the associated span ID for every single data point that we return. One other kind of nuance here that's important is that when we're reading out these exemplars with data points, M3 query also has to perform, you know, common aggregations for the queries that it runs. So, you know, some or max or average, or, you know, more complex operations, it's important that as we're kind of rolling up results, we're keeping those exemplars with the results as we're rolling them up. Now, we spoke earlier a little bit about sampling and how that's very kind of, that's very much one of the biggest challenges with tracing. Now, kind of one of the reasons why sampling techniques are really insufficient for ensuring that you have 100% always correct representative traces compared to your metrics is, you know, let's say you're doing some sort of head based sampling technique where you're trying to choose at the start which of the traces you want to keep based off of have the distinct combinations of tag values associated with that trace. The problem here is you potentially miss cases where you actually need to know information about that trace that occurs actually at the end like request latency. So at the time of like choosing which of these traces you want, you don't know whether or not that trace ultimately will fall into kind of the histogram bucket of, you know, the longest time versus the shortest time. So this is for this endpoint and for this status code, etc. And so one of the, you know, kind of gaps in the in common sampling text techniques is that we really have to kind of roll the dice in certain cases to to try to get this representative sample, when ideally we would want to just be able to have 100% certainty that all of the traces that we're storing are truly fully representative of all the of the particular kind of metric combinations that we have. And so I guess I skipped over this but the kind of the way that deeply linking metrics and traces actually solves this problem is, and we'll talk about the architecture here in a bit. And essentially, because we're storing trace IDs along with data points, we can use that tight coupling to essentially tell the trace storage tier just store these specific traces, and the reason you should store just these ones is because the trace stores that we're including in our metrics persistent store. So then, as long as kind of the coordination of what to store is synchronized both between the metric side and the trace side, then we can always be 100% sure that the trace IDs that are stored with metrics are also actually stored in the trace storage itself. So kind of we talked about all these different pieces. Now let's actually try to like walk through the flow of how they all work together. So first, on the left most side, let's say we have this app that is going to be emitting both metrics and traces. Now, within that app, you know, we're going to be using open telemetry, which is the library. And we, you know, emit both the traces and metrics. And so that's kind of the library that's making sure that trace IDs are going to be included as exemplars with metric values that get emitted. So then if we take this top half and walk through kind of the metrics flow. Open metrics is emitting those metrics with exemplars abiding by the open metric standard, which is to say this particular kind of protocol format, which includes being able to append arbitrary metadata information as exemplars. So open telemetry is exposing that trace ID as an exemplar. Next, we have Prometheus actually scrape that data, which includes kind of the metric value as well as the exemplar from that application. Next, it actually then forwards that right to the M3 aggregation tier, which is in charge of then actually persisting that to M3DB, which is the long term storage node for it. So kind of Prometheus again, gets the metric value has the exemplar, it gets to the M3 aggregator, the M3 aggregator then like actually writes out that data point with the exemplar to disk, and that that's ultimately done then in M3DB. Now, now let's talk about kind of the tracing flow here. So in addition to open telemetry exposing metrics to the Prometheus scraping, we also have kind of a Jaeger collector that is reading the traces that are being emitted from our application. Now, kind of the last piece of the puzzle here to solve this sampling problem and ensure that we are actually always storing the trace IDs that we ultimately are including as, you know, values, or as traces for each data point of a metric on the metric side. We actually introduced this additional trace holding cache tier that exists and sits in front of the actual standard, you know, trace and span storage, like something like Jaeger for example. So essentially what happens here is we just keep all 100% of the traces in memory here but for a short period of time so that we can essentially just sit here and wait for the M3 aggregator to signal to us which of these span IDs we actually do want to persist. So instead of just arbitrarily sampling kind of some subset of spans, which then might not actually be, you know, the ones that we're storing on the metric side, or, you know, in a traditional tracing case might not be a representative sample that we want. What we can do is we can actually just keep them in this holding tier and wait for the aggregation tier to say, hey, definitely stored this particular span ID because I just persisted it alongside this particular time series data point. And so by doing this, it ensures that on the metric side we're selecting a representative representative sample because every single time series itself is, you know, one of the proper unique combinations of values including things like histograms. And then on the trace side, we actually can then only store those specific subsets of traces that we ultimately are persisting there. And we can also, you know, include more than just that subset but this allows us to get at least the 100% coverage comprehensive, you know, sample of traces for every metric data point we would care about. And then kind of lastly, you know, once this data is stored both in M3 as well as Yeager, we can now kind of read out, you know, metric values but then know this metric value is associated with this trace ID so we can create kind of a contextual link between, you know, let's say, Grafana and Yeager. Again, this is kind of a visualization of what that would look like. So let's say, you know, we're reading data from M3, we're viewing the M3 query results on a graph, but because M3 query can read out the exemplar which includes the trace ID. We can actually leverage a Grafana feature where you can add kind of per data point annotation functionality to launch from a particular data point to a link. We can actually add to that the ability to just create a link that links from Grafana to Yeager using that particular span ID. So that is kind of the whole architecture change with all of these different open source projects to kind of build out that deeply linked metrics and traces. Now kind of we're going to talk about what this feature actually looks like by leveraging that, you know, that implementation that we just discussed and specifically how that empowers us to be able to create an experience in which an engineer can get an alert and then directly jump to a request comparison. Cool. So I'll now jump to a demo. So first, I'll just describe kind of what we have running locally. So we're running this hot rod server, which is essentially just simulating a bunch of metrics and traces that we can use as examples. And it's kind of functioning as a test kind of car or our ride sharing service. We're also running Grafana that is visualizing a few of the metrics that are being emitted. We're running Yeager, which we'll see in a second. And so then now let's, you know, look at an actual example of how we could potentially, you know, alert or, you know, jump from an alert directly to a trace view. So, you know, normally let's say you have an alert that looks like this where we have an error, or we have an alert on the metric total customer requests and specifically for, you know, 500 error requests. So kind of the traditional flow here would be that you could click a Grafana link. Now we can see kind of, you know, the metric over time or within this time range that is associated with that alert. We can see these ones are 200s. These ones are 500s. But again, the next step would have to be doing this kind of manual searching through Yeager. But instead of that, we can actually directly link because we have the exemplars to either the trace itself, or even better, kind of a trace diff that says, here's the difference between the 500 case versus, you know, a generic 200 case. And so, for example, in this, you know, 500 versus 200 case, we can see, Oh, okay, the trace looks exactly the same up until these last three customer steps. And what that means is, you know, the 500 case actually stopped short of doing these last three cases. And so my SQL must be the root cause of this problem. It must have failed ultimately in my SQL. And we can actually, you know, click the trace and then actually see, okay, yes, it does look like this trace came down up to my SQL and ultimately fail. So that is kind of how, with all of that, like under the hood, you know, with the span IDs being emitted with the, with the specific metric values, it allows us to kind of build this, you know, alert experience, which can just immediately generate a useful kind of diff between kind of the error case trace and the normal case trace. And the other kind of cool thing that we can show here is how we embed these particular traces on the specific data point so I can click here, or here and I can either jump directly to the trace here, or I could, you know, click here and jump to the particular diff associated with with that trace, or with that metric data point rather. So that's kind of a demo of like that whole architecture running locally that we just described. Now, again, like how is that working so, you know, open telemetry and open metrics are exposing span IDs as exemplars, Prometheus is scraping them. Prometheus is empowering us to actually read and write those exemplars so whenever we're storing a data point or reading a data point, we're also in, we're also storing that exemplar or also reading that exemplar. And then lastly, you know, we can just leverage kind of Grafana features to make sure that links include those span IDs and jump you to kind of the Yeager that's running nearby. So that will just kind of what the result now looks like when we have not just kind of a value, a set of values for a particular kind of metric that we're returning but also it includes kind of the span ID and trace ID. One question might be, oh, what are some other kind of cool features can we build here so kind of some some ideas would be, for example, being able to on the UI select, hey, here's a data point that I want to compare to this data point. So you can kind of essentially say, I want to, you know, from here, not just have an auto generated diff based off of, you know, 200 versus 500, I actually want to like manually specify which two data points I want to be comparing kind of the way to build these kind of concept contextual links and integrations really can just be based off of, you know, how we think about metrics. So, you know, for example, in both graphing and alerting cases, as long as you know, or like we can expose the possibility for users to say, you know, for this particular metric, use this other metric as the, you know, the comparison data point whenever we're doing trace diffs. So an example would be whenever, you know, request latency is in this bucket then use the fast, like whenever it's in the slow bucket then use the fast bucket as the comparison, or whenever we're looking at, you know, RPC requests, we can say whenever there is an error case use the success case as the control. We can even actually kind of, for like really commonly exposed metrics like RPC metrics, we can actually probably just build plugins themselves that us that that know what these metric names are and what the like various combinations are, so you can kind of just get them automatically generated for free as opposed to as opposed to having to kind of manually set them yourself. So cool, that's kind of a summary of the whole architecture flow of, you know, natively tethering together metrics and traces and as you can see it really is a powerful tool to cutting out a lot of the manual labor in between, you know, getting an alert on a metric and actually finding some relevant trace and understanding what is different between that trace and some other, you know, control trace. So the demo we just showed is actually publicly available for folks to run at home. You can jump to it from this link, you just have to clone the repo there's a read me that describes how to do it. So all of the various pieces that we were describing are kind of in different states so a few of the PRs have been merged so the one that's actually adding the support for this exemplar metadata to open metrics has been merged. Also, the exemplar support for the Prometheus client library for exposing that exemplar has been merged. There are some other ones that are in flight so we're hoping to, you know, work with people in the communities to get all of that over the line. And then lastly, here's kind of all the links to the various open source projects that that we use to build all this so open metrics open telemetry Prometheus M3 and Grafana. Also, there's another video of the CTO at my company Chronosphere Rob Scalington talking about kind of some of the more interesting implementation details of how we've linked metrics and traces as well. So thank you for attending and hopefully I can meet all of you during Q&A and answer your questions. So thanks a lot.