 Hi everyone, my name is Albert and today I'd like to talk to you about using your trace data for monitoring and alerting of application health and not just for debugging. If you're already using a distributed tracing solution like Giga, hopefully this will give you an appreciation for the additional value that you could gain from your span data without adding more metrics instrumentation. And if you're new to distributed tracing, hopefully this will give you some motivation to start instrumenting your services and take advantage of the application health monitoring you could you could get for free along with the trace data. So just a little bit about myself, I'm a Giga maintainer working at Logzio on our distributed tracing product and funnily enough it's based on Giga. When I do have some spare time I like to walk around the garden and just get my hands dirty literally growing vegetables and fruit. So traces are a gold line of observability data it's rich in context and detail but actually not the majority of those traces are actually not that interesting once in a while you do find some nuggets of interesting traces like those with errors or slow requests but then the question is what's the best way to find these little nuggets of gold amongst our mountain of spans? And so to help explain how we do this I want to draw an analogy to how pulsars are detected. So pulsars are essentially dead stars that are spinning very rapidly emitting radio waves at a clock like period like much like a lighthouse. So when there those signals arrive to earth they're mixed in with other noise from either space or earth-bound sources like mobile phones or TV or radio radio stations. So as you can see from that first row it's not really clear that there is a pulse there amongst all that noise but if we know the period of these pulses then we could cut this signal up into that period and start adding those signals together and so the idea is that the pulse signals would accumulate up while the noise cancels each other out leaving a distinct pulse as we see at the bottom here. So the gold and the pulse in our analogy is like the single example error or slow span from a service operation and the mountain of soil or the noise it's much like the millions of spans from hundreds or even thousands of service operations to search through and much like adding the pulsar signal together so that to define a distinct pulse we aggregate the spans to highlight the statistically significant issues from the request errors or duration metrics that we gather. These are also known as red metrics. So now that we have our aggregated span data what are some real some real-world use cases that we could apply it to. So here I list out a few of these use cases that come to mind the first one being a high level view of application health in our organization and a good application of this a good use case for this is when we deploy a new service for instance a new version of our service with a new feature and we want to make sure that this doesn't impact the organization in terms of increasing error rates or latencies for other services and so having these these metrics here with like say sorting by error rates or latency we can make sure that there are no additional spikes added. In a similar vein we could use these metrics to set up SLAs to monitor on to make sure that we don't exceed say our 1% SLA on error rates and be alerted on them and finally an interesting use case that was brought that was brought to my attention recently in the Jager Slack channel was the ability to detect and identify spans that exceed what's what's expected what's the expected latency and we can measure this expected latency by computing the the average latency across many of these spans for that service or operation. So here we see a sneak peek at a proposed at a proposal that we're working on of a high level view of application health per service with the ability to drill down from that service into the operations and then into its subsequent traces. So you can see here a list of these services with their average latencies their request rates and their error rates along with the impact which is the latency multiplied by the requests. Here's an example of the open telemetry collector configuration that's required the configuration that's required to enable span aggregation and the idea here is to actually emphasise how easy it is to enable span aggregation in open telemetry collector which is literally at least three lines of config and these config lines are the span matrix line here and telling it to export to the Prometheus exporter and adding the span matrix export processor into our pipeline and on the right hand side here I've illustrated a simplified architecture diagram just to give some context on where the span matrix processor resides in the bigger picture and so we can see here from an instrumentation emitting spans to open telemetry collector we have a receiver to receive these spans that sends them to the processor to a span matrix processor and this processor then forks the data so firstly the spans down to the Yaga exporter so that we can view these spans in Yaga and secondly from the metrics that were aggregated into the Prometheus exporter down to the metrics collector to persist those metrics and then Grafana to query that data we also have a work in progress where Yaga is able to query this data this metrics data and here's a snapshot of a Grafana dashboard that I put together just to illustrate the possible visualisations that you could create from the span matrix data so for example in this first panel here I've drawn up a histogram of the span latencies where each column refers to the latency in milliseconds along with their counts so the current status of span matrix processor is that it's currently available for use in open telemetry and I'd encourage you to please try it out and welcome any feedback and coming soon to Yaga is the ability to read span matrix data and using this data to enhance the UI in a similar in a similar with a similar idea to the mockup that I showed earlier I know that it's a lot of formation to take in with little background context so I've added this resources page resources slide for you to refer to later on so yeah so resources on how to view the source code and documentation on the span matrix processor along with Yaga documentation and a link to our Slack Slack forum our Slack channel and also documentation for the open telemetry information so if you want to learn more about how to how to configure it and how it's designed and architected definitely there's some good documentation there that's for now thanks very much for listening and happy tracing