 As more and more engineers are the microservice architectures and cloud-native technologies, understanding the behavior and failure patterns of our systems is key to ensure the app-performing and delivering a great customer experience. Yet it is really hard because those environments are highly dynamic with auto-scaling and frequent deployments of new versions of our applications. I'm Ramon Gu, VP of Observability Products at Timescale, and in this CNCF webinar, I will show you how you can use three open source projects, open telemetry, Grafana and Promscale, to get insights about your systems that will help you deeply understand how they behave or misbehave so you can improve them and deliver a better experience to your users. This is the agenda. First I'll do a very quick introduction to open telemetry and distributed tracing. Then I'll talk about Promscale, which is a free open source observability backend that runs on top of Timescale DB and PostgreSQL. And finally, I'll show how you can use open telemetry, Promscale, Grafana and SQL to better understand your distributed systems by using a demo environment we've created that you can get up and running on your computer in just a few minutes. The demo environment we will use is available on GitHub at the URL that you see at the top. And we've also written a detailed blog post that covers everything about that demo environment, how to set it up, how to instrument your code with open telemetry and how to query the data and build Grafana dashboards. I'll be covering some parts of that in this session. If you want to dig deeper, I recommend you check that blog post. You ready? Let's get started. For those that are not familiar with it, I'm going to take a few minutes to introduce open telemetry and distributed tracing. Open telemetry is a new standard for instrumentation that is hosted by the Cloud Native Computing Foundation. It was born after two other open source projects joined forces, open tracing and open sensors. In the three years since the joint effort was announced, open telemetry has become the second most active project, as well as the second with most contributors among all CNCF projects. Only after Kubernetes, but well above Prometheus and other very popular projects. Most observability vendors, including us, timescale, and all major cloud providers are contributing to the project. Why is there so much interest in it? Well, first, it's vendor agnostic. Instrument ones and send telemetry anyway, so your investment is future proof. And behinds are the days where you had to reinsroman your systems when adopting a new observability tool. It also opens the door for engineers creating libraries, frameworks and tools that other developers used to build their applications to add instrumentation into the source code. For example, Kubernetes has started adding open telemetry instrumentation into their code. Second, it's a standard that includes the three key telemetry signals, metrics, logs and traces that share the metadata and tags so you can more easily correlate them. It also defines aligned protocol and semantic conventions, making interoperability between open telemetry and other tools much easier. And finally, it provides libraries that do automatic code instrumentation, dramatically reducing the effort required to instrument your code. In today's session, we will focus on open telemetry traces since they hold a lot of valuable data to understand distributed systems that metrics and logs cannot provide. But what is a trace? A trace is a connected representation of the sequence of operations that were performed across all microservices involved in order to fulfill an individual request. For example, if you open an article from a news site in your browser, there would be multiple operations saved by different microservices. Read the article, read the comments for the article and request ads to display with that article. Each of those operations are represented by a span with their own subspans. A span can have zero or multiple children. All spans have just one parent, except the initial span in a trace, called the root span, which has no parent. Some of you may be familiar with Yage, a popular open source distributed tracing product. This shows a screenshot of the Yage UI, which includes an individual trace with all its span under parent-child relationships. In the demo, we will make heavy use of prom scale and its capabilities. Prom scale is an open source observability backend for metrics and traces powered by SQL. As I mentioned, it's built on top of the proven rock solid foundation of timescale DB, which is a time series database built on top of PostgreSQL. And as a result, it lets you query the data using full SQL. We will use SQL extensively to derive insights from traces in our demo. On top of open telemetry, prom scale also integrates with Prometheus for long-term metric storage and analysis, with Grafana to visualize the data and also other tools like Yage, which we saw before. This is just a high-level architecture where you see prom scale using timescale DB to store the data and integrations with Prometheus, open telemetry, Grafana, Yage, and any tool that is big SQL. As I know, timescale DB is PostgreSQL with time series superpowers. Technically, it's a PostgreSQL extension. And so also gets access to all the capabilities PostgreSQL provides. Enough of an introduction, let's start playing with open telemetry, prom scale, Grafana and SQL. You can get this demo up and running on your own computer. You need to have Docker and Docker compose installed and then download the GitHub repo and run Docker compose up. This is what the three commands in this slide do. I'm going to copy and paste them into a terminal so you see what happens. So as you can see, this has called the repo and then I just run Docker compose which will download all the different images, build them and then get the environment up and running. This will take a few minutes so we're not going to see all of this now but you can do it on your laptop and we've tested with macOS, Linux and Windows. So going back to the slides, this is the high level architecture of our demo system. It's a password generator that has been over designed as a microservices application connected to an open telemetry observability stack. It has a lot generator that makes requests to the generator service. Then the generator service calls the upper, lower digit and special services to get random uppercase, lowercase, digits and special characters to build a password. The lower service is written in Ruby and the rest in Python. All services have been instrumented with open telemetry traces and send those traces to prom scale. As we saw before, you can get this demo up and running on your own laptop. Okay, now let's go into Grafana and let's check all the different dashboards that we've built that will show how to use SQL to derive insights from tracing. So here I'm already logged in and I have a demo environment that has been running for quite a bit. By default, the demo environment comes with these six dashboards and we will be looking at them now. One thing to keep in mind is that the first time you try to log in into Grafana, it will ask for your login and password and those are admin, admin. Okay, these are the defaults that are set. You can change them if you want, but since this is for demo purposes, that's not as important. So we're gonna start by looking at the request rate dashboard. The request rate dashboard is just simply showing the number of requests per second that are happening across the microservices. Since this is the architecture, if you check the architecture of the, if we check the architecture of the application, we'll see that there is basically one entry point, which is the generator. Okay, so this is basically measuring the request throughput for the generator. The throughput, request per second, is one of the golden metrics when measuring application performance. The other two are error rate and latency, which we will be looking at next. But here let's take a look at how these dashboards are built. So let's, we can take any of those two. Let's take the one at the bottom. So you'll see this is a standard time series component from Grafana and what we're doing is we're doing this SQL query. Okay, so you see you recognize the select from work close. In the select, what we're adding is we use the timescale db time bucket function. This creates buckets to be displayed. So it aggregates that we then group by the time bucket. So this basically aggregates data on a per second bucket. And then we're doing counting everything that is happening within the bucket. So that gives us the number of requests. So we're counting all the spans. Since that's what we have in the from close, we're querying the spans. Count to start is giving us all the span that meet this requirement, where parent span ID is null. So these are entry requests into the system, which as I mentioned are basically requests to the generator service. And so this is the throughput that we see. It comes in waves and we see Max is getting to 11 requests per second, but we also see in some buckets there are no requests at all. Okay, now we're gonna take a look at, we'll take a look at another dashboard. We'll look at the error rate dashboard. So this dashboard, I think it gets a little bit more interesting than the other one. So the other one was obviously showing the evolution of throughput over time, but this one is giving us more detail. This one, for example, if we'll look at, let's focus on this table. This table is telling us for each service an operation, what's the error rate? You know, what's the percentage of errors across all these operations that happened in the system in the last, in this case, 30 minutes. So let's take a look at what this query looks like. So this is what this query does. It looks at, it retrieves from, so it's using a subquery. Again, this is an interesting thing. It's something that is available in SQL, but not necessarily in all the query languages for observability tools offer. And so in this case, we have a subquery. We have an initial query that is doing a select again on the span view. This is a view that we, the prom scale exposes, but you can think of it as if it was a regular table. Doesn't matter that much for the purposes of explain the SQL that we use. So we're querying the span view. And in the span view, we have a service name, which is, you know, again, name of the service play, the explanatory span name. A span name is typically the name of the operation. Okay, it's the name of the span, but what it indicates is the name of that specific operation. Then we have, we're counting how many of those spans have a status code of error. And we're also counting the total number of spans. And we're grouping by one and two. That means that, you know, we're grouping by service name and span name. So these two statistics are calculated group by service name and span name, but that's why we see, you know, this in, what we see in this table. And then we're also, we're using, you know, two variables as filters. Okay, so if we go up, because we, as I said, this is a subquery, so we have this results, and then the only thing this other query is doing is just checking service name, span name, and then calculating the rate, the error rate. Okay, we could have done actually everything within the same query, but to make it easier to read, we just used a subquery. And finally, we're ordering by error rate the sentence. So we show those operations that have a higher error rate at the top, okay? So with this information, very quickly, we can see, okay, so generator generate is the one that has a higher error rate, but that is the top level operation. So let's focus on, you know, next level. Next level we'll see process, extra extra process upper is the other one that has a very high error rate. There are some other operations that have some errors, but they are, the error rate for those is much lower. So probably, you know, we should go and focus in this, you know, check this method and see what's going on, why we have such a high error rate. As we mentioned, you know, you have here at the top, if you wanted, you could actually filter down to some specific service or operation in here. The other thing we're doing here is that we're looking at the evolution, you know? This is a similar to this, but this is looking at the evolution over time. So if you, if we open this query, we'll see that the query is pretty much the same. The main difference is that we're introducing a time variable here, a time projection in the select. That is the time bucket, you know? So we're calculating this stat, the error rate per service and operation in a, you know, an upper minute basis. And we're plotting it here over time. Okay, let's move to the next one. The next one is latency, okay? Request durations. This is the third golden signal. So as I said, there are three. So we have throughput, error rate that we've already seen and then latency. And here we can see, let's look at this chart here. This chart is showing the evolution of duration over time, but we're not looking at average, we're actually looking at percentiles. Okay, so we're computing percentiles. So how does this work? But again, let's take a look at and see how the query works. So here, again, we're using this time bucket function that timescale DB provides to group the data in buckets of one minute. So then it shows, you know, the group by close. And then what we're doing is that we're looking at the percentiles for a 99 percentile, 95th percentile, 19th percentile and the median or 50th percentile. And to do that, we're using the approx percentile function provided by timescale DB, which looks like we use this percentile act function as well, which calculates a sketch on the duration millisecond, which is a data structure that then allow us to compute an approximate percentile on top of it in a way that is more performant. And then we're just plotting all of those here. So again, we can use the power of SQL and timescale DB to compute those percentiles. And we could use any, we could compute any percentile that we wanted here. Another thing that is interesting is this histogram of durations. Okay, so if we look at this, this is showing us the distribution of latency for request. Again, because all requests go through the generator, you know, this is for, you know, all generator requests. That is just one entry point into this microservices environment. And what we see is that while the majority of the requests are processed in, let's say, maybe two seconds or less, there are some of those that are extremely slow. We even have requests that took 30 seconds. That's a lot of time. What may be going on there? Okay, so here at the bottom, we have another interesting thing. Here we're listing individual traces. So again, a trace maps to our request and how it went through the system. So we're looking at individual traces when they happened and how long they took. Okay, and this query is actually showing the slowest one. So let's take a look at this. So if we look at it, we'll see that we have a number of traces, you know, the start time duration, as we saw in the panel, in the dashboard. And this is what we're doing. So we're displaying the trace ID, okay? And we're doing this replace text that I'll explain why we're doing this. We have the start time and the duration that we're projecting. And the only thing we're doing is just sorting, okay? So we are using, again, a span ID null, which means this is, you know, the root span and basically maps to a trace, a full trace. And the only thing we're doing is we're just sorting. Okay, so it's a very simple query. We're just searching for root spans and we're getting the top 10, that the slowest one, right? Because we're sorting by duration descendant. So we're doing this replacing. Why are we doing this? So the trace IDs, when they get stored in prom scale, they have, they use a UUAD format. So they have dashes in them. But you'll notice that this trace ID here is underlined. This is because this is a link. We've made this a link. And so we're actually, if you click on it in any of those traces will open the Grafana UI to show an individual distributed trace, which is similar. It's basically, we use the code from Yaga. And so with this, you know, you don't need to copy and paste the trace ID. Well, you can actually just use this linking a smart thing that we use thanks to the amazing capabilities that Grafana provides that are very flexible. You can jump straight into that slow trace and you can check and try to understand what's going on. As you see, you know, there are a lot of those spans that are very quick, but there are always, you know, a few of them that are slow. And if you check closely, you'll see that those that are slow actually belong to this digit and actually it's the random digit function that it's slow. Okay, you can see it, you know, very quickly here. So you could actually go back to your code, the random digit method or function in your code and check, you know, try to understand, you know, why that is a slow. Okay, so very quickly we've nailed down that the problem is related to this specific function. At least in this trace, you know, we could look at other traces and see maybe the problems would be different, but in this case, you know, that is the problem that is causing this trace to be slow. Okay, let's go back to our dashboard. And now let's take a look at something even more interesting, service dependencies. And so this is a service map and it helps us to quickly understand how our different services interact. For this dependency map, we're using Grafana's node graph, which at the time of this webinar is still in beta. So let's take a look, as you can see, we're using here the node graph. And the node graph panel from Grafana expects two queries, one to retrieve the nodes in the graph, the first query here, and that is the nodes in the graph is the circles and another one to retrieve the edges in the graph. And that is the arrows. And so this is the query to get the list of nodes. And basically we just retrieve all service names that appeared in spans in the currently selected time window in Grafana. ID and title are two parameters the node graph expects. ID uniquely identifies a node and title is the label that is assigned to the node. And so we're using service name in both cases. The second query is more interesting and it has to return the arrows, so the relationships in between the services. This is actually something that is typically or usually impossible with the limited query language that other observability backends provide. But because we can leverage the full capabilities of SQL provided by POSPR SQL, we can do joints. And in this case, we join the span view with itself to identify parent and child. We're using K here for kit. So identify parent and child spans that are related to each other. To do that, we have to check that the span ID of the parent is the same as the parent span ID of the child span. We had two additional conditions. So the first one is not a strictly needed, but what we're doing here is we're ensuring that both the parent and the child span are part of the same trace ID. And this would only make sense in cases where there are two spans that were assigned the same span ID, which is very unlikely to happen. The other condition, the one at the bottom, is actually very important because it ensures that we only look at parent-child relationships across services, that is a service operation calling an operation in another service. And so we remove inter-service relationships, that is an operation in a service calling another operation in the same service because we don't wanna show those in this map where we're interested in cross-service dependencies. This table here shows the same relationships as the service map, but in a table format with some additional stats. So you can see we have a number of calls, total execution time, and average execution time. If we look at the query behind, it's the same, so in this case, as I said, this is a table, we're showing a table, so this is a table panel, Grafana's table panel. And the query pretty much is the same join, okay? So this is very similar, this is the same join, but we're showing a set of different stats. So we're grouping by source, target, and span name. So that's the grouping that we're using, and then we're showing how many calls are happening from the source to this service, to this target service and operation. And the total execution time of, that was a span, so we just sum these spans and we just compute how much time has been spent in this specific operation across all the spans within the selected time window, and then the average execution of that span. And so here very quickly we can see that most of the time is actually spent in the generator calling the lower service, the lower service calling the digit service, and the generator calling the digit service. So I mean, it seems to be that the problem is actually in the digit service, that's a service that is very slow. And I think we already saw that, when we looked at the traces of the specific trace and we saw that a lot of time was spent in the digit service, so this is just reinforcing that. And that is not just an individual occurrence, most likely, but this is happening consistently across our overtime and across multiple requests. So we're now gonna take a look at an other way to explore and visualize trace data. So again, we're gonna be using the Node Graph panel, but in this case, we're trying to solve a different problem. Imagine that one of your services is unexpectedly going through a high-increasing load. And understanding where that load is coming from in a microservices environment is not easy because you would need to check all the different upstream services that end up calling the service under pressure. So let's select that different service here. Let's go, for example, for the digit service. And so if we look at the digit service and the slash, which is the entry point operation, which is here, we see in this tree, we see that this is being called by the generator. It's called by the generator, through it, it gets, it's a HTTP request to this service, but it's also called by the lowest service. And we see that this is, there is a digit operation in the lowest service that is end up calling digit, which is, I mean, we already saw in the service map that it's probably wrong. But the thing that is interesting as well is that it's quite a bit of load going to that service through this path, okay? So it's close to half of the load is generated via this path and half of the rest of the load will be generated by this path, which is the correct one. So we see that this digit service probably under pressure with doubling the amount of work it needs to do because we have something wrong in our code in this case. And again, I mean, we could have a lot of other hopes in the tree of spans or operations until we hit the service. And we could use this visualization to quickly spot where most of the requests are coming from. Again, the number here inside what we're doing is showing the total count of spans for that specific operation or what is the same thing, the same thing is the total number of times that operation has been executed in the selected time window, which in this case is 15 minutes. And so again, as I said, we're using the Node Graph panel and we have again the two queries. Since those two queries are a bit long, I'm gonna move to a text editor to review them. So the first thing to note is that doing this kind of thing like going up in the chain of calls is something that would be very tedious. If you had to do this without a powerful query language because basically we need to recursively traverse the tree of spans app across all traces that involve our problematic service. And so, you know, luckily, we can leverage the power of SQL again. And in this case, what we use is a recursive query. Okay, so we use this construct, our recursive query. And the way it works is that there is an initial query that will get executed, which is this one. And we see service and operation are the ones that you selected from the dropdowns in Grafana. And it runs this query, which is retrieving basically all spans, you know, some data for all spans that match this specific service and operation. And then it runs the results, so X are the results from this initial query. It runs them through these other query. And basically what this is doing is a join where it checks that X, so the results it looks at the results from the original query, reads the parent span ID, and then it checks for, you know, this new table that we're joining again, which is again the same span view table if you want. It collects, it looks at comparing and ensure that we retrieve the parents. Okay, so basically S in this case will represent the parent of X. So we're going at one level and we're projecting all these different values from the parent span. And because this is recursive, it will do the same thing again. So it will take the results that we just got, these results here, and run this query again, against it again. So it will do, again, the recursive thing, it will check, okay. So I have to look at this, the values that I have, I inject them, you know, I inject them into X here. And again, we look for, okay, for each of these spans that were returned here, let's look for the parents, okay? And let's retrieve the parent spans. And we do that again and again and again, until there are no results returned, okay? So this is how this recursive mix works. And once it has built, you know, that table, because you see this union, all is just appending all those results, the results from the first query and all the subsequent queries that is navigating upstream, you know, through the spans. It runs, on those results, it runs this query, okay? Which is doing, okay, returning, using the service name and the span name, the operation to generate an ID. So this will be, we're generating one node for each service name and span name. Something important to notice here is that we're not excluding inter-service operations, because we're actually interested in seeing them, in case the increase in calls was coming from an internal operation within the service and not generated from something outside. It could be maybe something wrong, you know, a new deployment that we made and maybe cost that problem. So we're not excluding it, we're actually including inter-service operations as well. And then we add service name as a subtitle, you know, of the node. So we had a span name as a title of the node, service name as a subtitle of the node. And then we're counting the number of spans, the number. And we use this thing just in case the, you know, a span ID for some reason during this recursive operation, created some duplicates. In theory, I don't think that's necessarily needed, but just in case we use this thing so that we remove any potential duplicates there and we have an accurate count. And then what we do is we grouping, you know, those results by service name and span name, so that basically grouping by node. So this is a query for the nodes and the edges uses a very similar query. But so again, you see this join here where it's traversing app, you know, from the current set of results, let's get the parents and project them. But it also adds a bunch of additional information because in here we're interesting in the edges. So we're projecting the ID for their relationship, which is, you know, service name, span name from the source to the service name and the span name of the child. So that is, you know, the relationship between two nodes essentially in the graph that we're displaying. And then we also are doing the target and source. We're using the MD, again, we do an MD5 on the service name and span name. I get to compute IDs for those. And then we just project here again, the same thing where we're projecting is the target, the ID, the target and the source, you know? So the node panel can actually connect the dots between the services. So we need to use the same ID in here for target and source. It's constructed, as you can see, the same way it was constructed in the node. So that, you know, again, the node graph panel can identify those nodes and make the connection with an arrow. Okay, so we've seen how we can travel shoot scenarios where, you know, we have a service that is having some issues, you know, and we can actually navigate up through the stream of, through the sequence of spans in all across all the different traces to understand how the service is being called. You know, what's the impact of things happening upstream into the service we're looking at. We can do something similar, but in this case, using downstream spans. Okay, so let me, let me make this bigger. And this is showing, again, here I have selected generator and HTTP get. So let's actually select generator and the generate operation, cause that is the entry point. And so this is showing an entire map of all their requests, you know, that go through the service, what are all the different service and operations that are being called, you know, and how often they are, et cetera. And we're using the same technique that we use in the upstream, span or upstream dependencies dashboard. The only difference is that in this case, the join is the other way around. Okay, so we're looking before we had the X pattern span ID equals S span ID. So here is the other way around. So we're looking for X span ID being the same as the pattern span ID. So we're just going, you know, downstream. Okay, and then we project the children. And then again, we do the same operation here. So it's a very, very similar thing. So I will not explain it in detail, but you can just show you can navigate upstream, but you could also navigate downstream. And, you know, this gives you a very interesting map of all the different calls that happen in the service. Across, you know, in this case, the last 15 minutes, once, you know, for all their requests to the generator service. So it helps you understanding detail at the end of the day what's happening, you know, how are all the different requests being processed? Another thing that I'll explain here in this dashboard that is interesting as well is this one. And this one is looking at the total execution time about operation, but it's not doing this just by blindly adding up the duration of all the spans for that specific operation is actually looking at time actually spent in the code of that operation. That is, it is subtracting the time spent in child spans. Okay, so you have an operation, it has some code and then it makes requests to other operations that have their, you know, that are tracking their own spans. So instead of telling you that, you know, the high level span is the one that is taking the longest, no, it's actually looking at how much time is spent within that, the code of that span. And so that you can identify where the bottleneck is because otherwise it will always, the one that is at the top of the hierarchy will be the one that shows us being the slowest. But here it's not the case. You know, if you see this query, the slowest one where most of the time is spent and we already have seen this, you know, over the course of this presentation is the digit random digit method or function is where most of the time is spent, 88% of the time is spent there. So definitely this is the first place we should go to optimize the performance of our service. If we didn't be subtracting time from child spans, the one at the top would have been the generate password from the generator service because that's the top level one. So all the time is adding up into the duration of that span. So how do we do that? This is actually really important. You know, it's this idea of, okay, I'm looking at where specifically in which code is specifically the time spent and by doing this subtraction of, okay, the parent span duration, I subtract the time span in the children, then I get the actual execution time within that specific code. It's really helpful to understand bottlenecks. And the way we do it is again, we use, it's the same thing again, we use our recursive query to traverse all the spans and assign time to the different actual time spent in the different spans. And what we do is this thing that you see here. This is the key thing. And what it's doing is subtracting to the parent span duration is subtracting the sum of the duration of all the children. Okay, so it's looking where the span ID, you know, this span here is the parent. Okay, so it's subtracting all this time. And the coalesce, what it's doing is if this return null, so no data, it just says it's zero. So it basically doesn't, there is no number. You don't need to subtract any time to the duration. It's, this is only useful for leaf spans, you know, that don't have any children. All right, so we've reached the end of this webinar. I hope that you enjoyed it. We showed that with OpenTelemetry, PromScale and Grafana, you can get insights you didn't think, I don't think you would think you thought were possible thanks to the power of full SQL. I encourage all of you to download the OpenTelemetry demo today. All the software we've shown here is available on GitHub and it's free to use. And if you have questions about PromScale or the demo environment, we're available in the PromScale channel in our Slack community that we see that you see here. I just wanted to take the time to thank you for watching this webinar and I hope to see you on our Slack community soon.