 Welcome to the workshop. This is intended to be hands-on so if you're wanting to follow along go ahead and go to this this link either via the QR code or the the link up there and download the repo and then run Docker compose up in the root of the repo. You'll be deploying a system that we will use to build some Grafana dashboards and explore traces. I don't know how slow the Wi-Fi is here so go ahead and kick it off now and you'll be ready when we're ready. Alright so the the workshops called analyzing and visualizing open telemetry traces with SQL. That's what we're gonna do. We're gonna drop some traces in the database and see what we can see. I'm John Pruitt. I'm an engineer at timescale working on the prom scale product. Prom scale is a backend for Prometheus metrics and open telemetry traces and this is a preview of what we're gonna build today. The system you're gonna deploy which I should say the postcards on the table have the link as well the QR code so if you missed it before you can catch up. We're gonna deploy a system that's going to generate traces no metrics no logs and we're gonna build some Grafana dashboards using that data. This this slide right here is focusing on the time series aspect of traces and then we're also gonna do a second dashboard that focuses on the tree structure of traces and the topology of the system. Yes so the repo actually has two different versions the question was are these manual traces or automatic? Today we're gonna use the manual version so the code is instrumented manually. There's also a version in the repo that uses only auto instrumentation and please feel free to ask questions as we go. So what we're gonna do first we're gonna talk about the demo system just what make sure you understand what you just deployed then we're gonna talk a bit about tracing background just to make sure we understand enough to write SQL and then we're gonna build a dashboard using the time series aspect of tracing then we're gonna build a second dashboard focusing on the tree structures and then wrap up. The demo system again here's the link pull it down it all runs in Docker that's all you need leave it up there for a second. When you get to the repo you know you hit the code button you can either use get clone downloaded as a zip also want to note that there is a workshop.md file in the root with notes so basically everything I'm gonna talk about is is recorded in that that markdown file as well so if you fall behind or if later on you want to revisit and dig in a little deeper you can consult that. Operating the demo system once you get it downloaded just Docker compose up starts it I like to run it detached if you want to pause it but not remove it Docker compose stop and Docker close start starts back up and if you want to get rid of it entirely Docker compose down so what are you deploying we've implemented a an absurd password generator the point of this is not to build a good password generator it's to emit interesting traces so what we have our four services that each return a random character so the lower service serves up lowercase characters upper uppercase digit digits in special special case then we have a generator system that uses those other services to produce a random generated password and finally we've got a load script that just continuously exercises the system and we're running three replicas of that and each of these services with the exception of the load generator is instrumented with open telemetry libraries to emit traces also in Docker we've got an observability stack the traces flow into the open telemetry collector first that's not strictly necessary but it's in the box to play with right now it's configured only to do batching no sampling those the collectors and forwards those traces to the prom scale connector which puts them in the prom scale database which is a timescale postgres database also in the box you've got Grafana and Yeager all configured to look at the database to explore traces if you want to connect directly to the database it's on local localhost 5 9 9 9 the database name is Hotel demo or you can jump directly onto the database container and use the client installed there Grafana is at localhost 3000 Yeager is at localhost 16 686 and each of the services are also exposed so you can poke them independently if you want so just to show you we've got Grafana here got Yeager we can jump in and find traces with Yeager and also this is the password service I can poke that and try that again there we go so let's do some tracing background just to make sure we all are on the same page and have a good foundation we have to understand the data model to understand how to answer questions about our system and how to write the sequel so first of all a trace is a collection of spans and a span can have one or zero or more children so in that way a trace is a tree structure a trace is also a time series so you know each span has a start time it has an in time and therefore a duration and each parent span encompasses all of its children so span one is a parent span two and span six first children the start and end times of them fall within the time span of the parent so you're probably familiar with looking at a trace in Yeager right you go over to Yeager you can see the tree structure over here by the nested spans and you can see the time series time series aspect over here with a Gantt chart and the length of the lines here is the duration and you also get the start and end times so this is this is cool and this is a powerful tool but you typically you're only looking at one trace at a time so the cool thing about sequel and what we're going to do today is we're going to look at a few thousand traces at a time and see what we can what we can glean from that that's you know hard to do looking at a single trace at a time so also let's look at the perspective from the code side this is a function in one of the services we deployed is is piped on you can see online 37 we create a span and give it a name that span has a context and the beginning of the context and the end of the context will determine the start time and the end time you can attach attributes to a span you can attach events to a span which are just point and time markers and a span can have child span so right here we've created a span with a different name it's a child of the one that wraps it it has its own start and end time that is contained within the parent span and finally you can record exceptions on the span so this is just to give you kind of a sense of what this looks like in code and how to translate what we see in the sequel with you know what you would see on the code side the other thing I want to mention here is again like this is not a great password generator it's created to make interesting traces and so these calls to work are intentionally slowing down the code just to make interesting data and we've got you know random exceptions being purposefully emitted as well so the tracing spec let's say we wanted to ingest these these traces this is a screenshot of the protobuf definition for a trace there's a lot there it would take us all day to do this probably more so we're not going to do that that's why we're piggybacking on prom scale which has done the hard work for us so prom scale is going to land this data in a Postgres database and a number of tables but today we're going to focus on only one view and that's the span view so each row in the span view corresponds with a single span right and many traces will be represented in this view the the view has many columns as well and we're only going to look at eight of them and I think you'll be surprised at just how much value we can get out of eight columns each span has a span ID and a trace ID all of the spans belonging to a given trace will share the same trace ID but each span will have its own span ID and so the combination of a trace ID and a span ID uniquely identifies as a span the service name column unsurprisingly gives you the name of the service that emitted the span the span name corresponds to the name you give the span when you create it which is typically like a function name or some sort of operation the start time in time and duration that's the time series aspect of a span and the using the trace ID and the parent span ID you can find a parent and therefore you get the tree structure of a span of a trace so speaking of the tree spur tree structure I wanted to elaborate on that last point you've got the span ID and the parent span ID on each span and the child's parent span ID points to the parents span ID saying that sounds weird so for me a visual helps the other thing to note here is that the each trace is going to have a single root and the roots parent span ID is null that's how we identify the root all right so next we're gonna we're going to build a dashboard we got through all of the boring part now we're ready to do some fun stuff so get this picture back in your mind time span start end date durations what we're gonna do in Grafana you know Grafana has this time picker in the UI right here we're gonna have to use that in our sequel so Grafana gives you a start time and an end time using that UI filter and what we're gonna do is compare the start times of the span to that time window that we've selected and so as you can imagine some spans will be fully contained in that window some will overlap and some will not will be excluded from the window and note that that's true within a trace so if this were a trace some spans would from the trace would be included and some would be excluded it's just something to keep in mind any questions so far so in Grafana there's a macro called dollar underscore underscore time filter you can use that to tie your queries to that UI widget and the way we're gonna do that isn't the where calls we'll use time filter and we're gonna filter on the start time that expands to something that looks like this where the start time is between the begin and the end of the filter that we've selected if you want to query directly against Postgres without Grafana the way you would do something like that is where the start time is greater than now minus an interval of 15 minutes that'll give you all the spans in the last 15 minutes all right so the first thing we're gonna do is build this widget so go over to Grafana sorry excuse me go to browse there's a demo folder and remember Grafana's on localhost 3000 in the demo folder are a number of dashboards the ones we're gonna use for the workshop are workshop one workshop one finished workshop two and workshop two finished so what I've done is workshop one and workshop one finished our copies of one another and workshop one all of the queries are commented out we're gonna talk about each one and we will uncomment them and enable them as we go workshop one is finished workshop one finished everything's enabled so if you want to cheat just go to the finished one yes oh yeah yeah okay sorry so the first time you log in it's gonna ask for an username and password it's admin admin it's gonna prompt you to set a new password you use whatever you want always use admin I should have mentioned that thank you you in there now okay cool yeah so we jump into workshop one and you should see something that looks like this and nothing's working right so what we're gonna do is we're gonna enable this this panel right over here that says number of traces in the time window and when we do it's gonna look something like this so behind this panel is one query the query is gonna look at the span view that we've been talking about it's gonna use the macro from Grafana to filter on the time window and we're gonna filter where the parent span ID is null and as I mentioned before the root span of every trace has a null parent span ID so basically what we're saying is give me all of the root spans in this time window that we've selected and why is that important we know that every trace has a single root span so by counting the number of root spans in the time window we know the number of traces in the time window right so go go back to Grafana hit edit on the panel and you should see some comments in the sequel take those comments out hit apply and you should see a number come up so right now I've got my filter set to five minutes in the last five minutes there are a hundred and sorry 712 traces again this is a terrible password generator it's horribly slow on purpose everybody good there I should also say we're gonna start with easy queries and work our way up so each one builds on the next so next we're gonna do throughput we're gonna build a panel that looks like this so if we know every trace has a single root span and we don't have sampling turned on we know that every trace corresponds with a request we can compute the throughput in the system by counting spans over time right our counting traces sorry over time so you'll note that the the query here is using this band view again the same filters that we use from the last query this time we're going to count traces as well but now we're gonna group it into time buckets of 10 seconds we're using a function from time scale DB called time bucket you could do this with standard sequel functions like date date trunk time book bucket is just more powerful and flexible so if we group by time and order by time what we're gonna see for a five minute window is 10 second slices where we count we get a count for each tense 10 seconds slice right so jump over here we're gonna go to the throughput panel right here edit I didn't want you all to have to type so we're gonna make this easy and just on coming up with the comments there hit apply so what do we see here we see that at least on my system I had a peak of 46 traces in a 10 second window in the last five minutes and I've seen as few as five so already we know that there is variability in the throughput of this system over time and if I scroll back out to say 15 minutes we see a pattern even more clearly right there are periods of time where the throughput is high and period of time periods of time where the throughput is low I don't know how you would get that from Yeager right so we just looked at you know 1500 traces and we've we've already found a pattern in our system we're gonna look at the slowest traces in the time window so we're gonna take the same exact query as before the same filter logic this time we're not gonna do any aggregations we're just gonna get the distinct trace IDs and their durations and by ordering by the duration descending and getting a limit of 10 we're gonna get the 10 slowest traces in the time window so jump back over here top left and enable the query so we can see now like in the last 15 minutes the worst requests latency was 7.2 seconds pretty terrible right so this is one place where you know Yeager could help you could go grab these trace IDs of the top 10 worst to go look at them try to figure out what's wrong all right so now we're gonna look at a histogram of the latencies so we've seen the throughput we've seen the top 10 slowest ones but the top 10 slowest one doesn't really give us a picture of you know what are the what are the fastest ones look like what are the average ones look like so this histogram what it's gonna tell us is it's gonna break our latencies down into a number of buckets and then count in those buckets for the entire time window so this query is easy you know it is essentially the same one as the top 10 slowest we're just not grabbing the slowest ones we're grabbing all of them and handing them to Grafana so enable that query and so what does this tell us it says for this 15 minute window the vast majority of our requests are happening in like what is this bucket let's say if this is 800 milliseconds you know the vast majority are happening under under 800 milliseconds but we're seeing this long tail that drags out to eight seconds basically so that's good to know that paints another detailed picture of what our system looks like one signal that a lot of people want to look at especially SREs are the P95 latencies this this tells us 95% of requests happen at or under this value this this amount of latency right and we're gonna plot that over time to do that we're gonna start with the same basic query we've been using all along we're gonna apply this time bucket to get the 10 second buckets and then we're gonna use a couple of times scale DB functions percentile ag and a prox percentile to get the P95 duration you could again use standard SQL functions to do that they just be a little less flexible and a little slower and we're gonna group by the time bucket and order by time and that's gonna give us P95 plotted over time jump in here enable the query check it out so now we know not only does the throughput vary over time but so does the latency which makes sense right and they're pretty much the inverse of one another when the throughput is high the latency the P95 latency is low and vice versa didn't necessarily have to be the case but the fact that it is is another piece of information another pattern that we see about the way our system runs and again this is just tracing data so we generated a histogram of latency up here right it's for the entire 15 minute bucket which is useful but it doesn't tell us if there's some variability in in latency over time beyond the 15 the P95 so now we're gonna build a histogram of latencies over time and what this essentially gives you is a think of it is like you have to kind of think at 3d right because it's not only gonna give you this the slowest and the fastest in each 10 second bucket it's gonna give you the span the spread and it will show you where most of the most of the traces are landing within that spread so the brighter the box the the more traces in that box that makes sense let's do the query first again very simple query we're building on the same query as we've been doing we're gonna use time bucket we're gonna use the duration and we're gonna let Grafana build that heat map for us jump in here hit apply and now so you can see right above the P95 latencies that's only telling us you know essentially the slow what the slow slow requests look like now with this picture we see for each vertical slice it broken down into buckets and the color corresponds with how many traces fall into that bucket right so we can see there's variability in in the latencies and we can see the spread right well it's kind of like a sine wave what's wrong up here okay any questions okay so you may say this is great I know my system is slow and I know it's variably slow but where do I go from there you know what if I want to nail this down to a service or an operation how do I figure out where to go optimize first so that's what we're gonna build now we're gonna build this donut chart this pie chart what we're gonna do is for every operation which is like a function we're gonna compute the in the amount of execution time that was spent in that operation for the 15 minute window right and you could say okay the function that's spending that we're spending the most execution time in is the bottleneck that's where we should go spend our time optimizing so we have this query this one's gonna be a little bit more advanced so go back to the thought each span has zero or more children a parent spans time frame encompasses all of its children so if we want to know how much time is spent in a span exclusive of his children we have to subtract the child's duration from the parents right and that's gonna get let us pinpoint whether the time was spent in the parent or in the child so what we've got here is the same view the same filter what we're gonna do is concatenate the service name the span name and call that our operation and then we're gonna sum the duration that spent in the parent and we're gonna subtract out the sum of the durations of the direct children so how do we do that we do we we reference the same view again we're gonna alias it as k for kids we have to find where the child's trace ID is the same as the parents trace ID and where the child's parent span ID is this spans ID right so that's gonna be all of the spans that are direct children of this span the time filter here is a performance optimization so if we we find all the direct children and some their durations subtracted from the parents duration we know how much time is spent in the parent excluding the children so we're gonna go over here bottom left panel edit enable the query hit apply this queries a little bit slower we could optimize it but it would be harder to understand so I left it slow okay so what do we see here we see that in this five minute window we spent ten point four minutes of execution time in the digit service in the random digit operation right it vastly outstrips all of the the rest of the system so realize we're not just looking at one process not just looking at one service we're looking at all five services we're looking at you know a thousand traces for five minute window so you know maybe there are all different kinds of requests happening in that five minute window and we've pinpointed a function that is our culprit so with this panel alone you you've identified the bottle neck and you know exactly where to go to spend your time optimizing right that's pretty cool so let's build a table with that data as well you know maybe we also want to see the average and the p95 for each of those operations we're gonna use the exact same query to do that all we're gonna do is wrap it so this is the query from the last panel we're just gonna wrap that in an outer query and instead of just using the total execution time we're going to average it and we're gonna do the p95 of it go back to your file we're gonna edit and enable this so we see that this table verifies what we see in the pie chart the digit service has a random digit operation on average it's taking over a second to complete at least on my laptop and the p95 duration is 4.2 seconds so not only is it slow on average it's extremely slow in the slowest case we also see you know the next slowest operation is in the same service it's the render digit and you know the p95 of it is 363 milliseconds on my laptop so between the two of these you know where to go to improve things now one thing we don't see here is whether or not that varies over time so let's go looking for another pattern so we're gonna build this you can think of this as taking the pie graph and generating a pie graph every 10 seconds and plotting it as a stacked bar graph right so we take the same query as before we're doing the the parents duration excluding the sum of the direct children's durations and we're gonna use time bucket to plot that in 10 second time windows edit that enable the query and I'm gonna scale back out to 15 minutes okay so that may just look like a bunch of colors at first but we can see that this orange bar is one very big relative to the others and also varies in size there there are times when the bar the orange bar is large there are times when the orange bar is small and if we scroll up we can see that those the size of the orange bar corresponds with what we see in the histogram of weight and seas and the p95 latencies and the throughput so what is this orange bar it is the random digits function from the digit service unsurprisingly like that's what all of the patterns are pointing to but we've just confirmed it what else can we see we can also see that occasionally this blue bar is tall and this red bar you know and those look periodic as well we see the red bar is longer here then it goes away it comes back here so what is that the red one is the render digit from the digit service which we saw right here so while this confirms that render digit is a problem this panel shows us that it's an intermittent problem that's a periodic problem it's not a constant problem what is this blue guy the blue one is processed digit I believe h generator git that was hard to tell in any case we've we found two problems we've now correlated a given function with latency and throughput issues and we can tell that they the second function is an intermittent problem on pounds things so we have a Grafana dashboard now we've used nothing but traces no metrics no logs I don't know about you but we started out with a very simple query and built upon it so even the the most advanced queries in here are understandable and you know I think you get a lot about you out of this any question before we move on yeah yes so the question is about sampling so remember we've got the collector in our stack right and right now I've got sampling turned off I'm not sampling at all so right yes so theoretically if you were sampling throughputs actually obviously not going to be throughput right you could actually if you knew like you were doing 10 percent samples or something maybe you could multiply it scale it up and get a approximation but it wouldn't be exact right slowest traces you know you're gonna get the slowest of the sampled traces so the more sampling you've got the more imprecise these would be but they should still be representative but you do have to be careful because you wouldn't want to label this throughput if you were if you were sampling because that would not be correct any other questions let's go back to where we were so we're gonna switch gears now and we're gonna build a second Grafana dashboard that instead of looking at the time series is gonna look at the tree structure this is powerful because you can't really get this with metrics right that's the beauty of traces you start to get a top topology of your system yeah so now we're gonna venture into that territory so get this picture back in your mind each span can have zero or more children the root span obviously has no parents each span has a parent span ID that points to the span ID of the parent and the root has a null parent span ID so if we were to query a trace a given trace we use a explicit trace ID here and started getting the span IDs and the parent span IDs out I mean this is what it look like it's just a list of numbers and I don't know about you but like my brain can't make sense of that that doesn't really tell me anything but we can use SQL to make sense of this to do that first I've got to make sure you know how to count to 10 and SQL all right so you may think select one all right I've count I've counted to one that seems easy enough right if I want to count to two you know select one union select two I've counted to two great I can continue that pattern to 10 and I've got 10 rows that count to 10 another way to do it was to use would be to use like the values clause and that way but hopefully we can all admit that these are gross nobody wants to do this right and if you're familiar with postgres you may know the the generate series function we're going to consider that cheating for now that's that's the nice way to do it right so you know if we wanted to count to a million nobody's going to do it that way that's just gross so how can we do it recursion is everyone familiar with recursion this is a lot of text but recursion is when a body of code executes itself to solve its problem right so a function can call itself a query can call itself it's a way of doing looping without a for loop without a while loop right it's about taking a big problem and broken down breaking it into smaller instances of the same problem to solve the problem so you may or may not have known that sequel can be recursive and a syntax looks like this this with this is a common table expression and you use the recursive keyword here to say this is a recursive common table expression aliased as X and the pattern is you'll have a query union with another query and the second query is going to reference X so it's referencing itself and how this works is the first query is the initialization step it's where you start in the recursion the second query is the iterative step and then finally outside you can reference the results of all of the recursion so if you want to count to 10 and recursive sequel you have the with recursive CTE you start with one you union all you refer to X so you're looking at the previous iteration and you take the previous iterations number and add one to it and that's this iterations number and then finally you're going to return the results but there's one problem with this query and it's the first rule of recursive sequel don't forget to stop if you were to run this it would either run forever or eventually you know throw a throw some sort of like stack overflow problem so what'll we add we add a where clause to the second part of the recursive CTE where the number is you know less than 10 and that lets us stop iterating and we get exactly what we want so this is just counting but you can use recursion to explore tree structures and graphs for that matter so we've got this picture in our mind this is how we're going to start our recursion we're going to find a span in each trace to start with so we're going to filter on spans filter on the time we're going to filter on a service and a span name and we'll find a bunch of samples you know a bunch of spans in our time window of that operation and that's where we're going to start in the trees if we want to walk upstream in the tree structure we we're going to use the results of the previous iteration where it's the same trace and where this iteration span ID is the previous iterations parent span ID all right so we're going to look at spans from the same trace where this iteration span ID is the previous iterations parents and then we're going to go one more up we'll say where this iteration span ID is the previous iterations parents and in that way we start here we go up go up looping recursively until we stop which is would be where the parents span ID is no there's nothing nothing else to match on so you can do the same thing in the opposite direction you can go downstream and it's really easy you take the same query and you just swap this relationship here so you look where this spans span ID is the next spans parent span ID you go downstream now why is why would you want to do this you might say like okay I'm working on service X we just got pounded in production and we don't know where it's coming from so that then you would want to look upstream right who's sending me requests why are they sending more or let's say I'm responsible for service X I'm getting the same workload but now they're running more slowly why you may want to look downstream and say something in in my stack is running more slowly now than it was before show me everything that's downstream of me alright so we're going to build a table that looks like this go back into Grafana and go back into the demo folder and swap over to workshop 2 and in workshop 2 we have added this these two pickers up here you can pick the service and you can pick the span name within that service and it's going to filter the whole dashboard on that selection okay so don't freak out we're gonna break this down this right here is the is where we start we're gonna look at the span table we're gonna filter by the time window we're gonna filter by this dollar service ties us to the picker for the service name and the dollar span ties us to the picker for the span name we're gonna get the trace ID span ID parent span ID service name span name the duration and we're gonna count so we're gonna start the counting at zero so that's where our recursion starts and what we're gonna do right here is we're gonna take the span for each iterative step look at the previous step where we're in the same trace and this steps span ID is the parent of the previous so we're walking up the tree right and what we're gonna do is we're gonna increment the distance by one so we know distance one we're directly called the one we so we selected on distance two was the parent of that distance three is paired with that and we're gonna pull all those out we're gonna throw out distance zero which is the one we picked and we're gonna get the P99 percent percentals P95 percent dollars all right so go over here go into our panel and enable this query hit apply and if it's set to six hours change it to five minutes all right so we're gonna look at the digit service we've said a lot about the digit service and we'll look at the slash operation slash span name this is where the requests are coming into the digit service so we can see upstream of the slash we're being called by the generator service and the lower service and we're getting called directly by the HTTP get spans and indirectly by these operations here and pause for a second any questions now we're gonna do the same thing but downstream so this is the same exact query the only thing we've done here is swap this relationship we're gonna drop over here edit enable it hit apply all right so now we know what is calling this operation and what it is calling right upstream and downstream but my problem is I still can't really visualize this this is the table it's got us got some numbers in it but it doesn't paint a picture so let's paint a picture there's a tool in Grafana called the node graph and that's what we're gonna use and a node graph requires two queries one query to identify the distinct nodes in the graph and one query to identify the edges so we're gonna build a picture of upstream spans and first we're gonna start with a query to identify the nodes that are upstream and each node in this picture is going to be an operation and each edge is going to be a call so we're gonna take the same recursive query that we had before for upstream but we're gonna do a distinct on the concatenation of the service name and the span ID to give it an ID so this Grafana widget needs an ID for each node and we're gonna use the service name and the span name as titles and subtitles and you can't really see anything yet but we're gonna go ahead and enable it so we have two queries here go to the first one it's got this comment in it that says nodes enable that one ahead apply and now you see we have nodes but we don't have any edges so let's get the edges in here same queries before although the recursive query has changed a little bit so with an edge it's not enough to know one span you need to know two spans you need the start and the end right so on the first iteration we actually don't have enough information to draw an edge so we're gonna start with a null child service name and a null child span name on the second and following iterations we know the prior iteration and this iteration so we're able to draw a line between the two so you'll notice that we've got P for parent we've got the parent service name and the parent span name and X is the previous iteration so we've got the child's service name and a child span name so the Grafana widget needs an ID for the edge and then it needs the ID of the source node and ID of the target node and with those three pieces of information it knows how to draw that so you'll see the source and the target are the concatenation of the span service and span names as before and now we're doing the concatenation of all four of those elements to give it an ID and we're doing distinct on it all right so let's draw this picture we're gonna go to the second query we'll give us the edges zoom out a little bit so we've looked at the last five minutes which we know from the previous dashboards probably between a thousand and twenty five hundred traces and we've looked at all of the distinct traces in that time window that are upstream that and we've looked at all those traces we found the ones the spans that start in the the digit slash and we've we've walked every single one of those traces and found the distinct call paths that came from the outside of the system and went through this digit slash span name right and so now we've got a picture of what all of those look like we can see this is the one that we filtered on we can see it's being called by the lower service and the generator service we can see that the lower service eventually gets called by the generator service and we can see inside the generator service and this generator generate is where the request entered the system so now we have a dynamic real-time map of not just our service dependencies but our operations within those services so it's like an an x-ray of all the services at once right and this is actually an Easter egg you might want to you might say well why on earth is the lower service calling the digit service there's zero reason why it's a purposeful bug in there for for us to find and dig through and try to go fix questions okay so let's do the same thing for downstream we're gonna do the nodes first it's the same query as before we're just you know we swap that one relationship to go downstream instead of upstream so you can go over to the the dashboard enable or edit this one go to the first query enable it we've got a second query for the edges again as before this is the exact same query as we did on the last one except for that one relationship change hit apply so now we know we filtered on digit slash this shows us the map upstream of it and this shows us the map downstream so we know that digit forward slash calls digit process digit and digit random digit and digit render digit and we also know that process digit calls extra process digit so between the two of these you now know in the last five minutes over the you know between one and two thousand traces in the last five minutes every trace every request that went through the digit forward slash you know chunk of code we know the distinct call tree above it and below it everything everything that went through that that's pretty cool so if we combine the previous dashboard in this one the previous dashboard showed us that there was this performance issue and it showed us the exact operation where that performance issue was right now we can go over to this dashboard and we can filter on the operation that we identified in the last and we know exactly what's upstream of it exactly what's downstream of it so we know if we optimize this thing what's it going to impact hopefully for the positive any questions we're at like just under an hour we've got about 15 minutes any questions about the workshop the demo the queries also any thoughts I'd love to hear after seeing this are there any things that you think of that maybe we could answer with sequel and traces yes that's right that's right so the question for the recording is about the node graph and the recursion and whether or not there are any loops in it and so the nature of the traces at least in this system is that none of our requests are going to go through the same chunk of code twice and especially within a given trace each span is going to have one parent and unless somehow yeah I don't see how you could have a span that's calling a span that's already happened so I think by nature of the data we're not in danger of an infinitely thousand yeah right so we're doing the distinct let's go back to the so you'll see right here we're using distinct here so we're finding all of the distinct paths so it's not necessarily the case that any one request would follow all of this it's actually not the case we know that some of them go down this path and some of them go down this path but we're we're building a picture of all of the distinct paths yeah so I know the this widget you know I'm not super familiar with it I know there's a way to color these nodes so I think you can compute you know the throughput or the latency like we did before and then use that to color these you know I'd love to see a way to color or like impact the size of the edges as well but I don't think this can do that this particular UI widget yeah right send in some feature request or Grafana what else so promscale is open source you can download it for free and we have an associated APM dashboard for Grafana it's also free we don't really have any enterprise features and yeah so that's what this is about you download it you've got this data in the database and unless you you're you know strong with SQL and have some imagination you don't really know what to do with it right I actually built this for myself when I was I'm on the an engineer on the team building prom scale and you know I needed some demo data to play around with to make sure that my schema worked and make sure these queries actually worked and so yeah that's that's what where it came from just Postgres yeah let me go back to so the question is do you have a prom scale exporter and so what we have is a prom scale connector which is written and go kind of like an agent so you would have your traces forwarded to that and it knows how to load them into the database it also supports Prometheus metrics so you can have metrics and traces forwarded to prom scale and put in the same database so what I haven't done in this and I hope to expand on this let's put both in the database metrics and traces and now with SQL you can correlate the two and build something even better you know even deeper insights so Yeager so prom scale connector implements the API that Yeager needs to get traces so if you want actually right here when you look at Yeager in the system you're running it's actually going through prom scale to get these traces to visualize so prom scales the back end for Yeager you can configure Yeager that way and obviously you can you can connect straight to the database with any Postgres driver so if you prefer you know Tableau or some other data analytics tool that's fully supported as standard Postgres SQL queries work with all of it they're in the same database they're in different tables and different schemas although the schemas may change soon or kind of bring them together but yeah you'll have certain tables that the metrics land in a certain tables that traces land in and then based on the metadata that you attach to either metrics or traces you can correlate the two we have exemplars yes and actually on the roadmap we have planned to do logs open-to-limit three logs and metrics but that's all down the line good questions anything else any ideas like what else would you like to see that's a really good question so the question is around you know we give we give a way to store the data and with SQL we give a way to query the data but you know what's in the box what's out of the box right so that's that's the beauty of SQL I mean it's a it's a turn complete language for data analytics so the the cool thing is we're giving you the data in a form with a language to go with it where you can answer whatever question comes to your mind but again you have to like no SQL right so I think part of that is in these APM dashboards where we give you like Grafana dashboards that are already pre-built I don't know what we might come up with in the future but for now you know it's workshops and demos and blog posts and we can keep talking I just wanted to put this up because we have a little Google Forms survey where you know it's like three minutes if you don't mind going in and giving us some feedback that'll help us you know figure out what to build you know anything else so now you have a running trace generating playground on your system take it home play around with it if you have any questions timescale has a community slack you can jump in there there's a prom scale channel you can find me or anyone else in there with a little tiger logo next to their name you can hit us up on our GitHub repo for the demo file issues contribute that'd be great yeah thank you thank you very much