 to meet up. Thanks for joining us on a Saturday morning. So we have with us today Ananya Agarwal from Grafana, who is our first speaker. He would be presenting a talk about Grafana tempo, which is a new tracing platform from Grafana's repertoire. And Ananya actually has been walking on top of Open Telemetry and other tracing platforms. Actually, yes, we have been maintaining out with Yeager also. So he has been walking with tracing platforms for quite a long time and this is, if I may say Ananya, is this your creation? Is this something that you conceptualized and sort of came out with? Definitely, no, I would not take the credit for that. Fair enough, but yeah, you share one of the maintenance and it's always better to hear about a product from the horse's mouth, right? And as a second speaker, we have Ankit Nayan from Signose HQ. Now, Signose HQ is an open source telemetry platform, which is possibly going to yield all three pillars in future tracing logs and metrics. I think if I'm saying correctly, Ankit, currently it has metrics and tracing. Tracing has the core and metrics built top on the traces, right? And yeah, I have personally been fortunate to work with Ankit during 2020, during the whole lockdown period, had some contributions towards that. But yeah, it's a very interesting platform that Ankit and Trinay both are making and it would be really nice to hear from them as a second talk that how he planned about and how the whole progression and iteration of making this product was for them and what it actually takes to make an observability platform from scratch. So, Gia, with that, I think. Do you want to also hear from the audience because let's wait for like two, three minutes to start Ankit. Sure, we are at 11.02, so we can just like wait till 11.05 or 10.10, so... Yeah, we have some folks joining in, right? So, let's like, if anybody wants to tell us what they are doing, what got them interested in observability, how they are about us, that would be great. Yeah, great idea about Pranay. So, Gia, if anyone from the audience who have joined us in Zoom, they want to share... Let me just call the names. Pramod, do you want to share anything? Sure. Or if you can... I don't want to roll call, but it's just easier because like otherwise people will think who will talk first, so... Yeah, that's fine. Can you hear me? Yeah, I can hear you. We can hear you. Yeah, great. So, yeah, I work as an independent consultant in data engineering and data infrastructure. So, and I've worked for companies like Uber, Yahoo, Capillary in the past. So, observability, you know, kind of comes as a requirement over there because you need to make sure your systems are up and running. So, that's why I was kind of curious to see what this talk is all about. Interesting. And how did you hear about us? I think I got a message on some meetup thing, if I'm right, or maybe Twitter somewhere. Cool, man. Welcome. Thank you. Did you hear from him? Aakash? Aakash Mishra, if you want to share something. Hi, guys. Okay. Yeah, I'm kind of new to this observability thing. The employer I work for, we are running into some issues where we don't know what's happening with our applications since it's our first time deploying on Kubernetes and stuff. So, just to see how people do it. Awesome. Sreejan? Hey, no, man, you will have a whole talk for you. Hey, hey, everyone. So, I work at this startup. I handle mostly infrastructure things, but we don't, I mean, trying to bring in more observability into architecture and I'm, I am interested in all things observability. And I've been part of this telegram group. Oh, nice. Yeah. So, yeah. Awesome, man. Welcome. Thanks. Sudhir, you're there. Hey, Sudhir here. Very much to share. I currently lead observability platform at Freshworks. Oh, nice. You have a lot to share. So, I just thought let me just reuse a lot of tools. Let me just get in and then see what's happening around and then try to see and try out mutants. Well, that's it, more or less. Awesome, man. We also have a telegram group. I'll post the links in the, during the talk. So, maybe you can join us there also. I see TK Saurav also here. So, yeah. Cool. A lot of folks have joined. I think with us, we have around 12 people on the call and there are around 10 people, 10 folks who are watching us live on YouTube. So, yeah, Pranay, would you say it's a good time to start with the first talk? Yeah, sure. And we will, so for the audience, we take questions at the end of the talk, end of each talk, actually. We have demarcated sessions and at the end of the whole meetup, we actually demarcate like whole 15 minutes to 20 minutes, whatever, to actually just hang out and discuss stuff that we talked about during the whole meetup. So, the vibe of this whole thing is less a webinar and more a meetup and hanging out as a whole, right? As a community. So, Pranay will share the Q&A link with us on the chat. I will do the same with our YouTube audience here. And once that's, you know, like as the talks progress, feel free to ask any question and we'll sort of select like, we can't do all questions, but we'll select some at least and ensure the most interesting questions get answered by the speakers, which are most relevant to, you know, their talk. Cool. With that, I think, yeah, it's good to mute ourselves and we'll hand over the, you know, conversation to Ananya, who is going to talk about Grafana Temple, something that he made for Grafana recently. Cool. So, over and out to you Ananya, please. Thanks, Pranay. Good morning, everyone. I will just go ahead and set up by sharing my screen. Should be up in a second. Yep, you can see it. Cool. Let me also make sure I can see all of us because I want to be talking to my slides. All right. Cool. So, welcome, everyone. And thanks for joining on a Saturday morning. I know, like I wanted to have this meet up at 10 and then I spoke to Hashfire and it was mentioned that waking up at 10 a.m. on a Saturday would be herculean task. So, we moved it to 11. And I've been also asked to make this super interesting because Pranay is going to zone out after 30 minutes, so I'll try and do my best with that. I'm Ananya Agarwal. I'm a software developer at Grafana Labs and today we're going to talk about Grafana Temple and our path through this amazing 1 million per second ingestion rate, right? And how we got here. We'll talk about the journey of its inception, what we found fundamentally felt was missing in tracing platforms and how we built Temple and then we'll talk about our progress from there. So, cool. Before that, a little about me. My journey in open source started out back in college when I was doing a Google sum of code under the LLVM compiler infrastructure project. I was bigger on compilers at the time. And after graduation, I was working as a site reliability engineer and I was discovering platforms where I could integrate observability and this is how I came across Yeager. I started contributing to Yeager and open telemetry and that's how I'm here at Grafana today. And outside of work, I really enjoy reading. These are a couple of books, some of my favorite books that I've read recently. And I've also been really getting into F1. I really enjoy watching it. And I put up this picture of this car because every time the McLaren drives past, I read it as dinotrace and not darktrace. So that's cool. To lay down the agenda for this talk real quick. So if any of this sounded like crazy stuff to you, don't worry, we'll get started with an intro to distributed tracing. We'll talk about how it integrates into our debugging workflow and how it fits in very well in the observability stack and how we can use it to monitor our applications. Then we'll talk about some limitations and some of the problems that we came across with the existing solutions. And then the sort of inspired us to design this new system and then we'll build on top of that to talk about what tempo is the features and we'll have a quick demo in this section. And then finally, my favorite part, we'll talk about how we've scaled to a million spans per second. So cool. What is distributed tracing? So distributed tracing is actually a way to get fine grained information about our system performance. It's a way, it gives us a bird's eye view into the performance of all of our services that make up our application and it sort of gives us the life cycle of a request as it passes through our system. And so I know that both the talks today are based on distributed tracing and so it must be like the new big thing. And so let's hope it is. And that so distributed tracing is based on the concept of context propagation. And what this really is, is it's explained in a diagram on the right. And suppose we have a bunch of microservices, a to e, and our application receives an external request. And every time it receives an external request, the ed service in this case, which is a, it assigns a unique ID to this request. And then this service, a, it could be doing a couple of downstream calls, like maybe it could be authenticating the request could be passing down to downstream services, which could be querying a database and so on. And all of these services do some quantum of work, but every time they make a downstream request, they pass this unique ID along as part of the context. And this is how the context propagation comes along. And then these services do some amount of work, which is recorded as a span. And a trace is just a bunch of spans stitched together with the help of this common ID that is present as part of each of these spans. So each of these spans are emitted, collected in the tracing back in and together it forms this trace, right? So let's see how tracing fits into a typical debugging workflow. So usually what happens is my debugging workflow starts by me getting alerted, like something's breaking in production and I need to go figure out what's happening. And the way this works is after I get the alert, I open a beautiful Grafana dashboard. And this is sort of useful because it gives me a single place to see what's going on with my application. It uses metrics. I jump to a metric view. And this is sort of like metrics are great because they're like aggregatable. You can see the health of different services that are present in your application. And it gives me from a very high level, it tells me which part of my application is actually breaking down. And then this particular service might be seeing elevated like latencies or high error rates and so on. So from here, once I have this dashboard open and I know with service is breaking down, I can go and type in some ad hoc queries. Like it might be because there are too many open connections or my connection pool ran out or I have a queue and connections are just waiting in line, like requests are just waiting in line on the queue. And there could be a lot of things. And I really like to sort of modify my queries, sort of check, drill and inspect which part of my application is breaking down. And the observability tool should actually allow you to do that, right? Type out these ad hoc queries. After I get this, after I have like a bird's eye view of the system, I want to see the events that are going on in that part of my application, right? I want to see the logs like, Hey, what, what is my application log and what's going wrong. And then from there, I jump into this distributed tracing view, which tells me, Oh, this part of your service or this request, particular request is seeing elevated latencies. And then it leads me to that performance fix or a bug fix and so on. But typically what happens is in this entire workflow, it breaks at this point. So I go through these, all of these wonderful debugging workflows. And then at the point where I need to view the trace, the workflow just breaks down because either, either the trace got sampled away or I'm just not able to query it properly. Or there could be a lot of reasons. But after viewing my logs, I'm unable to go to my trace for that request. I know which request the is having like high latencies or errors, but I'm unable to view the trace. So this is actually because of a couple of limitations and racing. And we were actually facing a lot of these and which is why we decided to build tempo. And we'll just quickly walk through what those limitations are. So first is sampling. So tracing is really high volume data. And typically what most applications do is use something like a probabilistic sampler with samples requests upfront, right? It samples at random. And even though. So what is the act of sampling? It's sort of observing just a few samples out of your entire data set, right? And the, the sample down data set is representative of the original data set, as you can see in this figure, but it does have some information missing. It obviously loses some information. It's not lossless. And because it's probabilistic, we don't, we could be throwing away some really interesting traces upfront, right? But the probabilistic sampler doesn't know that the trace is going to go on and have high latencies or high error rates. And so sometimes with sampling, we lose a lot of this interesting data, which we could have saved. Second thing is the cost of storage and operation, right? So typically tracing back can use like an indexing engine, right? Everything needs an indexing engine. We need something like elastic search or Cassandra, which will, which will tell us, Hey, this is the sort of traces for this particular service, this particular endpoint and so on. But these use block storage devices, which are costly in terms of like storage and also in terms of operation, because they're just bulky to use and managing a large Cassandra cluster or elastic search cluster, like I'd like to know how many of us have had good experiences with that. So, and then third, there's like the operational complexity of current tracing back ends, right? So part of the problem is that we have all of this metadata that's coming out of this application, like which cluster it's running in, which namespace and the pod name and so on. And this metadata is sort of repeatedly being re-indexed in metric back ends and logging back ends as well as in tracing back ends. Like I wonder if there was a way in which I could reuse all of this indexing information present in the current systems and then sort of discover the trace present in tempo, right? And lastly, there's also very limited search in terms of that. I can only search, like for instance, if I wanted to search based on a time range as well as a regix based on an HTTP URL, it's really hard and it's really expensive. The query would scan a lot of indexes and it would be really expensive. So we sort of wanted to make trace storage super simple. And which is why we came up with tempo. We obviously ran out of gregor names, Thanos was taken, Loki was taken. And so we had to go with tempo. So that's what we did. And tempo is horizontally scalable, high volume, multi-tenant tracing backing that's cost effective and easy to operate. So well, that's a lot of terms, right? We're going to dig into what each of these means. But first and foremost, that's the repository. It's open source. There are documentation, there's all sort of beautiful guides that we've written for integration with existing observability tools and so on. Please do check it out. So firstly, I mentioned it's easy to operate and cost effective, right? So what does that mean really? So it means that tempo really has very minimal dependencies. And even in a full blown, high ingestion rate environment, the only dependency we have is on cache. And even though it's optional, we recommend it. But we'll talk about how it comes into play, like how cache is useful. But if you're running it for a small scale deployment, you can run it as a single binary. Your application can send traces directly to the tempo single binary, which is just like dot slash tempo pointed to a config file and pointed to the SC bucket or GCS bucket and should just run. Finally, the Grafana UI directly queries the tempo binary and renders the traces. We also, it uses object storage as a back end, which is super cost effective. Of course, compared to block storage devices, it's a lot more cheap. And so that's how it makes it cost effective. What this means is that we've made a trade-off in terms of search and tempo is a key value store only. And what this means is that given a key, which is the trace ID, it can retrieve the trace JSON. But at the moment, we do not have any other search capabilities. And we'll talk about how we integrated with other observability tools to enhance the observability. And so actually indexing this key, which is the trace ID, it's a really high cardinality label. And we'll talk about some of the tricks we use to index it. And if you ever actually navigate to the tempo repository on GitHub, okay, I don't have it here, but there is a subsection as a folder called tempo DB. We wrote it to be a generic key value store built on top of object storage. So you could plug in any high cardinality label into it. Maybe one day it'll be used to store images. We don't know. Today we've built it to work for traces, but the tempo DB can be used as a generic key value store. So cool. Second is the integration with existing observability tools. So Grafano comes built in with a tempo data source. You can just query tempo natively, and it will render the traces in Grafano. You do not need any other visualization. And this was built in in collaboration with the Yeager UI team. So for Prometheus, we have the upstream client has support to record exemplars for histograms. We'll see a demo of that in a moment. But yeah, this is really cool. And if you haven't heard of exemplars, I will give you a demo for that looks like in a moment. And Prometheus also has support for in-memory storage, which was recently merged kudos to like a huge team effort in getting that across the line. But yeah, this is upstream now and available. Finally, in Loki, the Grafano data source for Loki has what is called derived fields. And we'll talk about how that helps in jumping from logs to traces. And finally, it has this really advanced query language, logqlv2, which helps in filtering and calling, drilling down on the traces that we're super interested in. Third, we have consistent metadata. And what this means is that if your application is running on any container orchestration platform, for instance, Kubernetes, and if you're using the Grafano agent to ingest traces from the application and send it to tempo, then the Grafano agent actually uses the Prometheus service discovery mechanism. And it pulls the Kubernetes metadata API to extract information about the different pods that are running in the namespace. And when it receives traces from these particular pods, it does a lookup on the metadata API. It fetches all of the metadata like cluster namespace, and it attaches this as tags to the spans that are being ingested from the application. And then you can send it to tempo. The Grafano agent, by the way, natively supports Zipkin, Yegor, and open telemetry formats. And so it's a drop in replacement. And finally, this consistent metadata helps us to transition easily between the different telemetries of metrics, logs, and traces. And so this is consistent metadata in action. It's a split view on Grafano on the left. I'm viewing a trace. And as you can see, I have some metadata that is attached to a trace, like all of this is attached as tags to the spans. And there's this little button here that says logs for the span. And when I click this, it opens the split view in low key, where all of this metadata, well, it's configurable which labels you choose. But this metadata is sort of translated into query selectors. And you can view the logs for that particular service. And in the same time range. So this is super useful from jumping to traces to logs to see that, oh, hey, this, you go to a trace and you see that, oh, this service was having elevated latencies. And now I want to view the logs for that. And then you can directly jump into that. Finally, all of the services that make up tempo are available as containerized applications. And we can run this on any container orchestration platform. And each of these are horizontally scalable. So you can scale them out as and when your workload grows and expands. And that is super helpful. You're not restricted in terms of vertical scaling or anything. And yeah, okay, this is a little bit. We also support zero, but yeah, we'll get to that. So cool. We built all of those cool things. And it got a really great response in the community. So Kelsey actually tweeted out saying Grafana is out here flexing with its new trace and backend. It integrates natively with object stores in the cloud like GCS and S3. And so did Yana. Yana mentioned that there's a great momentum on the distributor tracing side with a lot of companies launching open source distributor tracing products. This is great. And so with that, we will skip to a demo of tempo and everything that I'm running is available online at this link. So let me just navigate this and cool. So this is a Docker compose demo, which is designed to show all of the capabilities of tempo. And I'm just going to run this like, in fact, I already have it running. And so we're just going to see a demo we're going to jump in. So this is sort of like a toy application that we made to showcase the different features of tempo. It sort of looks very similar to another orange site. And we also put in some funny things up here, the final apps and also score takes like low keyboard for metrics. And so like you can interact with it. And it's just a demo application. And so for the purposes of this demo, let's imagine that this is my application, but I've been paged for it, right? And so the first thing I do, according to our debugging workflow, is to sort of open our, define our dashboard, right? And I see that, okay, I have this already in place. And I see that. So we have this intuitive layout for the dashboards, which we call a red method, request errors and latency, request errors and duration. And so you can see that we have like three services, load balancer, app and DB, and they're called in that order, right? The request, first it's the load balancer, and then the application and then the DB. And I see that, okay, some things happening at the application level, my latencies are randomly spiking. And I see that some of my requests are also giving 500 errors, right? And so now I want to drill down and figure out what's happening in this application. So the first thing I do is to sort of select this, right? And I select this region that's actually showing errors. And once I have the time range selected, I go into explore mode. And so here, this, as we discussed, right, it helps me fiddle around with the query, it helps me add, maybe I can track container CPU, and so on, right? Like I can add other metrics and see if they correlate with this different performance and so on. For now, I'm not going to do that because I know exactly what's going wrong. But, okay, let's move on. So after this, now I want to see the events, like what is my application logging? Okay, this seems to be happening, but there must be some indication of what's going wrong in my application, right? So the easiest thing to do is to actually jump to the local data source. And if you notice, the selector was retained as I switched data sources, as well as the time range from when I selected the duration and the query panel. And so now I'm actually viewing logs for this particular service in that time window. This is super useful. And I see that, okay, something is going wrong. I have like broken pipe. So this is happening. Now, what if I wanted to see a trace for this, right? And so now we're going to see a demo of log QLV2, which is a super powerful query language, which is written in Loki. And what this allows us to do is to sort of parse this message out. And log fmt is a log formatter, which says that your message is logged in the key equals value format. You also have like, there are a number of parsers available. There's JSON. And so your log line could be logged in any of these formats. And Loki will pass these out for you. And what will what this will allow us to do is it will make each of these fields queryable. So you can actually, for instance, let's take this, right? And I see that this log line actually has the level, it has the status, it has the duration and has the trace ID. So what I'm going to do now is once I parse it out, I'm going to say show me level equals info. So it filtered by info and show me status equals 500. So cool. Now I have all of my requests, which actually threw out a 500. And now I can click on this and it will show me the trace and tempo. So this is actually the derived fields in the Grafana data source, which allows us to link an existing field to an external data source. So what I did here was link the trace ID field into the tempo data source. And when I clicked on this, it opened the trace window in a split screen. So now I can like sort of expand on this sort of. So I can like expand, I can check the tags. I can see what went wrong. It's likely some, it's an issue with the database. Okay, I figured that it's not just the application. It's actually happening at the database level. And so on, right? Like I can use this to actually drill down and figure what's happening. And this is not the only way to do it. So I'm going to go ahead and I'm going to go ahead and and this is not the only thing I can do. I can filter on, like there are other fields as well here. So I can also go ahead and say show me duration greater than 100 milliseconds. Okay, this is ambitious, but okay, there are a few traces which took longer than 100 milliseconds and had a status of 500. So this is really cool, right? From my metrics dashboard, I was able to jump to the logs. I was able to filter down the logs for the request that I was interested to view. And then I was able to jump to those traces and debug them. So this is, this is extensible. This is super powerful because now as developers, we can log anything out here, right? We could log the size of the request response. You can filter based on that. You can say which client is just requesting like a one GB response and so on. You could, you could log anything and filter based on that. Another really cool workflow that I'm excited to show is exemplars. So let's go back to this panel. And I'm not sure if you'll notice, but on the right, we have a different kind of panel. Like on the left, we have the old graph panel in Rafaana. On the new, we have this time series panel. And there are all these funny dots over here. And let's see what those are. These are actually exemplars. And so what this allows us to do is, okay, so what are exemplars? So exemplars are actually metadata that you can attach to a metric. And it's typically not indexed by the metric engine. It's just a really high cardinality label. Like it could be a customer ID. It could be a trace ID. And this is extra information that you add to the metric that is scraped and stored, but it's not indexed. But what you can do is actually when you're plotting the metric, you can show the corresponding metadata for it. So here I have a latency histogram. And what I'm doing is I'm recording histograms in my client application. And so as the request latency climbs for a response, the exemplars get recorded corresponding to that particular bucket. So for instance, let's look at this particular spike. Here I can see that my application was throwing 500s and the latency spiked up. And you can see that for this particular exemplar, the value was 0.141, which is 141 milliseconds. And you can see that it's being plotted at roughly 141 milliseconds. So you can see that as the latency climbs, the exemplars are plotted at that particular latency in the y-axis. So it's really easy to sort of pick which ones you want to view. Another cool thing is that let me see if I can... Okay. So another cool thing is I can also see the status code over here. I can see all of the labels that were attached to that metric. So I can filter down and say only for this particular root. Like this is metrics. If someone is hitting the metrics endpoint, I don't care about that. Just like show me something else. Maybe this one. It's hitting the root. Okay. So I would be interested in this. So I can filter down by the method. I can filter down by the status code and only view those exemplars. And what do these exemplars allow us to do? They allow us to directly view the trace for that particular request. So when I clicked on that... Okay. I should have talked about that a little, but there's this little query with tempo icon here, which we added to our data source. So we paired up the Prometheus and tempo data sources. And we said that whenever you see an exemplar trace ID in Prometheus, link it to tempo. And this is exactly what we found us doing. And whenever I see your trace ID, I can click query with tempo and you can jump to the trace. So this is like 142 milliseconds. And here it's 146 milliseconds because this was recorded at the app level. At the app level, it should still be 142 milliseconds. Yeah. 142 milliseconds. And at the load balancer level, it'll be a little more than that. But the idea is that this makes it super simple to jump from metrics to traces. We can directly check the outliers and jump to those particular traces. And from here, once I've identified the service that's giving me all of these errors, I can use this other icon which says logs for the span. And what this does is it actually, for this time window in which the span was recorded, it opens a split view and shows me the logs in low key. And I can directly say, oh, it was a lock timeout. There's a lock and the queries are waiting behind that. So that's probably it. I need to fix that in my code. So this sort of workflow is really useful to drill down and identify the super important traces. And yeah, we're really excited about this. So this sort of covers most of the workflows that we have in this application. Let me just quickly check. I haven't missed out on anything. And I will zoom past this because I have a couple of interesting things to cover after this. So we have metrics, logs, traces, we have metrics, logs, and we have logs. So cool. All of this is available on this repository, gryphana slash DNS. Feel free to check it out. It's under production, Docker compose, and all you need to do is run these two commands and you should have the same level. With that, I will go back to my slides. And so as you saw, what we've really been trying to do is to sort of ease the transition between these three pillars of observability, right? There's metrics, logs, and traces. And what we really, as we saw when we switched from the dashboard view to the Explorer view, we really switched from metrics to, sorry, when we switched data sources from Prometheus to Loki, we actually went from metrics to logs. And this was present in Loki 1.0, right? With the consistent metadata that you have across Prometheus and Loki, you can jump and view the metrics and logs for the same service. Loki 2.0 actually allowed us to generate metrics from logs. And we'll see if we have time to cover that at the end. From logs to traces, this is something that came with derived fields. Rafaana 7.0 introduced that. And from traces to logs, we saw that in tempo. This should actually be metrics to traces. From metrics, we can jump to traces using exemplars. And traces to metrics is still a work in progress. So each of these pillars has their own place in the observability stack, right? Metrics are aggregatable. They're useful for alerting. Logs are useful because you can check the health of a particular service. You can see what the events are happening in a particular service. And traces are more a request scope. They're useful for debugging purposes where you can really drill down and see what's going wrong with the application. So cool. Now on to my favorite part, scaling to a million spans per second. So this was, until now, whatever we've discussed was the founding principles of tempo, how the different decisions, design decisions we took in order to come here. But here on, we were focused on scale. We really wanted to make this a high volume ingest system. And we got to that. And so let's talk a little bit about that. So the first, okay. This should have been before this, but we added support for Azure Blob Storage. So not only can you store your traces in S3 and GCS, but also Azure Blob Storage. And this was a community PR. It's really great. It didn't come from one of the maintainers. It was PR from the community. It allows you to use any Azure Blob Storage compatible back end. And as you can see, like with the number of reactions on that, it was really people were looking forward to using it. And so this was great. So, okay. Now when we talked about, so, okay, I have a couple of block diagrams here to simplify how tempo works. And so when we talked about this, we said that all of this ingestion will dump blocks in S3, right? So let's talk a little bit about that. What exactly happens in the storage? And what are the decisions and what are the improvements that we had to make to actually scale to a million spans per second? So first, what happens is like all of your ingested data is dumped in the form of blocks in the S3 back end. And we need to query it. And so the way query your works is there's a Grafana UI and you pointed to the query. This is actually, when we launched, this was the case. Now we've improved on it. And this is what I'm going to talk about, how we scaled our query paths to make it horizontally scalable using the query front end. And so when we first started out, we just have a query which was responsible for querying all of these blocks in the back end, right? Like you have a million spans per second, you're getting crunched up into blocks and dumped into the S3 storage. And you have a query which is querying all of these blocks and returning the response to Grafana. Now what happens is as and when you grow your scale, you have more and more blocks to scan and it becomes really intensive for a single query to go through all of these blocks and return a millisecond or sub-second response to Grafana. Even though it's all in parallel, like each of these blocks are queried in parallel, it's still a mammoth task for a query to go through like thousands of blocks. And if you have a high query rate, it gets really intensive for a query to do. So what we really needed was a way to shard this block space, right? Assign smaller work for each query. And that's exactly sort of what we did with the query front end. What we did was we introduced a query front end which would shard out this block space and it would say, okay, hey, you query, you only query the first three blocks and this other query will query the next three blocks. And so this query front end sits between the query and Grafana and it sort of shards the block space and assigns a part of a shard to each of the query so that each of these can process in parallel, right? And so if you have really high QPS as well, each of these processing can happen in parallel. And what this really allows us to do is to actually horizontally scale our query path. So as and when your number of blocks grow in the back end, you can actually add more queriers and the same query front end will split it further and further and the queriers can take a small part of the work and do it in parallel and return it to the query front end. This allows us to reduce our query times a lot. If not, with a million spans per second, we'd probably be doing like 10 seconds of latency and so on. But with this, we're able to keep it to like two seconds and so on. So it's pretty fast. Another important feature we added was exhaustive search. So what really happens is we wait at the ingest time, we wait a configurable amount of time for it to call a trace finished, right? We say a trace is finished when we wait for 10 seconds and all of the spans have come, have reached the ingestion pipeline. But often that's not the case. We have really long running traces which could span minutes and so on. And what could happen is that parts of the traces could be present in different blocks in the back end. And what we needed to do was a way to actually merge all of these responses when we sort of query them. And this is exactly what we did. Now with exhaustive search, you can actually, at each step, the courier will combine traces that are split across blocks. If you give it the same trace ID, it'll combine traces from across blocks and pipe it up to the query front end, which will again combine responses that it receives from multiple couriers. And then give you the complete picture. This allows your trace to run for hours or even sometimes, even if you have a short running trace, your trace could be split because of compaction, which we'll talk about in a second. So the next thing that we implemented was level compaction. This is pretty standard, like even in SS tables and so on. We come across this concept where if you're ingesting a million spans per second, you're almost flushing 600 to 700 or close to even a thousand blocks per hour. That's like a lot. And because your courier needs to query all of these blocks, it is imperative. It's important to keep the number of blocks in the back end small. We need to keep the size of the block list really small. So what we needed for that is a way to do level compaction. And what level compaction does is it defines levels in compaction as the name suggests. And when it sees a lot of the blocks are in the level zero compaction, it sort of crunches them together into level one blocks. And when you have a lot of level one blocks, it will crunch all of these level one blocks together into level two blocks and so on. As you go up the level, as the level increases, the size of the block will increase, but your block list will remain small. And so this was really useful. We implemented this. And so as you can see, with level compaction, if your traces, if part of your traces and block two and part of your traces and block four, they could actually end up in very different blocks. And you know, in a very, in some cases, they could never actually be combined. We also run tempo with a replication factor of three. So every ingested span is replicated across three ingesters. And hence three blocks, like each ingester makes one block to the back end. And so it's really important for compaction to also dedupe when it combines blocks. And so that's, that's another reason why we need this compaction to dedupe and to keep the number of blocks in the back end small. Finally, we also added compression. So first it was hilarious. Like we used to store all of these as uncompressed protobuf in our object store backend. And let's understand, like we really need to understand how the query path works to understand how we implemented compression. So each of these blocks, actually, if you blow it up, it looks something like this. It has a bloom, an index, and a data file, a bloom filter, an index and a data file. So the bloom filter is actually really useful because if there are 5,000 blocks in the back end, we don't want a query to actually query every single block. And a bloom filter is a probabilistic data structure that says, you know, you can ask it, does this block have this particular trace ID? And it responds in a yes or a no. But this is, it's a probabilistic data structure. If it says no, then it's definitely not there in that block. But if it says yes, it may or may not be there. So we have a bloom, this is called the false positive rate of the bloom filter. And so we have these bloom filters with a 0.05% false positive rate. So if we have 5,000 blocks, this bloom filter allows us to reduce our search space to just 5% of that. And we just need to search those many blocks. And once a query, like once we issue a query, a query will actually look up the bloom filter, it'll ask a block, hey, do you have the bloom filter? If the, do you have the trace? And if the bloom filter says yes, I do, then it does an index lookup, which is an index is just very straightforward. It just stores the offsets to the traces that are present in the data file. And so then it does an index lookup and then fetches the trace. And so what we really had was something like this, right? So our index stores the trace ID and the offset at which the trace is present in our backend. And so, and this is the data file, right? So you have at that particular offset, you have the blob of that, the entire protobuf compressed or you have the entire protobuf of the trace present. And so we couldn't just compress the entire data file together because then you'd lose these boundaries and you wouldn't know which traces where. And so we had to compress each of these traces individually. And then this is what we did. Like if you look at the difference between them, between these two images, you see that each of these individual trace blobs is compressed and then appended into this data file. And you can see that the offsets have also reduced because of course, once you compress the size of the data will go down. And compression was really important. And the reason like the reason when we introduced compression, we also had to do versioning. And what is versioning? It's like we have this ingestion pipeline where our ingestors are continuously making blocks in our backend. And now if I want to roll out compression in my environment, right now suddenly I roll out a new version of the ingestor, which actually stores compressed data in our backend. But now when I get a query from Grafana, my query has to scan through all of these blocks, but some blocks have compressed data and some have uncompressed data. The query goes like what is happening here? I need to, I don't understand this format of traces, like some of your blocks have compressed data and some have like some have compressed data and some don't. So we actually added versioning, which actually allows you to run blocks with different encoding schemes. You could have one block with snappy compressed data, one with Z standard and one uncompressed. And you can actually throw in any block in there and the query will just pick it up and query it. So this is really powerful, enabled versioning. Lastly, I think this is the last one. I think I should breeze through this. So we did index paging, which is really straightforward. So if as in when your block size explodes, right, you're compacting smaller and smaller blocks into larger ones, the size of your index starts blowing up because now you have so many traces to keep track that your index runs into a few MBs. And now you can't really cache your index because it's a few MB, you can't do that across 4,000 blocks, that's a few gigabytes of cache. So what we wanted and what we did was to sort of break this down into pages. And if the bloom filter at query time, if my bloom filter says, Hey, this block has your trace, then we just look up that particular page, we go page by page of the index and then look it up and go to the data. Okay, cool with just the final few things. Another thing that we implemented was compression between the agent and the distributor. So between agent and tempo. And as you can see, our request throughput just tanked, right, we went from 60 to 70 MBPS to just like 16, just by enabling compression between the agent and the distributor. And this allowed for huge egress savings as well as sort of like reducing your network costs as well as better performance and so on. And finally, tempo now is a real true single binary when we launched it, it had to be run together with this component called a tempo query, which would translate open telemetry format of traces. So tempo stores any of these formats, you can give it zip gain open telemetry or Yeager format. And but at query time, Grafana needed Yeager format of traces and tempo query was this component that would actually translate between the open telemetry and Yeager formats. Now we actually drop that we have native integration in Grafana, it can directly query the query front end and render your traces. That's pretty neat. Cool with all of these improvements, it brings us to our current scale. These are the numbers. I hope that everyone's been waiting for it. So today we have like 5500 blocks in the back end, we have this is I think 18 billion races, 33 terabytes of compressed data, all in object storage. And this is like a screenshot from our internal dashboard. The one million per second is not just a benchmark, it's actually our real ingestion rate. So this is, we're seeing one million per second consistently over the past few weeks now. And yeah, we also have this really cool command line tool, which can query your blocks directly and dump out this information. So this is how I got this information. And so yeah, with that, so how to get involved, there are all these avenues for everyone to get involved if you're a user, if you want to contribute, if you want to check it out, if you want to use Grafana cloud for storing your data. Grafana cloud has a free plan, we're offering 5000, we're changing that actually, we had 5000 spans per second, but we're changing that one MB per second for free storage for three days. And then we have docs with integration guys, you can integrate it with the existing observability tools as we discussed. And also, like, we have some guides on how you can get started in different languages like in Python and Node.js, and we have all of these tools. The repository is here, we have an example in the Docker compose, one that I showed you was DNS, but we also have a similar one in the temporary pool. If you have any questions, there's a Grafana Labs Slack, and there's a tempo channel, we all hang out there, you can come talk to us about them. With that, thank you. And I'll take any questions. 1145, I think I was okay. Wow, exactly. Amazing. Thanks for the talk. And I like, I had a couple of gaping moments like when you described certain components and how you handled that really, really great. And one particular thing I had the burning question in my head before I came into the talk was that whether this is going to be hotel compatible or not, given that your background in being a contributor to hotel and everything, right? So great to hear that's the case and great to hear that the internal format is actually Yega format. The internal format is actually open telemetry, we store everything in open telemetry. So great, I was just going to ask that why rely on Yega when there is the hotel format which is supposed to come up as a standard anyways? Right, exactly. So I guess like the community is moving towards that as a standard and we felt like that is the right way to go. So once from any format, you can send us traces, but we convert them internally into open telemetry. Once a trace touches tempo, everything internally is open telemetry protobuf. So would we be able to plug in the tempo back into Hotel Collector itself as a storage in that case? For sure. Yeah. In fact, I think there should be a blog post on this soon. You can use the open telemetry collector to open telemetry collector traces to the final cloud. Like that should be, yeah. So there's a blog post on this. You can configure the collector to send directly to Grafana Cloud or tempo itself. So you can just go through this. Awesome. I'm going to come back to you for this link so that I can pipe this to our community on YouTube and on Telegram groups. It will be super useful. Having worked a bit on Hotel Collector and stuff, I find this extremely, extremely useful. And yeah, personally, it's great for me at my work. I'm talking about my work piece. It's going to be great for me. So amazing. Rene, can you ping the link on Zoom? Yeah, for sure. Okay, let me do that. Oh my God, where is this? We'll also put up most of this links on the Hasgeek page for later archival and lookup so that everyone who is attending the talk can come back and ask questions or just discuss about the stuff. Anyway, so we'll first take questions from the Slido link. Let's check there. Can you pick up the Slido questions up? I think I'll just ping the link to Rene. Okay, sure. Then we can pick them up. Where are you going to ping? I pinged on Zoom. Can you see the questions? Cool. Okay, so I'll just read this out. Okay, there are five of these. Let's go through these. Yeah, you can just pick any like in any order you want. Sure. So yeah, let's just go through this. How are exemplars created? Is it automatic? How can we make sure that each anomaly pattern has an exemplar to look at? So today exemplars are recorded like Go and Java, I think the SDKs, the Prometheus SDKs for Golang and Java have support for recording exemplars and histograms and counters, if I remember correctly. Like I'm not 100% sure about that, but so I think if you're using the Prometheus library and you're familiar, you say record an example, record a histogram. You say that when you observe the latency of a request, you record it in a histogram. Today that call is just replaced with record with exemplar and you can just drop in that replacement and the client will record exemplars for you. So just asking like a tagged questionnaire. What does the exemplar data structure look like? Does it have the like, of course the timestamp and then does it have the trace ID and also like the log sample at that particular moment? Because I'm hoping it's linking all the three things sort of together. The metric at that particular timestamp, the log that are related to that and also one trace or multiple trace. I actually had a doc somewhere about this that you'll find out at the moment, but let me just... I mean we can take this up later. I know this is like a deeper question for folks mind. I saw it as somewhere, so I think there's only one trace ID as a label added to that. Is there a background noise from somewhere? Yeah, I think it's from sort of side. I think there was some background noise from your side. So I'm using this amazing notepad to show you what exemplars look like. So this is my typical histogram, like what it looks in Prometheus, right? HTTP request durations bucket. I have some labels, it's job, yada, yada, yada and then exemplars just add this like the hash and then trace ID XYZ. So exemplars get added to the... The where format of Prometheus? Correct. And now it's part of open metric standard as well, which is what Prometheus implements. And yeah, any open metrics compatible storage backend should have this built into it. And then these get scraped, they don't get indexed in Prometheus. So this is actually another cool thing that I want to show you here. So if you look at this, you have this panel that's showing this latency along with histograms, right? Let's look at this one. It's more interesting. So yeah, so I actually have these three. So let me let me delete two of them so you can see. This is the P99 and P95 and all of them. Correct, correct. So now I have this 99 percentile histogram and you can see that exemplars are plotting. There's actually an additional step. There's, you see this exemplars icon, right? So you actually need to enable this to see exemplars. Like you see that I disabled it and the exemplars just switched off. So if you enable this, you start getting these exemplars and they're a completely different query path from the current metrics. Like metrics are indexed according to this metadata and then exemplars have a completely different query endpoint and this actually asks Grafana to enable that endpoint, like start querying that endpoint on Prometheus. Right. So only, only, only if we enable this, then the exemplars would be passed. I'm thinking some sort of like in my mind and the way I'm thinking is it's like an annotation to the metric format. So certain metric timestamps are annotated with the trace IDs, which together the format look calls, it's called exemplars, right? Amazing. Amazing. Hey, I think for the next question, maybe we'll ask the participants to ask the question if they're here. I think Akhil asked the question. So Akhil, if you can unmute you, unmute yourself and ask the question, I think that would be more interactive. Absolutely. Sure, definitely. So my question was mostly around coagulation and then eventually anomaly detection in some format. One of the big problems of open telemetry is ability to have unified via format storage as well as during transmission. Is it possible to have that kind of correlation capability automatically built into this particular tool? And if so, how does it work? So for sure, let me, let me go back to where I presented. So, okay. So to talk about that a little, like, thanks for asking that question, Akhil. That's a great question. So if you're using the Grafana agent to receive spans from your application and forward, then to tempo, right? You could use the same agent to actually scrape metrics as well as logs. Like the Grafana agent has supportive scrape prometheus metrics. It actually embeds part of that code and actually embeds prom tail, which is low keys tailing service, right? You can tail logs using prom tail and then set low key. So what using this Grafana agent allows you to do is to have the same metadata on all of these meta on all of these telemetry signals. So you can actually correlate really well and open telemetry also has a low key exporter. And so like the industry is moving towards converging on a standard for all of these. I think eventually this is what will happen, right? You'll have consistent metadata across all of these telemetries and you'll be able to seamlessly transition between any of them. Today, it's possible with the Grafana agent, because it sort of attaches the same metadata to all the three signals. A couple of follow up questions and then I'll be done. So the first question is, is this agent available as a library that we can consume in our own custom consumer? And second one is, does this agent support event emission in response to correlation events? For example, if I detect some kind of correlation event like an anomaly, can I trigger alerts using that? That's a good question. So to answer your first question, is it available as a library? It's open source and available as a container. You can drop it into any of your orchestration platforms and it should just run out of the box. Secondly, to correlate, there is a lot of work happening in this area. We have this thing called a span matrix processor. So, okay, the whole point is that it sort of allows you to generate metrics from places and you can sort of track things like your HTTP request duration and the number of errors and so on. So you can build on top of these integrations between these telemetry signals. But in terms of alerting, you would still need something like Prometheus and an alert manager. That's how we design. So what I meant was like, can this hook up to those alerting services? For example, I detect a correlated anomaly across different data streams that can span out those across services, but across different data streams and logs, errors, and faces and metrics. So is it possible for this particular agent to send or transmit those events to those third party energy services? Okay, so that's a great question. The way we do it today is you sort of convert all of these signals into metrics and then Prometheus will alert on top of those metrics. There's no auto alerting as such, like not that I'm aware of at least. This space is moving so far that I am not really sure if it's implemented already, but as far as I know, there's no auto alerting on top of these signals. But for instance, you're receiving spans from an application and you're tracking the latency of a particular span from a particular service. You can expose that as a histogram and send it to Prometheus. And then in Prometheus, you can alert if the histogram value exceeds a particular threshold. So that's where we imagine this to work, like we convert all of these signals. Anything that you want to alert on, you can convert those signals into metrics and then pipe those metrics to Prometheus. The Grafana agent can ship those to Prometheus and then you can add alert manager to sort of alert when it crosses these thresholds. Sure. Thanks a lot for that response. And by the way, this is a great project. So we are in a similar boat. We are building a completely green field observability solution on top of open telemetry. By the way, I'm from Atlassian. So we are doing everything scratch. That's amazing. Is Atlassian also getting into observability? If you have been, no, I mean, it's for internal observability platform. If you have been following open telemetry's eventing model, large part of push came from us. That's amazing. Great to have you here, Akhil. Good that you could join us. You could join us on our telegram channel. I am already there. That's how I found out. Oh, awesome. Really good. Cool. I think the next question is from Pramod. Pramod, if you want to unmute and... Yeah. Pranay, I think this should be the last question because it's sort of running out of time and I wouldn't want to keep Ankit waiting. Yeah. So let's just do the last question and then we'll move on. Yeah. So yeah, my question is probably basic. So I think this tempo captures all the data. There's no sampling going on. So is it like a full replacement for the existing thing or can you just use it to try it out for some use cases and then expand? So thanks for that question. We have designed Tempo to be super high scale and we're making it as cheap as possible to run so that you don't do any sampling. You store a hundred percent of your traces and every time... This was what I started out with at this initial workflow that you never should reach the stage where you're unable to find a trace. You always have a hundred percent of your traces. But then again, sampling is... It is needed in some cases because generally most of the traces in an application look almost alike and they're uninteresting, boring old traces. I don't need to store all of them. There is a lot of work happening in the tail sampling area, which is a way to sort of not do this random upfront sampling. You can buffer all of your traces for some configurable amount of time and then you can evaluate some conditions on top of them. Is the trace duration greater than five seconds? Does it have so many errors? Does it have this particular API call? And you can evaluate all of these conditions and only sample those. But if you don't want to, then we're sort of heading in that way with tempo. We're pushing it to its limits. We want to make it easy and cheap to store a hundred percent of your traces. It really depends on the needs of the organization, but we're here to sort of... Yeah, to hook up to that question, Pramod. So Ananya actually did a talk with us sometime back regarding his work on scalable tail sampling on OpenTelemetry and that was his starting sort of... That is how we got connected and that was a great talk. Tail sampling on a distributed, sharded tracing database is sort of hard, but of course it's possible. One question from my side, Ananya, on this is that does the Grafana agent right now allow for a head sampling configuration? Like I know the OpenTelemetry agent does. I can ignore like let's say 2XX all trace samples, which are 2XX. I can ignore based on let's say HTTP code and stuff like that, right? Does the Grafana agent allow for that as of now? So the Grafana agent actually just embeds the OpenTelemetry collector. The Grafana agent does nothing but... Everything that you need, all your receivers, all processors that you have in the OpenTelemetry collector can just be pulled into the Grafana agent. In fact, even our distributor layer in Tempo, it has a shim that uses the OpenTelemetry collector to convert all of these formats from these Yeagers and so on. Everything gets converted into OpenTelemetry. I love the fact that how Tempo is not reinventing the wheel, but just hooking up to the exact technologies that are becoming standard and just leveraging them to offer something more on the back end. Like offer the scalability and the ignition rate that we actually want, right? Great to hear that. So Pramod, what I will do for you is that I will redirect you. If you could join our Telegram group, I could actually redirect you to Ananya's previous talk on tail sampling and a lot of the questions that you had regarding head and tail sampling would actually get answered through that. Cool, awesome. Ananya, I would also implore you to kindly join our Telegram group so that folks who are here and folks who are also there, they can join and ask you certain questions and it's great to have like-minded folks in there. It's still very curated. We don't have, the conversation is very well-structured still now. We have reached like around 70 plus members, but till now we are having a great conversation over there, not too noisy. So feel free to join there. I'll send you the link. Pranay has already pasted the invite link. Awesome. Awesome. Thanks. Thanks for listening in. Thanks for joining. So and for everyone, like if you want to ask more questions regarding this, Hasgeek has pointed out to us that we can start actually using the Hasgeek's comments section, the comments page as a Q&A sort of platform. So anything, whoever and folks who are watching on the live stream also, feel free to go to the Hasgeek page, which is linked in the description. And in the comments section, keep asking more questions, keep putting in opinions or things you want to discuss regarding this topic. And as a community, we'll try to take that up and sort of answer that, right? Yeah. And also any feedback you have on how we can organize this better? Absolutely. All feedback is welcome. As I said, like the way Pranay and I had envisioned this whole thing is to sort of have the discussion as a community and not as a monologue of discussions where someone is just telling something and you're on the receiving and hearing the ideas to be able to discuss the approach and the products and the solutions as a whole, right? As a community. Cool. So with that, I would give the platform to Ankit. Ankit is the co-founder of Cygnus HQ, selected for YC21 batch, Winter 21, I guess. And he's going to talk about this amazing open source telemetry product that they're making called Cygnus. Hello, folks. Thank you, Hashfire for introducing me. So I guess the screen is visible to you. So we at Cygnus are building a full-stack open source observability platform. Too many words to chew at a first point. So open source, we are completely open source that does the first point. So an observability product will talk about what observability is and what constitutes the three pillars that we talk about occasionally, right? So Anindya had a great talk about going deeper into S3 and indexing and increasing the ingestion rate. I would take my perspective from a different point. If you're like, I have explored Prometheus, I have explored Grafana, I have used Jago and how and why we are coming up with this solution that why I think would be helpful for you folks to start using. So that's the context and overall how useful distributed tracing is and how monitoring is not enough observability that we can expect of. So we'll punch these things into our talks. So to start with, we'll discuss the three pillars of observability. Observability is basically finding out the internal states of your system when you have observed something as in the external output. Did we lose Ankit or is it just me? I can't, can others here Ankit? I think we lost Ankit. Let me just ping him. Wait for a few seconds, folks. I'm just pinging Ankit. Okay. Meanwhile, Hatchfire, you can fill in. Sure. Sorry, I just had gone to fetch some water for myself. I think this is a very intelligent problem. We might have had like either a power outage or an ACT outage in either case, apologies for the inconvenience. Yeah, for our YouTube folks, apologies a bit. I think our speaker is facing some technical issues. Yeah, I think he's facing network issues. He'll be just joining back in two minutes. Meanwhile, if anybody has any questions, like they didn't get us to ask before the questions like we can, we can actually talk about, you know, like, you know, how Akash talked about how they are building the internal tracing or observability platform. A lot of us, yeah, Ankit is back anyway. We'll take that up after. Okay, no issues. Happens, man. Go ahead. Sure, let me see. My net just went off. I don't know why. Cool. So I was talking that the three pillars of observability being the metrics, traces and logs. So metrics is something that we try to run aggregates on and try to get an overall idea of how things or the measured quantity is going. Like you can measure your breadth counts or you can measure your footsteps. So those are also metrics. So in software worlds, we measure different things and red metrics are very prominent things to measure for applications. So red is rate at which your application is receiving this request. E is error that your application is receiving and duration is the latency that your application is serving. So these are the red metrics that every application should be able to monitor. And one of the tools that we have today is Prometheus that is fantastic in terms of giving the power to run collector and analyze the metrics that you have. So Prometheus is a pull-based system where it tries to scrape the target URL where metrics are exposed on an endpoint and Prometheus tries to scrape those targets at fixed intervals and stores them. So Prometheus is really good at that and has alert manager and a different aspects that gives you the complete picture how you can look into metrics. So we have been using logs for a long time now. So we are in a habit of dumping a few lines of introspection into our application so that we can look into if something is going haywire and we just spin up logs and we want to see figure out which line number it is throwing up what error was put up there. And we usually use elastic to store these log lines and which enables us to do free text search and visualize some aggregates like some amount of aggregates like strata scores and duration and all is also available to be done in elastic. So the world in logs is moving towards more of being structured. So if you have free text of log lines it becomes very difficult to index store and lately perform queries to find out those keywords. So Loki is again from Grafana Labs and is good at storing structured logs and moreover I don't think structured is only needed. It can store unstructured logs also but structuring the logs gives you huge benefits in terms of storage. You can store index based on those keys only. So tracing is relatively new in the world. We have people have started actively using the last 3, 4, 5, 6 years only. So tracing is basically when you track a request completely from when it enters your boundary of control to the time goes from one application to another application to database calls and returns to the user. So if you are able to track the complete request down and multiple events are thrown in the duration of the request going from different processes and boundaries. So we should be able to get much deeper insights into what happened in the request and helps us in debugging issues a lot. So tracing isn't chronological order. So you should be able to pair up events like this event happened after that and this that event after that. So it's a complete chronological order of events. And the smallest individual unit in a trace we call it span. So span is equivalent to an event. So whenever we talk about tell about spans or events so consider that we are talking about the same thing. So these are the logos that have pulled up pulled up from Google to so. So this is Prometheus, this is Niagara, this is Elastic and this is Sloppy. Now I'm going to talk about more about Prometheus and as myself how I explore Prometheus and what things I felt lacking in Prometheus and Niagara as a distributed tracing system both combined together. So when we started working with Prometheus, it was things are super easy to get started up. So for a few couple of members in the team, we can just spin up from each server. It has a few different components called alert manager and the UI prompt expression is also there. We can spin up a few exporters that can expose the metrics in Prometheus format for Prometheus server to scrape them. So it's very easy to get CPU usage memory, the storage network, and if you're using Kubernetes, then you can also plug in Qubes state metrics into Prometheus. So these metrics come in very handy. So Prometheus has a very powerful alert manager, I would say. So the power to run SQL queries into metrics by slicing and dicing the dimensions and the key value pairs that are sent in Prometheus is actually very powerful and was not available earlier in many open source tools. Now Prometheus has empowered you to use alert as code also that you can review it later. So you can just write in alert as sort of prompt queries and you can store it or you can work it, version it and review it later. Right? So as compared to other vendors like Eterlog and Neural Lake, so custom metrics would be very much relatively cheaper if you are trying to use Prometheus. Because Prometheus is very much efficient in terms of storage and disk. So these are the pretty good things that Prometheus gave us as an open source community and society. When we started working with Prometheus on its scale and we found a few things that were hard about Prometheus. So digging deeper into application metrics was one of them. So a few of the exporters about applications gave you some details about the red metrics like what is a request, what is the latency, your buckets of application. But the detailed application metrics like how much time is actually spent on being database calls versus external calls versus the application logic, the breakdown like something Neural Lake and Eterlog vendors give you and like the capabilities to see or to go deeper into the application. Like if you have SQL queries that is slow you would like to interrogate more about it. So these kind of limitations are there that most of the client libraries that we use for exporters do not give us. Most of the standardized metrics are available like if you are using MySQL, Mongo or Redis or RabbitM, use those exporters with self-providing metrics about those applications and standard exporters about any language specific or framework specific applications also there. So later on when you imagine you have thousands of services and you have many applications you want to monitor, you have to manage running exporters as sidecars also. So you have to maintain the time of those exporters otherwise Prometheus won't be able to scrape those metrics. But Prometheus is not horizontally scalable. It is if you have to increase the capacity of Prometheus you will have to increase the CPU and the memory resource of your machine where it is hosted. If you want to make it so there's a way to use Prometheus in a better way by federating Prometheus like there will be child Prometheus, there will be parent Prometheus. The parent Prometheus can scrape the child Prometheus and maintain some different granularity of data that can be reviewed later. But the overall Prometheus is not horizontally scalable. Prometheus stores data on local disk by default. So if you have to set up long term storage for Prometheus you need to choose like Cortex or Thanos that are pretty big in terms of architecture. Next is if you if you see something going wrong in terms of your metrics like let's say you're having 20% errors in your application or you're seeing a slow end point. So Prometheus alerted your water. Now you need to act on it. You need to find the root cause of things why it happened. So this becomes difficult to do with Prometheus. Prometheus just gives you the overall aggregate view of issues and not how to drill down deeper into it. So this is how I tried to bring the application metrics. I put up my own plant libraries in different languages that would try to export metrics in Prometheus format from those applications. So finding out RPS by status codes, the application P15, 1999 percentiles, latencies by end points and breakdown by Redis, Mongo or the application logic where that request is taking most of its spending most of its time. And if it is using Mongo and Redis how do you how much request per second is the application making to Mongo and is the application finding Mongo slow or not. So if there is increase in latency here, you can figure out that it's the Mongo query which took time and not the application logic. So these kind of things I tried to figure it out with application by building my own plant libraries. So this is another one. So if you're making external calls to services like different downstream services of card service, PayPal service or user service, you should be able to know right. If whether your services are slow or the downstream services that you are using are actually slow and they are throwing errors or not. So this is specifically built for those kinds of detections. Now using Cortex is a bit difficult. We found using Cortex a bit difficult. Like you have a lot of distributors, ingesters, console, query and you have to use Cassandra to store as a storage. So it is quite difficult and architecture to manage when it comes to the ROI that we see just we can make from which is horizontally scalable and have long term storage. So not bring the RCA root cause analysis part. So we would like to go more towards distributed tracing and figure out what happened to each and every request as it traversed through different downstream services. So that's where the distributed tracing come in. So in today's world of microservices architecture, there are a lot of microservices and all of them are horizontally scalable. You should be able to figure out where a request actually failed, which microservice it failed, which instance of that microservice it failed, if is any of the machines down, which deployment wasn't led to that issue. So these kind of information would be very useful in debugging systems. So distributed tracing will definitely like to be able to help you if you use it properly. So this is a sample architectural diagram of Uber where we see a lot of services being spun up and dependent on each other. So a typical request goes through hundreds of services before being responding back. And if any of the last downstream services failed, the complete request will fail and you should be able to figure out what happened. So distributed tracing comes out very handy. So what distributed tracing will do, it will start emitting events in each part of this traverses to different applications. And if anything went wrong in any of the applications you should be able to see in the dashboard. This is a sample hot rod application, a simple sample application of ride healing on demand. So that's something like Uber application that when you click on any of these buttons you would be able to call a driver and a car that will be arriving in a few minutes. So this is a sample application as I've shared the link also to that application that is used to demonstrate Jager, that is the latest tool to do distributed tracing. So this is the UI of Jager where you see the list of different traces. So trace is a complete request and the span is actually one event in that request. So a trace would ideally be consisting of hundreds of spans or events that were emitted very traversed on different applications, boundaries. So for example, this request had 51 spans or 51 events emitted during its time period. It went through multiple services like customer service, driver service, frontend service, route service, various to before it went back to the user. It took 755 milliseconds and you should be able to filter out these traces by service, by minimum duration or by using the key value pairs or tags. Like every event that is emitted, a set of key value pairs can be annotated to that event to explore more on that. So if you filter out, if you try to use key value pairs to filter out traces, you should be able to do that in Jager. Now, if you click on any of these individual trace, it will give you a detailed trace view diagram here. So this typical trace view diagram gives you the distribution of timeline of the request. So this is the overall timeline of the request and how different events played their role in constituting that complete timeline of the request. So this frontend service called customer service as customer endpoint, which in turn called MySQL service with a MySQL select statement. On parallel frontend service called the driver service, the driver service called Redis service with get driver endpoint. And multiple Redis calls were being made, like it's kind of polling to get data from Redis. So when you click on any of these particular events, you should be able to drill down into more details. Here you have 13 calls to Redis, and all these calls are made back to back. Like when the earlier event timed or the closed, then the next event is emitted. So it's polling again and again. So it's this sequential execution of the request that is data being sent to Redis. So when you click on this, let's say MySQL statement, you are able to see the key value pairs that are associated with that specific event. So you can see the SQL query and guide, and you should be able to figure out why it is taking that long. So it helps in debugging. There are a few patterns that you should be on a lookout to figure out what can go wrong in Yager by seeing the patterns in the Yager trace view. So this one red marker is basically shows you that there is an error with that event, something went wrong with that event. If you try to optimize for some of the timings that this request took, you should be able to see the longest event or span that was there, and you can try to optimize on that. Be very of the sequential execution of this staircase pattern. Like things are happening sequentially, and you might not be knowing whether it's by design or architectural choice or not. Like a few ORMs may execute a request sequentially before returning the results, while you did not and know internally that it is meant to be sequential. So this should give you an example of things that are happening sequentially, and you should have a look whether you want it that way or not. So be also cautious when you see all the events ending at the same time. So this might happen that all these five events are ending at the same time, might mean that there's some sort of timeouts happening. So that the connection is already full and all the requests are failing after some time. So these patterns give you some idea about what is happening with the request that you are seeing and help you debug people. So I tried of giving you an overall idea about how you can utilize distributed tracing more and incorporate that into your debugging life cycle, just going more than metrics. And so distributed tracing is the next step of what monitoring gave you. Monitoring gave you an alert that something is going wrong. I am seeing 30% of errors or slow latencies of one second or two seconds, not you need to drill down deeper. And distributed tracing helps you there. So this is a complete architecture of Yaggle that helps you see the visualizer distributed tracing data. So this is your application. You include a library or Yaggle client library according to your language and it starts sending the data about request as it receives and the different events that it relates to the Yaggle agent. So Yaggle agent typically runs one on each virtual machine on each machine and each machine has can have multiple applications. So basically multiple applications will be sending data to the Yaggle agent in that machine and multiple Yaggle agents in different machines will be sending data to Yaggle collector. The Yaggle collector processes this data and makes them in a format that can be stored in databases. So typically elastic dvr or kassandra is having used to store these traces data in law. There is a Yaggle query that queries on top of this kassandra database and sends the data to UI to visualize it. So UI is the one that we are seeing here. So that's the layer that's a complete architecture of how Yaggle works. You can have some sort of sampling control from Yaggle collector to Yaggle agent to client. So you can filter out which traces you want to store for the longer term or not. You can even run some Spark jobs in before storing to databases to get a few aggregates from the data of distributed tracing. So now since we have the complete idea about prometheals and Yaggle and the distributed tracing data, we felt there was a need for a better distributed tracing platform that gives you more controls on the distributed tracing data. Right now a lot of distributed tracing data is on hundreds. The high dimensionality of the data that is there in the distributed tracing is completely unutilized till now. You just are able to see those key value pairs in the event from that graph and cannot actually use to filter those traces out and run some aggregates. I'll give an example like suppose you have a different class of customers like gold, platinum, silver and you have annotated the request with that customer type and you would like to maintain different SLOs and SLIs or alert levels for different type of customers. In the tracing data, you should be able to get some aggregates after filtering of that customer type. You should be able to filter out customer type is equal to gold and you should be able to see some 95 and 99 percentile request profiles of how the latencies are looking into it. That should help a lot and again the same for a different use case like you have payment channels and different partners that get ways that you consider for payment and if any of one of the payment channels is failing, you should be able to filter the trace data out and run some aggregates to see whether there is error in that payment channel or not. These capabilities are not today there in Yago and you cannot set alerts on Yago. The complete channel of a metrics pipeline and distributed tracing data pipeline is independent of each other. This is a complete background I tried to pull up when we were trying to build up a unified platform for observability. Hashtag helped me in some of these work in doing this architecture and discussion and implementation. What we try to do is we try to pull up a metric system and the data pipeline that's practiced from HHS and Cortex along with the tracing collector of Yago and the ingester and the storage and the querier of Yago for the distributed tracing pipeline and try to provide a unified front-end that we built using ReactJS to query from both Yago and Cortex. Here is the client infrastructure. We have different application libraries, client libraries sending data to hotel collector. Hotel collector is open telemetry collector and if you don't know what the open telemetry is, it's basically a vendor neutral instrumentation framework that gives you various different libraries and different languages that you can use to start sending data in open telemetry format. You start sending data to different hotel collectors in your machine so hotel collector will send data to we verify via proxy that we authenticate the client and then we send the data to Yago collector and Prometheus server that get data from hotel collector. These are the independence. This is a tracing pipeline and this is a matrix pipeline. The Prometheus server then has a remote write feature which sends the data to Cortex. We run Cortex which includes distributor, ingester, memcache and writes the data to Cassandra. The Yago collector on the other hand writes to Kafka to handle the scale. Then it has the other ingester that writes to Cassandra and Cortex has a Cortex courier and Yago has a Yago courier which both of them is used in the front end independently to query the matrix data and trace data respectively. So this is what we tried to pull up together but still we feel there was a huge gap architecture become very huge to manage different components for and keeping matrix and tracing data as independent pipelines. So both metrics and tracing data have, like we were discussing after the internet talk, right? So the tracing data have a huge set of high dimensional data and the matrix data also had key value pairs. So there has to be some correlation between metrics and traces for both of them to intermix seamlessly and that can be very much usable in the dashboards. But that's why we thought of using building signals, right? So to give you a single pane of view for metrics and traces, first of all, the powerful trace filtering and aggregation capabilities that is not there yet in Yago. You can set out retention rules of data easily from the dashboard because it provides that. So one other thing that we felt was there was a huge gap in open source and SaaS windows, like, so SaaS gives you a lot of things out of box, but when you try to set up from Meteos and Yager and Cortex to be used in an enterprise, it takes one and a half or two months to set those up completely, right? While you get it off on the other hand, you can get you started within a few minutes and hours, right? So that's a huge gap. So open source is good for adoption and to use this, to use in the starting, but it is very difficult to set things up and manage them in the later end. So we try to bridge that gap in sex. We try to give you out of box features and managing the open source should be less than a pane than it used to be, right? We do a cost benchmark on storing the traces data as compared to Datalog. I have posted the link there. So we turned out to be 10x cheaper. We again, we used Druid and S3 storage. So that's an objective storage that we used and it is very cheap in terms of the pay skill that you need to pay to AWS. Now, the code we built is completely open source so that you can self-host it in your infrastructure and have the control of the data or the observatory data. You should not be caught on some compliance and regulations that use data is not a CCPA compliance or GDPR compliance and you should have better data governance, right? So these tracing data can have sensitive data that you might not be sending to vendors unless you share off your regulation. So these come in very handy in industries such as string tech and healthcare where they cannot share the data to many of the SaaS vendors today. And our architecture is built on a stream processing architecture using Kafka and Druid, right? So it can support huge scale. We are built on top of open telemetry. We ask our clients to do open telemetry instrumentation. We have docs on open telemetry instrumentation and we use open telemetry collector to receive trace data and convert them to be ingested into Kafka, right? So later on, we will plan to build an anomaly detection framework also and that would be open source that can plug into Prometheus or a tool like SignoS that will help you automatically detect some of the anomalies that your observatory data can help you with, right? So this is the architecture of SignoS we came up with. Your different applications will have hotel libraries that will be sending data to hotel collector. Hotel collector has an exporter of Kafka that can write to a Kafka topic. The topic now is OKLP spans. There is a stream processing libraries in Go which I have used to read a topic from Kafka in real time and write them back into the topic after doing some sort of processing. Right now, the type of processing we are doing is to extract some metadata so that it can be ingested, flattened and ingested to Druid. So Druid does not work on nested data. It accepts flat data. So we have a stream processing that does that and Druid can ingest data from Kafka in real time. So Druid has some sort of ingester supervisors that can ingest real-time data from Kafka and for the long-term deep cold storage Druid uses S3. So that is very cheap. We built a query service that is in Go and that query is from data from Druid and so the two frontend that is built in React. So these are the key components that we have in SignoS. Like we have open telemetry collector, Kafka, stream processors, Apache Druid and we have a query service in the frontend. So this is the complete backend that you need to install to get things working with open telemetry. I would like to give you a demo of SignoS, how things look when you actually install it out. So when SignoS is installed and your applications are instrumented and sending data to SignoS, so this is a dashboard that you see. So I have come up with the same sample services that Uber used to demonstrate Yaggle. So we have frontend service, customer service, driver service, route service, like a hot rod application, right healing service application that I showed there for Yaggle. So in the frontend screen we can see the P90 and latency, the error rate, the request per second. So all these data that I'm showing is being crunched out of the distributed tracing data and we have the capability to run aggregates on those data now. Like if you want to drill down deeper into the frontend dashboard, you can see the 50th and 90th and 99th percentile profiles, the RPS, the end points, their performance and all and the errors that you are receiving. You can switch it to one RPA. External calls that we are building that will be very helpful to debug systems in microservices. Like if your application is calling external calls like different downstream services or external payment data or something and they are slow or failing, you should be able to figure those out in this external calls page, right? So that the application owner must know whether their application is slow actually or it's a downstream or dependent applications that are slow. So if you see some spike, let's say if you are interested in the spike here and want to drill it deeper, how things are going there, you should be able to click on it and click on view traces. So what this will do is it will take you all the list of traces, all the list of requests that were emitted during that period of time from that service automatically, right? So you can filter by latency further, you can filter, you can, you want to see all those requests that took more than a second. So you can apply that filter and you can see the traces of the list of traces are filtered here and let's suppose you want to drill down deeper into any of individual requests, you can click the trace ID. Trace ID is basically represents one trace, one trace is one request. When you click on this, you see the list of events that were emitted by that request or by that trace. So again, it's the same graph that I showed you there in a different format. This is called flame graph. This is a frontend service. It calls a dispatch endpoint. So it calls slash customer endpoint. It's calls mysql query. So you should be able to figure out what is taking most of the time. The complete request is taking 2.1 second and mysql is taking 1.65 seconds. Then maybe you want to drill down there. So on a parallel driver service is being called, you can click on that driver service and see there are multiple Redis calls are being made here, right? You can see the parameter ID that is there in the Redis call. And if there is some error, you should be able to see them here. So this is the detailed, like I said, these sequential calls are being made in Redis and you should be able to figure those out from these diagrams. So it comes very handy when you try to drill down into individual requests. Now I'll show you the power of trace aggregation capabilities. You can filter on a service and there's a key value here like customer type is equal to gold. You want to filter and know the 99% litancies. So in a similar way, we don't have customer type is equal to gold right now, but we have something that's to be URL key and the value is this that this service is calling an external URL that is hosted at 0.0.0.0.8081 port. This is the host and port that it calls. You can apply that filter. Now we have the list of traces that are filtered by these criteria's. Now these number of traces or events can be in thousands of millions per second depending on the archaeology of application handles. Now you need to figure out whether it is normal for them to be 1.58 seconds or not or whether what is the baseline out here that you want to drill down deeper into some of the requests, right? So how to pick a request to debug? So going from here, you would like once you have the list of thousands of filter traces here, you want to run some aggregates on the list of those traces like you want to filter out by duration or P99 profile, you should be able to see the P99 profile of those filtered traces. So this is pretty powerful feature again because this is the service map that is yet to come that different all the services will be here and they will be dependent on each other and if any one of them is unhealthy, we will be showing you up here. So that's the brief thing that we have built. Now we signal and we are going more towards being a complete single pane of a unified view of observability for all the metrics logs and traces. Now moving forward, so next question ideally is what would need to get to see that dashboard, right? So there are two steps in starting to use signals. First, you need to install signals. Signals will be your complete full stack observability backend that can collect data, analyze the data and visualize you in a beautiful UI. So you can follow the installation docs to install signals. It's basically very easy. You can just get clone the repo and go to deployment folder and run a script installed on SH. So this will automatically install Docker if it is not there, Docker compose and all the different components like hotel collector backend and UI that is available. So this script basically uses Docker compose up as an underlying command to spin up those things, but it checks on a few things like whether Docker is there or not, whether Docker compose is there or not, or whether it is compatible or not. So once you have installed signals in a machine, you have the IP of that machine that you can send the data to signals, you can use open telemetry documentation to instrument your application with open telemetry data. We have managed application we support for Java. We have written documentation for Java, Golang, Node.js and Python. So this is a sample Java application. You need to pull up that, you need to just download this jar file from open telemetry. And this is a run command that you need to put for your own application. Ideally, you would run by Java minus jar, your application dot jar, but now you need to expose two variables, I would say. So this is the first you need to point to the signals backend. This is the IP of the signals backend. And this is the port that this Java application sends the data to. So this is the IP of the signals and port. And this is the name of the application that you want to see in the signals dashboard. That's all you need to do to start seeing that data. It's very easy to start things up and you can even add more of your custom spans and events by using those up in telemetry SDKs. We have some time when we can go down the why we chose to add a budget route. How much time do we have? We can go down deeper. We have a couple of minutes. We have a couple of minutes. We have definitely some more time on this. We can do this. I'm going to want to break quickly for like taking a few questions from people. I see a few questions on Slido. Sure, sure. Let me pick up a few. How many slides you have more? I think you have one hour just showing them the druid. So let me finish it up. I will take two minutes only. I'm done. So the reason we chose a budget route was that we needed some analytical database that can run time distributed tracing and it works for the real-time ingestion from Kafka. So it's useful. There we have and add a single platform for a complete observatory for different. So we tried to convert three types of observatory data into one data store. We tried to achieve that. We tried to incorporate logs into it as open telemetry logs mature. So druid has fantastically scalable individual components like druid has historical parts and coordinator and different components that helps druid to manage the scale and store into S3. So these components are usually very scalable. If you want just to scale the query up layer, you can do that. And if you want to scale the initial layer, you can do that independently. It's very cost effective to write to S3 for dvp storage and druid is proven to handle the scale of real-time analytical data store at Airbnb, Lyft, Netflix and Pinterest. You can go to the druid powered by page and you can see the list of companies using it. So this is how the signal user of druid looks like. So you will be, the signal uses druid to do those searcher for queries. And this is the UI of druid. So this is the ingestion, real-time ingestion. It is going on from the topic from Kafka. It is flattened spans. It is running. So after a few time that you have a specified right now, I have specified 20 minutes enough for each 20 minutes. This segment or the data chunk of druid, it stores into S3 for later query. We can run SQL query in druid console and query the data. So basically my goal and query service converts the query into druid SQL query and gets the data from druid. So this is the data source that druid has and the data that appears during. But that's it. So thank you for being a wonderful audience. You can always reach out to me at AnkitatSignals.io, LinkedIn or Twitter. I am available to discuss all about observability. Pranayam, we can... Okay, I think it's a great time to start a Slack channel for signals. We have a signal Slack where we can all hang out and discuss stuff. We already have. We already have. Oh, awesome. Great, great. I guess this is... So yeah, I'll pipe in couple of questions, I guess like first from YouTube since last time we didn't get much questions from YouTube. So Satya Bhatt from YouTube, he's asking. He's a long-term community member. So yeah. So his question is that on one side you mentioned horizontal scaling of Prometheus is difficult. On the other side to Kim, this looks 10x more complicated, right? It says costs are 10x cheaper. But what about the lights on and overall maintenance cost of the platform, including Kafka, Druid, it's not cheap to run Kafka and Druid at scale, right? Because Kafka also needs ZooKeeper, then you're handling ZooKeeper failures. And while he acknowledges that ZooKeeper less Kafka is sort of on the horizon since we have been hearing about that since last year, it should be there sometime either this year or the next. But still like his question is like, as of now, this architecture once deployed inside a cluster would still be costly and still be a bit more complicated than current solutions. So when we are talking about current solutions, so are we talking about other versus tools or other vendors? I'm sure it won't be costlier than other SaaS vendors because the way they charge you is completely different. So I think we lost Ankit again. Ankit, like I think there is a little bit of network trouble at your end. Your voice is sort of chopping up. Okay, so I was saying, so the SaaS vendor, as compared to SaaS vendor, this was definitely going to be much cheaper. First of all, from the point of view that each and every company that runs the SaaS, they spend just 20% or 25% of their revenue on their infrastructure and tech. So in that terms, it's going to be definitely five times cheaper if you run the same systems in the same scale even in data rock. So apart from that, so as I discussed, using Prometheus along with Cortex and Yeager, but a scale that I can say you at a million spans per second, it's not going to be less troublesome than running signals. I agree that Kafka and Druid are big mammals that need to be, that needs a few manpower from the manpower to handle, but when the company reaches a scale, I guess people resort to Kafka to handle that shit. So yeah, so we are exploring that outright. We thought of plugging in a different queuing system apart from Kafka, right, RabbitMQ or RavenGridis to be a lightweight system that can choose the data to Ankit, would switching of the video help on your end? Would audio only feed actually consume probably lesser bandwidth? Maybe we can try that. Ankit, you want to switch off your video? Pranay, why don't you take a question in his deed, since he's having network connectivity? Satya has another question, which also I asked you on Twitter yesterday, which is, aren't you folks planning a SAS model for signals at some point of time? Yeah, so currently we're focusing more on the self-hosted version because that's where we are seeing lots of interest from the community and also from like bigger enterprises who are interested in trying us out. So our current focus is there, but of course in the long-term road map, we will have a SAS version, though as in because currently if you want to use SAS, there are like tools, other tools like Datadog, you're much more covered there, but if you want to do self-hosted, there are no good tools today. So that's where you're focusing currently. But of course, like in the longer term, we will have a SAS product also. So our goal is that, hey, this is a product which developers use, right? And ideally, we think that there shouldn't be like any product which people use should be an open source product. Like it should not be a proprietary closed source product which people use. So we are aiming for that vision. Let's see, there you go. Sure. So I think another concern that pops up is that, which Satya says here that Docker Compos is not really production ready. And a lot of us practitioners, we would consider production ready when we have head charts or operators or stuff like that. But of course, that's always like a long-term thing. So when could we see signals production ready in those terms? Like when it's very easy to operationalize the whole stack together? I think we already have hem charts. We do have hem chart deployment ways. But still not. So we plan to have different conflicts for the different scale that Kafka and Root can handle. Like today, for a demo purpose, this Kafka and Root can be set up in 4GB RAM with two CPU machines. But we are giving a different cluster set up like nano, small, micro, medium, large, these setups will be providing. And an estimate of the scale that these set up can handle automatically. So I guess that will. I think that we should now move towards Slido questions, stopping the YouTube side of the things. I know Satya has a couple of more things and I'm sure Satya, I would like to say like join us on our channel. Both Ankit and Pranay would be there and you can keep piping in more queries and questions regarding the signals platform. But we need to move on to other attendees and their questions right now. So yeah, Ankit, will you kindly bring up the Slido link and maybe pick up your favorite questions from there and try answering. Or if you have anybody who has asked questions there. If they're here on Zoom, it's better that they can open up and ask themselves. I don't think so. So there's a This is Ashok. Can you hear me? I'm the one who brought out that instrumentation question. So yeah, I've been working on observability for long. I think I have interacted with the signals guys also in a very limited capacity. But yeah, this is the thing. We are an enterprise, I work for Cisco. So the biggest challenge is how far should one go with instrumentation versus looking into this out of the box tools. One thing I really simply find which is like very interesting is the way things work with things like linkerty. It's out of the box you get your not exactly distributed tracing, but to an extent it can be enabled. Then there is also telemetry out of the box dashboards from Grafana and such. So I mean, how would you kind of contrast that? Because instrumentation is an effort. And as you can understand in enterprise, rewriting stuff is hard. That is one question. But there is another question also, which I kind of did not ask. So there's a lot of things that are already there. Especially because telemetry, tracing are new things. Logging is a very old thing. How does one utilize that with whatever we get from the logs? For sake of disclosure, we are using fluently for the logging and then we switch it on to elastic and then we serve. Sure. Go ahead guys. Yes, so I was thinking, so this work that we are talking is basically taken care of by open telemetry. So open telemetry is trying to standardize a lot of things in terms of instrumentation, like the basic frameworks, the databases, everything that we use. So these can be used by open telemetries and on the same hand, on the other hand, open telemetry provides you other SDKs that you can use to instrument some data that you want specific to your business or something. So that definitely is going forward because there have been a lot of cases where enterprises become vendor logged in because you have already instrumented a huge ton of load of code into their applications from these vendor specific libraries. So that definitely you have to look into open telemetry and I think it's the standardized way to go forward. So some other companies have been trying to extract data from logs also to start converting data from logs to have metrics and dashboards. So that also a way but the logs also have to be structured and standardized going forward. And open telemetry is also working on that. Like basic components, basic things that we try to standardize in logs is by regixing by shutter score, regixing by time, regixing by service name. So these kind of common key value pairs can be separated out in each log line so that they can be very efficiently indexed later on and a chart can be very much easily drawn on those terms. Yeah, in our case, we are using FluentD and FluentBit for that and you can enrich data with things like log stash, probably that's something that you can do as well. So, hey Ashok, I think last time I chopped out of it, my network is a little unstable. So to add on to what Ankit said, there's a lot of work happening in the instrumentation area in the open telemetry land as well. And I think one thing to add is that a lot of work is happening in the auto instrumentation area as well. So if you're using applications written in java.net, and I think there's a couple of them, you can directly use these auto instrumentation sidecars. They'll sit next to your application either as a sidecar or a process on the same node and they let's extract this telemetry information and pass it on to any tracing backend. So you can get a lot of this for free out of the box without having to manually instrument your code. Yeah, yeah, I'm trying on those. But yeah, my point is enterprise things are very slow to move. Right, understandably, a lot of teams. Yeah, I am using those open telemetry auto instrumentation. It's rapidly evolving, but I think it will take some time to mature. Yeah, yeah. I see one more question on Slido. Is a person who has asked that or as far as you can just pull that up? I have one question. The trace view that we saw, it looked very similar to a flame graph. Is that what it's using? Because that would be like flame graphs would come from aggregates and this would be like a very different application for it. So it's kind of interesting that you. Right, it's the same flame graph, Brendan Greg developed. Right, okay. But how does it, like how can you expand it to show tags and stuff? So how can you, it's there in the library. So I didn't get a question. So you can click on the individual events and it shows you the details. Ananya, do you mean the visualization? Yep, yep, yep. The trace view. So if you click on the trace view in the bottom panel show you, Ankit, you want to quickly show that part? So once you quickly click on the spam, right? In the bottom, there's a panel which shows you all the tags associated with that spam. Okay, okay. I think I missed that. There's just like two panels where different things are shown. Ankit can show you a quick. Yeah, so the tags before below all the details of that span and then you can like have different tags for that. That's cool. That's really cool. Cool. I think we don't have any question. I think there's one more question. Hey, we have caught up on the offline. So this is a quick question. So say we are already emitting events from say micrometer activity and on top of that, we add even say open telemetry. What kind of latency that could get added because of this? The open telemetry libraries state to add 13% of latency to the application if added. So that would be independent of your micrometer or any other. So all libraries that you're using. Okay, okay. So it's better if we stick to one of them, if we finalize what kind of metrics that we want to use. Otherwise, it will unnecessarily add up if existing platforms is already using some other metrics. Micrometer is sort of from a different era of computing altogether. And all of what we are talking about here is sort of evolution of things that stuff like micrometers, Jabbics and all of that had started out sort of in the 90s. So these are sort of descendant tooling who had learned from the mistakes of that time and sort of geared up to fix those and offer more value added stuff from those aspects, right? So yeah. Yeah, it is just that, you know, the console that takes some luggage applications or any applications for the matter. I think you have to do the benchmark yourself. You have to run the same code base, one instrumented with micrometer and then a different branch instrumented with Hotel Instrumentation Library. And then you have to do the benchmark on the latencies on P99s, P95s and subsequent, you know, whatever ratings you want to get and compare those latencies to figure out which one actually works for us. And to be honest, this is true that with the like current modern infrastructure, especially CNCF infrastructure is very layered with abstractions, right? And each layer adds its own latencies and complexities and extra network calls and stuff like that. Earlier infra was much more simpler when you used to run native on top of bare metal and with simple instrumentation library, sometimes even shell scripts, right? Zabix and stuff used to rely on literally shell script based instrumentation. So it could have, it could be the case that, you know, things are much more performant. But then you also have to weigh that what is the leverage you are? Yes, you're losing minus 13% on the latency end. But is the information provided by the modern tracing platforms, whether it's Grafana or Cygnosis, whether it's Tempo, Yeager or any of these, right? The enriched information that you're getting out of it, the enriched correlation you're getting out of in exemptors, right? Is that helping you understand your application at runtime much, much better than what you have with micrometer or not? So you have to do that sort of, you know, SWOT comparison, box set in and then decide which is more valuable to you. That is like my two sets. As an intra-engineer, I would do the exact same with any, like I'm not going to use something just because it's out there. I'll probably benchmark my applications both with Tempo and Yeager and then figure out what is right for us, right? I did, yes. Thanks, yes. All right, I think we are already on time. Awesome, I think, yeah, I think we are at one, yeah. But I think I see a few more faces like names here. Yeah, I think this is a good time to start actually, you know, the actual talk format is over now. Now folks could unmute themselves and we'll sort of take over to the hangout zone. And I would see, you know,