 Hello everyone, welcome to Cloud Native Live, where we dive into the code behind Cloud Native. I am Annie Talastau, and I'm a CNCS Ambassador, as well as Senior Product Marketing Manager at Camunda, and I will be your host tonight. So every week, we bring a new set of presenters to showcase how to work with Cloud Native Technologies. They will build things, they will write things, and they will answer all of your questions. So you can join us every Wednesday to watch live. This week, we have really special program because we had a last-minute cancellation, but Grace, William, thankfully, we had a step up. So thank you so much to our speakers this week for stepping up on late notice, but even more so, amazing to have them with us. So we have Alokia and Chris here with us to talk about using hotel-distributed tracing for real-time observability. And as always, this is an official livestream of the CNCS, and as such, it is subject to the CNCS Code of Conduct. So please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of your fellow participants as well as presenters. But I'll hand it over to the speakers to kick off today's presentation. Thank you, thank you, Annie. Hi, everyone, thanks for joining us. This is Alok, Alokia from Offscrews. So today, as you guys probably found a late, late last-minute surprise, I'm going to talk about open telemetry, but the specific topic is tracing. And so most of my talk, if you've seen it, is how does one use tracing for the upside of things, meaning tracing has always been used. So just to set the stage up, I'm going to start sharing my screen. And Annie, let me know if this is showing up. Let's see, is this showing up now, let me just click. Yes, I think that works. You will see the screen, now we see the screen, all good. Perfect, okay, guys. All right, so again, it's going to be all about open telemetry, tracing, and you'll notice the specific term that I'm using, the real time observability. That's the key that I'm going to talk about. Hopefully that gets your interest. Let's get this out of the way. And I think I'm probably preaching to the choir because you guys are all practitioners. The biggest challenge for cloud native is complexity, dependencies, dynamism, you all know that. The good news is you can get all the data you want. The question is, do you have adequate insights at the scale and tracing by no means is one of the biggest complex beasts, right? So I'm going to jump, tell you a little background on what we do and how we came into this. So Ops screws, if you're not familiar with us, is probably the one observative company that builds on telemetry, meaning all open CNCF projects. So you can see we collect data. We don't use any proprietary regions and codes. So if you're into open telemetry, open source, CNCF, this would be of interest to you, see? So for example, collecting metrics for primitives, for logs, all the open source, like Fluendee feeding and Grafana low key, which is compatible with and meant for Kubernetes traces. So Yeager, even Grafana tempo and open Zipkin, which is what we talk about, flows like Istio or EBPF, and of course, Kubernetes and changes. And of course we have to cut cloud metrics. So if you look at that on the bottom, metrics, logs, this flows and configs and changes, all of them being collected from open source, CNCF agents, running of course, in the control plane and the Kubernetes plane and the monitoring plane, not in the IO plane. So what we do to help Asari and DevOps teams is pull that all as this telemetry by collecting them from the native collectors, such as Prometheus, if you have C-Advisor as a demon set or node exporter and so on and so forth, combine them together contextually. So we get the full dependency map, both service to service as well as service to the Kubernetes layer to infrastructure. And then understand what is happening by using machine learning to understand the behavior of every service. And then we also do automated causal analysis. The whole idea is that there is no reason, the message I'll give you is there's no reason to go with proprietary agents, et cetera, when CNCF and open telemetry is providing all of that. This is the way the future has been. A future is going and this is where you wanna be, where the puck is gonna be. So the question is, we are doing this, the specific area that we are looking at is how do we make, we know about metrics and logs and events, right? We can see them, we can capture them. But the biggest question is then, what about tracing itself? And the reason I bring that up is, okay, one more thing before I do that. This is trying to give you that same thing what I just talked about more of a workflow. As you can see on the left-hand side is all the open source CNCF projects for telemetry. What we actually do, as I said, was we build out that dependency, we check the behavior model using textual knowledge and machine learning. And then using that behavior model and deviation, we detect problems, whether it's events or predictive from the ML. And then use information from there to be able to get to events or logs, looking at changes. And as part of that, of course, we wanna look at traces as well, bringing them all together using a decision tree for causal analysis. So this is kind of the larger framework as to set the background for you. Specifically, today's topic, and I'll kind of walk through in the demo as well to give you an idea. The biggest challenge that we know that ops teams, whether you're on the SRE side, tech ops, DevOps ops is tracing with distributed tracing. Primarily because it's complex, right? So I'll give you two examples where tracing, it becomes hard. And I don't know if you can see this on the right-hand side, I'll hope you walk through. One of the biggest challenges about trying to find a different trace is most of the time when we tag them, right? We usually start with the root span. So if you look in the right-hand side here, this is the front end. This is actually an example that I'll show later. This is online boutique from Google, being used for tracing. Front end has an operator called receive cart, right? This is for the e-commerce application on Google boutique. Now, when you notice the requests come in, they can make calls to any of these services, right? They can go to get code, get cart, convert, get supported currencies, all of these services are done when you are basically making a request to fill out your cart for e-commerce. And then after that, of course, you can get a code and product. All these spans that will go between them, this has been collapsed, would all start from the root span receive cart. On the other hand, you can also have a receive cart that can come from the front end, same root span, but it might be going to add item or get the product information and get the specific information, all that. So there'll be traces that will traverse through this with the same root span. So if you wanted to tag this by root span, in order to differentiate different traces, you have to start adding very complex number of tags across all of these or odd queries. So same entry point, different path. And as you can imagine, the performance of these kind of transactions going through these paths here from this set of services is very different. So trying to find a problem trace, which starts with same root span and taking all these different paths becomes very hard. And imagine you have thousands of transactions per second or thousands of transactions per minute, even or hour, trying to find that. So this is a non-trivial problem. So, you know, as you're probably aware. Then second part is, imagine you already found a specific trace in your Oatsman queries through that. How do you know there is a problem? Usually you try to do a manual search or query because the volume of data that's coming at you, right? It could be not only that's a bad code, but it could be in its running on Kubernetes and Kubernetes has caused a problem on the underlying container that implements this service. So these two broad problems, while it sounds simple, meaning at the high level, headline level is non-trivial. So those of you are working with traces, what you'll find most time is, of course, the developers know what they're looking for and checking. But if you're on the off side, monitoring your e-commerce application or any kind of transactional applications and you have thousands of transactions every hour, imagine trying to find the problem and how do you know where the problem is? So detecting time to detect, meantime to detect becomes larger because of the manual process. And then of course resolving it, trying to figure out which service was called and which operations, which service is non-trivial. So this is a well-known problem in trace and you'll see any work, not just a distributed tracing using Yeager OpenZipkin, anyone using tracing typically has to go to this process. So how can we... To audience questions, by the way, or comments as well. So there was a question on, is this for Kubernetes? And then there was a request to get the link to the slides at some point. Obviously you might not have to do it right now. We can send that, absolutely, yeah. Okay, for the second question. On the first question, Kubernetes is one example, no. We can use at least from the tracing application, if you're using OpenTelemetry, anywhere using OpenTelemetry like Yeager, OpenZipkin or even Temple, the approach that you're talking about will work. Hope that answers the question. So Annie, I'm looking at my screen. So hopefully if there's any other follow-ups, I'll be glad to answer. Yeah, we'll capture them. Okay, so this is the two big issues. There's also an issue about how you store and retrieve traces and what has happened in the past with proprietary tools, as opposed to using, say, OpenTelemetry was, the traces were usually somewhere that you had to pull in, right? With Yeager having an open backend where you can persist them, but it's a shared repository on cloud or you put it on your own account or on your own storage, it doesn't matter. And now they are accessible. So when you wanna pull in a trace, we can get that. So one of the big advantages of having OpenTelemetry is accessibility, open accessibility of capturing different traces, adding things to it and of course, accessing it. But the two problems remain. How do we discriminate the trace of interest given the complexity and the way the different paths take place? And second, how do we detect problems? So that's what I'm gonna talk about today. So to set the stage, I'm gonna introduce what's called trace path. And to jump ahead, if you look at the bottom right and I explained in three parts, trace path for the simplest definition as a unique pattern of traces that represent a business flow. And the reason we are doing this is to compare the different ways we can watch real-time performance. So let me give you one of those things that we already do and there are some other folks might be using like flow. So if you're using Istio or using EVPF, which we do, you have aggregated metrics, let's say from service A to service B to service C to service D. So requests counts that come into service B, response time, errors, you can do that. Now these will be aggregated on the scraping interval. So if there is a problem with the service, no matter what the transaction, what calls and operations behind the service you're calling, you won't have visibility. But if there's a problem at the aggregated level, you will capture it. And this we can do in real-time and we can show you that and it's pretty well known. So the advantage of the flow is it gives you the structure and the dependency at the highest service to service level, but doesn't have aspects of the operations. Now contrast this when if you're running a trace, that service, the actual request maybe from the operations, let's call it A1 might be calling service B, but it's calling operations B1 or service A1 might be A up around by calling B2 and B2 could be calling C1 and they could be multiple. B1 could be C2 and C1 might be calling D. So as you can see, you can have many different paths, right? I can go A1, B2, C1, D1, or A1, B1, C1, D1, right? Or et cetera, et cetera, right? So the possible combination becomes complex and of course there might be multiple calls between them as well, right? Repeated calls different times. So this at the transaction level is a lot more detail and trying to see that in real time and visualize it when there are many of them because that's not really aggregated is hard. So think of it as this is an aggregate snapshot. This real-time performance, but you have to go to a specific trace, you know, look at the one and pull that data and look at the typical flame graph. The problem is you can't watch, oops, sorry, didn't mean to go there. Went to a wrong screen. So you can't go to the full detail versus the aggregate. What does, if you want to use traces and want to detect problems at this detail level, what we really want is alerts in real time. And I think there was a question about Kubernetes. You don't know whether the underlying infrastructure, whatever is managing the services that implements these operation services in the application, whether that has a caused a problem, like typically configurations are limited or databases under provision. What ops wants to do is detect the problem. They like to know drill down and basically have this operational view. As opposed to go, let's go search services down, someone calls you and we typically know this, right? User calls you and saying, my requests are failing, I can't get this transaction done. And then you go and start searching, hopefully with the request, et cetera. It's after the fact. How can we make this proactive? That's where Tracepath comes in. So let's dig into what Tracepath really does. So think of Tracepath as aggregating at the service operations level. Remember, we're looking about business flows. So remember the paths that we're talking about, service A can call service B, go to service D. There might be another one that requests are coming through here. There are a finite number of paths that might go from service operation one, oops, sorry, to let's say service operation D. So imagine you're presenting in the Tracepath of your application, the service operation as the vertex and these edges that you're seeing, for example, service A operation or service B operation one being called is the edge in that vertex. So Tracepath, now of course you can have different traces, go through these operations, right? For example, I can come here, request might come from here on T2 hop one and go to T2 hop two. Request might come from T1 hop one, go to T3 hop two and then go down, right? So the same service operations can be on different Tracepaths, right? So it's not exclusivity. So the way we want to get aggregation is we now will aggregate the request coming from the service operations of one to service operation another one and aggregate the flow metrics. Typically request rate, response time, error count. So here is an example. Here's a trace T1 that I've shown you here, one T1 hop one, T1 hop two, T1 hop three, T1 hop four, hopefully you can see this, right? I just kind of walked through one of the Tracepaths which is listed here, T1. For that, the aggregated average duration on that path is 275 microseconds, maximum is this, I can label this, I can create another Tracepath because I'm watching the frequencies of the traces, T2 goes from the hop one here, from this top one to here, T2 goes from here and then goes here. So it's only three service pair connections, right? Now how do we get this? We are looking at all the transactions that are going through, stepping back and grouping them on this common routes. So think of this as finding the most common routes from the entry point to whatever the end of the traces and grouping them because if you think about it, most traces will follow certain patterns, that's why we call it a business pattern. So if you step back and you can collect and build this, right, over a period of time, you will get a set of bubbled up trace paths based on these common traces that can be grouped in it. Of course, I can change a service, different request comes in the Tracepath, can't change. So the trace paths by definition cannot be static, it will be dynamic, you have to update as new request type coming in and it calls different operations, right? So by doing this aggregation, we can keep a live view of the different trace paths dynamically and be able to, and from there go down to the traces that basically are running through the trace paths. Now, I've given example of performance metrics like duration, response time, errors, et cetera, I can even get request counts, but I can also add to the trace path some specificity. For example, if the request, the T1 is user, let's say payment service and it's high, I can tag it and specify and create a tag for that so I can search those trace paths that are higher importance to me because they're bigger in fact, or some trace paths that are using a specific service or container that I'm worried about because let's say it's used as a shared database, then I can tag them as well. So there's a problem, I can pick those. Now, how do we aggregate thousands of traces into order of one 10th to 100th number trace paths? What we've done here is, as you can see, we're not giving you all the calls that are going between them. So we aggregate the repetitive spans and loops that go through. So they might be coming like this, on my, if you watch my finger, hop on, hop to come back, you can make another call, et cetera, but we're not showing all those actual spans. So you're not seeing the flame graph. You're aggregating the service operations and consolidating it on that service operator. That gives us this big reduction in the mapping of all the traces that I'm seeing into a few trace paths that I can now monitor and capture aggregated metrics on. So I'm gonna pause there if there's any question, Annie, because this is the core of it. And I'll go through some examples in the demo. Yeah, no question so far, but that's a good reminder to the audience. If there's any questions or anything needs clarifying, you can ask them immediately in the chat and we will get you an answer as soon as we can. So again, just to summarize, and I'll give a couple of examples, where we collect trace parts and as someone pointed out, no, this is not specific to the infrastructure. This can work anywhere. Could be running on serverless, could be running your bare metal machine, it could running on a container, could be running, you know, on a VM, doesn't matter. We are collecting the unique patterns of traces that represent the Brisco. So coming back to this, instead of showing you all the different paths, we will base our trace paths here, we'll have these services and maybe only these two, the blue and the red, might be the two dominant trace paths that can essentially capture all the different traces that are traversing through these operations, okay? So as an example, I could have 500 traces of an hour. I'll only see this aggregated view without understanding which operations and I can map those transactions that traces into just two trace paths as an example here. So that's why operations can now look at those trace paths and look at the performance level. So let's take an example, example graph. So how would we collect the data? So here's a typical example, right? The overall same graph, right? A, request comes to A, calls B, B calls C. There might be more requests here. C calls D, D calls Z and goes back as an example of the four spans. So at each time span, if I've calculated, if TO209, you can see this is my trace path, span A, I can look at the counts that are coming in over that interval, right? The last interval, the errors and response time. And then on each one of the spans that combine these, right? So the first three, the three spans you'll A, B, C, let's say there was nothing going on in D. So in that case, I won't get any data. Let's say D was not called. So in that first timestamp, I'll get, for those three spans on this trace path, I'll get these aggregated metrics. In the next, let's say I'm scraping at 30 seconds, same one, I'll get these. So essentially, we are finding all, and then in those three, in that 30 second interval, I can also find out what are the traces that actually went through. That contributed to the count, request count, the errors and the response time. So this is the data that you want to process. And what do you want to show them? Here are the two trace paths. Here is what the aggregate metric is. Oh, you want to go down into the specific traces that you saw in the first 30 seconds. Okay, here are the two traces. So imagine your error count went high, then you can go drill down saying, hey, find the trace that contributed the error. So it's a two-step process, right? I found the trace paths. Look at the aggregate trace path metrics on performance, and then drill down from there. When I see a problem and meet some issues at the trace path, go find the trace, that's a problem. As opposed to looking for the trace that someone told me and then trying to figure out if it was bad or not. This is proactively monitoring and it's all custom to the data and the traces and the type of requests coming in. Clear? Hopefully, keep posting questions. So hopefully, that gives you enough of a view that we can go into a demo and maybe clarify this. So what is the advantage if you think about it? What we want for operations is automatically grouping and identifying these different traces into these trace paths even as they change dynamically. So that we get a real-time operational view for anyone opts to look and say, okay, last one hour, these are the four trace paths that the most dominant had the most requests coming in. I may have 20, but the three are the top. Oh, and these one or two started having high response time to be determined whether high response time was a problem or not, or more error counts. And because we are collecting all the open telemetry, I just want to point out now, of course, with these traces running on that servers, I'm still, again, combining it with logs, flows and changes and bringing them in contextually. So you basically have for operations a real-time view of bringing in the contextual, overall holistic understanding of all this. So now operationally, you can look at things in real-time just like you do log alerts, metrics from Prometheus, flows that you're doing, you can still do flows, changes from Kubernetes or whatever orchestration you're using as well as traces through trace paths. The advantage now is just like we've done before with metrics or flows, I can now set capabilities to detect anomalies. Now I can detect anomalies at the aggregate level on traces on the trace paths. And then use it for causal analysis because I have linking all of them, okay? Any questions? I can't see you guys, so I'm just gonna pause it and pause a few seconds before I move on and then we'll go to the demo. No worries. Nothing, Annie? Yeah, no questions so far, other than obviously the ones that we answered already. Sure, sure. But people might have more during the demo so don't be afraid, anyone ask away, yeah. So this is just to set the stage because the demo is gonna be relying on the environment that we have. So our deployment architecture, if you'll notice is, and yes, I am using Kubernetes and this is the NCF, open source environment. So imagine if you have Kubernetes, of course you can have VMs as well. The blue, the green are your pods that implement the services for your application. So the blue is all of the open source instrumentation. So for example, if you're using Prometheus for metrics, see advisor as a domain, as a demon set, node exporter. If you're using low key, from tail, across them. If you're using Yeager, which we'll talk about today, there is Yeager already installed and we are capturing that, which means your code that's running and these already have the Yeager libraries enabled to capture those. So we have logs, sorry, metrics, logs and traces. We capture them through one container, one pod, one for each telemetry in the cluster, one. So essentially we are sitting passively in the Kubernetes node, not touching anything. And of course with Yeager you don't have, we don't have to touch the code, right? And then we also capture Kubernetes to capture the state metrics to get changes. So effectively, and of course, if you're running on cloud, we'll also get the cloud gateway. I think that that's what this is. So basically using these five pods to collect data from the open telemetry or cloud, we can capture everything we need. We can of course do the same with cloud metrics and VMs as well, and then push it to the controller. That's what we will be doing, okay? So we thought much should do actually let me go back and if there are no questions, jump into the demo. All right, please let me know if you can see this. We can see it, but it is very small. It is very small, I can zoom in. So what you are seeing and is essentially, think of it as every container pod nodes pod nodes being captured here and the flows between them. So as an example, if I go in here, you can see the pod has metrics, events, logs, connections. So this is, and you can see the direction of shipping service here as an example going from here to here. In fact, if I highlight that, you can see the connections. And in fact, as I'm looking at it, you can even see the data flowing through. This is what can be done by pulling data in from those collectors. We actually build this service map directly from those collectors that we talked about, of course. And there's already an audience question. What is the correlation between span A, B, and C, if any? Sorry to say that again. What is the correlation between span A, B, and C? Oh, we are back on this. I think he's talking about, or she is talking about this, correct? Is this the question related to that? Possibly, possibly. We can talk through it more. And if Jingdong, you have more specifics. Yeah, so the example here is, if this is the span A, this is the major span. Span B is when the request went from here to here. Request goes from B to C, et cetera. This is almost like showing you the flame graph. If I were to kind of go in there later on, let me see if I can show you an example here. I'm jumping around a bit, so pardon me for that. No worries. And they confirmed that that's exactly what they meant. Yeah, so in fact, I'm jumping ahead, guys. So here's the trace path summary that we'll show. And if I just pick one at random, this will actually get into, let me just pick front end. So if I look at the traces, so essentially what you're seeing is as I'm jumping ahead, this is that aggregated service operations to service operations. You can see that, right? Services front end, the operations are cataloged. Services front end, the operations add service. So that's the aggregate. This is not the span level. This is not the span level. Yeah, they already said, ha, stack trace, thanks. Question. Excellent. And then I think this is where I can jump to the trace and here's your familiar spans, right? I'm jumping ahead and already giving away all the details of the demo, but here. So root, right? So this is the root A, I think that's what he's talking about. And then the span goes from here to here, et cetera, right? So that's what that was. But it's being aggregated because the service was this. These are the corresponding service operations, but doesn't necessarily show all the detailed spans which we can break down further here, right? So that's the collapsing and aggregation we are talking about. Hopefully that answers the question from the person who asked that. All right. They said, cool, thanks so much. So see you. Live interaction, absolutely love it. Okay, so I think for purpose of our discussion on trace and trace path, I just want to give you an idea that we can go into any service, right? I'll just give you another crazy example here. I'll pick one at random. Any container, we can look at its metrics, events, logs and connections. We already know the open shell. In fact, as I said, you can even look at, we can figure out where that application is running, what it's using, what node it's running on, what infrastructure. I'm not going to go on that, but it's interesting now we know where that service is and what node it's running on, how much it's consuming, whether those services are healthy or not, et cetera. And we will detect it right there. So you can get all that detail. What we are not showing is this is not at the same level as the trace path, right? So that's what I want to go back to next. And so going back here, where we were on the app map, that's where we were. So the app map gives you that structure, but this is again, not at the trace level, right? This is scraped and built at the aggregate level. So what you were seeing as an example, as we are zooming in, this was at the flow level. So the flow level gives us the connections, right? Who's talking to whom, right? The front end is, if you look at here, requests are coming inbound, inbound, inbound, and then goes out to this guy, load generate as an example. And then you can capture aggregated metrics like latency, average, max, but this is at that service level, right? So the service level being between these services, right? So if you zoom in here to see that, it says server front end talks to this front end, it's sending so much byte. And that response that 28.648, it's from flow, which is EBPF, aggregated. We're not seeing all the transaction that's going through between these two services, right? Or between these, coming from the server to this. So what we want, as I was pointing out, and I'm gonna jump back again if you don't mind. Yeah, and there's an audience question as well if we have a good spot for that one. So I showed you the flow. We know traces, we'll talk about how to trace. Go ahead, audience, please. What's the question? Yes, Jimmy asks, is anything instrumented out of the box? Do you need to write it yourself or with like open telemetry? Ah, thank you. I will revisit this. So if you've, for Yeager, if you've already opened up the libraries, the Yeager native library, let's say C, Python, whatever, and you've enabled those libraries to send through Yeager, we will capture those. You can of course customize which ones you wanna send, right? So when we say Yeager here, we are assuming that Yeager is instrumented and although you don't have a collector here, our gateway is gonna basically go and talk to the Yeager repository and get that data that's coming in. So instrumentation is, think of it as almost like auto-instrumentation enabled by the libraries. As far as the other things goes, like metrics, logs, et cetera, like from tail, see advisor, we can help you enable, sorry, install these using a simple help chart for the logs, metrics and events. Yeager, you turn on Yeager and our gateway collector will collect that. Hopefully that answered the question. No, and if not, Jimmy, please elaborate if you wanna ask one. Yeah, I can allow you to follow up with us. Absolutely, we'd love to talk to you. Yeah, there's so many different aspects here and I think this way we'll focus on, so I mean, here's a different view of it by namespace. You can see here, this is our online boutique application. This is our collector, right, by namespace. You can see where Opscrew sits. Here's another application. So all of these, I just filtered by namespace. We build automatically. So you can see this is the collectors that we install using a help chart. So Yeager is just being collected and we push pulling that in. So, all right. Then we have another one already. People are getting really interesting questions, which is awesome. So this will ask what use case solution this demo is built upon, industry specific, like traces collection for telecom workload or something like that. We don't make any assumptions on the workload, remember? And I'll explain. So going back to the issue here, it doesn't matter what your workload is, remember? If you have any trace, so I give an example of an e-commerce like online boutique because it's public available. But anything that's using tracing generate traces and they will have trace spots and you want to detect problems, whether they're slowing down, having errors or services are dropping. So this is orthogonal to the specific workload you use. Right, just like collecting metrics doesn't depend on what man. The idea is that you can collect all the metrics but in real time, as we are given examples, the logs, et cetera, and bringing them together, however, and know the structure of the application. But how do we detect a problem at the trace transaction level? That's the focus. So we don't really care what you're running. Obviously you care because, let's say you have a customer facing application. Could be an IoT application. Could be you've got machinery coming and you want to detect whether it's coming in, a security application monitoring something else and they're sending requests at a large volume and you want to know whether you're capturing or there's a problem and you're not missing it, right? So we are capturing traces and trying to make understand how ops can even track traces if there's a problem. Does that answer the question? I mean, I think- Hello, but just for the sake of this demo, particular demo built on any specific industry? That's a fair point, Chris. I think the demo itself, what we are showing here as an example, and that's interesting, is we did use an open source e-commerce application called online boutique. We implemented this and this is an e-commerce application because e-commerce is probably one of the most interests to retail and other industries. But you could have replaced this with any other type application, collect traces and still detect problems. Right, telecom. Hope that helps. Yeah, it could be telecom, it could be IoT, it could be manufacturing, whatever, right? Security, et cetera. Okay, great. So let's go into trace path. So the way we go in trace path, as you saw, this was the app map level that we just showed. This is the host map, look at this. This is the infra map. I didn't even go through them. But similarly, what we've created is a trace map view and trace map view, we use something called trace paths, of course, because we want the real-time snapshot. So here's a summary level view so that you don't have to worry about. What is happening, as we mentioned, is we aggregate the traces, collecting those traces, and basically, instead of showing you all, I will show you all the trace paths we've been collecting. We are listing them by kind of like the top 10 and the top five idea. What is the top five trace paths by request count in the last 15 minutes? Of course, I can change this to anything you want. That will give you a visibility. We can do all that. But for now, what is the top five as an example? What's the response time? Errors, so problems, and then specific services. Also, because the trace paths involve services and requests, right? And then where are the errors? So that means this gives you the aggregate view. So if you're the ops and you want to get the holistic picture, all the traces going on, this is one way to look at, as we said, group them by trace paths and find issues or problems able to get the high-level snapshot go down, as opposed to, hey, let me search for traces that went wrong, which one. So as an example, here are all the trace paths and this can go on and on, right? Yeah, and there's a question as well. Can we get trace of GCP resources? Trace of, I didn't quite understand the question. GCP, it could be Google Cloud Platform or something else if it's something specific. Well, the question is, are you using open telemetry collecting traces from what? So can you clarify that? I mean, you say GCP, it's running on a cloud platform. What are you instrumenting? What application that generates the trace? So maybe I'm not understanding the question. Remember, I'll take a gander. If you're running the application on any cloud, on-prem, bare metal, one or other cloud, doesn't matter. If it's instrumented with open telemetry, we can collect it, we can do this. So we are cloud independent, obviously, right? It's open telemetry after all. I don't know if that answers the question. Yeah, let's see, let's see, Muhammad, if you have, if you say yes or no, we'll know. But then there's another one. Yes, he'll follow up. Let me go through the demo and then I can take live interactive questions. So here's an example of summary of all the trace paths without grouping. Now, what I'll point out is, in fact, I'll go to the previous one. So you can see we get an aggregate view and which ones have errors, latency, et cetera. This is your top level view. Okay, red means there are errors. So there's a number of ones that are having errors that we are capturing. And in fact, I can go back and say, is that true in the last, let's say four hours? And it should group because I'm collecting statistics. And if I go to all trace paths, last four hours, we got a few more. So you can see the volume of error went down. Okay, so let's go back to most recent 15 minutes that we are collecting traces because there's some errors there. It generates and updates. If I look at that, you'll notice something here. Remember discriminating traces, I wanna point this out. Do you see this front end? Receive cart, receive cart. So at the top level, if that's the label you're using, discriminating, but look at this. There are a lot more traces on this path than this, even though they had the same front end name. In fact, if I go, if you hover it, you can see this has seven plus one plus one services involved. This only has two. So for example, if I go to receive cart, I can look at the trace path and you can see here, this is gonna populate as we speak, bring it in. What it says is front end coming, receive cart, pull it, hold on, hold on. Okay, here we go. So this basically makes calls to get code, get cart, convert currencies, get the supported currency list. Obviously not in this order, right? And then after that goes to shipping service. After that, this one calls this. These are the services and the corresponding operation. And in fact, as you're seeing it, on this path, we're also seeing on that edge the corresponding response time errors, et cetera. If I go back, so this was the first receive cart, right? More services, if I get to the second receive cart, there's only five service operations involved. Let it populate it again, there we go. And you know what's interesting? There are no errors on this one. Same front end. So able to discriminate without going and trying to figure out how to tag everything on those services, don't want to do that. Why should option have to do that and figure this out? You don't know how things will change. The trace path by doing it even though same front, we can discriminate. That's the whole point, differentiating traces. That's the one thing we want to automate. There's no way anyone can do that. So all services, here is an example again. As you can see, that's why the errors are all of them are related to the front end. So that's why we have to figure out what's going on. So going back to highlights, let's figure out. Let's take a look at the topmost error. So if I go look topmost error, here's one, in fact, I can even see which one has the highest latency. Although this is not an error. This one does, response time, but which one has the error? Let's look at errors. So I let it populate and on that trace path, which seems to be the highest one, this is live demo guys. So I have to give it time to populate the data. There you go. You can see from a summary that we are aggregating as we are processing that front end, right? Which has, as you can see, one, two, three, four, five, six, really seven services. Receive calls these other six. And then the next service is product catalog, which has this get product, right? So these are the ones involved. Again, we're not showing the spans. Remember we aggregated, but we know that this front has problems. And in fact, this is specifically on the error. So if I want to go down from front end and now detect what happens, I can click on that, right? Sorry, before that, let me just show you on this one, what is the inventory? As I said, there are only those two services, right? Front end, which has, if you remember six, one, two, three, four, five, six. And then product catalog only had one. Get that product, right? And where's the problem? Aha! Hipster at service is having a higher amount of errors. So already I'm seeing it. Now the question, can I get to those traces? So at the aggregate metric level, I can even find out the metrics are aggregate. And I can see the errors are being consistently high. So if I go back, what does ops want to do? Find me those traces that are causing a problem. Give me that trace. So here there are multiple, remember they are looping through. The spans are being aggregated. So they are actually, as you can see, almost, you know, I forget the exact number here. I don't know where it's listed below somewhere. I'm not seeing it, but you can see, there's about 10 of these that happened in the last 15 minutes. So if I pick one at random here, the same service, that pulls up your familiar flame, right? Gives you the down and saying, where is the problem? So we are highlighting that because we know that service. It's the Hipstrap service. And now there are the tags, right? Description, what is the type? There's a transient failure here. Error while dialing this, connection refused. Again, this is from, I know what the process, I even know the container and the part V. So this is the added baggage. So I know which container to go through. If it's not able to connect, I can go back and diagnose this from the trace path to the trace to the underlying service. Again, as someone pointed out, if it's not Kubernetes as well, then whatever it's running on, right? So I want to pause because I really want to leave at least 10, 15 minutes of questions, but hopefully that gave you some background, right? And please follow up, post on chat. If you want a background this, as I said, the whole idea if I were to summarize is to make a distributed treasury available where you don't have to go custom coding, custom instrumentation or proprietary agents, hey, open telemetry and open source CNC if it's the way to go. In order to figure out the problems on traces for operations, what we want to do with trace path is able to automatically group by nine and five those traces. Again, as I said, dynamically. Once we have that, we can bring in all the contextually linked things and all we can detect problems, go down to the specific traces. So skipping through this, just as a background, if you want to try us, you can go to obscruse.com forward slash free forever or reach us at info at obscruse.com. You can check out our website. I'll be glad to take questions now. So with that, thank you so much for chiming in. Feel free to ask questions. I will stop. We can see now the stream window. So that's a bit, yes, perfect. Now we are not seeing that anymore. It was a bit too many windows at the same time. Yes, yes. Okay, we have a few questions here. Starting off with the questions that we kind of touched upon, but we have some more info. So regarding the question about getting a trace of Google Cloud platform resources, the asker continued. Yes, using the hotel collector, I got an update that we need to pass the current context only to the Google Cloud methods. Let's suppose we have a Google Cloud Pub-Sub VQ and I want to collect the spans of those operations that happened at the Pub-Sub VQ. And there's an additional comment at the top there that's attached. Yeah, yeah. So that goes, we already have the hotel enabled for our services and many tools, but we want some observability for the Google Cloud resources also. I, at the top of my head, Mohammed, I would say if you do, then at the minimum we can start tracking those because you've already instrumented that. Specifically, how it ties with everything else would be great to have another that dive with you. Maybe you can ping us and send us an email, send Chris an email, chris.opscruise.com because this could be very specific to your environment, but on the top from what you told me, if you already instrumented, doesn't matter what the target that you're trying to monitor the light-tracing and capture, I think we should, but we should follow up. Okay, great. So, would you have Chris? Perfect. Then we had a question earlier on, in addition to the aggregate view, can we support threshold average time being back to the question mark? Sure, sure. I'm not sure I didn't, I bet I not should. Are you talking about average response time? That question? Sorry? Is that what the question was? It was kind of just- We can support for threshold, question mark, average time, question mark. Yeah. Yeah, so one of the things I didn't talk about, there's a whole other session, is absolutely, we can detect. We already do that just to give you a flavor of this, guys. So, for example- We are not seeing your screen. I'm going to a different screen. Oh, we're not seeing? Hang on, hang on, hang on. I forgot, I forgot. Okay, let's see. Am I sharing now or not? Not yet. Let's do this again. Okay, I'm sorry about that. Yes. All right. So, what you can see here, if you're seeing my alert window, this is something I didn't talk about outside the scope of this talk. We do capture SLO breaches by looking at response time using flows when it goes high as an example, just wild example, right? So, and we do this using ML methods. And there are multiple ways of doing that. You can go different ways of solving the problem, looking at the whole path when it goes up, et cetera. We do root cause analysis. And the way we actually do this, there are a couple of different ways to do this. One is you can, let's take front end as an example. The service level. Let's take the service level. If you see here, I'm trying to find where we can set SLOs for us. I think it's one of these screens. If I remember right, I'll be on this screen. Give me a minute. I'm trying to find one that will be the easy one to show here. Let me look at engine X, it's the entry point. Should be front end, front end. What I was trying to go is see if I can show you an example. Ah, here we go. So this is a slightly different application a shopping cart on this. If you can zoom in this engine X, I clicked on it. You can add SLO. And you can see what it's doing. And I didn't get the name, sorry. So by looking at it, we are constantly, we use a statistical method for doing that, but you can also set your own SLO. If that's one, that's a threshold at the ingress level. However, as I said, we use ML techniques to also look at the data and do that. Beyond that, so we are doing the same approach when we are looking at back to the trace map and you wanna say, let's say I take a trace path like this, you can say response time and you can say, hey, I can set response time on the front end receive checkout, cart checkout and set an SLO. We are not enabling that. You can see already someone's already tracking anomalies, it's higher than. So here's an example. We set three seconds as an example. Hope that answers your question. So there are a couple of ways of doing it. There is ML methods or you can set one. So the example I just gave you here was we set a three seconds to detect problems on that. Hope that answers the question. Yeah, hopefully it's not, let us know as well. Follow up. Send an email to me, alocatopswiz.com or follow up with chris atopswiz.com. Absolutely. Yeah, and we can see the reason. I tried to put my name in the chat and it got flagged as spam, so. Or I might, I don't know. So you gotta fix that, anyway. Yeah, sure, anybody else? I'd love to answer your question. You have a few comments. Yeah, there was a question from Vipo. Yeah, so that question goes, do you have the capability for auto discovery of new services slash functional? Absolutely, we do. That's the one nice thing about it. Once you get the data, there's so, I'll jump back and forth if you don't mind, hopefully you don't get dizzy. When we are talking about if transactions change, let me go to this, when transactions change, this list gets updated because as the transactions coming, we are running the grouping on the backend, right? So, and in fact, not only are you looking at the grouping or a service drop, stress paths have to be. So by definition, it has to be dynamic. So we're not only discovering new services, services drop, but also service operations being called. Remember, trace path is between service operations and service operations. The answer is absolutely. So the highlights will change, trace paths will change and all of these will be updated as well. Perfect, and we got a confirmation from the previous question asker on. Cool, thanks so much. So covered that really well. Yeah, we are starting to near the end of our time today. We are having so much fun. Yeah, there is still five minutes to go if anyone's typing away while we see if anyone's typing away and just like sending a question in soon. And I'm going to keep this up there. So in case people want to get hold of us or you want to try, besides info or obstacles, as I said, you can reach out to me, a look at obstacles, or I would say go to Chris. He's tracking this better than I. So it's just Chris, right? Chris, chris at obstacles.com, Chris. Chris, Christopher. Christopher, he goes by either. He won't be upset. Perfect, any other resources or anything else? Yeah, we do have not. I'm glad you mentioned that. I should have brought it up. There is a e-book we just published. In fact, just about a week or two ago, Chris that we can send out that has a lot more detail. In fact, I'm trying to see, if you don't mind, I'm going to jump and see if I can find it for you guys actually and show you what it looks like since we got a minute. Yeah, and we actually got a question by what do you mean by fee forever? That's a salesperson, that's a good question. I would like to know the answer to that. All right, Chris will take care of that, but. No, I'm saying I want to know what you mean. Well, we do have some customers who are using us, but if they don't use them, of course, we don't run it because it's a resource on the side. So I will now at the risk of sharing my whole screen show you what this e-book looks like. Can you guys see that? No, yes. Oh, now we see, yes. Now we see the e-book. So this was published and basically the details on this e-book and what we do with samples, how it and the specific problems, et cetera. I'm just going ahead because I had a hand in this. So this is available with even the screenshot, some of the stuff you've already seen and what we do, et cetera. So that is there, this e-book, you can just ask us for it, send us an email, et cetera. And Chris will be glad to get hold of this. Perfect, Chris has a lot to emailing to do. Yes, we keep him busy. That's perfect. Yeah, so final call for questions if there's anything that pops up. But thank you so much. It's been really lovely for everyone. Great, and thanks for letting us host this and talk about TradeSpot, something we really believe in and it's open telemetry. So guys, if there's one message I'll give you, open telemetry, CNCF, you got everything you need. I mean, it's like a buffet out there. You don't need anything. And now with open telemetry, Yeager, and able to bring traces in real time, you don't have to wait for some poor guy searching and ops be second to class citizens to developers and apps. So it's for everyone. So, all right, thanks guys. Appreciate it, looking forward to hearing from you guys. Yeah, perfect. Reach out to these guys, it's gonna be great. So as the final kind of wrap up here, as you can see from the screen, the Q4 calendar for online programs is now open. So book your session such as this from there. So it's open till the end of the year. So plenty of chances there for amazing content like this to be showcased. So looking forward to that. But as always, thank you everyone for joining the latest episode of Cloud Native Live. And it was great to have a session about using hotel distributed tracing for real time observability. And also we really love the interaction with impressions from the audience. Thank you so much. And as always, we use the latest Cloud Native code every Wednesday. All right. And in the coming weeks, we have more great sessions coming up and book those sessions in the online programs calendar as well. And thank you for joining us today and see you next week. Hey, thanks everyone.