 All right, so my name is Venkatesh Rangarajan. I'm a group product manager, and I'm going to talk on the topic of customer-centric observability at scale using AI ops and open telemetry. So before I begin, I just wanted to introduce the organization that I represent today. It's Intuit. If you haven't heard about Intuit, it's a leading FinTech company based out of Mountain View in California. Our mission is to power prosperity around the world, and we are participant and maintainers in several cloud-native projects. Some of them like Argo, Istio, Admiral, you can see. And you can visit our booth to understand more about how we are contributing to the open-source movement. You must have heard of some of our products. You must have used some of them, TurboTax, QuickBooks, MailChimp, and Mint. Before we begin, I just want to introduce the part of the organization that I represent within Intuit. We are part of a team called DevX. DevX is basically the development platform team that builds a lot of the capabilities that are used by other organization to build their product on top of. So when the TurboTax team wants to build some capabilities, we provide the foundational capabilities that can be then used by the TurboTax team. Similarly, if QuickBooks want to build something, then they rely on our runtime to build the infrastructure on top of. On the left-hand side are the major teams that we contribute to. We contribute to the front-end experience. We call it the UX fabric. We also contribute to the platform experiences. This is the self-serve mechanism for our developers to discover, create, manage the lifecycle of their assets. And we are responsible for the platform runtime. This is the serverless and the container orchestration environment that we provide. And last but not the least, that's a topic of today's discussion is around platform operations. It's just not sufficient to provide these capabilities. It's really important to make sure that we have the right level of monitoring, learning, and then resiliency built into it. And that's one of the focuses of our organization. If you look into it the way we have pivoted, we call it the modern development platform. And this is a snapshot of the various capabilities that are a part of the modern development platform. On the top are the experiences. This is what a customer interacts with. So when you're a TurboTax customer or a QuickBooks customer, you are interacting with us either using a web app or a mobile app. And to enable these experiences and to improve the development velocity, make sure that there is monitoring and security around these capabilities, we provide what is called as the app fabric. App fabric is the front-end experience. And that's something that is used by the application teams to build their functionality on top of. In addition to that, I mentioned about providing a self-service capability where developers can come manage the lifecycle of an asset. So think about it this way, it's a large organization and then you have a new developer who's joined and they want to create a new service. It's quite daunting. So what we've done it into it is we provide what is called as a Paired Path. What that does is it provides you asset lifecycle management, governance, monitoring, observability, security, all of the things that are necessary to build a well-engineered application. So as a developer, you go there, we provide a Paired Path. There's a self-service experience and you can create your asset. And we want to make sure that it's engineered in the right possible way. So that's the top layer. We also provide the runtime, which is service communication through service mesh and various other modalities. Runtime, as I talked to you about, is the Intuit Kubernetes service and then serverless through a layer on top of the AWS Lambda. And the bottom most layer is we cannot think of these layers in isolation. We have to always think observability first, operational excellence first, security first. And that's where this layer becomes important. Every asset that gets created using this self-service interface, we have out-of-the-box observability. We have logs, metrics, traces, events, and alerts, all of that coming out of that. And as you know, observability is a data problem. So this generates a lot of data. And all of that data is stored in what we call as the operational data lake that we rely on heavily for our AI ops. Moving on, before we go into observability, I just want to paint a full picture of how to think about this. So observability in itself is like metrics, logs, and traces. But for you to be successful in observability, you've got to think about all the other supporting functions that enable the organization to succeed with these tools. That is around reliability engineering. What does that mean? It's like, are the observability team tools being used in the right format? Can we run a chaos experiment to see whether the organization is able to respond to an incident, use the tools in the right manner, whether the people and processes are able to respond in an appropriate fashion, and then mature as an organization? There has to be a certain rigor around this process. So that's the reliability engineering piece. There are going to be incidents. But when there is incidents, that's when the observability tools are used. So that's where service management comes into picture. What is the process and procedures around service management? Are they using the right tools? Is the alerting properly enabled? Is the context available? If an incident happens, the first question that get asked is, how many customers were impacted? Are we able to answer those questions? Are we able to root cause and analyze which service should be looked at? So that's kind of the service management piece. And obviously, one of the underlying factors of observability is the cost associated with this. So to make sure that the right patterns are used for the right solution, that's where the cost management picture comes in. We want to make sure that logs are used appropriately with the right level of sampling. Metrics are used appropriately, same with tracing. So the whole picture for an organization to succeed is to be operating in all these different levels. So let's talk about the background and use cases. So what does it mean to be customer-centric? If you think about observability, it's traditionally very system-centric. What does that mean? We look at service uptime. But we know in the world of containers and serverless that infrastructure is fungible. Today, you might be on containers, previously on EC2. And in the future, you might be serverless or some other stack. And what happens to observability at that point in time? It's pretty fungible, right? And there's a very weak correlation between an issue with the infrastructure with what the customer is experiencing. So at Intuit, what we are trying to do is we are trying to move the needle more towards customer-centric observability, because what the customer is trying to accomplish on our platform is durable. They want to file taxes. They want to create an invoice. They want to be able to process a payment flow. This is durable irrespective of the underlying technology. So we are moving the needle towards that. In addition to system uptime, we are looking at functional uptime. What does that mean? Is a user able to complete a functional task that they hired us for? We want to really know that the customer is succeeding in accomplishing the task that they're paying us for. And then with customer-centric observability, we are able to have a direct correlation between an issue in the infrastructure with the impact on the customer, how many customers, what region, what kind of device. So we are able to answer those questions. So that's, in nutshell, is what customer-centric observability means. And any workflow or experience that an application has had several teams participating in it. So a payment flow might have 10 different teams participating in the orchestration of that workflow. With customer-centric observability, we are able to actually bridge and create a connection between all these different pieces of infrastructure and pinpoint exactly where an issue might be. So during an incident, we are not spending our time looking for needle in a haystack. We are intelligently able to actually pinpoint exactly where we should be devoting our resources to. So our high-level strategic goals, we want to be customer impact-driven. We want to rely on AI ops because observability throws out a lot of data and it's not humanly possible for us to be looking at so many dashboards, looking at so many metrics and looking at so many alerts. So we want to have this one goal and entity, which we call as the anomaly score, to drive a lot of the functions of observability, things like alerting, incident creation, and remediation process. In addition to that, anomaly score also drives automatic rollbacks that we have invested in. So our goals are basically reduce the mean time to detect under five minutes, mean time to restore under 40, and have a 99.99% availability. So let's talk about observability. When you look at observability, this is kind of a functional footprint of where we are in observability at Intuit. There are these traditionally common means of looking at the health of a system, which is logging. So we rely on Splunk for logging. Then we have metrics. Metrics are emitted from various different sources, like the front-end metrics, the service metrics, the infrastructure health metrics, and then the area of focus that we have been investing in the past few years around tracing. Tracing helps us with root cause isolation. We want to reduce our mean time to isolate an issue, and tracing is an area that we are investing in to accomplish these goals. But it's not just sufficient to provide these capabilities. It's really important that we provide interface so that when an issue occurs, the right information is bubbled up for the developer. That's where the user interface and the triaging flow comes into picture. And I'll share that in the demo further in the presentation. And last but not the least, at the bottom is the ODL layer. So each one of these capabilities throws out a lot of metrics, logs, and various type of data. We want to be able to correlate and make sense of all of this data and provide these insights. These insights can only be created if we are able to make sense of this data. So that's the bottom layer, which is the operational data layer. So if you think about observability, I bucketed them into like four major areas, right? Emit, collect, process, and then provide the information to the right stakeholder in the right format, in the actionable format so they can make sense of it. So one of the areas that we have seen good success is providing these capabilities out of the box. What does that mean? If you're creating a service, we want to make sure metrics are emitted out of the box. If it is a front-end asset, the front-end assets are open telemetry compliant. They are publishing the metrics and the spans so that there is not a lot of developer toil on onboarding onto these capabilities. The next step in that is collection. We need to have a highly resilient collection pipeline that collects all of this data and the processing this information, correlating metrics, logs, and traces, and then providing the user interface to drive various different things like alerting and then dashboarding and reporting and various other modalities. I won't talk through this slide, but this is what our footprint looks like. The front-end I talked about, the insights and UX is basically something that we built internally into it. And then we have a combination of some vendor tools and open-source tools. With regards to tracing, we rely on Tempo, which is an open-source product from Grafana, and that's our trace store. And I'll get into more details when we talk about tracing. All right, so let's start with the customer-centric observability with the topic of real user monitoring. So the applications that into it are composed in what is called as a micro front-end application architecture. What that means is instead of shipping a whole monolith, the application is fragmented into pieces and it's a collection of plugins. And each plugin in this application is owned by a different team within into it. So when you're accessing QuickBooks online, it's nothing but a collection of different plugins that generates the experience for you and it's put together at runtime. Why did we do this? It allows us to move faster. The developer velocity is increased. It allows the teams to execute faster without having this big dependency. And then the other thing is it allows each plugin team to release independently. And that way, we are able to iterate faster. So those are the benefits with the micro front-end architecture. But at the same time, it throws some challenges. If something is wrong with an application, how do you know who's responsible for it? If a particular page on an application is not performing, if there is an error or there's an issue, that's where we run into challenges. And I'll talk about how we solve for that. That's where we invested in what is called as the real user monitoring, the front-end JavaScript library that's part of all app fabric applications. It captures mobile app performance and availability from a real user's perspective. Again, this helps us to measure things from a customer perspective instead of looking at just infrastructure numbers. This is something that is integrated out of the box. Talking about developer toil, building capabilities is one job, but being able to roll that out throughout an organization where there's a significant amount of technical depth is really hard. That's why one of the core decisions that we made was to provide this capability out of the box and integrate it. We also derive anomaly insights based on customer impact. So if a particular functionality in the app is broken, we are able to bubble up the anomaly insights and drive remediation through that, alerting and remediation to that. And all of these insights are available in one location for our developers to look at and debug if there is an issue associated with it. If you think about the real user monitoring, there are like two major components. One is the performance component which relies on the Google web vitals. We standardize around Google web vitals to look at the performance of the application, but there is this other concept called failed customer interaction. What does that mean? A failed customer interaction is a body of work or a unit of work that the application is trying to perform. For example, a payment flow, uploading a document, or a login interaction. That's a particular functionality that the user is trying to accomplish with our application. The developers have to instrument, and the instrumentation work involved is pretty minor, but they have to instrument, saying, hey, I really care about this particular interaction in my application. So these are measured as customer interaction, and we measure the health of our application based on these customer interactions. So as a part of the adoption, we talk to the various applications team saying, hey, real user monitoring is part of the platform, but we really want you to instrument these key interactions so we really know the health of our application from a customer's perspective. So part of the job was running a program and driving that throughout the organization, instrumenting the key workflows. And if there is an impact, now we know what functionality within the application is broken. So this is, again, out of the box in terms of the transportation of this data, storage of this data, and analysis of this data. Again, FCI is open telemetry compliant, so as long as you're using our standard libraries, we create trace spans, and if we propagate context, we get the entire flow, the distributed flow across the infrastructure. So a quick look at our platform. As I talked about, ROM has basically three major components. It's the front-end library that's part of the app fabric, and there is a backend where we have a ROM service that's a highly available, highly performance service that's listening to all the users or clients who are connecting, who are using our application. It's ingesting all of the metrics, spans, and we are persisting this data in our operational data lake. At the same time, we are relying on anomaly detection in real time to determine if there are any issues that we notice with the metrics that we are collecting. And we provide sophisticated interface in the front for our developers to triage in case they see that the anomaly score is high. And I'll talk about that a little bit more when we look at the demo. All right, let's talk about tracing. So into its tracing platform, the core principles, when we started embarking on the journey of adopting tracing was number one. We wanted it to be vendor-neutral and cost-effective. We know that with observability, the cost is one of the biggest challenges because if not rolled out properly, the cost just skyrockets and then there's a real question whether we are driving the value with the tools that we have invested in. And so we really wanted to be really conscious about the cost. So one of the core principles was we wanted to be vendor-neutral and wanted to be on open source standards. Secondly, the other goal was we didn't want to have an anti-pattern in people misusing tracing in a way like using it as a metrics store. What we really wanted to do, focus was we have solution for metrics, we have solution for logs. We really want to impact our meantime to isolate with tracing. And the focus was from alerting to metrics to traces to logs. That's the journey we wanted to enable. So that was the real core focus. And then as I talked about adoption for any organization to succeed, a lot of these features have to be out of the box. Otherwise, you get sporadic adoption. Now think about tracing. Tracing is to look for a trace that's flowing all through your infrastructure. Let's say you have not enabled tracing in 50% of your organization, you're going to run blind. For tracing to succeed, you need to have coverage across the organization. So one of the goals here is to provide this capability out of the box. How did it benefit, right? As I mentioned earlier, we had an over-reliance on fragmented and unstructured logs. And we don't have one Splunk, we have several Splunk indexes. So there was a fragmented experience on where and what to look for data. And when there is an incident, time is of the essence. That was one of the big challenges. Number two, it was really difficult to isolate. As I mentioned, there's multiple Splunk indexes. In the search, you cannot really very easily correlate an issue occurring in one service versus an issue occurring in the front end. So that was the other big issue. And the other issue was how do you put together all of this end to end, right? If there's a customer interaction flow that's happening across the infrastructure, there was very limited capability on drawing that picture together for the on-call engineers. And it was difficult for us to correlate logs, metrics, and traces. With the implementation of tracing and context propagation, we are able to correlate metrics, logs, and traces. We are able to build service dependency graph. So a service dependency graph, think of it like a topology. When you look at a customer interaction flow, we want to see at an aggregate level what assets are participating in that flow. That's service topology. So tracing allows us to build that service topology map. Now, you might want to go more granular. I want to look at one particular trace. That's where the call graph comes into picture. You're like, hey, I want to look at this particular trace because I see that it's performing poorly. Click on that and you get the call graph. And this allows us to have end to end traceability from the front end through the API gateway to all the backend services that are participating. And we're based on, as I said, the Grafana tempo based on open standards. And Wendern U2. So a quick look at our tracing platform. This is like a logical architecture to just give an idea of how it looks like. Instrumentation, we want to provide instrumentation and support out of the box as much as possible. So these are the various flavors of services and front end assets that take benefit from tracing. So we have the real user monitoring that I talked about earlier, which is the front end. And then for services that are built using Java, we have the JSK favorite support. We also support service mesh. So if there is a traffic that's going across service mesh, we want to make sure that the trace is propagated, trace funds are created. And similarly for async, event bus, and other areas as well. With regards to our runtime, we have out of the box support for our Kubernetes platform as well. Primarily the trace funds are stored in S3, but you want to be able to search the span based on certain criteria. That's where the elastic search comes into picture. Grafana does provide, the tempo does provide a search capability, but we wanted something a little bit more sophisticated where we had semantic based search. So we had to build our own search cluster using elastic search. It's a metadata search engine. It doesn't store all the spans. It just stores information about the spans. And the last piece is basically the user interface. Again, we want to provide a really sophisticated view for our developers to try out these things. So that's the last piece in the picture. Let's take a look at the tracing architecture a little bit in more detail. So if you think about the producers, we have three main producers. Number one, EC2, number two, Kubernetes. Number three is the front end failed customer interaction that's coming in from the mobile or the web apps. The trace span gets transported through the hotel collector and make their way into S3, okay? In addition to that, as soon as a span makes it way into the S3, we create a topic in an event bus because we have different type of processes that process this span. So we don't really pose the spans to the event bus. We just create a topic, a notification saying, hey, a fresh span is available for processing. So if you start from the top, we have a metrics aggregator. We extract metrics from our traces. So that's the top portion over there. The services basically reads the spans from S3, extracts metrics and writes it to an event bus. And from the event bus, the metrics is then published to two different locations. We use Wavefront as one of our metrics dashboarding tool. In addition to that, we have a high latency, a low latency, high performance druid as one of the metrics tool. So it gets published to that as well. The next in that is the trace ingester service. Again, as soon as a notification comes in, the trace ingester service receives a notification, it reads the trace span, and then it publishes trace span to two locations. Tempo, which is a Grafana-based solution, it allows us to assemble a call graph together. So when you have multiple spans, which are part of the same trace, you want the ability to put all of the call graph together. And that's where Tempo is useful for. But at the same time, we needed to have some sophisticated searches associated with traces. For that, we created our own elastic search cluster, and this basically stores the metadata associated with the span and provides the trace capabilities. And all of this is consumed through a trace search API. This is a GraphQL API. And this is consumed by our front-end observability experience. And that's how we manifest the traces for our on-call engineers and for our developers. Last, at the bottom here is the service dependency. I talked to you about service dependency. This is like a topology view that shows you a breakdown of the various assets or services that are participating in the orchestration of a workflow. So this is the layer at the bottom that creates a service registry. Say that for this particular workflow, based on the trace data, we have seen that various different assets, these are the different assets that are participating, and it allows us to paint that picture for our developers. So what does the dependency graph look like? I talked a lot about dependency graph. This is a screenshot from an incident. When an incident happened, the asset in the middle, right here, can see my mask, yeah, right here, this was the one that had an issue. But if you look at this, there are so many different assets and services that are participating in this flow. And when a P1 incident is there, there are so many people on the call. It's really important to understand the impact, which are the different teams that need to be working on this. Who's the root CI or the root cause who needs to respond to this incident? What are the other teams that need to be informed about a service disruption? Maybe they are not the cause of the disruption, but they are participating in this. So that's where the dependency graphs become really important. Do a time check here. So what experience are we trying to enable with tracing? So I talked about reliance on anomaly score. So think about the story here. There is an anomaly, the anomaly score goes up, maybe it exceeds the threshold, and an alert gets created. As soon as the on-call engineer gets an alert, they click on a link, they go to a service dependency graph, they look at the service dependency, they know the root CI of all the different assets, 15, 20 different assets that are participating in a flow. They know exactly which particular asset to look at. They click on that asset, and then using the search capabilities, we are able to bubble up the relevant traces that are appropriate for that issue. So instead of doing an open search, where you go to a trace explorer and search for traces without knowing where to look and what to look for, we literally bubble up the right traces that you should be looking at. Clicking on that takes you to a call graph, which is for one specific trace. And at that point in time, you might actually know what the issue is right away, but maybe it is not sufficient. So we have links into logs. Clicking on that will take you to the exact log line. In Splunk, to carry forward with your debugging session. Pause here, and let's go and talk about AI ops. Okay, so what is the need for AI ops in all this picture? So we talked about realism monitoring for customer centering of the mobility. We talked about tracing to be able to paint the picture across the infrastructure. The last piece I wanted to kind of cover today is the AI ops. So we are generating a large amount of data, high cardinality data, and it's really difficult to understand what to look at when you have so much data. Take a service for example, right? For this service, we might be interested in say, I want to know all four XX errors. I want to know all successful requests. I want to know how many services, how many requests are below a certain performance threshold, and what kind of errors am I encountering, right? For each one of these, you can create a configuration, you can create an alert, right? So at the root, so if it were our service, as a developer, I'll have to maintain six different configurations. Now, let's say you have several different verbs associated with it, upload, get, put, and so the number of alert configurations that you maintain need to maintain increases by a factor. Let's say you want to have a little bit more details, like you have different clients connecting to your service. There is a front-end mobile app, there's a web app, there's a plug-in connecting to you. It becomes really hard to maintain all these configurations, and this is not something that is static. This is dynamically moving, because as a service owner, you do not know which service is going to connect to you when, and you have to maintain different thresholds. There might be services that have really high throughput, and there might be services which have really very low throughput in terms of the request counts, but there might be an issue where there are errors, and suddenly your alerts spike up. So you need to maintain separate configurations for each one of these services. So this is where AIOps helps us. Instead of relying on individual configurations, what we do is we have a unified anomaly instead of independent alerts. So this unified anomaly is the same anomaly across all of the use cases that you're seeing on the top. It does not matter. If a new client shows up, the system learns itself automatically and adjusts its anomaly score. Number two, why is anomaly score important? We use it to drive our incident response. So when anomaly score exceeds a certain threshold, we drive alerting, and from alerting we drive the incident creation. Third, this anomaly score is always on and proactive. What does that mean? It is working in real time. It's not a batch process, and when a new client comes up, the system learns over a period of time and adjusts its algorithm in a way that we are able to respond and give the right anomaly. It has a high signal-to-noise ratio, and we eliminate the eyes on the glass approach, because what we don't want our developers to be doing is looking at dashboards. We really want to let them know if there is an issue and whether they should be responding to certain issues. All right, so we have 30 minutes. We're gonna actually skip through this slide and walk into a demo, and then we can come back here. All right, so what you're looking at on the screen is what is called as the development portal. I talked to you about how we provide a pair path for our developers. This is the starting point for any developer who wants to, let's say, create a new service, right? So they will come here and create a new service, and in addition to the service lifecycle management, we also provide things like observability and various other governance, lifecycle management, discovery of new applications, new services. Let's say as a developer you want to consume a particular service, you can come and look for a service. Let's say I want to say connect to a bank. I will come here and look for any capabilities that exist already in the system that I can use, and we also show a maturity score for various services. So we want to make this in such a way that it's all service teams, it reduces the friction. It is required for teams to connect to each other to build capabilities. With that, what we have here is a tab called observability, and a lot of what I talked about powers, the user experience that you're seeing on the screen here, and let's start with taking a look at incidents here. So what you're looking at on the screen here is incidents, and I see that there are a few recently closed incidents, so I'm going to open up one of the incidents. All right, so as soon as I open the incident, there are a few things that I wanted to kind of observe. First thing is we show the dependency graph associated with an incident. Even though this particular service had a health issue, we can see that we break this into the dependency graph and show all the different assets that are participating in the incident. Now, as an engineer, I might want to know a little bit more about what's happening with this particular asset, so as soon as I click on this incident, it takes me to a screen where I get more observability metrics associated with this asset. Number one, we lead by anomaly score, so as you see here right now, the anomaly score is low. That means there is no issue with this particular asset. The next thing we show is the dependency graph. This shows where all this particular service is being consumed, so if there is an issue or a high anomaly score, we want to know who all might be impacted. The next question is, if there is an incident, I want to know whether there was a change that had caused the incident. Because 66% of the incidents that we have noticed in the past have been initiated because the change was deployed, so we bring that information here. Again, this is contextual information that traditionally was really difficult for us to find. We are bringing all of that information in one particular location, so that at the time of incident, you're not hunting for this. Then, based on metrics, we are able to provide the availability numbers, the request, and the error counts. And then if you scroll down here, this is where the tracing data shows up. Now, instead of going and performing an open-ended search, we want to provide trace in the context of an application. So as a service owner, when I look at this, I can see the traces which are underperforming. And when I click on this, this should open up the call graph for me. In this particular service, the end-to-end instrumentation has not occurred. That's why the call graph ends at the API gateway layer, but let me pick up an application which has proper instrumentation here. So this is one of our flagship applications, which is QBO, QuickBooks Online. First thing, this is a front-end application now. I moved away from the service, and I'm looking at a front-end. I can see that the anomaly score is 2.3, which is really not that concerning. Anything over five, we start to worry about it. I talked about customer interaction. You can see here, I can see how many customer interactions have happened in the past two hours, how many of them have failed, and how many users are impacted. If this were an incident, that is the first question everybody asked. How many users are impacted? And we are able to provide that information right away. So within this application, when I pop this open, for this application, I'm able to see all the different services that are participating in building this experience. And it's taking a little while, because I think there are quite a few different services and assets. All right. We don't expand all the nodes of the leaf here. We can further click and look at more details, but this is the dependency graph for this application. So you can see here that the QuickBook application relies on so many different services behind. In the infrastructure, as I click open, you can see that the multi-level hierarchy of how many different services are participating in the orchestration of this experience. And for each service, we also show their respective anomaly score. So if I'm a QBO engineer or a site reliability engineer, I come here. If I see that one of my dependent services is having a health issue, that's something I should be concerned about. And at that point in time, I will go and look at that service and see if there is an incident which is active. Was there a change recently deployed? If the anomaly score is high enough, then I can actually initiate creation of an incident at that point in time. Next question, are there any changes? No. This shows the total level of interaction. And I talked about FCI or the failed customer interactions. This is where we show the FCI. So what you're seeing on the screen are body of work that a customer or user is trying to accomplish with this application. So I can see here things like reconciliation, editing time sheet, exporting a PDF. All these are individual tasks that the user is trying to accomplish with the application. And I can actually pop these open. And I can see that, hey, for the transaction query, there are about 2,400 interactions. 117 have failed and 17 users have been impacted. Again, if this number were sufficiently high, this would have automatically triggered an alert. And from that alert, we would have automatically created an incident. So we are actually able to create incidents for individual interactions within an application, not just for the application health or the service health, literally a piece of work that a customer is trying to accomplish with our application. First scroll down here. It shows the dependency graph. And then here's where tracing comes in. I'm able to take a look at traces associated with this. Again, this app is not fully instrumented. So let me open up an app which has proper instrumentation to look at traces. So I'm going to open one of our IDX plugins. This is a plugin that takes care of a lot of the interactions with third party data providers such as eTrade and Bank of America and others. And I'm going to open probably this refresh connection flow. As soon as I click on this refresh and connection flow, I get another dependency graph. This is based on trace data. This shows all the various interactions, the functions that are getting called, and then the success or failure of those function and what it is calling subsequently. Again, this is not super useful because there's a lot of information here. But if I open up my trace, it basically opened up this call graph which shows me that profile, when I'm trying to authenticate and acquire the profile, I'm seeing an issue. Maybe as a developer, I want to look at more details. At that point in time, I click on these access logs and this should take you directly into the logs for that particular trace. So let me kind of put all of this together. As an on-call engineer, when there is an incident, you land up on this page and you look at all the active incidents that are happening and then you're like, hey, I want to know a little bit more. I want to understand more what's happening with the incident. And you open up and you look at the call graph. You're like, hey, these are the different assets. I see an incident in progress. Let me open it up and you open that up. You go look at the anomaly score. You look at the customer impact. You look at the traces associated with it. You open the trace, you get a call graph. From the call graph, you go into the logs. So the whole journey of getting an alert to going to an exact log line within a service, automatically discovering the whole flow in under one minute. That is what we are striving for. That is the real power of this because we want to reduce our mean time to isolate and mean time to restore things. One thing I didn't touch upon is we also rely on this metrics to automatically roll back things. We rely on Orbo to automatically see if there's a change, the metrics associated with the change are deteriorating the user experience in some way. If it is so, then we are progressively we roll back our changes as well. This again helps us with preventing change related incidents and also remediating them before a customer encounters any issues. That's all the time I had, I believe. About a minute? All right, one question. Yes, sir? We haven't looked at it yet. No. For the front end, there is no sampling. For FCI's, there's no sampling. We literally extract metrics and say, these many users are impacted. You see, there was a separate path for metrics and separate path for spans. So metrics we extract without sampling and for the spans, we have sampling. And we have adaptive sampling that automatically adjust based on error rates and request and throughput and other areas. And take one more, one question. We are still in the process of figuring out whether we want to enable it for all the applications. What we are doing is we are looking at workflows instead of applications. We are identifying what the top 10 or top 100 workflows, most valuable workflows are. Payment might be one of them. New user registration might be one of them. Instead of saying, across the board, all services enable tracing, we are like, what are the most valuable workflows? What are the services and front end assets that participate in that workflow? Enable tracing for them. So we are taking a different approach. We are taking a customer-centric approach. What do we really want tracing for? Instead of saying, just roll out tracing everywhere and then see if we got the right information. So a slightly different approach, workflow based. All right, I think we're out of time. Yeah, I'll take a configure. So Tempo, the Grafana-based Tempo is, it provides an API where you can actually say, for this particular trace, please assemble all the spans. And the UX that you saw is something that we built, but there are many open-fill options also for you if you want to just assemble a call graph. I don't think that's really complicated. What's, the hard part is how do you discover the right trace in context of the workflow that you are solving for? That is the real challenge. Call graph is a solved problem. There are many, many different options that you have for that. All right, thank you so much.