 Hey, everyone. Thank you for joining us this evening. I realize it's past 5 p.m. on a Thursday, and we're all here. So thank you again for that. So today we are here to share our journey about customer centric observability and how at Intuit, we've been able to successfully detect customer impact in under three minutes. I'm sure most of you are familiar with Intuit and what it does. Our mission is to power prosperity around the world with our products like TurboTax, Credit Karma, QuickBooks, and MailChimp. And behind these products are five major platforms that enable and accelerate innovation and enable consistency across all of these products. And as you can see looking at these numbers, the scale that we work with is enormous. The organization that I and Naga are part of is DevX. And our job is to build the platforms and tools that enable our developers to deliver products faster and operate them excellently. I'm Vinit Samhel, Group Dev Manager at Intuit, and I'm here with Naga, our principal engineer. As you may be familiar, we have received the CNCF award this year again, and we are grateful for that. And we are both active contributors as well as maintainers of open-source projects. So let's take a step back on our journey. So when we started our journey, we were largely focused on system-centric monitoring. What that means is across the thousands of services and the hundreds of web apps, plugins, and widgets that we have across Intuit, each one of these components had their own way of monitoring and measuring their availability. But when any of these components had an outage, we lacked an understanding of customer impact. Our time to detect these problems was also extremely high. After detecting the problem, isolating this problem to the specific experiences that have gone down, the experiences that are unavailable, and how do you actually recover from this problem, how do you root cause, how do you repair, took even longer. All in all, at our scale, we were spending a ton of dollars, but not necessarily getting the best ROI on observability. And most importantly, we realized that we were not leveraging our own data. So we took a step back, and we anchored around one of our core values at Intuit, which is customer obsession. What that means is we fall in love with our customers' problems, and then we sweat every detail of the experience to deliver excellence. Now, when you look at observability, we see it's no different. So while focusing on our systems is very important, focusing on what customers are experiencing is even more important. The technology that powers our products evolves in time, the solutions change in time. But the end user value that you're delivering is durable. That stays in time. So while measuring service availability is very important, what is even more important is measuring your customer experience availability. And that's what we're focused on. Obviously, experience availability has a direct correlation to your business revenue as well, which is very measurable. So when we measure experience availability, we established a most basic premise, which is customer interaction. Now, our customer interaction is defined as a value-driven action that the customer performs with the product experience. Seems very simple. Some examples are a user signs into the product, a user e-files taxes. A user makes a payment. Now, it's important to clarify that we are not focused on the micro-events and micro-interactions like clicks and scrolls. But what we are focused on is, what value does the customer get when they actually click on that button? That's the value that we're focused on. That is the customer interaction that we're talking about. Now, we measure these interactions in three states. One is a successful interaction where the customer was able to proceed further and do what they were seeking out to do. Second, a degraded interaction where a suboptimal experience was delivered to the customer. And finally, a failed customer interaction where the customer actually had unavailability with that experience and couldn't proceed further. Now, by measuring failed customer interactions, we can exactly quantify what is the experience availability. And we'll shortly talk about how we do that. So to solve this problem, we built a platform for failed customer interactions. It's part of our real user monitoring offering. And what we did was we built a library which is integrated in all our products, integrated in our web products, integrated in our mobile products. And this library enables our developers to add instrumentation. We'll shortly talk about the nature of this instrumentation that emits specific telemetry. We have a scalable back-end that ingests and reaches processes, this telemetry, at massive scale. It persists this in our operational data lake. And then the millions and billions of interactions that are going on at any given point of time, how do we make sense of it? We make sense of this with NUMAPROJ, our open-source project, which sniffs out anomalies, which sniffs out issues from these interactions. It identifies patterns. It identifies issues and outages that our customers are experiencing. When we identify an anomaly, when we identify an alert-worthy problem, our developers are alerted with very contextual alerts. What they get is an alert that talks about the specific asset or the component that's failing, the exact number of users that are experiencing that impact at that point of time. And we also give them a deep link into a curated UX, which starts their journey or kickstarts their journey from detection of the problem to root cause. I'm going to hand over to Naga now. Who's going to walk us through the details of the implementation? Thank you, Vinay. Is it clear? Let's now dive into the detail. So the first step is the instrumentation. This is where we ask the product team to actually identify the key interactions in their applications. And the developers from their product team go about instrumenting it using the simple interface as shown here. This is where they can provide a user-friendly interaction name and supply it with additional key value tags. Next, the developer go about writing the business logic, as usual. This is where you may perform a few back-end API calls, get past the response, render it in the UI. After that, you can end the interaction. When you end it, you can either mark it as success or failure. This is very specific to that interaction and the business logic related to that interaction. When the interaction fails, you can also supply it with additional reason for the failure. Now, this interface behind the scene actually uses the OpenTelemetry.js library. So when the interaction is created, a trace context is generated behind the scene. And as back-end API calls happen throughout the lifecycle of that interaction, this trace context is propagated to the back-end. So this is how we join the front-end with the back-end and provide that end-to-end view. This library is actually embedded in all the products, both in the web and the mobile of Intuit products. Let's look at the high-level design. So as customers use our products, FCIs are generated in the background. And then they are sent to our OpenTelemetry collector in the back-end. We have written a custom processor to extract FCI metrics from these spans. It's a page-in-duty alert right there. Yeah. So we extract FCI metrics from these spans. So things like success count, error, degraded count, latencies of the interaction across multiple dimensions. This is then aggregated over a minute window using a stream processing pipeline based off of beam and fling. And then sent to a Kafka topic, which is part of our operational data lake. Using this operational data lake, we not only solve observability use cases, but use cases in the other domain like security and development velocity as well. The aggregated FCI metrics is then persisted in Druid. It is an OLAP data store providing subsequent query performance. And also sent to Wavefront for all our alerting needs. The aggregated metric is also analyzed by our AI op system, the NUMA approach, which generates this anomaly score for each of these interactions. So using this FCI metrics, we are able to detect any issue pretty quickly. But to enable the isolation, we also want the raw spans. So to that, from the open telemetry collector, we use the S3 exporter to send that traces into Grafana tempo, which is our trace to. This is how the data collection and analytics pipeline look like. Now as an on-call engineer, you can use the UX of the GraphQL-based API that we have on top of this data lake to extract these metrics and traces from the data store. I'd like to highlight a couple of important use cases that Minit had mentioned earlier. The first thing is the quantifying of the impact. So whenever there is an issue or an incident that happened, the first question the leadership asks is how many unique users are impacted. So when we want to do these kind of operations on the number of events that we are collecting, where in the range of millions of events over a short period of time, running count-distinct queries is computationally intensive and takes a lot of time. Since we deal with operational metrics, approximate value with fixed error boundary is acceptable. So to that end, there is a library of streaming algorithms provided by Apache Data Sketch. So they have algorithm for each type of use cases. So specifically on this case, we use something called as the hyperlog log algorithm. It is able to estimate fields with billions of cardinality with an error rate of just plus or minus 2% and about 1.5 kB of memory. So Apache Druid actually supports this and many other sketches using the data sketch extension. The other problem is how do we identify or how do we provide quality signals and rule out all the noise that's happening? So what happened was, as we started collecting these FCI metrics, we observed that errors are inevitable in the front end. There are a lot of client-related dimensions that come in in terms of the client device, the network, the browsers that the users are using, making it very difficult to separate out all the noise that's happening versus the actual signal. So one thing is we could not go with the fixed error rate for all the... We could not set up alert based on fixed error rate because these error rates used to vary throughout the day for each interaction as well as throughout the season. It had its own seasonality and trends. So for that purpose, we used the AI ML over here. It's the NUMAPROG, which is an open source project. It provides real-time analytics and AI ops on Kubernetes. Specifically for the FCI use case, it generates anomaly score for each and every interaction, considering its seasonality, its trend, or a period of time. So the closer to zero means the interaction is being fine. As it trends towards 10, it means that the customers are really facing issue using the experience that is provided by the system. What this enabled us to do is, as a platform, set up a very generic alert out of the box using the... Provide generic alert out of the box using the anomaly score. Thereby, making the dream of getting the empty TD under five minutes possible. With that, I'll just dive into the demo. In this demo, what I would like to show is a journey that an on-call engineer takes, usually where whenever there is an issue, how they use the tool that we have built to debug that. So this is a demo web application, and you can see there is a login feature that is provided. I, as a customer, and I'm trying to log in here and say, for example, in this case, the sign-in functionality is not working as expected. So, thing to note is, this error that you see is not... We don't mark the FCI as failed for cases where users have supplied an invalid username or a password. So the reason for that is the system worked as expected. This, we mark this interaction as failed only when the system is really having a problem. Now, as more and more customers start facing this problem of signing in, the anomalous code starts increasing for this specific interaction, eventually resulting in a Slack alert for an on-call engineer. So this is an example of a Slack alert that our on-call engineer receives. There's a lot of information here, so let me just go one by one. So the first one that you see here is the plugin name. So plugin name, think of it as a microservice, but for frontend. So it has all the UI functionality features provided. It has the business logic that is making the business, back-end API calls, so on and so forth, packaged as a feature, and then it is embedded into web applications like TTO, QuickBooks, et cetera. So if a feature is needed, we write it once and then kind of use it in multiple places. So the sign-in feature is actually part of this plugin. After that, we show the unique user impacted using the HyperLog algorithm that I mentioned earlier, and it says that around 9% of them were impacted. Also, it provides other numbers like the total failed count as well as the total interaction. It also calls out which specific interaction that is failing, that is the sign-in in this case, and it is impacting this specific web application. So now as an on-call engineer, I clearly know there is some issue going on. I have already know how many unique users are impacted because of that problem. As next step, I would like to isolate where this problem is coming from. So there is a deep link available right in that Slack which takes the user to a dev portal. It's an internal tool and takes it to that asset that is the plugin asset that is there. At a quick glance, the on-call engineer can find out that over a period of two hours, the plugin has been doing fine. It's only in the last 10 minutes where the issue is propping up. You see the same kind of information that you saw in Slack over here, but it also provides a link to the interaction that is having the problem. Let me click on that. Here you see that the plugin as such has multiple interactions provided. It's only the sign-in functionality is what is having the problem. So let's look into that one. So we provide the same level of breakdown, but at a minute level and for this specific interaction. And the first thing as a plugin on-call engineer, what I would like to do is, hey, is this issue happening just in the front end? So to kind of rule that out, we provide multiple dimensions on which the on-call engineer can slice and dice the data and determine if the issue is localized to the UI itself or not. So in this specific example, you see that in production, there are a couple of plugin versions running and the errors are happening consistently across the board. Similarly, you can look at, let's say, a browser and you can see that if the issue is happening specifically on a single browser. So by looking at this data as an on-call engineer, I'm fairly certain that the plugin as such is doing okay. Most likely the problem is in the back end. So to that end, we provide exemplar traces and you see that all the metadata from the metrics is used here to filter down the traces that is of interest for the on-call engineer. As I click on this, this provides a sample for that specific interaction and it provides this end-to-end view. So thing to note is that in this specific part, it is, there is five or four services and then in the fifth service, there is a database call, right? And as an on-call engineer, I don't know anything about this back end, but I'm now fairly certain that the issue most likely is going on in the bottom most span in here where the database call is going on. I can look at few more examples in here and kind of narrow down or get an intuition that most likely the problem is happening in that space. So this is where then I can paste the on-call engineer of who owns this service and take his or her to troubleshoot this issue further. So as you saw in the demo, using the FCI metrics, we were able to kind of isolate as well as quantify the unique user impact and then use the power of the open telemetry tracing to actually isolate where the issue is surfacing from. With that, I'll now hand it over to Vinith to wrap it up. Thanks, Anga. All right, so the blueprint that you just saw, with that specific blueprint, every experience that has been instrumented with these FCIs, we are able to detect a problem in under three minutes. And as you see, we have AI-driven contextual alerts that exactly quantify the number of users impacted. This is the state that we are in today. And isolation of these experiences is very obvious. We know exactly the sign-in flow is broken or the payment flow is broken. And across the billions and millions of flows that we have across multiple product experiences, we know exactly where the problem is. And most importantly, we have end-to-end triageability. So we've established a path where we detect the problem, then we quantify the blast radius, then we isolate the problem, and finally, we lead the engineers to the root cause. With this platform, some numbers that I'd like to share, so these numbers are from our last tax season and specifically for TurboTax, we served over 30 billion interactions and we delivered over four nines of experience availability with this platform. And not only that, we've been able to save more than $2 million of vendor observability that we have just by using open-source technologies and some of our open-source contributions as well. Now, what's next? So we have been able to establish an end-to-end dependency graph if I may, because we are able to connect all the dots in our ecosystem. We know exactly where the problem is. Is it the front-end problem? Is it a back-end problem? And we can isolate and whatnot as you just saw. Clearly, we've nailed the detection problem. Now, we wanna reduce the mean time to recover, mean time to repair, right? So our focus next is how do we recover from these problems as fast as we can? And that's where we are getting into API-level granularity where can we pinpoint to the exact API operation that's failing using AI insights again? We also wanna manage costs with the scale that we have. We are looking into adaptive sampling as well. And obviously, these signals will inform our rollouts and enable automatic rollbacks as well. And lastly, there is a ton of meta information that we are gathering as well. As you saw, browsers, platforms, these are just some meta tags that we have. But there is tons more meta information that we still have to parse out. And with all these meta information, we can build deeper insights. And those are some of the problems that we're working on right now. With that, we'll open up to any questions. Yes, go ahead, yeah. Yeah, let me repeat the question. So is the Dev portal that we looked at, is that a custom portal or is it open source? So yes, that is a custom portal. We are, that UX is built in-house. And today it does, it is not open source, but the platform in itself is built on open source technologies, yeah. Just to get it right, so you inject the open telemetry code even to, so in the browser context already and all the way down to all the backend calls, so we have a complete overview. How do you gather this? Do you use the standard endpoint or was this really just what you implemented on your own that you gathered this from the front end from the backend services? Did you do that on your own or do you use something out of the box? So yeah, we use the open telemetry tracing code here. So there is a standard way to, where you generate this span and then it creates trace context like trace ID, parent span ID, and then you pass it on as the header whenever you are performing that backend API call. And the important part is the backend API call should all backend service also should play along with the, in the trace journey. Otherwise the trace gets dropped. So it's an open telemetry tracing platform. It's classic open telemetry. It's added to those standards. But the call out here is we've simplified the entire implementation with just those three lines of code. So as a developer, you don't have to worry about open telemetry and open tracing and what are the complexities of it. You're just saying, hey, this interaction is important for me and if this experience is unavailable, I want to know about that, yeah. And the collection part, so do you collect that with open telemetry or how do you? Yeah, we use the open telemetry collector to collect all the FCI spans. Thank you, great question. So just to repeat, do we present this to external developers on our platform? So we are working on externalization. That is an initiative that Intuit is working on. But as it stands today, this is not presented externally, but external FCI is definitely a possibility that we're looking at in the future, yeah. So you're probably not ingesting all the failed FCI's. So what kind of sampling rate do you decide for the failed interactions? And do you do any sampling for the successful one as well? Yeah, so right now it is 100% sampling, but we are also looking into adaptive sampling. One thing to note is that even though we are planning to do the sampling, mainly for the backend API call, but for the front end for this FCI, we still want to collect that information because that's the only place where we get real information on how the customer is getting that experience, right? And then we use that information to actually quantify the impact as well. So even though we are gonna do the adaptive sampling, but for the front end we will continue to collect these fans at 100%. Yeah, and just to underscore, right, so the metrics we will ensure that is 100% accurate. So the unique user impact, the FCI failures, that will be 100% accurate based on 100% sampling. But beyond it, the persistence will be posted. So we won't actually retain the trace with sampling after that. So the trace propagation will be cut down, yeah. What are you using Kafka for? What are you using Kafka for? Oh, Kafka is our, it's a place to stitch together multiple data streams together. So the FCI metrics is then stored in Kafka, which Druid and other systems can easily read and persist the data for. So it's a place for us to stage the data. Yeah, so there is enrichment processing going on within the Kafka pipeline, and then it goes to our audio. Multiple partitions, yeah. Hey, great presentation. Now I was wondering how you mark the result of the span on the trace. Is that just metadata on the span? Sorry, could you please repeat that? How do you mark it via the interaction as completed or failed on the trace? Yeah, so as I was showing earlier, when you mark an interaction as success or failure, that's additional tag that is passed as an attribute in the span. Yes, and then in the open telemetry collector, we specifically sniff out that specific tag to mark that interaction as success or failure. Hi, this sounds a lot like HTTP status codes. I was thinking about this in that direction. Can you guys talk about the difference between what you would see with the status code versus what you see from this customer interaction style? Yeah, so here's the thing, right? So when you have the customer interaction, you have multiple backend API calls that you may have, right? But not necessarily a backend, let's say a service results in 500, you may have multiple retries behind the scene. So it's not always mapped directly to the status code of the service that is returning the response. So that is why we are not using the status code assays, like how in open telemetry it gets used, versus introducing a custom tag so that it gives the power to the developer even though the API may respond with 500, but it still gives the power for the front-end developer to mark it as success or failure. Yeah, and let me take a specific example as well. So for example, add into it, we work with different payment providers, right? So say our primary payment gateway is down, right? And if the customer is trying to make a payment, obviously the first post goes out, we get an error there, right? So now if you just rely on the status code, that would be marked as an error. But if you fail over, abstract it to the customer to a secondary gateway, we wanna know about that problem because that's a degraded interaction. But for the customer, it was successful, right? So that is why there is a certain level of abstraction on it. Good question, we are working towards that goal. So the library is not public yet, but yeah, we're working on that. So one question, you know, especially when we're trying to build these kind of tools, right? You know, to log the errors, definitely we tend to get a lot of data because we want to use that data for analytics purpose, right? To generate all these views. But that also gets into the anti-pattern of like when the issue is happening, right? And you tend to log too much of data and create a lot of side effects for the application. So what are the things you guys did to make sure you don't get into that kind of anti-pattern? Yeah, so all of this data is sent to a separate endpoint. And then internally, we have our own retries, for example. And so this is operational data. So think of it as fire and forget for the application perspective and not inducing overhead for the app in itself. So when you say fire and forget, like is the async threads that you're running? Yeah, we send it to a separate endpoint while the application flow can continue making those same backend API calls. And even if our endpoint is unavailable, these trace finds will simply be dropped. Thank you. And thank you for passing the microphone. Yeah.