 Good afternoon, everyone. How is everyone doing? My name is Amit Kalamkar. I lead observatory and analytics at Intuit. I have with me, Vijit Maharas. He's a principal engineer at my team. And today we wanted to talk about Intuit's observatory journey using IOPS. Here is the agenda for today. First, we'll talk a little bit about Intuit, our observatory strategy. Then we have a live demo where we can show how this observatory strategy is in action. Then, Vijit will go over architecture and details on how we are achieving it. And lastly, we can do question and answers. Most of you should know Intuit from our flagship products like TurboTax, QuickBooks, MailChimp, Credit Karma, etc. We are one of the largest SaaS company out there. On the top, you can see these are the five main platform areas within Intuit that powers all these products. This ensures that we provide value to our customers as well as we accelerate innovation. On the bottom, you can see the scale at which we operate. We operate at pretty large scale. Me and Vijit both belong to developer experience and platform. Our group provides all the platforms, infrastructure and capabilities that are needed by these products to develop as well as operate. We also run at a very large scale. For example, we are around 1 million CPU cores. And the investment which we have done in these platforms have resulted in 6x improvement in development velocity since 2019. Intuit is very much a believer in open source. We not only use it, but we also contribute a lot of things back. Intuit is a proud recipient of CNCF End User Award both in 2019 and 2022. Intuit also has created and open source products like Argo and Numa Proj recently. Intuit also has a lot of contributors as well as mentors in different open source projects. So we are very much into open source and we want to ensure that we give back to community. So now let me start with you just giving a high level idea of our platform at Intuit. So if you look at the slide, this is our modern SaaS platform. We started the modernization in 2018. We pretty much modernize everything front end, back end, all the boxes you see there. We also created paved roads both for containerized as well as serverless app. This ensured that our developer have Intuit automation from deployment build as well as scale management. As part of this modernization, we deliberately made an effort to instrument everything out of box. So we get real-time events and metrics from the all layers of our infrastructures and platform. And we stored that in our operational data lakes. We also have an open source AIR platform called Numa Proj, which we use to do real-time analytics on this and generating actionable insights. Now, let me talk a little bit about absorbing strategy at Intuit. So one of the principles we started with that we wanted our observability to focus on how our customers are feeling. What is the customer impact? What is the revenue impact? We wanted to move away from a system-centric observability to more customer-centric observability. So all our strategy, if you will see, is a customer-centric and we rely heavily on AIR apps to achieve this. So here are a few of the strategic goals we have at Intuit. We have a goal of MTTD less than 5 minutes, MTTR less than 40 minutes. Our probability target is 99.99 for tier 1 and tier 2 applications. And our performance SLA is less than 4 seconds. We build a lot of capabilities to achieve these goals. For example, for MTTD, we build ROM and FCI as well as golden signals which comes out of the box. And these metrics are connected with auto alert and incident creation. And that's helping us reduce our MTTD. For MTTR side of it, we have a centralized schematized logging as well as log analytics. And we also use distributed tracing, dependency graph based on dependency tracing to isolate issues. And that's helping us reduce MTTR. For availability targets, we rely heavily on progressive rollouts as well as automatic rollbacks using AIR apps. And on performance side of it, we have implemented a ROM performance as well as anomaly score on this performance metrics to achieve the performance targets. Now let me talk a little bit about observability pillars which we have. We rely on logging to reduce MTTR. Some of the things which we are doing in logging includes root cause analysis as well as stack trace. Metrics, we have implemented golden signals as well as FCI and they are automatically alert as well as create incidents when there is an anomalous behavior in those metrics. We use dependency, we use tracing to isolate the issue. Most of the things which we use there are called graphs and dependency. All this our data is stored in the operational data lake. We use our AIR apps platform which is called new approach to analyze this data in real time. And then we show this on user interfaces and triaging flows which are very curated for a particular use cases. As well as this real time insights are used to trigger incidents as well as alerts for further developers. Now let me hand over to Vijay to show all this in action and a demo. Thank you Amit. Now let me show what our triaging process is. And for that let's pretend that I am an application developer who just got a page for an alert. And it all starts with the entry point. And so we have many ways to get an alert. It could be through the page or duty, but this is one way of getting the alert that is we see it in a slack channel. These alerts are very contextual. So what it means is when we give an alert, we talk about what is the seed. So first of all, let's take this particular line item. It talks about the time it started, the number of users impacted, how many degraded was failed was there. What is the exact interaction that phase and what is the anomaly score that triggered the event. So all the alerts are based on anomaly scores. I'll get a little bit more into what anomaly score is and how to quantify it. But the key thing is alerts are based on anomaly score. And only that we also in this contextual message we give a way to see the metrics that is through the if you click on the link you will see the metrics we give a link to the logging system and more importantly we give a link which brings into the observability tool which can be used to So if I were to click on this link because this is the project experience that you will go through right you click on it you see something like this this is the TurboTax web app page and this has a live incident. Right and the alert knows about the time that's where the time component comes into play it automatically selects the time and it what you're seeing is a is a three actually color representation of anomaly score. So the way to interpret is written here, if your application is healthy that is between one to three, everything is behaving within the normal operating pattern. If the score goes from four to 10 that meaning from a long gradient all the way to red that meaning the application is heavily anomalous or deviant from the usual normal operating pattern. One key point is this is not a dashboard actually it's a tool what it means is we collect a lot of data we collect around 10 terabytes of data per day. And we cannot just throw all the information to the user because it will confuse and they cannot prioritize what's happening so we use anomalies or inherently to bubble up insights so the first thing you see is anomaly insight it talks about what are the interactions that are failing and what is the anomalies for so higher the anomalies for more important it is. Now along the journey, you will also see that we also summarize say how many failed interactions are how many users are impacted and these are dissected in different ways. When you are looking at the top level entity, this is the total number of users impacted as a service or a web app as a whole, but we could I will show how to dissect or slice and dice this much as we go down the demo. The key thing developers use is the dependency graph the moment they get, they get into an order they want to figure out whether the problem happened because of them, or because of somebody else. Now, in this you can clearly see the triaging view that hey, there is a subject that has a problem which is because this backend, you can see the story behind this line right is clearly highlighted a search widget had a problem, which has a search service and the orchestration is a problem. This kind of dependency can get quite complex right meaning if you were to see into this graph we even show how many changes have been applied in real time. The point here is the data what we are showing is being curated and being merged from different data sources so we can give a unified experience for the developer to walk it walk through this dependency graph as they see, and we even see the relationship using anomaly score the value six is because this interaction has a anomaly of six, even though the other service seems to be not operating in the normal operating pattern so that is how we need to understand how important the data curation and the integration with anomaly score happens. Now, let me go down the journey and help you understand a little more about what I meant by slice and dice so one could say that hey, it's not about the total score you can even search by browser plugin and get even more. Slicing and dicing of information I will skip going in the interest of time. One important factor is that we have an out of the box price analytics. Now, if you see here these fields are auto populated by the anomalies course we in the earlier thing we saw that the water interactions that the highest anomaly and the auto populated that feels and we did a search. Okay, so this is a price analytics, there's a better way to see this too so let me show the price analytics down after going through little more in detail what I meant. So, if you see, as I said earlier search was the one that was impacted and there's a table here let me call this for you, it is sorted by anomaly score so search, which it performs, this is the most impacted one and this is where I can show you the pricing and dicing in a little more cleaner fashion. So you can see that the failure rate has increased, and the number of user impacted is 100%. This is very specific to this search, this interaction, it's not so we are able to slice information per interaction, and even it much granular level using browsers and things like that. Okay, now let me, as a developer, you know that I have a problem with search and then what you do is you get even curated version of this price analytics where it clearly calls out with more fields where it is the CI standard for customer interaction, it has failed and the type is interaction and the plugin and the widget this is auto populated and it will give you a price analytics view. If you were to click on this, you will end up in a complex price view, something like this, right. So the, this red, but sure shows that it's a failure and if you were to scroll down you will see that by it failed. Right. So, the key thing I wanted to show to you was the journey from an alert all the way to getting the information, curating the information collapsing and giving a view, and the guided path where we use anomaly insights inherently so that we can sort, and we can filter out data so that one could reach to the right set of debugging information. Now one key thing I want to show is we also have access logs so the moment you get to the price, you can click on this access logs and you will see that what are the logs related to that problem. And logs are the best friend of a developer right so we give all the way from the customer failure from slack from our alerting contextual alerting system all the way to the logs, which includes tracing to guide the user. Now, let me talk about our texture right so if you have to build a system like this, this is, we need much more than just observability data this has to come from multiple layers. Now, that's the reason we start off with an operational data platform. Now what is operational data platform right so in observability domain we talk about logs metrics prices but from our standpoint there's much more to it right we collect information at real time from Kubernetes clusters build systems our security data that is the black boss inference mesh kind of you can think about mesh kind of our texture security data. The developer portal is a developer portal if you if you remember in the demo right we say let me go back there I'll show you every service at into it right we'll have a tab here, which says that observability that we have to have a golden entity that ties everything together. Right, so that is how we have that information coming in at real time and observability is one of those data sources that coming in I will deep dive into observability shortly but I just want to give the scale of the art data platform architecture this is not a bespoke architecture which we built just for observability. This is an architecture that powers much more than observability so let's take in the right most like the use cases that we support using this platform run time. CVs compliance detection, we use cost see you remember in the earliest slide I'm told about one million course how do you attribute it the cost to each and every BSP. We need a platform that can support that scale right and then observability of course that is a key thing right so this is a platform that supported now let's talk about how we do it by the two central aspects, first and foremost we have started and cataloging system, this is only possible if you have clean data being ingested into system, which supports curation at ingestion right we absorb itself we need to make sure that there's a data is clean it is attributable. There is no dark data coming in because data is costly, the tens of terabyte will cost you a lot in in just matter of days. So every data we get is very contextual and very useful so that we get more data. So that is the storage in the cataloging part. Second is processing discovery and search. The key thing what we did these is the central brain aspect of this is the new approach aspect. So every data that is coming at real time can be assigned insights, we assign anomalies course we assign enrichments and so forth so that we have context about every input that is coming into the system for open so this platform is extended So when we adopted and extended our observability system we added like let's say tempo store so that we can store the place as sees and we can retrieve the place we also the whole system is based on open source technology there's no proprietary whatsoever. So this is the open source analysis which is used for in order to indexing, and we use prove it and Apache fling we use Apache fling for very heavy high end number crunchy, because the scale be at the scale we do and we so this is more like a data engineering problem which is a centralized data engine we have entire data is exposed via API this is a graphical API and this response time SLAs in the sub milliseconds. So this is the operational data platform as a business logic, and that's how multiple use cases have been fit, and we democratize the data which we store so we have a very clean interface to pull the data out. So there are two aspects first one is the data curation and the data lake so any data that is structured unstructured events or metric should be be stored in data like and it is only possible by having a data mesh kind of architecture. What we do is we have a golden entity for every data set, and we practice so audience in the end becomes a warehouse for clean documented schematized operational data. It helps us drive faster automation and better decisions because it's real time in nature, you can, we can, you will see shortly what I mean by real time nature. And then we do a analytics on top of the data that's stored in audio. To do this at that scale, see the problem in systems like flint or data engineering tools that is done suddenly but if you have to move the problem all the way to source and get clean interest informational data we need to build a very lightweight system that can analyze stream that is that can do lightweight stream processing with some analytics. So this is why we built a new project called new map is an open source project. It's a Kubernetes native language agnostic real time data and take two. It has no parts new more flow and new more logic new more flow is the central engine for doing computation on a streaming on a stream, unbounded input, new more logic is a collection of ML models and libraries. This numerology can work by itself as a library because it's a cultural model, but it was the best when it run on new more flow. Now you can do this analytics on a streaming agent. So that's how these both are married together. So what is new more I'll just give you a beef introduction to it. So let's assume that you have a stream coming in. Okay. So here to see as a developer you need to do some language agnostic computation, which is very stream processing competition actually, which is very easy for anybody to do it so the whole point is easy to use any machine learning developer a machine learning engineer or an application developer should be able to do strength processing at a very lightweight cost efficient manner. So let's say let's assume that you have a source. The source is all about reading data from source. This is what we want to approach out of the box, then it guarantees that the next the processing of the source is part of the user It could be a map or a radius function where we do data processing using window one example you could imagine is for our neural numerology which uses a neural network for some computations it needs a sliding window of 10 minutes. So how do they do that we just write it in udf saying that group by 10 minutes use sliding window as a strategy, and suddenly there you udf will get data laid in that format. Then they could, and output is flat map is zero one or more results. They could do conditional forwarding meaning they can choose which part to take. This is for a big testing and performance analysis on ML models and so forth. And then we persist the data to sink it could be blob stores meaning right back to Kafka S3 update problem. Now, let's take a look back and see hey how does obsidian architecture fit into operational data platform. This front end design is based on micro front end architecture meaning it's composed of a lot of smaller composite things like plug ins and so plug ins and other things and any, any front end system is auto instrument that's the key thing. So, in the developer portal you say that I want to create a web app or a plug in it scaffolds a code that is instrumented with FCA. Whenever they don't have to do anything it's auto instrument the core library has it. Right. And when a UA is being rendered to the user through mobile app or through a web app, it creates the root span and then it forwards the span all the way to the back end end to end. Okay. So when we get the span, we, our open telemetry collector collects it and send the span directly to operational data like here we use the way we do that is we write to Kafka, the metadata and S3 as the data more because we do millions of spans and spans a minute. Okay, this is a very high throughput system. And then we injected we do all reverse indexing and whatnot. In the meantime, we also use our new approach to listen on this and create something called red metrics to understand how the each interaction is working each operation is working and as an anomaly score right away. That's the reason we are able to bubble up the insight and that also goes to operational data and the UA what happens is it just pulls in the data merges all together and we should give it as a visualization to the developer. So that is how we fit the observatory design on top of our operational data like data platform. One last thing I want to talk about how do we do streaming a ops because at scale can get tricky so the way we do it is any data provider okay it could be Kafka Prometheus or anything. It's always streams in the data it's always spring so if it is Prometheus you can imagine it's a car is a remote right if it is Kafka is a is a Kafka ready say anyway it is streaming in nature. So we go for feature engineering this is where we analyze the metric we understand whether it's what kind of metric type is and we, we have a more have a pre processing on the feature. Okay, and then post processing happens this is where the score gets normalized between zero to 10. Because we want to make sure the anomaly score is interpretable for everybody and this remains the same across into it everybody knows what zero means what that means. Now, operational systems are very dynamic in nature meaning you might see new interaction being added for example if it is a customer interaction or old interaction movie being deprecated so sometimes we will not have a model and we are we want a system that can scale on demand so what happens is in case it does not fit a model, it will trigger an inline training, where it trains the model, loads it in the model store and it will do a fit so that way we have an end to end system where we don't have to. We auto detect that's the right thing we auto detect the changes and adapt and adopt the right set of inferencing and model scoring systems. So that's all we had. So please see that our approach is something into it started and it is well adopted across the community and across the industry. And we start and the same team has started new approach. And we are made, we are quite sure that we'll make a big impact with the new approach and please at contribute and we are hearing. So that's more QR code if you want to learn more about new approach. Lastly, we'll open up questions and answers.