 Good afternoon, everyone. My name is Amit Lankar. I am here with my colleague, Vijit Morris, and today we are going to talk about how we are using operational data at Intuit to provide meaningful and actionable insight. This is the overview of stock we are going to cover today. We will start with the problem statement, how we are solving it using operational data like talk about the applications that run on our operational data like let's get into a demo and then talk about what we are doing next. First, who we are. Me and Vijit both work for a modern SaaS team within Intuit. It's a platform team. Intuit, as you folks might know, has been around since 1983. We have 5,000-plus developers in 21 locations and so more than 50 million customers. These communities at Intuit, we have around 1200 services currently running in drug on 200-plus clusters on 10k-plus nodes. We have a lot of open source contribution, including Argo, which is a CNCF incubation project and ECO and Adminware. So, the objective of this initiative was how we can derive real-time actionable insights on the operational data we have. When we started with this, there were few problems we encountered. We were generating a lot of operational data, but they were in silos. There was no standard way to correlate this data and provide insight. Each team used to use their own dashboards to figure out insights in their own silos. And there was no easy way to provide a scalable ML platform. And why it became critical right now as more and more of our services moved to Kubernetes? It increased the complexity due to a dynamic cloud-native platform, as well as we were deploying much more that increased the risk. At the same time, Intuit did a DevOps transformation, which required specialized data driven tooling. So, how do we solve this? We created a platform called Operational Data Lake. Operational Data Lake is nothing but a warehouse of a clean, documented and schematized operational data, which we collect and process in real-time. We collect this data at different life cycles of an application that includes development, build tasks, production, security, etc. And then provide real-time analytics using ML algorithms. Another principle we wanted to follow is we democratize this operational data. So, it's available through cell service for anybody at Intuit to use it. So, our platform is divided into four parts. One is collection. Second is processing. Third is real-time ML analytics. And the fourth is storage. Let's get a little bit more detail into each of these. In collections, we use Kafka for our real-timeization. We also support batch using Kruble for things like structured log. And yes, data governance is in there and moderated by data storage. For processing, we do real-time processing using Apache King. We have cataloging using Apache Atlas. We want to make sure everything is catalog and discoverable for folks who want to use this data. We also provide enrichment using GoldenLTT. We use AssetID in our case. On analytics side, we created a standardized ML platform that highlights the anomalous behavior in any stream of data that enables us to provide a guided debugging option. As far as visualization is concerned, we created a standardized pattern for both visualization and pre-agent. We also offer an interactive exploration using Druid. And all this data is stored in a long-term store, S3 and ELK for batch. Let's go over a high level of ODL architecture. On the left-hand side, you can see we have set up of producers. It includes our Kubernetes platform. It includes build platform, API gateway. We get security data, hotel data. All this data is collected in real-time through Kafka. We ensure that everything is cataloged so that the data is clean. Once that data is collected, we process it using Apache Beam for different use cases. All this data is stored in different cache as well as data structure to serve these use cases. We have a query engine and support different BI tools like AvaloflickView so that people can create reports out of it. And on the right-hand side, you can see different consumers and use cases we support, including in security, in observability, cost optimization, etc. These are some of the applications that are currently using our operational data platform and providing actionable insight. First, we have a security use case. We collect the data for related to security, including lineage, what piece of software is running. Right from the time we build a container until the end of the lifecycle when it's running in production. We can figure out what CDE affects which piece of code, not just in real-time, but at runtime and do a triaging for that. Next is the cost reporting and analysis. This helps us with different controllers which we have written. It helps us attribute costs for any resources, CPU, memory to a particular service and on-account owner, not just for reporting, but also optimizing the cost into its best. Another major use case we are solving is development velocity. We get data from, we'll get deployment platforms to figure out what is our release velocity and then provide meaningful insights into them how we can increase the development velocity. And last but not the least is our observability application called Fuzzy, which we will go a little bit more into the next segment. The main objective of this Fuzzy is can we reduce MPTI, MPPD or all services that are running at the end? Now over to Vijit, who will go over our Fuzzy observability application and show us a demo. Thank you, Amit. Now I'm going to talk about Fuzzy, our observability application built on top of ODL. Why did we build Fuzzy? We found a couple of issues in the current way of doing it. The meantime to detect is too long. In most of the cases what we found out is we would love to detect in a standardized way because either we wanted to detect when a customer is impacted, either an internal or external. There should be a standard, almost standard way to do that. As soon as we get into an incident, the first thing the incident responder or the service owner looks for is how can we ascertain the customer impact? Who are the customers who are impacted? What is the level of impact and which are the endpoints they are being affected? The percentages, the numbers and so forth. Once we know that, we would love to solve the problem. That is isolate the source and the root cause. Source could be the source service, while the root cause could be the cause element. The root cause is little tricky. So for those some cases where we cannot find the root cause, we would need the domain experts to come in and debug or try as the problem. When the domain experts come in, what they find out is there is a large latency in data retrieval. This is due to the exponential growth in the data. Organic growth, most systems are being added. So we wanted to have a near real-time interactive system where they could explore the data and try as the problem. So this is why we built Fuzzy. What is the charter of Fuzzy? Fuzzy has three main objectives. One is make obvious what services are affected. Second is what is the causal service? And third, the causal element. Once we have this, we believe and we can show that anybody, any non-expert will be able to try as the problem. By non-expert, what we mean is we don't need domain level expert to try as. We should be able to isolate the problem very quickly than what is happening. We will talk about how ML is helping us in that. And once we are able to reduce the meantime to isolate, we will reduce the meantime to resolve. How are we doing it? Fuzzy is being an ODL app using the same four pillars of ODL. We have collection, processing, ML, and analytics. In collection, we collect all the data that is being exposed by the cloud provider. In our case, it's AWS. We have AWS, CloudWatch, CloudTrail, CloudEvents, all kinds. From the platform side, that is, we use Kubernetes. We get the audit logs. We get the Kubernetes objects, object changes, the events using data control. We have a standardized way of scraping the metrics. That is via Prometheus. So any application that runs on Kubernetes, we can get the time series metrics out by scraping the Prometheus, and we send that to Kafka. Recently, we also started getting sampled open elementary metrics. Now that we collect different data from different systems and applications, we do processing. Windowing is the most important processing we do where we window the data to a fixed or a session window. The reason we window is we want to make sure that the customers of this data won't have trouble putting the out-of-water events into order and so forth. So we put it into fixed granularity. Once we put it into fixed windows, we also make sure that there is ascertaining. So that we can correlate all these metrics and join these metrics. Kubernetes data will be very specific to Kubernetes, but we have to inject SSID, so we have to derive that. That is one main job of processing too. Once we have clean data, which is co-operable, we write it into multiple different stores for different speeds of retrieval. Radies, elastic search, druid and so forth. On ML, we have unsupervised models that is trying to detect anomalous measurements on this data. These ML models are trained per service because each service is very unique in its behavior. A quick example would be for one application, ready-scash misses are okay, but for another, misses are not okay. So it's tuned per service to be very precise. On analytics, we use this ML score, the anomaly score as the guiding factor for guided debugging. When I say guided debugging, what we mean is, let's say you can divide your application in four segments, your infrastructure applications, your AWS and your Kubernetes. If you are seeing a high anomaly in your AWS side, you really don't have to debug the application side because it's clear that the problem is happening on AWS side. This way user can narrow down to the problem very fast and the guiding factor is the anomaly score which we generate. Our analytics platform also builds hierarchical view so we could see the view from the Induit level, the BU level, the Scrum level. So we can see the impact radius. The data is very rich and we would love to share this data with everybody at Induit so we expose the entire data over graphical for consumption. The Fuzzy architecture, this architecture is very similar to the ODEA architecture because it's built on top of ODEA. I highlighted using green to show what are the components in ODEA used to buy Fuzzy. We have IKS controllers that pulls in all the data. APA GW is the APA gateway. Any two services that talk to each other always talk through APA gateway. We use black box inference to gather data about how the interaction, who are talking to each other, what status course they are emitting, what endpoints do they talk to, what is the latency and so forth. There is a plethora of information there. We use Argo CD and Argo rollouts and these generate data to help us understand when did the deployment start, when did, if things have gone out of sync, how fast is the rollout progressing and so forth. This helps us give a lot of insights on the deployment. Dev portal is where we create service. That's the starting point for a developer. It has our golden entity. That is our asset ID. So we get data and we use that asset ID for enriching. Hotel is our open element collector. This is very new. All the data is being returned to Kafka. We do curation, clean up and store to the deep door for big data analysis over a longer period of window. Every data is catalogued. Here on the processing side, you see SPP tumbling processors. SPP is doing the most of the heavy lifting, which is doing the bingoing, cleaning, joining and so forth. Tumbling processors are very efficient codes with the aim to write to radius in different data structures and it is optimized for that. This main point of tumbling processors is for the fuzzy UI to be able to slice and dice the data as the user requires. So we write it in different formats so that we can, users can slice and dice. I will talk more about ML in the coming slide. Then we of course have a GraphQL that exposes the entire data. We use different kinds of cash stores based on their current and the ML architecture. So this architecture is built on Kubernetes because we wanted a scalable system that could run thousands of models. It uses Argo workflows, Argo CD and Argo events to make it scalable and resilient. What we do is in the dispatcher, it gets pre-aggregated clean data. It looks in the messages, see which service does this message belong to and forward to the stream detector. Stream detector has a model per service. So three things could actually happen. When a message comes in the stream detector, it will see that, hey, I already have a model for this in cash. Assign the anomaly score. Other thing is it does not have a model. So it looks in the model store, gets the latest model and as an score. Sometimes you might not even find a model in a model store because it's a newly on board application. We do create dynamic models. That's the key thing. So it asked the on-demand training system to train a model and the model trainer will write a model back to the, the newly trained model back to that model storage so that the service can be ascended score in the later messages. When the new messages come. The publish adapter publishes to multiple endpoints. Today, we read from Kafka and we publish to Kafka. That is the default. So we write the anomaly score back to Kafka topic so others can consume it and they can set up. Use the data as the cfit. We also do get tops because we want to make sure that users can add new models without redeploying the entire thing. So if somebody adds a new NLP based model, they just need to write it in get tops saying that, hey, this is a new model for the service. Once it is merged, a new model will be available auto trained and be available for the, for the metrics to be ascended score with demo. Now let's get into details of fussy through a demo. This is the incident commander view of fussy on the first page. The whole point of fussy is it is it does not require any onboarding any user at into it any service at into it the moment they are onboard it moment they are created will get fussy for free. The front page of fussy what it shows is the top 10 services at into it that is behaving most anomalously out of thousands of services we have the way to interpret this is the score of zero means all is good while the score of 10 means it is most anomalous. From this user could understand that there are some anomalous services at into it. And they would like to go more to understand what is the causes service and the causal event that could have triggered this for this purpose they go to the detailed view of fuzzy. The most important information they look for and as an as a service owner is the radius impact. That's what they care about the most to understand who is being impacted because I have my application is behaving back. For that we have an application dependency graph. Let's say I am the owner of application classic right. My service has a score of 10 that means I am there is an issue. It clearly shows that who are the clients that are affected because of the anomalous behavior in my application and what is the score all of them are behaving is being impacted because of it. We also show a service to service anomaly over a score over the edge on how how the call used to be and how the call is and we highlight the score the 10 means that it's a total impact. I could also find see that hey I depend on entitlement and entitlement seems to an issue that could be the causal service that is triggering this incident. As an owner of entitlement what I would want to know is hey what is the causal event. So let's assume this is the detailed view which is can be used to understand what's going on right. So the way to interpret this set of graphs or set of charts are we show it as as a segment as each hope. So there are three major hopes we show as of today. One is the hope number one is the API gateway any two services at into it talks to interacts with each other via API gateway. So it's clear that hey there are some issues you see a spike in errors. Both a fire to and fire force. And you could see the error rate is has around 8% the general rate we are seeing. This data could be so we summarize this data and we show them on a heat map so that they could know what's happening. We could even actually if you see in here we could actually slice do a quick summarization by clicking over a span and see over a time of five minutes 15 minutes. What's the error distribution. Now that I see that hey there is a lot of errors. I would like to know what's happening. Clearly I can see that hey the load balancer which is serving the city sitting in front of the service is also showing a spike in errors. User could clearly click into it and see whether it's a target metric and a load balancer whether it's being generated by the load balancer is clearly shown by the load balancer. Now we would like to know what is triggering that is where the where we look into in the Kubernetes namespace to see hey how are the events looking at that namespace level. And we see that hey there is something at the deployment site something is happening. Once you click on it it shows that hey somebody didn't do it has run a restart on the deployment which is causing the issue. This clearly shows a non expert to be frank to see hey I see an error I see a deployment and what is causing the deployment. These are all interactive you could click on to see hey what's happening right I have 26 events in forward I you could click on it and you can see what the sequence of events that happened and how is it working. Since there are a lot of data we also summarize it as an exclamation point so that this exclamation is all about bubbling up the most causal reason which you might be interested in which could be triggering the incident for example if you click here like it shows that hey the is feeling because the ports are unhealthy. Ideally there should be only one exclamation point and we are which shows the real event and we are trying to improve our mission learning to do that to show the causal event rather than showing a set of exclamations. We also show a quick insight onto the port metrics and how many ports are running so forth and what is the CPU usage for a why just to sort for the user to see what's happening they could search for their port they can see okay I'm running five ports and this is how the memory distribution is these are also color coded between 0 to 100. So from here, I hope I was able to convince you that hey from an error API gate error in the system or incident week we are able to find the root cause very fast. Now I will hand it over to Amit for the closing. Thank you Vijay. As you can see in the demo means we have done very good progress on operational data like and using that operational data to solve some of the use cases so what's coming next. We want to make sure that we support customers it's bring your own metrics, we are expanding our service capabilities for open ended debugging also expanding our ML for more on demand models, and we are planning to do open source. Thanks. Thank you for attending our talk.