 First of all, thank you for coming for the last session of the day. My name is Amit Kalamkar. I lead Observability and Analytics at Intuit. I have with me, Vijit Morris. He is a principal engineer in my group. And today, we wanted to talk about how we are using JNAI and ARRs to reduce MTTR at Intuit. So here's the agenda for today. We will talk about Intuit. Our operational excellence goal, how are we achieving those. We have a nice demo for you. And then we'll talk about NUMA Proj, which is powering all of this and how you can use NUMA Proj in different use cases. Most of you will know Intuit from our flagship product, QuickBooks, TurboTax, KretKarmam, Mailchimp. We are also creators of Argo. So all these flagship products are powered by these five platform areas. These platform areas ensure that we drive customer value as well as drive innovation. Intuit is a 100% SaaS company. And from the numbers on the screen, you can see we operate at a pretty large scale. Intuit is also pretty much committed to open source. We not only use it for our applications and platform, but we have active committers as well as contributors in a lot of open source projects. The biggest success story for us is Argo. As most of you might know, Intuit created Argo. And now it's used worldwide. And it's one of the fastest-growing projects in CNCF. Our latest open source project is called NUMA Proj. It is used for stream processing, AIOps, and analytics. We just released 1.0 this week. And we are getting good community engagement for that project also. We are also proud recipient of CNCF end user award twice. So we are thankful for that. So let me start with the Intuit development platform journey. Like most companies, we started with moving to cloud. It was mostly Lyft and Chef. The workloads were traditionally monoliths running on EC2. We started modernization of this platform back in 2018 when we started adopting cloud-natives technology, containerizing it using Kubernetes. That helped us increase our developer productivity 6x. And now we are more focused on making a next-generation platform, which is more of AI-based. So this is how our next-generation platform looks like. We called it modern SaaS AI. It has four pillars. The first pillar is AI-powered app experience. Here, we are using JNAI to provide US experiences to our customers. That includes assisted experiences, different types of experiences. The second pillar is JNAI assisted development. That includes helping our developers with coding, debugging, testing. We are using things like GitHub co-pilot here. The third pillar is AI-powered app-centric runtime and traffic management. Here, we are using the data which we collect and AI to make the platform simpler for our developers so that they can concentrate on just writing their business logic. And last but not the least is the smart operations using AI apps. All of this is possible because of our investment we have done in our operational data platform as well as MIMA approach. So let me tell you how high-level how our operational data platform works. On the left-hand side, you can see we collect a lot of data, data from Git, data from Observability, data from Kubernetes in real-time, and then use MIMA approach capabilities to ensure that this data is clean, attributable, and then we store it in different stores, depending on the use case. For example, for real-time, we use RUID. For long-term, we use S3 and so on. Then we use, again, MIMA approach capabilities to analyze this data in real-time. We use traditional statistic model, traditional AI models, LLM models, to drive actionable insights. And these insights are then used by different use cases. You can see it on the right-hand side of the slide. That includes security use cases, cost use cases, development productivity use cases, and of course, operational excellence, which is our highest usable use case. So let me talk about operational excellence at Intuit, being one of the largest SaaS company. Operational excellence is always a priority. We want to make sure our experiences and products are available, and we give a delightful experience to our customer. So these are the four pillars around which we are working on operational excellence at Intuit. First is to reduce empty TD. We want to make sure that if there is an issue, we want to know about it. So we have done investment here on things like automatic golden signals, both for platforms and services. Also, we did something we called it as failed customer interaction, which is more how our customer feels about our product. I'll give you an example. If you are trying to upload a W2 in a TurboTax, and it fails, we want to know about it. So we have used OpenTelemetry to instrument that. And all of this has helped us reduce our mean time to detect for less than five minutes. The second pillar is reducing empty TR. If there is an issue, we want to fix it as soon as possible. So there are two pillars, two things which we have done here. One is the resiliency pattern. But we also use our automatic golden signal to failover if there are any issues. And second is automatic rollback using progressive delivery. We'll talk a little bit more today on this. There are two more pillars for operational excellence. One is performance. We use Google vitals, and we want to make sure that we give a delightful experience for our users. And the fourth pillar is cost. We want to make sure the cost is manageable. All of this is powered by Numa Barrage. Now let's talk about reducing empty TR in specific, and how we are using progressive delivery for that. So before we get into the solution, let me tell you some of the data points. One-third of our incidents are caused by change. And I think a lot of you might correlate with that. Second, we saw that our empty TR is high because although we have all the data, we have events, we have laws, we have races, people spend time finding out the exact thing that can help them with their TRG. So we wanted to address that as well as we wanted to ensure that we give quick summarization for anti-reaging capability for incidents. So what we did was we integrated both AI ops and JNAI into Argo CD and rollouts. We built two net new capabilities. One is AI ops, where we are collecting the data within the cluster, analyzing them, running an anomaly detection model, and making a decision whether we want to roll back automatically depending on how the change is. And the second what we called is a NUMAProjustice. Here we again, we run an NLM within the cluster itself. We ingest logs, traces, events, all types of data, and summarize it for our developers so they know why a certain thing is happening. So now let me hand it over to Vijit to show it in a demo. Thank you, Amit. So let me first start with a small demo plan. And so what I'm gonna do is I'm gonna roll out, I'm gonna roll out a change that as a bug inside and you will see how we roll it back and also how we explain the reason for rollback. So developers can know what is the reason for the rollback by summarizing log events and metrics. And based on the lessons learned, I'm gonna play a video on how to do it. So the way you have to see it is that on my right hand side, this is the demo app we have and the yellow features represent success. Once in a while, you will see a red fish that represents error. So the fishes are going, we are gonna make a dummy change, sorry, a buggy change and we use GitOps for any change that can do it. We will merge this PRN and once we synchronize the changes in Argo CD, you will see that the new canary deployment has been deployed. So we have the stable and the canary at the same time. Once the canary starts taking traffic, you will see that there are more errors, right? You will see more red fishes coming in. This means that there is a bug in the deployment and the analysis has started. So you'll see that the analysis is running starting. And the analysis template here shows anomalies scores in case if you cannot see in the back, it's basically numbers saying that the anomalies score is around nine. I'll explain that in a second, but analysis clearly says that there's a problem and we wait for around five minutes before I roll back to, sorry, five analysis runs before I roll back to be triggered. So only three of them have run and it takes around 15 seconds for one of them to run. So one key thing that is, as Amit was talking earlier that developers has to see the metrics of the deployment and that's the metrics extension we added to Argo CD. So what we show here is the golden signal. It could be any metric you want to see it. These metrics are fetched from Prometheus so you can add whatever golden signal you think is relevant. At Intuit we do error rate, latency and few others. So you can, now that we have those metrics coming in, we also have to quantify in a generalized fashion saying whether your deployment is good. So what you see here is there's a blue line or a zero line, which is that anomaly score of your stable application that is running. And you might see a small change here and orange line that is coming in. That is the canary anomaly score that is coming in. Okay, so if you see the values here, the blue line is zero, meaning your stable is quite well and the canary is showing a value of seven. Now let me explain what that means. So the anomaly score at Intuit is normalized between zero to 10. Anything less than three means your application or your canary is operating within the normal operating pattern with respect to your stable. Anything more than three, meaning it is started deviating from the normal operating pattern and the highest value it can reach is 10, meaning it is totally deviant from the normal operating pattern. That means anything less than three, we are good, anything more than three, we auto roll back. So that's what we see. We have a seven and if you come down here, you will see that the analysis template has finished running. That's the reason I want, if you see the demo app, you don't see any red anymore. It's all blue, sorry, all yellow. Right, and also we see all stable hash, the old good boards running. The canary has been removed because analysis template finished. Now one question developers ask at this point is, why did it get rolled back? This is where the summarization comes in. This is powered by NUMA approach. So this is a new tab, the NUMA approach assist tab where we show the anomaly score and some key metrics. So we can see that, okay, the error rate is going up. The anomaly score is around 9.28. That is the error rate. And it gives you the summary. Now it's giving you two summaries. The first one is based on OpenAI, DaVinci model 3.5 turbo. What it shows is that, first is the summary of the number of data set that came in. It says that there are 60 error logs that's coming in and also gives you a potential root cause. It says that, hey, the errors must be caused by too many Redis connections opened. So as a developer, they clearly know what's happening. Anyway, they are not that worried because we already rolled back the change. So this is a very good summary about what's happening and the potential root cause. Now on the bottom, this is OpenAI DaVinci model, right? On the bottom, we also have our own fine-tuned custom model. This is way cheaper. It's a in cluster deployment of our LLM that way it is way cheaper, fine-tuned on Intuit data. And the result is very promising and very close to the OpenAI one. The reason we do this is at scale, we have to run lot many analysis and we wanted to have a cheap way of doing it and very close and very accurate results because we fine-tuned on Intuit data. Given this that you can use this LLM, this GNAI integration, not only adjust deployment rollbacks, you can do it at any time of the day. For example, let's say you get a page during your day and they ask, there's a problem and you want to debug, right? So let me show you. This is a running Argos ID deployment and if you want to click here, right? This, there is no deployment going on. This is the normal day-to-day operation. Right before the meeting, the talk, I just triggered an OIM. You can trigger an OIM by just clicking here. So this is a demo app. And if I click here, right, it will take you to the place and it will tell you what's happening. So you can see the NUMO project. Let me zoom a bit here. The NUMO project, I'm saying that, hey, there is around this many errors and the container was terminated due to an OIM error. I hope that was too much of zooming. But see, we use it at runtime too, meaning you don't have to just have to roll back to really, see it any time is an error. You can come here and you can look at the NUMO project, which will tell you the summary based on all the data that we gather and collect. Now let me get back to my slides and talk about how we do it. Okay, so the very high-level architecture, right? So we have the application deployment here. First is the AF platform that at runtime gets all the metrics and writes an anomaly score back to Prometheus, which is Argos rollouts looking to and making a decision whether to roll back, whether based on whether the score is greater than three or less than three. The bottom part of the screen, what we do is the LLM piece where it has a data register that gathers logs, metrics, events. There's some pre-processing. And then it does a call to our LLM, right? This is the custom LLM and the OpenAI LLM. And we write the data back to the store, which is shown in the UI, the Argos CD NUMO project system. We also have a training pipeline which gathers this data and fine-tunes based on a few intervals and how the organic patterns are changing based on the traffic. Now let's take a pause here because what you're seeing here, right, is a very advanced platform at the end where we use. We use the cutting-edge technologies and we have, for example, we have a very scalable operational data platform. We use AIOPS to make decisions. We use Argos rollouts for progressive delivery and even LLM being integrated. So the natural question that comes is did you guys really have to boil the ocean to get to a place where we are, right? And the answer would be no, we didn't have to, right? And then what is that we did, right? The key thing we did was from the very beginning, we made sure that we are able to stream data in and able to do stream processing in a very large-scale manner. The key thing that when somebody talks about stream processing or real-time processing is mostly related to, first thing that comes to mind is data engineers working on fling or spark on Java code base. And that meaning that there's a perception that it's only for data engineers, streaming is accessible. And we wanted to change that. We wanted to make sure streaming is for everyone. Application developers, ML engineers, DevOps, SRVs, product managers, even. So the key thing is we have to make streaming easy for everyone, right? You should be able to use streaming in five minutes, learn it and continue. So the rest part of the talk is, how do we do that at Intuit at scale? And that's where we, the open source NUMO approach come into play. And the rest of the talk I'll walk you through, how do we do streaming at Intuit and how it is easy to do streaming for everyone. So what does NUMO approach include, right? So NUMO approach is a collection of Kubernetes native open source, language, agnostic, real-time data analytics tools. It has three main components. First is NUMO flow, that is massively parallel real-time data processing platform. This is where the data movement happens. NUMO logic is a collection of ML models that we have been using for a couple of years and for real-time operational data. And lastly, it's about control plane on how do you manage these resources. For this talk, we'll just take on the NUMO flow because it's a big topic. So again, what is NUMO flow, right? It's a massively parallel data processing engine. This is where the data movement happens, right? And it's built with three core philosophies. First is it's very native to Kubernetes. That means that if you have a specification, it should be able to run on edge, on prem, on the cloud with the same specification, you don't have to change anything. Second is the most important one, that is it should be very easy to use, meaning you can write it in any language you like. It should be, it's very easy to adopt. And lastly, scale and cost efficient. It can autoscale all the way to zero and zero to many. And at our deployment at Intuit, we found out that it is 30% more cheaper than the Java equivalent versions like Fling or Spark. We are also creators of Argo, as you know, and the community has been asking from Argo workflow standpoint that hey, if Argo workflow is for batch, what is the streaming equivalent of it? And that's the NUMO flow. It's one way to see NUMO flow is the streaming equivalent of Argo workflows. And we fixed few things Argo had problems with. For example, there's a lot of port churn in Argo. In any event you create a port, you churn out a lot of ports, you fragment your HCD, these are big problems at scale because company like Intuit, we have lots of workflow running, right? So we wanted to make sure those problems also get fixed in NUMO flow. Hence we don't have those kinds of HCD fragmentation or mutating the state or anything like that. Now I have been talking about a pipeline. What really is a pipeline? Pipeline is, you can think about you are reading from somewhere that is the source you read from, you do something to it, some kind of transformation, some user defined function, it could be anything. Then you write it out somewhere. There is a destination for this message, right? This is the simplest NUMO pipeline you can imagine. You are reading, doing something, writing it out. Now let me walk you through one use case, one example as a demo. And for that, let me... So NUMO flow comes with a lot of in-bills, okay? They're Kafka, HTTP. So let's talk about a use case where you are reading from HTTP and you are writing to a log, you are just writing it out. And we will use a hugging phase model to do something called a sentiment analytics. Sentiment analytics meaning I send in some sentence and it will send, say whether it's a positive or a negative sentiment. So I will demo this. Before I demo this, the first question is how easy it is to write a UDF. So if you see here, this is all you have to do, meaning you have a handler, this is a Python code, it could be any code. You are given an input and you have to return an output. If you look at this code, it does not say there is any streaming aspect to it. It does not talk about retries, it does not talk about anything else, right? It's all about you are given an input, all you need to do is create an output. We will make sure that, this is a guarantee the platform provides that the message will be exactly once forwarded, retries are taken care of and auto scaling is taken care of. We understand the streaming semantics well, back pressure and so forth. So from a user, all you need to do is give this handler this simple piece of code and you can replace with anything you can think about. Now this is the Kubernetes here for the same pipeline. So the way it is is nothing but a graph. You have a set of vertices, you have a source, inference and a sync. And then you have an edge that connects between sources, your user defined function and your sync. So this makes a graph. Now let me show you a demo, this is a real one. Okay, so this is the cool UI we built. This is a cluster summary, I'm just showing it from the local. What you're seeing is the namespaces we have and the pipelines that's running. So in this case, I've already deployed the sentiment analytics model because it's a large one. It takes a while to download. But it's very easy to create a pipeline. You click here, you can say, hey, input that, and if I say submit it, it will create a pipeline. So it's easy, you can do all the correct operations from the UI itself and we have our back using GitHub and things like that. Now coming back to the sentiment analytics pipeline, if you look at the presentation, it's the same slide, right? You have input inference. So it's very explainable by itself. And what I'm gonna do is I'm gonna write some data to input and I'm hoping that the output will come. Okay. Now, let's send a sentence. Basically what I'm gonna write is writing documentation is like trying to explain quantum physics to a toddler. I send the data, I got it O4. And if you see, the output is here, right? And it says clearly it's a negative sentiment, right? It's like how difficult is to write a document? Now let me send something that sounds better. Kubernetes is like the ultimate wingman, always there to support your app. I send it, it should be there. It's a positive sentiment. So if you think and think about it, the example is very simple, but what if your input was a Slack webhook and you made a product release and you are just passing it through? In fact, this was a product manager use case. He did it during a hack day. So you can imagine how the symbol it is, right? And this code, by the way, is open source. You can take a look into it and it's very simple to see. I wasn't lying about the code, it's very small, right? So that is all about this demo. Let me get back to slides again. Now, I showed you a simple one. It does not mean NUMAflow cannot do complex things. It comes in different shapes and size. You could do multi-source, for example. You can do join operations on things. You could do join on UDFs and merging on UDFs. Since NUMAflow is more like a fire and forget because it can auto-scale. It knows node migration, pod migration. So mostly at Intuit, we do it in a fire and forget mode. That meaning how do you auto-configure a pipeline at runtime? So we support something called side input which can broadcast messages and reconfigure itself. And lastly, it can even support cycles. So that meaning, for example, you read some message and you thought, okay, let me reprocess the same message with additional context. And there are much more, but these are the key ones. Now, let me talk about few use cases. The community and at Intuit, we use using NUMAflow, right? First is streaming analytics. That is number crunching. You have data that is coming in. You saw the line of servility. We do golden signals. How do you compute availability? How do you detect errors and so forth? That's streaming analytics. The ML ops and ML inference is something the NUMAproject assist was doing earlier and also anomaly detection. And lastly, people also use for event-driven application. You can relate this more like streaming overflows where you get an event that could be metadata to other things and you do some processing. With this, I will hand it over to Amit to talk about few success stories. Thank you, Vijit. So we wanted to walk you guys through some of the success stories so that you can get an idea where you can use NUMAflow. First is the streaming analytics. I mentioned golden signals when we talked about operational excellence. So golden signal pipelines that into it use NUMAflow. Some of the features which I wanted to highlight is multi-language. So it's written both Java and Go. And most important to me, it's like 30% efficient than equivalent fling job which we are running. The second is a community example where B-Cube is using it for digital signal processing. They use the same pipeline in on-prem, on in their cloud as well as on the edge. So the footprint of NUMAflow is such small that you can run it on any edge device. We have successfully run it on a Raspberry Pi. The second use cases is MLops. We talked about NUMAproject assist. Some of the highlights there is you can use it for in-cluster analytics, AIops. You can use it for LLM, both for training, creating the training data sets, doing A-B testing. And other use cases in MLops is anomaly detection. That's most widely used in and into it. Any time series data, any developer can use it to generate anomaly. Again, it's a DIY. It's simple to use. So you don't need to be a data engineer, ML engineer to do this. And the last set of use cases is event-driven. First is the metadata service. We get a lot of data from our cloud provider. And we want to process it as soon as possible. So the feature here is the scalability. We go from zero to hundreds of watts. Do it, process it, and then again, go back to zero. Again, for metadata service, it's almost 90% efficient. The cost efficient than the equivalent lambda we used to run. And the last, but not the least success story is one of the largest automotive company. It's using it for real-time map processing. So you can think of it as running on an edge. It scales up, it's always reliable because your navigation depends on that. And I think one of the quotes they had, it's fire and forget because they run it for months without even touching that. So again, the idea behind this is to give you guys an idea how you guys can use it. This is our QR code. You can go to GitHub. The demos you saw are available for you guys to download. Everything is open source. You will see other examples, other documentation blocked there. And also if you like it, please go ahead and start the report. That's how your community works. So thank you. Next talk. Can you elaborate more about the anomaly detection part? Sure. Does that require each application developer to define their own SLO or you have a unified algorithm in plan form level to do all the golden signal anomaly detection? So if I got the question correct, first was how do you do anomaly detection, right? So our anomaly detection is based on, we look at the application's run state and compare it with how it is performing now. So it's an auto encoder which we use, we have some around, it's based on 10 inputs, a sliding window of 10, but the key thing is that we look at the current pattern and compare with the historical pattern. And so it can understand time of the day, week over week and so forth. And it, but we avoid the cold start problem meaning we start with just 180 inputs, meaning 180 minutes, that's around less than three hours, up to three hours and we are able to scale anomaly score. So we do not have a cold start problem and we go up to like last 10 days. So we compare that and give an anomaly score. The output is normalized. Anybody at Intuit will say what the value five mean, meaning five is a medium level anomaly. 10 meaning is completely anomalous, right? So that's how we standardized it. What was the second question I have? You had two questions, right? Just please respond. Yeah, thank you. Really nice talk. I had a question around real-time incident analysis using AI and MLops. Do you guys really do that? Say for example, you have a payment platform that you're connected to and there's an issue with the payment platform. How do you detect that inside of your infrastructure? So how do you detect? So you remember Amit talking about failed customer interaction. For example, you have a W2 upload that is failing and how do we come to know about it, right? So the way it happens is, let me show you a real, the beauty of this is that our UI is, if it loads, yeah, there you go. So what happens is this is the pipeline, right? So we get data from Kafka all the time. Let's say every interaction, there are thousands of interactions happening and we do inference. This is where we infer the data saying that based on the current input, whether the current scenario is anomalous. And then we preprocess, post-process, this is where we normalize it and we say that, okay, this is an anomaly and we send it to all our things. For example, we send it to Kafka for further analytics, we send it for alerting and also we do training on the fly because this is a zero configuration system, meaning if a new interaction comes up, we will understand that, okay, this is something new. We don't have a model for it and we do an online training for that. So this is how we use NUMA approach. The key thing here, if you see, right, NUMA flow is that you can do all this in a single pipeline and this is a production pipeline that actually does the anomaly detection and the output of this create incidents now to tie it back. If it is a score greater than three, it's an incident fired at into it. And do you have automatic recovery? The MTTR is a tricky bit. That's where Amit was talking about is we have like resilience, like multi-region deployments. For example, the change and everything we roll back. But it is not a full proof, complete solution. Some of them you really have to debug but we do give power to isolate. Meaning you can actually see what is going wrong to an extent and we do mean meantime to isolate but some bits of MTTR is not automated. Thank you. Thank you. You sure? For yes, for changes, yes, for changes we use progressive delivery and it automatically rolls back. So we use NUMA project as well as Argo CD, Argo rollouts. So both are open source product. Argo CD and Argo rollouts is used for deployment add into it and we use progressive delivery and the talk as we discussed based on the progressive delivery if there is an anomaly, we roll it back. Just to summarize, we do not use LLM for mathematical use cases here. We use auto-encoders for anomaly detection if you're curious. One quick one. What's the preferred kind of tendency for NUMA flow or is it similar to what you would do with Argo workflows where you can run per cluster or is it more of a kind of roll your own per namespace sort of situation? Yes, today it is just within the cluster but there is nothing stopping us doing multi cluster because all one of the edge could be another cluster. Nothing is stopping us as of today but I can do it, we do single cluster and single namespace. Thank you. You're welcome. Thank you.