 Today, we wanted to talk how we integrated observability into Argo CD and Argo Rollouts, which is backed by AIOps and how that's helping into a reduced MTTD and MTTR. My name is Amit Kalamkar. I lead observability and analytics at Intuit. I have with me, Vijit Morris. He is a principal engineer for that tech track. Here's the agenda for today. We will talk about a specific problem that is change-induced incident and how we resolved it using AIOps. We have a demo both for Argo CD and Argo Rollouts. And then we'll talk about Neuma Paraj, which is our open-source AIOps project, and how that's powering all this. And if time permitting, we will do some questions. Most of you should know Intuit from our flagship products, QuickBooks, TurboTax, Crate Karma. All these products are powered by these five platform areas internally. These platform areas ensure that we provide value to customers as well as we accelerate innovation. Me and Vijit belongs to developer experiences and platforms. Our group is responsible for all build time and runtime at Intuit. Just to give you an idea about scale, we run around 2,000 plus services on Kubernetes. And the investment we have done on modernization of the internal platforms have resulted in 6x improvement in development velocity at Intuit. Intuit is very much bought into open-source. We not only use it, but we have active contributors and maintainers for a lot of CNCF projects, including Argo, Istio, and others. We are also one of the largest end user companies. And with the new capability, we are open-sourcing like Neuma Paraj. We are continuing the collaboration with other end user companies. Let me first start with giving you guys an idea of our tech ecosystem at Intuit. We started the modernization of this platform in 2018. We pretty much modernized everything, front-end platforms, back-end platforms. We moved all our container payloads as part of this to Kubernetes. We also created payout roads, both for serverless and services that gave our developers end-to-end automation from commit to deploy. As part of this automation, we made a deliberate effort to instrument all layers of our platform, install structure, and applications out of box. So we get real-time events from all over the place and we store it in the operational data lake. This data we use to derive actionable insight for different areas like operational excellence, cost, security, and so forth. To derive this actionable insight at scale, we needed a platform, AI app platform, that can scale. And that's how Neuma Paraj was born. Neuma Paraj is Kubernetes native, data processing, and analytics tools. It consists of two areas. One is Neuma Flow, which deals with data processing. It makes data processing easy, scalable, and reliable. And Neuma Logic, which we have open-sourcing and a lot of models which we use internally and which runs on Neuma Flow. You will learn more about this project as we go through the presentation. One of the core principles at Intuit is innovation. We want to make sure we innovate and we innovate fast. We do over 1,000 releases per day in production. And that's only possible because the investment we made in Argo, most of the people might know Argo, Intuit created Argo and open sourced it. Over 100 companies now use it across the world. Being one of the largest SaaS companies, one of the operational excellence, it's always at the forefront for Intuit. We want to make sure that our products are always available. And if there are issues, we resolve them fast. So there is less MTTD and MTTR for our incidents. So we are always looking at ways to improve this. One of the areas we saw where it needed improvement is change-related incidents. We found out that one-third of our incidents were caused by change and their MTTD was higher. And then we dig into little bit more into the data as why it's happening. There were few reasons. Our deployment as well as operational experiences were disjoint. So people were deploying but they needed to go different dashboard to figure out what's happening. Two, people had to go to hundreds of metrics and needs to be really know the application very well to understand the quality of the change. And it depends on the developer how much they know the application to resolve. Three, we didn't have an automatic AIOPS-based rollback through ARGO rollouts. And that resulted in a higher MTTD. So what we did, the solution was we bought our AIOPS-based observability into ARGO CD and rollouts. We did three things. One, we added a metrics tab to ARGO CD. So as soon as you deploy, you can check whatever metrics are relevant to application then there itself. Two, instead of going to hundreds of metrics and making a manual judgment, we run a multivariant model which is powered by a pneumoproach where you get one signal which tells the quality of your change, whether it's good or bad. Three, we removed the human from the equation. We integrated with ARGO rollouts so that if the change is bad, it automatically rolls back. And that has helped us reduce MTTD and MTTR. So let me hand it over to Vijay to show a demo. So let me start off with a demo where like, this is the persona is a service developer. So let's pretend I'm a service developer and I'm gonna make a change. The change I'm about to make has a small bug in side it and we will use the three features I'm gonna talk about. That's the metrics we have integrated, the time series anomalies course we have and also the automated rollback that rolls back and mitigate the problem. For that first, let me introduce a demo app, okay? So what you see here is that when the browser renders, it talks to the backend and backend gives you two things. One is the type of the fish. Here you see octopus, this is the version one we are running. And the color of the fish, yellow represents a happy state. That meaning we are getting successful requests from the backend. Once in a while, you will see a red, unhappy octopus going around. This shows that the backend has an unhealthy response. The key thing to note here is it's okay to have one of few errors as long as it does not violate your SLA or SLO. Now the change I'm going to make is a change in the backend and the way we make that change is, let's go to the pull request. I'm gonna merge this PR in and synchronize the change. So this is an algorithm of change, meaning this is a progressive delivery where a new tannery gets deployed as part of the change I just deployed. And you will see that the new, the old ports are getting terminated and the new ports are coming up. In the same time, what happens is that, so if you see, visually you can see on the right-hand side, there is a new set of slow-moving red, one could say evil fishes moving around. This is the new version of the application. It is slow and you see that it's erroring out. The point is that in real life, unlike the demo, you don't go to, let's say I don't deploy and go to TurboTax Online to see whether there's a redfish. Meaning these are very subtle errors. And the way you detect these kind of errors are looking at the metrics, your application colors. But this demo is just to show the effect of such a bad change manifested in the UI. So what you need is you want to look at the metrics and quantify whether it releases good or bad. Along the lines, you need to know that we do see in Argo CD, right? You see that it's healthy, the resources are shown as healthy, though the application behind it is unhealthy. If you hear the help means your resources is healthy, but it does not represent the state of your application. Now, traditionally, one would go to your metrics provided to look at logs. So this is the statement number one where we said, hey, we have metrics integrated with Argo CD, right? Now when you open, what you see here is the default template. It's written golden signals as the tab name. It's just the default configuration we have deployed it into it. It's just a collection of metrics which we think are vital for our application is from the family of metrics called request errors duration and also utilization, saturation and errors. This is out of the box for any application at the end of it. You can reconfigure it to whatever you think is the right thing. It's totally GitOps driven, the tab itself. And the way you read this metrics is the good old way, the x-axis is per minute. That's how we have configured it. And the way to interpret the metric is the latency is second, actually, on the left side. So the current stable hash, that is 5B7, the blue line is having around 0.04 millisecond latency. While the new canary hash is around having 0.8 second latency. That's what you clearly see with the new canary versus the old octopus release, right? I told average, it doesn't mean you had to see average. It's also the even metric is config driven. You can point it to percentiles or whatever you see is right. I'll talk about how we get the metrics down the lane. We also have other metrics like status code that's the total number of requests with 200, with canary. And so if you see here, right, we do see some successful yellow happy fishes too. It's just that the number of requests successful are very low. Along the same line, we have error, right? You can clearly see there's a spike in error. We have HDTB traffic. And just to show that this charting library is very powerful, we added a pie chart out there. It doesn't mean much, but just to show that the power of the charting library. And we also have utilization metrics per deployment. One by looking at here could say that, hey, my deployment is unhealthy, right? My errors are higher. I do see higher latencies and can come to that conclusion. But what we found out is in real life, you have way more metrics than that. And you should be a very experienced application developer in that platform to really understand whether you can qualify or quantify whether a deployment is good or bad. So you need to analyze a lot of metrics holistically. What we wanted was we wanted one single score that can define it for you, right? That is the AOS part of it. So this is a multivariant AOS platform that looks into multiple metrics, which is again configurable. You can configure as many metrics to look into and gives you one single score to make the decision, right? The way you have to interpret the score is if the score is between zero to three, it is operating within the normal operating pattern. Between three to seven, it is deviating. And beyond seven, it is totally deviating. And if you see on the right hand side, the rollback has happened automatically. I'll talk to, I'll get to that, how it happened. But the key thing is while I'm explaining, things have happened, we have mitigated the problem. This anomaly, unlike the demo where we show the yellow fishes and the red fishes, this is a real thing. We have been using it into it for the last four years. We use this to create incidents when there is a spike in user behaviors, right? Anomalous behaviors they are seeing in front end. When we see anomalous behaviors, when services talk to each other. So this is a real thingy. And that's the reason we open-source it, because we felt that the community could use it. And that is the new myelogic part of it, where we open-source this model so you could use it. Lastly, let me get into the, how did the rollback happen? So we have an analysis template. I think I might have to increase the font here a little bit. So what we do is like, if you were to see here, right, I hope you can see this. Right, this is an anomaly which just looks into your canary deployment. Unlike the entire deployment, just focus on the canary, delta, small deployment you have made. And the success criteria for this deployment is that the anomaly should be less than three. So the moment it sees that it is more than three, it goes on auto rollbacks, right? So this is an AV testing between your stable hash and the canary hash. That's the power of this. So you can do anomaly at different levels. We have, in the AIOPS level, you are looking at the applications. The whole here, we are looking at the canary as it is. And you can see that it failed because the score is coming very, very close to 10, right? Of course, greater than three is a failure, but you can clearly see the number of errors were high and that hence we rolled back. I think, this is the demo we have. And so you saw three things. One is the metrics. I just wanted to finish it off just showing that, right? So one is the metric you saw. That is the, if you see here, right? So if you see even the metric, after the deployment has gotten over, your values are smaller, right? The new canary is no longer seen in the metrics. Then you saw the AIOPS that looks into the entire system as a whole. And lastly, you saw the canary deployment, the canary analysis using AIOPS. So there are three things we saw. Now let me get into the architecture and how we are doing it and continue. Okay, so how are we seeing it, right? So when user lost the Argos CD UI, it talks to Argos CD server. Here we use Argos CD extension framework to see whether observability has been turned on. We developed a new thing called Argos CD metric server, which is for this project, which can pull metrics from any service provider you have. Prometheus, at any rate, we use Prometheus, hence Prometheus, but it can't talk to any service provider. This is a decoupled architecture. Now this is the first metrics you saw, right? The metrics that is being collected by Prometheus. Since this is a decoupled architecture, to see the anomaly score, all you need to do is the new project just need to write anomaly back to the Prometheus. That's the way you saw the AIOB staff. This is how it is feeding it off. Now to do Argo rollouts, all you need to do is do the analysis template from Prometheus. So new approach will write another anomaly score just for the cannery deployment so that you can get the score in, again for analysis template by Argo rollouts and that makes the decision of rolling back. Now this brings into a second question, how do we do anomaly? This anomaly is computed even though you are not rendering the UI or not even when you are just deploying, it's always on. This is a streaming system which always inspects and says whether your system is doing well. The way it happens is Prometheus has a feature called remote writer. So it, for every metric it scrapes, it will push the metric real time streams into our NUMA project. NUMA flow, on which the NUMA logic runs and it does feature engineering like scaling the metric and so forth. Once it does that, it does the inferencing, right? It assigns an anomaly score. Once the anomaly score has been computed, we do post processing. This is to normalize it to a human understandable format and give it between zero to 10. Otherwise it will be too complex for them to interpret it and this is an N plus one anomaly score. So we compute for each multi variant aspect one anomaly score and one plus one to have the unified anomaly score. Since it is an operation system, things change all the time. We cannot always configure the anomaly system. It auto discovers about new applications that come online, new configurations that come online. That means we have to do an inline training all the time. So we trigger a training, training such as the data from ProMeteers and it stores it in model storage. Model storage, out of the box we do ML flow and training is our overflows. But this is how the streaming system was and this has been the streaming system at conceptual streaming system at InDuit for the last four years. Since we have been streaming, doing streaming for a while, we ran into a couple of challenges to do real time streaming. The problem number one is lot of boilerplate code for streaming. What it means is what we found out was our ML engineers are spending more time writing streaming code and doing the infrastructure for streaming. Like for example, if you use Kafka, creating Kafka topics, how to reliably read, consume, produce data and so forth. Rather than what they do the best that is ML exploration and ML experiments. So that was one bigger problem for us. Second was Atok code. Here Atok actually means non-standard. For example, we deploy streaming systems as a deployment which is mostly good for north-south traffic while streaming systems are east-west. So there's much more to that, meaning you need to really understand how the pipe is basically fluid dynamics at that point. Streaming systems are more like fluid dynamics. It's not a north-south one in single transaction. It has much more to that. Context is very important. This all brought in a problem there. It's very difficult to do quick experimentation and extension. For example, we want to play a new model. We want to try different kind of feature engineering. And we want to extend more and enrichment, right? This made it very difficult for us to do it. This does not mean that there are no other data and streaming systems out there. There are excellent systems like Apache Flings, Spark Streaming, Samsung to name a few. But those are very complex data engineering tools meant to do data engineering in a data centralized way. In this demo, in the earlier demo, what did the NUMA project, the whole system pipeline, sits along with your application namespace. It's so lightweight that it sits along with your application namespace and computes anomaly. You don't need to do that. So we moved all the problems to the data producer side. So we needed a very lightweight, cost efficient and a lightweight system rather than complex systems like data engineering tools and frameworks we have. So for these reasons, we did experiments and then we ended up building our own system. This is the NUMA flow. NUMA flow is distributed, meaning it can scale to various extent, big levels and distributed that based, direct or acyclic graph based stream processing system, which is native to Kubernetes, designed with the concept that data engineering as stream processing, mainly stream processing should be very easy for both application and ML engineers. The way it works is you listen to a stream. It's an unbounded stream. The data keeps coming. It never ends, right? And it reached the data. Then it tracks event time and water months. These are some advanced features for completeness. Then it passes to the first user defined function. All it cares is you are given a key and a value. The output can be zero, one or many key and values. The moment the output has been computed, it can choose which path it has to take. Either it can take all the path or a subset of path. This is for AB testing, weighted experiments and whatnot. That is called condition forwarding internally, right? So we do that and then you can ask many UDFs as you want. In the end, we have a sync. Sync is its main job is to make sure that we persistently write the data to a stream or to a blob stop. You can easily write a user defined sync because you might come up with many different ways to write it and we define very easy way to write UDFs and syncs. So what are the features of new workflow? The most important, I would say the value is the one thing which we hold true to very core to our selfies should be very, very easy to use. You should be able to get it up and running in five minutes and it should be, even we would say, you should be able to learn it in less than five minutes. It's language agnostic, meaning you can write in any different language you like, UDFs and the syncs. It is lightweight, it's native to Kubernetes, you should be able to run it on edge. Cost efficient, it can auto scale to zero. Meaning if you don't see any traffic, it can reduce itself to zero, so there is no cost incurred whatsoever. It can receive from where it left off. With this complex systems, what we need is separation of concepts. Streaming systems are very complex. It can get very, very, with a bigger die, right? So separation of concepts is very important and we separate functions as units and each functions, we guarantee that the data transfer between them is exactly one semantic, so there is no duplication problems or so. It means we also support other lower demands but semantics like at least ones, but the key thing is we support exactly one semantics. With all this, we have standardized streaming. What it means is if you were to change, in the demo you saw, we saw writing from Prometheus and we were writing back to Prometheus for writing the anomaly score. You can easily replace both the source with Kafka and the sync with something else. But the core logic of ML system still works. This is how we use the same platform for customer centric anomaly detection and whatnot. Meaning the core anomaly detection, all you need to do is change the source and the sync, right? Things will work. And we also do back pressure handling. That means we can understand the resistance among the data flow and then autoscale. This is where your consumer is slower than your producer. So we autoscale based on back pressure and then we have water mass for correctness. Lastly, anything that is built on pneumo flow should be operationally excellent. It's a fire and forget model because your input never ends. It's an unbounded stream. So you turn it on and then you can forget about it because we take care of automatic retries, pod migration, node migration, whatnot. Now let me quickly introduce pneumologic. Pneumologic actually is the ML system we have and it's powering the real-time analytics. This is a library actually, but it works very well when on top of pneumo flow. The way we use it today is that we have a stream coming in, we do pre-processing, we do inference post-processing and write it to a sync. And we have many different models. For example, we have auto encoders, clustering logics and so forth for different use cases, but on a very, very high level, this is how the deployment looks like in and around. And we also do in-line training using the same system. So you find that the model has become stale or you need a different way to train it. You can pass in the information we train. The sync here is the ML flow sync. So it can be feed, there's a feedback loop back to inference system. The features it provides is... Pneumologic is a model repository for AIOps models, which is very, very well-written for time series data. There is a lot of data processing tool kits available with it, which helps you do analyze, understand operational data, which is time series in nature. Online training is out of the box. A bit experimentation is one core feature we really, really hold onto it because we experiment a lot with ML models. And the point of this is earlier, we had a lot of problem with ML developers having tough time with streaming and everything. This improves ML development velocity. It's a cookie cutter development, right? You write your UDFs, you deploy it, you get every feature of Pneumologic, which I talked about, and much more than what Pneumologic provides on top of it. With this, actually, yeah, the earlier slide didn't do just Pneumoflow or Pneumologic because it's a very complex system. Come to what we can talk more about it. But in the industry of time, I'll give to Amit to talk about the fraud head. So thank you, Vijit, for awesome demo and introduction to Pneumopraj. So watch next for us. We are obviously adopting it across our Argo CD and Argo rollout clusters. We have also very rich roadmap for Pneumopraj. On Pneumoflow side, some of the things which we are open sourcing includes windowing and aggregations, which we use internally. On Pneumologic side, we will be open sourcing other models which we internally like forecasting. These are our gates. So if you go to any of these gates, there are a lot of examples and documentation how to get started. You can take a look at it if you like it. You can start it. You can also, like Vijit said, go to our booth. There are other use cases for Pneumopraj you can look at including streaming workflows, image processing, automatic scale up and scale down. So you can visit that and you can have a discussion. Lastly, before I end, I also wanted to thank Nats. We have been collaborating with them on Pneumopraj for just stream. So thank you and we are open for questions.