 Good afternoon everyone. Is everyone enjoying ArgoCon so far? So today we wanted to talk about how we incorporated observability powered by our AIOS platform into Argo CD. My name is Amit Kulamkar. I lead observability and analytics at Intuit and I have with me Vijit Morris. He is a principal engineer at Intuit. I am representing Neuma Flow. He is representing Argo. We have all our bases covered. So here is our agenda for today. We will talk about the problem we were having which is change induced incident and how we solved it. Then we will have a live demo wireless guard permitting. Otherwise we have a backup. Then we will get into Neuma Praj and how it's powering these experiences. And then we will talk about what roadmap we have further. So let me start with Intuit. Most of you should be familiar with Intuit with our products like TurboTax, QuickBooks. We are one of the largest fintech company out there. We are also one of the largest SaaS company. And we are pretty much committed to open source. We are not only using it internally but we have a lot of maintainers as well as contributors. In a lot of projects Argo is to name few. We are also one of the top end user company for CNCF. And with these new capabilities which we are building, we want to make sure we continue this collaboration with other end user companies out there. Let me first ground everybody with the tech ecosystem which we have. We started our modernization journey in 2018. We pretty much modernize everything, the front end, back end, platform infrastructure. Some of the things which we did as part of this is moved all our container-paid road into Kubernetes. We also created paved roads so that it becomes easy for our developers to deploy. So the GitOps automation is based on Argo. As part of this, we deliberately ensure that all the stack, the platform infrastructure is instrumented. So we get real-time events into our operational data lake. And then we build the AIOps platform which we will talk about a little bit later to derive actionable insights from them. Let me also touch base a little bit on the scale at which we operate. We have more than 2,500 services currently running in production on Kubernetes. As far as our data lake is concerned, we generate more than 2.5 billion real-time events per day. And the AIOps platform is already generating more than 16 million predictions per day. And it is increasing day by day as we add more use cases. One of the core principles of Intuit is innovation. We want to make sure we innovate fast and these innovations goes to our customers. So it's only possible because of the investment which we have made in our GitOps automations as well as Argo CD, Argo Rollout. As many of you know, Argo CD came out of Intuit and now it's used by hundreds of companies across the world. So now let me talk about a problem which we are talking. Pratik mentioned in today's keynote. The operational excellence is one of the priorities at Intuit. We want to ensure that all of our core capabilities are always available. And we want to also ensure that if we have incidents, hopefully we don't, that our MTDD and MTDR is manageable. So when we looked into the data, we figured out that one-third of our incidents were caused because of the change. And we wanted to address that. So we did a little more deep dive into that and to figure out why. So there were two main reasons. One, we found out that there was a disjoint experience between the deployment which is the Argo CD UX which most people use and the observability experiences we have. Two, for developer it was little bit harder to figure out the quality of change. And we wanted to address that because it was increasing our MTDD. So what we did was we brought our observability experiences into Argo CD. It gave us two things. One, the new capability where the matrix is already incorporated into Argo CD. So the developers can look at Argo CD itself and look at any matrix they want, whether it's a golden signal matrix, any custom matrix, at the place where they are deploying. Two, for the quality of change, we gave them one signal using the Neymar project which it will talk about where they can look at the one signal and say that is the change good or bad. And if it is bad, you can revert it back. So now let me hand it over to Vijith to do the demo. Thank you, Amit. So before the demo, let me talk about the demo plan. What we're going to do is let me pretend that I'm a service developer who wants to make a change. And the change I have has a bug inside and we will see how the new changes to Argo CD, that is the matrix available and also the AI powered anomaly score, how we can use it to quantify the quality of the change. So demo time. Before the demo, let me introduce to a small demo app. What you see is that there are some yellow features moving across the screen. These features represent a request. This browser is generating every second. You can imagine it's around one to five requests per second. And the angry red fish represents an error. It's coming from the back end. The change I'm going to make is a change to the back end, not to the front end. And we'll see the experience which the customer sees because of the change and how we can detect the problem looking at the Argo CD observability. So it starts by a Git PR. We are going to merge this pull request in. This is the change we have. It's merging in. And let me refresh the screen here. It's a little slower. Let me sync in the changes. So there's nothing new in here so far, right? It's the same Argo CD you are seeing. We do represent health here as a heart, the green icon, but it doesn't represent the application that's running on the resource. Rather, it shows the health of the resource. So for this reason, what people do is traditionally, they will switch out from here. They will go to the metrics provided as UAE. For example, at Induit, we use WaveFriend. It could be Prometheus, DataDover, whatnot. That is the disjoint experience I was talking about. To solve that problem, what we have done is we have integrated a metrics tab here. This metrics tab will show the metrics as you see here. Now, I'll talk about in detail how we are getting the metrics, but for the demo, I'll just talk about what we are seeing. These are the KPIs, which you think is very relevant for your application. The configuration you see here is the default we have deployed at Induit. The reason being, at Induit, every application that is deployed has a concept of golden signals. Golden signal is nothing but a family of metrics from the red, meaning request errors and duration slash use, use standing for utilization, saturation, and errors. And we call it golden signal, popularized by Google's term, actually. So what we see here is latency on the x-axis is one minute. So everything you see in this metrics view is one minute, and we show the score for that. For example, latency, the way you interpret it is September 20th, 1340. I have a latency for the deployment 59FF has a latency around 0.2 milliseconds. The latency here is in seconds. On the left-hand side, what we see. The key thing here I told is average latency. It does not mean that we can only show average. It could be P99 or P75. The only thing you need to change is the configuration for getting the metrics from the backend, from your metrics provider. It is very configurable. The whole page is configurable. In fact, you can configure it like this. If you see here, we see that this is the golden signal view. And in here, we have multiple metrics that's coming in. So you change your expression, you get the metrics, and when the UI is rendered. Now that the deployment is going through, let's see what the user experience is. The purple fish represents, let's say, more enlightened fishes, but we see a lot of errors. There are more angry fishes in play. Meaning there is something going wrong in the deployment. You could see this using the metrics. That's the point of metrics. So if you were to see here, it takes around... We scraped the metrics around 30 seconds. So let me see whether I can just switch between and see whether... You can see that, hey, I see a latency going up. I also see my 200s going, okay, a spike in 500. The total number of requests, this is the total number of requests that's coming per minute, is around 990. That remains a constant. And also we show more metrics like utilization metrics. Right here, we are color coding each port and each deployment ID to that port. We see nothing much different from there. But we have put a pie chart here just to show the power of the charting library available in Argo CD, right? So you see that, okay, now you clearly see there is a problem, right? Now the question is, is it enough? Is it enough to make the decision to roll back? If I am an experienced developer in this project, I could make the decision by looking at the key metrics, which I know about, right? But that's not enough. The concept is everybody should be able to make that decision very easily. And to make that, you should look at the system holistically rather than an individual metric. It boils down to the fact that how can we provide a single metric that can represent the quality of your deployment? That is where the AI ops platform comes into the play. So if you see here, right? Okay, clearly show if there's a problem. It depends on how we interpret this data. So at 1349, right after the deployment, deployment is shown by this flash icon. You see that the anomaly has gone up, right? It's showing around 9.97. To interpret this score, you need to understand how anomaly is generated, meaning there is a three-segment score. The segments are... You could take it. Meaning it depends on your use case, but at the end of it, we have zero to the third meaning the application is behaving within the normal operating pattern. Between three to seven meaning, the most of the metrics or the patterns are deviating from the normal operating pattern. Above seven means that it's totally deviate, right? Meaning the deployment is bad. At this point, one could come back and roll back. Meaning this is the single score that can tell you that whether your deployment is good, bad, or ugly. This score is generated by analyzing many different metrics. That's the whole point. It's a multivariant system. Earlier, we talked about holistically analyzing the system, right? So it's a multivariant score, which is also configurable. For example, let's say if I were to do an anomaly on the number of fissures versus the golden signal along with latency and error, you could do it. So all you need to do is change the way the configuration is put in place so it can learn about the metrics and then give you the anomaly score based on the system. This model, in fact, we use an autoencoder model. This has been at Intuit for more than three to four years. We have been using this system, and it's very powerful. It creates incidents at Intuit. So it's that powerful. And that's the reason we thought, hey, we have such a good, great system that can detect problems. Why not integrate it to Argo CD and make it available for the community so that everybody can make use of this cool feature we are using at scale? Now that I am done with the demo, let's go back to the architecture, how things play behind the scene. So what happens here is when the user loads the UI, it talks to Argo CD server, good old way, Argo CD server through Argo CD extension framework knows that observability has turned on, is on, and then it talks to something called Argo CD metric server. This is something we developed for this project. So the Argo CD metric server has both the configurations of all the metrics, view, and so forth, and also it's also access a proxy to the metrics provider. We use Prometheus, but it could be others, different data dog, what not, right? This makes the data available to the UI, right? The UI gets the data through this request flow and the data flow. This is a decoupled architecture, meaning to make any metrics available in Argo CD, all you need to do is make sure that the metrics are available in your metrics provider. This is how AIO system is able to give it anomaly score. Totally decoupled, meaning it can write the score back to the provider Prometheus and you see it in the UI. Now let me get a little more into the anomaly system because that's a very interesting system, right? Let me, I'll go with these small, small batches here because there's a lot here. How are we getting the metrics? This is a real-time system. This anomaly score is not computed when you open the page. It's always there, right? It starts by Prometheus having a feature, meaning it can stream in through a HTTP remote writer protocol to our anomaly system. So every metric Prometheus get, it will forward streams in, is the right word, streams to our pipeline which does feature engineering. Feature engineering is something like, you can imagine some scalers, standard scalers and logarithm transformations to make sure that it is compatible with our inference system. And it also does some grouping, group by for the anomaly detection system to understand the multi-valent aspect of it. Once that is done, it forwards to the ML inference system where a score is assigned to it. Now this score is sent into a post-processing system that normalizes the score between 0 to 10 so that the score is intuitive. Otherwise, it will be all over the place. We want to make sure that the user can assign and make a decision based on the score. It should be universal. And this is a happy path, meaning the feature engineering happens, the ML inference assumes that the model is in memory so that it can do the post-processing. But this is an AOP system, meaning it has to understand based on how the things are changing in the backside. For example, there will be a new application coming in. There is an existing application exposing a new endpoint. It means that we are seeing this metric for the first time, right? So we don't have a model in place and there should be an online system that should be able to compute those. So this is what happens. In case it does not have a model, it triggers online triggering. The provider is, again, Prometheus. It reads the data, trains the model, puts it into a model storage. We use ML from here. And this is which the ML inference system will talk to to get the model in place. So meaning once it is running, it's always running. You can change your application. It detects the changes. It will still assign an anomaly score. Our system is good enough such that if you have 180 points, it can give you an anomaly score. If you were to change your code or a configuration every time you add an endpoint, because then during RCS we will say that only if I added that configuration, that problem is not there. We always detect the changes and do an online training. This architecture has been in place again at Intuit for a while. We use this to create incidents. That means this should be a way to prove an architecture. We can't wake up half of Intuit because our anomaly score is bad. I'm not talking about alerts here. We literally create incidents. It means it should work like talk. It's like talk work. You read, it's a very beautiful system. If you were to look at this at a microscopic level, each component has some subtleties. That becomes barriers to build a stream crossing system. I will talk about those barriers in a very high level. There are many, but I'm just coping to few. So what are the barriers for building an AIO system? First is if you were to look into every component, what we do is everybody should do no streaming. Most of our ML engineers are experts in streaming too at one point because they all need to know how to read from Kafka, write from Kafka. It sounds easy, but if you were to use Kafka, Kafka is a dumb broker smart client architecture. That means the client should be very, very competent. There are a lot more to streaming in, having streaming, reliable streaming in each component, but that is a problem. So each component should be lightweight, that's a barrier. Second is ad hoc code. Ad hoc actually means non-standard. By non-standard, what we mean is we need to look at the streaming platform as an end-to-end system, not as a component at a component level. Think about it that in the nth vertex, nth component, you have a backlog. The source should know about it. Otherwise, you will end up in data loss. The source will still push it through and the last component is backlogging. It will never catch up, and what will happen is if you were to use any streaming system, they will trunge the whole data and you will end up in data loss. And mostly what happens is people try to deploy this connected system as a microservice, the worst. It only knows about a thing at a time, but you need to look at holistically the entire pipeline to make it work . Also, it's very difficult to extend the platform because if you were to use streaming, if I want to say fork out a data for A-B testing, you need to create a new Kafka topic or a streaming topic. Then you need to have your own consumer group. It gets quite complex, right? So what in fact happens is people try to put more code onto the same component, thinking that, okay, it's easier to do it for the time being because we are just plugging and playing or experimenting. But soon the experiment will get into production and we will see issues, right? So it's very important that in AIOPS we get to experiment and play with reliably and operationally excellent way. It does not mean there are no data engineering frameworks that can do streaming. There are excellent systems. For example, we ourselves use Apache Flink based on Apache Beam abstraction. Then SAMHSA is there as part. Many, many players are there but they are built for data engineering. They are not built for AIOPS or simple data processing. Their aim is like 2 billion TPS, right? And think about it. If I were to show this demo with a Flink cluster near application cluster, who will be eating more cost? Your Flink cluster will eat away a lot of money because it's very complex and the barrier to get into streaming is very high. The knowledge of how to really do streaming can get very, very complex. But it comes to the cost of doing it. If you were to do data engineering at a centralized system, what you do is you need Flink or things. But if you were to move the problem at the source level, like today what we are seeing is MTTD as soon as a problem happens, we need to have a lightweight system sitting along with your source and able to detect the problems. And of course, the solution is not cheap either. So how does this ideal AIOPS system look like, right? I think we all can agree to that. It should be very easy to use. One would say it should be running in less than minutes and possibly even learning less than five minutes. It should be truly language agnostic. I should be able to write each component in any language I like without knowing about how the streaming itself is done behind the scenes. It should be lightweight. Meaning I should be able to move it all the way to a edge system. It should be native to Kubernetes because Kubernetes has a lot of lightweight edge systems. Cost efficiency is very, very important because as the day moves through, you will have input coming in different throughputs. And if there is no input, your system should be able to detect that and scale down to zero. Meaning you are not incurring any cost at that point. Separation of concerns is paramount here. When you build a complex systems like streaming systems, you need to make sure that each component worries about what exactly they want. They don't have to worry about how streaming works. How do I make sure that my data is forwarded only once because exactly one semantics is also tricky. It is like the parent of whole semantics. It's like grand out of four. But here what we mean is we can support lower-order semantics like at most ones or at least ones. The whole point is the platform should be able to support this out of the box. And then standardizing streaming. This what it means is it doesn't matter if you use Kafka or an HTTP streaming endpoint or TCP endpoint, the rest of the ML inference system should work. And it's a true story where today I showed Prometheus with HTTPS as an remote writer. At scale what we use Kafka as the input because we have a centralized way of doing AI ops. But when you do it per cluster, you do HTTP. But the rest of the anomaly detection system, nothing changes. If I were to move the sync, for example, instead of writing the Prometheus, I can write a wave front for all I care. And the detection system with the ML engineer's route still works as is. There's a small point called watermark. It's very, very important. It helps you do completeness. I'm just moving on just by saying that because it's a very tricky but very interesting topic. Anything that is built on an ideal system should be operationally excellent. Meaning it's fire and forget. Think about it. Your input is an infinite source. It never ends. And it's not like something that comes up ends and then you have something you can look. This is a fire and forget system. You move it to source. It auto-scales. It can do retry, spot migration, node migration. It shouldn't matter. In fact, we look at the system as a queuing theory at each vertex. And that makes it beautiful, I would say. It should be operationally excellent if you write something on this ideal AIO system. With this, let me introduce NUMAflow. This is the platform we built based on our learnings from four to five years. NUMAflow is, one could say, summarized something like, is a Kubernetes native, DAG based, DAG meaning director, a cyclic graph based, distributed stream processing platform. Developed in mind, with the mindset that stream processing should be made easy for every application developer and every ML engineer. Now, what I'm trying to is injustice to NUMAflow to go in just one slide, but let me try to do some justice to it. First step is you read from the source. You have an infinite bound problem. You're going to read from the source. Infinite unbounded stream is a subset of bounded stream. So it implicitly means we can support bounded, but anyways you read from the source, you do see like assign event time and watermark. This is to track the progress of time. Time is very important in an unbounded system. Once you read from the source you have the next vertex called UDF. Every message goes to a user defined function, which is nothing but a container. The input to the UDF is a key and a value. Key is a string, value is a byte array. It could be anything. The output of that is also a key and a value, but it could be either zero or nothing, meaning think about like a filtering system. You're dropping the data because you're not happy with the data. Or it could be one out that is a good old map. And lastly we could have more than one that is called a flat map. That is a superset of all those zero, one or many. So the UDF can have input is key value, output is zero, one or more values, key value pairs. Once the output has been computed, you can make a decision. Which path the UDF should, the message should take a good analogy for this is if you're doing experimentation, you want the same data to go to two paths. You want model A to run on it and also model B to run on it. There are some cases where the UDF could be a decision UDF. What it happens is I'm analyzing the data and found out that this is the data that belongs to a low cardinality system. The data is sparse, not dense. I'm going to use something like SOM model versus if it was dense I would have used a auto encoder model. So you can make decisions on the message and forward it the right place. That is called conditional forwarding. Once you have conditional forwarded the message, you can have any number of these steps. For the simplicity I just put two there you could persist to sync. The last two things are syncs. The sync is different from a UDF, meaning you need to do some batch writing and things like that. It should be efficient and be able to scale. So the syncs are taken separately and NUMA flow considers a pipeline between a source. We call it terminals, a source and a sync. Whatever happens between a source, the moment we read it to the sync, the moment we write to it to an external source is follows all the exactly once semantics and all the goodies which we talked about in the previous slide. Please come to the booth. We can talk a lot more about NUMA flow and how Argo CD is using it because this is just one slide trying to summarize a very interesting topic and a very cool feature so please feel free to drop by. With this I will hand it over to Amit. Everybody like the demo? Yeah? So Vigit went into a lot of details. I want you guys to remember one thing. NUMA flow takes care of everything what Vigit talked about. You should be up and running in 5 to 10 minutes. If you are not please let me know. We have a problem. So what's coming ahead, right? So let me tell you that we want to make sure that this Argo CD observability we want to roll out to all our clusters within into it. Second the next thing is we want to integrate with Argo rollouts so that we remove the human from the equation. So Argo rollouts look at this anomaly one signal and automatically rolls back. And we feel that NUMA flow has a lot of interesting use cases which we are already working on in cluster analytics that includes automatic scaling that will directly affect cost as well as automatic alerting. So, please go ahead and look at our Git repose NUMA Argo approach. If you like it, star it and always we are hiring. So if anybody is interested, you like to work on this awesome stuff, please reach out to one of us. Thank you. Yes, you can scan the code. You will get the Github robot. And there are a lot of examples you can go to the Github and try it out. Sorry, I couldn't. The charts at the beginning where you showed response time for instance for the old deployment and the new deployment and also the configuration where you get the data sources. It looked a bit like Kayenta. Also the configuration. How is it related? Is that Kayenta a part that works with NUMA flow and it's interconnected? Or how do these pieces fit together? So the metrics that we get in, the configuration we showed was for the metrics view configuration. But down the lane, we also have configuration for what metrics should be injected to the anomaly detection system. What are the multivariant features you would like to have? And that are the metric name you just need to provide. So I would say it's in one place you could, we are trying to keep it along with the deployment of your application where you could say, okay, these are the metrics I'm interested in the view because these are my application KPIs and these are the metrics that should be inputted to the multivariant anomaly system. And I think to your second part of your question everything is real-time. Yeah, latency, yeah. So it is the end-to-end latency of the NUMA flow is in milliseconds. The only latency that could be added is how complex the UDF is in between. In case if you are using a very complex model which is running on a very cheap hardware it could take, for example, like around 0.5 seconds, but everything is real-time. So if you were to use a GP-based system then that will not be a problem. But 0.5 seconds is the end-to-end for the anomaly detection system. Asterisk is that how often do you scrape the metric from the application. If you were to scrape it every 15 seconds, that is actually the delaying factor. It's not the end-to-end inference system. Okay, thank you.