 So thanks everyone for coming. We're going to be talking about real-time Argo rollouts analytics Powered by a notification engine. We'll talk about how we did some analytics with the help of the notification engine add into it My name is Henry Flix. I'm a group product manager into it working on our development Platform and developer experience, and I'm also one of the Argo maintainers Hi, I'm Vijay Agrawal, and I'm the senior engineering manager at into it, and I lead the CID CD systems at into it So just quickly what we'll go through today. I'll give you a brief overview of the inter platform What we've been up to the last few years why we're doing what we're doing We'll go a little bit Into some of the operational challenges that we we've seen in our platform Which led up to what we're going to talk about today, and then Vijay will walk us through What we did and how we did it and then we'll wrap up with a little bit View into the future some ideas to have on how we can take this to the next next level So quickly into it. We've been working on our development platform for the last few years We've transitioned to to this modern platform Cloud-native based Think you've all seen the cloud the CNCF cloud native landscape, you know with a ton of different things on it we use a good chunk of that in our platform and With the help of this we've been able to abstract some of the complexities Automate a lot of things and increased our develop development velocity There's about nine times over the last few years, which is a pretty amazing achievement And we've run pretty much everything we talked about here today at scale and into it Our environment. I think it's up to like 320 clusters now So pretty much everything runs on Kubernetes in the public cloud And more relevant to the talk today We have about 16,000 in space to spread out of the 300 clusters Roughly 25,000 applications running in in Argo CD and about 13,000 or so roll-up objects in there All of these would not be possible without open source You've probably heard about Argo since you're here today That's one of the open source projects that we that we're heavily invested in but there are a number of other ones that we maintain Some of them created some that we contribute to but whatever we do We try and do open source upstream first I'm sure we don't have in downstream fork and we give back to the community. What whatever we do and Thanks to that we've also been acknowledged as the end user recipient twice by by CNCF Lastly in 2022 We're also one of the end users. We're also very engaged in the end user community. We don't sell any Cloud native services. So what we're doing today is not a sales pitch This is how we use it. Why are we using it and and what you want to where we want to go with it? So so why are we here today? So we've used progressive delivery and Argo rollouts for years now. It was an unveiled in San Diego Many many years ago. I guess was 2019 right keep conning in San Diego And we've been using it in production Pretty much ever since And it's one of the default options in our development platform If you create a service if you bend a new service add into it, you're going to be set up for Argo rollouts which is awesome helps us a lot, but As we're getting more and more usage of this platform, you know as a product manager I want to understand, you know, how is this platform being used? How are my users really using these platforms? Sure, we have some operational data. We can see, you know, how many how many clusters we have how many rollouts I'll just know that but how am I use is really using the platform, right? And that will help me as a product manager to see both how we can improve the usage of These features within into it to make our developers more productive But also are there any gaps that we can work with the community on to improve the Argo ecosystem upstream? So but what we realized was to do this We need a lot more insight into What what and how the products is being used? So Some of the questions we're trying to answer like what features are actually being used Yes, we know you use Argo rollouts But what are you really using in Argo rollouts? And it could be things like what traffic strategy you use for example like if you're using service mesh That might have implications on our service mesh implementation We need to do some integration with service mesh work with the service mesh team You can do some some cool integrations there maybe We then a bunch of templates automatically When when you get your service initially created But how many uses have really changed that that basic template how many people have tweaked them What metrics are using to drive your analysis runs have you come up with a bunch of custom metrics if so are those metrics We should maybe incorporate into the platform or are they using something that they maybe shouldn't be using right? and also some more on the troubleshooting RCA side like what's causing rollouts to not go well Are there performance issues our analysis wrong runs taking longer than they should because so maybe something You need to look at Prometheus is something else is going on right so we need deeper insights We need a lot more insights and more data on How this is being used so Hendricks asked like a lot of questions that we are asking and We tried to sort of build a generic framework that can answer those questions when we need to we need that data So traditionally what we would have done in the past is when you need data insights Like we would create a requirement for developer and developer will go update the code We'll deploy that code to production and then we'll verify that data and then create a dashboard out of it It can take anywhere from days to weeks depending on like how long is your production Development cycle what we set up to do is how we can improve this and what we did at least for rollout insights Is we removed all these middle steps so when you need data insights We created a generic framework that you can directly go and create the appropriate query to create a dashboard Which where you can visualize you you have a generic all data available at your fingertips with this We don't need developer dependency for data gathering and insights That was like one of and we'll show how we accomplish this going forward So one of the core components that we did is like we moved to centralized logging before this like we were logging in few different places in different instances and few months back we took initiative and we sort of did centralized logging we have like as an expansion like 320 clusters and There is a rollout controller which is running in each of those controllers watching all the rollout objects that are configured within the cluster And all these like 300 controllers are sending all that data to a centralized Server and we'll see how we are doing that in future slides But that's the core of centralized logging logging and what that has enabled us to do is it has enabled us to sort of better manage our Logging and it has provided standardization for that logging Since all are flowing in the same matrix same format it has provided that Standardization and it has accelerated our MTTR as well because now we can get those insights much faster because all data is available In one place and we can build correlations across the platform when we do see issues It enables us to sort of create central monitoring and alerting across our platform It provides better data security because we have to manage just one instance instead of multiple instances And we can configure effective RBAC on the data again cost is improved because you are sharing the resources across all your clusters now So we'll look at like how we accomplished this and like we use notification engine to accomplish a lot of what we will talk about today and We'll talk a little bit about like what notification engine is how How it's a configuration is done and how we are using it so notification engine as a high level is a configuration driven sort of Golang library that can provide you ability to configure notifications for cloud native applications We are using it for rollouts and CD But you can use it for any cloud native application and at a high level what notification engine provides is it provides you a K8 informer that's watching on a particular resource in this case We are using it to watch rollout resources and it's watching on update on and or when a resources added deleted or updated And when it does detect that it's using sort of K8 API server to detect those changes and when it does detect those changes it looks at the configuration which is configured in the notification engine and Which is basically a bunch of triggers and templates and based on those triggers and templates It would send out appropriate notification to like whether slag pager duty or a custom web book whatever you have configured it to be The key features of notification engine it is it's provide out-of-box integration with a bunch of these different providers And you can customize it to different templates whatever you need to and you can even create your custom triggers For whatever your needs are it provides you all these notifications in real time for any resource changes that you want The benefits that we have seen with it is it has provided us improved visibility and transparency into our development processes Where for example if a rollout is completed then we can send automated alert to a slack saying that rollout is completed Or if a rollout is aborted we can create an incident that this was aborted and somebody can take a look which Results in faster mitigation of incidents when they do happen. So both of those benefits. We have realized with it Let's look at notification config. So at a high level like if you want to add a notification to the system What we do is we have a Subscription we add a subscription which is a which contains sort of a recipient Which is the receiver for the message and a trigger when the message should be triggered a Recipient in this case is a web hook It could be a slack or a pager duty or teams whatever you wanted to be in this case We configure the URL where the data should be sent and then we configure a trigger and this is a custom trigger We provide some out-of-box triggers as well But this trigger basically what the when condition there is is what I want to highlight here Which says like the current part has not equal to stable hash Which means that the rollout is in progress and the abort status flag is true Which means this rollout is aborted so we and we want to send this notification once per revision so we configured that you when whenever this condition is met the notification with engine will will send a notification and The notification format of it is defined by the send for a template over that and in this case the important thing to highlight is we are sending the whole rollout object That is that we saw in the message and you will see the power of sending that whole manifest over there later on because then We can ask any question related to that manifest later on when we are querying the data Where we don't need to sort of create individualized matrix for each of the data points We get a generic framework for everything Now we'll we'll see how we are deploying this at Intuit So you have same stuff like where we have these 300 controllers and it has a bunch of rollout objects within it and These objects are typically watching GitHub repos that are being managed by individual app developers And we have approximately 4,000 repos and 13,000 rollout objects That we have managed as of now and it's growing and it's we want to sort of We added notification engine which is watching over these changes in these rollout objects and taking action when do they those changes do happen So what it does is we have a centralized web book receiver, which is taking notifications from the Notification engine and it processes all these notifications that are coming and Typically right now we are using six basic triggers when a rollout is added deleted about it started Completed or sort of fully promoted. We are interested in these events and when they do happen we look at like send the whole manifest to the web book receiver and The web book receivers take that takes that and forwards it to a central Logging server in this case like we use Splunk and Athena for for for long Longer time frame and then we do send that some of the pre-computed matrix to Prometheus as well for for a different use cases that we have And then based on once you have it in the central logging system You can correlate it with other data like you are receiving data from let's say service match You can correlate that data with this data point to create even richer insights for your data and create More dynamic dashboards out of this So this is at a high level what we do and we are doing some future enhancements to this system Where we are adding message queue support because sometimes we do see that notifications are missed In some of the scenarios because if the web book receiver was unavailable for whatever reason Then you do miss some notifications So we are as a adding message to support to it and secondly we are adding run support for runtime values Where we just don't want to know that what strategy you are using but when the rollout does get about it We want to see the actual values on the analysis template that When it did get reported so that we can tweak our defaults better and learn from it So we'll look at some of like interesting data that we gathered and what is how you can create that data points from that So here this shows like how our deployment is going how many are using canary versus blue-green On a daily basis and this provides us like as we are rolling out AI ops based Strategy we can see that adoption over here as we roll it out and track that similarly We can track live how how traffic routers are being used that into it for from a rollouts perspective like a lot of them Are not using a traffic router some of them are using is to some in the some of them are using a lv Some are using a combination of that so we can track all that and when an incident does come We can prioritize it accordingly how many people it's impacting and reach out to those service owners based on the criticality of it This sort of provides us a view of How many rollouts are being completed on a daily basis so you can see Saturday Sunday People don't do deployment is expected But if we do see that drop on weekdays, then we can take action again like you can see how many are aborted very few as Expected some are sort of promoted full and we can sort of understand that why people are promoting full and sort of to take action to sort of improve rollouts This sort of graph provides a view into how much is the pause time when people are sort of doing rollouts Typically 600 is the default template we went with so you see that usage and a lot of people have moved to zero They don't want any sort of they are not using the rollouts effectively, but those are pre-prod environment So it's expected in that sense that they they don't want any Sort of they want to move faster in their pre-prod environment And that's why the timing is zero and we can sort of see how people have deviated from defaults similarly like Same for number of steps like four is the default we ship with and then like how people have changed that over time And like people are using 14 steps. We want to know what those 14 steps are What are the use cases in which they are using 14 so we can tweak our defaults based on the different use cases We have rather than shipping one static default template to people and you we can see here like what are which services using it What are the steps? They are using it for and then later we can reach out to those people either to understand their use case Or if we see it at a broader scale, then then we can make adjustments to our templates This shows like how many which template people are actually using in this case like memory error rate CPU utilization And request latency are heavily used and some people are using like pods restarts and us and we'll see like as We are rolling out again AI ops space. We'll see that template pickup here as we go along Again abort is a critical functionality of rollout So we do want to understand like when a rollout is aborted what is happening behind the scenes why a rollout is being aborted So we do see we have we can see that if it's aborted due to CPU CPU utilization or memory Utilization or if it's like the success rate is falling then it's aborted and like we have a category missing data Which is like we are not able to receive that data from Prometheus So if it's aborted due to that then like we get to know and here if we see that matrix sort of trending in the wrong direction Then we know that our defaults are not correct and we need to adjust them And we can go further deeper and see that when our rollout is aborted Then what is the matrix that it got aborted is what was the values that were configured and when the rollout was aborted So we have the timestamps and everything so we can track it actively and Thanks VG So most of what VG talked about is already in production, right? So we're seeing some of the benefits of this already Some of it is fairly new so we're still you know have some Hopes and ideas of where we're going to see some some benefits from this as we as we learn more But one of the key benefits is that it's improving our best practices having all this insight Let's us understand our users better and understanding users better mean that we can help them more and better We can figure out what are the best Ways to deploy Argo rollout and use Argo rollout so we can improve our templates make sure that you know Whatever configuration they do is is suiting their needs So we'll have fewer misconfiguration and less issues from those from those users using our templates and platform It especially for me as a product manager This is awesome as well because it helps me understand how the users are really using The product and I'm using the word product like particularly this morning. I know it I know it's an open-source project, but calling it a product Which helps me understand the users, right? Yes, I talked a lot of internal users I talked to people in the community, but having this data Gives me another level of understanding that I can't really gather in in a like one-to-one Conversations or service or whatnot, right? It also helps the platform team like VG and the team when they run this for all our users They get a lot of data that helps them do root cause analysis faster and easier Instead of reaching out to the user and asking them going back and forth. They have all this data at their fingertips And in addition we can use these to do some proactive actions We get these notifications. We can see that someone configure something That's maybe not according to best practices We can reach out to them practically and hopefully the future even do like automatic Proactive actions and helping them out making sure we steer them in the right direction making sure that you know what they have is something that's going to work for their service and Last but not least Reducing the end user involvement and I mean that in a good way We do want to end user involvement, but end users don't necessarily want to spend time with the platform team debugging issues Right. I'm sure that all of you that have had some platform issues You talk to a platform team you get frustrated with how long it is and all the questions they ask by using this We can do a lot more of that without Going back and forth to the user if you have all the data we can just get on it do it faster do it more efficiently Some of the things we were looking at to do enhanced is and make it even better is Getting even further into the data VJ already talked about getting some of the runtime data making sure we can do some more correlation with the manifest And what's actually in the running system and that will help us do in get an even better understanding of why things happened When they happen and how they happen Of course like everyone else we're looking at AI and seeing how we can use AI here to help us do some log analysis Do some correlations and and help automate some of that The predictions and root cause analysis that we're spending way too much time on today Message to support what I talked about we don't have any automation or alerting based on this this yet But you can see that maybe something we can do in the future But even so even with the data gathering we want to make sure that we have consistent data and we have Correct data Looking at the success we've seen with this for Argo rollouts Since notification engines here for Argos CD as well like why not see how we can use this in in Argos CD We've learned a lot about our grow lots over the last few months as we've been doing this So we want to see how we can take this to Argos CD as well There are ways you can use other things like Google Analytics to get some of this user data in in CD as well but you're not going to get to the level of Level of depth or the level of data that you get by using this natively with the notification engine So using this for Argos CD to look at and how people are using rollbacks and what features using the product is going to help us get The same level of depth and understanding of Argos CD as we've been getting from Argo rollouts There's only so much you can cover in 25 minutes I hope this at least was a good teaser into what we've done Vijay and I are gonna be here all week feel free to reach out to us in the booth in the bar Wherever you see us and we're happy to spend hours talking about this Thanks everyone for coming