 So, the talk we're going to deal with today is on Prometheus Data Analysis and Event Notifications for Progressive Delivery. I'm Ravi Hari, Principal Software Engineer at Intert. So, let me give a brief of who we are. We are Intert. We founded the company in 1983 and we've gone IPO in 93 and we are at 19 locations. I'm working out of Bangalore. We had 9.6 billion revenue last year and have about 100 million customers. We contribute to a lot of projects in the open source. So, let me give a brief about Argo. Argo is a sort of Kubernetes native tools that manage running of jobs and applications in Kubernetes. It basically makes running these jobs and applications seamless. So, it goes with the tagline of get stuff done with Kubernetes. Argo adopts this GitOps paradigm for continuous delivery and progressive delivery as well as it enables to run MLOps on Kubernetes. So, there are multiple projects in Argo, like Argo CD, Argo Rollouts, Argo workflows and events. Today, we are primarily going to talk about Argo Rollouts for progressive delivery. So, what is Argo Rollouts? It's a open source Kubernetes deployment controller. It's a drop-in replacement for deployment that we get default out of Kubernetes and it supports multiple strategies like Blue Green and Canary. So, let's delve a little bit into why we needed Argo. The out-of-the-box deployment controller that we get in Kubernetes provides two type of strategies. One is recreate, the other one is rolling update. So, with neither of them, we will be able to do Blue Green or Canary or progressive delivery. That is one of the reasons why Argo Rollouts stepped in and it provided all these different functionalities. So, let's delve into what these Blue Green, Canary and progressive delivery are a little bit in detail. So, let's start with Blue Green. We have, say, version one and let's say we have two services, an active and a preview one. You know, also call it as stable and desired and both point to the first version initially. And as we bring in the newer version, the preview points to the revision two. And if we test it out and everything looks good, then we can promote the revision two to be the stable version. And then the traffic from service in stable also goes to the latest version that is revision two. And then once we are comfortable with the revision two, we can delete the whole set of parts. So, this is how Blue Green generally works, right? How do we qualify this in Argo Rollouts? In Argo Rollouts, we specify this with, you know, a rollout object, but with a strategy of Blue Green. And then we specify the active service, preview service and stuff like that, right? Let's look at Canary. In Canary, we have, say, revision one and both stable and desired services point to revision one initially. But as we start deploying revision two in Canary, the traffic percentage is gets into revision two based on the number of parts that we create in the revision two. Say in a set of four parts, if a new part gets created in revision two, 25% of the traffic goes into that. The difference between this one and the rolling updates here is that rollout provides you with a lot of features to control this, you know, revision two at this particular point and do your analysis and testing. And only when you feel comfortable, you'll be able to promote it. Whereas in deployment, it continuously keeps updating the newer parts as the newer parts are stable, it terminates older versions and it brings up the newer ones continuously. Whereas Argo Rollouts with Canary, it provides you with the ability to pause here and do your analysis. Now, once you are comfortable with this, we can promote more percentage of traffic. And after a certain point, we can delete the older version and completely step up with the newer one. So, how do we specify the Canary deployment? We qualify that with a Canary strategy and then we give the weights to it and the weights can be distributed into different steps. Each step can have a different weight percentage and we can qualify the pause duration however long we want at every percentage of weight distribution that we are interested in. Now, an extension to this Canary is ALB Canary. We have seen that, you know, the percentage of traffic is dependent earlier only based on the number of replica pods that are coming up in the Canary. However, if you want to do traffic routing as we are doing Canary, what we could do is we can configure ALB Canary and we can split the percentage of traffic that we need to send to the newer version, that is revision 2. And we can control it by configuring this in the traffic routing with ARGAR allots using ALB Canary. So, initially we can probably send 5% of the traffic and later once we are comfortable, we can completely send the traffic and completely shift the stable revision also to send the traffic to revision 2 after everything is stable. How we can specify this? We can specify this with the traffic routing configurations in the ARGAR allots as we have seen and the analysis templates can also be used here to analyze how your traffic is behaving with the newer version of the pods, right. And in the ALB Ingress, what we get is that if you see that there is an annotation that is introduced as the percentage of the traffic gets moved into the newer revision based on the set weight percentages and the steps that we give, the weight here in this ALB Ingress for the desired service gets updated and accordingly that percentage of traffic gets routed to the revision 2. So, this is how ALB Ingress with Canary works. We can configure this with various types of other Ingresses like STO or NGINX and other things. ARGAR allots provides the plugin capability. Now we have seen different types of blue, green and canary, but what is progressive rollouts, right. So, the ability to introduce observability into delivery process and controlling the blast radius of our delivery of the newer deployments constitutes primarily the progressive delivery, right. So, what we could do with progressive delivery is that we can enable metrics that we are interested in to monitor as we are progressing with the revision 2 or the newer version of the deployments of the applications and we can qualify which kind of metrics and conditions are successful and which are failures and we can also enable auto promotion to the newer version completely in case of blue green deployments or we can roll back in case of the canary continuous analysis, if in case the newer version fails in any one of the step, we can automatically abort it and roll back to the previous step. So, all these things constitute into progressive delivery and ARGAR allots provides this automated features out of the box for you. So, let us look at what kind of data analysis we could do with Prometheus to qualify a progressive delivery for an application, right. So, let us understand how this works internally in ARGAR allots. ARGAR allots itself is a controller that manages these deployment objects as well as the replica sets and the services ingresses and stuff like that. On top of it, ARGAR allots has an embedded analysis controller wherein we can define an analysis template as it goes through the list of the steps that we defined in the analysis in the analysis portion in the canary strategy or in the blue green strategy, it would take that analysis template and convert that into an analysis run and at the time of analysis run, it goes and fetches the data from the desired metric provider. It primarily we use Prometheus in Kubernetes. So, it goes and fetches the data from Prometheus and ensures if the conditions that is specified for success and failure, for example, the total number of requests that are successful or failed or for example, the CPU usage on a container or the memory usage on a container should be less than some 80% and stuff like that. All those different sets of conditions that we want to qualify for a successful promotion, they can be evaluated and then rollout proceeds further into the next set of steps as these analysis run continuously in the background or in line. So, how we define an analysis template? An analysis template can be simply put, can be defined like this wherein we write some Prometheus query. We just declared the Prometheus URL that we want to look for in the cluster as an environment variable. Our analysis template can be as simple as this. Primarily, we just have to write our query and the condition at which we want to qualify it as successful. But if you want to get it a little more verbose and write additional details into it, we can qualify that. We can specify the Prometheus endpoint URL and the port in which we want to run and how many times we want to run this analysis as well as at what interval we want to run this analysis run. All these things can be specified and these can be injected into the canary or blue-green relapse. So, in blue-green, where we can use this analysis runs is primarily pre and post promotions. So, primarily once we deploy the revision too, before we promote our traffic completely into the newer version, we can run all analysis checks and ensure the newer version of the parts are all successfully updated and there is no issues. After we switch the traffic, we can again run the post promotion analysis on the parts and ensure nothing breaks and everything is successful with that. So, we can add these pre promotion analysis and post promotion analysis with analysis templates and we can query the data from Prometheus and then ensure our applications are stable after promoting to the newer version. This is in blue-green. How about canary? There are multiple ways. One is in background analysis. When we qualify with the strategy canary and write the analysis on top of the steps, it keeps running continuously for all the steps that we have defined. For example, we have defined set weight 40% and set weight 60% and stuff like that. So, at every step, it continuously runs this analysis and then ensures that the newer version of the part or the application is stable at that percentage of canary deployments. And in any case, it fails, it automatically rollbacks to the previous version that we already have and our Guru allots already keeps in memory the hash value of that replica set. How about the other approach? The other approach we have is inline analysis. In terms of inline analysis, what it does is it's like a blocking step after every step that we define, it stops the rollouts there and then it does this analysis at that time and then gets back data and qualifies whether that analysis run is successful, only then it proceeds further. So, this is how inline analysis can be done. This is primarily used if we want to do some kind of quick checks where more some heavyweight processes, like if you want to run some benchmarks at 40% of the traffic or 80% of the traffic and ensure everything works fine with your newer version, this can be done with the inline analysis. So, we have seen analysis. When we want to bring observability to the application developers, it's not enough if we run the analysis. We also need feedback. Let's say developers update their GitHub repo and they want to fire and forget, but it's also good to notify them if everything went successfully fine or if something had a problem and stuff like that. So, events in this case is really helpful. So, how does events work with Argo rollouts? So, again, extending into the previous architecture that we have, Argo rollouts also embeds a notification engine, it's a controller in Argo project now. Basically, what this events does is, let's say if you run analysis, we generate an event whether that analysis is successful or fail. Let's say if we complete the rollout, that is an event that we generate or if we update a step in the rollout, that's an event that we generate. All these events that we generate and write into the event recorder will be pushed to notification engine and notification engine can then send based on our subscription to any one of these channels. It could be email, it could be Slack, Telegram, PagerDuty or Obscene. All these are integrated and many more are also available. So, we can choose which notification service we want to get qualified with and alerted for based on the type of the trigger that we are interested in. So, let's look at what it takes to configure events in Argo rollouts. So, the first thing that we need to set is, this is an example for Slack. We need to set the Slack bot token in the secret for the Argo rollouts and then we need to set the config map. In the config map, we consume the token and then we write a trigger for one of the conditions that we are interested in for which we want to get alerted, say rollout completed and then we want to send a message to the Slack with all the details. So, we write the message in the Slack format. If it is an email, we can write the message in an email format and stuff like that. So, we configure our config map with this data and then in the rollouts, we qualify with annotations that we want to get notified on a specific channel. So, we subscribe to specific triggers and those triggers, we can qualify this. This was until recently, we were doing like this and we realized that if we have to subscribe for multiple channels for same set of triggers, the cardinality here is high and our rollout object becomes very clumsy, right? It is not pretty neat. So, what we have come up with, we have come with a new feature in notifications with a new annotation where we can call that annotation as subscriptions and here for a common set of triggers, if you want to get notified for multiple Slack channels or multiple services, we can define them. For example, in the second case here, in the first case, we are getting notified for rollout updated and rollout completed on a Slack channel but in case of failures, we not only want to get updated on the Slack channel but also email then in the same annotation, we can write it with new service description here, right? So, with that, we get notified in case of failure events and stuff like that. Let me do a quick demo. I am just put forwarding the metrics here and then my Argo rollouts is running here. Let me show you Argo rollouts object. This is similar to what we have seen earlier. In case of rollout updated, completed or rollout paused, I want to get notified in this channel. In case of analysis successful or rollout about it and other kind of failure scenarios on analysis, I want to get notified in the data analysis channel and I want to get emailed on this email, right? So, in terms of rollout specification, this is a canary strategy and in the canary strategy, I have set weight percentages and pause durations here defined and here this is a success case for the analysis and here I have a failure case for the analysis. Let me run this here quickly, okay? Let me show you the Slack channel here. So, this object had been created. I am not getting updates. Let me show you my recorded video. I just took a backup in case for whatever reason the internet connection is not great. Let me show the demo in my recorded video. Yeah. So, this is the object that I have just walked through just now and let me proceed further. So, let me go ahead and deploy this. Yeah. So, now that we have deployed the Argo rollout with a bluer version, you will see that an event got generated here and now what I am doing is I am updating from version 1 to version 2. That is from blue version to green version. Now, when I deploy, it would also execute the analysis because this is promoting newer version of the parts. So, you see that we are getting notifications for the green version of the parts in the rollouts demo where we have subscribed for rollout updated or rollout completed. And in case of analysis, as we have seen just now here, if the analysis is successful, we got a notification here that the analysis job, the analysis that was successful, we got a notification in the data analysis channel. We will also show you what happens in case if there is a failure. So, on paused events again, we are getting notification in another channel. So, when we go through the rest of the steps, there is a failure step that we have defined. At that step, it would fail and there we would see a notification coming primarily for us on the data analysis channel. There you go. So, here if you see the analysis fail and we have aborted the rollout and we switch back to the previous versions. Essentially, we are getting notified with different color patterns. All these things you can configure in your config map on the colors and schemes and stuff like that. You know, we have qualified two notification service channels. One is Slack. The other one we should also get in email. Let me quickly show you that. Yeah. So, I am getting notified in my email. So, at this point, this was for the previous one and the latest one is here where initially the first analysis was successful and then we ran into an analysis failure. So, we got an event here and then the rollout got aborted. So, that is how we can configure these notifications and immediately get alerted and look into what has happened. Now, these channels here, it need not be limited to Slack and email. It can be PagerDuty. It can be Obscene and others. So, we can get alerted if there are a certain folks in the team or you are taking an on-call on a particular day, you get notified and you can fix the problem immediately. Yeah. That is pretty much what I have today. Any questions? We have time for one question, I think. My browser froze, so I can't check the time. Any questions? We have five minutes. Perfect. Thank you. Any questions? We don't bite. No questions? There we go. Thanks for the presentation. What I missed from the receivers from the alerts, especially because we are on the Prometheus Day, is the Prometheus Alert Manager. Do you have support for that? Because in the end, I am thinking that the purpose is to have one tool for controlling all the alert rules and so on because I couldn't write a configuration for every event channel, etc., etc. We have support for Alert Manager as well. Because Alert Manager, I thought in Prometheus Day, most of the people will cover it, so I was just showing the other channels.