 Welcome everybody. This is Bridging Argo Notifications. I am Andre Marcello Tanner and with my colleague Alistair Israel who unfortunately couldn't be here today but will access him via live power video recording. So quick about us, we are staff engineers, we are based in Toronto Canada, also previously from the Philippines, so we're representing if we're available in these Slack groups in case you are there, look us up. We work for a company called ADA, their AI customer agent company, and the reason I'm sharing this is like we have about 75 developers and that we support and you probably know AI is heating up and we've been around for like doing AI for like seven plus years but now everybody else is doing it and we have to keep moving faster and delivering quality to our customers and the story goes like we have this monolith, I'm sure you've seen this in your company at other places, we like to think it's majestic but you know it's a monolith and to deploy people have to line up and deploy there, like they have to get in the queue, this is how it was before, they have to get in the queue to deploy their code but this is the monolith that's been in our company since the beginning and you know it's Python but it provides the most value right, it provides the most value and even though we'd like to do microservices or other new things, this is the thing that provides the most value and when developers deploy it's scary, like they don't want to break other parts of the code right and they're watching all their monitors, their dashboards while they're deploying so we wanted to like figure out a solution and be able to deploy and get things out into production faster and among other solutions what we wanted to change the paradigm is how you deploy at our company from a synchronous to a asynchronous pattern and it wasn't simple but you know this is the story of how we did that through Argo notifications and I'm going to start by sharing a short video of my colleague recorded of what we did and then afterwards I'll go into more depth with Argo notifications all right so let's go to the video by the way quick raise of hands, how many people here use Argo notifications, how many people here use Argo CD or Argo rollouts, okay see everybody uses Argo CD and Argo rollouts you are using Argo notifications or you can't use it so just a little hint but yeah let's get started. Hello Argo Cotton, Alistair Sorrell here, staff engineer and data, first of all let me thank you for having me even if just virtually and for the opportunity along with my colleague Andre Marcelo Tanner to share with you some learnings from our journey with Argo notifications so first off a bit of history at ADA it used to take up to 30 minutes for just our main monoliths CIE pipeline deployments were manually triggered by clicking a button and could take another 30 or more minutes to fully deploy on our largest cluster so when we sought out to improve our entire software delivery process we decided early on to adopt the Argo ecosystem tool is because we all know how great they are right but initially we wanted to be in the same process that everyone was already used to so we ended up adopting the manual queuing system to coordinate releases but the net result particularly was that engineers could end up spending hours just waiting into the queue we all agreed there had to be a better way like would it be nicer if the developer could just mark their PRs as ready for merge walk away and just get informed later on when it was successfully deployed or if it fails for whatever so we needed something better since we adopted the GitOps repository pattern this meant we were pushing manifests to the GitOps repo which Argo CD was configured to watch and sync and deploy from but we needed to set the status of the GitHub deployments in the upstream project people not in the in upstream we also wanted to send slack notifications both on public channels and abstractness and this was across multiple kubernetes clusters each with their own Argo CD you rollouts installations and because we did not want to deal with the complexity of opening up our kubernetes or Argo API is over the internet across AWS accounts regions and VPCs now why did why couldn't we just use our go notifications out of the box we quickly realized that our go notifications while useful was inadequate for our Argo notifications can perform simple HTTP requests or input and delivery but it couldn't be used to perform requests to query your lookup information such as determining the upstream GitHub repo that push to GitOps repo or fetching PR information associated with the commit or even just looking up a slack ID given a person's email address it also couldn't or at least wasn't obvious how to perform a sequence of events for example send a slack direct message to a user requires at least two API calls first to open the channel and then to send the actual message complex logic and loops which were impossible or incredibly hard or quickly became cumbersome in tedious we were basically programming in templates so in the end we ended up writing a web service middleware in Ruby which we called Argo bridge because it bridged Argo notifications with all the other systems that we needed to interact with it could have been written Python or go or elixir but the initial proof concept was an AWS Lambda so we went with a more lightweight run down but anything once you put it into production becomes operational and you need to support it and maintain it for the next couple of years and we've been adding more and more functionality to it in response to our user requests all right so that is part one but I wanted to give a quick recap while we're going through that so we basically wanted to do really complex or what we thought was complex notifications and we didn't want to do those in just YAML or we didn't see the feature set of notifications that could do what we needed and so we basically built our own webhook service that takes in events and then does things in code we call that Argo bridge we're not we're not releasing it technically but we build this in-house it's sort of like a hack days project kind of thing but it was so useful that we ended up using it and instead of putting all the logic in our in notifications because we were also taking notifications from Argo CD and Argo rollouts and then using that to send more like useful information to our developers we put that as code in a service that's you know actual code so but what was the effect what happened after that did it work so one more short video from Alistair and then we'll go to our go notifications so how are we now at data and how have the developer experience and overall software to deliver improved the biggest win that we consider is the paradigm shift from sync to async we said goodbye to merge queue of highly synchronous and manually orchestrated deployments and embrace it fully asynchronous continuous and progressive progressive delivery with Argo CD and Argo rollouts we're now sending enriched context sensitive and threaded slack modifications we're also able to send direct messages to users on actionable alerts and information and this has quickly become popular as a company-wide or even personal we've increased the total number of deployments per day but while at the same time reducing our pipeline times from an average of around one hour to these days around sub 10 minutes now before we ended up where we are now we did try a couple of other approaches and notably if you ask the Argo folks you're probably directed to Argo workflows and events for anything more complex than payload delivery we did an early proof of concept using these and while we think everything we ended up doing was entirely possible in Argo workflows and events we quickly came to the realization that we still end up programming YAML and everyone knows how that is so we elected to use the general purpose programming language instead with more mature testing libraries we also considered front-loading all the information we could and adding them as annotations to the Kubernetes objects this is also viable but this also meant performing all computation and information lookup in GitHub actions more YAML up front and we still have to deal with the complexities of loose nested conditionals around this now we've learned quite a bit since putting this into production for over a year we realized that yes we can still use simple Argo notifications to post events in Datadog this not only provided a mechanism to better observe the behavior and sequence of events of our Argo rollouts and services it acted as a fall back when troubleshooting our notifications bridge so if it's entirely possible using Argo notifications please use that if not and if you're capable build it and submit a PR to the Argo project we realized that copying and pasting inevitably introduced errors across rigorous subscriptions and templates in many different places and all this makes it everything harder to troubleshoot so by relying on Argo notifications for simple event delivery we could then concentrate our maintenance and debugging on each single service especially initially when we use this send slack messages for every event on every cluster multiple trials sometimes this just created noise and we learned to start threading our event notifications based on user feedback and the way my colleague honored that this was quite clever instead of maintaining state in our service we used slack as a source of state we were able to keep our Argo bridge itself stateless and simpler to maintain and this was met with a lot of approval all along so now our engineers are saying just to roll out succeeded messages enough and we're doing that to I create that for all deployments however would require maintaining state and so that's probably where we're going once again this has been pleasure uh our stairs are out here thanking all of you at Argo coin and wishing you a good day all right thank you Alistair for that so what I want to emphasize here is we actually managed to with this in our webhook service and Argo notifications shift the paradigm of our developers from synchronous to asynchronous where you no longer watch your deployment for an hour or 30 minutes waiting for it to succeed or fail you you you committed your code and then you went and did something else and then you get the notifications when it's actually deploying because Kubernetes Argo CD Argo all asynchronous right and trying to put synchronous workflows into that is really hard we tried like we tried sticking synchronously into our github pipelines and trying to watch for deployments it works but it fails a lot and so really shifting the mindset to just wait for the notifications and then it'll tell you if you need to do something about it really worked and now it became only about making sure we deliver the right notifications at the right time so we ended up delivering too many notifications and then we got complaints of like you know make them more clear so different problem but it's a better problem than like please make the synchronous pipeline tell me if it's working or not right so and there were other changes also to reduce the thing but if you have big monoliths if you think about it the more deployments you have the more contributions to that a central code base single deployment pipeline you're going to end up having to batch code deployments will take long if you do canary deploys you can't really like make many small deploys you end up having like their deploys will eventually go out so notifications are the best way to get that um be able to like reduce the cognitive load and move forward with whatever other tasks folks want to work on and um yeah it's really successful um and and this is an example where we pushed it into Datadog because that's our observatory platform it made it also easier for us to then do other things with this data rather than searching through notification logs of when different notifications went out so the other portion of the talk is like what is notifications right and hopefully so Argo notifications engine is a sub project under Argo Proj it is actually a go library that's being used by both Argo CD and Argo rollouts it supports a ton of services and probably more and you can see webhooks over there that's what we used to do our service but if you use any of this services you can just use it with if you're using Argo rollouts or Argo notifications right and the documentation goes into more of what exactly features is supported so a little bit of a history it started off as Argo CD notifications and then it was converted into a go library and then it was merged and then it was merged officially into Argo CD and then Argo rollouts from that go library so both Argo CD and Argo rollouts use that Argo notifications engine library actually any CNCF project can use a notification engine to have all that supported as far as I know only those two projects use it maybe we'll get cargo using it I heard but yeah it's really a powerful notifications engine in Argo CD you have all these variables app context secret service type but you can use in your notifications and you can configure them per application or per project so if you want to have all applications under a certain project get some type of notification if you want like a namespace to have notifications you can use the if you're using the namespace Argo CD app per namespace feature you can do that the thing about notifications engine it supports different features at different there's the notifications engine isn't versioned so if you want a certain feature in Argo CD Argo CD has to pull in that version of the notifications engine same for Argo rollout so in current 2.10.2 there's a hash code there that's the version of notifications but usually you just look at the documentation because the documentation will tell you what features are supported but we're also working to improve the documentation too Argo rollout same way it doesn't support as many objects because both projects include the go library and then use it in slightly different ways Argo rollout sends a lot of events directly from the code so it sends a lot of like on rollout failed on analysis etc it doesn't use triggers yet for for most of those but it can't you can use triggers and it's it sends the rollout in context object and then soon I heard from Zach that we're getting the dot analysis object so you can send notifications with the result of your Argo rollouts analysis also has per namespace configs if you're using the there's a self-service notification feature in Argo rollouts and then that's the version so quick for you if you haven't used it before this is what it looks like you configure your services in a config map in Argo CD and Argo rollouts and including any private information in a secret you make these templates and this is basically like this is what your users see in the various services like we have here a slack and like if you're setting email notifications you can just configure it in a template that this is all YAML and the templates use go templating so you can do code in there I don't recommend it but this is why we made our own service but you can do that as much as like Argo use go templating and then you have triggers so triggers is like when this condition happens send these notifications right so you can and this is using the expression language which check out the URL for more how to use that so there's a there's a trigger language there for how to do conditions and then when you want your application or rollout to receive a notification you set a subscription now you can set global subscriptions in the in the config map or in your Argo CD application or rollout YAML you can configure per application subscriptions like for this application send to my slack channel send to this slack channel send to this email for example there are some advanced features if you are already using it no there are like functions you can use that are available check the documentation per project but they support so you don't have to like do this complex logic and for some of these we maybe we didn't need to write our own code we could have used one of these pre-built things and not have to maintain it Prometheus metrics we did not know about this when we build our own service but they have Prometheus metrics so that would be also make it easier to observe what's going on in the notifications when you make a notification how do you test it when you actually go in slack and or check up the emails you can actually trigger it via the CLI so that both projects have a CLI for testing the notifications and also for I think for Argo CD at least you can trigger on a selector so you don't have to put in all your app YAML you can put in your you can put a selector based on labels and trigger notifications based on that so what's coming next for the project well just personally we're working on a Datadog PR for notifications and we're trying to get some of the features that we built in our Argo bridge thing back into notifications sort of like so there's less for us to maintain but also it's like the things we built aren't unique to what our company does so we would rather support those in an open source project and rather than putting it like in our own Ruby program but these are some of the things we're and then also at our company we're trying to use more notification engine features rather than using the custom code we built there are in newer versions of the notifications and rollouts in CD there's a lot of new GitHub features so thank you to all the contributors who are contributing to that but yeah if you want to see what Argo bridge maybe not today but soon there will be some code sample code up there on that URL and rate subscribe click questions there's a mic here in front if you have questions come up and I think we have a few minutes for questions thank you yeah go ahead no it's more like an example of a webbook service so because our the code we wrote in-house isn't exactly pretty so we're we're extracting it out and just making a unsupported example like a sample of limitation yeah yes basically it it's a service that takes takes an event and then does something with that event yeah yeah we found a trouble like like we know our users emails from git but what's their slack ID and how do you get that and so we had to do more stuff to do that thank you thank you so yes they can go for lunch but I think it will we it's more of a human problem of your your code and tricks it so so as part of the process of also doing things to like move to an asynchronous paradigm in our deployment process we put in like tools to like quickly roll back since we use argocity it's just a git revert and you can go back to the old version so anybody can trigger that we also put in more and say also like like basically we have an emergency rollback feature that allows anybody to like if someone broke the code just go back to the previous version with a single click and so that alleviate some of the the worry around what if this breaks and I'm not watching it kind of fears yeah at our company anybody with access to any developer that has access to that code base can do the rollback not just the developer who made the the commit but we have a we also do have a good ownership culture where the team or the developer who made a commit is responsible for their code and when there's an incident they will do the whole process yes I'd say yeah having that culture and also like I'd say before this there was also a lot of hesitation around moving to this asynchronous model but we also did have the support of our technical leadership to like let's just go for it and try it if it doesn't work we can always move back today the synchronous way but we haven't moved back so we're all good thank you I think we're out of time thank you very much