 Hi. Hello everyone. I'm Larissa. I'm a computer scientist in Adobe Experience Platform. I've mostly been developing backend services, but I've always had an interest in automation and making it easier for developers to fail through CI CD. Yeah. And I'm Yonuts. I'm a senior computer scientist. I work for Adobe Experience Platform. I've been mostly helping teams build reliable services. And I think a big part of that has been to build these robust CD pipelines. Our talk today is about advanced deployment pipelines with ARGO, precisely with ARGO CD, ARGO rollouts, ARGO workflows, and events. I noticed that all presenters were kind of advertising their talk with what components of ARGO CD, of ARGO they are using, so we are using all of them. So let's get started. At Adobe, the journey towards continuous deployment started with us deploying in messes clusters with some homegrown deployment tools. Then with the embracement of Kubernetes, teams could get direct access to clusters and basically leverage the full power of it. And they started bringing their own tools and maintaining them, such as Spinnaker and Ansible. And we've been Spinnaker users for a while. We've been building over and over for lots of projects, continuous deployment pipelines to deploy to production in a safe, quick and recoverable manner. But today to reduce this technology sprawl and to also concentrate the effort on a single CI CD platform, Adobe is making the switch to Kubernetes native patterns and tools such as GitOps and ARGO. So we targeted to move our robust and thoroughly tested pipelines with Spinnaker to ARGO. And why did we do that and why didn't we reduce that technology sprawl to Spinnaker only? Well, Kubernetes native and GitOps tool has many advantages over Spinnaker. And the move to ARGO was backed up by things like deployment drift. With Spinnaker, your deployment can drift away from the Git state. Your source of truth of what's running in production is somewhere between your last Spinnaker logs and what your users might have done in the meantime, as opposed to a GitOps tool which would reapply the Git state whenever a drift is detected. There's also a learning curve with Spinnaker. There is application code. There is Kubernetes code. And there's also Spinnaker-specific language while developing on top of ARGO is just writing Kubernetes manifest. And we also bumped into bugs with Spinnaker because Spinnaker steps are hard to test, whereas ARGO workflows boils down to code that runs into containers and can be easily isolated and tested. But the move to ARGO wasn't without these challenges. And the main ones were no promotion out of the box, no rollback out of the box. And what about deployment configuration changes? Nobody is really talking about them. Should they be handled by the GitOps controller only, or should they be handled through some kind of pipeline? Let's talk about the first one, promotion. First, it's important to say that we are doing trunk-based development. We have a single, long-lived branch called Main. That, of course, with GitOps implies that we are handling environments as directories and deployments, thanks, are reduced to ARGO CD thinking only the main branch. Here you can see the file hierarchy of what we have. Each path maps to a Kubernetes namespace and has the configuration files for a particular environment and region. Also, irrespective of the deployment tool we use, we always promote our changes through environments. In our branch environment, we do some tests or health checks to start to gain confidence in our release candidate. And we first deploy to dev, then to stage in the end to prod, stopping the line if something fails and rolling back all the previous environments. If environments are directories in the main branch, promotion comes down to orchestrating commits between GitPaths. And although there is no promotion out of the box with ARGO CD, you can easily do that and you have the building blocks to do that with ARGO workflows. And the solution we came up with is this workflow you see right here where you can have the CI as part of the workflow if you want, or you can have the CI triggering this workflow right here through ARGO events. But the promotion starts, of course, with the first environment, which will just go and update the container image tag with the newly built one from the CI step. But it will update that only in the dev GitPath, and it will commit those changes to the main branch. Then the workflow runs in sync step that will go and deploy these changes in dev. And as part of the sync, we also run some acceptance tests, and we run them post-deployment in a post-sync hook. And afterwards, the workflow also sends some notification via Slack regarding the status of the deployment. And we do that for the rest of the environments as well. The end result is what you see in the right of the screen, some commits for every environment and region. The important thing to notice here, and I'll come back to this shortly, is that we stamp each commit with the workflow UID. But what happens if something goes wrong, if the tests are failing or the deployment fails? Well, you most likely would want to recover as fast as possible from that to minimize the impact, and watch faster than automating some rollback. So what we do is, in an exit handler, we revert all the commits, all the deployment commits from the current workflow run. And in the right of the screen, you can see some code of this exit handler that will run if the workflow failed, and it will run the template revert commit. And what the revert commit does is this script right here. It will first fetch all the commits performed by the current workflow. By that workflow UID I was talking earlier. Then it will just take them one by one and revert them and push those changes to the main branch. And the end result is what you see in the right of the screen as most as like some compensated commits for all the deployment commits from the current workflow run. Now, the third challenge. What about deployment configuration changes? So far we saw that we are validating application code. Remember I said that when we are promoting a change, we are updating only the image tag, basically. But what about deployment changes? Well, the practice with GitOps, I think, is developers just go and edit those files in the Git state directly and have the GitOps controller sync them. If some kind of promotion is required, developers need to do that themselves by first merging the change to dev, then to stage and so on. And if something fails in this process, the rollbacks are in the responsibility of the developer as well. And if there are multiple edits in a short time, it becomes even more of a challenge. So the solution we came up with is have the developers edit and do this modification in some other place in the first place. They can change application code deployment configuration for all environments in a single commit and have those changes be promoted by an Argo workflow, which of course it will roll back the changes if something goes wrong. And let's see how the promote steps changes in this case to account for deployment configuration changes. So for each environment region, we first want to copy the recently edited manifest by the developer into the Git state and then continue with what we did for application code, update the image tag and commit those changes to Git. Thanks. So we saw why we migrated from Spinnaker to Argo and we also looked at some of the challenges that we met along the way adopting Argo. Now let's talk a bit about some patterns. So here's the thing. I think that you probably are familiar with this trying to build this automated pipeline that takes you from dev to stage production and it's really a balancing act. It has to be fast. You don't want your developers to wait. It has to be stable. You don't want for failures to occur for no reason. And it has to be safe. Like you have to make those changes and make sure that whatever you just change gets to production in a good order. Let's talk a bit about some of the patterns that we use. The first thing we use or the first pattern that we've been testing for a few years now is something that I've seen other people use as well is we take every change that goes into a PR and we deploy it into a dedicated environment for that PR. And these PR environments are isolated or semi-isolated. I guess it depends on the service and the dependencies and how easy it is for us to set it up. What happens is that whenever a PR gets open, what we do is there's a provisioning flow happening that just spawns up this new namespace dedicated for that PR, maybe provision some secrets, create a dedicated ARGO application that is also committed to get and also maybe create some DNS entries. And once that happens is that whenever we have changes happening in that PR, so whenever a developer goes in and makes an update, we deploy that code automatically in that environment and finally we also run tests, like we run functional tests afterwards. So what this does, what this helps us do is actually continuously validate the changes that go into PR before we even consider merging that code to main. That means that we drastically reduce the amount of bugs and failures that we see in the main branch. Second thing that we have to ask ourselves is, okay, but what about impact? Like, we diminish the risk per failure, but what about the impact of failure? In some of the cases where, let's say, maybe the functional test don't catch a problem or maybe there's a failure which just is so deeply rooted that only appears after running for hours and so on. Well, we have a particular problem, let's say, in experienced platform. There's a network of services, mostly for data collection that are deployed across the globe. Today we have seven regions. And whenever, you know, we would, if we were ever to try to do a simultaneous rollout across the globe, then any kind of failure would have a great impact on our users and eventually on our business. So what we do is we use wave deployments. And this is, just like the name suggests, this is about splitting your production deployments into multiple waves. So we start with, let's say, a smaller region where we deploy that region to production. We let it bake for a bit. And if everything checks out, then we proceed to the next waves and so on until we have deployed to all regions. And this is all done using our workflows and the DAG. It's nothing very complicated in the end. The only thing that I would like to add there is that some teams, it all depends on the, I guess, the requirements, some teams actually have a wave zero where they deploy to a dark launch environment. Most of them call it pre-prod. And it's called pre-prod because it's actually identical to production in terms of configuration and deployment configuration. However, it doesn't really get any customer traffic. Like it doesn't really get any real traffic. It just uses, it gets synthetic traffic and then it's also used for testing. So we use that before we actually even deploy to any real users to the code. That's all great. But what about at the deployment level itself? Like how do you make sure that the deployment is doing okay? And probably some of you, or maybe most of you are using our rollouts and there are some very standard way of doing that. Like you can use analysis on a rollout based on Prometheus metrics. At least that's what we do. Prometheus is quite standard at Adobe. We use it to monitor just about anything. You can also do, sorry about that. You can also do a man-widen analysis which is basically a statistical analysis using confidence intervals. It actually can tell you if there's a significant degradation between a baseline version of your application and the Canary application, basically what you're trying to roll out. And this is very useful if you have multiple things you're looking at and using Prometheus analysis just doesn't cut it out. So if you're considering business metrics, like maybe you're looking at success rate or latency, but if you're also looking into how your application is doing, like you want to look at JVM memory or maybe you're looking at, I don't know, CPU usage or other indicators, then you probably want to do this kind of analysis. And finally, another pattern is using tests as a rollout gate. And I just want to go a bit into this one. So what we do is we create an analysis which is backed by a job, like a regular Kubernetes job. And the purpose of the job is just to run functional tests. And then we create an experiment. And in this experiment, of course, it has to run against the version of the service we're trying to roll out. And the experiment just uses the analysis that we just defined, so it basically runs the functional tests. And finally, in the rollout, you can, sorry, you can roll out and use the experiment as the first step, so it actually gates the rollout on the result of the experiment. And what this does is if the tests fail and the analysis fails, the experiment fails, and therefore the rollout will not happen. But if the test succeeds, then the analysis succeeds, the experiment succeeds, and the rollout will proceed. So it's basically a way of pushing the execution of functional tests prior to the cannery actually getting any kind of user traffic. So it really helps us to further diminish the risk of failure. And finally, you probably notice that we've been talking a lot about continuous deployments or this model, because that's what we've been using for a few years now. It has really helped us. And it's fine, but at the same time, we do realize that in some cases maybe continuous deployments is really not the answer for everything. On one thing, what we notice is that continuous deployments typically equate synchronous workflows. You have to take each commit and promote it from dev to stage to prod and so on and so on. So this becomes a problem in some cases. If you have a large team working on a repo and maybe you have a high number of changes, maybe you have a slow rollout to production in order to minimize risk, or maybe you just have a long deployment time for whatever reason and you can't drastically reduce it. These are all actual use cases from Adobe as well. And it becomes problematic because if you use these sort of synchronous workflows, then your developers will have to sort of wait for older changes to be deployed until their new changes are deployed. A second perspective is that of the business because here's the thing. I think some organizations, some groups are, and I think they're rightfully, so maybe they're afraid of continuous deployments because with continuous deployments, you no longer control what gets released and when. You just release every change once it's in main, you push it to production. And sometimes that really doesn't, maybe it doesn't work for anyone, for everyone. And in that case, if you want to make releases as a business decision, so basically decide exactly when you want to release a piece of code, then maybe continuous delivery is probably a better model for you and for us. So how do we do this? Well, it's pretty simple really. We basically split the deployment workflow into, we have the main deployment workflow which pushes the code all the way to lower level environments like dev or stage or, you know, for teams which use pre-prod to pre-prod. And then there's a second workflow which actually looks at the state of the code into that last environment like stage or pre-prod and takes the application version from there and pushes it to production. And these are two different workflows. The second one, you can actually trigger it by hand like manually or maybe it runs on a schedule. And the difference is if you want to have an actual release as a button, then you'll want to trigger it manually. But if you are just plagued by the problem of having too many commits and queuing, then maybe you just want to release on a schedule. Like maybe you want to release every three hours or once a day or maybe once a week. I guess it depends. This is what it looks like. You can see how this workflow actually spans into two ones per pair of environment and region. And here it's looking for the application version in the previous environment, then promotes it and waits for the Argo CD application to be synced and healthy. And finally, besides the pros and cons, I guess one or two other things I would like to add. I think that if you're using these synchronous commit promotions and continuous deployments, what you can do is even extend it to in case of failure and in case of rollbacks, you can actually extend it to rollback the code that was pushing the application repository and that caused the failure. That's one thing. And a second thing is that if you're using continuous delivery and you're using these asynchronous state promotions, like you deploy from time to time from stage or preprot to production, then because you're pushing more changes, you're pushing one to 10 commits. If you're pushing all of those to production at once and you have a failure, you will have a bit of a harder time to figure out what a commit was really the problem. So just to do a quick recap, what we noticed really early on when trying to or adopting Argo was that we needed to create promotion and rollback and bake it into our workflows. Two, that we feel that you have to take the deployment configuration changes through the same diligence and process as the application code changes. Three, that PR preview environments are really helpful. They're pretty easy to set up and they help you to identify and resolve failures long before they are even reaching the main branch. Fourth, that wave deployments can actually help you minimize the impact of eventual failures in your production rollouts. And five, that you can use progressive deployments in very standard ways, like Argo rollouts is, I really like the documentation, is really well documented. You can use analysis pretty easily and you can use that to sort of easily rollout changes to your users. And finally, that asynchronous tape promotions are sometimes a good way to mitigate the problem of continuous deployments. And with that, I think we're through with our presentation. Thank you for listening. If you have any questions, we're now open.