 Welcome. Thanks for for being here. We're going to talk to you about how we use progressive delivery with our rollouts at Adobe. The things that how we set it up, what went well, what things we did, didn't work so well, some issues we found on the way. And so hopefully you can get some ideas for your every day. Our rollouts use it if you are not using it already. So my name is Natalia. I'm working as a software developer engineer at Adobe. I love math and coding and this is my second talk at the CubeCon. The previous one was in Chicago. So yeah, glad to be here. And I'm Carlos. I'm a principal scientist at also at the same product at Adobe. And I've been contributing to a lot of open source projects. And one of them I started was the Kubernetes Yankees plugin that was 10 years last December. So it's been a long way. So a small short introduction about experience manager. So you understand a little bit what what are the problems we are facing. Adobe is a distributed Java application that uses a lot of open source components from the Apaches Over Foundation and obviously a bunch of other open source libraries there. And a particularity is that it has a huge market for extension developers. So it's like a platform where people are writing code that is deployed. And we deploy that code into the cloud as part of our product. We are running AM on Kubernetes as part of this cloud service. And it's running on Azure. We have more than 45 clusters across multiple regions because it's a content management system. Customers want to run as close as possible to their to their customers. So we have US, Europe, Australia, Singapore, Japan, India and any new region that customers ask for with right to to adopt. And we have a dedicated team managing clusters for multiple products. So we don't have all the permissions like you would have in a cluster that you run yourself. There's a team that builds the clusters, keeps it updated and we have limited permissions on those clusters. So that's also another particularity. On AM we have multiple teams building services in different languages and we try to have an approach of you build it to run it. We use a lot of APIs and a lot of Kubernetes operator patterns in order to build these services and to add functionality to the main AM core application, the Java application. On the scale we have, so we have 17,000 environments, what we call Adobe Experience Management Environments. This translates to over 100,000 deployment objects, so Kubernetes deployment objects over 100,000. More than 6000 namespaces. And we are already doing or we've been doing progressive rollouts at the environment level for some years already. This meaning we can rollout changes to specific namespaces or percentage of namespaces in a controlled manner with something that we built and we created ourselves. But the reason to look for something else was how do we avoid issues in production, not just during tests. I think everybody looking at progressive delivery, that's what we're looking for. Okay, tests are not enough. We want to check that things in production are working fine. We also have to consider issues introduced by or Adobe code, things that we write, but also things that customers write. So if we make a mistake or a customer makes a mistake, we want to protect the production environments from that. And we have to do it for 17,000 unique services. Full end-to-end testing is expensive. We do end-to-end testing at a bigger scale, but obviously as things grow, this becomes costlier and it takes more time. Also, it doesn't cover all the corner cases that you may think. Testing is for things you know that may fail. I think progressive delivery is for things you don't know that may fail. That's a good analogy. And when we have testing that fails or production rollouts that fail, we have to do the analysis of, is this a problem with our AM release? Is this a problem with the customer code? Is this a temporary flaking issue? And this is all costly and that's time. So it's time-consumed. Releases can get delayed because we detect something rolling out to production. Now we have to stop the train, look at what happened. Is this something that is blocking the release or not? And also if we don't detect it and we roll something that is bad, is broken, 100% of the traffic for a particular customer could be affected. So here we, I guess you all know, that's why we're moving to Argo rollouts. That's why we're all here. So anybody using Argo rollouts already? Just a few people. I'm not going to enter in a lot of the details because the other talks also went into the very how to configure it, how to set it up. I'm just going to go a bit more high level on how we did it and why we did it. So we set up canary deployments with automatic rollback based on real-world traffic and error metrics and using these metrics from Prometheus, metrics that we already have for alerts. So we are feeding this now to Argo rollouts. One interesting case was also, okay, rollouts works with one deployment. What if I have two deployments that go together? We have two deployments that we call author and publish, so author is used by people creating content, publish is kind of serving that content. If there is an issue with any of those, we want to roll back both of them. So you can do this with Argo rollouts by using the same metric that includes both deployments, and then Argo rollouts will see if the metric fails in either of them is going to roll back both. Because Argo rollouts only looks at one specific deployment, but when you use a metric that includes both of them, it will roll back both. And in the future, we can consider also Helm rollback because every time we do a new deployment or a new release with Helm, maybe we don't want to just roll back the deployment. Maybe we want to go and look at, okay, Argo rollouts has decided this is broken. Let's go and roll back the whole Helm chart so we don't have a mixture of newer Helm chart with old deployments. The way we did it, so we have a rollout object and we use a workload ref in the rollout. This way we can point to an existing deployment as we roll out. And we don't have to delete the deployment objects, we just create the new rollout object and this is in Helm and we point the rollout to the deployment object. We configure the rollout, we use the labeling feature on rollout. So all pods get the label roll equals stable, new pods get the label roll equals canary. So you can have metrics for both of them separately and you can look only at the canary metrics. And you could also in the future, I mean you could have traffic going only to the canary if you wanted because they have separate labels. You could have separate services, separate ingress. This would be interesting if you want to deploy and say, oh, I want to look at preview at the canary before promoting. On the analysis template, which is the part that is going to decide is this a successful rollout or not. We look at the metric, but we look at the success condition that there's less than 10% of that. All of this here means just 10% of the metric is less than 10%. So 10% of the metric, which is an errors metric. We point to the Prometheus service running on the namespace and the metric is called request error ratio, five minutes, pod label, the tier. This is where we use both tiers, both deployments. So these are metrics coming from the deployment call author and the deployment call publish. So we can say if either of those fail, roll it back. And because we have two rollout objects, one for author, one for publish, if any of the metrics fail for the other one, it's also going to roll it back. So both of them would be rolled back at the same time. So what is good with our rollouts? Automatic rollback on high error rates, that's the key feature that we are looking for. We don't want to halt the train. We don't want to break a customer. And this allows us to do these non-blocking rollouts across environments and then investigating what were the issues. So we get alerts when the rollout fails and we can look at those later. And it will use blast radius. So if we make a mistake or a customer makes a mistake, only a percentage of the traffic is going to get affected. And also we get more frequent releases. We validate this with real traffic and at the end of the day, more velocity. What if you look at the Dora metrics and things like that. That's what makes a team beyond the top performers. What is bad about rollouts? So the migration requires orchestration to avoid that. So this means when you create a new rollout and you have deployment, even when you use workload ref to reference a deployment that already exists, now you have pods running at the same time both owned by the deployment and the rollout. And this is a problem when you have thousands of services. Typically people would go and say, oh, when my rollout is successful, I scale down the deployment. And you just have to do this once, once you migrate. But when you have tens of thousands, that's a bit harder. So one colleague on our team, she contributed this PR. You don't have to care about this that much anymore. So automatically we'll scale down the deployment after migration to rollouts. And this PR is going to be released on rollouts 1.7. So it's not out yet as I checked before the talk and it's not out yet. But this will allow you to do, add a new field with the scale down attribute on the workload ref. And you can say scale down never. So the old deployment is never scaled down. So that was how it was working before. And unsuccessful. So after the rollout becomes successful, the deployment is scaled down or progressively. As new replicas are coming in from the rollout, replicas from the deployment are getting deleted. So things that were so-so. For us, we started with simple rollouts, watch for the graded statuses, and we noticed things like, okay, Prometheus is not reachable. So that's some things you have to account for. When Prometheus is not reachable, Argo Rollouts is going to mark your deployment as degraded. And the other one that we realized was upgrades with object deletion. One thing we use a lot, well, or we try to use more, is immutable resources, immutable config maps and secrets. When you create a config map and secrets, you can say immutable equals true. These signals the Kubernetes API to say, okay, this is not going to change. I don't need to watch for changes. Because every time Kubernetes watches for an object, it's creating memory, it's using memory on the API server. And this at the scale, this creates a high load on the API server. So one thing we were doing, okay, let's use immutable objects for some things that we don't want to change. And we know they're not going to change. And what happens, the typical pattern with Helm is every time you make a change to the secret or the config map, you have to change the name. And you would use a shell of the contents. That's what a Helm pattern. So every time you change the content, you're going to get a new secret and the old secret is deleted because Helm is going to do that. Now, what happens is imagine in this example, Helm doesn't upgrade. The new secret is created, the old secret is deleted. So the one is created, the zero is deleted. If the new pulse failed to start because of the rollout, now Argo keeps the old deployment running and scales down the new one, but the old deployment cannot create more new pulse because the secret no longer resists. So as soon as the existing pulse are recycled, you get an outage. So that's one tricky thing that we are still figuring out how to change. But that's something that you need to have into account. This is not a problem just with rollout. Just rollout makes it worse because the rollout can take more time than normal deployment, but it's something to watch for. You need good metrics. You need to account for Canary stable labels on the metrics. You don't want to check metrics of the stable deployment, so the stable pulse, you want to just look at the Canary pulse. What happens when you have environments with low traffic? If you don't get this metric, maybe you assume that it's successful, but then when you start getting traffic, then you realize that something is broken, so that's also a tricky part. Very annoyingly rollout, but you're adopting the rollout object, now you need suddenly to change runbooks, tooling, training because now you're going to have deployment objects that are going to be scaled to zero. If somebody is not aware of that, they can mistakenly go and say, oh, I need to scale up this deployment because it's zero. Now you have a mix of both things. We're going to show you a little demo. We have to record it to speed it up. In this demo, actually, we're going to show you how we did it, our rollouts on our environment. Here you can see that we are spinning up a new, well, actually, no, it's not playing. One sec, technical problems, it's not a real demo, but always demo time. No matter what you do, it's always demo time. After the beginning? All right, now. We have this recording for an environment where we started a new environment. You can see the revision 22 is a new rollout. On the other side of the screen, you can see the bot. We had two bots running for the previous version and a new bot which is starting up. I just made the recording so we can just make it faster. Once this, yeah, there you go. The bot is in isolation. Actually, we are deploying a new version which suddenly it contains a back. So what is going to happen is that we have a URL which is returning 500 errors. As soon as it is ready, which is going to be like in a little bit, there you go, because I made it. Now, behind the scenes, we are receiving requests which are returning 500. For this demo, we just did a lot of requests for this endpoint because it's a demo environment. And then we have the analysis run, which is the running. You can see it on the 22 version, which is checking what is the health for that metric that Carlos was pointing out. So now we can see we have one error for the analysis run. There you go. But I will show you later, we have a limit of, hey, with one error, we're okay. With more than one error, then we failed. So just Argo continues. It starts the new pod. It says, all right, I have a new error, but it is fine. As soon as I have two, there you go. Then it is marked as degraded. Here at this point, Argo is going to say, all right, the new deployment failed. I mark it as degraded. And then I'm going to start doing the fallback to the previous stable version, which is index example is the previous one, the 21. So now, yeah, the 21 version is going to be, well, actually at the top of the screen, this is a big screen, but we can also say that the rollout is very high, all right, because the metric actually made an error. So now you can see that again, little by little, just doing another rollout, the new revision, the 22 is going to be scaled down, it is terminating, and the new one, well, the previous one, sorry, the version 21, is going to spin up new pods because that is the healthy one. So we are making a fallback on the other side, but you can see also the new pods coming up. You can just see there. So this is how Argo works. I mean, this was a little demo just because we make it shorter, but this is quite enough time to make the rollout and to detect that everything, hey, it failed, then you need to roll it back. So now if we go back, let's see, what is the end point to the slides. This is the analysis run that we saw in the demo. So actually here you can see that there were six points where four of them were marked as succeed. Actually, you can see there are not a number. That's because, I mean, Prometheus starts to get the data and to ingest it little by little. So that's something you need to take care to configure and to set up the proper values because actually you need to take that delay of Prometheus, right? And then we had two failures and since we configured to mark it as failed as soon as we have more than one failed or error, then it was marked as degraded. Okay, so to sum it up, progressive delivery, we think it's a great idea, rollouts, argorolls, great implementation. There's only a few things that you have to have into account when you adopt it and hopefully you got some ideas here. Thank you for being here. And I think you can leave it back using .qr on the SCADES page. If you have some questions. Yes, I have a question. I wanted to come back to the issues you wanted to iron out, especially the issue with the config map missing that you mentioned. If you name config maps and you need to roll back, the old config map is gone. We have exactly the same problems. We have found a solution. Let me share it with you and see if it would work for you. So we've written some automation that basically retains the old config maps in Git. So after you render using Helm and you get the new config map with a new name, we'll actually replace the old one. We save the old ones as well. Last 10 or whatever number you want to have. So they all show up as synced in our goal. So later, when we deploy again, we just remove the oldest one, just leave the last 10. And that way we don't have this problem. We can now roll back and we don't have this error. Yeah, there's multiple ways to solve this. We are not using Argo CD where we are using Argo Rollouts, but you can keep the old ones for a bit. You can label them. You can instruct Helm not to delete them. And you could have maybe a post update job in Helm where you delete the old ones. So you don't keep them around. I mean, there's multiple ways. You can do it with an operator. There's some ways you just need to be aware. It's a bit of the you have to deprecate things before deleting them in case this happens. Any one of these solutions you would recommend? Of the ones that you mentioned? We are not sure yet where we're going to do, but because one of the other options we have is to put secrets together instead of having many immutable secrets. We could have just one secret that is not immutable. Then that doesn't overload the API so much and maybe that's fine. But we probably gonna, because we have an operator, in our use case we have another operator that triggers these things. The Helm operates and all that. And we can do the deletion from there. So that could be another option for us. Hi, thank you for the talk. I wanted to ask if you have faced any drawbacks using the workload ref to the deployment? Not other than the problems with the operating and having to scale down the deployment. That was nothing. Okay, that's great to know. Thank you. I think maybe one more? Yes. Great talk. I have one question with manual feature gating or developer access because I presume you have many developers deploying in many different ways or maybe they're using a model. But I can also suspect that maybe some of them would like to see the visual Argo rollouts UI and then see how it's working and maybe manually allow for the next analysis or the next phase to happen. Is that something you are using? And if so, is it any good? The user experience, let's say part of it, that would be interesting but we are doing this for thousands of deployments. So having going and looking at, oh, this one, how is this one doing? That's not going to fix anything. But yes, we are storing the value on our own. We have an operator that kind of overarching operator that looks at the Helm operates and Argo and that operator is getting the status and that status is being stored and can be observed by other tools that are triggering it. How is it to get the status filled with all the information? We grab the information from the rollout status, put it in the other operator in a way that it makes sense for the clients triggering the operates. So you essentially get a timeline? Yes. Very nice. I think that's time. Thank you all. Thank you everyone.