 Hello everyone, I hope you guys can hear me. I'm Larissa and today we're going to talk about Taylor Deployment Strategies, our go-roll outs for diverse applications. But first, a little bit about myself. I'm a computer scientist with the Adobe Experience Platform Group. I've been mostly developing backend services but I've always had an interest in automation. And in the last year and a half, I worked at updating our CICD pipelines with tools such as Argo CD, Argo workflows, events and rollouts. Today we're going to talk about a used case of the last one, namely Argo rollouts. In Experience Platform, we've always had a focus on quality. Every change that we make, we actually take that through a series of various checks and validation before it goes to production. And everything is automated, of course. So you can see on the screen all the checks that we employ before we mark the PR. And we started a basic code verification. Then we go on and do a series of static analysis from vulnerability checks to license checks to code analysis checks and security scans. And if everything checks out, we then go on and deploy that code into an FMR environment and run a battery of comprehensive functional tests against it. We do all of these things to basically reduce the risk, improve confidence in our releases and reduce change failure rates. When it comes to deployment, we employ a secured approach. We have multiple environments up to production and we deploy the code in each of these environments and then test it. An automatic promotion to the next environment or an automatic rollback happen based on the result of the deployment and the test. And in production, we're even more careful and we deploy in groups starting with smaller regions first and continuing with the larger ones. Again, we do all these because we wanna reduce the risk. We wanna reduce the change failure rate and we wanna basically reduce the blast radius in case there is an issue. So you may be thinking that we're doing a lot of checks, right? And this should be good enough. Actually, it isn't because on one hand, as a growing business experience platform is looking at increasing its footprint and expecting to open up new regional deployment around the world. And with new regions, we'll have multiple parallel deployments, right? We cannot just deploy everything sequentially. It will take us too long. And yeah, monitoring them, it will become more complex. And on the other hand, we'll always be focused on stability. And although we employ a lot of checks, there are some use cases where the area of impact might be pretty big. So this was our motivation into looking into progressive deployment. And of course, a simple answer to that question was our go-roll outs. Why? Because it's cloud native, because it's very easy to learn and understand. Basically lets you configure every type of progressive deployment out there. But this thing of rollout being very flexible in terms of what you can configure, raises the question, what do we configure for our use case? What's the best strategy? And it was important for us to leverage the existing mechanism that we had today, like the test. Now I'm going to walk you through a simplified view of a feature. We have an experience platform, which is called audit. And we'll use the components implementing these features as our test subject. So today this is implemented like this. We have events produced on a Kafka topic, which are being consumed, processed, and ingested into a data store by this Kafka consumer component. From the data store, they are exposed through this RESTful web service. It's important to state that both these microservices are both backwards and forward compatible to each other. When it comes to testing that post deployment test I was talking about earlier, we are actually testing the entire system. So our functional tests generate data on the Kafka topic and they validate the entire system behavior through that RESTful web service. These tests are packaged in a container and they run as a Kubernetes job in the CICD pipeline. So let's see how we can apply rollouts to each of these components. And we'll start with the RESTful web service, which is a very good candidate for rollouts with blue-green deployment strategy. Why? Because it can be tested without any live traffic and we are going to reuse some of the tests we already have to do this testing. So we need to split up the test a bit and have the dedicated container with component tests for this API. These tests will be then referenced in an analysis template, similar to what you see on the screen. The important thing to note here is that the RESTful web server's URL needs to be added as a parameter and it needs to point to the preview service that's targeting the preview pod, basically the green part of a blue-green deployment. This analysis is then referenced by the rollout spec and it's referenced as a pre-promotion analysis, meaning that the rollout controller is instructed to make sure to run this test against the preview service before the preview service is promoted and receives any live traffic. Therefore, if there is an issue that can be captured through testing, we'll have no impact here. Another thing to notice here is that the preview replica account is set to one. This will help us not use too many resources when we run both the blue and the green deployment side-by-side and we're basically spinning just a single pod. We also, you can see that we don't set the replica account here because this rollout is managed by a horizontal pod autoscaler. HPA works great with rollouts because it will manage only the replica of the active replica so that's actually receiving live traffic. A takeaway from here is this approach is very similar to what we do with pre-prod environment. What we call the pre-prod was actually an environment very similar to production in terms of configuration but it did not receive any live traffic, just synthetic one. And the scope of pre-prod environment was to test basically production configuration with zero impact. But of course, pre-prod environments are expensive to run because alongside a new deployment of the service they require additional resources like in our case, a database. While you can get the same amount of validation basically with the blue-green and you will be running just a double amount of pods just before the cutover. Until that point, because we said that pre-replica account one, we're just using yet another pod. Now let's see how, what strategy can we apply to the streaming application, to that Kafka consumer. So if I were to apply the same strategy, blue-green and to create a new version of my application with the same configuration as the first one and this version would act as my green part of my blue-green or actually my yellow in the picture then I would have mistakenly deploy a cannery because if I deploy the same configuration on both versions of the application they will share the same consumer group and as a consequence they will share the traffic that comes through that Kafka topic. So let's say I change the consumer group. I intentionally change that then I will not be able to promote anymore my V2 application because the whole idea with using blue-green is to be able to promote a new version to be the stable one but if it has a different configuration than the stable one I cannot promote it anymore. So because blue-green is all about bringing new instances and test them without receiving any live traffic there's no fit here. Well you can say okay expose that new version of the application to the traffic through a cannery and test it like this. Unfortunately this doesn't quite work as well because of the same reason of shared consumer groups. I cannot target the V2 pods which are my cannery pods here with the test traffic because both the live traffic and the test traffic are shared by both my stable and my cannery deployments. So I have no guarantee that I'm actually testing the cannery deployment. So what we ended up doing is a cannery but with metric analysis. So all my pods are exposing a HTTP endpoint for metrics which is being scraped by a Prometheus instance and the cannery is running with the live traffic and we are assessing how well it's doing by using some key metrics. The key metrics for our use case are error rate which should not be greater than 1%. The error rate in our system means that we are not able to process the messages and we are parking them on a dead letter queue. And the second metric just assesses that we are actually producing event to that data store. You can see that both metrics have an initial delay of one minute to allow for the cannery pods to be up and running and to also allow for a first consumer group rebalance as the cannery pod is being added to the consumer group. These two metrics are then referenced from the rollout spec of my Kafka consumer rollout and they are referenced as a background analysis meaning that they will run for the entire duration of the rollout. But how do we exactly scrape those metrics and collect metrics from the cannery pods? I could use a service monitor which would target the cannery service. This will correctly identify my cannery pods during a rollout. However, when the rollout is promoted to the stable version. I have an issue because the cannery service will point basically to my stable pods too because cannery service is using rollout pods template hash to target the pods and the pod template hash will stay the same after the rollout is promoted because there is basically no change in the spec. So in this case, my second service monitor will actually point to the stable pods and will collect data from my stable version of the application. Apart from the issue that we are basically duplicating metrics, I have another issue if I want to run a new rollout because depending on the time window I use, my first data points might be skewed because they actually have data from the stable service. So an alternative to this would be to use a feature from rollout called ephemeral metadata which lets you basically define in the spec labels and annotation to attach to the cannery respective to the stable pods. Of course, of interest to us is the cannery metadata here which will be added by the rollouts during the cannery. Then I can leverage a pod monitor which select based on the labels I just added. Only the pods there are part of the cannery deployment. The good part with this approach is when the cannery gets promoted, my pod monitor will not select any of the pods anymore because all the pods have the label stable as the cannery label was an ephemeral one. Now let's go some, let's see some of the takeaways we've got from this running rollouts with analysis for streaming application. The first one was about how long to run the cannery for. Of course, it depends on the application. The answer is enough for the metrics to be scraped and enough to collect sufficient data points for your analysis. In our case, because it's a moderate traffic application, an interval, and we're scraping every 15 seconds, anywhere to five to 10 minutes is enough to assess that this rollout is okay or not. Another thing to take into consideration here is how frequent you do changes to that application because if you're running rollouts for days and you have frequent changes to the service, you might still want to test each change at the time so the speed delivery will be basically reduced. Another takeaway here was about the interval to compute metrics on because we need to be aware of consumer rebalances which happen every time we add new cannery pods. So time windows short like one minute might not be enough because you still need to wait for the rebalance to finish and also for the metrics to be collected. Another takeaway is, though metrics do help us reduce impact when we're talking about rolling out a new application, they are not comprehensive enough to test the entire functionality of the deployment so they are not a replacement for tests. So this is actually a second reason that we wanna fail fast and run that rollout for such a short time is because we still have tests afterwards to run. So we are still running the component test for this component but also the test for the entire system post deployment and a final note I wanted to make is running Argo rollouts with Argo CD. So, and especially discuss the rollbacks use case. So as you know, rollbacks, Argo rollouts will rollback your code if there is an issue by scaling down the replica set which has issues and scaling up the stable one. We actually do more than that. We follow closely GitHub's principle and basically we reflect that change back into Git through a Git revert. There are multiple reasons why you wanna do this. One of them would be that if the rollback is retried without having a fix for it multiple times and if that rollback is producing some impact with Canary for example, then you will basically be multiplying that impact with the retries you're making. Another reason to do this is you usually don't know what happened when a rollback does happen with rollouts. So you might wanna rollback all the other regions you have in production until you figure out what's the root cause and if it's actually a configuration issue or it's more than that. This was all from me. If you have any questions there is a mic up there. If not, we can take it after the talk. Thank you.