 So, hello everyone, I am Florentine and this is my colleague Anka, we are both software engineers at Adobe where our main focus is on developing and maintaining real-time data intensive applications and we are really excited to be here today and talk about how we automated our release process for continuous deployment. So in this presentation we'll follow our attempt to create a completely hands-off deployment pipeline for one of our business critical services. To give you a sneak peek, we see that it can be done, but we don't like to you, it takes time and effort to build such a reliable pipeline before it will become effortless. At Adobe we are involved as engineers in the full life cycle of the application from development to operations. So by removing repetitive work, the engineers can focus on what they do best innovating. This is why our light motive today is automated. And the techniques we'll be presenting are technology agnostic, so they can be adopted regardless of the CI or CD tools you and your teams are using. For example, even if you are using Spinnaker and Jenkins today to deploy our applications in Kubernetes clusters, we'll keep the same practicing when we'll be migrating to something else. The business critical services we implemented continuous deployment for is part of Adobe Experience Platform Edge Network, which offers various digital marketing capabilities from data collection to delivering real-time personalized experiences. More specifically, our service is the server-side orchestrator that allows access and coordinates access to all Adobe Experience Cloud solutions present in the seven geographic distributed regions. So, our service is on the critical path, impacting the availability of the entire solutions stack. Now, let's have a look at how a release process used to look like some time ago, and during the presentation, we'll have the chance to see how it evolved. So, starting from left to right, the release engineer starts the release process by deploying a new version in a non-production environment called Stage. After double-checking that the version is performing well, it starts the production deployment by deploying two Kennery pods in two distinct regions. The next step is an exhaustive manual analysis requiring close monitoring by a graph on the dashboards of the Kennerys to establish if they are performing well, both functionally and non-functionally. Based on the results, the release engineer takes the decision to do the global rollout or not. Once the new version is running in all seven regions, a final post-release validation check is performed in order to make sure that everything is in tip-top shape. One thing that immediately stands out is that five out of the ten pipeline stages requires an engineer's input. So we think that we can do better, right? But why would we want to change anything in the first place? Manual deployments can be a frustrating and exhausting experience for the release engineer, as he has to track like 165 metrics in each of the seven regions, so the possibility of missing something is pretty high. And this is why today we'll go through our multi-er journey of automating our deployment. There are a lot of techniques that we've implemented along the way, so we'll have a brief overview of the first two sections, and then we'll have a really deep dive in the release validation of automated Kennery analysis and wave deployment, where we'll share our challenges and how we overcame them to make our continuous deployment plan a reality. And now let's first talk about how we improved our CI quality and strengthened our pre-release validation in order to catch issues early before pushing them to production. So we follow a trunk development model, which means that we aim to deliver software at a high rate. In a continuous deployment setup, our engineers might feel hesitant about keeping emerging changes frequently to the main branch. So it's very important to not compromise on safety. This starts with the high quality code in the main branch, which is ensured by a strong CI pipeline. In addition, enabling automatic code promotion to non-production environments creates a tight feedback loop for both functional and non-functional validation. And with these quality checks, we would be able to accomplish our objective of ensuring that the main branch is releasable at any time. So we've made some improvements to our CI Jenkins pipeline by adding multiple new stages. But the most important one is the addition of functional validation before changes are even merged into the main branch. And this way, our engineers can make sure that their changes won't break any existing functionality. And for this, we trigger a Spinnaker pipeline. The Spinnaker pipeline allows us to deploy a preview version of the application in a specific namespace, which we call PRPreview, in a Kubernetes stage cluster. This preview version is connected to all stage dependencies, which allows functional validation to take place. Engineers can also manually test their changes with the same namespace whenever needed. Once the functional tests are complete, the test results are reported back to the CI pipeline. And with this type of functional validation, we make sure that the application can be successfully deployed and behaves as expected. After functionally validating the PR, we are confident in merging it into the main branch once it receives at least two approvals, of course, and automatically promoting it to non-production environments. Using Jenkins, we trigger Spinnaker pipelines that deploy the new version in stage, respectively pre-prod. And functional tests are run against these environments. If functional tests fail, both environments are rolled back automatically to the latest stable version, and the team is notified via Slack. Notably, we have added an additional environment, pre-production, which is a shadow for our production environment. It behaves exactly like production, but it doesn't receive any live traffic, as it is accessible at a different cost. Since the application running here is connected to all production dependencies, we can functionally validate it, as we would have done in production, without actually pushing the changes to production. While functional validation is crucial, non-functional validation is just as important, and we evaluate the performance of each commit merging to the main branch by using GetLing to generate load and test different production-like scenarios. We deployed a new version, which we call Canary, alongside with a fixed version that we know already that performs well, called Baseline. And both these pods receive the exact same traffic so that we can analyze their metrics to see whether a performance regression was introduced. The metrics are compared automatically using the same technique that we use for in-release validation, automated Canary analysis, but we will explore it in more detail during the deep-dive into automated Canary analysis. And now let's come back to our initial to-do list and check the items corresponding to the first two sections, namely CI quality and pre-release validation. And let's see what reliable pre-release validation allowed us to safely change in our CD pipeline. We automated performance analysis, and we were able to automatically promote changes to non-production environments, thanks to functional validation. And this way, we were able to eliminate the first manual step, which leaves us with four. Okay, so the next stage in our journey is to automate the Canary analysis, so let's dive in. In order to validate a release, we are deploying a new pod called Canary, which is running the new version, and the pod called Baseline running the old version, along the existing pods called Stable. We are retrieving the metrics from the baseline and the Canary in order to visualize and compare them in Grafana. Our validation dashboard is called Deployment Compare, and here we are taking the 165 metrics. And each graph shows a metric from the baseline and respectively the Canary, and the release engineer's job is to spot any pattern difference between the two lines. This is the tedious process, of course, as we would need to check seven regions, and this is why we are running Canaries only in two of them. So in order to automate this step, we have the same setup, only that now we are redirecting the metrics to Cayenta, an open source tool which uses a statistical algorithm to compare metrics and detect pattern differences. Cayenta will not know what to analyze out of the box, you will need to configure it to track the same metrics you are tracking manually. And you can track various metrics from purely technical ones to business specific ones. Configuring Cayenta is easy, you just query your metrics back and as you do in Grafana. In our case, this will be a parameterized Prometheus query. We are configuring the Canary analysis to run for three hours in order to catch all kinds of behaviors and to make sure that there is no latent issue. It is also important to let the application warm up so that metrics can stabilize before starting to analyze them. And we are doing an evaluation each hour so that we can spot the problems early and stop the analysis without the need to wait for the full three hours. So we will have three evaluations in total and each one of them will yield a score. In our case, the final evaluation run needs to have a score of at least 99 to pass, while intermediary ones need to have a score of at least 95. It is important to have this tolerance interval as sometimes there might be a small difference that is detected which is corrected until the last evaluation run. Some of the metrics you track may have a severe impact on your application and you like to stop the analysis early if a difference is detected for them. And these metrics can be marked as critical and the analysis will be stopped irrespective of the obtained score. And it's important to do so in order to avoid a prolonged impact. So, okay, now we have automatic can analyze analysis completely set up, but we had a few challenges to make it reliable. And these micro services are very popular today. One of the issues we faced, and you may too, is that we've been observing a lot of differences when comparing response latencies between services. The differences were small, but they were present. As opposed to manual analysis, Kayanta is very accurate and was detecting these small latency differences. And an interesting fact is that we are detecting these differences even when comparing a version with itself. So we embarked on a journey to understand why we're reporting different latencies and how could we solve it in order to stabilize the analysis. Firstly, not all of your requests are equal. You have requests that you process normally and take longer. And you have also better requests that you terminate immediately. And usually better requests are not at a consistent rate. So one of your analysis post may observe more of them than the other. And the overall observed latency will differ. This is why you need to be more granular in comparing latency. And let's see an example. From the graph, we can observe that the successful requests are equally distributed across baseline and canary. So the observed latencies will be the same. Now let's say that the canary observed two bad requests while the baseline none. Bad requests are fast, so they will move the compute latency to the left, yielding a difference. And so to keep things simple, we recommend to use the route and status code dimensions in order to compare apples with apples. Another thing is that Kubernetes indeed abstracts away the infrastructure, but it may still have an impact on your analysis. For example, if your cluster uses different type of nodes, they may yield a different performance. You can also be impacted if your posts are evacuating in the middle of the analysis due to a cluster scale out, or you can be simply impacted by a noisy neighbor. And to account for this, you should use the dedicated worker tiers. Just for the canary analysis, where the nodes have attained. And we are configuring our pods to tolerate that thing. And affinity rules in order to target those nodes. Another thing is that pods should have enough resources and not be impacted by external factors such as CPU throttling. This will definitely skewer metrics because your analysis codes may be impacted asymmetrically. One of the solutions could be to use CPU pinning, where your pods get a full CPU cores exclusively to themselves. And they will not be throttled at all. Or if you are using dedicated nodes, you can have no CPU limit and let the analysis pods use the full CPU of the node. In these cases, you should monitor for the CPU usage. The solution we are using is to have enough capacity in the cluster by having autoscaling in place using a horizontal pod autoscaler. And in this case, we are tracking CPU throttling in our analysis. Also, latencies may be observed if your analysis pods are scheduled cross availability zones. And in order to fix this, you can use affinity rules to schedule your pods in the same easy. The formation changes really brought more stability, but they didn't fix the issue for good. The thing is that not all pods with which your service is communicating are equal. They are impacted by the same issues that we've discussed before. So it depends really with whom you are talking as you may get a difference in response latency, even if it is the same service. So how could we solve this? One key observation got us an idea. We are observing a difference in latency only for low values, while for high latencies, there were no differences. Everything was fine. So we weren't looking at our timing metrics. We were using histograms. The histograms devised a time axis in intervals called buckets. Each bucket has a counter, which is incremented for each request that has a response latency between its boundaries. We can observe that at the beginning of the axis, the intervals are more fine grained versus to the end where they are more coarse grained. This means that low latencies are measured more precisely. And the thing is that we don't want to be that precise in our measurements as the actual environment cannot allow for such precision. So what we did was to remove some buckets to be also more coarse grained. One thing is that when we are computing percentile latencies, Prometheus will place the request percentile in a bucket. But instead of returning the interval, it will use linear interpolation to return a scalar value between the bucket boundaries. With the previous change of course buckets, we made sure that the requested percentile will be placed on the same bucket on the canary as well on the baseline. Unfortunately, this will not guarantee that the interpolated value will be the same. And let's have a look at the formula used to compute this value. The bucket variable represents the bucket selected for the requested percentile. And we can see that the computed value depends on the number of observations in that bucket. And the problem is that observations across the canary and baseline will never match perfectly due to the environment in precision. As even identical requests may yield a different processing time. So we need to allow for some entropy in the system. And the solution is to use FX sizes to have a tolerance interval which allows to specify a threshold by which the metrics can differ before they are considered failed. With these changes, we were really able to get rid of the latency differences and get a step closer to stabilize the analysis. However, our journey did not end right there. The automatic canary analysis wasn't stable yet because we faced another challenge, the one with low frequency metrics. Such metrics lead to small sample sizes for both control and experiment group which affects the confidence of the analysis results. If the sample size is too small, we risk receiving either false positives or false negatives. And now let's see an example of an analysis that yielded a false positive. There was not enough traffic for this specific route and this means that the baseline and canary post didn't receive the exact same number of requests. The analysis detected difference which percentage wise is quite significant but in terms of the number of data points was not. And to ensure that our analysis remains statistically significant we must exclude such metrics from it. We accomplished this by using a Prometheus query that filters out any metrics that do not have a consistent traffic rate over the last hour. Well, the goal is to include only metrics for which Prometheus has recorded a non-zero value at each scrape interval which in our case is at every 15 seconds. And now let's examine the traffic rate for a particular route for both baseline and canary pods with a low and inconsistent traffic rate. It is a clear difference between the two plots but the number of non-zero data points for this metric is below our predefined threshold of 240 that we explained earlier. So we can confidently disregard this metric from the analysis because the observed difference is most likely false positive. And now in comparison let's look at the traffic rate that is low but consistent on both baseline and canary pods. The traffic rate is quite low. It's even under one RPS but it's consistently above zero which means that we have enough data points to make this analysis meaningful. We checked one more item from our roadmap and let's see how our deployment looks like now. With this update we were able to eliminate one of the most tedious parts of our release process and that is manual canary analysis which leaves us with three manual steps. And thanks to automated canary analysis we are now able to launch canary pods in all seven of our regions whereas before we could manage monitoring only two. There are cases when a release engineer might want to continue with the release despite an automated canary analysis failure such as when performance improvements have been achieved. But is it safe to roll out globally when automated canary analysis passes? We've talked about false positives and how we can mitigate them but false negatives are particularly risky because they hide the negative impact that is visible only after the global rollout. In this case our analysis found the difference between the baseline and the canary but upon reviewing the plot we can see that there are very few non-zero data points for this specific metric and this resulted in a small sample size that wasn't large enough for the statistical test to confidently declare that there was indeed an effect there. So it just concluded that there was no true impact. As a side note here this isn't a metric with a low and inconsistent traffic rate and we run into this issue before excluding such metrics from our analysis but this is still a clear example that such metrics do not yield reliable results be it in either false positives or false negatives and also a release engineer wouldn't have been able to spot the difference during manual canary analysis. We were able to notice the negative impact on our Grafana dashboards following the global rollout and in this case this issue affected our availability so we had to perform a rollback. Sometimes there are issues that can be pinpointed only in production with a reasonable amount of traffic and what can we do in this particular case is just limit the impact as much as possible and consequently we brainstormed ways to prevent similar setbacks from affecting all of our global traffic in the future and let's see how we leverage web deployments to this purpose. Web deployments allow for a progressive monitoring of changes making it easier to spot issues early and they limit bless radius because when negative effects are present only a part of the traffic is affected rather than all of it. Despite the reduced risk they still keep a good balance between safety and speed because you can still deploy changes frequently while still minimizing the chances of a major issue. So our services globally distributed across seven regions which differ in terms of size and pattern and traffic pattern and we decided to split our deployment into waves grouping together regions with similar sizes and roll them out progressively from the smallest to the largest. In this way any regressions would be isolated to a smaller group of regions instead of affecting the entire system. And now the deployment process was divided into three waves each of which progressively increased in size. We conduct functional tests on the wave and after its regions have been upgraded to the new version. There is an interval of 30 minutes in which the pipeline monitors for any alerts that might have been triggered following the wave rollouts completion and if functional tests fail or alerts get triggered the pods are reverted to the previous stable version. But however in this scenario the entire wave is affected. Can we take an additional step and limit the negative impact inside a wave? And we were able to achieve this by implementing a traffic-based wave in addition to the initial region-based one. And this one involved directing a third of our traffic to a group of cannery pods which allowed us to conduct a second automated cannery analysis with a larger sample size to catch any false negatives that the first run of automated cannery analysis might have missed. And now let's take a closer look at what happens inside the region during the traffic-based wave. Our goal is to ensure that the group of cannery pods receives a substantial portion of the traffic 33% in our case while still maintaining the region's overall capacity and avoid over provisioning. We run automated cannery analysis now for a period of one hour since the traffic directed to the cannery pods generates enough data points to make the analysis reliable this time. And based on the automated cannery analysis findings we then decide whether to go with a full rollout in that region or just destroy the cannery pods and hold the release right there. With a two-step wave rollout we were able to remove another manual step from our CD pipeline which leaves us with two manual steps. And with that we've ticked off the release validation section and there is only one section left to cover. Now we are finally ready to automatically promote a release to production. And as we are following the trunk development model we do commit changes frequently. And to the fact that our continuously delivery pipeline has many quality gates, the rollout is low. So moving each change through this pipeline wouldn't scale and this is why I decided to take an asynchronous approach where we have a daily triggered deployment which offers a decent velocity while also keeping the release chain set small. And to be fair with you by daily we mean Monday to Thursday. We can replace the manual trigger with the Chrome Trigger Jenkins Pipeline which is in charge of retrieving the latest version, create the git release and start the deployment pipeline. Once we remove the manual trigger we introduced a new stage which checks if there is any ongoing incident for our service in order to not start the deployment and put extra pressure. Otherwise the pipeline remained the same and we only have one manual step present left. The question is if we still need it or could we simply remove it? And for this last manual step there is no new technique as we already have multiple quality gates and no regression should find its way past the wave deployment step. But of course there can be misconfigurations in the pipeline. And this is actually the step we are at the present moment we are iterating and building a track record of how the pipeline behaves and we are fine tuning and fixing along the way. But even with this last manual step the release process has been greatly improved and the experience is much more tolerable for the release engineer. So it has been a long journey so let's have a quick final recap. Completely automating deployments requires various quality gates in order to ensure the same or even a higher safety standard in respect to manual ones. And we've seen some of the techniques we adopted such as having a functional test suite and the performance analysis to automatically evaluate functional and non-functional aspects with synthetic traffic. Having a production shadow environment gives us the possibility to validate even before hitting production and in-release techniques such as automated canary analysis and wave deployment to validate releases against live traffic. By doing so, deployment rollout will be definitely slower but this is not a bad thing and we have a saying that good deployments are slow deployments. One of the major steps and challenges in automating this whole pipeline was to stabilize the canary analysis and we see many techniques to avoid false positives and negatives. And our recommendation is to implement them in an iterative way and on a per-need basis. Also, sometimes you can catch regression only in production and by adopting a wave pattern the impact can be reduced without breaking any SLAs. And also building a strong CD pipeline is an iterative process so we are always on the lookout on how to improve it and next on our radar is to also analyze and compare application logs. And of course, if you have any suggestion for us we are more than happy to hear them. And finally, you can adopt any of these techniques as we have been using only open source technologies to implement them and as we mentioned in the beginning you can adopt them even if you have a different technologies taken hours. Thank you, this was it and I think we will also have time for questions. Thank you. No, we are. So we have been using like on this pinnaker like classical CD, nothing with GitOps but we are looking for a targo and possibly we'll migrate to Argo in the future. But we'll keep like the same practices. I mean, all of this we'll just need to migrate on top of Argo. We don't know how yet and how really we'll do that but this is our plan possibly we'll migrate to Argo. So yeah, eventually we'll get to GitOps but in the near future but keeping the same practices. Yeah, I gave the mic. Yeah, I have a question. So it's over everything very impressive and like it's very interesting what you're done but from your camera analysis I see it mostly done for stateless type of applications or just for deployments, not for stateful sets and not database application which heavily work with databases. If is it so or you also do this analysis for stateful applications and application which has shared connection for example to database and if you do how do you handle this rollout and wave deployment in this case? So the application that we build this is stateless. You are correct. The thing we have like what we call this data planes they are like highly intensive in traffic but they are stateless and we have also like control planes which are making database. The problem with those services that they don't have that much traffic because they are control planes and they are really receiving some traffic and on those services you cannot drag energy analysis because you have basically no metrics and nothing there to analyze. So for those services you are mostly relying on functional just functional validation or functional test suite. I hope we answered your question. My question is that what do you do if you already deploy into production and you notice an issue? Do you have another pipeline to roll back or you run the same pipeline to stable image? So yeah I mean for hot fixes of course we have like a backdoor where we just go and release it. Maybe I mean we have a pipeline is configurable like in Spinnaker we have like a Boolean variable like for each region and if you have like all of them on true it will just do a full global rollout and if you know you can select in which region you want and I think it depends on a per case basis but we have like a backdoor to just keep the automated camera analysis and the way deployment and all these automated in case it is needed like to perform a hot fix or immediate rollback to the previous version. And also we have a variable on which we decide whether we turn off automatic deployments in Jenkins that just cron triggers this whole deployment process and if we decide that we have an incident that is affecting our services we just turn that off to not like put extra pressure on the deployments, on the regions and just focus on fixing the issue. On the people side I was wondering if you had much much pushback for removing those manual authentication gates and if you did how did you work your way through it and then in such with the other teams and such. Do you want to take this one, Anka, or? I think we can answer it. Yes, so you really, it's a really good question and yes, we face pushback and it's hard to convince because right now we're doing like the release engine one engineer from our team. I mean we have like the one call schedule and everyone comes to be like the release engineer. So yet someone was really overseeing the whole process and detecting all these problems and now it's hard to say, hey, don't even look it will just go automatically, it's a hard sell and this is why the step that we are in we still have that manual, basically it's nothing. It's just for an engineer to come and have an overlook over one of our dashboards is like compare cross region to see that everything is in shape just to validate the release and just say, okay, it's fine or perform a role back if needed. And we're building like, this is why I wanted also to have this daily trigger cron in order to build a track record and see, okay, look in six months we didn't have any incident or we have just one incident that would have been either way if a human was there. So basically I think that the solution is to build a track record over time in order to convince others, okay, now we can really look the other way and pay attention to the release and let the pipeline do its thing. Yeah, I mean, it's understandable that since our service is super critical to our business process we can encounter resistance towards fully hands-off deployments and it's understandable but also as we presented at the start there are 165 metrics to track for each region and we just left from there. It's easier to let the machine automatically compare these metrics instead of having an engineer to just look on dashboards and I don't know possibly miss something out of, I don't know. And I think even before we had the working on other services and it was funny like in time we were seeing, hey, but when did we, like our peak dropped? I mean, I don't know, we had like 1,000 FPS for a single machine when we were running on EC2. Instances and we were then down to 800 and how this happened? We always validate the releases. Because yet you may miss something. You say, as we are saying like, Kayanta is very accurate and was detecting these differences while as humans were saying like in Grafana we didn't even spot those milliseconds or one to milliseconds, we couldn't see them anyway. So yeah, we think that it would be like best to let computers do the repetitive work. But you need to prove it, have a track record. This is the key. And do it incrementally. I mean, we started with inoffensive changes. Let's say to just automatically promote them to production and see how our entire CICD pipeline behaves in this way. But yeah, we're starting with more and more dangerous change sets, let's say to see how automatic and reanalysis works and wave deployments and so on. Okay, we hope we answered your question. Thank you. Thank you. Thank you everyone.