 Hello, everyone. Welcome to our talk. Today, we'll be talking about how at Databricks, we have integrated Argo rollouts to achieve fully automated, zero downtime releases. My name is Rohit Agarwal, and I'm a software engineer at Databricks. I'm currently focusing on application traffic infrastructure and save deployments. Some of my ongoing projects are to make deployment zero downtime, safer, and more stable using technologies like Argo rollouts. I'm also an active contributor to the Argo project. I'm here with my colleague, Gavin. Hello, everyone. My name's Gavin Cleiger. I'm also a software engineer at Databricks where I work on our release platform, and I lead our health-mediated release efforts. A little bit about Databricks. We are a unified analytics platform helping our customers make better predictions with their data. We have offerings across all three major clouds, and we are actively hiring, by the way, if you want to go and check our careers page. Now to provide a sense of the magnitude of releases at Databricks, we operate in over 100 regions across all three major cloud providers. We have over 300 microservices, and we add new microservices every week. As a result, we conduct over 10,000 releases every day. Here is an example of a fully automated spinacker pipeline from a developer's perspective, which we use for one of our microservice. We conduct releases in different phases or different waves, grouping customers based on their usage. As you can see, there are around 100 contacts in this pipeline. At this scale, everything must be automated, from version creation to release to health checks to rollbacks. I'll now hand it over to Gavin to elaborate on what makes our releases safe. So before we talk about the architecture at Databricks we've used to make releases safe, I want to talk just a little bit about kind of the primary components of what impacts the outage severity of a release related regression. There are two primary factors that we want to mitigate when we're talking about reducing the severity of regressions from releases. The first is the number of impacted customers or your blast radius. Basically, we want to catch an issue when it's affecting a small proportion of traffic in customers. The second factor is the duration of the outage. We want to rollback and revert a breaking change as quickly as possible. These two factors are the primary foundation of our health-mediated release effort. Now, I want to talk briefly about how a standard release looks in Kubernetes with these two factors in mind before we then take a look at the architecture with Argo. In a standard Kubernetes release with a deployment using a rolling update, you'll see the following. There will be a stable replica set managing pods for your service running on your old version. There will be a canary replica set running pods for your service on the new version. And the rollout using a rolling update will move either a pod or a group of pods from the old replica set to the new replica set over time. Now, in Kubernetes, this is basically done as fast as possible within the constraints you've defined. So as soon as some pods are ready and available in the new replica set, it'll start working on the next set of pods. If we think back to the principles we just talked about regarding release safety, we'll notice that this is already starting to violate our limiting blast radius principle. Most releases with the deployment are going to complete in the order of minutes. So if you have a regression that becomes apparent seven minutes after a release, well, it's going to be in all of your pods, and it will be affecting all of your traffic. That's not a good blast radius. Now, let's talk about the case when there is a regression. We can see here our pods from our service are emitting metrics. In this case, our metrics provider is Prometheus and M3. You can imagine any metrics provider you'd like. In the case of an outage, there will be typically an alert fired based on those metrics for your service. A developer will receive that alert. They'll have to respond to the alert. Perhaps they look at a dashboard, try and guesstimate the health of their service, and then they take an action to roll back. In this case, you can imagine they trigger a spinnaker pipeline to roll back. However, you manage that at your company. Again, thinking about our second principle of release safety, we see another problem. The time to roll back is high here. There is a human in the loop. Someone has to get a page, look at those dashboards. We're talking minutes and minutes and minutes, extending an outage. Additionally, any system with a human in the loop is not going to scale well. When you reach a release cadence of releasing daily multiple times a day across numerous services, across hundreds of contexts, even dozens of contexts, this is no longer feasible. Human beings cannot investigate every single release, and if they do, the release cadence will slow to a creaking halt because teams are just not gonna want to release super frequently. All right. So we're now gonna talk about how Databricks solves this, and to start, I'm gonna introduce two concepts from the Argo Rollouts project. The first concept I'd like to introduce is the rollout itself, the titular rollout. A rollout can be thought of as a custom resource that is a drop-in replacement for a Kubernetes deployment. It is almost exactly identical to a Kubernetes deployment, except for two crucial differences. You'll notice the strategy field is different, and it allows us to specify a much more granular way of managing and updating our pods. In this case, you'll see a steps field. In this rollout, it's configured to update 20% of your pods, wait 10 minutes, update another 40%, wait 10 minutes, and so on. These are example numbers, but you can get the sense of what we're accomplishing here. Instead of just updating all your pods over the course of six minutes, you can gradually roll this out, and that will allow you to limit blast radius. You'll also notice an analysis field. This is the crucial factor of Argo Rollouts, an analysis defines a series of health checks for your service. As Argo updates the pods for your rollout, it will continuously run those health checks in the background. When it detects that your service is unhealthy, it will immediately trigger a rollback automatically with no human in the loop. Otherwise, at the end of your release, things look exactly as they do using a standard deployment. And again, a rollout really is identical to a deployment, except for the strategy field. Okay, so we talked about this analysis that's running health checks against your service. What does that look like? Well, it's another custom resource in Kubernetes, also part of the Argo project called an analysis template. An analysis template is literally just a list of metrics to be run for your service during a release. We can see here, I've defined a single example metric, but you can imagine you'd have numerous metrics evaluating the health of your service to get a holistic view of health. You can see a metric is pretty simple to define. Again, here, we're using an M3 or Prometheus metric, but Argo, out of the box, supports numerous metric providers, pretty much all the big names you can think of. If we look, we see the query. This basically defines the check to be run against your metric store. In this case, we're looking at a 4XX count. You'll also notice a failure condition. This can be thought of as a red line, and if our metric violates that red line a certain number of times, Argo will say, hey, something's wrong, we need to roll back. You'll notice if you look even closer at the metric, we also have a field here, HMR role equals Canary. So what is this? One of the benefits of Argo rollouts is that it will dynamically label pods during a release. This means that pods in your Canary replica set, running the new version, can have a Kubernetes label attached to them dynamically that says Canary. Similarly, pods in your stable replica set, running the old version, can have a label that says stable. Why is this useful? Well, that allows us to target our health checks exactly against pods running our updated binary. This means even if you have 40 pods for a service and you've only updated one of them, you can still get a pretty good signal on the health of your new version by just focusing on a single pod. All right, and just to put this one last time to make it even more concrete, we can see here, this is an example of an analysis run that failed. Specifically, we've highlighted the metric that failed. You can see here, we're looking at an RPC error rate. It's specified if it drops below an SLO of around four nines. Repeatedly, that should trigger a rollback. And here we can see it did that three times and so Argo has rolled back. And you can imagine numerous metrics just like this are evaluated simultaneously all the way through your release. Okay, so now that we've kind of built our foundation there, let's take a look at what this looks like altogether. You'll notice first that we've replaced our deployment with the Kubernetes rollout resource. Again, what this means is that we're gonna have a release strategy that updates our pods gradually and limits our blast radius. You'll also notice that the Argo rollouts controller is now in the picture. This is a microservice, that's the only microservice in the Argo rollouts project. It will continuously run our health checks against our metrics provider through the duration of a release. Now, in the happy path, as our health checks pass, Argo will tell the rollout to continue proceeding along our specified steps. What about the unhealthy path? Well, Argo will fire a health check or series of health checks that fail, and it will then interface directly with the rollout to trigger an instantaneous rollback. What this means is it will scale the Canary replica set back down to zero and scale up your base replica set to this desired number of pods for your service. This is basically as fast as you can possibly roll back a pod workload in Kubernetes. And this will, by the way, maintain all the safety constraints that you have in your deployments today, max unavailable, et cetera, et cetera. Okay, so now one of the first things you'll think about when starting to implement a solution with Argo rollouts is managing health checks, because pretty much everything works almost out of the box. But you have to come up with some definition of health for your services. So I thought it would be helpful to kind of walk through some of the best practices we've discovered at Databricks solving this very problem. The first useful thing to do is to build some default library of health checks that work for your services across the board. Now you may be thinking, well, where do I start? Like I mentioned earlier, it turns out we all pretty much have started. When you think about your alerting configuration, there's probably a standard set of alerts for services at your company. Things like pod restarts, CPU utilization, even RPC error rates depending on your framework, all sorts of these generic metrics that are pretty much the same across the board for 90% of services. So since we saw earlier, a health check is basically just a query against a metric provider and a red line. Very easy to just convert health checks into, or to convert alerts into health checks and vice versa. You can even have a common source of truth. All right. Now once you've established this library of health checks that's kind of defines a basic level of service health, teams are gonna start coming to you and asking for even more. So for example, the GraphQL team may come to you and say, man, we love this generic layer of health checks. Now we roll back automatically when there are weird Kubernetes level problems or when we have our RPC errors, but we have this specific metric for our service, the specific error count that we really wanna action on. Can we define a custom metric? And so you wanna provide a very easy interface through templating for developers to write their own custom health checks. And it turns out that this is pretty simple because as we saw a health check again, basically a query and a red line. Like it's very simple to do. And teams probably already have custom alerts that they can just pull over. Additionally, Argo has a dry run feature. So when a team or even you in your generic library are introducing a new check, you can enable it in a dry run mode where it won't affect the outcome of your analysis. And we'll hear a little more about that later. All right, the third point I wanna make is more of a cultural principle, which is you wanna have a self-healing and self-improving feedback loop. When there's an outage of Databricks, part of the post-mortem process is asking the question, could a health check have prevented this regression from being introduced by a release? If so, we add that health check. And so in that way, over time, your service health check will become more and more comprehensive. Now on the flip side of the coin, every now and then you'll have a rollback that isn't all that helpful. And so when those happen, you wanna make sure you're evaluating, hmm, is this red line configured properly? Is this check asking the question we wanna answer? And so if you implement those cultural principles, your library will just improve gradually over time. All right, thanks, Gavin. I'll take a minute to highlight the dry run feature that Gavin was mentioning earlier. But before we dive deep into the dry runs and the benefits and necessity of the dry runs, I'd like to provide some background on the dry run feature that we have in Argo Rollouts. We wanted to make this process of defining new health checks in production as smooth as possible. So we went in and we contributed this dry run feature in Argo Rollouts. Essentially the main question which dry run answer is, how would Argo have evaluated a given query? Dry runs can be conducted at the analysis template level or you can also do it at the rollout or experimentation level. It's important to note that the dry run do not affect the final state of the rollout. We receive a summary of the results at the end and then you can use it to do any further analysis to see which checks are succeeding and which checks you wanna promote to the veteran checks. Here is an example of what the dry run configuration looks like in an analysis template. As we can see here, we have a few metrics for total number of 5XX and 4XX errors coming out of an RPC service and here we have marked the total number of 5XX error metric in a dry run mode, meaning that it won't impact the final state of your rollout. Just to see an example of how does the dry run mode actually work and analysis template can consist of both the veteran as the dry run checks. As you previously noted, any failures in the dry run mode won't impact the final state of your rollout. It's a good practice for a developer to start adding checks in a dry run mode and then update those checks which continuously succeed in the veteran mode. They can collect metric until they are fully confident about the maturity level of their new checks before graduating them to the veteran checks. For instance, in this example, as you can see, there are eight veteran checks which are all successful. Optionally, there are some dry run checks which failed but then it did not impact the final state of the rollout which has succeed and we did not trigger a rollback. Let's see a hypothetical user journey for implementing a new health check. Initially, a developer would add any health check in the dry run mode and collect metrics over the end next runs. If there are no false positives, these checks would then be upgraded to the veteran checks. Alternatively, if there are any false positives, the developer would continue iterating and refining these checks until all the false positives are eliminated. Next, let's talk about visibility a little bit. One of the key advantages of Argo rollouts is that developers do not have to actively monitor all their releases. However, Argo provides an excellent user interface to monitor all the releases. As shown here, it provides a step-by-step breakdown of the entire release, include the analysis results of various health checks at each step. Additionally, it can trigger a notification webhook to create a Gira ticket for you or to shoot an automatic email in the event of a rollback. All right, now let's talk about some of the more advanced features that we use here at Databricks. One such cool feature I wanna highlight is traffic shifting. Previously, Gavin highlighted the significance of limiting the blast radius. And by default, Argo rollouts provide a staged pod-by-pod update, which is a considerable improvement over the traditional deployment method. But we can do even better. Using Argo's native integration with Istio, we can do a more granular percentage-based traffic routing. What exactly does this mean? Let's take a closer look and an example. Consider a microservice with three pods. With the traditional setup without Istio, the smallest amount of traffic we can send to the new version that you're canarying is 25%. Argo rollouts integration with Istio allows us to do a more fine-grained traffic routing. And as we can see here, even though 25% of the pods are belonging to the new replica set, we are only routing a tiny percentage of the traffic to this new version. This helps limit the blast radius and also don't impact your biggest users. I'll now hand it over to Gavin to discuss how these features operate in practice. Great. Yeah, so out of the box, we've talked about how Argo rollouts gives you these automatic rollbacks where it will scale up your base replica set, scale down your canary replica set. But what if you're deploying more than just a rollout? At Databricks, it's not uncommon to deploy a rollout, some config maps, some secrets, perhaps even other workloads. And we want to have the property of atomicity, meaning if we update five entities together, they either all end up in a happy updated state or they all get rolled back. So how do we accomplish this? A very simple wrapper around Kube control that we call Kube CFG. You can see the interface is exactly the same as a standard Kube control update, but under the hood, we've added a tiny little bit of functionality. First, we identify the resources that are going to be updated or created by our Kube control call. And we snapshot the stable versions that are currently running. Next, we execute our Kube control command as usual to trigger our release. And lastly, we simply monitor the pod updates. So we're looking at the Argo rollout. If it ends happily, everything's updated, then we exit, things are great. If we see the rollout as marked as failed and the rollout's aborted, meaning it's been rolled back, then we just reapply all the stable manifests we snapshot earlier. And so this is kind of the simplest way to just achieve total atomicity and suddenly you can roll back basically anything automatically using Argo as like your Oracle of, are things good? Sure, they go back. Okay. Now, let's say you've rolled out Argo rollouts across your company. All your deployment services are happy. But what comes next is that all of your teams using stateful sets or daemon sets come to you and they say, man, like everyone else has this great health mediated release. Outage has been going down. Our manager wants us to have something similar. We're getting a lot of pressure from leadership. How can we get health mediated release? How can we get these automatic rollbacks? And luckily, Argo rollouts comes with a custom resource out of the box that can be used just for this. It's called the analysis run. Earlier, we talked about an analysis template. If you'll remember, it was just a sequence of health checks that would be automatically run during your rollout release. Well, an analysis run is almost identical, just a bunch of health checks. The only difference is the user specifies how long it's supposed to run. It's not tied to the actual service release. So here's how you can use this. Again, we do our kube config apply. We pass in our Kubernetes manifest. This is probably your stateful set. You also pass in your analysis run. And under the hood, things look very similar. Identify resources being updated, retain the stable versions, execute your kube control apply. And lastly, monitor this time the analysis run. And so again, we're just gonna kind of pull the analysis run however you wanna implement it and say, hey, is this analysis healthy? Are things going good as it completed successfully? If so, great. If not, if we notice that the analysis has failed, then we can reapply stable manifest. And so we can basically outsource all of this health mediation to the Argo controller, even for non-deployment workloads. Okay. One last thing, once you've kind of built out this framework and you've got your HMR and it's enabled for your services that are deployments and stateful sets and daemon sets and you've got a great library of health checks, maybe you're even doing traffic shifting, you're gonna get to this question of, well, how can we do even better validating the health of our service? And so out of the box, Argo has first-party support for custom webhooks. So as part of your analysis so far, we've seen, oh, talking to M3 or your metrics provider to do basic querying. But you can just talk to an API endpoint and that API endpoint can do whatever kind of evaluation you want. So at Databricks to take things to the next level, we've built a service that we call the evaluation service with a pretty simple API for various kinds of evaluation that can't be performed as just simple queries against a metric provider. One such form of analysis is A-B analysis. Before we discussed a red-bine analysis where we compare a static metric to a predefined threshold. And in A-B analysis, we simply compare metrics from our canary pods to metrics from our stable pods and see if they're from the same statistical distribution. And so this has some great features. If we go to the next slide, we can see here's an example of a metric where an RPC error rate violated its red line. But if we notice in yellow, the canary violated the red line, but in green, the stable also violated this red line. Now, if something like this happened with a red line check, you might roll back. But should we? Well, the problem isn't isolated to the new version of our service. It's across both versions. So rolling back is probably not the answer and this might be an anomaly or something unrelated to the version of our service. With an A-B analysis, your service can just say, oh, these look like the same distribution. Things are fine. I'm not going to roll back. On the other hand, if things look like they're not from the same distribution and it's in a bad direction, like oh, my CPU utilization's way up or my error rate's up, then an A-B analysis would roll that back. All right, looks like we have about four minutes left. This is the end of our prepared content. I wanna make sure we have some time to take some questions here. So Rohit and I will happily take a few minutes of questions. I guess people can raise their hand and go from there. Okay, so thanks for the presentation. And I wanted to ask if is your case, how do you deal with long running processes, usually with workers that do not accept, let's say, HTTP-based traffic, but still will be there running forever? Do you use Argo rollout for such scenarios or if you use, how do you manage to release them? Yeah, so we use Argo rollouts exclusively for updating Kubernetes manifest. If something's a long running process, we'll look at the metrics for the service after and before the update and see what they look like. And so that's how we use Argo. We would not use Argo for something that's not a native Kubernetes entity. Does that answer the question? Nice presentation, thank you. I wanted to ask you, have you considered just using the built-in support for Cayenta, for instance, for statistical analysis? I'm referring to the last part about extending analysis with a custom service. Sure, so he's speaking of Cayenta, which is a component of Spinnaker. It's its own standalone service, but it's from the same project. Cayenta offers some AB analysis engines to perform analysis. We find Cayenta to be, we did experiment with it. When we built our evaluation service, we actually reused a lot of the logic from Cayenta, which is open source. And so the actual statistical evaluations and things we did, but we wanted control over our API for evaluations. Additionally, Cayenta was a little difficult to work with out of the box, but it's very simple to build an evaluation service that just does the same statistical stuff with your own nice API, and then you have a little more control. But yeah, Cayenta is an open source service you can use for AB analysis, if you want. And you could certainly use web calls to that. I think it has a UI, too. Any other questions? Hi, thanks for the presentation. You already mentioned Atomicity, your cube CFG tool. I was wondering if you go from version one to version two of your deployment via rollout, but you do the same via config map and secret. I mean, they can also introduce the errors. Are there mechanisms that prevent you from ending up with a rollback scenario where you would be rollback to version one, but still have version two of the secrets or config map? Yeah, so the way we handle rollbacks with our RAP or cube config is we snapshot everything that's being updated by the cubectl command, and that's exactly what we revert. So the way the system works, you're always supposed to end up in the exact same state where you started. There is a little bit of complexity here. You can imagine if you have multiple workloads or multiple processes that are updating the same underlying resources. For example, if you have three release pipelines that are all touching the same config map, and then sometimes you update the config map with a rollout, you can get into some weird behavior. So best practice is that you're just not be allowed, right? Every resource should be managed by one release process only. Anyone else? Thanks for the great talk. A question, do you know if there's support coming for other resources than just deployments with Argo rollouts, because then the thing with demon sets and stateful sets and maybe also Argo applications would then be supported for rollouts? Yeah, I can't speak too much to the project's super long-term roadmap. It's an open-source project. I will say there are open tasks to build a new controller. It's actually pretty difficult. The way a deployment manages pods is completely different from how the stateful set manages pods. And so doing that will require basically a new controller. So the work there is non-trivial. But I would expect it in like a few years. Like, that's where the project's heading. Like I said, today you can still sort of get health mediated release and automatic rollbacks by just leveraging the analysis run to perform the analysis and so forth. You just have to kind of trigger it and look at it yourself. Yeah. Okay, the last question. So does the rollout object actually have a deployment in the backend or it manages replica sets? It is a drop in replacement for a deployment. So you will no longer have a deployment. You will have a rollout. And the rollout is built on the standard deployment framework. So it manages pods very similarly. We saw the same two replica sets, et cetera, et cetera. The only difference is like a little more sophisticated strategy. So the actual way it updates pods are more sophisticated than a standard rolling update. But yeah, there is no deployment here. Your deployment is now a rollout. And like I said, if you want to convert a deployment to a rollout, you change the kind and then you just update your strategy. It's very simple. Thanks. Yeah. Okay. Thank you. If you have any other questions, catch the guys off stage. Thanks a lot. Thank you all.