 Good afternoon, everybody, and welcome to our talk. Today, we're going to talk about how we at Databricks are using the Argo ecosystem and combining it with multiple cloud-native technologies, such as Prometheus, Kubernetes, Envoy, to support our model-serving system and how we're using it to scale to meet our customer requirements. We'll talk about a couple of challenges and how Argo is helping us address those challenges. This talk will be fairly introductory, given the time constraints and the number of challenges we want to talk about. Awesome. Your presenters for today are Rohit Agarwal. Hi, this is Rohit Agarwal. I work on the traffic platform team at Databricks. And then my team mostly handles the ingress and egress stack, things like service mesh. And I'm here with my colleague Arjun. Hey, and I'm Arjun Dikuna. And I work on the GNI Serving Team at Databricks. And I focus on the ML ingress stack. Before I continue, I'd like to introduce Databricks to those of you who don't know about it. So Databricks is a SaaS platform. And with a very simple mission, we want to democratize data and AI. What does that mean? Is we give you all a unified platform where every individual, not engineer, but individual has the tools and capabilities required to make sense of your data and drive better data-driven decisions. This includes your ETL workloads, your model-training stack, your model-serving stack, stitch it together with really good lineage tracking, and combine it to the governance framework. We've grown rapidly over the past few years, where over 7,000 bricks are strong globally. And we are hiring. So do check out our careers page. Let's talk about the AI lifecycle. When you start with the AI lifecycle, you will generally start with some raw data set. You take that data set, clean it, and make your trained data from there. Optionally, also a feature store. From there, you train a number of models on that data and evaluate all those models on some metric that you care about. And then you choose the best performing model and deploy that model, which means that with all the investments you have made to the entire lifecycle, the actual business utilization or potential is only realized once you deploy that model. In that case, you need to make sure that deployment is a good deployment with a good production use case deployment. Enter GNAI serving with Databricks. So Databricks allows you to deploy any custom model that you care about, trained on any framework that you've been using, TensorFlow, PyTorch, whichever. It supports CPU and GPU serving. In addition, Databricks also curates a list of top foundational model APIs so that you can start experimenting with model serving without having to go through the training process. And you can also use GNAI serving as a gateway to external model providers, such as OpenAI. You get all of this with a unified UI, API and SDK to manage all these types of AI models. It is serverless out of the box, which means that our customers can now make the availability versus cost trade-off using request-based auto scaling. It also comes with the scale to zero option, so it's really useful when you're developing your models and you're testing them out before you productionize them. As with any real-time serving system, it has an SLA-backed availability, low latency overhead, secure deployment. It integrates really cleanly with your feature stores, which could be online in nature, and your vector indexes to power your RAG applications that everyone cares about today. And then you have your governance with Unity Catalog and request response logging to reduce your errors with inference tables. Cool, our next slide, one more. So since its launch last year, it's been less than a year since we G8 GNAI serving. We serve 1,000 weekly active customers. There are about 5,000 plus active endpoints being queried. We support a max of about 25,000 QPS per cloud region, and it's currently G8 on AWS, Azure with GCP are coming soon. So let's talk about the first focus of our talk, which is the GNAI serving ingress stack, and see how Argo helps us release this ingress stack and update it with high conference and stability. The ingress stack is crucial to our serving system. As we said, it supports up to 25K QPS for our customers. In addition, it does authentication and authorization verification. It does request-based and concurrency-based rate limiting, and it exposes metrics for consumption to both GatorBrick's engineers internally and our customers externally, helping us power their dashboards, which means we really need to be careful with this ingress stack. Next slide. So now, updating this ingress stack is tricky, right? Because our customers use GNAI for a myriad of applications across multiple verticals. This could be healthcare, banking, manufacturing. My favorite use case is GoGuardian using GNAI serving to monitor the internet usage of school children to try and prevent self-harm tendencies. Given these mission-critical use cases, it's imperative that whenever we update the ingress stack, we have zero downtime. We cannot have any 4x or 5x access errors during an update. We cannot have any regression on latency or performance. A lot of these applications are very latency-sensitive. And finally, if there is ever an issue that happens during a rollout, it should roll back automatically to restore the last known LKG status. Wow, updating the stack is kind of scary. So you could say, hey, let's not update the stack. Let's deploy it once, make a really awesome ingress stack, and call it today. Unfortunately, that is really not good deployment practice, with security patches coming in regularly and just improvements that we can constantly make in our ingress system, we have to constantly update the stack on a pretty regular cadence. So if you're updating our stack, let's look at what makes a release safe. There are two primary factors that you can talk about when you look at the impact of an outage caused by a bad release. The first is the number of impacted customers, and the second is the duration of that outage. Therefore, both these factors kind of give you the two important principles to reduce the impact of an outage. The first principle is reducing the blast radius. So you can catch breakages before the effect a large number of customers. And the second is reduce the rollback time. Reduce rollback as soon as you find an issue so that you can reduce the impact of your outage. Cool. So let's look at what Kubernetes today does as a standard deployment. So you have a deployment and you create a new version of your service. What the deployment does is it manages two replica sets. It manages the old stable replica set that's running your current stable version of your service. And the second is the new canary replica set where it's gonna deploy your new service. And what the deployment rolling update does is that it moves from the older replica set to the new replica set as soon as possible. It'll try and create your replica set pause and terminate the older pause. But that kind of violates the first principle of a safe rollout. Hey, if something goes wrong, it completely takes down our service and impacts our customers immediately. Let's say there's a womb that hits us a new service after five minutes. After the rollout is complete, it affects every single customer using that service. The second is during this update process, you can collect metrics. So we use Prometheus internally. And we have a lot of alerts that has been set up on these metrics. So let's say there's some alert that fires during the solar process. It generally pages an on call. Everything's on fire. And the on call looks at your standard operating procedures. The on call looks at dashboards, tries to figure out what's going wrong. And then when they realize that, hey, there is some issue with the rollout, they'll trigger a rollback. This is where our second principle gets affected, which is look at the end to end time between when the on call was first alerted and when they are rolling back the deployment. This is non-negligible. And the longer this is, the larger the impact of the outage. So what do we do? So we decided to start using Argo rollouts to update our ML address stack. And Argo rollout is nothing but a drop in replacement for a deployment with a couple of really interesting fine drain controls that we're gonna talk about really quickly. You'll notice that the rollout has changed the strategy file and gives you analysis and steps. Let's take a look at steps first. So now, rather than a zero to one rollout, you break your rollout up into steps. So we're changing how fast we rollout, how fast we can rollout at a particular time. So in our example, we first update 20% of our pods, wait 10 minutes, then the next 40% more, wait some more time, and so on and so forth. During this entire process, there are constant background health checks that are running via the analysis template. And I'll come to that in a second. So our first principle is met because now our first principle says that, hey, we're gonna try and reduce the impact by reducing the blast radius, by breaking up our deployment into smaller, more fine-grain steps. And at the end of your rollout, it all remains, it's like a deployment happened. It's the same out of the box. So with this fundamental foundation, let's look at what changes now in how we roll out the ML serving stack. So instead of a deployment, it's a rollout now, with an Argo rollouts controller, which is a microservice that's controlling this rollout, that's now constantly querying Prometheus to look at the health checks that we care about. Instead of a zero to one deployment from the old replica set to the new cannery set, it first takes those 20% of your pods, runs the background health checks, and if everything is green and the rollout is healthy until now, it continues the rollout. Now what happens is something is bad. Some health check fails based on the analysis template, and Argo is immediately able to tell the rollout that hey listen, this is not a healthy update, let's roll back soon. So we've met both the principles we really care about. A more staged rollout to reduce the blast radius. And if there is something that goes wrong, the human in the loop is removed, and the Argo controller can immediately roll it back, reducing the duration of the outage, thereby reducing the impact of that outage. An analysis is basically the background metrics that are constantly being verified to check that if some metric is going haywire, resulting in a unhealthy update. This is an example that we actually used during our analysis, and it's one of many, you can choose a list of metrics that you care about. So in our case, we have a prober called the heat seeker that's constantly checking that the ingress stack is up and running. And if we see any five X-axis during this rollout over a period of 10 seconds, and it happens twice, so the failure limit is twice, we trigger a rollback. And that's how we're trying to keep our ingress stack safe and supporting all these really good use cases of our customers. We don't just add metrics out of the box, we first test it so that we know that we are sure that these metrics are gonna survive and not give any false positives, and I'll leave the mic to Rohit to talk about this part. Thank you, Arjun. So in this second half of the talk, I'd mainly focus on workflows and how we use Argo workflows to do a lot of use cases here at Databricks. But then I'll take a minute to highlight this feature that Arjun just mentioned. So we added capability for dry runs in Argo rollouts, and I think it went out in version 1.2. So what happens when a developer is trying to add a new check, if they add this check in the veteran, and if anything goes wrong, we just initiate a false positive rollback. So prevent this scenario, we have this feature called dry run. So whenever a developer is trying to add a new check, they can simply just mark it as dry run. In dry runs, you can have as part of analysis, template, you can also have it part of the rollouts and experiments. If a dry run check fails, we don't initiate a rollback. We simply give you a result at the end of the rollout that you can use to analyze, and then based on that, you can tweak your checks and then make refinements. When you're happy with the check over unsuccessful runs, you can then graduate these checks to veterans, and then they start impacting the state of the rollouts. So just to see an example, here is the analysis template that we shared before. This is the analysis template that we used to update our ingress stack. Here we have two metrics. One is the 5x6 errors, which is marked in the dry run mode. And then another one is 4x6 errors. So in this case, if you hit the failure condition, which is like you are getting 10 5x6 errors in the last five minute interval, this dry run check would fail, but then it won't impact the final status of your rollout. But on the other hand, if you fail the second check, which is a veteran check, which is getting 10 errors in five minute window, 4x6 errors, it would actually initiate a rollback. Just to speak quickly how dry run mode actually works, our analysis template can consist of both, the veteran checks as well as the dry run checks. It's a good practice for any developer who is trying to introduce new checks to start with a dry run. They can collect the metrics over n successful runs, and then when they're confident about the maturity of the new checks, they can then graduate it to veterans. For instance, in this example, we can see there are seven veteran checks and there are five dry run checks. While some of the dry run check failed, since all the veteran checks are successful, we don't impact the final state of the rollout is still green. Now let's see a hypothetical journey for a developer who is trying to introduce new checks. They are trying to add two new checks here. One is to measure the increase in any 408 errors from Envoy, and then second is to look at the increase in number of the connection timeouts coming from the Envoy. They still don't know the correct red lines. So in this case, they add these two new checks as the dry run checks and they collect this data over next end runs. If there are no false positives, they just simply go and graduate these checks to veterans and then they start impacting the final state of the rollout. But if there are false positives, the developer would continue to refine these metrics, define the correct red lines, and then until all the false positives get eliminated, they will still keep running it as dry runs. Okay, so I think that's about safe rollouts and dry runs. Next, we want to talk about workflows. We use Argo workflows for various use cases here at Databricks with different set of goals and challenges. And I want to share some of the interesting workflows that we have for model serving. So we'll explore some of these in detail today to get a comprehensive understanding. I want to start with capacity planning. So Arjun earlier talked about model backends and how model serving request based auto scaling works. This auto scaling requires setting up the compute that we run model serving models on and bootstrapping these machines with our observability stack, health checkers, networking demon sets, and this all is very time consuming. So in order to speed things up, we maintain a small warm pool. We have two main goals here. One is the availability, which means we want enough machines in the warm pool so that we can handle the incoming workload. And then second thing is efficiency. Compute costs a lot of dollars, so we don't want idle machines to be sitting in warm pool. Now we have an Argo workflow, which periodically scraped the usage metrics from our ingress on by proxy containers. We aggregate these metrics and then we send it to a machine learning model. We use the output from this proprietary machine learning model to decide the fleet capacity. And this happens in all the regions. So basically we have like 70 different regions across all three cloud providers. So this is happening in every region. And then finally we upscale and downscale the warm pool fleet size based on what machine learning model tells us. There are some clear wins here. It's really, really easy to set up the whole pipeline. As I mentioned before, there are 70 regions. If we make any change, we can just apply it everywhere. We don't have to manually go in each region and do the same thing. Also, we get the observability at every step, every workflow step. If any step fails, we just trigger alerts and then our on call goes and look into it. Just to see this in practice, let's take a look at the workflow. In the first step, we scraped the metrics from Envoy containers. I think it's a very simplified view. I'm hiding one thing. So Prometheus is the one which scraped the metrics because there are like 20 containers for Envoy running. And then Argo workflow simply gets the data from Prometheus. The next step is to aggregate this data over all the interesting dimensions that we want and then send it to the machine learning model. And then finally we use the output coming from the machine learning model to decide the one pool size. If we think that the pool needs to be resized and we need more capacity, we go request machines and then add it to our one pool. And if we want to downsize the pool, we just let go of additional capacity. There's some very interesting patterns that we saw. This whole thing is still in pilot. So we are continuously refining the model. But then we see like there is a huge capacity demand in the middle of the night and then in middle of the day, we let go of everything. Next use case I want to talk about is container builds. So when a user creates a model and query their machine learning model, behind the scenes, we package everything and deploy it as a service. So the entire build process is divided into three steps. This is done to aim for better separation between what Databricks own and what user own so that we can surface the errors. Like if it's a build error from a user installing a dependency which doesn't exist, we just want to tell users that there is some problem with the requirements file. Versus if there is a Databricks error, we are not trying to, we fail to create the Docker image or we fail to push the Docker image. Then we need that alert to go to our on call so that they can look into it. The two main goals here are, we want to build the model serving container, we want to execute it, and we want to update the UI that user is using to mark it as ready. Second important goal is, we want to deliver the build logs that I talked about back to the user. So if there is any problem with the model that they are trying to deploy, they can just go look at those logs. To achieve this, we again use Argo workflow. We try to retrieve various artifacts from different sources. It can be PyPy, S3 buckets, GitHub. We construct a Docker image from all these artifacts and we push this into our registry. In the third step, we deploy a Kubernetes service leveraging this Docker image that we pushed. We update the UI state based on the health check probes. So once the service comes up, we have some readiness checks and once those readiness checks passes, we update the UI state to be ready and then user can send traffic to the API. And in the event of failures, we have logs for both Databricks as well as for the user. If it's a Databricks error, then we just go and alert our own calls. The benefits are, again, the entire pipeline is very, very easy for us to set up. We can set up alerts to notify in case of any issues and then we can enhance observability at every step of this workflow. Just to see this in practice, as I mentioned, there are three steps. The first step is a pre-built step. In this step, we fetch the base image and then we install various dependencies. These dependencies come from like wheels, PyPy, S3 buckets. So this is the step in which if it fails, we ship the logs back to the customer or the user and that they are responsible for fixing it. The next step is build step in which we package everything up. We create a Docker image and in this step, if there is anything which goes wrong, then we alert our own calls. The final step is a post-build step. In this step, we finally use the Docker image that we pushed or built in the last step and use it to spin up machine learning models. This is the step where we also probe. So once these services are up and running, we go and update the state in the UI. So here's a screenshot illustrating the user-facing interface. I think it's a little bit hard to see but then the state of the endpoint is still creating and then in the highlighted section, there is an error which says that the requirement file that we got has a dependency which cannot be resolved or which cannot be installed. The third use case that we use our goal workflows for is metrics delivery. So one of the key features for model serving is the emission and delivery of metrics. So these metrics include things like QPS, latencies, CPU usage, so that people can tweak their models accordingly and customer today have their own dashboards, own alerting based on these metrics. The primary objective for us is to capture these metrics at a periodic interval and then deliver them back to the control plane because control plane is the one where our UI lives and then once we deliver it to the control plane, we can show these things per model per customer in the UI. To achieve this, we again have our workflow which periodically scrape the envoy containers. We aggregate the endpoint per customer metrics over all the interesting dimensions and then we store these metrics in a time series database which is located in our control plane. So just to see how it works, the model serving comprises of two main components. There is a data plane where the model serving workloads are actually running. Here is, we also have an ingress stack, authentication authorization, rate limiting, everything is running on the data plane and then we have a control plane which basically enable customer to manage their endpoint. So this is where the UI is and then how customers create new models. Based on the SLA, a new workflow run occurs every X minutes. In the first step, it just scrape metrics from Prometheus which aggregates metrics from all the envoy containers. Then we finally aggregate these metrics over all the dimensions that we want. In the third step, we send this data to a metrics collector service that we have in the control plane and then we persist it into a persistent storage. Our UI just hit this persistent storage and then show these metrics based on per endpoint and per customer. Here is a screenshot you can see for per endpoint metrics. At the top you can see we have metrics, we have events and logs which we also pipe using the same mechanism. There are some graphs about latency, QPS and then we show things like CPU usage and memory usage. So there are a ton of metrics that you can describe the alerts on. These were just a few examples. I think we use many, many more workflows. We have workflows for auto scaling, real-time config delivery for our services and so on. But I think in interest of time, I'd skip those today. So that concludes our prepared material for today. I think we do still have some time, so we'll take any questions if you may have. Thank you. No questions? I have one. I have a great presentation. I have a question on transitioning to Argo rollouts. Think about this. How did you actually get your developers to transition to it? Because there's a few topics in Argo rollouts because this is devops or platform engineering driven and the developers have to learn how the Argo rollouts work and how to write Prometheus queries to actually know what conditions will cause the rollouts to move back. Yeah, that's actually a great question. So today, we have a wrapper over COOP-CTL. We call it like COOP-CFG. We just get the regular deployment file and then we generate a rollout CRD on the fly. So we generate a rollout CRD on the fly and we deploy it and then for the metrics, we provide our developers with some out-of-the-box metrics like database errors, basic service errors and then we give them a framework so they can easily define the checks they need and that's why we have the dry run feature. Like if you want to do something advanced and you're not sure, you can just start with a dry run and then you can with time graduate it to veterans. But yeah, it's all maintained by the DevEx team. Okay, so it's kind of like a centralized and that's some sort of overhead you have to maintain for the sake of everybody. That is correct, yeah. So our DevEx team gives us a library and then every service would just use that library to get out of the box checks and they can add on top of that. Out of curiosity, how big is your DevEx team now? 10 people. All right, thank you. Yeah, I think we'll be at the back of the room if you have more questions. Thank you.